All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/5] mm/memcg: Reduce kmemcache memory accounting overhead
@ 2021-04-19  0:00 ` Waiman Long
  0 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2021-04-19  0:00 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin
  Cc: linux-kernel, cgroups, linux-mm, Shakeel Butt, Muchun Song,
	Alex Shi, Chris Down, Yafang Shao, Wei Yang, Masayoshi Mizuma,
	Xing Zhengjun, Matthew Wilcox, Waiman Long

 v4:
  - Drop v3 patch 1 as well as modification to mm/percpu.c as the percpu
    vmstat update isn't frequent enough to worth caching it.
  - Add a new patch 1 to move Move mod_objcg_state() to memcontrol.c instead.
  - Combine v3 patches 4 & 5 into a single patch (patch 3).
  - Add a new patch 4 to cache both reclaimable & unreclaimable vmstat updates.
  - Add a new patch 5 to improve refill_obj_stock() performance.

 v3:
  - Add missing "inline" qualifier to the alternate mod_obj_stock_state()
    in patch 3.
  - Remove redundant current_obj_stock() call in patch 5.

 v2:
  - Fix bug found by test robot in patch 5.
  - Update cover letter and commit logs.

With the recent introduction of the new slab memory controller, we
eliminate the need for having separate kmemcaches for each memory
cgroup and reduce overall kernel memory usage. However, we also add
additional memory accounting overhead to each call of kmem_cache_alloc()
and kmem_cache_free().

For workloads that require a lot of kmemcache allocations and
de-allocations, they may experience performance regression as illustrated
in [1] and [2].

A simple kernel module that performs repeated loop of 100,000,000
kmem_cache_alloc() and kmem_cache_free() of either a small 32-byte object
or a big 4k object at module init time is used for benchmarking. The
test was run on a CascadeLake server with turbo-boosting disable to
reduce run-to-run variation.

With memory accounting disable, the run time was 2.848s with small object
and 2.890s for the big object. With memory accounting enabled, the run
times with the application of various patches in the patchset were:

  Applied patches   Run time   Accounting overhead   %age 1   %age 2
  ---------------   --------   -------------------   ------   ------

  Small 32-byte object:
       None          10.570s         7.722s          100.0%   271.1%
        1-2           8.560s         5.712s           74.0%   200.6%
        1-3           6.592s         3.744s           48.5%   131.5%
        1-4           7.154s         4.306s           55.8%   151.2%
	1-5           7.192s         4.344s           56.3%   152.5%

  Large 4k object:
       None          20.612s        17.722s          100.0%   613.2%
        1-2          20.354s        17.464s           98.5%   604.3%
        1-3          19.395s        16.505s           93.1%   571.1%
        1-4          19.094s        16.204s           91.4%   560.7%
	1-5          13.576s        10.686s           60.3%   369.8%

  N.B. %age 1 = overhead/unpatched overhead
       %age 2 = overhead/accounting disable time

The small object test exercises mainly the object stock charging and
vmstat update code paths. The large object test also exercises the
refill_obj_stock() and  __memcg_kmem_charge()/__memcg_kmem_uncharge()
code paths.

The vmstat data stock caching helps in the small object test,
but not so much on the large object test. Similarly, eliminating
irq_disable/irq_enable helps in the small object test and less so in
the large object test. Caching both reclaimable and non-reclaimable
vmstat data actually regresses performance a bit in this particular
small object test.

The final patch to optimize refill_obj_stock() has negligible impact
on the small object test as this code path isn't being exercised. The
large object test, however, sees a pretty good performance improvement
with this patch.

[1] https://lore.kernel.org/linux-mm/20210408193948.vfktg3azh2wrt56t@gabell/T/#u
[2] https://lore.kernel.org/lkml/20210114025151.GA22932@xsang-OptiPlex-9020/

Waiman Long (5):
  mm/memcg: Move mod_objcg_state() to memcontrol.c
  mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
  mm/memcg: Optimize user context object stock access
  mm/memcg: Save both reclaimable & unreclaimable bytes in object stock
  mm/memcg: Improve refill_obj_stock() performance

 mm/memcontrol.c | 199 +++++++++++++++++++++++++++++++++++++++++-------
 mm/slab.h       |  16 +---
 2 files changed, 175 insertions(+), 40 deletions(-)

-- 
2.18.1


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v4 0/5] mm/memcg: Reduce kmemcache memory accounting overhead
@ 2021-04-19  0:00 ` Waiman Long
  0 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2021-04-19  0:00 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun, Matthew Wilcox,
	Waiman Long

 v4:
  - Drop v3 patch 1 as well as modification to mm/percpu.c as the percpu
    vmstat update isn't frequent enough to worth caching it.
  - Add a new patch 1 to move Move mod_objcg_state() to memcontrol.c instead.
  - Combine v3 patches 4 & 5 into a single patch (patch 3).
  - Add a new patch 4 to cache both reclaimable & unreclaimable vmstat updates.
  - Add a new patch 5 to improve refill_obj_stock() performance.

 v3:
  - Add missing "inline" qualifier to the alternate mod_obj_stock_state()
    in patch 3.
  - Remove redundant current_obj_stock() call in patch 5.

 v2:
  - Fix bug found by test robot in patch 5.
  - Update cover letter and commit logs.

With the recent introduction of the new slab memory controller, we
eliminate the need for having separate kmemcaches for each memory
cgroup and reduce overall kernel memory usage. However, we also add
additional memory accounting overhead to each call of kmem_cache_alloc()
and kmem_cache_free().

For workloads that require a lot of kmemcache allocations and
de-allocations, they may experience performance regression as illustrated
in [1] and [2].

A simple kernel module that performs repeated loop of 100,000,000
kmem_cache_alloc() and kmem_cache_free() of either a small 32-byte object
or a big 4k object at module init time is used for benchmarking. The
test was run on a CascadeLake server with turbo-boosting disable to
reduce run-to-run variation.

With memory accounting disable, the run time was 2.848s with small object
and 2.890s for the big object. With memory accounting enabled, the run
times with the application of various patches in the patchset were:

  Applied patches   Run time   Accounting overhead   %age 1   %age 2
  ---------------   --------   -------------------   ------   ------

  Small 32-byte object:
       None          10.570s         7.722s          100.0%   271.1%
        1-2           8.560s         5.712s           74.0%   200.6%
        1-3           6.592s         3.744s           48.5%   131.5%
        1-4           7.154s         4.306s           55.8%   151.2%
	1-5           7.192s         4.344s           56.3%   152.5%

  Large 4k object:
       None          20.612s        17.722s          100.0%   613.2%
        1-2          20.354s        17.464s           98.5%   604.3%
        1-3          19.395s        16.505s           93.1%   571.1%
        1-4          19.094s        16.204s           91.4%   560.7%
	1-5          13.576s        10.686s           60.3%   369.8%

  N.B. %age 1 = overhead/unpatched overhead
       %age 2 = overhead/accounting disable time

The small object test exercises mainly the object stock charging and
vmstat update code paths. The large object test also exercises the
refill_obj_stock() and  __memcg_kmem_charge()/__memcg_kmem_uncharge()
code paths.

The vmstat data stock caching helps in the small object test,
but not so much on the large object test. Similarly, eliminating
irq_disable/irq_enable helps in the small object test and less so in
the large object test. Caching both reclaimable and non-reclaimable
vmstat data actually regresses performance a bit in this particular
small object test.

The final patch to optimize refill_obj_stock() has negligible impact
on the small object test as this code path isn't being exercised. The
large object test, however, sees a pretty good performance improvement
with this patch.

[1] https://lore.kernel.org/linux-mm/20210408193948.vfktg3azh2wrt56t@gabell/T/#u
[2] https://lore.kernel.org/lkml/20210114025151.GA22932@xsang-OptiPlex-9020/

Waiman Long (5):
  mm/memcg: Move mod_objcg_state() to memcontrol.c
  mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
  mm/memcg: Optimize user context object stock access
  mm/memcg: Save both reclaimable & unreclaimable bytes in object stock
  mm/memcg: Improve refill_obj_stock() performance

 mm/memcontrol.c | 199 +++++++++++++++++++++++++++++++++++++++++-------
 mm/slab.h       |  16 +---
 2 files changed, 175 insertions(+), 40 deletions(-)

-- 
2.18.1


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v4 1/5] mm/memcg: Move mod_objcg_state() to memcontrol.c
@ 2021-04-19  0:00   ` Waiman Long
  0 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2021-04-19  0:00 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin
  Cc: linux-kernel, cgroups, linux-mm, Shakeel Butt, Muchun Song,
	Alex Shi, Chris Down, Yafang Shao, Wei Yang, Masayoshi Mizuma,
	Xing Zhengjun, Matthew Wilcox, Waiman Long

The mod_objcg_state() function is moved from mm/slab.h to mm/memcontrol.c
so that further optimization can be done to it in later patches without
exposing unnecessary details to other mm components.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 mm/memcontrol.c | 13 +++++++++++++
 mm/slab.h       | 16 ++--------------
 2 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e064ac0d850a..dc9032f28f2e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3150,6 +3150,19 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
 	css_put(&memcg->css);
 }
 
+void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
+		     enum node_stat_item idx, int nr)
+{
+	struct mem_cgroup *memcg;
+	struct lruvec *lruvec = NULL;
+
+	rcu_read_lock();
+	memcg = obj_cgroup_memcg(objcg);
+	lruvec = mem_cgroup_lruvec(memcg, pgdat);
+	mod_memcg_lruvec_state(lruvec, idx, nr);
+	rcu_read_unlock();
+}
+
 static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 {
 	struct memcg_stock_pcp *stock;
diff --git a/mm/slab.h b/mm/slab.h
index 076582f58f68..ae8b85875426 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -239,6 +239,8 @@ static inline bool kmem_cache_debug_flags(struct kmem_cache *s, slab_flags_t fla
 #ifdef CONFIG_MEMCG_KMEM
 int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s,
 				 gfp_t gfp, bool new_page);
+void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
+		     enum node_stat_item idx, int nr);
 
 static inline void memcg_free_page_obj_cgroups(struct page *page)
 {
@@ -283,20 +285,6 @@ static inline bool memcg_slab_pre_alloc_hook(struct kmem_cache *s,
 	return true;
 }
 
-static inline void mod_objcg_state(struct obj_cgroup *objcg,
-				   struct pglist_data *pgdat,
-				   enum node_stat_item idx, int nr)
-{
-	struct mem_cgroup *memcg;
-	struct lruvec *lruvec;
-
-	rcu_read_lock();
-	memcg = obj_cgroup_memcg(objcg);
-	lruvec = mem_cgroup_lruvec(memcg, pgdat);
-	mod_memcg_lruvec_state(lruvec, idx, nr);
-	rcu_read_unlock();
-}
-
 static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
 					      struct obj_cgroup *objcg,
 					      gfp_t flags, size_t size,
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v4 1/5] mm/memcg: Move mod_objcg_state() to memcontrol.c
@ 2021-04-19  0:00   ` Waiman Long
  0 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2021-04-19  0:00 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun, Matthew Wilcox,
	Waiman Long

The mod_objcg_state() function is moved from mm/slab.h to mm/memcontrol.c
so that further optimization can be done to it in later patches without
exposing unnecessary details to other mm components.

Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 mm/memcontrol.c | 13 +++++++++++++
 mm/slab.h       | 16 ++--------------
 2 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e064ac0d850a..dc9032f28f2e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3150,6 +3150,19 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
 	css_put(&memcg->css);
 }
 
+void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
+		     enum node_stat_item idx, int nr)
+{
+	struct mem_cgroup *memcg;
+	struct lruvec *lruvec = NULL;
+
+	rcu_read_lock();
+	memcg = obj_cgroup_memcg(objcg);
+	lruvec = mem_cgroup_lruvec(memcg, pgdat);
+	mod_memcg_lruvec_state(lruvec, idx, nr);
+	rcu_read_unlock();
+}
+
 static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 {
 	struct memcg_stock_pcp *stock;
diff --git a/mm/slab.h b/mm/slab.h
index 076582f58f68..ae8b85875426 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -239,6 +239,8 @@ static inline bool kmem_cache_debug_flags(struct kmem_cache *s, slab_flags_t fla
 #ifdef CONFIG_MEMCG_KMEM
 int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s,
 				 gfp_t gfp, bool new_page);
+void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
+		     enum node_stat_item idx, int nr);
 
 static inline void memcg_free_page_obj_cgroups(struct page *page)
 {
@@ -283,20 +285,6 @@ static inline bool memcg_slab_pre_alloc_hook(struct kmem_cache *s,
 	return true;
 }
 
-static inline void mod_objcg_state(struct obj_cgroup *objcg,
-				   struct pglist_data *pgdat,
-				   enum node_stat_item idx, int nr)
-{
-	struct mem_cgroup *memcg;
-	struct lruvec *lruvec;
-
-	rcu_read_lock();
-	memcg = obj_cgroup_memcg(objcg);
-	lruvec = mem_cgroup_lruvec(memcg, pgdat);
-	mod_memcg_lruvec_state(lruvec, idx, nr);
-	rcu_read_unlock();
-}
-
 static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
 					      struct obj_cgroup *objcg,
 					      gfp_t flags, size_t size,
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v4 2/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
  2021-04-19  0:00 ` Waiman Long
  (?)
  (?)
@ 2021-04-19  0:00 ` Waiman Long
  2021-04-19 16:38     ` Johannes Weiner
  -1 siblings, 1 reply; 49+ messages in thread
From: Waiman Long @ 2021-04-19  0:00 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin
  Cc: linux-kernel, cgroups, linux-mm, Shakeel Butt, Muchun Song,
	Alex Shi, Chris Down, Yafang Shao, Wei Yang, Masayoshi Mizuma,
	Xing Zhengjun, Matthew Wilcox, Waiman Long

Before the new slab memory controller with per object byte charging,
charging and vmstat data update happen only when new slab pages are
allocated or freed. Now they are done with every kmem_cache_alloc()
and kmem_cache_free(). This causes additional overhead for workloads
that generate a lot of alloc and free calls.

The memcg_stock_pcp is used to cache byte charge for a specific
obj_cgroup to reduce that overhead. To further reducing it, this patch
makes the vmstat data cached in the memcg_stock_pcp structure as well
until it accumulates a page size worth of update or when other cached
data change. Caching the vmstat data in the per-cpu stock eliminates two
writes to non-hot cachelines for memcg specific as well as memcg-lruvecs
specific vmstat data by a write to a hot local stock cacheline.

On a 2-socket Cascade Lake server with instrumentation enabled and this
patch applied, it was found that about 20% (634400 out of 3243830)
of the time when mod_objcg_state() is called leads to an actual call
to __mod_objcg_state() after initial boot. When doing parallel kernel
build, the figure was about 17% (24329265 out of 142512465). So caching
the vmstat data reduces the number of calls to __mod_objcg_state()
by more than 80%.

Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
---
 mm/memcontrol.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 61 insertions(+), 3 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index dc9032f28f2e..693453f95d99 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2213,7 +2213,10 @@ struct memcg_stock_pcp {
 
 #ifdef CONFIG_MEMCG_KMEM
 	struct obj_cgroup *cached_objcg;
+	struct pglist_data *cached_pgdat;
 	unsigned int nr_bytes;
+	int vmstat_idx;
+	int vmstat_bytes;
 #endif
 
 	struct work_struct work;
@@ -3150,8 +3153,9 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
 	css_put(&memcg->css);
 }
 
-void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
-		     enum node_stat_item idx, int nr)
+static inline void __mod_objcg_state(struct obj_cgroup *objcg,
+				     struct pglist_data *pgdat,
+				     enum node_stat_item idx, int nr)
 {
 	struct mem_cgroup *memcg;
 	struct lruvec *lruvec = NULL;
@@ -3159,10 +3163,53 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
 	rcu_read_lock();
 	memcg = obj_cgroup_memcg(objcg);
 	lruvec = mem_cgroup_lruvec(memcg, pgdat);
-	mod_memcg_lruvec_state(lruvec, idx, nr);
+	__mod_memcg_lruvec_state(lruvec, idx, nr);
 	rcu_read_unlock();
 }
 
+void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
+		     enum node_stat_item idx, int nr)
+{
+	struct memcg_stock_pcp *stock;
+	unsigned long flags;
+
+	local_irq_save(flags);
+	stock = this_cpu_ptr(&memcg_stock);
+
+	/*
+	 * Save vmstat data in stock and skip vmstat array update unless
+	 * accumulating over a page of vmstat data or when pgdat or idx
+	 * changes.
+	 */
+	if (stock->cached_objcg != objcg) {
+		/* Output the current data as is */
+	} else if (!stock->vmstat_bytes) {
+		/* Save the current data */
+		stock->vmstat_bytes = nr;
+		stock->vmstat_idx = idx;
+		stock->cached_pgdat = pgdat;
+		nr = 0;
+	} else if ((stock->cached_pgdat != pgdat) ||
+		   (stock->vmstat_idx != idx)) {
+		/* Output the cached data & save the current data */
+		swap(nr, stock->vmstat_bytes);
+		swap(idx, stock->vmstat_idx);
+		swap(pgdat, stock->cached_pgdat);
+	} else {
+		stock->vmstat_bytes += nr;
+		if (abs(stock->vmstat_bytes) > PAGE_SIZE) {
+			nr = stock->vmstat_bytes;
+			stock->vmstat_bytes = 0;
+		} else {
+			nr = 0;
+		}
+	}
+	if (nr)
+		__mod_objcg_state(objcg, pgdat, idx, nr);
+
+	local_irq_restore(flags);
+}
+
 static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 {
 	struct memcg_stock_pcp *stock;
@@ -3213,6 +3260,17 @@ static void drain_obj_stock(struct memcg_stock_pcp *stock)
 		stock->nr_bytes = 0;
 	}
 
+	/*
+	 * Flush the vmstat data in current stock
+	 */
+	if (stock->vmstat_bytes) {
+		__mod_objcg_state(old, stock->cached_pgdat, stock->vmstat_idx,
+				  stock->vmstat_bytes);
+		stock->cached_pgdat = NULL;
+		stock->vmstat_bytes = 0;
+		stock->vmstat_idx = 0;
+	}
+
 	obj_cgroup_put(old);
 	stock->cached_objcg = NULL;
 }
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v4 3/5] mm/memcg: Optimize user context object stock access
  2021-04-19  0:00 ` Waiman Long
                   ` (2 preceding siblings ...)
  (?)
@ 2021-04-19  0:00 ` Waiman Long
  -1 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2021-04-19  0:00 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin
  Cc: linux-kernel, cgroups, linux-mm, Shakeel Butt, Muchun Song,
	Alex Shi, Chris Down, Yafang Shao, Wei Yang, Masayoshi Mizuma,
	Xing Zhengjun, Matthew Wilcox, Waiman Long

Most kmem_cache_alloc() calls are from user context. With instrumentation
enabled, the measured amount of kmem_cache_alloc() calls from non-task
context was about 0.01% of the total.

The irq disable/enable sequence used in this case to access content
from object stock is slow.  To optimize for user context access, there
are now two sets of object stocks (in the new obj_stock structure)
for task context and interrupt context access respectively.

The task context object stock can be accessed after disabling preemption
which is cheap in non-preempt kernel. The interrupt context object stock
can only be accessed after disabling interrupt. User context code can
access interrupt object stock, but not vice versa.

The downside of this change is that there are more data stored in local
object stocks and not reflected in the charge counter and the vmstat
arrays.  However, this is a small price to pay for better performance.

Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
---
 mm/memcontrol.c | 94 +++++++++++++++++++++++++++++++++++--------------
 1 file changed, 68 insertions(+), 26 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 693453f95d99..c13502eab282 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2207,17 +2207,23 @@ void unlock_page_memcg(struct page *page)
 }
 EXPORT_SYMBOL(unlock_page_memcg);
 
-struct memcg_stock_pcp {
-	struct mem_cgroup *cached; /* this never be root cgroup */
-	unsigned int nr_pages;
-
+struct obj_stock {
 #ifdef CONFIG_MEMCG_KMEM
 	struct obj_cgroup *cached_objcg;
 	struct pglist_data *cached_pgdat;
 	unsigned int nr_bytes;
 	int vmstat_idx;
 	int vmstat_bytes;
+#else
+	int dummy[0];
 #endif
+};
+
+struct memcg_stock_pcp {
+	struct mem_cgroup *cached; /* this never be root cgroup */
+	unsigned int nr_pages;
+	struct obj_stock task_obj;
+	struct obj_stock irq_obj;
 
 	struct work_struct work;
 	unsigned long flags;
@@ -2227,12 +2233,12 @@ static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
 static DEFINE_MUTEX(percpu_charge_mutex);
 
 #ifdef CONFIG_MEMCG_KMEM
-static void drain_obj_stock(struct memcg_stock_pcp *stock);
+static void drain_obj_stock(struct obj_stock *stock);
 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 				     struct mem_cgroup *root_memcg);
 
 #else
-static inline void drain_obj_stock(struct memcg_stock_pcp *stock)
+static inline void drain_obj_stock(struct obj_stock *stock)
 {
 }
 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
@@ -2242,6 +2248,40 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 }
 #endif
 
+/*
+ * Most kmem_cache_alloc() calls are from user context. The irq disable/enable
+ * sequence used in this case to access content from object stock is slow.
+ * To optimize for user context access, there are now two object stocks for
+ * task context and interrupt context access respectively.
+ *
+ * The task context object stock can be accessed by disabling preemption only
+ * which is cheap in non-preempt kernel. The interrupt context object stock
+ * can only be accessed after disabling interrupt. User context code can
+ * access interrupt object stock, but not vice versa.
+ */
+static inline struct obj_stock *get_obj_stock(unsigned long *pflags)
+{
+	struct memcg_stock_pcp *stock;
+
+	if (likely(in_task())) {
+		preempt_disable();
+		stock = this_cpu_ptr(&memcg_stock);
+		return &stock->task_obj;
+	} else {
+		local_irq_save(*pflags);
+		stock = this_cpu_ptr(&memcg_stock);
+		return &stock->irq_obj;
+	}
+}
+
+static inline void put_obj_stock(unsigned long flags)
+{
+	if (likely(in_task()))
+		preempt_enable();
+	else
+		local_irq_restore(flags);
+}
+
 /**
  * consume_stock: Try to consume stocked charge on this cpu.
  * @memcg: memcg to consume from.
@@ -2308,7 +2348,9 @@ static void drain_local_stock(struct work_struct *dummy)
 	local_irq_save(flags);
 
 	stock = this_cpu_ptr(&memcg_stock);
-	drain_obj_stock(stock);
+	drain_obj_stock(&stock->irq_obj);
+	if (in_task())
+		drain_obj_stock(&stock->task_obj);
 	drain_stock(stock);
 	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
 
@@ -3153,6 +3195,10 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
 	css_put(&memcg->css);
 }
 
+/*
+ * __mod_objcg_state() may be called with irq enabled, so
+ * mod_memcg_lruvec_state() should be used.
+ */
 static inline void __mod_objcg_state(struct obj_cgroup *objcg,
 				     struct pglist_data *pgdat,
 				     enum node_stat_item idx, int nr)
@@ -3163,18 +3209,15 @@ static inline void __mod_objcg_state(struct obj_cgroup *objcg,
 	rcu_read_lock();
 	memcg = obj_cgroup_memcg(objcg);
 	lruvec = mem_cgroup_lruvec(memcg, pgdat);
-	__mod_memcg_lruvec_state(lruvec, idx, nr);
+	mod_memcg_lruvec_state(lruvec, idx, nr);
 	rcu_read_unlock();
 }
 
 void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
 		     enum node_stat_item idx, int nr)
 {
-	struct memcg_stock_pcp *stock;
 	unsigned long flags;
-
-	local_irq_save(flags);
-	stock = this_cpu_ptr(&memcg_stock);
+	struct obj_stock *stock = get_obj_stock(&flags);
 
 	/*
 	 * Save vmstat data in stock and skip vmstat array update unless
@@ -3207,29 +3250,26 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
 	if (nr)
 		__mod_objcg_state(objcg, pgdat, idx, nr);
 
-	local_irq_restore(flags);
+	put_obj_stock(flags);
 }
 
 static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 {
-	struct memcg_stock_pcp *stock;
 	unsigned long flags;
+	struct obj_stock *stock = get_obj_stock(&flags);
 	bool ret = false;
 
-	local_irq_save(flags);
-
-	stock = this_cpu_ptr(&memcg_stock);
 	if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
 		stock->nr_bytes -= nr_bytes;
 		ret = true;
 	}
 
-	local_irq_restore(flags);
+	put_obj_stock(flags);
 
 	return ret;
 }
 
-static void drain_obj_stock(struct memcg_stock_pcp *stock)
+static void drain_obj_stock(struct obj_stock *stock)
 {
 	struct obj_cgroup *old = stock->cached_objcg;
 
@@ -3280,8 +3320,13 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 {
 	struct mem_cgroup *memcg;
 
-	if (stock->cached_objcg) {
-		memcg = obj_cgroup_memcg(stock->cached_objcg);
+	if (in_task() && stock->task_obj.cached_objcg) {
+		memcg = obj_cgroup_memcg(stock->task_obj.cached_objcg);
+		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
+			return true;
+	}
+	if (stock->irq_obj.cached_objcg) {
+		memcg = obj_cgroup_memcg(stock->irq_obj.cached_objcg);
 		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
 			return true;
 	}
@@ -3291,12 +3336,9 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 
 static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 {
-	struct memcg_stock_pcp *stock;
 	unsigned long flags;
+	struct obj_stock *stock = get_obj_stock(&flags);
 
-	local_irq_save(flags);
-
-	stock = this_cpu_ptr(&memcg_stock);
 	if (stock->cached_objcg != objcg) { /* reset if necessary */
 		drain_obj_stock(stock);
 		obj_cgroup_get(objcg);
@@ -3308,7 +3350,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 	if (stock->nr_bytes > PAGE_SIZE)
 		drain_obj_stock(stock);
 
-	local_irq_restore(flags);
+	put_obj_stock(flags);
 }
 
 int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v4 4/5] mm/memcg: Save both reclaimable & unreclaimable bytes in object stock
@ 2021-04-19  0:00   ` Waiman Long
  0 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2021-04-19  0:00 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin
  Cc: linux-kernel, cgroups, linux-mm, Shakeel Butt, Muchun Song,
	Alex Shi, Chris Down, Yafang Shao, Wei Yang, Masayoshi Mizuma,
	Xing Zhengjun, Matthew Wilcox, Waiman Long

Currently, the object stock structure caches either reclaimable vmstat
bytes or unreclaimable vmstat bytes in its object stock structure. The
hit rate can be improved if both types of vmstat data can be cached
especially for single-node system.

This patch supports the cacheing of both type of vmstat data, though
at the expense of a slightly increased complexity in the caching code.
For large object (>= PAGE_SIZE), vmstat array is done directly without
going through the stock caching step.

On a 2-socket Cascade Lake server with instrumentation enabled, the
miss rates are shown in the table below.

  Initial bootup:

  Kernel       __mod_objcg_state    mod_objcg_state    %age
  ------       -----------------    ---------------    ----
  Before patch      634400              3243830        19.6%
  After patch       419810              3182424        13.2%

  Parallel kernel build:

  Kernel       __mod_objcg_state    mod_objcg_state    %age
  ------       -----------------    ---------------    ----
  Before patch      24329265           142512465       17.1%
  After patch       24051721           142445825       16.9%

There was a decrease of miss rate after initial system bootup. However,
the miss rate for parallel kernel build remained about the same probably
because most of the touched kmemcache objects were reclaimable inodes
and dentries.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 mm/memcontrol.c | 79 +++++++++++++++++++++++++++++++------------------
 1 file changed, 51 insertions(+), 28 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c13502eab282..a6dd18f6d8a8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2212,8 +2212,8 @@ struct obj_stock {
 	struct obj_cgroup *cached_objcg;
 	struct pglist_data *cached_pgdat;
 	unsigned int nr_bytes;
-	int vmstat_idx;
-	int vmstat_bytes;
+	int reclaimable_bytes;		/* NR_SLAB_RECLAIMABLE_B */
+	int unreclaimable_bytes;	/* NR_SLAB_UNRECLAIMABLE_B */
 #else
 	int dummy[0];
 #endif
@@ -3217,40 +3217,56 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
 		     enum node_stat_item idx, int nr)
 {
 	unsigned long flags;
-	struct obj_stock *stock = get_obj_stock(&flags);
+	struct obj_stock *stock;
+	int *bytes, *alt_bytes, alt_idx;
+
+	/*
+	 * Directly update vmstat array for big object.
+	 */
+	if (unlikely(abs(nr) >= PAGE_SIZE))
+		goto update_vmstat;
+
+	stock = get_obj_stock(&flags);
+	if (idx == NR_SLAB_RECLAIMABLE_B) {
+		bytes = &stock->reclaimable_bytes;
+		alt_bytes = &stock->unreclaimable_bytes;
+		alt_idx = NR_SLAB_UNRECLAIMABLE_B;
+	} else {
+		bytes = &stock->unreclaimable_bytes;
+		alt_bytes = &stock->reclaimable_bytes;
+		alt_idx = NR_SLAB_RECLAIMABLE_B;
+	}
 
 	/*
-	 * Save vmstat data in stock and skip vmstat array update unless
-	 * accumulating over a page of vmstat data or when pgdat or idx
+	 * Try to save vmstat data in stock and skip vmstat array update
+	 * unless accumulating over a page of vmstat data or when pgdat
 	 * changes.
 	 */
 	if (stock->cached_objcg != objcg) {
 		/* Output the current data as is */
-	} else if (!stock->vmstat_bytes) {
-		/* Save the current data */
-		stock->vmstat_bytes = nr;
-		stock->vmstat_idx = idx;
-		stock->cached_pgdat = pgdat;
-		nr = 0;
-	} else if ((stock->cached_pgdat != pgdat) ||
-		   (stock->vmstat_idx != idx)) {
-		/* Output the cached data & save the current data */
-		swap(nr, stock->vmstat_bytes);
-		swap(idx, stock->vmstat_idx);
+	} else if (stock->cached_pgdat != pgdat) {
+		/* Save the current data and output cached data, if any */
+		swap(nr, *bytes);
 		swap(pgdat, stock->cached_pgdat);
+		if (*alt_bytes) {
+			__mod_objcg_state(objcg, pgdat, alt_idx,
+					  *alt_bytes);
+			*alt_bytes = 0;
+		}
 	} else {
-		stock->vmstat_bytes += nr;
-		if (abs(stock->vmstat_bytes) > PAGE_SIZE) {
-			nr = stock->vmstat_bytes;
-			stock->vmstat_bytes = 0;
+		*bytes += nr;
+		if (abs(*bytes) > PAGE_SIZE) {
+			nr = *bytes;
+			*bytes = 0;
 		} else {
 			nr = 0;
 		}
 	}
-	if (nr)
-		__mod_objcg_state(objcg, pgdat, idx, nr);
-
 	put_obj_stock(flags);
+	if (!nr)
+		return;
+update_vmstat:
+	__mod_objcg_state(objcg, pgdat, idx, nr);
 }
 
 static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
@@ -3303,12 +3319,19 @@ static void drain_obj_stock(struct obj_stock *stock)
 	/*
 	 * Flush the vmstat data in current stock
 	 */
-	if (stock->vmstat_bytes) {
-		__mod_objcg_state(old, stock->cached_pgdat, stock->vmstat_idx,
-				  stock->vmstat_bytes);
+	if (stock->reclaimable_bytes || stock->unreclaimable_bytes) {
+		int bytes;
+
+		if ((bytes = stock->reclaimable_bytes))
+			__mod_objcg_state(old, stock->cached_pgdat,
+					  NR_SLAB_RECLAIMABLE_B, bytes);
+		if ((bytes = stock->unreclaimable_bytes))
+			__mod_objcg_state(old, stock->cached_pgdat,
+					  NR_SLAB_UNRECLAIMABLE_B, bytes);
+
 		stock->cached_pgdat = NULL;
-		stock->vmstat_bytes = 0;
-		stock->vmstat_idx = 0;
+		stock->reclaimable_bytes = 0;
+		stock->unreclaimable_bytes = 0;
 	}
 
 	obj_cgroup_put(old);
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v4 4/5] mm/memcg: Save both reclaimable & unreclaimable bytes in object stock
@ 2021-04-19  0:00   ` Waiman Long
  0 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2021-04-19  0:00 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun, Matthew Wilcox,
	Waiman Long

Currently, the object stock structure caches either reclaimable vmstat
bytes or unreclaimable vmstat bytes in its object stock structure. The
hit rate can be improved if both types of vmstat data can be cached
especially for single-node system.

This patch supports the cacheing of both type of vmstat data, though
at the expense of a slightly increased complexity in the caching code.
For large object (>= PAGE_SIZE), vmstat array is done directly without
going through the stock caching step.

On a 2-socket Cascade Lake server with instrumentation enabled, the
miss rates are shown in the table below.

  Initial bootup:

  Kernel       __mod_objcg_state    mod_objcg_state    %age
  ------       -----------------    ---------------    ----
  Before patch      634400              3243830        19.6%
  After patch       419810              3182424        13.2%

  Parallel kernel build:

  Kernel       __mod_objcg_state    mod_objcg_state    %age
  ------       -----------------    ---------------    ----
  Before patch      24329265           142512465       17.1%
  After patch       24051721           142445825       16.9%

There was a decrease of miss rate after initial system bootup. However,
the miss rate for parallel kernel build remained about the same probably
because most of the touched kmemcache objects were reclaimable inodes
and dentries.

Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 mm/memcontrol.c | 79 +++++++++++++++++++++++++++++++------------------
 1 file changed, 51 insertions(+), 28 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c13502eab282..a6dd18f6d8a8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2212,8 +2212,8 @@ struct obj_stock {
 	struct obj_cgroup *cached_objcg;
 	struct pglist_data *cached_pgdat;
 	unsigned int nr_bytes;
-	int vmstat_idx;
-	int vmstat_bytes;
+	int reclaimable_bytes;		/* NR_SLAB_RECLAIMABLE_B */
+	int unreclaimable_bytes;	/* NR_SLAB_UNRECLAIMABLE_B */
 #else
 	int dummy[0];
 #endif
@@ -3217,40 +3217,56 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
 		     enum node_stat_item idx, int nr)
 {
 	unsigned long flags;
-	struct obj_stock *stock = get_obj_stock(&flags);
+	struct obj_stock *stock;
+	int *bytes, *alt_bytes, alt_idx;
+
+	/*
+	 * Directly update vmstat array for big object.
+	 */
+	if (unlikely(abs(nr) >= PAGE_SIZE))
+		goto update_vmstat;
+
+	stock = get_obj_stock(&flags);
+	if (idx == NR_SLAB_RECLAIMABLE_B) {
+		bytes = &stock->reclaimable_bytes;
+		alt_bytes = &stock->unreclaimable_bytes;
+		alt_idx = NR_SLAB_UNRECLAIMABLE_B;
+	} else {
+		bytes = &stock->unreclaimable_bytes;
+		alt_bytes = &stock->reclaimable_bytes;
+		alt_idx = NR_SLAB_RECLAIMABLE_B;
+	}
 
 	/*
-	 * Save vmstat data in stock and skip vmstat array update unless
-	 * accumulating over a page of vmstat data or when pgdat or idx
+	 * Try to save vmstat data in stock and skip vmstat array update
+	 * unless accumulating over a page of vmstat data or when pgdat
 	 * changes.
 	 */
 	if (stock->cached_objcg != objcg) {
 		/* Output the current data as is */
-	} else if (!stock->vmstat_bytes) {
-		/* Save the current data */
-		stock->vmstat_bytes = nr;
-		stock->vmstat_idx = idx;
-		stock->cached_pgdat = pgdat;
-		nr = 0;
-	} else if ((stock->cached_pgdat != pgdat) ||
-		   (stock->vmstat_idx != idx)) {
-		/* Output the cached data & save the current data */
-		swap(nr, stock->vmstat_bytes);
-		swap(idx, stock->vmstat_idx);
+	} else if (stock->cached_pgdat != pgdat) {
+		/* Save the current data and output cached data, if any */
+		swap(nr, *bytes);
 		swap(pgdat, stock->cached_pgdat);
+		if (*alt_bytes) {
+			__mod_objcg_state(objcg, pgdat, alt_idx,
+					  *alt_bytes);
+			*alt_bytes = 0;
+		}
 	} else {
-		stock->vmstat_bytes += nr;
-		if (abs(stock->vmstat_bytes) > PAGE_SIZE) {
-			nr = stock->vmstat_bytes;
-			stock->vmstat_bytes = 0;
+		*bytes += nr;
+		if (abs(*bytes) > PAGE_SIZE) {
+			nr = *bytes;
+			*bytes = 0;
 		} else {
 			nr = 0;
 		}
 	}
-	if (nr)
-		__mod_objcg_state(objcg, pgdat, idx, nr);
-
 	put_obj_stock(flags);
+	if (!nr)
+		return;
+update_vmstat:
+	__mod_objcg_state(objcg, pgdat, idx, nr);
 }
 
 static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
@@ -3303,12 +3319,19 @@ static void drain_obj_stock(struct obj_stock *stock)
 	/*
 	 * Flush the vmstat data in current stock
 	 */
-	if (stock->vmstat_bytes) {
-		__mod_objcg_state(old, stock->cached_pgdat, stock->vmstat_idx,
-				  stock->vmstat_bytes);
+	if (stock->reclaimable_bytes || stock->unreclaimable_bytes) {
+		int bytes;
+
+		if ((bytes = stock->reclaimable_bytes))
+			__mod_objcg_state(old, stock->cached_pgdat,
+					  NR_SLAB_RECLAIMABLE_B, bytes);
+		if ((bytes = stock->unreclaimable_bytes))
+			__mod_objcg_state(old, stock->cached_pgdat,
+					  NR_SLAB_UNRECLAIMABLE_B, bytes);
+
 		stock->cached_pgdat = NULL;
-		stock->vmstat_bytes = 0;
-		stock->vmstat_idx = 0;
+		stock->reclaimable_bytes = 0;
+		stock->unreclaimable_bytes = 0;
 	}
 
 	obj_cgroup_put(old);
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v4 5/5] mm/memcg: Improve refill_obj_stock() performance
@ 2021-04-19  0:00   ` Waiman Long
  0 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2021-04-19  0:00 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin
  Cc: linux-kernel, cgroups, linux-mm, Shakeel Butt, Muchun Song,
	Alex Shi, Chris Down, Yafang Shao, Wei Yang, Masayoshi Mizuma,
	Xing Zhengjun, Matthew Wilcox, Waiman Long

There are two issues with the current refill_obj_stock() code. First of
all, when nr_bytes reaches over PAGE_SIZE, it calls drain_obj_stock() to
atomically flush out remaining bytes to obj_cgroup, clear cached_objcg
and do a obj_cgroup_put(). It is likely that the same obj_cgroup will
be used again which leads to another call to drain_obj_stock() and
obj_cgroup_get() as well as atomically retrieve the available byte from
obj_cgroup. That is costly. Instead, we should just uncharge the excess
pages, reduce the stock bytes and be done with it. The drain_obj_stock()
function should only be called when obj_cgroup changes.

Secondly, when charging an object of size not less than a page in
obj_cgroup_charge(), it is possible that the remaining bytes to be
refilled to the stock will overflow a page and cause refill_obj_stock()
to uncharge 1 page. To avoid the additional uncharge in this case,
a new overfill flag is added to refill_obj_stock() which will be set
when called from obj_cgroup_charge().

Signed-off-by: Waiman Long <longman@redhat.com>
---
 mm/memcontrol.c | 23 +++++++++++++++++------
 1 file changed, 17 insertions(+), 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a6dd18f6d8a8..d13961352eef 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3357,23 +3357,34 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 	return false;
 }
 
-static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
+static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
+			     bool overfill)
 {
 	unsigned long flags;
 	struct obj_stock *stock = get_obj_stock(&flags);
+	unsigned int nr_pages = 0;
 
 	if (stock->cached_objcg != objcg) { /* reset if necessary */
-		drain_obj_stock(stock);
+		if (stock->cached_objcg)
+			drain_obj_stock(stock);
 		obj_cgroup_get(objcg);
 		stock->cached_objcg = objcg;
 		stock->nr_bytes = atomic_xchg(&objcg->nr_charged_bytes, 0);
 	}
 	stock->nr_bytes += nr_bytes;
 
-	if (stock->nr_bytes > PAGE_SIZE)
-		drain_obj_stock(stock);
+	if (!overfill && (stock->nr_bytes > PAGE_SIZE)) {
+		nr_pages = stock->nr_bytes >> PAGE_SHIFT;
+		stock->nr_bytes &= (PAGE_SIZE - 1);
+	}
 
 	put_obj_stock(flags);
+
+	if (nr_pages) {
+		rcu_read_lock();
+		__memcg_kmem_uncharge(obj_cgroup_memcg(objcg), nr_pages);
+		rcu_read_unlock();
+	}
 }
 
 int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
@@ -3410,7 +3421,7 @@ int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
 
 	ret = __memcg_kmem_charge(memcg, gfp, nr_pages);
 	if (!ret && nr_bytes)
-		refill_obj_stock(objcg, PAGE_SIZE - nr_bytes);
+		refill_obj_stock(objcg, PAGE_SIZE - nr_bytes, true);
 
 	css_put(&memcg->css);
 	return ret;
@@ -3418,7 +3429,7 @@ int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
 
 void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
 {
-	refill_obj_stock(objcg, size);
+	refill_obj_stock(objcg, size, false);
 }
 
 #endif /* CONFIG_MEMCG_KMEM */
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v4 5/5] mm/memcg: Improve refill_obj_stock() performance
@ 2021-04-19  0:00   ` Waiman Long
  0 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2021-04-19  0:00 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun, Matthew Wilcox,
	Waiman Long

There are two issues with the current refill_obj_stock() code. First of
all, when nr_bytes reaches over PAGE_SIZE, it calls drain_obj_stock() to
atomically flush out remaining bytes to obj_cgroup, clear cached_objcg
and do a obj_cgroup_put(). It is likely that the same obj_cgroup will
be used again which leads to another call to drain_obj_stock() and
obj_cgroup_get() as well as atomically retrieve the available byte from
obj_cgroup. That is costly. Instead, we should just uncharge the excess
pages, reduce the stock bytes and be done with it. The drain_obj_stock()
function should only be called when obj_cgroup changes.

Secondly, when charging an object of size not less than a page in
obj_cgroup_charge(), it is possible that the remaining bytes to be
refilled to the stock will overflow a page and cause refill_obj_stock()
to uncharge 1 page. To avoid the additional uncharge in this case,
a new overfill flag is added to refill_obj_stock() which will be set
when called from obj_cgroup_charge().

Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 mm/memcontrol.c | 23 +++++++++++++++++------
 1 file changed, 17 insertions(+), 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a6dd18f6d8a8..d13961352eef 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3357,23 +3357,34 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 	return false;
 }
 
-static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
+static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
+			     bool overfill)
 {
 	unsigned long flags;
 	struct obj_stock *stock = get_obj_stock(&flags);
+	unsigned int nr_pages = 0;
 
 	if (stock->cached_objcg != objcg) { /* reset if necessary */
-		drain_obj_stock(stock);
+		if (stock->cached_objcg)
+			drain_obj_stock(stock);
 		obj_cgroup_get(objcg);
 		stock->cached_objcg = objcg;
 		stock->nr_bytes = atomic_xchg(&objcg->nr_charged_bytes, 0);
 	}
 	stock->nr_bytes += nr_bytes;
 
-	if (stock->nr_bytes > PAGE_SIZE)
-		drain_obj_stock(stock);
+	if (!overfill && (stock->nr_bytes > PAGE_SIZE)) {
+		nr_pages = stock->nr_bytes >> PAGE_SHIFT;
+		stock->nr_bytes &= (PAGE_SIZE - 1);
+	}
 
 	put_obj_stock(flags);
+
+	if (nr_pages) {
+		rcu_read_lock();
+		__memcg_kmem_uncharge(obj_cgroup_memcg(objcg), nr_pages);
+		rcu_read_unlock();
+	}
 }
 
 int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
@@ -3410,7 +3421,7 @@ int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
 
 	ret = __memcg_kmem_charge(memcg, gfp, nr_pages);
 	if (!ret && nr_bytes)
-		refill_obj_stock(objcg, PAGE_SIZE - nr_bytes);
+		refill_obj_stock(objcg, PAGE_SIZE - nr_bytes, true);
 
 	css_put(&memcg->css);
 	return ret;
@@ -3418,7 +3429,7 @@ int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
 
 void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
 {
-	refill_obj_stock(objcg, size);
+	refill_obj_stock(objcg, size, false);
 }
 
 #endif /* CONFIG_MEMCG_KMEM */
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [External] [PATCH v4 5/5] mm/memcg: Improve refill_obj_stock() performance
  2021-04-19  0:00   ` Waiman Long
  (?)
@ 2021-04-19  6:06     ` Muchun Song
  -1 siblings, 0 replies; 49+ messages in thread
From: Muchun Song @ 2021-04-19  6:06 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin, LKML, Cgroups,
	Linux Memory Management List, Shakeel Butt, Alex Shi, Chris Down,
	Yafang Shao, Wei Yang, Masayoshi Mizuma, Xing Zhengjun,
	Matthew Wilcox

On Mon, Apr 19, 2021 at 8:01 AM Waiman Long <longman@redhat.com> wrote:
>
> There are two issues with the current refill_obj_stock() code. First of
> all, when nr_bytes reaches over PAGE_SIZE, it calls drain_obj_stock() to
> atomically flush out remaining bytes to obj_cgroup, clear cached_objcg
> and do a obj_cgroup_put(). It is likely that the same obj_cgroup will
> be used again which leads to another call to drain_obj_stock() and
> obj_cgroup_get() as well as atomically retrieve the available byte from
> obj_cgroup. That is costly. Instead, we should just uncharge the excess
> pages, reduce the stock bytes and be done with it. The drain_obj_stock()
> function should only be called when obj_cgroup changes.
>
> Secondly, when charging an object of size not less than a page in
> obj_cgroup_charge(), it is possible that the remaining bytes to be
> refilled to the stock will overflow a page and cause refill_obj_stock()
> to uncharge 1 page. To avoid the additional uncharge in this case,
> a new overfill flag is added to refill_obj_stock() which will be set
> when called from obj_cgroup_charge().
>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  mm/memcontrol.c | 23 +++++++++++++++++------
>  1 file changed, 17 insertions(+), 6 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a6dd18f6d8a8..d13961352eef 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3357,23 +3357,34 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
>         return false;
>  }
>
> -static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
> +static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
> +                            bool overfill)
>  {
>         unsigned long flags;
>         struct obj_stock *stock = get_obj_stock(&flags);
> +       unsigned int nr_pages = 0;
>
>         if (stock->cached_objcg != objcg) { /* reset if necessary */
> -               drain_obj_stock(stock);
> +               if (stock->cached_objcg)
> +                       drain_obj_stock(stock);
>                 obj_cgroup_get(objcg);
>                 stock->cached_objcg = objcg;
>                 stock->nr_bytes = atomic_xchg(&objcg->nr_charged_bytes, 0);
>         }
>         stock->nr_bytes += nr_bytes;
>
> -       if (stock->nr_bytes > PAGE_SIZE)
> -               drain_obj_stock(stock);
> +       if (!overfill && (stock->nr_bytes > PAGE_SIZE)) {
> +               nr_pages = stock->nr_bytes >> PAGE_SHIFT;
> +               stock->nr_bytes &= (PAGE_SIZE - 1);
> +       }
>
>         put_obj_stock(flags);
> +
> +       if (nr_pages) {
> +               rcu_read_lock();
> +               __memcg_kmem_uncharge(obj_cgroup_memcg(objcg), nr_pages);
> +               rcu_read_unlock();
> +       }

It is not safe to call __memcg_kmem_uncharge() under rcu lock
and without holding a reference to memcg. More details can refer
to the following link.

https://lore.kernel.org/linux-mm/20210319163821.20704-2-songmuchun@bytedance.com/

In the above patchset, we introduce obj_cgroup_uncharge_pages to
uncharge some pages from object cgroup. You can use this safe
API.

Thanks.

>  }
>
>  int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
> @@ -3410,7 +3421,7 @@ int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
>
>         ret = __memcg_kmem_charge(memcg, gfp, nr_pages);
>         if (!ret && nr_bytes)
> -               refill_obj_stock(objcg, PAGE_SIZE - nr_bytes);
> +               refill_obj_stock(objcg, PAGE_SIZE - nr_bytes, true);
>
>         css_put(&memcg->css);
>         return ret;
> @@ -3418,7 +3429,7 @@ int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
>
>  void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
>  {
> -       refill_obj_stock(objcg, size);
> +       refill_obj_stock(objcg, size, false);
>  }
>
>  #endif /* CONFIG_MEMCG_KMEM */
> --
> 2.18.1
>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [External] [PATCH v4 5/5] mm/memcg: Improve refill_obj_stock() performance
@ 2021-04-19  6:06     ` Muchun Song
  0 siblings, 0 replies; 49+ messages in thread
From: Muchun Song @ 2021-04-19  6:06 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin, LKML, Cgroups,
	Linux Memory Management List, Shakeel Butt, Alex Shi, Chris Down,
	Yafang Shao, Wei Yang, Masayoshi Mizuma, Xing Zhengjun,
	Matthew Wilcox

On Mon, Apr 19, 2021 at 8:01 AM Waiman Long <longman@redhat.com> wrote:
>
> There are two issues with the current refill_obj_stock() code. First of
> all, when nr_bytes reaches over PAGE_SIZE, it calls drain_obj_stock() to
> atomically flush out remaining bytes to obj_cgroup, clear cached_objcg
> and do a obj_cgroup_put(). It is likely that the same obj_cgroup will
> be used again which leads to another call to drain_obj_stock() and
> obj_cgroup_get() as well as atomically retrieve the available byte from
> obj_cgroup. That is costly. Instead, we should just uncharge the excess
> pages, reduce the stock bytes and be done with it. The drain_obj_stock()
> function should only be called when obj_cgroup changes.
>
> Secondly, when charging an object of size not less than a page in
> obj_cgroup_charge(), it is possible that the remaining bytes to be
> refilled to the stock will overflow a page and cause refill_obj_stock()
> to uncharge 1 page. To avoid the additional uncharge in this case,
> a new overfill flag is added to refill_obj_stock() which will be set
> when called from obj_cgroup_charge().
>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  mm/memcontrol.c | 23 +++++++++++++++++------
>  1 file changed, 17 insertions(+), 6 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a6dd18f6d8a8..d13961352eef 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3357,23 +3357,34 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
>         return false;
>  }
>
> -static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
> +static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
> +                            bool overfill)
>  {
>         unsigned long flags;
>         struct obj_stock *stock = get_obj_stock(&flags);
> +       unsigned int nr_pages = 0;
>
>         if (stock->cached_objcg != objcg) { /* reset if necessary */
> -               drain_obj_stock(stock);
> +               if (stock->cached_objcg)
> +                       drain_obj_stock(stock);
>                 obj_cgroup_get(objcg);
>                 stock->cached_objcg = objcg;
>                 stock->nr_bytes = atomic_xchg(&objcg->nr_charged_bytes, 0);
>         }
>         stock->nr_bytes += nr_bytes;
>
> -       if (stock->nr_bytes > PAGE_SIZE)
> -               drain_obj_stock(stock);
> +       if (!overfill && (stock->nr_bytes > PAGE_SIZE)) {
> +               nr_pages = stock->nr_bytes >> PAGE_SHIFT;
> +               stock->nr_bytes &= (PAGE_SIZE - 1);
> +       }
>
>         put_obj_stock(flags);
> +
> +       if (nr_pages) {
> +               rcu_read_lock();
> +               __memcg_kmem_uncharge(obj_cgroup_memcg(objcg), nr_pages);
> +               rcu_read_unlock();
> +       }

It is not safe to call __memcg_kmem_uncharge() under rcu lock
and without holding a reference to memcg. More details can refer
to the following link.

https://lore.kernel.org/linux-mm/20210319163821.20704-2-songmuchun@bytedance.com/

In the above patchset, we introduce obj_cgroup_uncharge_pages to
uncharge some pages from object cgroup. You can use this safe
API.

Thanks.

>  }
>
>  int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
> @@ -3410,7 +3421,7 @@ int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
>
>         ret = __memcg_kmem_charge(memcg, gfp, nr_pages);
>         if (!ret && nr_bytes)
> -               refill_obj_stock(objcg, PAGE_SIZE - nr_bytes);
> +               refill_obj_stock(objcg, PAGE_SIZE - nr_bytes, true);
>
>         css_put(&memcg->css);
>         return ret;
> @@ -3418,7 +3429,7 @@ int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
>
>  void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
>  {
> -       refill_obj_stock(objcg, size);
> +       refill_obj_stock(objcg, size, false);
>  }
>
>  #endif /* CONFIG_MEMCG_KMEM */
> --
> 2.18.1
>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [External] [PATCH v4 5/5] mm/memcg: Improve refill_obj_stock() performance
@ 2021-04-19  6:06     ` Muchun Song
  0 siblings, 0 replies; 49+ messages in thread
From: Muchun Song @ 2021-04-19  6:06 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin, LKML, Cgroups,
	Linux Memory Management List, Shakeel Butt, Alex Shi, Chris Down,
	Yafang Shao, Wei Yang, Masayoshi Mizuma, Xing

On Mon, Apr 19, 2021 at 8:01 AM Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>
> There are two issues with the current refill_obj_stock() code. First of
> all, when nr_bytes reaches over PAGE_SIZE, it calls drain_obj_stock() to
> atomically flush out remaining bytes to obj_cgroup, clear cached_objcg
> and do a obj_cgroup_put(). It is likely that the same obj_cgroup will
> be used again which leads to another call to drain_obj_stock() and
> obj_cgroup_get() as well as atomically retrieve the available byte from
> obj_cgroup. That is costly. Instead, we should just uncharge the excess
> pages, reduce the stock bytes and be done with it. The drain_obj_stock()
> function should only be called when obj_cgroup changes.
>
> Secondly, when charging an object of size not less than a page in
> obj_cgroup_charge(), it is possible that the remaining bytes to be
> refilled to the stock will overflow a page and cause refill_obj_stock()
> to uncharge 1 page. To avoid the additional uncharge in this case,
> a new overfill flag is added to refill_obj_stock() which will be set
> when called from obj_cgroup_charge().
>
> Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
>  mm/memcontrol.c | 23 +++++++++++++++++------
>  1 file changed, 17 insertions(+), 6 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a6dd18f6d8a8..d13961352eef 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3357,23 +3357,34 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
>         return false;
>  }
>
> -static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
> +static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
> +                            bool overfill)
>  {
>         unsigned long flags;
>         struct obj_stock *stock = get_obj_stock(&flags);
> +       unsigned int nr_pages = 0;
>
>         if (stock->cached_objcg != objcg) { /* reset if necessary */
> -               drain_obj_stock(stock);
> +               if (stock->cached_objcg)
> +                       drain_obj_stock(stock);
>                 obj_cgroup_get(objcg);
>                 stock->cached_objcg = objcg;
>                 stock->nr_bytes = atomic_xchg(&objcg->nr_charged_bytes, 0);
>         }
>         stock->nr_bytes += nr_bytes;
>
> -       if (stock->nr_bytes > PAGE_SIZE)
> -               drain_obj_stock(stock);
> +       if (!overfill && (stock->nr_bytes > PAGE_SIZE)) {
> +               nr_pages = stock->nr_bytes >> PAGE_SHIFT;
> +               stock->nr_bytes &= (PAGE_SIZE - 1);
> +       }
>
>         put_obj_stock(flags);
> +
> +       if (nr_pages) {
> +               rcu_read_lock();
> +               __memcg_kmem_uncharge(obj_cgroup_memcg(objcg), nr_pages);
> +               rcu_read_unlock();
> +       }

It is not safe to call __memcg_kmem_uncharge() under rcu lock
and without holding a reference to memcg. More details can refer
to the following link.

https://lore.kernel.org/linux-mm/20210319163821.20704-2-songmuchun-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org/

In the above patchset, we introduce obj_cgroup_uncharge_pages to
uncharge some pages from object cgroup. You can use this safe
API.

Thanks.

>  }
>
>  int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
> @@ -3410,7 +3421,7 @@ int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
>
>         ret = __memcg_kmem_charge(memcg, gfp, nr_pages);
>         if (!ret && nr_bytes)
> -               refill_obj_stock(objcg, PAGE_SIZE - nr_bytes);
> +               refill_obj_stock(objcg, PAGE_SIZE - nr_bytes, true);
>
>         css_put(&memcg->css);
>         return ret;
> @@ -3418,7 +3429,7 @@ int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
>
>  void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
>  {
> -       refill_obj_stock(objcg, size);
> +       refill_obj_stock(objcg, size, false);
>  }
>
>  #endif /* CONFIG_MEMCG_KMEM */
> --
> 2.18.1
>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [External] [PATCH v4 5/5] mm/memcg: Improve refill_obj_stock() performance
  2021-04-19  6:06     ` Muchun Song
  (?)
@ 2021-04-19 15:00       ` Shakeel Butt
  -1 siblings, 0 replies; 49+ messages in thread
From: Shakeel Butt @ 2021-04-19 15:00 UTC (permalink / raw)
  To: Muchun Song
  Cc: Waiman Long, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Andrew Morton, Tejun Heo, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Vlastimil Babka, Roman Gushchin,
	LKML, Cgroups, Linux Memory Management List, Alex Shi,
	Chris Down, Yafang Shao, Wei Yang, Masayoshi Mizuma,
	Xing Zhengjun, Matthew Wilcox

On Sun, Apr 18, 2021 at 11:07 PM Muchun Song <songmuchun@bytedance.com> wrote:
>
> On Mon, Apr 19, 2021 at 8:01 AM Waiman Long <longman@redhat.com> wrote:
> >
> > There are two issues with the current refill_obj_stock() code. First of
> > all, when nr_bytes reaches over PAGE_SIZE, it calls drain_obj_stock() to
> > atomically flush out remaining bytes to obj_cgroup, clear cached_objcg
> > and do a obj_cgroup_put(). It is likely that the same obj_cgroup will
> > be used again which leads to another call to drain_obj_stock() and
> > obj_cgroup_get() as well as atomically retrieve the available byte from
> > obj_cgroup. That is costly. Instead, we should just uncharge the excess
> > pages, reduce the stock bytes and be done with it. The drain_obj_stock()
> > function should only be called when obj_cgroup changes.
> >
> > Secondly, when charging an object of size not less than a page in
> > obj_cgroup_charge(), it is possible that the remaining bytes to be
> > refilled to the stock will overflow a page and cause refill_obj_stock()
> > to uncharge 1 page. To avoid the additional uncharge in this case,
> > a new overfill flag is added to refill_obj_stock() which will be set
> > when called from obj_cgroup_charge().
> >
> > Signed-off-by: Waiman Long <longman@redhat.com>
> > ---
> >  mm/memcontrol.c | 23 +++++++++++++++++------
> >  1 file changed, 17 insertions(+), 6 deletions(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index a6dd18f6d8a8..d13961352eef 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -3357,23 +3357,34 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
> >         return false;
> >  }
> >
> > -static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
> > +static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
> > +                            bool overfill)
> >  {
> >         unsigned long flags;
> >         struct obj_stock *stock = get_obj_stock(&flags);
> > +       unsigned int nr_pages = 0;
> >
> >         if (stock->cached_objcg != objcg) { /* reset if necessary */
> > -               drain_obj_stock(stock);
> > +               if (stock->cached_objcg)
> > +                       drain_obj_stock(stock);
> >                 obj_cgroup_get(objcg);
> >                 stock->cached_objcg = objcg;
> >                 stock->nr_bytes = atomic_xchg(&objcg->nr_charged_bytes, 0);
> >         }
> >         stock->nr_bytes += nr_bytes;
> >
> > -       if (stock->nr_bytes > PAGE_SIZE)
> > -               drain_obj_stock(stock);
> > +       if (!overfill && (stock->nr_bytes > PAGE_SIZE)) {
> > +               nr_pages = stock->nr_bytes >> PAGE_SHIFT;
> > +               stock->nr_bytes &= (PAGE_SIZE - 1);
> > +       }
> >
> >         put_obj_stock(flags);
> > +
> > +       if (nr_pages) {
> > +               rcu_read_lock();
> > +               __memcg_kmem_uncharge(obj_cgroup_memcg(objcg), nr_pages);
> > +               rcu_read_unlock();
> > +       }
>
> It is not safe to call __memcg_kmem_uncharge() under rcu lock
> and without holding a reference to memcg. More details can refer
> to the following link.
>
> https://lore.kernel.org/linux-mm/20210319163821.20704-2-songmuchun@bytedance.com/
>
> In the above patchset, we introduce obj_cgroup_uncharge_pages to
> uncharge some pages from object cgroup. You can use this safe
> API.
>

I would recommend just rebase the patch series over the latest mm tree.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [External] [PATCH v4 5/5] mm/memcg: Improve refill_obj_stock() performance
@ 2021-04-19 15:00       ` Shakeel Butt
  0 siblings, 0 replies; 49+ messages in thread
From: Shakeel Butt @ 2021-04-19 15:00 UTC (permalink / raw)
  To: Muchun Song
  Cc: Waiman Long, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Andrew Morton, Tejun Heo, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Vlastimil Babka, Roman Gushchin,
	LKML, Cgroups, Linux Memory Management List, Alex Shi,
	Chris Down, Yafang Shao, Wei Yang, Masayoshi Mizuma,
	Xing Zhengjun, Matthew Wilcox

On Sun, Apr 18, 2021 at 11:07 PM Muchun Song <songmuchun@bytedance.com> wrote:
>
> On Mon, Apr 19, 2021 at 8:01 AM Waiman Long <longman@redhat.com> wrote:
> >
> > There are two issues with the current refill_obj_stock() code. First of
> > all, when nr_bytes reaches over PAGE_SIZE, it calls drain_obj_stock() to
> > atomically flush out remaining bytes to obj_cgroup, clear cached_objcg
> > and do a obj_cgroup_put(). It is likely that the same obj_cgroup will
> > be used again which leads to another call to drain_obj_stock() and
> > obj_cgroup_get() as well as atomically retrieve the available byte from
> > obj_cgroup. That is costly. Instead, we should just uncharge the excess
> > pages, reduce the stock bytes and be done with it. The drain_obj_stock()
> > function should only be called when obj_cgroup changes.
> >
> > Secondly, when charging an object of size not less than a page in
> > obj_cgroup_charge(), it is possible that the remaining bytes to be
> > refilled to the stock will overflow a page and cause refill_obj_stock()
> > to uncharge 1 page. To avoid the additional uncharge in this case,
> > a new overfill flag is added to refill_obj_stock() which will be set
> > when called from obj_cgroup_charge().
> >
> > Signed-off-by: Waiman Long <longman@redhat.com>
> > ---
> >  mm/memcontrol.c | 23 +++++++++++++++++------
> >  1 file changed, 17 insertions(+), 6 deletions(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index a6dd18f6d8a8..d13961352eef 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -3357,23 +3357,34 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
> >         return false;
> >  }
> >
> > -static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
> > +static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
> > +                            bool overfill)
> >  {
> >         unsigned long flags;
> >         struct obj_stock *stock = get_obj_stock(&flags);
> > +       unsigned int nr_pages = 0;
> >
> >         if (stock->cached_objcg != objcg) { /* reset if necessary */
> > -               drain_obj_stock(stock);
> > +               if (stock->cached_objcg)
> > +                       drain_obj_stock(stock);
> >                 obj_cgroup_get(objcg);
> >                 stock->cached_objcg = objcg;
> >                 stock->nr_bytes = atomic_xchg(&objcg->nr_charged_bytes, 0);
> >         }
> >         stock->nr_bytes += nr_bytes;
> >
> > -       if (stock->nr_bytes > PAGE_SIZE)
> > -               drain_obj_stock(stock);
> > +       if (!overfill && (stock->nr_bytes > PAGE_SIZE)) {
> > +               nr_pages = stock->nr_bytes >> PAGE_SHIFT;
> > +               stock->nr_bytes &= (PAGE_SIZE - 1);
> > +       }
> >
> >         put_obj_stock(flags);
> > +
> > +       if (nr_pages) {
> > +               rcu_read_lock();
> > +               __memcg_kmem_uncharge(obj_cgroup_memcg(objcg), nr_pages);
> > +               rcu_read_unlock();
> > +       }
>
> It is not safe to call __memcg_kmem_uncharge() under rcu lock
> and without holding a reference to memcg. More details can refer
> to the following link.
>
> https://lore.kernel.org/linux-mm/20210319163821.20704-2-songmuchun@bytedance.com/
>
> In the above patchset, we introduce obj_cgroup_uncharge_pages to
> uncharge some pages from object cgroup. You can use this safe
> API.
>

I would recommend just rebase the patch series over the latest mm tree.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [External] [PATCH v4 5/5] mm/memcg: Improve refill_obj_stock() performance
@ 2021-04-19 15:00       ` Shakeel Butt
  0 siblings, 0 replies; 49+ messages in thread
From: Shakeel Butt @ 2021-04-19 15:00 UTC (permalink / raw)
  To: Muchun Song
  Cc: Waiman Long, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Andrew Morton, Tejun Heo, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Vlastimil Babka, Roman Gushchin,
	LKML, Cgroups, Linux Memory Management List, Alex Shi,
	Chris Down, Yafang Shao, Wei Yang, Masayoshi Mizuma, Xing

On Sun, Apr 18, 2021 at 11:07 PM Muchun Song <songmuchun-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org> wrote:
>
> On Mon, Apr 19, 2021 at 8:01 AM Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> >
> > There are two issues with the current refill_obj_stock() code. First of
> > all, when nr_bytes reaches over PAGE_SIZE, it calls drain_obj_stock() to
> > atomically flush out remaining bytes to obj_cgroup, clear cached_objcg
> > and do a obj_cgroup_put(). It is likely that the same obj_cgroup will
> > be used again which leads to another call to drain_obj_stock() and
> > obj_cgroup_get() as well as atomically retrieve the available byte from
> > obj_cgroup. That is costly. Instead, we should just uncharge the excess
> > pages, reduce the stock bytes and be done with it. The drain_obj_stock()
> > function should only be called when obj_cgroup changes.
> >
> > Secondly, when charging an object of size not less than a page in
> > obj_cgroup_charge(), it is possible that the remaining bytes to be
> > refilled to the stock will overflow a page and cause refill_obj_stock()
> > to uncharge 1 page. To avoid the additional uncharge in this case,
> > a new overfill flag is added to refill_obj_stock() which will be set
> > when called from obj_cgroup_charge().
> >
> > Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > ---
> >  mm/memcontrol.c | 23 +++++++++++++++++------
> >  1 file changed, 17 insertions(+), 6 deletions(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index a6dd18f6d8a8..d13961352eef 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -3357,23 +3357,34 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
> >         return false;
> >  }
> >
> > -static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
> > +static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
> > +                            bool overfill)
> >  {
> >         unsigned long flags;
> >         struct obj_stock *stock = get_obj_stock(&flags);
> > +       unsigned int nr_pages = 0;
> >
> >         if (stock->cached_objcg != objcg) { /* reset if necessary */
> > -               drain_obj_stock(stock);
> > +               if (stock->cached_objcg)
> > +                       drain_obj_stock(stock);
> >                 obj_cgroup_get(objcg);
> >                 stock->cached_objcg = objcg;
> >                 stock->nr_bytes = atomic_xchg(&objcg->nr_charged_bytes, 0);
> >         }
> >         stock->nr_bytes += nr_bytes;
> >
> > -       if (stock->nr_bytes > PAGE_SIZE)
> > -               drain_obj_stock(stock);
> > +       if (!overfill && (stock->nr_bytes > PAGE_SIZE)) {
> > +               nr_pages = stock->nr_bytes >> PAGE_SHIFT;
> > +               stock->nr_bytes &= (PAGE_SIZE - 1);
> > +       }
> >
> >         put_obj_stock(flags);
> > +
> > +       if (nr_pages) {
> > +               rcu_read_lock();
> > +               __memcg_kmem_uncharge(obj_cgroup_memcg(objcg), nr_pages);
> > +               rcu_read_unlock();
> > +       }
>
> It is not safe to call __memcg_kmem_uncharge() under rcu lock
> and without holding a reference to memcg. More details can refer
> to the following link.
>
> https://lore.kernel.org/linux-mm/20210319163821.20704-2-songmuchun-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org/
>
> In the above patchset, we introduce obj_cgroup_uncharge_pages to
> uncharge some pages from object cgroup. You can use this safe
> API.
>

I would recommend just rebase the patch series over the latest mm tree.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 1/5] mm/memcg: Move mod_objcg_state() to memcontrol.c
@ 2021-04-19 15:14     ` Johannes Weiner
  0 siblings, 0 replies; 49+ messages in thread
From: Johannes Weiner @ 2021-04-19 15:14 UTC (permalink / raw)
  To: Waiman Long
  Cc: Michal Hocko, Vladimir Davydov, Andrew Morton, Tejun Heo,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin, linux-kernel, cgroups, linux-mm,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun, Matthew Wilcox

On Sun, Apr 18, 2021 at 08:00:28PM -0400, Waiman Long wrote:
> The mod_objcg_state() function is moved from mm/slab.h to mm/memcontrol.c
> so that further optimization can be done to it in later patches without
> exposing unnecessary details to other mm components.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  mm/memcontrol.c | 13 +++++++++++++
>  mm/slab.h       | 16 ++--------------
>  2 files changed, 15 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e064ac0d850a..dc9032f28f2e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3150,6 +3150,19 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
>  	css_put(&memcg->css);
>  }
>  
> +void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
> +		     enum node_stat_item idx, int nr)
> +{
> +	struct mem_cgroup *memcg;
> +	struct lruvec *lruvec = NULL;
> +
> +	rcu_read_lock();
> +	memcg = obj_cgroup_memcg(objcg);
> +	lruvec = mem_cgroup_lruvec(memcg, pgdat);
> +	mod_memcg_lruvec_state(lruvec, idx, nr);
> +	rcu_read_unlock();
> +}

It would be more naturally placed next to the others, e.g. below
__mod_lruvec_kmem_state().

But no deal breaker if there isn't another revision.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 1/5] mm/memcg: Move mod_objcg_state() to memcontrol.c
@ 2021-04-19 15:14     ` Johannes Weiner
  0 siblings, 0 replies; 49+ messages in thread
From: Johannes Weiner @ 2021-04-19 15:14 UTC (permalink / raw)
  To: Waiman Long
  Cc: Michal Hocko, Vladimir Davydov, Andrew Morton, Tejun Heo,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun, Matthew Wilcox

On Sun, Apr 18, 2021 at 08:00:28PM -0400, Waiman Long wrote:
> The mod_objcg_state() function is moved from mm/slab.h to mm/memcontrol.c
> so that further optimization can be done to it in later patches without
> exposing unnecessary details to other mm components.
> 
> Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
>  mm/memcontrol.c | 13 +++++++++++++
>  mm/slab.h       | 16 ++--------------
>  2 files changed, 15 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e064ac0d850a..dc9032f28f2e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3150,6 +3150,19 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
>  	css_put(&memcg->css);
>  }
>  
> +void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
> +		     enum node_stat_item idx, int nr)
> +{
> +	struct mem_cgroup *memcg;
> +	struct lruvec *lruvec = NULL;
> +
> +	rcu_read_lock();
> +	memcg = obj_cgroup_memcg(objcg);
> +	lruvec = mem_cgroup_lruvec(memcg, pgdat);
> +	mod_memcg_lruvec_state(lruvec, idx, nr);
> +	rcu_read_unlock();
> +}

It would be more naturally placed next to the others, e.g. below
__mod_lruvec_kmem_state().

But no deal breaker if there isn't another revision.

Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [External] [PATCH v4 5/5] mm/memcg: Improve refill_obj_stock() performance
@ 2021-04-19 15:19         ` Waiman Long
  0 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2021-04-19 15:19 UTC (permalink / raw)
  To: Shakeel Butt, Muchun Song
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin, LKML, Cgroups,
	Linux Memory Management List, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun, Matthew Wilcox

On 4/19/21 11:00 AM, Shakeel Butt wrote:
> On Sun, Apr 18, 2021 at 11:07 PM Muchun Song <songmuchun@bytedance.com> wrote:
>> On Mon, Apr 19, 2021 at 8:01 AM Waiman Long <longman@redhat.com> wrote:
>>> There are two issues with the current refill_obj_stock() code. First of
>>> all, when nr_bytes reaches over PAGE_SIZE, it calls drain_obj_stock() to
>>> atomically flush out remaining bytes to obj_cgroup, clear cached_objcg
>>> and do a obj_cgroup_put(). It is likely that the same obj_cgroup will
>>> be used again which leads to another call to drain_obj_stock() and
>>> obj_cgroup_get() as well as atomically retrieve the available byte from
>>> obj_cgroup. That is costly. Instead, we should just uncharge the excess
>>> pages, reduce the stock bytes and be done with it. The drain_obj_stock()
>>> function should only be called when obj_cgroup changes.
>>>
>>> Secondly, when charging an object of size not less than a page in
>>> obj_cgroup_charge(), it is possible that the remaining bytes to be
>>> refilled to the stock will overflow a page and cause refill_obj_stock()
>>> to uncharge 1 page. To avoid the additional uncharge in this case,
>>> a new overfill flag is added to refill_obj_stock() which will be set
>>> when called from obj_cgroup_charge().
>>>
>>> Signed-off-by: Waiman Long <longman@redhat.com>
>>> ---
>>>   mm/memcontrol.c | 23 +++++++++++++++++------
>>>   1 file changed, 17 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index a6dd18f6d8a8..d13961352eef 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -3357,23 +3357,34 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
>>>          return false;
>>>   }
>>>
>>> -static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
>>> +static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
>>> +                            bool overfill)
>>>   {
>>>          unsigned long flags;
>>>          struct obj_stock *stock = get_obj_stock(&flags);
>>> +       unsigned int nr_pages = 0;
>>>
>>>          if (stock->cached_objcg != objcg) { /* reset if necessary */
>>> -               drain_obj_stock(stock);
>>> +               if (stock->cached_objcg)
>>> +                       drain_obj_stock(stock);
>>>                  obj_cgroup_get(objcg);
>>>                  stock->cached_objcg = objcg;
>>>                  stock->nr_bytes = atomic_xchg(&objcg->nr_charged_bytes, 0);
>>>          }
>>>          stock->nr_bytes += nr_bytes;
>>>
>>> -       if (stock->nr_bytes > PAGE_SIZE)
>>> -               drain_obj_stock(stock);
>>> +       if (!overfill && (stock->nr_bytes > PAGE_SIZE)) {
>>> +               nr_pages = stock->nr_bytes >> PAGE_SHIFT;
>>> +               stock->nr_bytes &= (PAGE_SIZE - 1);
>>> +       }
>>>
>>>          put_obj_stock(flags);
>>> +
>>> +       if (nr_pages) {
>>> +               rcu_read_lock();
>>> +               __memcg_kmem_uncharge(obj_cgroup_memcg(objcg), nr_pages);
>>> +               rcu_read_unlock();
>>> +       }
>> It is not safe to call __memcg_kmem_uncharge() under rcu lock
>> and without holding a reference to memcg. More details can refer
>> to the following link.
>>
>> https://lore.kernel.org/linux-mm/20210319163821.20704-2-songmuchun@bytedance.com/
>>
>> In the above patchset, we introduce obj_cgroup_uncharge_pages to
>> uncharge some pages from object cgroup. You can use this safe
>> API.
>>
> I would recommend just rebase the patch series over the latest mm tree.
>
I see, I will rebase it to the latest mm tree.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [External] [PATCH v4 5/5] mm/memcg: Improve refill_obj_stock() performance
@ 2021-04-19 15:19         ` Waiman Long
  0 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2021-04-19 15:19 UTC (permalink / raw)
  To: Shakeel Butt, Muchun Song
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin, LKML, Cgroups,
	Linux Memory Management List, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun

On 4/19/21 11:00 AM, Shakeel Butt wrote:
> On Sun, Apr 18, 2021 at 11:07 PM Muchun Song <songmuchun-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org> wrote:
>> On Mon, Apr 19, 2021 at 8:01 AM Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>>> There are two issues with the current refill_obj_stock() code. First of
>>> all, when nr_bytes reaches over PAGE_SIZE, it calls drain_obj_stock() to
>>> atomically flush out remaining bytes to obj_cgroup, clear cached_objcg
>>> and do a obj_cgroup_put(). It is likely that the same obj_cgroup will
>>> be used again which leads to another call to drain_obj_stock() and
>>> obj_cgroup_get() as well as atomically retrieve the available byte from
>>> obj_cgroup. That is costly. Instead, we should just uncharge the excess
>>> pages, reduce the stock bytes and be done with it. The drain_obj_stock()
>>> function should only be called when obj_cgroup changes.
>>>
>>> Secondly, when charging an object of size not less than a page in
>>> obj_cgroup_charge(), it is possible that the remaining bytes to be
>>> refilled to the stock will overflow a page and cause refill_obj_stock()
>>> to uncharge 1 page. To avoid the additional uncharge in this case,
>>> a new overfill flag is added to refill_obj_stock() which will be set
>>> when called from obj_cgroup_charge().
>>>
>>> Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>> ---
>>>   mm/memcontrol.c | 23 +++++++++++++++++------
>>>   1 file changed, 17 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index a6dd18f6d8a8..d13961352eef 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -3357,23 +3357,34 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
>>>          return false;
>>>   }
>>>
>>> -static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
>>> +static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
>>> +                            bool overfill)
>>>   {
>>>          unsigned long flags;
>>>          struct obj_stock *stock = get_obj_stock(&flags);
>>> +       unsigned int nr_pages = 0;
>>>
>>>          if (stock->cached_objcg != objcg) { /* reset if necessary */
>>> -               drain_obj_stock(stock);
>>> +               if (stock->cached_objcg)
>>> +                       drain_obj_stock(stock);
>>>                  obj_cgroup_get(objcg);
>>>                  stock->cached_objcg = objcg;
>>>                  stock->nr_bytes = atomic_xchg(&objcg->nr_charged_bytes, 0);
>>>          }
>>>          stock->nr_bytes += nr_bytes;
>>>
>>> -       if (stock->nr_bytes > PAGE_SIZE)
>>> -               drain_obj_stock(stock);
>>> +       if (!overfill && (stock->nr_bytes > PAGE_SIZE)) {
>>> +               nr_pages = stock->nr_bytes >> PAGE_SHIFT;
>>> +               stock->nr_bytes &= (PAGE_SIZE - 1);
>>> +       }
>>>
>>>          put_obj_stock(flags);
>>> +
>>> +       if (nr_pages) {
>>> +               rcu_read_lock();
>>> +               __memcg_kmem_uncharge(obj_cgroup_memcg(objcg), nr_pages);
>>> +               rcu_read_unlock();
>>> +       }
>> It is not safe to call __memcg_kmem_uncharge() under rcu lock
>> and without holding a reference to memcg. More details can refer
>> to the following link.
>>
>> https://lore.kernel.org/linux-mm/20210319163821.20704-2-songmuchun-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org/
>>
>> In the above patchset, we introduce obj_cgroup_uncharge_pages to
>> uncharge some pages from object cgroup. You can use this safe
>> API.
>>
> I would recommend just rebase the patch series over the latest mm tree.
>
I see, I will rebase it to the latest mm tree.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 1/5] mm/memcg: Move mod_objcg_state() to memcontrol.c
@ 2021-04-19 15:21       ` Waiman Long
  0 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2021-04-19 15:21 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Vladimir Davydov, Andrew Morton, Tejun Heo,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin, linux-kernel, cgroups, linux-mm,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun, Matthew Wilcox

On 4/19/21 11:14 AM, Johannes Weiner wrote:
> On Sun, Apr 18, 2021 at 08:00:28PM -0400, Waiman Long wrote:
>> The mod_objcg_state() function is moved from mm/slab.h to mm/memcontrol.c
>> so that further optimization can be done to it in later patches without
>> exposing unnecessary details to other mm components.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>   mm/memcontrol.c | 13 +++++++++++++
>>   mm/slab.h       | 16 ++--------------
>>   2 files changed, 15 insertions(+), 14 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index e064ac0d850a..dc9032f28f2e 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -3150,6 +3150,19 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
>>   	css_put(&memcg->css);
>>   }
>>   
>> +void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
>> +		     enum node_stat_item idx, int nr)
>> +{
>> +	struct mem_cgroup *memcg;
>> +	struct lruvec *lruvec = NULL;
>> +
>> +	rcu_read_lock();
>> +	memcg = obj_cgroup_memcg(objcg);
>> +	lruvec = mem_cgroup_lruvec(memcg, pgdat);
>> +	mod_memcg_lruvec_state(lruvec, idx, nr);
>> +	rcu_read_unlock();
>> +}
> It would be more naturally placed next to the others, e.g. below
> __mod_lruvec_kmem_state().
>
> But no deal breaker if there isn't another revision.
>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
>
Yes, there will be another revision by rebasing patch series on the 
linux-next. I will move the function then.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 1/5] mm/memcg: Move mod_objcg_state() to memcontrol.c
@ 2021-04-19 15:21       ` Waiman Long
  0 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2021-04-19 15:21 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Vladimir Davydov, Andrew Morton, Tejun Heo,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun, Matthew Wilcox

On 4/19/21 11:14 AM, Johannes Weiner wrote:
> On Sun, Apr 18, 2021 at 08:00:28PM -0400, Waiman Long wrote:
>> The mod_objcg_state() function is moved from mm/slab.h to mm/memcontrol.c
>> so that further optimization can be done to it in later patches without
>> exposing unnecessary details to other mm components.
>>
>> Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>> ---
>>   mm/memcontrol.c | 13 +++++++++++++
>>   mm/slab.h       | 16 ++--------------
>>   2 files changed, 15 insertions(+), 14 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index e064ac0d850a..dc9032f28f2e 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -3150,6 +3150,19 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
>>   	css_put(&memcg->css);
>>   }
>>   
>> +void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
>> +		     enum node_stat_item idx, int nr)
>> +{
>> +	struct mem_cgroup *memcg;
>> +	struct lruvec *lruvec = NULL;
>> +
>> +	rcu_read_lock();
>> +	memcg = obj_cgroup_memcg(objcg);
>> +	lruvec = mem_cgroup_lruvec(memcg, pgdat);
>> +	mod_memcg_lruvec_state(lruvec, idx, nr);
>> +	rcu_read_unlock();
>> +}
> It would be more naturally placed next to the others, e.g. below
> __mod_lruvec_kmem_state().
>
> But no deal breaker if there isn't another revision.
>
> Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
>
Yes, there will be another revision by rebasing patch series on the 
linux-next. I will move the function then.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 1/5] mm/memcg: Move mod_objcg_state() to memcontrol.c
  2021-04-19  0:00   ` Waiman Long
  (?)
@ 2021-04-19 15:24     ` Shakeel Butt
  -1 siblings, 0 replies; 49+ messages in thread
From: Shakeel Butt @ 2021-04-19 15:24 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin, LKML, Cgroups,
	Linux MM, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun, Matthew Wilcox

On Sun, Apr 18, 2021 at 5:00 PM Waiman Long <longman@redhat.com> wrote:
>
> The mod_objcg_state() function is moved from mm/slab.h to mm/memcontrol.c
> so that further optimization can be done to it in later patches without
> exposing unnecessary details to other mm components.
>
> Signed-off-by: Waiman Long <longman@redhat.com>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 1/5] mm/memcg: Move mod_objcg_state() to memcontrol.c
@ 2021-04-19 15:24     ` Shakeel Butt
  0 siblings, 0 replies; 49+ messages in thread
From: Shakeel Butt @ 2021-04-19 15:24 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin, LKML, Cgroups,
	Linux MM, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun, Matthew Wilcox

On Sun, Apr 18, 2021 at 5:00 PM Waiman Long <longman@redhat.com> wrote:
>
> The mod_objcg_state() function is moved from mm/slab.h to mm/memcontrol.c
> so that further optimization can be done to it in later patches without
> exposing unnecessary details to other mm components.
>
> Signed-off-by: Waiman Long <longman@redhat.com>

Reviewed-by: Shakeel Butt <shakeelb@google.com>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 1/5] mm/memcg: Move mod_objcg_state() to memcontrol.c
@ 2021-04-19 15:24     ` Shakeel Butt
  0 siblings, 0 replies; 49+ messages in thread
From: Shakeel Butt @ 2021-04-19 15:24 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin, LKML, Cgroups,
	Linux MM, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun

On Sun, Apr 18, 2021 at 5:00 PM Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>
> The mod_objcg_state() function is moved from mm/slab.h to mm/memcontrol.c
> so that further optimization can be done to it in later patches without
> exposing unnecessary details to other mm components.
>
> Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Reviewed-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [External] [PATCH v4 5/5] mm/memcg: Improve refill_obj_stock() performance
@ 2021-04-19 15:56       ` Waiman Long
  0 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2021-04-19 15:56 UTC (permalink / raw)
  To: Muchun Song
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin, LKML, Cgroups,
	Linux Memory Management List, Shakeel Butt, Alex Shi, Chris Down,
	Yafang Shao, Wei Yang, Masayoshi Mizuma, Xing Zhengjun,
	Matthew Wilcox

On 4/19/21 2:06 AM, Muchun Song wrote:
> On Mon, Apr 19, 2021 at 8:01 AM Waiman Long <longman@redhat.com> wrote:
>> There are two issues with the current refill_obj_stock() code. First of
>> all, when nr_bytes reaches over PAGE_SIZE, it calls drain_obj_stock() to
>> atomically flush out remaining bytes to obj_cgroup, clear cached_objcg
>> and do a obj_cgroup_put(). It is likely that the same obj_cgroup will
>> be used again which leads to another call to drain_obj_stock() and
>> obj_cgroup_get() as well as atomically retrieve the available byte from
>> obj_cgroup. That is costly. Instead, we should just uncharge the excess
>> pages, reduce the stock bytes and be done with it. The drain_obj_stock()
>> function should only be called when obj_cgroup changes.
>>
>> Secondly, when charging an object of size not less than a page in
>> obj_cgroup_charge(), it is possible that the remaining bytes to be
>> refilled to the stock will overflow a page and cause refill_obj_stock()
>> to uncharge 1 page. To avoid the additional uncharge in this case,
>> a new overfill flag is added to refill_obj_stock() which will be set
>> when called from obj_cgroup_charge().
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>   mm/memcontrol.c | 23 +++++++++++++++++------
>>   1 file changed, 17 insertions(+), 6 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index a6dd18f6d8a8..d13961352eef 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -3357,23 +3357,34 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
>>          return false;
>>   }
>>
>> -static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
>> +static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
>> +                            bool overfill)
>>   {
>>          unsigned long flags;
>>          struct obj_stock *stock = get_obj_stock(&flags);
>> +       unsigned int nr_pages = 0;
>>
>>          if (stock->cached_objcg != objcg) { /* reset if necessary */
>> -               drain_obj_stock(stock);
>> +               if (stock->cached_objcg)
>> +                       drain_obj_stock(stock);
>>                  obj_cgroup_get(objcg);
>>                  stock->cached_objcg = objcg;
>>                  stock->nr_bytes = atomic_xchg(&objcg->nr_charged_bytes, 0);
>>          }
>>          stock->nr_bytes += nr_bytes;
>>
>> -       if (stock->nr_bytes > PAGE_SIZE)
>> -               drain_obj_stock(stock);
>> +       if (!overfill && (stock->nr_bytes > PAGE_SIZE)) {
>> +               nr_pages = stock->nr_bytes >> PAGE_SHIFT;
>> +               stock->nr_bytes &= (PAGE_SIZE - 1);
>> +       }
>>
>>          put_obj_stock(flags);
>> +
>> +       if (nr_pages) {
>> +               rcu_read_lock();
>> +               __memcg_kmem_uncharge(obj_cgroup_memcg(objcg), nr_pages);
>> +               rcu_read_unlock();
>> +       }
> It is not safe to call __memcg_kmem_uncharge() under rcu lock
> and without holding a reference to memcg. More details can refer
> to the following link.
>
> https://lore.kernel.org/linux-mm/20210319163821.20704-2-songmuchun@bytedance.com/
>
> In the above patchset, we introduce obj_cgroup_uncharge_pages to
> uncharge some pages from object cgroup. You can use this safe
> API.

Thanks for the comment. Will update my patch with call to 
obj_cgroup_uncharge_pages().

Cheers,
Longman


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [External] [PATCH v4 5/5] mm/memcg: Improve refill_obj_stock() performance
@ 2021-04-19 15:56       ` Waiman Long
  0 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2021-04-19 15:56 UTC (permalink / raw)
  To: Muchun Song
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin, LKML, Cgroups,
	Linux Memory Management List, Shakeel Butt, Alex Shi, Chris Down,
	Yafang Shao, Wei Yang, Masayoshi Mizuma, Xing

On 4/19/21 2:06 AM, Muchun Song wrote:
> On Mon, Apr 19, 2021 at 8:01 AM Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> There are two issues with the current refill_obj_stock() code. First of
>> all, when nr_bytes reaches over PAGE_SIZE, it calls drain_obj_stock() to
>> atomically flush out remaining bytes to obj_cgroup, clear cached_objcg
>> and do a obj_cgroup_put(). It is likely that the same obj_cgroup will
>> be used again which leads to another call to drain_obj_stock() and
>> obj_cgroup_get() as well as atomically retrieve the available byte from
>> obj_cgroup. That is costly. Instead, we should just uncharge the excess
>> pages, reduce the stock bytes and be done with it. The drain_obj_stock()
>> function should only be called when obj_cgroup changes.
>>
>> Secondly, when charging an object of size not less than a page in
>> obj_cgroup_charge(), it is possible that the remaining bytes to be
>> refilled to the stock will overflow a page and cause refill_obj_stock()
>> to uncharge 1 page. To avoid the additional uncharge in this case,
>> a new overfill flag is added to refill_obj_stock() which will be set
>> when called from obj_cgroup_charge().
>>
>> Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>> ---
>>   mm/memcontrol.c | 23 +++++++++++++++++------
>>   1 file changed, 17 insertions(+), 6 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index a6dd18f6d8a8..d13961352eef 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -3357,23 +3357,34 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
>>          return false;
>>   }
>>
>> -static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
>> +static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
>> +                            bool overfill)
>>   {
>>          unsigned long flags;
>>          struct obj_stock *stock = get_obj_stock(&flags);
>> +       unsigned int nr_pages = 0;
>>
>>          if (stock->cached_objcg != objcg) { /* reset if necessary */
>> -               drain_obj_stock(stock);
>> +               if (stock->cached_objcg)
>> +                       drain_obj_stock(stock);
>>                  obj_cgroup_get(objcg);
>>                  stock->cached_objcg = objcg;
>>                  stock->nr_bytes = atomic_xchg(&objcg->nr_charged_bytes, 0);
>>          }
>>          stock->nr_bytes += nr_bytes;
>>
>> -       if (stock->nr_bytes > PAGE_SIZE)
>> -               drain_obj_stock(stock);
>> +       if (!overfill && (stock->nr_bytes > PAGE_SIZE)) {
>> +               nr_pages = stock->nr_bytes >> PAGE_SHIFT;
>> +               stock->nr_bytes &= (PAGE_SIZE - 1);
>> +       }
>>
>>          put_obj_stock(flags);
>> +
>> +       if (nr_pages) {
>> +               rcu_read_lock();
>> +               __memcg_kmem_uncharge(obj_cgroup_memcg(objcg), nr_pages);
>> +               rcu_read_unlock();
>> +       }
> It is not safe to call __memcg_kmem_uncharge() under rcu lock
> and without holding a reference to memcg. More details can refer
> to the following link.
>
> https://lore.kernel.org/linux-mm/20210319163821.20704-2-songmuchun-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org/
>
> In the above patchset, we introduce obj_cgroup_uncharge_pages to
> uncharge some pages from object cgroup. You can use this safe
> API.

Thanks for the comment. Will update my patch with call to 
obj_cgroup_uncharge_pages().

Cheers,
Longman


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 1/5] mm/memcg: Move mod_objcg_state() to memcontrol.c
@ 2021-04-19 16:18         ` Waiman Long
  0 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2021-04-19 16:18 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Vladimir Davydov, Andrew Morton, Tejun Heo,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin, linux-kernel, cgroups, linux-mm,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun, Matthew Wilcox

On 4/19/21 11:21 AM, Waiman Long wrote:
> On 4/19/21 11:14 AM, Johannes Weiner wrote:
>> On Sun, Apr 18, 2021 at 08:00:28PM -0400, Waiman Long wrote:
>>> The mod_objcg_state() function is moved from mm/slab.h to 
>>> mm/memcontrol.c
>>> so that further optimization can be done to it in later patches without
>>> exposing unnecessary details to other mm components.
>>>
>>> Signed-off-by: Waiman Long <longman@redhat.com>
>>> ---
>>>   mm/memcontrol.c | 13 +++++++++++++
>>>   mm/slab.h       | 16 ++--------------
>>>   2 files changed, 15 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index e064ac0d850a..dc9032f28f2e 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -3150,6 +3150,19 @@ void __memcg_kmem_uncharge_page(struct page 
>>> *page, int order)
>>>       css_put(&memcg->css);
>>>   }
>>>   +void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data 
>>> *pgdat,
>>> +             enum node_stat_item idx, int nr)
>>> +{
>>> +    struct mem_cgroup *memcg;
>>> +    struct lruvec *lruvec = NULL;
>>> +
>>> +    rcu_read_lock();
>>> +    memcg = obj_cgroup_memcg(objcg);
>>> +    lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>> +    mod_memcg_lruvec_state(lruvec, idx, nr);
>>> +    rcu_read_unlock();
>>> +}
>> It would be more naturally placed next to the others, e.g. below
>> __mod_lruvec_kmem_state().
>>
>> But no deal breaker if there isn't another revision.
>>
>> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
>>
> Yes, there will be another revision by rebasing patch series on the 
> linux-next. I will move the function then. 

OK, it turns out that mod_objcg_state() is only defined if 
CONFIG_MEMCG_KMEM. That was why I put it in the CONFIG_MEMCG_KMEM block 
with the other obj_stock functions. I think I will keep it there.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 1/5] mm/memcg: Move mod_objcg_state() to memcontrol.c
@ 2021-04-19 16:18         ` Waiman Long
  0 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2021-04-19 16:18 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Vladimir Davydov, Andrew Morton, Tejun Heo,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun, Matthew Wilcox

On 4/19/21 11:21 AM, Waiman Long wrote:
> On 4/19/21 11:14 AM, Johannes Weiner wrote:
>> On Sun, Apr 18, 2021 at 08:00:28PM -0400, Waiman Long wrote:
>>> The mod_objcg_state() function is moved from mm/slab.h to 
>>> mm/memcontrol.c
>>> so that further optimization can be done to it in later patches without
>>> exposing unnecessary details to other mm components.
>>>
>>> Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>> ---
>>>   mm/memcontrol.c | 13 +++++++++++++
>>>   mm/slab.h       | 16 ++--------------
>>>   2 files changed, 15 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index e064ac0d850a..dc9032f28f2e 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -3150,6 +3150,19 @@ void __memcg_kmem_uncharge_page(struct page 
>>> *page, int order)
>>>       css_put(&memcg->css);
>>>   }
>>>   +void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data 
>>> *pgdat,
>>> +             enum node_stat_item idx, int nr)
>>> +{
>>> +    struct mem_cgroup *memcg;
>>> +    struct lruvec *lruvec = NULL;
>>> +
>>> +    rcu_read_lock();
>>> +    memcg = obj_cgroup_memcg(objcg);
>>> +    lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>> +    mod_memcg_lruvec_state(lruvec, idx, nr);
>>> +    rcu_read_unlock();
>>> +}
>> It would be more naturally placed next to the others, e.g. below
>> __mod_lruvec_kmem_state().
>>
>> But no deal breaker if there isn't another revision.
>>
>> Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
>>
> Yes, there will be another revision by rebasing patch series on the 
> linux-next. I will move the function then. 

OK, it turns out that mod_objcg_state() is only defined if 
CONFIG_MEMCG_KMEM. That was why I put it in the CONFIG_MEMCG_KMEM block 
with the other obj_stock functions. I think I will keep it there.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 2/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
@ 2021-04-19 16:38     ` Johannes Weiner
  0 siblings, 0 replies; 49+ messages in thread
From: Johannes Weiner @ 2021-04-19 16:38 UTC (permalink / raw)
  To: Waiman Long
  Cc: Michal Hocko, Vladimir Davydov, Andrew Morton, Tejun Heo,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin, linux-kernel, cgroups, linux-mm,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun, Matthew Wilcox

On Sun, Apr 18, 2021 at 08:00:29PM -0400, Waiman Long wrote:
> Before the new slab memory controller with per object byte charging,
> charging and vmstat data update happen only when new slab pages are
> allocated or freed. Now they are done with every kmem_cache_alloc()
> and kmem_cache_free(). This causes additional overhead for workloads
> that generate a lot of alloc and free calls.
> 
> The memcg_stock_pcp is used to cache byte charge for a specific
> obj_cgroup to reduce that overhead. To further reducing it, this patch
> makes the vmstat data cached in the memcg_stock_pcp structure as well
> until it accumulates a page size worth of update or when other cached
> data change. Caching the vmstat data in the per-cpu stock eliminates two
> writes to non-hot cachelines for memcg specific as well as memcg-lruvecs
> specific vmstat data by a write to a hot local stock cacheline.
> 
> On a 2-socket Cascade Lake server with instrumentation enabled and this
> patch applied, it was found that about 20% (634400 out of 3243830)
> of the time when mod_objcg_state() is called leads to an actual call
> to __mod_objcg_state() after initial boot. When doing parallel kernel
> build, the figure was about 17% (24329265 out of 142512465). So caching
> the vmstat data reduces the number of calls to __mod_objcg_state()
> by more than 80%.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> Reviewed-by: Shakeel Butt <shakeelb@google.com>
> ---
>  mm/memcontrol.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 61 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index dc9032f28f2e..693453f95d99 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2213,7 +2213,10 @@ struct memcg_stock_pcp {
>  
>  #ifdef CONFIG_MEMCG_KMEM
>  	struct obj_cgroup *cached_objcg;
> +	struct pglist_data *cached_pgdat;
>  	unsigned int nr_bytes;
> +	int vmstat_idx;
> +	int vmstat_bytes;
>  #endif
>  
>  	struct work_struct work;
> @@ -3150,8 +3153,9 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
>  	css_put(&memcg->css);
>  }
>  
> -void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
> -		     enum node_stat_item idx, int nr)
> +static inline void __mod_objcg_state(struct obj_cgroup *objcg,
> +				     struct pglist_data *pgdat,
> +				     enum node_stat_item idx, int nr)

This naming is dangerous, as the __mod_foo naming scheme we use
everywhere else suggests it's the same function as mod_foo() just with
preemption/irqs disabled.

> @@ -3159,10 +3163,53 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
>  	rcu_read_lock();
>  	memcg = obj_cgroup_memcg(objcg);
>  	lruvec = mem_cgroup_lruvec(memcg, pgdat);
> -	mod_memcg_lruvec_state(lruvec, idx, nr);
> +	__mod_memcg_lruvec_state(lruvec, idx, nr);
>  	rcu_read_unlock();
>  }
>  
> +void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
> +		     enum node_stat_item idx, int nr)
> +{
> +	struct memcg_stock_pcp *stock;
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +	stock = this_cpu_ptr(&memcg_stock);
> +
> +	/*
> +	 * Save vmstat data in stock and skip vmstat array update unless
> +	 * accumulating over a page of vmstat data or when pgdat or idx
> +	 * changes.
> +	 */
> +	if (stock->cached_objcg != objcg) {
> +		/* Output the current data as is */

When you get here with the wrong objcg and hit the cold path, it's
usually immediately followed by an uncharge -> refill_obj_stock() that
will then flush and reset cached_objcg.

Instead of doing two cold paths, why not flush the old objcg right
away and set the new so that refill_obj_stock() can use the fast path?

> +	} else if (!stock->vmstat_bytes) {
> +		/* Save the current data */
> +		stock->vmstat_bytes = nr;
> +		stock->vmstat_idx = idx;
> +		stock->cached_pgdat = pgdat;
> +		nr = 0;
> +	} else if ((stock->cached_pgdat != pgdat) ||
> +		   (stock->vmstat_idx != idx)) {
> +		/* Output the cached data & save the current data */
> +		swap(nr, stock->vmstat_bytes);
> +		swap(idx, stock->vmstat_idx);
> +		swap(pgdat, stock->cached_pgdat);

Is this optimization worth doing?

You later split vmstat_bytes and idx doesn't change anymore.

How often does the pgdat change? This is a per-cpu cache after all,
and the numa node a given cpu allocates from tends to not change that
often. Even with interleaving mode, which I think is pretty rare, the
interleaving happens at the slab/page level, not the object level, and
the cache isn't bigger than a page anyway.

> +	} else {
> +		stock->vmstat_bytes += nr;
> +		if (abs(stock->vmstat_bytes) > PAGE_SIZE) {
> +			nr = stock->vmstat_bytes;
> +			stock->vmstat_bytes = 0;
> +		} else {
> +			nr = 0;
> +		}

..and this is the regular overflow handling done by the objcg and
memcg charge stock as well.

How about this?

	if (stock->cached_objcg != objcg ||
	    stock->cached_pgdat != pgdat ||
	    stock->vmstat_idx != idx) {
		drain_obj_stock(stock);
		obj_cgroup_get(objcg);
		stock->cached_objcg = objcg;
		stock->nr_bytes = atomic_xchg(&objcg->nr_charged_bytes, 0);
		stock->vmstat_idx = idx;
	}
	stock->vmstat_bytes += nr_bytes;

	if (abs(stock->vmstat_bytes > PAGE_SIZE))
		drain_obj_stock(stock);

(Maybe we could be clever, here since the charge and stat caches are
the same size: don't flush an oversized charge cache from
refill_obj_stock in the charge path, but leave it to the
mod_objcg_state() that follows; likewise don't flush an undersized
vmstat stock from mod_objcg_state() in the uncharge path, but leave it
to the refill_obj_stock() that follows. Could get a bit complicated...)

> @@ -3213,6 +3260,17 @@ static void drain_obj_stock(struct memcg_stock_pcp *stock)
>  		stock->nr_bytes = 0;
>  	}
>  
> +	/*
> +	 * Flush the vmstat data in current stock
> +	 */
> +	if (stock->vmstat_bytes) {
> +		__mod_objcg_state(old, stock->cached_pgdat, stock->vmstat_idx,
> +				  stock->vmstat_bytes);

... then inline __mod_objcg_state() here into the only caller, and
there won't be any need to come up with a better name.

> +		stock->cached_pgdat = NULL;
> +		stock->vmstat_bytes = 0;
> +		stock->vmstat_idx = 0;
> +	}
> +
>  	obj_cgroup_put(old);
>  	stock->cached_objcg = NULL;

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 2/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
@ 2021-04-19 16:38     ` Johannes Weiner
  0 siblings, 0 replies; 49+ messages in thread
From: Johannes Weiner @ 2021-04-19 16:38 UTC (permalink / raw)
  To: Waiman Long
  Cc: Michal Hocko, Vladimir Davydov, Andrew Morton, Tejun Heo,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun, Matthew Wilcox

On Sun, Apr 18, 2021 at 08:00:29PM -0400, Waiman Long wrote:
> Before the new slab memory controller with per object byte charging,
> charging and vmstat data update happen only when new slab pages are
> allocated or freed. Now they are done with every kmem_cache_alloc()
> and kmem_cache_free(). This causes additional overhead for workloads
> that generate a lot of alloc and free calls.
> 
> The memcg_stock_pcp is used to cache byte charge for a specific
> obj_cgroup to reduce that overhead. To further reducing it, this patch
> makes the vmstat data cached in the memcg_stock_pcp structure as well
> until it accumulates a page size worth of update or when other cached
> data change. Caching the vmstat data in the per-cpu stock eliminates two
> writes to non-hot cachelines for memcg specific as well as memcg-lruvecs
> specific vmstat data by a write to a hot local stock cacheline.
> 
> On a 2-socket Cascade Lake server with instrumentation enabled and this
> patch applied, it was found that about 20% (634400 out of 3243830)
> of the time when mod_objcg_state() is called leads to an actual call
> to __mod_objcg_state() after initial boot. When doing parallel kernel
> build, the figure was about 17% (24329265 out of 142512465). So caching
> the vmstat data reduces the number of calls to __mod_objcg_state()
> by more than 80%.
> 
> Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Reviewed-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> ---
>  mm/memcontrol.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 61 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index dc9032f28f2e..693453f95d99 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2213,7 +2213,10 @@ struct memcg_stock_pcp {
>  
>  #ifdef CONFIG_MEMCG_KMEM
>  	struct obj_cgroup *cached_objcg;
> +	struct pglist_data *cached_pgdat;
>  	unsigned int nr_bytes;
> +	int vmstat_idx;
> +	int vmstat_bytes;
>  #endif
>  
>  	struct work_struct work;
> @@ -3150,8 +3153,9 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
>  	css_put(&memcg->css);
>  }
>  
> -void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
> -		     enum node_stat_item idx, int nr)
> +static inline void __mod_objcg_state(struct obj_cgroup *objcg,
> +				     struct pglist_data *pgdat,
> +				     enum node_stat_item idx, int nr)

This naming is dangerous, as the __mod_foo naming scheme we use
everywhere else suggests it's the same function as mod_foo() just with
preemption/irqs disabled.

> @@ -3159,10 +3163,53 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
>  	rcu_read_lock();
>  	memcg = obj_cgroup_memcg(objcg);
>  	lruvec = mem_cgroup_lruvec(memcg, pgdat);
> -	mod_memcg_lruvec_state(lruvec, idx, nr);
> +	__mod_memcg_lruvec_state(lruvec, idx, nr);
>  	rcu_read_unlock();
>  }
>  
> +void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
> +		     enum node_stat_item idx, int nr)
> +{
> +	struct memcg_stock_pcp *stock;
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +	stock = this_cpu_ptr(&memcg_stock);
> +
> +	/*
> +	 * Save vmstat data in stock and skip vmstat array update unless
> +	 * accumulating over a page of vmstat data or when pgdat or idx
> +	 * changes.
> +	 */
> +	if (stock->cached_objcg != objcg) {
> +		/* Output the current data as is */

When you get here with the wrong objcg and hit the cold path, it's
usually immediately followed by an uncharge -> refill_obj_stock() that
will then flush and reset cached_objcg.

Instead of doing two cold paths, why not flush the old objcg right
away and set the new so that refill_obj_stock() can use the fast path?

> +	} else if (!stock->vmstat_bytes) {
> +		/* Save the current data */
> +		stock->vmstat_bytes = nr;
> +		stock->vmstat_idx = idx;
> +		stock->cached_pgdat = pgdat;
> +		nr = 0;
> +	} else if ((stock->cached_pgdat != pgdat) ||
> +		   (stock->vmstat_idx != idx)) {
> +		/* Output the cached data & save the current data */
> +		swap(nr, stock->vmstat_bytes);
> +		swap(idx, stock->vmstat_idx);
> +		swap(pgdat, stock->cached_pgdat);

Is this optimization worth doing?

You later split vmstat_bytes and idx doesn't change anymore.

How often does the pgdat change? This is a per-cpu cache after all,
and the numa node a given cpu allocates from tends to not change that
often. Even with interleaving mode, which I think is pretty rare, the
interleaving happens at the slab/page level, not the object level, and
the cache isn't bigger than a page anyway.

> +	} else {
> +		stock->vmstat_bytes += nr;
> +		if (abs(stock->vmstat_bytes) > PAGE_SIZE) {
> +			nr = stock->vmstat_bytes;
> +			stock->vmstat_bytes = 0;
> +		} else {
> +			nr = 0;
> +		}

..and this is the regular overflow handling done by the objcg and
memcg charge stock as well.

How about this?

	if (stock->cached_objcg != objcg ||
	    stock->cached_pgdat != pgdat ||
	    stock->vmstat_idx != idx) {
		drain_obj_stock(stock);
		obj_cgroup_get(objcg);
		stock->cached_objcg = objcg;
		stock->nr_bytes = atomic_xchg(&objcg->nr_charged_bytes, 0);
		stock->vmstat_idx = idx;
	}
	stock->vmstat_bytes += nr_bytes;

	if (abs(stock->vmstat_bytes > PAGE_SIZE))
		drain_obj_stock(stock);

(Maybe we could be clever, here since the charge and stat caches are
the same size: don't flush an oversized charge cache from
refill_obj_stock in the charge path, but leave it to the
mod_objcg_state() that follows; likewise don't flush an undersized
vmstat stock from mod_objcg_state() in the uncharge path, but leave it
to the refill_obj_stock() that follows. Could get a bit complicated...)

> @@ -3213,6 +3260,17 @@ static void drain_obj_stock(struct memcg_stock_pcp *stock)
>  		stock->nr_bytes = 0;
>  	}
>  
> +	/*
> +	 * Flush the vmstat data in current stock
> +	 */
> +	if (stock->vmstat_bytes) {
> +		__mod_objcg_state(old, stock->cached_pgdat, stock->vmstat_idx,
> +				  stock->vmstat_bytes);

... then inline __mod_objcg_state() here into the only caller, and
there won't be any need to come up with a better name.

> +		stock->cached_pgdat = NULL;
> +		stock->vmstat_bytes = 0;
> +		stock->vmstat_idx = 0;
> +	}
> +
>  	obj_cgroup_put(old);
>  	stock->cached_objcg = NULL;

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 4/5] mm/memcg: Save both reclaimable & unreclaimable bytes in object stock
@ 2021-04-19 16:55     ` Johannes Weiner
  0 siblings, 0 replies; 49+ messages in thread
From: Johannes Weiner @ 2021-04-19 16:55 UTC (permalink / raw)
  To: Waiman Long
  Cc: Michal Hocko, Vladimir Davydov, Andrew Morton, Tejun Heo,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin, linux-kernel, cgroups, linux-mm,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun, Matthew Wilcox

On Sun, Apr 18, 2021 at 08:00:31PM -0400, Waiman Long wrote:
> Currently, the object stock structure caches either reclaimable vmstat
> bytes or unreclaimable vmstat bytes in its object stock structure. The
> hit rate can be improved if both types of vmstat data can be cached
> especially for single-node system.
> 
> This patch supports the cacheing of both type of vmstat data, though
> at the expense of a slightly increased complexity in the caching code.
> For large object (>= PAGE_SIZE), vmstat array is done directly without
> going through the stock caching step.
> 
> On a 2-socket Cascade Lake server with instrumentation enabled, the
> miss rates are shown in the table below.
> 
>   Initial bootup:
> 
>   Kernel       __mod_objcg_state    mod_objcg_state    %age
>   ------       -----------------    ---------------    ----
>   Before patch      634400              3243830        19.6%
>   After patch       419810              3182424        13.2%
> 
>   Parallel kernel build:
> 
>   Kernel       __mod_objcg_state    mod_objcg_state    %age
>   ------       -----------------    ---------------    ----
>   Before patch      24329265           142512465       17.1%
>   After patch       24051721           142445825       16.9%
> 
> There was a decrease of miss rate after initial system bootup. However,
> the miss rate for parallel kernel build remained about the same probably
> because most of the touched kmemcache objects were reclaimable inodes
> and dentries.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  mm/memcontrol.c | 79 +++++++++++++++++++++++++++++++------------------
>  1 file changed, 51 insertions(+), 28 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c13502eab282..a6dd18f6d8a8 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2212,8 +2212,8 @@ struct obj_stock {
>  	struct obj_cgroup *cached_objcg;
>  	struct pglist_data *cached_pgdat;
>  	unsigned int nr_bytes;
> -	int vmstat_idx;
> -	int vmstat_bytes;
> +	int reclaimable_bytes;		/* NR_SLAB_RECLAIMABLE_B */
> +	int unreclaimable_bytes;	/* NR_SLAB_UNRECLAIMABLE_B */

How about

	int nr_slab_reclaimable_b;
	int nr_slab_unreclaimable_b;

so you don't need the comments?

>  #else
>  	int dummy[0];
>  #endif
> @@ -3217,40 +3217,56 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
>  		     enum node_stat_item idx, int nr)
>  {
>  	unsigned long flags;
> -	struct obj_stock *stock = get_obj_stock(&flags);
> +	struct obj_stock *stock;
> +	int *bytes, *alt_bytes, alt_idx;
> +
> +	/*
> +	 * Directly update vmstat array for big object.
> +	 */
> +	if (unlikely(abs(nr) >= PAGE_SIZE))
> +		goto update_vmstat;

This looks like an optimization independent of the vmstat item split?

> +	stock = get_obj_stock(&flags);
> +	if (idx == NR_SLAB_RECLAIMABLE_B) {
> +		bytes = &stock->reclaimable_bytes;
> +		alt_bytes = &stock->unreclaimable_bytes;
> +		alt_idx = NR_SLAB_UNRECLAIMABLE_B;
> +	} else {
> +		bytes = &stock->unreclaimable_bytes;
> +		alt_bytes = &stock->reclaimable_bytes;
> +		alt_idx = NR_SLAB_RECLAIMABLE_B;
> +	}
>  
>  	/*
> -	 * Save vmstat data in stock and skip vmstat array update unless
> -	 * accumulating over a page of vmstat data or when pgdat or idx
> +	 * Try to save vmstat data in stock and skip vmstat array update
> +	 * unless accumulating over a page of vmstat data or when pgdat
>  	 * changes.
>  	 */
>  	if (stock->cached_objcg != objcg) {
>  		/* Output the current data as is */
> -	} else if (!stock->vmstat_bytes) {
> -		/* Save the current data */
> -		stock->vmstat_bytes = nr;
> -		stock->vmstat_idx = idx;
> -		stock->cached_pgdat = pgdat;
> -		nr = 0;
> -	} else if ((stock->cached_pgdat != pgdat) ||
> -		   (stock->vmstat_idx != idx)) {
> -		/* Output the cached data & save the current data */
> -		swap(nr, stock->vmstat_bytes);
> -		swap(idx, stock->vmstat_idx);
> +	} else if (stock->cached_pgdat != pgdat) {
> +		/* Save the current data and output cached data, if any */
> +		swap(nr, *bytes);
>  		swap(pgdat, stock->cached_pgdat);
> +		if (*alt_bytes) {
> +			__mod_objcg_state(objcg, pgdat, alt_idx,
> +					  *alt_bytes);
> +			*alt_bytes = 0;
> +		}

As per the other email, I really don't think optimizing the pgdat
switch (in a percpu cache) is worth this level of complexity.

>  	} else {
> -		stock->vmstat_bytes += nr;
> -		if (abs(stock->vmstat_bytes) > PAGE_SIZE) {
> -			nr = stock->vmstat_bytes;
> -			stock->vmstat_bytes = 0;
> +		*bytes += nr;
> +		if (abs(*bytes) > PAGE_SIZE) {
> +			nr = *bytes;
> +			*bytes = 0;
>  		} else {
>  			nr = 0;
>  		}
>  	}
> -	if (nr)
> -		__mod_objcg_state(objcg, pgdat, idx, nr);
> -
>  	put_obj_stock(flags);
> +	if (!nr)
> +		return;
> +update_vmstat:
> +	__mod_objcg_state(objcg, pgdat, idx, nr);
>  }
>  
>  static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
> @@ -3303,12 +3319,19 @@ static void drain_obj_stock(struct obj_stock *stock)
>  	/*
>  	 * Flush the vmstat data in current stock
>  	 */
> -	if (stock->vmstat_bytes) {
> -		__mod_objcg_state(old, stock->cached_pgdat, stock->vmstat_idx,
> -				  stock->vmstat_bytes);
> +	if (stock->reclaimable_bytes || stock->unreclaimable_bytes) {
> +		int bytes;
> +
> +		if ((bytes = stock->reclaimable_bytes))
> +			__mod_objcg_state(old, stock->cached_pgdat,
> +					  NR_SLAB_RECLAIMABLE_B, bytes);
> +		if ((bytes = stock->unreclaimable_bytes))
> +			__mod_objcg_state(old, stock->cached_pgdat,
> +					  NR_SLAB_UNRECLAIMABLE_B, bytes);

The int bytes indirection isn't necessary. It's easier to read even
with the extra lines required to repeat the long stock member names,
because that is quite a common pattern (if (stuff) frob(stuff)).

__mod_objcg_state() also each time does rcu_read_lock() toggling and a
memcg lookup that could be batched, which I think is further proof
that it should just be inlined here.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 4/5] mm/memcg: Save both reclaimable & unreclaimable bytes in object stock
@ 2021-04-19 16:55     ` Johannes Weiner
  0 siblings, 0 replies; 49+ messages in thread
From: Johannes Weiner @ 2021-04-19 16:55 UTC (permalink / raw)
  To: Waiman Long
  Cc: Michal Hocko, Vladimir Davydov, Andrew Morton, Tejun Heo,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun, Matthew Wilcox

On Sun, Apr 18, 2021 at 08:00:31PM -0400, Waiman Long wrote:
> Currently, the object stock structure caches either reclaimable vmstat
> bytes or unreclaimable vmstat bytes in its object stock structure. The
> hit rate can be improved if both types of vmstat data can be cached
> especially for single-node system.
> 
> This patch supports the cacheing of both type of vmstat data, though
> at the expense of a slightly increased complexity in the caching code.
> For large object (>= PAGE_SIZE), vmstat array is done directly without
> going through the stock caching step.
> 
> On a 2-socket Cascade Lake server with instrumentation enabled, the
> miss rates are shown in the table below.
> 
>   Initial bootup:
> 
>   Kernel       __mod_objcg_state    mod_objcg_state    %age
>   ------       -----------------    ---------------    ----
>   Before patch      634400              3243830        19.6%
>   After patch       419810              3182424        13.2%
> 
>   Parallel kernel build:
> 
>   Kernel       __mod_objcg_state    mod_objcg_state    %age
>   ------       -----------------    ---------------    ----
>   Before patch      24329265           142512465       17.1%
>   After patch       24051721           142445825       16.9%
> 
> There was a decrease of miss rate after initial system bootup. However,
> the miss rate for parallel kernel build remained about the same probably
> because most of the touched kmemcache objects were reclaimable inodes
> and dentries.
> 
> Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
>  mm/memcontrol.c | 79 +++++++++++++++++++++++++++++++------------------
>  1 file changed, 51 insertions(+), 28 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c13502eab282..a6dd18f6d8a8 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2212,8 +2212,8 @@ struct obj_stock {
>  	struct obj_cgroup *cached_objcg;
>  	struct pglist_data *cached_pgdat;
>  	unsigned int nr_bytes;
> -	int vmstat_idx;
> -	int vmstat_bytes;
> +	int reclaimable_bytes;		/* NR_SLAB_RECLAIMABLE_B */
> +	int unreclaimable_bytes;	/* NR_SLAB_UNRECLAIMABLE_B */

How about

	int nr_slab_reclaimable_b;
	int nr_slab_unreclaimable_b;

so you don't need the comments?

>  #else
>  	int dummy[0];
>  #endif
> @@ -3217,40 +3217,56 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
>  		     enum node_stat_item idx, int nr)
>  {
>  	unsigned long flags;
> -	struct obj_stock *stock = get_obj_stock(&flags);
> +	struct obj_stock *stock;
> +	int *bytes, *alt_bytes, alt_idx;
> +
> +	/*
> +	 * Directly update vmstat array for big object.
> +	 */
> +	if (unlikely(abs(nr) >= PAGE_SIZE))
> +		goto update_vmstat;

This looks like an optimization independent of the vmstat item split?

> +	stock = get_obj_stock(&flags);
> +	if (idx == NR_SLAB_RECLAIMABLE_B) {
> +		bytes = &stock->reclaimable_bytes;
> +		alt_bytes = &stock->unreclaimable_bytes;
> +		alt_idx = NR_SLAB_UNRECLAIMABLE_B;
> +	} else {
> +		bytes = &stock->unreclaimable_bytes;
> +		alt_bytes = &stock->reclaimable_bytes;
> +		alt_idx = NR_SLAB_RECLAIMABLE_B;
> +	}
>  
>  	/*
> -	 * Save vmstat data in stock and skip vmstat array update unless
> -	 * accumulating over a page of vmstat data or when pgdat or idx
> +	 * Try to save vmstat data in stock and skip vmstat array update
> +	 * unless accumulating over a page of vmstat data or when pgdat
>  	 * changes.
>  	 */
>  	if (stock->cached_objcg != objcg) {
>  		/* Output the current data as is */
> -	} else if (!stock->vmstat_bytes) {
> -		/* Save the current data */
> -		stock->vmstat_bytes = nr;
> -		stock->vmstat_idx = idx;
> -		stock->cached_pgdat = pgdat;
> -		nr = 0;
> -	} else if ((stock->cached_pgdat != pgdat) ||
> -		   (stock->vmstat_idx != idx)) {
> -		/* Output the cached data & save the current data */
> -		swap(nr, stock->vmstat_bytes);
> -		swap(idx, stock->vmstat_idx);
> +	} else if (stock->cached_pgdat != pgdat) {
> +		/* Save the current data and output cached data, if any */
> +		swap(nr, *bytes);
>  		swap(pgdat, stock->cached_pgdat);
> +		if (*alt_bytes) {
> +			__mod_objcg_state(objcg, pgdat, alt_idx,
> +					  *alt_bytes);
> +			*alt_bytes = 0;
> +		}

As per the other email, I really don't think optimizing the pgdat
switch (in a percpu cache) is worth this level of complexity.

>  	} else {
> -		stock->vmstat_bytes += nr;
> -		if (abs(stock->vmstat_bytes) > PAGE_SIZE) {
> -			nr = stock->vmstat_bytes;
> -			stock->vmstat_bytes = 0;
> +		*bytes += nr;
> +		if (abs(*bytes) > PAGE_SIZE) {
> +			nr = *bytes;
> +			*bytes = 0;
>  		} else {
>  			nr = 0;
>  		}
>  	}
> -	if (nr)
> -		__mod_objcg_state(objcg, pgdat, idx, nr);
> -
>  	put_obj_stock(flags);
> +	if (!nr)
> +		return;
> +update_vmstat:
> +	__mod_objcg_state(objcg, pgdat, idx, nr);
>  }
>  
>  static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
> @@ -3303,12 +3319,19 @@ static void drain_obj_stock(struct obj_stock *stock)
>  	/*
>  	 * Flush the vmstat data in current stock
>  	 */
> -	if (stock->vmstat_bytes) {
> -		__mod_objcg_state(old, stock->cached_pgdat, stock->vmstat_idx,
> -				  stock->vmstat_bytes);
> +	if (stock->reclaimable_bytes || stock->unreclaimable_bytes) {
> +		int bytes;
> +
> +		if ((bytes = stock->reclaimable_bytes))
> +			__mod_objcg_state(old, stock->cached_pgdat,
> +					  NR_SLAB_RECLAIMABLE_B, bytes);
> +		if ((bytes = stock->unreclaimable_bytes))
> +			__mod_objcg_state(old, stock->cached_pgdat,
> +					  NR_SLAB_UNRECLAIMABLE_B, bytes);

The int bytes indirection isn't necessary. It's easier to read even
with the extra lines required to repeat the long stock member names,
because that is quite a common pattern (if (stuff) frob(stuff)).

__mod_objcg_state() also each time does rcu_read_lock() toggling and a
memcg lookup that could be batched, which I think is further proof
that it should just be inlined here.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 1/5] mm/memcg: Move mod_objcg_state() to memcontrol.c
@ 2021-04-19 17:13           ` Johannes Weiner
  0 siblings, 0 replies; 49+ messages in thread
From: Johannes Weiner @ 2021-04-19 17:13 UTC (permalink / raw)
  To: Waiman Long
  Cc: Michal Hocko, Vladimir Davydov, Andrew Morton, Tejun Heo,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin, linux-kernel, cgroups, linux-mm,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun, Matthew Wilcox

On Mon, Apr 19, 2021 at 12:18:29PM -0400, Waiman Long wrote:
> On 4/19/21 11:21 AM, Waiman Long wrote:
> > On 4/19/21 11:14 AM, Johannes Weiner wrote:
> > > On Sun, Apr 18, 2021 at 08:00:28PM -0400, Waiman Long wrote:
> > > > The mod_objcg_state() function is moved from mm/slab.h to
> > > > mm/memcontrol.c
> > > > so that further optimization can be done to it in later patches without
> > > > exposing unnecessary details to other mm components.
> > > > 
> > > > Signed-off-by: Waiman Long <longman@redhat.com>
> > > > ---
> > > >   mm/memcontrol.c | 13 +++++++++++++
> > > >   mm/slab.h       | 16 ++--------------
> > > >   2 files changed, 15 insertions(+), 14 deletions(-)
> > > > 
> > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > index e064ac0d850a..dc9032f28f2e 100644
> > > > --- a/mm/memcontrol.c
> > > > +++ b/mm/memcontrol.c
> > > > @@ -3150,6 +3150,19 @@ void __memcg_kmem_uncharge_page(struct
> > > > page *page, int order)
> > > >       css_put(&memcg->css);
> > > >   }
> > > >   +void mod_objcg_state(struct obj_cgroup *objcg, struct
> > > > pglist_data *pgdat,
> > > > +             enum node_stat_item idx, int nr)
> > > > +{
> > > > +    struct mem_cgroup *memcg;
> > > > +    struct lruvec *lruvec = NULL;
> > > > +
> > > > +    rcu_read_lock();
> > > > +    memcg = obj_cgroup_memcg(objcg);
> > > > +    lruvec = mem_cgroup_lruvec(memcg, pgdat);
> > > > +    mod_memcg_lruvec_state(lruvec, idx, nr);
> > > > +    rcu_read_unlock();
> > > > +}
> > > It would be more naturally placed next to the others, e.g. below
> > > __mod_lruvec_kmem_state().
> > > 
> > > But no deal breaker if there isn't another revision.
> > > 
> > > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > > 
> > Yes, there will be another revision by rebasing patch series on the
> > linux-next. I will move the function then.
> 
> OK, it turns out that mod_objcg_state() is only defined if
> CONFIG_MEMCG_KMEM. That was why I put it in the CONFIG_MEMCG_KMEM block with
> the other obj_stock functions. I think I will keep it there.

The CONFIG_MEMCG_KMEM has become sort of useless now. It used to be
configurable, but now it just means CONFIG_MEMCG && !CONFIG_SLOB. And
even that doesn't make sense because while slob doesn't support slab
object tracking, we can still do all the other stuff we do under
KMEM. I have a patch in the works to remove the symbol and ifdefs.

With that in mind, it's better to group the functions based on what
they do rather than based on CONFIG_MEMCG_KMEM. It's easier to just
remove another ifdef later than it is to reorder the functions.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 1/5] mm/memcg: Move mod_objcg_state() to memcontrol.c
@ 2021-04-19 17:13           ` Johannes Weiner
  0 siblings, 0 replies; 49+ messages in thread
From: Johannes Weiner @ 2021-04-19 17:13 UTC (permalink / raw)
  To: Waiman Long
  Cc: Michal Hocko, Vladimir Davydov, Andrew Morton, Tejun Heo,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun, Matthew Wilcox

On Mon, Apr 19, 2021 at 12:18:29PM -0400, Waiman Long wrote:
> On 4/19/21 11:21 AM, Waiman Long wrote:
> > On 4/19/21 11:14 AM, Johannes Weiner wrote:
> > > On Sun, Apr 18, 2021 at 08:00:28PM -0400, Waiman Long wrote:
> > > > The mod_objcg_state() function is moved from mm/slab.h to
> > > > mm/memcontrol.c
> > > > so that further optimization can be done to it in later patches without
> > > > exposing unnecessary details to other mm components.
> > > > 
> > > > Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > > > ---
> > > >   mm/memcontrol.c | 13 +++++++++++++
> > > >   mm/slab.h       | 16 ++--------------
> > > >   2 files changed, 15 insertions(+), 14 deletions(-)
> > > > 
> > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > index e064ac0d850a..dc9032f28f2e 100644
> > > > --- a/mm/memcontrol.c
> > > > +++ b/mm/memcontrol.c
> > > > @@ -3150,6 +3150,19 @@ void __memcg_kmem_uncharge_page(struct
> > > > page *page, int order)
> > > >       css_put(&memcg->css);
> > > >   }
> > > >   +void mod_objcg_state(struct obj_cgroup *objcg, struct
> > > > pglist_data *pgdat,
> > > > +             enum node_stat_item idx, int nr)
> > > > +{
> > > > +    struct mem_cgroup *memcg;
> > > > +    struct lruvec *lruvec = NULL;
> > > > +
> > > > +    rcu_read_lock();
> > > > +    memcg = obj_cgroup_memcg(objcg);
> > > > +    lruvec = mem_cgroup_lruvec(memcg, pgdat);
> > > > +    mod_memcg_lruvec_state(lruvec, idx, nr);
> > > > +    rcu_read_unlock();
> > > > +}
> > > It would be more naturally placed next to the others, e.g. below
> > > __mod_lruvec_kmem_state().
> > > 
> > > But no deal breaker if there isn't another revision.
> > > 
> > > Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> > > 
> > Yes, there will be another revision by rebasing patch series on the
> > linux-next. I will move the function then.
> 
> OK, it turns out that mod_objcg_state() is only defined if
> CONFIG_MEMCG_KMEM. That was why I put it in the CONFIG_MEMCG_KMEM block with
> the other obj_stock functions. I think I will keep it there.

The CONFIG_MEMCG_KMEM has become sort of useless now. It used to be
configurable, but now it just means CONFIG_MEMCG && !CONFIG_SLOB. And
even that doesn't make sense because while slob doesn't support slab
object tracking, we can still do all the other stuff we do under
KMEM. I have a patch in the works to remove the symbol and ifdefs.

With that in mind, it's better to group the functions based on what
they do rather than based on CONFIG_MEMCG_KMEM. It's easier to just
remove another ifdef later than it is to reorder the functions.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 1/5] mm/memcg: Move mod_objcg_state() to memcontrol.c
@ 2021-04-19 17:19             ` Waiman Long
  0 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2021-04-19 17:19 UTC (permalink / raw)
  To: Johannes Weiner, Waiman Long
  Cc: Michal Hocko, Vladimir Davydov, Andrew Morton, Tejun Heo,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin, linux-kernel, cgroups, linux-mm,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun, Matthew Wilcox

On 4/19/21 1:13 PM, Johannes Weiner wrote:
> On Mon, Apr 19, 2021 at 12:18:29PM -0400, Waiman Long wrote:
>> On 4/19/21 11:21 AM, Waiman Long wrote:
>>> On 4/19/21 11:14 AM, Johannes Weiner wrote:
>>>> On Sun, Apr 18, 2021 at 08:00:28PM -0400, Waiman Long wrote:
>>>>> The mod_objcg_state() function is moved from mm/slab.h to
>>>>> mm/memcontrol.c
>>>>> so that further optimization can be done to it in later patches without
>>>>> exposing unnecessary details to other mm components.
>>>>>
>>>>> Signed-off-by: Waiman Long <longman@redhat.com>
>>>>> ---
>>>>>    mm/memcontrol.c | 13 +++++++++++++
>>>>>    mm/slab.h       | 16 ++--------------
>>>>>    2 files changed, 15 insertions(+), 14 deletions(-)
>>>>>
>>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>>>> index e064ac0d850a..dc9032f28f2e 100644
>>>>> --- a/mm/memcontrol.c
>>>>> +++ b/mm/memcontrol.c
>>>>> @@ -3150,6 +3150,19 @@ void __memcg_kmem_uncharge_page(struct
>>>>> page *page, int order)
>>>>>        css_put(&memcg->css);
>>>>>    }
>>>>>    +void mod_objcg_state(struct obj_cgroup *objcg, struct
>>>>> pglist_data *pgdat,
>>>>> +             enum node_stat_item idx, int nr)
>>>>> +{
>>>>> +    struct mem_cgroup *memcg;
>>>>> +    struct lruvec *lruvec = NULL;
>>>>> +
>>>>> +    rcu_read_lock();
>>>>> +    memcg = obj_cgroup_memcg(objcg);
>>>>> +    lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>>>> +    mod_memcg_lruvec_state(lruvec, idx, nr);
>>>>> +    rcu_read_unlock();
>>>>> +}
>>>> It would be more naturally placed next to the others, e.g. below
>>>> __mod_lruvec_kmem_state().
>>>>
>>>> But no deal breaker if there isn't another revision.
>>>>
>>>> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
>>>>
>>> Yes, there will be another revision by rebasing patch series on the
>>> linux-next. I will move the function then.
>> OK, it turns out that mod_objcg_state() is only defined if
>> CONFIG_MEMCG_KMEM. That was why I put it in the CONFIG_MEMCG_KMEM block with
>> the other obj_stock functions. I think I will keep it there.
> The CONFIG_MEMCG_KMEM has become sort of useless now. It used to be
> configurable, but now it just means CONFIG_MEMCG && !CONFIG_SLOB. And
> even that doesn't make sense because while slob doesn't support slab
> object tracking, we can still do all the other stuff we do under
> KMEM. I have a patch in the works to remove the symbol and ifdefs.
>
> With that in mind, it's better to group the functions based on what
> they do rather than based on CONFIG_MEMCG_KMEM. It's easier to just
> remove another ifdef later than it is to reorder the functions.
>
OK, I will make the move then. Thanks for the explanation.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 1/5] mm/memcg: Move mod_objcg_state() to memcontrol.c
@ 2021-04-19 17:19             ` Waiman Long
  0 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2021-04-19 17:19 UTC (permalink / raw)
  To: Johannes Weiner, Waiman Long
  Cc: Michal Hocko, Vladimir Davydov, Andrew Morton, Tejun Heo,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun, Matthew Wilcox

On 4/19/21 1:13 PM, Johannes Weiner wrote:
> On Mon, Apr 19, 2021 at 12:18:29PM -0400, Waiman Long wrote:
>> On 4/19/21 11:21 AM, Waiman Long wrote:
>>> On 4/19/21 11:14 AM, Johannes Weiner wrote:
>>>> On Sun, Apr 18, 2021 at 08:00:28PM -0400, Waiman Long wrote:
>>>>> The mod_objcg_state() function is moved from mm/slab.h to
>>>>> mm/memcontrol.c
>>>>> so that further optimization can be done to it in later patches without
>>>>> exposing unnecessary details to other mm components.
>>>>>
>>>>> Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>>>> ---
>>>>>    mm/memcontrol.c | 13 +++++++++++++
>>>>>    mm/slab.h       | 16 ++--------------
>>>>>    2 files changed, 15 insertions(+), 14 deletions(-)
>>>>>
>>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>>>> index e064ac0d850a..dc9032f28f2e 100644
>>>>> --- a/mm/memcontrol.c
>>>>> +++ b/mm/memcontrol.c
>>>>> @@ -3150,6 +3150,19 @@ void __memcg_kmem_uncharge_page(struct
>>>>> page *page, int order)
>>>>>        css_put(&memcg->css);
>>>>>    }
>>>>>    +void mod_objcg_state(struct obj_cgroup *objcg, struct
>>>>> pglist_data *pgdat,
>>>>> +             enum node_stat_item idx, int nr)
>>>>> +{
>>>>> +    struct mem_cgroup *memcg;
>>>>> +    struct lruvec *lruvec = NULL;
>>>>> +
>>>>> +    rcu_read_lock();
>>>>> +    memcg = obj_cgroup_memcg(objcg);
>>>>> +    lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>>>> +    mod_memcg_lruvec_state(lruvec, idx, nr);
>>>>> +    rcu_read_unlock();
>>>>> +}
>>>> It would be more naturally placed next to the others, e.g. below
>>>> __mod_lruvec_kmem_state().
>>>>
>>>> But no deal breaker if there isn't another revision.
>>>>
>>>> Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
>>>>
>>> Yes, there will be another revision by rebasing patch series on the
>>> linux-next. I will move the function then.
>> OK, it turns out that mod_objcg_state() is only defined if
>> CONFIG_MEMCG_KMEM. That was why I put it in the CONFIG_MEMCG_KMEM block with
>> the other obj_stock functions. I think I will keep it there.
> The CONFIG_MEMCG_KMEM has become sort of useless now. It used to be
> configurable, but now it just means CONFIG_MEMCG && !CONFIG_SLOB. And
> even that doesn't make sense because while slob doesn't support slab
> object tracking, we can still do all the other stuff we do under
> KMEM. I have a patch in the works to remove the symbol and ifdefs.
>
> With that in mind, it's better to group the functions based on what
> they do rather than based on CONFIG_MEMCG_KMEM. It's easier to just
> remove another ifdef later than it is to reorder the functions.
>
OK, I will make the move then. Thanks for the explanation.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 1/5] mm/memcg: Move mod_objcg_state() to memcontrol.c
@ 2021-04-19 17:26               ` Waiman Long
  0 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2021-04-19 17:26 UTC (permalink / raw)
  To: Waiman Long, Johannes Weiner
  Cc: Michal Hocko, Vladimir Davydov, Andrew Morton, Tejun Heo,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin, linux-kernel, cgroups, linux-mm,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun, Matthew Wilcox

On 4/19/21 1:19 PM, Waiman Long wrote:
> On 4/19/21 1:13 PM, Johannes Weiner wrote:
>> On Mon, Apr 19, 2021 at 12:18:29PM -0400, Waiman Long wrote:
>>> On 4/19/21 11:21 AM, Waiman Long wrote:
>>>> On 4/19/21 11:14 AM, Johannes Weiner wrote:
>>>>> On Sun, Apr 18, 2021 at 08:00:28PM -0400, Waiman Long wrote:
>>>>>> The mod_objcg_state() function is moved from mm/slab.h to
>>>>>> mm/memcontrol.c
>>>>>> so that further optimization can be done to it in later patches 
>>>>>> without
>>>>>> exposing unnecessary details to other mm components.
>>>>>>
>>>>>> Signed-off-by: Waiman Long <longman@redhat.com>
>>>>>> ---
>>>>>>    mm/memcontrol.c | 13 +++++++++++++
>>>>>>    mm/slab.h       | 16 ++--------------
>>>>>>    2 files changed, 15 insertions(+), 14 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>>>>> index e064ac0d850a..dc9032f28f2e 100644
>>>>>> --- a/mm/memcontrol.c
>>>>>> +++ b/mm/memcontrol.c
>>>>>> @@ -3150,6 +3150,19 @@ void __memcg_kmem_uncharge_page(struct
>>>>>> page *page, int order)
>>>>>>        css_put(&memcg->css);
>>>>>>    }
>>>>>>    +void mod_objcg_state(struct obj_cgroup *objcg, struct
>>>>>> pglist_data *pgdat,
>>>>>> +             enum node_stat_item idx, int nr)
>>>>>> +{
>>>>>> +    struct mem_cgroup *memcg;
>>>>>> +    struct lruvec *lruvec = NULL;
>>>>>> +
>>>>>> +    rcu_read_lock();
>>>>>> +    memcg = obj_cgroup_memcg(objcg);
>>>>>> +    lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>>>>> +    mod_memcg_lruvec_state(lruvec, idx, nr);
>>>>>> +    rcu_read_unlock();
>>>>>> +}
>>>>> It would be more naturally placed next to the others, e.g. below
>>>>> __mod_lruvec_kmem_state().
>>>>>
>>>>> But no deal breaker if there isn't another revision.
>>>>>
>>>>> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
>>>>>
>>>> Yes, there will be another revision by rebasing patch series on the
>>>> linux-next. I will move the function then.
>>> OK, it turns out that mod_objcg_state() is only defined if
>>> CONFIG_MEMCG_KMEM. That was why I put it in the CONFIG_MEMCG_KMEM 
>>> block with
>>> the other obj_stock functions. I think I will keep it there.
>> The CONFIG_MEMCG_KMEM has become sort of useless now. It used to be
>> configurable, but now it just means CONFIG_MEMCG && !CONFIG_SLOB. And
>> even that doesn't make sense because while slob doesn't support slab
>> object tracking, we can still do all the other stuff we do under
>> KMEM. I have a patch in the works to remove the symbol and ifdefs.
>>
>> With that in mind, it's better to group the functions based on what
>> they do rather than based on CONFIG_MEMCG_KMEM. It's easier to just
>> remove another ifdef later than it is to reorder the functions.
>>
> OK, I will make the move then. Thanks for the explanation. 

BTW, have you ever thought of moving the cgroup-v1 specific functions 
out into a separate memcontrol-v1.c file just like 
kernel/cgroup/cgroup-v1.c?

I thought of that before, but memcontrol.c is a frequently changed file 
and so a bit hard to do.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 1/5] mm/memcg: Move mod_objcg_state() to memcontrol.c
@ 2021-04-19 17:26               ` Waiman Long
  0 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2021-04-19 17:26 UTC (permalink / raw)
  To: Waiman Long, Johannes Weiner
  Cc: Michal Hocko, Vladimir Davydov, Andrew Morton, Tejun Heo,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun, Matthew Wilcox

On 4/19/21 1:19 PM, Waiman Long wrote:
> On 4/19/21 1:13 PM, Johannes Weiner wrote:
>> On Mon, Apr 19, 2021 at 12:18:29PM -0400, Waiman Long wrote:
>>> On 4/19/21 11:21 AM, Waiman Long wrote:
>>>> On 4/19/21 11:14 AM, Johannes Weiner wrote:
>>>>> On Sun, Apr 18, 2021 at 08:00:28PM -0400, Waiman Long wrote:
>>>>>> The mod_objcg_state() function is moved from mm/slab.h to
>>>>>> mm/memcontrol.c
>>>>>> so that further optimization can be done to it in later patches 
>>>>>> without
>>>>>> exposing unnecessary details to other mm components.
>>>>>>
>>>>>> Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>>>>> ---
>>>>>>    mm/memcontrol.c | 13 +++++++++++++
>>>>>>    mm/slab.h       | 16 ++--------------
>>>>>>    2 files changed, 15 insertions(+), 14 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>>>>> index e064ac0d850a..dc9032f28f2e 100644
>>>>>> --- a/mm/memcontrol.c
>>>>>> +++ b/mm/memcontrol.c
>>>>>> @@ -3150,6 +3150,19 @@ void __memcg_kmem_uncharge_page(struct
>>>>>> page *page, int order)
>>>>>>        css_put(&memcg->css);
>>>>>>    }
>>>>>>    +void mod_objcg_state(struct obj_cgroup *objcg, struct
>>>>>> pglist_data *pgdat,
>>>>>> +             enum node_stat_item idx, int nr)
>>>>>> +{
>>>>>> +    struct mem_cgroup *memcg;
>>>>>> +    struct lruvec *lruvec = NULL;
>>>>>> +
>>>>>> +    rcu_read_lock();
>>>>>> +    memcg = obj_cgroup_memcg(objcg);
>>>>>> +    lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>>>>> +    mod_memcg_lruvec_state(lruvec, idx, nr);
>>>>>> +    rcu_read_unlock();
>>>>>> +}
>>>>> It would be more naturally placed next to the others, e.g. below
>>>>> __mod_lruvec_kmem_state().
>>>>>
>>>>> But no deal breaker if there isn't another revision.
>>>>>
>>>>> Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
>>>>>
>>>> Yes, there will be another revision by rebasing patch series on the
>>>> linux-next. I will move the function then.
>>> OK, it turns out that mod_objcg_state() is only defined if
>>> CONFIG_MEMCG_KMEM. That was why I put it in the CONFIG_MEMCG_KMEM 
>>> block with
>>> the other obj_stock functions. I think I will keep it there.
>> The CONFIG_MEMCG_KMEM has become sort of useless now. It used to be
>> configurable, but now it just means CONFIG_MEMCG && !CONFIG_SLOB. And
>> even that doesn't make sense because while slob doesn't support slab
>> object tracking, we can still do all the other stuff we do under
>> KMEM. I have a patch in the works to remove the symbol and ifdefs.
>>
>> With that in mind, it's better to group the functions based on what
>> they do rather than based on CONFIG_MEMCG_KMEM. It's easier to just
>> remove another ifdef later than it is to reorder the functions.
>>
> OK, I will make the move then. Thanks for the explanation. 

BTW, have you ever thought of moving the cgroup-v1 specific functions 
out into a separate memcontrol-v1.c file just like 
kernel/cgroup/cgroup-v1.c?

I thought of that before, but memcontrol.c is a frequently changed file 
and so a bit hard to do.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 1/5] mm/memcg: Move mod_objcg_state() to memcontrol.c
@ 2021-04-19 21:11                 ` Johannes Weiner
  0 siblings, 0 replies; 49+ messages in thread
From: Johannes Weiner @ 2021-04-19 21:11 UTC (permalink / raw)
  To: Waiman Long
  Cc: Michal Hocko, Vladimir Davydov, Andrew Morton, Tejun Heo,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin, linux-kernel, cgroups, linux-mm,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun, Matthew Wilcox

On Mon, Apr 19, 2021 at 01:26:29PM -0400, Waiman Long wrote:
> On 4/19/21 1:19 PM, Waiman Long wrote:
> > On 4/19/21 1:13 PM, Johannes Weiner wrote:
> > > The CONFIG_MEMCG_KMEM has become sort of useless now. It used to be
> > > configurable, but now it just means CONFIG_MEMCG && !CONFIG_SLOB. And
> > > even that doesn't make sense because while slob doesn't support slab
> > > object tracking, we can still do all the other stuff we do under
> > > KMEM. I have a patch in the works to remove the symbol and ifdefs.
> > > 
> > > With that in mind, it's better to group the functions based on what
> > > they do rather than based on CONFIG_MEMCG_KMEM. It's easier to just
> > > remove another ifdef later than it is to reorder the functions.
> > > 
> > OK, I will make the move then. Thanks for the explanation.

Thanks!

> BTW, have you ever thought of moving the cgroup-v1 specific functions out
> into a separate memcontrol-v1.c file just like kernel/cgroup/cgroup-v1.c?
> 
> I thought of that before, but memcontrol.c is a frequently changed file and
> so a bit hard to do.

I haven't looked too deeply at it so far, but I think it would make
sense to try.

There are indeed many of the entry paths from the MM code that are
shared between cgroup1 and cgroup2, with smaller branches here and
there to adjust behavior. Those would throw conflicts, but those we
should probably keep in the main memcontrol.c for readability anyway.

But there is also plenty of code that is exclusively about cgroup1,
and which actually doesn't change much in a long time. Moving that
elsewhere shouldn't create difficult conflicts - maybe a few line
offset warnings or fuzz in the diff context of unrelated changes:

- the soft limit tree and soft limit reclaim

- the threshold and oom event notification stuff

- the charge moving code

- remaining v1 interface files, as well as their helper functions

From a quick scan, this adds up to ~2,500 lines of old code with no
actual dependencies from the common code or from v2, and which could
be moved out of the way without disrupting ongoing development much.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 1/5] mm/memcg: Move mod_objcg_state() to memcontrol.c
@ 2021-04-19 21:11                 ` Johannes Weiner
  0 siblings, 0 replies; 49+ messages in thread
From: Johannes Weiner @ 2021-04-19 21:11 UTC (permalink / raw)
  To: Waiman Long
  Cc: Michal Hocko, Vladimir Davydov, Andrew Morton, Tejun Heo,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun, Matthew Wilcox

On Mon, Apr 19, 2021 at 01:26:29PM -0400, Waiman Long wrote:
> On 4/19/21 1:19 PM, Waiman Long wrote:
> > On 4/19/21 1:13 PM, Johannes Weiner wrote:
> > > The CONFIG_MEMCG_KMEM has become sort of useless now. It used to be
> > > configurable, but now it just means CONFIG_MEMCG && !CONFIG_SLOB. And
> > > even that doesn't make sense because while slob doesn't support slab
> > > object tracking, we can still do all the other stuff we do under
> > > KMEM. I have a patch in the works to remove the symbol and ifdefs.
> > > 
> > > With that in mind, it's better to group the functions based on what
> > > they do rather than based on CONFIG_MEMCG_KMEM. It's easier to just
> > > remove another ifdef later than it is to reorder the functions.
> > > 
> > OK, I will make the move then. Thanks for the explanation.

Thanks!

> BTW, have you ever thought of moving the cgroup-v1 specific functions out
> into a separate memcontrol-v1.c file just like kernel/cgroup/cgroup-v1.c?
> 
> I thought of that before, but memcontrol.c is a frequently changed file and
> so a bit hard to do.

I haven't looked too deeply at it so far, but I think it would make
sense to try.

There are indeed many of the entry paths from the MM code that are
shared between cgroup1 and cgroup2, with smaller branches here and
there to adjust behavior. Those would throw conflicts, but those we
should probably keep in the main memcontrol.c for readability anyway.

But there is also plenty of code that is exclusively about cgroup1,
and which actually doesn't change much in a long time. Moving that
elsewhere shouldn't create difficult conflicts - maybe a few line
offset warnings or fuzz in the diff context of unrelated changes:

- the soft limit tree and soft limit reclaim

- the threshold and oom event notification stuff

- the charge moving code

- remaining v1 interface files, as well as their helper functions

From a quick scan, this adds up to ~2,500 lines of old code with no
actual dependencies from the common code or from v2, and which could
be moved out of the way without disrupting ongoing development much.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 1/5] mm/memcg: Move mod_objcg_state() to memcontrol.c
@ 2021-04-19 21:24                   ` Waiman Long
  0 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2021-04-19 21:24 UTC (permalink / raw)
  To: Johannes Weiner, Waiman Long
  Cc: Michal Hocko, Vladimir Davydov, Andrew Morton, Tejun Heo,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin, linux-kernel, cgroups, linux-mm,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun, Matthew Wilcox

On 4/19/21 5:11 PM, Johannes Weiner wrote:
>
>> BTW, have you ever thought of moving the cgroup-v1 specific functions out
>> into a separate memcontrol-v1.c file just like kernel/cgroup/cgroup-v1.c?
>>
>> I thought of that before, but memcontrol.c is a frequently changed file and
>> so a bit hard to do.
> I haven't looked too deeply at it so far, but I think it would make
> sense to try.
>
> There are indeed many of the entry paths from the MM code that are
> shared between cgroup1 and cgroup2, with smaller branches here and
> there to adjust behavior. Those would throw conflicts, but those we
> should probably keep in the main memcontrol.c for readability anyway.
>
> But there is also plenty of code that is exclusively about cgroup1,
> and which actually doesn't change much in a long time. Moving that
> elsewhere shouldn't create difficult conflicts - maybe a few line
> offset warnings or fuzz-- Rafael
>
>
>   in the diff context of unrelated changes:
>
> - the soft limit tree and soft limit reclaim
>
> - the threshold and oom event notification stuff
>
> - the charge moving code
>
> - remaining v1 interface files, as well as their helper functions
>
>  From a quick scan, this adds up to ~2,500 lines of old code with no
> actual dependencies from the common code or from v2, and which could
> be moved out of the way without disrupting ongoing development much.
>
Right.

Currently memcontrol.c has over 7000 lines of code and keep growing. 
That makes it harder to read, navigate and update. If we can cut out 
2000 lines or more from memcontrol.c, it will make it more manageable.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 1/5] mm/memcg: Move mod_objcg_state() to memcontrol.c
@ 2021-04-19 21:24                   ` Waiman Long
  0 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2021-04-19 21:24 UTC (permalink / raw)
  To: Johannes Weiner, Waiman Long
  Cc: Michal Hocko, Vladimir Davydov, Andrew Morton, Tejun Heo,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun, Matthew Wilcox

On 4/19/21 5:11 PM, Johannes Weiner wrote:
>
>> BTW, have you ever thought of moving the cgroup-v1 specific functions out
>> into a separate memcontrol-v1.c file just like kernel/cgroup/cgroup-v1.c?
>>
>> I thought of that before, but memcontrol.c is a frequently changed file and
>> so a bit hard to do.
> I haven't looked too deeply at it so far, but I think it would make
> sense to try.
>
> There are indeed many of the entry paths from the MM code that are
> shared between cgroup1 and cgroup2, with smaller branches here and
> there to adjust behavior. Those would throw conflicts, but those we
> should probably keep in the main memcontrol.c for readability anyway.
>
> But there is also plenty of code that is exclusively about cgroup1,
> and which actually doesn't change much in a long time. Moving that
> elsewhere shouldn't create difficult conflicts - maybe a few line
> offset warnings or fuzz-- Rafael
>
>
>   in the diff context of unrelated changes:
>
> - the soft limit tree and soft limit reclaim
>
> - the threshold and oom event notification stuff
>
> - the charge moving code
>
> - remaining v1 interface files, as well as their helper functions
>
>  From a quick scan, this adds up to ~2,500 lines of old code with no
> actual dependencies from the common code or from v2, and which could
> be moved out of the way without disrupting ongoing development much.
>
Right.

Currently memcontrol.c has over 7000 lines of code and keep growing. 
That makes it harder to read, navigate and update. If we can cut out 
2000 lines or more from memcontrol.c, it will make it more manageable.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 2/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
@ 2021-04-19 23:42       ` Waiman Long
  0 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2021-04-19 23:42 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Vladimir Davydov, Andrew Morton, Tejun Heo,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin, linux-kernel, cgroups, linux-mm,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun, Matthew Wilcox

On 4/19/21 12:38 PM, Johannes Weiner wrote:
> On Sun, Apr 18, 2021 at 08:00:29PM -0400, Waiman Long wrote:
>> Before the new slab memory controller with per object byte charging,
>> charging and vmstat data update happen only when new slab pages are
>> allocated or freed. Now they are done with every kmem_cache_alloc()
>> and kmem_cache_free(). This causes additional overhead for workloads
>> that generate a lot of alloc and free calls.
>>
>> The memcg_stock_pcp is used to cache byte charge for a specific
>> obj_cgroup to reduce that overhead. To further reducing it, this patch
>> makes the vmstat data cached in the memcg_stock_pcp structure as well
>> until it accumulates a page size worth of update or when other cached
>> data change. Caching the vmstat data in the per-cpu stock eliminates two
>> writes to non-hot cachelines for memcg specific as well as memcg-lruvecs
>> specific vmstat data by a write to a hot local stock cacheline.
>>
>> On a 2-socket Cascade Lake server with instrumentation enabled and this
>> patch applied, it was found that about 20% (634400 out of 3243830)
>> of the time when mod_objcg_state() is called leads to an actual call
>> to __mod_objcg_state() after initial boot. When doing parallel kernel
>> build, the figure was about 17% (24329265 out of 142512465). So caching
>> the vmstat data reduces the number of calls to __mod_objcg_state()
>> by more than 80%.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> Reviewed-by: Shakeel Butt <shakeelb@google.com>
>> ---
>>   mm/memcontrol.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++---
>>   1 file changed, 61 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index dc9032f28f2e..693453f95d99 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -2213,7 +2213,10 @@ struct memcg_stock_pcp {
>>   
>>   #ifdef CONFIG_MEMCG_KMEM
>>   	struct obj_cgroup *cached_objcg;
>> +	struct pglist_data *cached_pgdat;
>>   	unsigned int nr_bytes;
>> +	int vmstat_idx;
>> +	int vmstat_bytes;
>>   #endif
>>   
>>   	struct work_struct work;
>> @@ -3150,8 +3153,9 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
>>   	css_put(&memcg->css);
>>   }
>>   
>> -void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
>> -		     enum node_stat_item idx, int nr)
>> +static inline void __mod_objcg_state(struct obj_cgroup *objcg,
>> +				     struct pglist_data *pgdat,
>> +				     enum node_stat_item idx, int nr)
> This naming is dangerous, as the __mod_foo naming scheme we use
> everywhere else suggests it's the same function as mod_foo() just with
> preemption/irqs disabled.
>
I will change its name to, say, mod_objcg_mlstate() to indicate that it 
is something different. Actually, it is hard to come up with a good name 
which is not too long.


>> @@ -3159,10 +3163,53 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
>>   	rcu_read_lock();
>>   	memcg = obj_cgroup_memcg(objcg);
>>   	lruvec = mem_cgroup_lruvec(memcg, pgdat);
>> -	mod_memcg_lruvec_state(lruvec, idx, nr);
>> +	__mod_memcg_lruvec_state(lruvec, idx, nr);
>>   	rcu_read_unlock();
>>   }
>>   
>> +void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
>> +		     enum node_stat_item idx, int nr)
>> +{
>> +	struct memcg_stock_pcp *stock;
>> +	unsigned long flags;
>> +
>> +	local_irq_save(flags);
>> +	stock = this_cpu_ptr(&memcg_stock);
>> +
>> +	/*
>> +	 * Save vmstat data in stock and skip vmstat array update unless
>> +	 * accumulating over a page of vmstat data or when pgdat or idx
>> +	 * changes.
>> +	 */
>> +	if (stock->cached_objcg != objcg) {
>> +		/* Output the current data as is */
> When you get here with the wrong objcg and hit the cold path, it's
> usually immediately followed by an uncharge -> refill_obj_stock() that
> will then flush and reset cached_objcg.
>
> Instead of doing two cold paths, why not flush the old objcg right
> away and set the new so that refill_obj_stock() can use the fast path?

That is a good idea. Will do that.


>
>> +	} else if (!stock->vmstat_bytes) {
>> +		/* Save the current data */
>> +		stock->vmstat_bytes = nr;
>> +		stock->vmstat_idx = idx;
>> +		stock->cached_pgdat = pgdat;
>> +		nr = 0;
>> +	} else if ((stock->cached_pgdat != pgdat) ||
>> +		   (stock->vmstat_idx != idx)) {
>> +		/* Output the cached data & save the current data */
>> +		swap(nr, stock->vmstat_bytes);
>> +		swap(idx, stock->vmstat_idx);
>> +		swap(pgdat, stock->cached_pgdat);
> Is this optimization worth doing?
>
> You later split vmstat_bytes and idx doesn't change anymore.

I am going to merge patch 2 and patch 4 to avoid the confusion.


>
> How often does the pgdat change? This is a per-cpu cache after all,
> and the numa node a given cpu allocates from tends to not change that
> often. Even with interleaving mode, which I think is pretty rare, the
> interleaving happens at the slab/page level, not the object level, and
> the cache isn't bigger than a page anyway.

The testing done on a 2-socket system indicated that pgdat changes 
roughly 10-20% of time. So it does happen, especially on the kfree() 
path, I think. I have tried to cached vmstat update for those on the 
local node only, but I got more misses with that. So I am just going to 
change pgdat and flush out existing data for now.


>
>> +	} else {
>> +		stock->vmstat_bytes += nr;
>> +		if (abs(stock->vmstat_bytes) > PAGE_SIZE) {
>> +			nr = stock->vmstat_bytes;
>> +			stock->vmstat_bytes = 0;
>> +		} else {
>> +			nr = 0;
>> +		}
> ..and this is the regular overflow handling done by the objcg and
> memcg charge stock as well.
>
> How about this?
>
> 	if (stock->cached_objcg != objcg ||
> 	    stock->cached_pgdat != pgdat ||
> 	    stock->vmstat_idx != idx) {
> 		drain_obj_stock(stock);
> 		obj_cgroup_get(objcg);
> 		stock->cached_objcg = objcg;
> 		stock->nr_bytes = atomic_xchg(&objcg->nr_charged_bytes, 0);
> 		stock->vmstat_idx = idx;
> 	}
> 	stock->vmstat_bytes += nr_bytes;
>
> 	if (abs(stock->vmstat_bytes > PAGE_SIZE))
> 		drain_obj_stock(stock);
>
> (Maybe we could be clever, here since the charge and stat caches are
> the same size: don't flush an oversized charge cache from
> refill_obj_stock in the charge path, but leave it to the
> mod_objcg_state() that follows; likewise don't flush an undersized
> vmstat stock from mod_objcg_state() in the uncharge path, but leave it
> to the refill_obj_stock() that follows. Could get a bit complicated...)

If you look at patch 5, I am trying to avoid doing drain_obj_stock() 
unless the objcg change. I am going to do the same here.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 2/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
@ 2021-04-19 23:42       ` Waiman Long
  0 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2021-04-19 23:42 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Vladimir Davydov, Andrew Morton, Tejun Heo,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun, Matthew Wilcox

On 4/19/21 12:38 PM, Johannes Weiner wrote:
> On Sun, Apr 18, 2021 at 08:00:29PM -0400, Waiman Long wrote:
>> Before the new slab memory controller with per object byte charging,
>> charging and vmstat data update happen only when new slab pages are
>> allocated or freed. Now they are done with every kmem_cache_alloc()
>> and kmem_cache_free(). This causes additional overhead for workloads
>> that generate a lot of alloc and free calls.
>>
>> The memcg_stock_pcp is used to cache byte charge for a specific
>> obj_cgroup to reduce that overhead. To further reducing it, this patch
>> makes the vmstat data cached in the memcg_stock_pcp structure as well
>> until it accumulates a page size worth of update or when other cached
>> data change. Caching the vmstat data in the per-cpu stock eliminates two
>> writes to non-hot cachelines for memcg specific as well as memcg-lruvecs
>> specific vmstat data by a write to a hot local stock cacheline.
>>
>> On a 2-socket Cascade Lake server with instrumentation enabled and this
>> patch applied, it was found that about 20% (634400 out of 3243830)
>> of the time when mod_objcg_state() is called leads to an actual call
>> to __mod_objcg_state() after initial boot. When doing parallel kernel
>> build, the figure was about 17% (24329265 out of 142512465). So caching
>> the vmstat data reduces the number of calls to __mod_objcg_state()
>> by more than 80%.
>>
>> Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>> Reviewed-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>> ---
>>   mm/memcontrol.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++---
>>   1 file changed, 61 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index dc9032f28f2e..693453f95d99 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -2213,7 +2213,10 @@ struct memcg_stock_pcp {
>>   
>>   #ifdef CONFIG_MEMCG_KMEM
>>   	struct obj_cgroup *cached_objcg;
>> +	struct pglist_data *cached_pgdat;
>>   	unsigned int nr_bytes;
>> +	int vmstat_idx;
>> +	int vmstat_bytes;
>>   #endif
>>   
>>   	struct work_struct work;
>> @@ -3150,8 +3153,9 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
>>   	css_put(&memcg->css);
>>   }
>>   
>> -void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
>> -		     enum node_stat_item idx, int nr)
>> +static inline void __mod_objcg_state(struct obj_cgroup *objcg,
>> +				     struct pglist_data *pgdat,
>> +				     enum node_stat_item idx, int nr)
> This naming is dangerous, as the __mod_foo naming scheme we use
> everywhere else suggests it's the same function as mod_foo() just with
> preemption/irqs disabled.
>
I will change its name to, say, mod_objcg_mlstate() to indicate that it 
is something different. Actually, it is hard to come up with a good name 
which is not too long.


>> @@ -3159,10 +3163,53 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
>>   	rcu_read_lock();
>>   	memcg = obj_cgroup_memcg(objcg);
>>   	lruvec = mem_cgroup_lruvec(memcg, pgdat);
>> -	mod_memcg_lruvec_state(lruvec, idx, nr);
>> +	__mod_memcg_lruvec_state(lruvec, idx, nr);
>>   	rcu_read_unlock();
>>   }
>>   
>> +void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
>> +		     enum node_stat_item idx, int nr)
>> +{
>> +	struct memcg_stock_pcp *stock;
>> +	unsigned long flags;
>> +
>> +	local_irq_save(flags);
>> +	stock = this_cpu_ptr(&memcg_stock);
>> +
>> +	/*
>> +	 * Save vmstat data in stock and skip vmstat array update unless
>> +	 * accumulating over a page of vmstat data or when pgdat or idx
>> +	 * changes.
>> +	 */
>> +	if (stock->cached_objcg != objcg) {
>> +		/* Output the current data as is */
> When you get here with the wrong objcg and hit the cold path, it's
> usually immediately followed by an uncharge -> refill_obj_stock() that
> will then flush and reset cached_objcg.
>
> Instead of doing two cold paths, why not flush the old objcg right
> away and set the new so that refill_obj_stock() can use the fast path?

That is a good idea. Will do that.


>
>> +	} else if (!stock->vmstat_bytes) {
>> +		/* Save the current data */
>> +		stock->vmstat_bytes = nr;
>> +		stock->vmstat_idx = idx;
>> +		stock->cached_pgdat = pgdat;
>> +		nr = 0;
>> +	} else if ((stock->cached_pgdat != pgdat) ||
>> +		   (stock->vmstat_idx != idx)) {
>> +		/* Output the cached data & save the current data */
>> +		swap(nr, stock->vmstat_bytes);
>> +		swap(idx, stock->vmstat_idx);
>> +		swap(pgdat, stock->cached_pgdat);
> Is this optimization worth doing?
>
> You later split vmstat_bytes and idx doesn't change anymore.

I am going to merge patch 2 and patch 4 to avoid the confusion.


>
> How often does the pgdat change? This is a per-cpu cache after all,
> and the numa node a given cpu allocates from tends to not change that
> often. Even with interleaving mode, which I think is pretty rare, the
> interleaving happens at the slab/page level, not the object level, and
> the cache isn't bigger than a page anyway.

The testing done on a 2-socket system indicated that pgdat changes 
roughly 10-20% of time. So it does happen, especially on the kfree() 
path, I think. I have tried to cached vmstat update for those on the 
local node only, but I got more misses with that. So I am just going to 
change pgdat and flush out existing data for now.


>
>> +	} else {
>> +		stock->vmstat_bytes += nr;
>> +		if (abs(stock->vmstat_bytes) > PAGE_SIZE) {
>> +			nr = stock->vmstat_bytes;
>> +			stock->vmstat_bytes = 0;
>> +		} else {
>> +			nr = 0;
>> +		}
> ..and this is the regular overflow handling done by the objcg and
> memcg charge stock as well.
>
> How about this?
>
> 	if (stock->cached_objcg != objcg ||
> 	    stock->cached_pgdat != pgdat ||
> 	    stock->vmstat_idx != idx) {
> 		drain_obj_stock(stock);
> 		obj_cgroup_get(objcg);
> 		stock->cached_objcg = objcg;
> 		stock->nr_bytes = atomic_xchg(&objcg->nr_charged_bytes, 0);
> 		stock->vmstat_idx = idx;
> 	}
> 	stock->vmstat_bytes += nr_bytes;
>
> 	if (abs(stock->vmstat_bytes > PAGE_SIZE))
> 		drain_obj_stock(stock);
>
> (Maybe we could be clever, here since the charge and stat caches are
> the same size: don't flush an oversized charge cache from
> refill_obj_stock in the charge path, but leave it to the
> mod_objcg_state() that follows; likewise don't flush an undersized
> vmstat stock from mod_objcg_state() in the uncharge path, but leave it
> to the refill_obj_stock() that follows. Could get a bit complicated...)

If you look at patch 5, I am trying to avoid doing drain_obj_stock() 
unless the objcg change. I am going to do the same here.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 1/5] mm/memcg: Move mod_objcg_state() to memcontrol.c
@ 2021-04-20  8:05                   ` Michal Hocko
  0 siblings, 0 replies; 49+ messages in thread
From: Michal Hocko @ 2021-04-20  8:05 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Waiman Long, Vladimir Davydov, Andrew Morton, Tejun Heo,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin, linux-kernel, cgroups, linux-mm,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun, Matthew Wilcox

On Mon 19-04-21 17:11:37, Johannes Weiner wrote:
> On Mon, Apr 19, 2021 at 01:26:29PM -0400, Waiman Long wrote:
[...]
> - the soft limit tree and soft limit reclaim
> 
> - the threshold and oom event notification stuff
> 
> - the charge moving code
> 
> - remaining v1 interface files, as well as their helper functions
> 
> From a quick scan, this adds up to ~2,500 lines of old code with no
> actual dependencies from the common code or from v2, and which could
> be moved out of the way without disrupting ongoing development much.

Moving those into its own file makes sense to me as well. If the code is
not conditional (e.g. like swap accounting and some others) then moving
it would make memecontrol.c easier to navigate through.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 1/5] mm/memcg: Move mod_objcg_state() to memcontrol.c
@ 2021-04-20  8:05                   ` Michal Hocko
  0 siblings, 0 replies; 49+ messages in thread
From: Michal Hocko @ 2021-04-20  8:05 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Waiman Long, Vladimir Davydov, Andrew Morton, Tejun Heo,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun, Matthew Wilcox

On Mon 19-04-21 17:11:37, Johannes Weiner wrote:
> On Mon, Apr 19, 2021 at 01:26:29PM -0400, Waiman Long wrote:
[...]
> - the soft limit tree and soft limit reclaim
> 
> - the threshold and oom event notification stuff
> 
> - the charge moving code
> 
> - remaining v1 interface files, as well as their helper functions
> 
> From a quick scan, this adds up to ~2,500 lines of old code with no
> actual dependencies from the common code or from v2, and which could
> be moved out of the way without disrupting ongoing development much.

Moving those into its own file makes sense to me as well. If the code is
not conditional (e.g. like swap accounting and some others) then moving
it would make memecontrol.c easier to navigate through.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 4/5] mm/memcg: Save both reclaimable & unreclaimable bytes in object stock
  2021-04-19 16:55     ` Johannes Weiner
@ 2021-04-20 19:09       ` Waiman Long
  -1 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2021-04-20 19:09 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Vladimir Davydov, Andrew Morton, Tejun Heo,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin, linux-kernel, cgroups, linux-mm,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun, Matthew Wilcox

On 4/19/21 12:55 PM, Johannes Weiner wrote:
> On Sun, Apr 18, 2021 at 08:00:31PM -0400, Waiman Long wrote:
>> Currently, the object stock structure caches either reclaimable vmstat
>> bytes or unreclaimable vmstat bytes in its object stock structure. The
>> hit rate can be improved if both types of vmstat data can be cached
>> especially for single-node system.
>>
>> This patch supports the cacheing of both type of vmstat data, though
>> at the expense of a slightly increased complexity in the caching code.
>> For large object (>= PAGE_SIZE), vmstat array is done directly without
>> going through the stock caching step.
>>
>> On a 2-socket Cascade Lake server with instrumentation enabled, the
>> miss rates are shown in the table below.
>>
>>    Initial bootup:
>>
>>    Kernel       __mod_objcg_state    mod_objcg_state    %age
>>    ------       -----------------    ---------------    ----
>>    Before patch      634400              3243830        19.6%
>>    After patch       419810              3182424        13.2%
>>
>>    Parallel kernel build:
>>
>>    Kernel       __mod_objcg_state    mod_objcg_state    %age
>>    ------       -----------------    ---------------    ----
>>    Before patch      24329265           142512465       17.1%
>>    After patch       24051721           142445825       16.9%
>>
>> There was a decrease of miss rate after initial system bootup. However,
>> the miss rate for parallel kernel build remained about the same probably
>> because most of the touched kmemcache objects were reclaimable inodes
>> and dentries.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>   mm/memcontrol.c | 79 +++++++++++++++++++++++++++++++------------------
>>   1 file changed, 51 insertions(+), 28 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index c13502eab282..a6dd18f6d8a8 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -2212,8 +2212,8 @@ struct obj_stock {
>>   	struct obj_cgroup *cached_objcg;
>>   	struct pglist_data *cached_pgdat;
>>   	unsigned int nr_bytes;
>> -	int vmstat_idx;
>> -	int vmstat_bytes;
>> +	int reclaimable_bytes;		/* NR_SLAB_RECLAIMABLE_B */
>> +	int unreclaimable_bytes;	/* NR_SLAB_UNRECLAIMABLE_B */
> How about
>
> 	int nr_slab_reclaimable_b;
> 	int nr_slab_unreclaimable_b;
>
> so you don't need the comments?

Sure, will make the change.


>>   #else
>>   	int dummy[0];
>>   #endif
>> @@ -3217,40 +3217,56 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
>>   		     enum node_stat_item idx, int nr)
>>   {
>>   	unsigned long flags;
>> -	struct obj_stock *stock = get_obj_stock(&flags);
>> +	struct obj_stock *stock;
>> +	int *bytes, *alt_bytes, alt_idx;
>> +
>> +	/*
>> +	 * Directly update vmstat array for big object.
>> +	 */
>> +	if (unlikely(abs(nr) >= PAGE_SIZE))
>> +		goto update_vmstat;
> This looks like an optimization independent of the vmstat item split?
It may not be that helpful. I am going to take it out in the next version.
>
>> +	stock = get_obj_stock(&flags);
>> +	if (idx == NR_SLAB_RECLAIMABLE_B) {
>> +		bytes = &stock->reclaimable_bytes;
>> +		alt_bytes = &stock->unreclaimable_bytes;
>> +		alt_idx = NR_SLAB_UNRECLAIMABLE_B;
>> +	} else {
>> +		bytes = &stock->unreclaimable_bytes;
>> +		alt_bytes = &stock->reclaimable_bytes;
>> +		alt_idx = NR_SLAB_RECLAIMABLE_B;
>> +	}
>>   
>>   	/*
>> -	 * Save vmstat data in stock and skip vmstat array update unless
>> -	 * accumulating over a page of vmstat data or when pgdat or idx
>> +	 * Try to save vmstat data in stock and skip vmstat array update
>> +	 * unless accumulating over a page of vmstat data or when pgdat
>>   	 * changes.
>>   	 */
>>   	if (stock->cached_objcg != objcg) {
>>   		/* Output the current data as is */
>> -	} else if (!stock->vmstat_bytes) {
>> -		/* Save the current data */
>> -		stock->vmstat_bytes = nr;
>> -		stock->vmstat_idx = idx;
>> -		stock->cached_pgdat = pgdat;
>> -		nr = 0;
>> -	} else if ((stock->cached_pgdat != pgdat) ||
>> -		   (stock->vmstat_idx != idx)) {
>> -		/* Output the cached data & save the current data */
>> -		swap(nr, stock->vmstat_bytes);
>> -		swap(idx, stock->vmstat_idx);
>> +	} else if (stock->cached_pgdat != pgdat) {
>> +		/* Save the current data and output cached data, if any */
>> +		swap(nr, *bytes);
>>   		swap(pgdat, stock->cached_pgdat);
>> +		if (*alt_bytes) {
>> +			__mod_objcg_state(objcg, pgdat, alt_idx,
>> +					  *alt_bytes);
>> +			*alt_bytes = 0;
>> +		}
> As per the other email, I really don't think optimizing the pgdat
> switch (in a percpu cache) is worth this level of complexity.

I am going to simplify it in the next version.


>
>>   	} else {
>> -		stock->vmstat_bytes += nr;
>> -		if (abs(stock->vmstat_bytes) > PAGE_SIZE) {
>> -			nr = stock->vmstat_bytes;
>> -			stock->vmstat_bytes = 0;
>> +		*bytes += nr;
>> +		if (abs(*bytes) > PAGE_SIZE) {
>> +			nr = *bytes;
>> +			*bytes = 0;
>>   		} else {
>>   			nr = 0;
>>   		}
>>   	}
>> -	if (nr)
>> -		__mod_objcg_state(objcg, pgdat, idx, nr);
>> -
>>   	put_obj_stock(flags);
>> +	if (!nr)
>> +		return;
>> +update_vmstat:
>> +	__mod_objcg_state(objcg, pgdat, idx, nr);
>>   }
>>   
>>   static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
>> @@ -3303,12 +3319,19 @@ static void drain_obj_stock(struct obj_stock *stock)
>>   	/*
>>   	 * Flush the vmstat data in current stock
>>   	 */
>> -	if (stock->vmstat_bytes) {
>> -		__mod_objcg_state(old, stock->cached_pgdat, stock->vmstat_idx,
>> -				  stock->vmstat_bytes);
>> +	if (stock->reclaimable_bytes || stock->unreclaimable_bytes) {
>> +		int bytes;
>> +
>> +		if ((bytes = stock->reclaimable_bytes))
>> +			__mod_objcg_state(old, stock->cached_pgdat,
>> +					  NR_SLAB_RECLAIMABLE_B, bytes);
>> +		if ((bytes = stock->unreclaimable_bytes))
>> +			__mod_objcg_state(old, stock->cached_pgdat,
>> +					  NR_SLAB_UNRECLAIMABLE_B, bytes);
> The int bytes indirection isn't necessary. It's easier to read even
> with the extra lines required to repeat the long stock member names,
> because that is quite a common pattern (if (stuff) frob(stuff)).
OK, I will eliminate the bytes variable.
>
> __mod_objcg_state() also each time does rcu_read_lock() toggling and a
> memcg lookup that could be batched, which I think is further proof
> that it should just be inlined here.
>
I am also thinking that eliminate unnecessary 
rcu_read_lock/rcu_read_unlock may help performance a bit. However, that 
will be done in another patch after I have done more performance 
testing. I am  not going to bother with that in this series.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 4/5] mm/memcg: Save both reclaimable & unreclaimable bytes in object stock
@ 2021-04-20 19:09       ` Waiman Long
  0 siblings, 0 replies; 49+ messages in thread
From: Waiman Long @ 2021-04-20 19:09 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Vladimir Davydov, Andrew Morton, Tejun Heo,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin, linux-kernel, cgroups, linux-mm,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun, Matthew Wilcox

On 4/19/21 12:55 PM, Johannes Weiner wrote:
> On Sun, Apr 18, 2021 at 08:00:31PM -0400, Waiman Long wrote:
>> Currently, the object stock structure caches either reclaimable vmstat
>> bytes or unreclaimable vmstat bytes in its object stock structure. The
>> hit rate can be improved if both types of vmstat data can be cached
>> especially for single-node system.
>>
>> This patch supports the cacheing of both type of vmstat data, though
>> at the expense of a slightly increased complexity in the caching code.
>> For large object (>= PAGE_SIZE), vmstat array is done directly without
>> going through the stock caching step.
>>
>> On a 2-socket Cascade Lake server with instrumentation enabled, the
>> miss rates are shown in the table below.
>>
>>    Initial bootup:
>>
>>    Kernel       __mod_objcg_state    mod_objcg_state    %age
>>    ------       -----------------    ---------------    ----
>>    Before patch      634400              3243830        19.6%
>>    After patch       419810              3182424        13.2%
>>
>>    Parallel kernel build:
>>
>>    Kernel       __mod_objcg_state    mod_objcg_state    %age
>>    ------       -----------------    ---------------    ----
>>    Before patch      24329265           142512465       17.1%
>>    After patch       24051721           142445825       16.9%
>>
>> There was a decrease of miss rate after initial system bootup. However,
>> the miss rate for parallel kernel build remained about the same probably
>> because most of the touched kmemcache objects were reclaimable inodes
>> and dentries.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>   mm/memcontrol.c | 79 +++++++++++++++++++++++++++++++------------------
>>   1 file changed, 51 insertions(+), 28 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index c13502eab282..a6dd18f6d8a8 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -2212,8 +2212,8 @@ struct obj_stock {
>>   	struct obj_cgroup *cached_objcg;
>>   	struct pglist_data *cached_pgdat;
>>   	unsigned int nr_bytes;
>> -	int vmstat_idx;
>> -	int vmstat_bytes;
>> +	int reclaimable_bytes;		/* NR_SLAB_RECLAIMABLE_B */
>> +	int unreclaimable_bytes;	/* NR_SLAB_UNRECLAIMABLE_B */
> How about
>
> 	int nr_slab_reclaimable_b;
> 	int nr_slab_unreclaimable_b;
>
> so you don't need the comments?

Sure, will make the change.


>>   #else
>>   	int dummy[0];
>>   #endif
>> @@ -3217,40 +3217,56 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
>>   		     enum node_stat_item idx, int nr)
>>   {
>>   	unsigned long flags;
>> -	struct obj_stock *stock = get_obj_stock(&flags);
>> +	struct obj_stock *stock;
>> +	int *bytes, *alt_bytes, alt_idx;
>> +
>> +	/*
>> +	 * Directly update vmstat array for big object.
>> +	 */
>> +	if (unlikely(abs(nr) >= PAGE_SIZE))
>> +		goto update_vmstat;
> This looks like an optimization independent of the vmstat item split?
It may not be that helpful. I am going to take it out in the next version.
>
>> +	stock = get_obj_stock(&flags);
>> +	if (idx == NR_SLAB_RECLAIMABLE_B) {
>> +		bytes = &stock->reclaimable_bytes;
>> +		alt_bytes = &stock->unreclaimable_bytes;
>> +		alt_idx = NR_SLAB_UNRECLAIMABLE_B;
>> +	} else {
>> +		bytes = &stock->unreclaimable_bytes;
>> +		alt_bytes = &stock->reclaimable_bytes;
>> +		alt_idx = NR_SLAB_RECLAIMABLE_B;
>> +	}
>>   
>>   	/*
>> -	 * Save vmstat data in stock and skip vmstat array update unless
>> -	 * accumulating over a page of vmstat data or when pgdat or idx
>> +	 * Try to save vmstat data in stock and skip vmstat array update
>> +	 * unless accumulating over a page of vmstat data or when pgdat
>>   	 * changes.
>>   	 */
>>   	if (stock->cached_objcg != objcg) {
>>   		/* Output the current data as is */
>> -	} else if (!stock->vmstat_bytes) {
>> -		/* Save the current data */
>> -		stock->vmstat_bytes = nr;
>> -		stock->vmstat_idx = idx;
>> -		stock->cached_pgdat = pgdat;
>> -		nr = 0;
>> -	} else if ((stock->cached_pgdat != pgdat) ||
>> -		   (stock->vmstat_idx != idx)) {
>> -		/* Output the cached data & save the current data */
>> -		swap(nr, stock->vmstat_bytes);
>> -		swap(idx, stock->vmstat_idx);
>> +	} else if (stock->cached_pgdat != pgdat) {
>> +		/* Save the current data and output cached data, if any */
>> +		swap(nr, *bytes);
>>   		swap(pgdat, stock->cached_pgdat);
>> +		if (*alt_bytes) {
>> +			__mod_objcg_state(objcg, pgdat, alt_idx,
>> +					  *alt_bytes);
>> +			*alt_bytes = 0;
>> +		}
> As per the other email, I really don't think optimizing the pgdat
> switch (in a percpu cache) is worth this level of complexity.

I am going to simplify it in the next version.


>
>>   	} else {
>> -		stock->vmstat_bytes += nr;
>> -		if (abs(stock->vmstat_bytes) > PAGE_SIZE) {
>> -			nr = stock->vmstat_bytes;
>> -			stock->vmstat_bytes = 0;
>> +		*bytes += nr;
>> +		if (abs(*bytes) > PAGE_SIZE) {
>> +			nr = *bytes;
>> +			*bytes = 0;
>>   		} else {
>>   			nr = 0;
>>   		}
>>   	}
>> -	if (nr)
>> -		__mod_objcg_state(objcg, pgdat, idx, nr);
>> -
>>   	put_obj_stock(flags);
>> +	if (!nr)
>> +		return;
>> +update_vmstat:
>> +	__mod_objcg_state(objcg, pgdat, idx, nr);
>>   }
>>   
>>   static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
>> @@ -3303,12 +3319,19 @@ static void drain_obj_stock(struct obj_stock *stock)
>>   	/*
>>   	 * Flush the vmstat data in current stock
>>   	 */
>> -	if (stock->vmstat_bytes) {
>> -		__mod_objcg_state(old, stock->cached_pgdat, stock->vmstat_idx,
>> -				  stock->vmstat_bytes);
>> +	if (stock->reclaimable_bytes || stock->unreclaimable_bytes) {
>> +		int bytes;
>> +
>> +		if ((bytes = stock->reclaimable_bytes))
>> +			__mod_objcg_state(old, stock->cached_pgdat,
>> +					  NR_SLAB_RECLAIMABLE_B, bytes);
>> +		if ((bytes = stock->unreclaimable_bytes))
>> +			__mod_objcg_state(old, stock->cached_pgdat,
>> +					  NR_SLAB_UNRECLAIMABLE_B, bytes);
> The int bytes indirection isn't necessary. It's easier to read even
> with the extra lines required to repeat the long stock member names,
> because that is quite a common pattern (if (stuff) frob(stuff)).
OK, I will eliminate the bytes variable.
>
> __mod_objcg_state() also each time does rcu_read_lock() toggling and a
> memcg lookup that could be batched, which I think is further proof
> that it should just be inlined here.
>
I am also thinking that eliminate unnecessary 
rcu_read_lock/rcu_read_unlock may help performance a bit. However, that 
will be done in another patch after I have done more performance 
testing. I am  not going to bother with that in this series.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2021-04-20 19:09 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-19  0:00 [PATCH v4 0/5] mm/memcg: Reduce kmemcache memory accounting overhead Waiman Long
2021-04-19  0:00 ` Waiman Long
2021-04-19  0:00 ` [PATCH v4 1/5] mm/memcg: Move mod_objcg_state() to memcontrol.c Waiman Long
2021-04-19  0:00   ` Waiman Long
2021-04-19 15:14   ` Johannes Weiner
2021-04-19 15:14     ` Johannes Weiner
2021-04-19 15:21     ` Waiman Long
2021-04-19 15:21       ` Waiman Long
2021-04-19 16:18       ` Waiman Long
2021-04-19 16:18         ` Waiman Long
2021-04-19 17:13         ` Johannes Weiner
2021-04-19 17:13           ` Johannes Weiner
2021-04-19 17:19           ` Waiman Long
2021-04-19 17:19             ` Waiman Long
2021-04-19 17:26             ` Waiman Long
2021-04-19 17:26               ` Waiman Long
2021-04-19 21:11               ` Johannes Weiner
2021-04-19 21:11                 ` Johannes Weiner
2021-04-19 21:24                 ` Waiman Long
2021-04-19 21:24                   ` Waiman Long
2021-04-20  8:05                 ` Michal Hocko
2021-04-20  8:05                   ` Michal Hocko
2021-04-19 15:24   ` Shakeel Butt
2021-04-19 15:24     ` Shakeel Butt
2021-04-19 15:24     ` Shakeel Butt
2021-04-19  0:00 ` [PATCH v4 2/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp Waiman Long
2021-04-19 16:38   ` Johannes Weiner
2021-04-19 16:38     ` Johannes Weiner
2021-04-19 23:42     ` Waiman Long
2021-04-19 23:42       ` Waiman Long
2021-04-19  0:00 ` [PATCH v4 3/5] mm/memcg: Optimize user context object stock access Waiman Long
2021-04-19  0:00 ` [PATCH v4 4/5] mm/memcg: Save both reclaimable & unreclaimable bytes in object stock Waiman Long
2021-04-19  0:00   ` Waiman Long
2021-04-19 16:55   ` Johannes Weiner
2021-04-19 16:55     ` Johannes Weiner
2021-04-20 19:09     ` Waiman Long
2021-04-20 19:09       ` Waiman Long
2021-04-19  0:00 ` [PATCH v4 5/5] mm/memcg: Improve refill_obj_stock() performance Waiman Long
2021-04-19  0:00   ` Waiman Long
2021-04-19  6:06   ` [External] " Muchun Song
2021-04-19  6:06     ` Muchun Song
2021-04-19  6:06     ` Muchun Song
2021-04-19 15:00     ` Shakeel Butt
2021-04-19 15:00       ` Shakeel Butt
2021-04-19 15:00       ` Shakeel Butt
2021-04-19 15:19       ` Waiman Long
2021-04-19 15:19         ` Waiman Long
2021-04-19 15:56     ` Waiman Long
2021-04-19 15:56       ` Waiman Long

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.