[PATCH v2 0/5] mm/memcg: Reduce kmemcache memory accounting overhead

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/5] mm/memcg: Reduce kmemcache memory accounting overhead
@ 2021-04-12 22:54 ` Waiman Long
  0 siblings, 0 replies; 28+ messages in thread
From: Waiman Long @ 2021-04-12 22:54 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin
  Cc: linux-kernel, cgroups, linux-mm, Shakeel Butt, Muchun Song,
	Alex Shi, Chris Down, Yafang Shao, Wei Yang, Masayoshi Mizuma,
	Xing Zhengjun, Waiman Long

 v2:
  - Fix bug found by test robot in patch 5.
  - Update cover letter and commit logs.

With the recent introduction of the new slab memory controller, we
eliminate the need for having separate kmemcaches for each memory
cgroup and reduce overall kernel memory usage. However, we also add
additional memory accounting overhead to each call of kmem_cache_alloc()
and kmem_cache_free().

For workloads that require a lot of kmemcache allocations and
de-allocations, they may experience performance regression as illustrated
in [1] and [2].

A simple kernel module that performs repeated loop of 100,000,000
kmem_cache_alloc() and kmem_cache_free() of a 64-byte object at module
init time is used for benchmarking. The test was run on a CascadeLake
server with turbo-boosting disable to reduce run-to-run variation.

With memory accounting disable, the run time was 2.848s. With memory
accounting enabled, the run times with the application of various
patches in the patchset were:

  Applied patches   Run time   Accounting overhead   Overhead %age
  ---------------   --------   -------------------   -------------
       None          10.800s         7.952s              100.0%
        1-2           9.140s         6.292s               79.1%
        1-3           7.641s         4.793s               60.3%
        1-5           6.801s         3.953s               49.7%

Note that this is the best case scenario where most updates happen only
to the percpu stocks. Real workloads will likely have a certain amount
of updates to the memcg charges and vmstats. So the performance benefit
will be less.

It was found that a big part of the memory accounting overhead
was caused by the local_irq_save()/local_irq_restore() sequences in
updating local stock charge bytes and vmstat array, at least in x86
systems. There are two such sequences in kmem_cache_alloc() and two
in kmem_cache_free(). This patchset tries to reduce the use of such
sequences as much as possible. In fact, it eliminates them in the common
case. Another part of this patchset to cache the vmstat data update in
the local stock as well which also helps.

[1] https://lore.kernel.org/linux-mm/20210408193948.vfktg3azh2wrt56t@gabell/T/#u
[2] https://lore.kernel.org/lkml/20210114025151.GA22932@xsang-OptiPlex-9020/

Waiman Long (5):
  mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state()
  mm/memcg: Introduce obj_cgroup_uncharge_mod_state()
  mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
  mm/memcg: Separate out object stock data into its own struct
  mm/memcg: Optimize user context object stock access

 include/linux/memcontrol.h |  14 ++-
 mm/memcontrol.c            | 200 ++++++++++++++++++++++++++++++++-----
 mm/percpu.c                |   9 +-
 mm/slab.h                  |  32 +++---
 4 files changed, 197 insertions(+), 58 deletions(-)

-- 
2.18.1

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2 0/5] mm/memcg: Reduce kmemcache memory accounting overhead
@ 2021-04-12 22:54 ` Waiman Long
  0 siblings, 0 replies; 28+ messages in thread
From: Waiman Long @ 2021-04-12 22:54 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun, Waiman Long

 v2:
  - Fix bug found by test robot in patch 5.
  - Update cover letter and commit logs.

With the recent introduction of the new slab memory controller, we
eliminate the need for having separate kmemcaches for each memory
cgroup and reduce overall kernel memory usage. However, we also add
additional memory accounting overhead to each call of kmem_cache_alloc()
and kmem_cache_free().

For workloads that require a lot of kmemcache allocations and
de-allocations, they may experience performance regression as illustrated
in [1] and [2].

A simple kernel module that performs repeated loop of 100,000,000
kmem_cache_alloc() and kmem_cache_free() of a 64-byte object at module
init time is used for benchmarking. The test was run on a CascadeLake
server with turbo-boosting disable to reduce run-to-run variation.

With memory accounting disable, the run time was 2.848s. With memory
accounting enabled, the run times with the application of various
patches in the patchset were:

  Applied patches   Run time   Accounting overhead   Overhead %age
  ---------------   --------   -------------------   -------------
       None          10.800s         7.952s              100.0%
        1-2           9.140s         6.292s               79.1%
        1-3           7.641s         4.793s               60.3%
        1-5           6.801s         3.953s               49.7%

Note that this is the best case scenario where most updates happen only
to the percpu stocks. Real workloads will likely have a certain amount
of updates to the memcg charges and vmstats. So the performance benefit
will be less.

It was found that a big part of the memory accounting overhead
was caused by the local_irq_save()/local_irq_restore() sequences in
updating local stock charge bytes and vmstat array, at least in x86
systems. There are two such sequences in kmem_cache_alloc() and two
in kmem_cache_free(). This patchset tries to reduce the use of such
sequences as much as possible. In fact, it eliminates them in the common
case. Another part of this patchset to cache the vmstat data update in
the local stock as well which also helps.

[1] https://lore.kernel.org/linux-mm/20210408193948.vfktg3azh2wrt56t@gabell/T/#u
[2] https://lore.kernel.org/lkml/20210114025151.GA22932@xsang-OptiPlex-9020/

Waiman Long (5):
  mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state()
  mm/memcg: Introduce obj_cgroup_uncharge_mod_state()
  mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
  mm/memcg: Separate out object stock data into its own struct
  mm/memcg: Optimize user context object stock access

 include/linux/memcontrol.h |  14 ++-
 mm/memcontrol.c            | 200 ++++++++++++++++++++++++++++++++-----
 mm/percpu.c                |   9 +-
 mm/slab.h                  |  32 +++---
 4 files changed, 197 insertions(+), 58 deletions(-)

-- 
2.18.1

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2 1/5] mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state()
@ 2021-04-12 22:54   ` Waiman Long
  0 siblings, 0 replies; 28+ messages in thread
From: Waiman Long @ 2021-04-12 22:54 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin
  Cc: linux-kernel, cgroups, linux-mm, Shakeel Butt, Muchun Song,
	Alex Shi, Chris Down, Yafang Shao, Wei Yang, Masayoshi Mizuma,
	Xing Zhengjun, Waiman Long

The caller of mod_memcg_lruvec_state() has both memcg and lruvec readily
available. So both of them are now passed to mod_memcg_lruvec_state()
and __mod_memcg_lruvec_state(). The __mod_memcg_lruvec_state() is
updated to allow either of the two parameters to be set to null. This
makes mod_memcg_lruvec_state() equivalent to mod_memcg_state() if lruvec
is null.

The new __mod_memcg_lruvec_state() function will be used in the next
patch as a replacement of mod_memcg_state() in mm/percpu.c for the
consolidation of the memory uncharge and vmstat update functions in
the kmem_cache_free() path.

Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
---
 include/linux/memcontrol.h | 12 +++++++-----
 mm/memcontrol.c            | 19 +++++++++++++------
 mm/slab.h                  |  2 +-
 3 files changed, 21 insertions(+), 12 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0c04d39a7967..95f12996e66c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -955,8 +955,8 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
 	return x;
 }
 
-void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
-			      int val);
+void __mod_memcg_lruvec_state(struct mem_cgroup *memcg, struct lruvec *lruvec,
+			      enum node_stat_item idx, int val);
 void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val);
 
 static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx,
@@ -969,13 +969,14 @@ static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx,
 	local_irq_restore(flags);
 }
 
-static inline void mod_memcg_lruvec_state(struct lruvec *lruvec,
+static inline void mod_memcg_lruvec_state(struct mem_cgroup *memcg,
+					  struct lruvec *lruvec,
 					  enum node_stat_item idx, int val)
 {
 	unsigned long flags;
 
 	local_irq_save(flags);
-	__mod_memcg_lruvec_state(lruvec, idx, val);
+	__mod_memcg_lruvec_state(memcg, lruvec, idx, val);
 	local_irq_restore(flags);
 }
 
@@ -1369,7 +1370,8 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
 	return node_page_state(lruvec_pgdat(lruvec), idx);
 }
 
-static inline void __mod_memcg_lruvec_state(struct lruvec *lruvec,
+static inline void __mod_memcg_lruvec_state(struct mem_cgroup *memcg,
+					    struct lruvec *lruvec,
 					    enum node_stat_item idx, int val)
 {
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e064ac0d850a..d66e1e38f8ac 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -799,20 +799,27 @@ parent_nodeinfo(struct mem_cgroup_per_node *pn, int nid)
 	return mem_cgroup_nodeinfo(parent, nid);
 }
 
-void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
-			      int val)
+/*
+ * Either one of memcg or lruvec can be NULL, but not both.
+ */
+void __mod_memcg_lruvec_state(struct mem_cgroup *memcg, struct lruvec *lruvec,
+			      enum node_stat_item idx, int val)
 {
 	struct mem_cgroup_per_node *pn;
-	struct mem_cgroup *memcg;
 	long x, threshold = MEMCG_CHARGE_BATCH;
 
+	/* Update lruvec */
 	pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
-	memcg = pn->memcg;
+
+	if (!memcg)
+		memcg = pn->memcg;
 
 	/* Update memcg */
 	__mod_memcg_state(memcg, idx, val);
 
-	/* Update lruvec */
+	if (!lruvec)
+		return;
+
 	__this_cpu_add(pn->lruvec_stat_local->count[idx], val);
 
 	if (vmstat_item_in_bytes(idx))
@@ -848,7 +855,7 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 
 	/* Update memcg and lruvec */
 	if (!mem_cgroup_disabled())
-		__mod_memcg_lruvec_state(lruvec, idx, val);
+		__mod_memcg_lruvec_state(NULL, lruvec, idx, val);
 }
 
 void __mod_lruvec_page_state(struct page *page, enum node_stat_item idx,
diff --git a/mm/slab.h b/mm/slab.h
index 076582f58f68..bc6c7545e487 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -293,7 +293,7 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg,
 	rcu_read_lock();
 	memcg = obj_cgroup_memcg(objcg);
 	lruvec = mem_cgroup_lruvec(memcg, pgdat);
-	mod_memcg_lruvec_state(lruvec, idx, nr);
+	mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
 	rcu_read_unlock();
 }
 
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 1/5] mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state()
@ 2021-04-12 22:54   ` Waiman Long
  0 siblings, 0 replies; 28+ messages in thread
From: Waiman Long @ 2021-04-12 22:54 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun, Waiman Long

The caller of mod_memcg_lruvec_state() has both memcg and lruvec readily
available. So both of them are now passed to mod_memcg_lruvec_state()
and __mod_memcg_lruvec_state(). The __mod_memcg_lruvec_state() is
updated to allow either of the two parameters to be set to null. This
makes mod_memcg_lruvec_state() equivalent to mod_memcg_state() if lruvec
is null.

The new __mod_memcg_lruvec_state() function will be used in the next
patch as a replacement of mod_memcg_state() in mm/percpu.c for the
consolidation of the memory uncharge and vmstat update functions in
the kmem_cache_free() path.

Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Acked-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
Reviewed-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 include/linux/memcontrol.h | 12 +++++++-----
 mm/memcontrol.c            | 19 +++++++++++++------
 mm/slab.h                  |  2 +-
 3 files changed, 21 insertions(+), 12 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0c04d39a7967..95f12996e66c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -955,8 +955,8 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
 	return x;
 }
 
-void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
-			      int val);
+void __mod_memcg_lruvec_state(struct mem_cgroup *memcg, struct lruvec *lruvec,
+			      enum node_stat_item idx, int val);
 void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val);
 
 static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx,
@@ -969,13 +969,14 @@ static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx,
 	local_irq_restore(flags);
 }
 
-static inline void mod_memcg_lruvec_state(struct lruvec *lruvec,
+static inline void mod_memcg_lruvec_state(struct mem_cgroup *memcg,
+					  struct lruvec *lruvec,
 					  enum node_stat_item idx, int val)
 {
 	unsigned long flags;
 
 	local_irq_save(flags);
-	__mod_memcg_lruvec_state(lruvec, idx, val);
+	__mod_memcg_lruvec_state(memcg, lruvec, idx, val);
 	local_irq_restore(flags);
 }
 
@@ -1369,7 +1370,8 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
 	return node_page_state(lruvec_pgdat(lruvec), idx);
 }
 
-static inline void __mod_memcg_lruvec_state(struct lruvec *lruvec,
+static inline void __mod_memcg_lruvec_state(struct mem_cgroup *memcg,
+					    struct lruvec *lruvec,
 					    enum node_stat_item idx, int val)
 {
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e064ac0d850a..d66e1e38f8ac 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -799,20 +799,27 @@ parent_nodeinfo(struct mem_cgroup_per_node *pn, int nid)
 	return mem_cgroup_nodeinfo(parent, nid);
 }
 
-void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
-			      int val)
+/*
+ * Either one of memcg or lruvec can be NULL, but not both.
+ */
+void __mod_memcg_lruvec_state(struct mem_cgroup *memcg, struct lruvec *lruvec,
+			      enum node_stat_item idx, int val)
 {
 	struct mem_cgroup_per_node *pn;
-	struct mem_cgroup *memcg;
 	long x, threshold = MEMCG_CHARGE_BATCH;
 
+	/* Update lruvec */
 	pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
-	memcg = pn->memcg;
+
+	if (!memcg)
+		memcg = pn->memcg;
 
 	/* Update memcg */
 	__mod_memcg_state(memcg, idx, val);
 
-	/* Update lruvec */
+	if (!lruvec)
+		return;
+
 	__this_cpu_add(pn->lruvec_stat_local->count[idx], val);
 
 	if (vmstat_item_in_bytes(idx))
@@ -848,7 +855,7 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 
 	/* Update memcg and lruvec */
 	if (!mem_cgroup_disabled())
-		__mod_memcg_lruvec_state(lruvec, idx, val);
+		__mod_memcg_lruvec_state(NULL, lruvec, idx, val);
 }
 
 void __mod_lruvec_page_state(struct page *page, enum node_stat_item idx,
diff --git a/mm/slab.h b/mm/slab.h
index 076582f58f68..bc6c7545e487 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -293,7 +293,7 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg,
 	rcu_read_lock();
 	memcg = obj_cgroup_memcg(objcg);
 	lruvec = mem_cgroup_lruvec(memcg, pgdat);
-	mod_memcg_lruvec_state(lruvec, idx, nr);
+	mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
 	rcu_read_unlock();
 }
 
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state()
@ 2021-04-12 22:55   ` Waiman Long
  0 siblings, 0 replies; 28+ messages in thread
From: Waiman Long @ 2021-04-12 22:55 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin
  Cc: linux-kernel, cgroups, linux-mm, Shakeel Butt, Muchun Song,
	Alex Shi, Chris Down, Yafang Shao, Wei Yang, Masayoshi Mizuma,
	Xing Zhengjun, Waiman Long

In memcg_slab_free_hook()/pcpu_memcg_free_hook(), obj_cgroup_uncharge()
is followed by mod_objcg_state()/mod_memcg_state(). Each of these
function call goes through a separate irq_save/irq_restore cycle. That
is inefficient.  Introduce a new function obj_cgroup_uncharge_mod_state()
that combines them with a single irq_save/irq_restore cycle.

Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <guro@fb.com>
---
 include/linux/memcontrol.h |  2 ++
 mm/memcontrol.c            | 31 +++++++++++++++++++++++++++----
 mm/percpu.c                |  9 ++-------
 mm/slab.h                  |  6 +++---
 4 files changed, 34 insertions(+), 14 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 95f12996e66c..6890f999c1a3 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1592,6 +1592,8 @@ struct obj_cgroup *get_obj_cgroup_from_current(void);
 
 int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size);
 void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);
+void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
+				   struct pglist_data *pgdat, int idx);
 
 extern struct static_key_false memcg_kmem_enabled_key;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d66e1e38f8ac..b19100c68aa0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3225,12 +3225,9 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 	return false;
 }
 
-static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
+static void __refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 {
 	struct memcg_stock_pcp *stock;
-	unsigned long flags;
-
-	local_irq_save(flags);
 
 	stock = this_cpu_ptr(&memcg_stock);
 	if (stock->cached_objcg != objcg) { /* reset if necessary */
@@ -3243,7 +3240,14 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 
 	if (stock->nr_bytes > PAGE_SIZE)
 		drain_obj_stock(stock);
+}
+
+static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
+{
+	unsigned long flags;
 
+	local_irq_save(flags);
+	__refill_obj_stock(objcg, nr_bytes);
 	local_irq_restore(flags);
 }
 
@@ -3292,6 +3296,25 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
 	refill_obj_stock(objcg, size);
 }
 
+void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
+				   struct pglist_data *pgdat, int idx)
+{
+	unsigned long flags;
+	struct mem_cgroup *memcg;
+	struct lruvec *lruvec = NULL;
+
+	local_irq_save(flags);
+	__refill_obj_stock(objcg, size);
+
+	rcu_read_lock();
+	memcg = obj_cgroup_memcg(objcg);
+	if (pgdat)
+		lruvec = mem_cgroup_lruvec(memcg, pgdat);
+	__mod_memcg_lruvec_state(memcg, lruvec, idx, -(int)size);
+	rcu_read_unlock();
+	local_irq_restore(flags);
+}
+
 #endif /* CONFIG_MEMCG_KMEM */
 
 /*
diff --git a/mm/percpu.c b/mm/percpu.c
index 23308113a5ff..fd7aad6d7f90 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1631,13 +1631,8 @@ static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
 	objcg = chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT];
 	chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = NULL;
 
-	obj_cgroup_uncharge(objcg, size * num_possible_cpus());
-
-	rcu_read_lock();
-	mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B,
-			-(size * num_possible_cpus()));
-	rcu_read_unlock();
-
+	obj_cgroup_uncharge_mod_state(objcg, size * num_possible_cpus(),
+				      NULL, MEMCG_PERCPU_B);
 	obj_cgroup_put(objcg);
 }
 
diff --git a/mm/slab.h b/mm/slab.h
index bc6c7545e487..677cdc52e641 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -366,9 +366,9 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s_orig,
 			continue;
 
 		objcgs[off] = NULL;
-		obj_cgroup_uncharge(objcg, obj_full_size(s));
-		mod_objcg_state(objcg, page_pgdat(page), cache_vmstat_idx(s),
-				-obj_full_size(s));
+		obj_cgroup_uncharge_mod_state(objcg, obj_full_size(s),
+					      page_pgdat(page),
+					      cache_vmstat_idx(s));
 		obj_cgroup_put(objcg);
 	}
 }
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state()
@ 2021-04-12 22:55   ` Waiman Long
  0 siblings, 0 replies; 28+ messages in thread
From: Waiman Long @ 2021-04-12 22:55 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun, Waiman Long

In memcg_slab_free_hook()/pcpu_memcg_free_hook(), obj_cgroup_uncharge()
is followed by mod_objcg_state()/mod_memcg_state(). Each of these
function call goes through a separate irq_save/irq_restore cycle. That
is inefficient.  Introduce a new function obj_cgroup_uncharge_mod_state()
that combines them with a single irq_save/irq_restore cycle.

Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Acked-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
---
 include/linux/memcontrol.h |  2 ++
 mm/memcontrol.c            | 31 +++++++++++++++++++++++++++----
 mm/percpu.c                |  9 ++-------
 mm/slab.h                  |  6 +++---
 4 files changed, 34 insertions(+), 14 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 95f12996e66c..6890f999c1a3 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1592,6 +1592,8 @@ struct obj_cgroup *get_obj_cgroup_from_current(void);
 
 int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size);
 void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);
+void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
+				   struct pglist_data *pgdat, int idx);
 
 extern struct static_key_false memcg_kmem_enabled_key;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d66e1e38f8ac..b19100c68aa0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3225,12 +3225,9 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 	return false;
 }
 
-static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
+static void __refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 {
 	struct memcg_stock_pcp *stock;
-	unsigned long flags;
-
-	local_irq_save(flags);
 
 	stock = this_cpu_ptr(&memcg_stock);
 	if (stock->cached_objcg != objcg) { /* reset if necessary */
@@ -3243,7 +3240,14 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 
 	if (stock->nr_bytes > PAGE_SIZE)
 		drain_obj_stock(stock);
+}
+
+static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
+{
+	unsigned long flags;
 
+	local_irq_save(flags);
+	__refill_obj_stock(objcg, nr_bytes);
 	local_irq_restore(flags);
 }
 
@@ -3292,6 +3296,25 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
 	refill_obj_stock(objcg, size);
 }
 
+void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
+				   struct pglist_data *pgdat, int idx)
+{
+	unsigned long flags;
+	struct mem_cgroup *memcg;
+	struct lruvec *lruvec = NULL;
+
+	local_irq_save(flags);
+	__refill_obj_stock(objcg, size);
+
+	rcu_read_lock();
+	memcg = obj_cgroup_memcg(objcg);
+	if (pgdat)
+		lruvec = mem_cgroup_lruvec(memcg, pgdat);
+	__mod_memcg_lruvec_state(memcg, lruvec, idx, -(int)size);
+	rcu_read_unlock();
+	local_irq_restore(flags);
+}
+
 #endif /* CONFIG_MEMCG_KMEM */
 
 /*
diff --git a/mm/percpu.c b/mm/percpu.c
index 23308113a5ff..fd7aad6d7f90 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1631,13 +1631,8 @@ static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
 	objcg = chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT];
 	chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = NULL;
 
-	obj_cgroup_uncharge(objcg, size * num_possible_cpus());
-
-	rcu_read_lock();
-	mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B,
-			-(size * num_possible_cpus()));
-	rcu_read_unlock();
-
+	obj_cgroup_uncharge_mod_state(objcg, size * num_possible_cpus(),
+				      NULL, MEMCG_PERCPU_B);
 	obj_cgroup_put(objcg);
 }
 
diff --git a/mm/slab.h b/mm/slab.h
index bc6c7545e487..677cdc52e641 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -366,9 +366,9 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s_orig,
 			continue;
 
 		objcgs[off] = NULL;
-		obj_cgroup_uncharge(objcg, obj_full_size(s));
-		mod_objcg_state(objcg, page_pgdat(page), cache_vmstat_idx(s),
-				-obj_full_size(s));
+		obj_cgroup_uncharge_mod_state(objcg, obj_full_size(s),
+					      page_pgdat(page),
+					      cache_vmstat_idx(s));
 		obj_cgroup_put(objcg);
 	}
 }
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 3/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
@ 2021-04-12 22:55   ` Waiman Long
  0 siblings, 0 replies; 28+ messages in thread
From: Waiman Long @ 2021-04-12 22:55 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin
  Cc: linux-kernel, cgroups, linux-mm, Shakeel Butt, Muchun Song,
	Alex Shi, Chris Down, Yafang Shao, Wei Yang, Masayoshi Mizuma,
	Xing Zhengjun, Waiman Long

Before the new slab memory controller with per object byte charging,
charging and vmstat data update happen only when new slab pages are
allocated or freed. Now they are done with every kmem_cache_alloc()
and kmem_cache_free(). This causes additional overhead for workloads
that generate a lot of alloc and free calls.

The memcg_stock_pcp is used to cache byte charge for a specific
obj_cgroup to reduce that overhead. To further reducing it, this patch
makes the vmstat data cached in the memcg_stock_pcp structure as well
until it accumulates a page size worth of update or when other cached
data change.

On a 2-socket Cascade Lake server with instrumentation enabled and this
patch applied, it was found that about 17% (946796 out of 5515184) of the
time when __mod_obj_stock_state() is called leads to an actual call to
mod_objcg_state() after initial boot. When doing parallel kernel build,
the figure was about 16% (21894614 out of 139780628). So caching the
vmstat data reduces the number of calls to mod_objcg_state() by more
than 80%.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 mm/memcontrol.c | 78 +++++++++++++++++++++++++++++++++++++++++++------
 mm/slab.h       | 26 +++++++----------
 2 files changed, 79 insertions(+), 25 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b19100c68aa0..539c3b632e47 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2220,7 +2220,10 @@ struct memcg_stock_pcp {
 
 #ifdef CONFIG_MEMCG_KMEM
 	struct obj_cgroup *cached_objcg;
+	struct pglist_data *cached_pgdat;
 	unsigned int nr_bytes;
+	int vmstat_idx;
+	int vmstat_bytes;
 #endif
 
 	struct work_struct work;
@@ -3157,6 +3160,21 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
 	css_put(&memcg->css);
 }
 
+static inline void mod_objcg_state(struct obj_cgroup *objcg,
+				   struct pglist_data *pgdat,
+				   enum node_stat_item idx, int nr)
+{
+	struct mem_cgroup *memcg;
+	struct lruvec *lruvec = NULL;
+
+	rcu_read_lock();
+	memcg = obj_cgroup_memcg(objcg);
+	if (pgdat)
+		lruvec = mem_cgroup_lruvec(memcg, pgdat);
+	__mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
+	rcu_read_unlock();
+}
+
 static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 {
 	struct memcg_stock_pcp *stock;
@@ -3207,6 +3225,14 @@ static void drain_obj_stock(struct memcg_stock_pcp *stock)
 		stock->nr_bytes = 0;
 	}
 
+	if (stock->vmstat_bytes) {
+		mod_objcg_state(old, stock->cached_pgdat, stock->vmstat_idx,
+				stock->vmstat_bytes);
+		stock->vmstat_bytes = 0;
+		stock->vmstat_idx = 0;
+		stock->cached_pgdat = NULL;
+	}
+
 	obj_cgroup_put(old);
 	stock->cached_objcg = NULL;
 }
@@ -3251,6 +3277,48 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 	local_irq_restore(flags);
 }
 
+static void __mod_obj_stock_state(struct obj_cgroup *objcg,
+				  struct pglist_data *pgdat, int idx, int nr)
+{
+	struct memcg_stock_pcp *stock = this_cpu_ptr(&memcg_stock);
+
+	if (stock->cached_objcg != objcg) {
+		/* Output the current data as is */
+	} else if (!stock->vmstat_bytes) {
+		/* Save the current data */
+		stock->vmstat_bytes = nr;
+		stock->vmstat_idx = idx;
+		stock->cached_pgdat = pgdat;
+		nr = 0;
+	} else if ((stock->cached_pgdat != pgdat) ||
+		   (stock->vmstat_idx != idx)) {
+		/* Output the cached data & save the current data */
+		swap(nr, stock->vmstat_bytes);
+		swap(idx, stock->vmstat_idx);
+		swap(pgdat, stock->cached_pgdat);
+	} else {
+		stock->vmstat_bytes += nr;
+		if (abs(nr) > PAGE_SIZE) {
+			nr = stock->vmstat_bytes;
+			stock->vmstat_bytes = 0;
+		} else {
+			nr = 0;
+		}
+	}
+	if (nr)
+		mod_objcg_state(objcg, pgdat, idx, nr);
+}
+
+void mod_obj_stock_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
+			 int idx, int nr)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__mod_obj_stock_state(objcg, pgdat, idx, nr);
+	local_irq_restore(flags);
+}
+
 int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
 {
 	struct mem_cgroup *memcg;
@@ -3300,18 +3368,10 @@ void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
 				   struct pglist_data *pgdat, int idx)
 {
 	unsigned long flags;
-	struct mem_cgroup *memcg;
-	struct lruvec *lruvec = NULL;
 
 	local_irq_save(flags);
 	__refill_obj_stock(objcg, size);
-
-	rcu_read_lock();
-	memcg = obj_cgroup_memcg(objcg);
-	if (pgdat)
-		lruvec = mem_cgroup_lruvec(memcg, pgdat);
-	__mod_memcg_lruvec_state(memcg, lruvec, idx, -(int)size);
-	rcu_read_unlock();
+	__mod_obj_stock_state(objcg, pgdat, idx, -(int)size);
 	local_irq_restore(flags);
 }
 
diff --git a/mm/slab.h b/mm/slab.h
index 677cdc52e641..ae971975d9fc 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -239,6 +239,8 @@ static inline bool kmem_cache_debug_flags(struct kmem_cache *s, slab_flags_t fla
 #ifdef CONFIG_MEMCG_KMEM
 int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s,
 				 gfp_t gfp, bool new_page);
+void mod_obj_stock_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
+			 int idx, int nr);
 
 static inline void memcg_free_page_obj_cgroups(struct page *page)
 {
@@ -283,20 +285,6 @@ static inline bool memcg_slab_pre_alloc_hook(struct kmem_cache *s,
 	return true;
 }
 
-static inline void mod_objcg_state(struct obj_cgroup *objcg,
-				   struct pglist_data *pgdat,
-				   enum node_stat_item idx, int nr)
-{
-	struct mem_cgroup *memcg;
-	struct lruvec *lruvec;
-
-	rcu_read_lock();
-	memcg = obj_cgroup_memcg(objcg);
-	lruvec = mem_cgroup_lruvec(memcg, pgdat);
-	mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
-	rcu_read_unlock();
-}
-
 static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
 					      struct obj_cgroup *objcg,
 					      gfp_t flags, size_t size,
@@ -324,8 +312,9 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
 			off = obj_to_index(s, page, p[i]);
 			obj_cgroup_get(objcg);
 			page_objcgs(page)[off] = objcg;
-			mod_objcg_state(objcg, page_pgdat(page),
-					cache_vmstat_idx(s), obj_full_size(s));
+			mod_obj_stock_state(objcg, page_pgdat(page),
+					    cache_vmstat_idx(s),
+					    obj_full_size(s));
 		} else {
 			obj_cgroup_uncharge(objcg, obj_full_size(s));
 		}
@@ -408,6 +397,11 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s,
 					void **p, int objects)
 {
 }
+
+static void mod_obj_stock_state(struct obj_cgroup *objcg,
+				struct pglist_data *pgdat, int idx, int nr)
+{
+}
 #endif /* CONFIG_MEMCG_KMEM */
 
 static inline struct kmem_cache *virt_to_cache(const void *obj)
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 3/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
@ 2021-04-12 22:55   ` Waiman Long
  0 siblings, 0 replies; 28+ messages in thread
From: Waiman Long @ 2021-04-12 22:55 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun, Waiman Long

Before the new slab memory controller with per object byte charging,
charging and vmstat data update happen only when new slab pages are
allocated or freed. Now they are done with every kmem_cache_alloc()
and kmem_cache_free(). This causes additional overhead for workloads
that generate a lot of alloc and free calls.

The memcg_stock_pcp is used to cache byte charge for a specific
obj_cgroup to reduce that overhead. To further reducing it, this patch
makes the vmstat data cached in the memcg_stock_pcp structure as well
until it accumulates a page size worth of update or when other cached
data change.

On a 2-socket Cascade Lake server with instrumentation enabled and this
patch applied, it was found that about 17% (946796 out of 5515184) of the
time when __mod_obj_stock_state() is called leads to an actual call to
mod_objcg_state() after initial boot. When doing parallel kernel build,
the figure was about 16% (21894614 out of 139780628). So caching the
vmstat data reduces the number of calls to mod_objcg_state() by more
than 80%.

Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 mm/memcontrol.c | 78 +++++++++++++++++++++++++++++++++++++++++++------
 mm/slab.h       | 26 +++++++----------
 2 files changed, 79 insertions(+), 25 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b19100c68aa0..539c3b632e47 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2220,7 +2220,10 @@ struct memcg_stock_pcp {
 
 #ifdef CONFIG_MEMCG_KMEM
 	struct obj_cgroup *cached_objcg;
+	struct pglist_data *cached_pgdat;
 	unsigned int nr_bytes;
+	int vmstat_idx;
+	int vmstat_bytes;
 #endif
 
 	struct work_struct work;
@@ -3157,6 +3160,21 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
 	css_put(&memcg->css);
 }
 
+static inline void mod_objcg_state(struct obj_cgroup *objcg,
+				   struct pglist_data *pgdat,
+				   enum node_stat_item idx, int nr)
+{
+	struct mem_cgroup *memcg;
+	struct lruvec *lruvec = NULL;
+
+	rcu_read_lock();
+	memcg = obj_cgroup_memcg(objcg);
+	if (pgdat)
+		lruvec = mem_cgroup_lruvec(memcg, pgdat);
+	__mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
+	rcu_read_unlock();
+}
+
 static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 {
 	struct memcg_stock_pcp *stock;
@@ -3207,6 +3225,14 @@ static void drain_obj_stock(struct memcg_stock_pcp *stock)
 		stock->nr_bytes = 0;
 	}
 
+	if (stock->vmstat_bytes) {
+		mod_objcg_state(old, stock->cached_pgdat, stock->vmstat_idx,
+				stock->vmstat_bytes);
+		stock->vmstat_bytes = 0;
+		stock->vmstat_idx = 0;
+		stock->cached_pgdat = NULL;
+	}
+
 	obj_cgroup_put(old);
 	stock->cached_objcg = NULL;
 }
@@ -3251,6 +3277,48 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 	local_irq_restore(flags);
 }
 
+static void __mod_obj_stock_state(struct obj_cgroup *objcg,
+				  struct pglist_data *pgdat, int idx, int nr)
+{
+	struct memcg_stock_pcp *stock = this_cpu_ptr(&memcg_stock);
+
+	if (stock->cached_objcg != objcg) {
+		/* Output the current data as is */
+	} else if (!stock->vmstat_bytes) {
+		/* Save the current data */
+		stock->vmstat_bytes = nr;
+		stock->vmstat_idx = idx;
+		stock->cached_pgdat = pgdat;
+		nr = 0;
+	} else if ((stock->cached_pgdat != pgdat) ||
+		   (stock->vmstat_idx != idx)) {
+		/* Output the cached data & save the current data */
+		swap(nr, stock->vmstat_bytes);
+		swap(idx, stock->vmstat_idx);
+		swap(pgdat, stock->cached_pgdat);
+	} else {
+		stock->vmstat_bytes += nr;
+		if (abs(nr) > PAGE_SIZE) {
+			nr = stock->vmstat_bytes;
+			stock->vmstat_bytes = 0;
+		} else {
+			nr = 0;
+		}
+	}
+	if (nr)
+		mod_objcg_state(objcg, pgdat, idx, nr);
+}
+
+void mod_obj_stock_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
+			 int idx, int nr)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__mod_obj_stock_state(objcg, pgdat, idx, nr);
+	local_irq_restore(flags);
+}
+
 int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
 {
 	struct mem_cgroup *memcg;
@@ -3300,18 +3368,10 @@ void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
 				   struct pglist_data *pgdat, int idx)
 {
 	unsigned long flags;
-	struct mem_cgroup *memcg;
-	struct lruvec *lruvec = NULL;
 
 	local_irq_save(flags);
 	__refill_obj_stock(objcg, size);
-
-	rcu_read_lock();
-	memcg = obj_cgroup_memcg(objcg);
-	if (pgdat)
-		lruvec = mem_cgroup_lruvec(memcg, pgdat);
-	__mod_memcg_lruvec_state(memcg, lruvec, idx, -(int)size);
-	rcu_read_unlock();
+	__mod_obj_stock_state(objcg, pgdat, idx, -(int)size);
 	local_irq_restore(flags);
 }
 
diff --git a/mm/slab.h b/mm/slab.h
index 677cdc52e641..ae971975d9fc 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -239,6 +239,8 @@ static inline bool kmem_cache_debug_flags(struct kmem_cache *s, slab_flags_t fla
 #ifdef CONFIG_MEMCG_KMEM
 int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s,
 				 gfp_t gfp, bool new_page);
+void mod_obj_stock_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
+			 int idx, int nr);
 
 static inline void memcg_free_page_obj_cgroups(struct page *page)
 {
@@ -283,20 +285,6 @@ static inline bool memcg_slab_pre_alloc_hook(struct kmem_cache *s,
 	return true;
 }
 
-static inline void mod_objcg_state(struct obj_cgroup *objcg,
-				   struct pglist_data *pgdat,
-				   enum node_stat_item idx, int nr)
-{
-	struct mem_cgroup *memcg;
-	struct lruvec *lruvec;
-
-	rcu_read_lock();
-	memcg = obj_cgroup_memcg(objcg);
-	lruvec = mem_cgroup_lruvec(memcg, pgdat);
-	mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
-	rcu_read_unlock();
-}
-
 static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
 					      struct obj_cgroup *objcg,
 					      gfp_t flags, size_t size,
@@ -324,8 +312,9 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
 			off = obj_to_index(s, page, p[i]);
 			obj_cgroup_get(objcg);
 			page_objcgs(page)[off] = objcg;
-			mod_objcg_state(objcg, page_pgdat(page),
-					cache_vmstat_idx(s), obj_full_size(s));
+			mod_obj_stock_state(objcg, page_pgdat(page),
+					    cache_vmstat_idx(s),
+					    obj_full_size(s));
 		} else {
 			obj_cgroup_uncharge(objcg, obj_full_size(s));
 		}
@@ -408,6 +397,11 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s,
 					void **p, int objects)
 {
 }
+
+static void mod_obj_stock_state(struct obj_cgroup *objcg,
+				struct pglist_data *pgdat, int idx, int nr)
+{
+}
 #endif /* CONFIG_MEMCG_KMEM */
 
 static inline struct kmem_cache *virt_to_cache(const void *obj)
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 3/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
  2021-04-12 22:55   ` Waiman Long
  (?)
@ 2021-04-12 23:03     ` Shakeel Butt
  -1 siblings, 0 replies; 28+ messages in thread
From: Shakeel Butt @ 2021-04-12 23:03 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin, LKML, Cgroups,
	Linux MM, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun

On Mon, Apr 12, 2021 at 3:55 PM Waiman Long <longman@redhat.com> wrote:
>
> Before the new slab memory controller with per object byte charging,
> charging and vmstat data update happen only when new slab pages are
> allocated or freed. Now they are done with every kmem_cache_alloc()
> and kmem_cache_free(). This causes additional overhead for workloads
> that generate a lot of alloc and free calls.
>
> The memcg_stock_pcp is used to cache byte charge for a specific
> obj_cgroup to reduce that overhead. To further reducing it, this patch
> makes the vmstat data cached in the memcg_stock_pcp structure as well
> until it accumulates a page size worth of update or when other cached
> data change.
>
> On a 2-socket Cascade Lake server with instrumentation enabled and this
> patch applied, it was found that about 17% (946796 out of 5515184) of the
> time when __mod_obj_stock_state() is called leads to an actual call to
> mod_objcg_state() after initial boot. When doing parallel kernel build,
> the figure was about 16% (21894614 out of 139780628). So caching the
> vmstat data reduces the number of calls to mod_objcg_state() by more
> than 80%.
>
> Signed-off-by: Waiman Long <longman@redhat.com>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 3/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
@ 2021-04-12 23:03     ` Shakeel Butt
  0 siblings, 0 replies; 28+ messages in thread
From: Shakeel Butt @ 2021-04-12 23:03 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin, LKML, Cgroups,
	Linux MM, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun

On Mon, Apr 12, 2021 at 3:55 PM Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>
> Before the new slab memory controller with per object byte charging,
> charging and vmstat data update happen only when new slab pages are
> allocated or freed. Now they are done with every kmem_cache_alloc()
> and kmem_cache_free(). This causes additional overhead for workloads
> that generate a lot of alloc and free calls.
>
> The memcg_stock_pcp is used to cache byte charge for a specific
> obj_cgroup to reduce that overhead. To further reducing it, this patch
> makes the vmstat data cached in the memcg_stock_pcp structure as well
> until it accumulates a page size worth of update or when other cached
> data change.
>
> On a 2-socket Cascade Lake server with instrumentation enabled and this
> patch applied, it was found that about 17% (946796 out of 5515184) of the
> time when __mod_obj_stock_state() is called leads to an actual call to
> mod_objcg_state() after initial boot. When doing parallel kernel build,
> the figure was about 16% (21894614 out of 139780628). So caching the
> vmstat data reduces the number of calls to mod_objcg_state() by more
> than 80%.
>
> Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Reviewed-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 3/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
@ 2021-04-12 23:03     ` Shakeel Butt
  0 siblings, 0 replies; 28+ messages in thread
From: Shakeel Butt @ 2021-04-12 23:03 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin, LKML, Cgroups,
	Linux MM, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun

On Mon, Apr 12, 2021 at 3:55 PM Waiman Long <longman@redhat.com> wrote:
>
> Before the new slab memory controller with per object byte charging,
> charging and vmstat data update happen only when new slab pages are
> allocated or freed. Now they are done with every kmem_cache_alloc()
> and kmem_cache_free(). This causes additional overhead for workloads
> that generate a lot of alloc and free calls.
>
> The memcg_stock_pcp is used to cache byte charge for a specific
> obj_cgroup to reduce that overhead. To further reducing it, this patch
> makes the vmstat data cached in the memcg_stock_pcp structure as well
> until it accumulates a page size worth of update or when other cached
> data change.
>
> On a 2-socket Cascade Lake server with instrumentation enabled and this
> patch applied, it was found that about 17% (946796 out of 5515184) of the
> time when __mod_obj_stock_state() is called leads to an actual call to
> mod_objcg_state() after initial boot. When doing parallel kernel build,
> the figure was about 16% (21894614 out of 139780628). So caching the
> vmstat data reduces the number of calls to mod_objcg_state() by more
> than 80%.
>
> Signed-off-by: Waiman Long <longman@redhat.com>

Reviewed-by: Shakeel Butt <shakeelb@google.com>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 3/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
  2021-04-12 22:55   ` Waiman Long
@ 2021-04-13 18:32     ` kernel test robot
  -1 siblings, 0 replies; 28+ messages in thread
From: kernel test robot @ 2021-04-13 18:32 UTC (permalink / raw)
  To: Waiman Long, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Andrew Morton, Tejun Heo, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim
  Cc: kbuild-all, clang-built-linux, Linux Memory Management List

[-- Attachment #1: Type: text/plain, Size: 6066 bytes --]

Hi Waiman,

I love your patch! Perhaps something to improve:

[auto build test WARNING on dennis-percpu/for-next]
[also build test WARNING on linus/master v5.12-rc7 next-20210413]
[cannot apply to hnaz-linux-mm/master]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Waiman-Long/mm-memcg-Reduce-kmemcache-memory-accounting-overhead/20210413-065659
base:   https://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu.git for-next
config: mips-randconfig-r031-20210413 (attached as .config)
compiler: clang version 13.0.0 (https://github.com/llvm/llvm-project 9829f5e6b1bca9b61efc629770d28bb9014dec45)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install mips cross compiling tool for clang build
        # apt-get install binutils-mips-linux-gnu
        # https://github.com/0day-ci/linux/commit/565fa80b02eeb06c48a88ac6354a4fc9e5addd06
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Waiman-Long/mm-memcg-Reduce-kmemcache-memory-accounting-overhead/20210413-065659
        git checkout 565fa80b02eeb06c48a88ac6354a4fc9e5addd06
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=mips 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   In file included from mm/mempool.c:22:
>> mm/slab.h:401:13: warning: unused function 'mod_obj_stock_state' [-Wunused-function]
   static void mod_obj_stock_state(struct obj_cgroup *objcg,
               ^
   1 warning generated.
--
   In file included from mm/oom_kill.c:49:
>> mm/slab.h:401:13: warning: unused function 'mod_obj_stock_state'
   static void mod_obj_stock_state(struct obj_cgroup
   ^
   fatal error: error in backend: Nested variants found in inline asm string: ' .set push
   .set mips64r2
   .if ( 0x00 ) != -1)) 0x00 ) != -1)) : ($( static struct ftrace_branch_data __attribute__((__aligned__(4))) __attribute__((__section__("_ftrace_branch"))) __if_trace = $( .func = __func__, .file = "arch/mips/include/asm/bitops.h", .line = 105, $); 0x00 ) != -1)) : $))) ) && ( 0 ); .set push; .set mips64r2; .rept 1; sync 0x00; .endr; .set pop; .else; ; .endif
   1: ll $0, $1
   or $0, $2
   sc $0, $1
   beqz $0, 1b
   .set pop
   '
   clang-13: error: clang frontend command failed with exit code 70 (use -v to see invocation)
   clang version 13.0.0 (git://gitmirror/llvm_project 9829f5e6b1bca9b61efc629770d28bb9014dec45)
   Target: mipsel-unknown-linux-gnu
   Thread model: posix
   InstalledDir: /opt/cross/clang-9829f5e6b1/bin
   clang-13: note: diagnostic msg:
   Makefile arch block certs crypto drivers fs include init ipc kernel lib mm scripts security sound source usr virt
--
   In file included from mm/slab_common.c:33:
>> mm/slab.h:401:13: warning: unused function 'mod_obj_stock_state'
   static void mod_obj_stock_state(struct obj_cgroup
   ^
   fatal error: error in backend: Nested variants found in inline asm string: ' .set push
   .set mips64r2
   .if ( 0x00 ) != -1)) 0x00 ) != -1)) : ($( static struct ftrace_branch_data __attribute__((__aligned__(4))) __attribute__((__section__("_ftrace_branch"))) __if_trace = $( .func = __func__, .file = "arch/mips/include/asm/atomic.h", .line = 153, $); 0x00 ) != -1)) : $))) ) && ( 0 ); .set push; .set mips64r2; .rept 1; sync 0x00; .endr; .set pop; .else; ; .endif
   1: ll $0, $1 # atomic_add
   addu $0, $2
   sc $0, $1
   beqz $0, 1b
   .set pop
   '
   clang-13: error: clang frontend command failed with exit code 70 (use -v to see invocation)
   clang version 13.0.0 (git://gitmirror/llvm_project 9829f5e6b1bca9b61efc629770d28bb9014dec45)
   Target: mipsel-unknown-linux-gnu
   Thread model: posix
   InstalledDir: /opt/cross/clang-9829f5e6b1/bin
   clang-13: note: diagnostic msg:
   Makefile arch block certs crypto drivers fs include init ipc kernel lib mm scripts security sound source usr virt
--
   In file included from mm/slub.c:20:
>> mm/slab.h:401:13: warning: unused function 'mod_obj_stock_state'
   static void mod_obj_stock_state(struct obj_cgroup
   ^
   mm/slub.c:1503:29: warning: unused function 'node_nr_slabs'
   static inline unsigned long node_nr_slabs(struct kmem_cache_node
   ^
   mm/slub.c:1521:21: warning: unused function 'kmalloc_large_node_hook'
   static inline void size_t size, gfp_t flags)
   ^
   fatal error: error in backend: Nested variants found in inline asm string: ' .set push
   .set mips64r2
   .if ( 0x00 ) != -1)) 0x00 ) != -1)) : ($( static struct ftrace_branch_data __attribute__((__aligned__(4))) __attribute__((__section__("_ftrace_branch"))) __if_trace = $( .func = __func__, .file = "arch/mips/include/asm/bitops.h", .line = 192, $); 0x00 ) != -1)) : $))) ) && ( 0 ); .set push; .set mips64r2; .rept 1; sync 0x00; .endr; .set pop; .else; ; .endif
   1: ll $0, $2
   or $1, $0, $3
   sc $1, $2
   beqz $1, 1b
   .set pop
   '
   clang-13: error: clang frontend command failed with exit code 70 (use -v to see invocation)
   clang version 13.0.0 (git://gitmirror/llvm_project 9829f5e6b1bca9b61efc629770d28bb9014dec45)
   Target: mipsel-unknown-linux-gnu
   Thread model: posix
   InstalledDir: /opt/cross/clang-9829f5e6b1/bin
   clang-13: note: diagnostic msg:
   Makefile arch block certs crypto drivers fs include init ipc kernel lib mm scripts security sound source usr virt


vim +/mod_obj_stock_state +401 mm/slab.h

   400	
 > 401	static void mod_obj_stock_state(struct obj_cgroup *objcg,
   402					struct pglist_data *pgdat, int idx, int nr)
   403	{
   404	}
   405	#endif /* CONFIG_MEMCG_KMEM */
   406	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 29192 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 3/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
@ 2021-04-13 18:32     ` kernel test robot
  0 siblings, 0 replies; 28+ messages in thread
From: kernel test robot @ 2021-04-13 18:32 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 6189 bytes --]

Hi Waiman,

I love your patch! Perhaps something to improve:

[auto build test WARNING on dennis-percpu/for-next]
[also build test WARNING on linus/master v5.12-rc7 next-20210413]
[cannot apply to hnaz-linux-mm/master]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Waiman-Long/mm-memcg-Reduce-kmemcache-memory-accounting-overhead/20210413-065659
base:   https://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu.git for-next
config: mips-randconfig-r031-20210413 (attached as .config)
compiler: clang version 13.0.0 (https://github.com/llvm/llvm-project 9829f5e6b1bca9b61efc629770d28bb9014dec45)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install mips cross compiling tool for clang build
        # apt-get install binutils-mips-linux-gnu
        # https://github.com/0day-ci/linux/commit/565fa80b02eeb06c48a88ac6354a4fc9e5addd06
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Waiman-Long/mm-memcg-Reduce-kmemcache-memory-accounting-overhead/20210413-065659
        git checkout 565fa80b02eeb06c48a88ac6354a4fc9e5addd06
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=mips 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   In file included from mm/mempool.c:22:
>> mm/slab.h:401:13: warning: unused function 'mod_obj_stock_state' [-Wunused-function]
   static void mod_obj_stock_state(struct obj_cgroup *objcg,
               ^
   1 warning generated.
--
   In file included from mm/oom_kill.c:49:
>> mm/slab.h:401:13: warning: unused function 'mod_obj_stock_state'
   static void mod_obj_stock_state(struct obj_cgroup
   ^
   fatal error: error in backend: Nested variants found in inline asm string: ' .set push
   .set mips64r2
   .if ( 0x00 ) != -1)) 0x00 ) != -1)) : ($( static struct ftrace_branch_data __attribute__((__aligned__(4))) __attribute__((__section__("_ftrace_branch"))) __if_trace = $( .func = __func__, .file = "arch/mips/include/asm/bitops.h", .line = 105, $); 0x00 ) != -1)) : $))) ) && ( 0 ); .set push; .set mips64r2; .rept 1; sync 0x00; .endr; .set pop; .else; ; .endif
   1: ll $0, $1
   or $0, $2
   sc $0, $1
   beqz $0, 1b
   .set pop
   '
   clang-13: error: clang frontend command failed with exit code 70 (use -v to see invocation)
   clang version 13.0.0 (git://gitmirror/llvm_project 9829f5e6b1bca9b61efc629770d28bb9014dec45)
   Target: mipsel-unknown-linux-gnu
   Thread model: posix
   InstalledDir: /opt/cross/clang-9829f5e6b1/bin
   clang-13: note: diagnostic msg:
   Makefile arch block certs crypto drivers fs include init ipc kernel lib mm scripts security sound source usr virt
--
   In file included from mm/slab_common.c:33:
>> mm/slab.h:401:13: warning: unused function 'mod_obj_stock_state'
   static void mod_obj_stock_state(struct obj_cgroup
   ^
   fatal error: error in backend: Nested variants found in inline asm string: ' .set push
   .set mips64r2
   .if ( 0x00 ) != -1)) 0x00 ) != -1)) : ($( static struct ftrace_branch_data __attribute__((__aligned__(4))) __attribute__((__section__("_ftrace_branch"))) __if_trace = $( .func = __func__, .file = "arch/mips/include/asm/atomic.h", .line = 153, $); 0x00 ) != -1)) : $))) ) && ( 0 ); .set push; .set mips64r2; .rept 1; sync 0x00; .endr; .set pop; .else; ; .endif
   1: ll $0, $1 # atomic_add
   addu $0, $2
   sc $0, $1
   beqz $0, 1b
   .set pop
   '
   clang-13: error: clang frontend command failed with exit code 70 (use -v to see invocation)
   clang version 13.0.0 (git://gitmirror/llvm_project 9829f5e6b1bca9b61efc629770d28bb9014dec45)
   Target: mipsel-unknown-linux-gnu
   Thread model: posix
   InstalledDir: /opt/cross/clang-9829f5e6b1/bin
   clang-13: note: diagnostic msg:
   Makefile arch block certs crypto drivers fs include init ipc kernel lib mm scripts security sound source usr virt
--
   In file included from mm/slub.c:20:
>> mm/slab.h:401:13: warning: unused function 'mod_obj_stock_state'
   static void mod_obj_stock_state(struct obj_cgroup
   ^
   mm/slub.c:1503:29: warning: unused function 'node_nr_slabs'
   static inline unsigned long node_nr_slabs(struct kmem_cache_node
   ^
   mm/slub.c:1521:21: warning: unused function 'kmalloc_large_node_hook'
   static inline void size_t size, gfp_t flags)
   ^
   fatal error: error in backend: Nested variants found in inline asm string: ' .set push
   .set mips64r2
   .if ( 0x00 ) != -1)) 0x00 ) != -1)) : ($( static struct ftrace_branch_data __attribute__((__aligned__(4))) __attribute__((__section__("_ftrace_branch"))) __if_trace = $( .func = __func__, .file = "arch/mips/include/asm/bitops.h", .line = 192, $); 0x00 ) != -1)) : $))) ) && ( 0 ); .set push; .set mips64r2; .rept 1; sync 0x00; .endr; .set pop; .else; ; .endif
   1: ll $0, $2
   or $1, $0, $3
   sc $1, $2
   beqz $1, 1b
   .set pop
   '
   clang-13: error: clang frontend command failed with exit code 70 (use -v to see invocation)
   clang version 13.0.0 (git://gitmirror/llvm_project 9829f5e6b1bca9b61efc629770d28bb9014dec45)
   Target: mipsel-unknown-linux-gnu
   Thread model: posix
   InstalledDir: /opt/cross/clang-9829f5e6b1/bin
   clang-13: note: diagnostic msg:
   Makefile arch block certs crypto drivers fs include init ipc kernel lib mm scripts security sound source usr virt


vim +/mod_obj_stock_state +401 mm/slab.h

   400	
 > 401	static void mod_obj_stock_state(struct obj_cgroup *objcg,
   402					struct pglist_data *pgdat, int idx, int nr)
   403	{
   404	}
   405	#endif /* CONFIG_MEMCG_KMEM */
   406	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 29192 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2 4/5] mm/memcg: Separate out object stock data into its own struct
@ 2021-04-12 22:55   ` Waiman Long
  0 siblings, 0 replies; 28+ messages in thread
From: Waiman Long @ 2021-04-12 22:55 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin
  Cc: linux-kernel, cgroups, linux-mm, Shakeel Butt, Muchun Song,
	Alex Shi, Chris Down, Yafang Shao, Wei Yang, Masayoshi Mizuma,
	Xing Zhengjun, Waiman Long

The object stock data stored in struct memcg_stock_pcp are independent
of the other page based data stored there. Separating them out into
their own struct to highlight the independency.

Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Roman Gushchin <guro@fb.com>
---
 mm/memcontrol.c | 41 ++++++++++++++++++++++++++---------------
 1 file changed, 26 insertions(+), 15 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 539c3b632e47..69f728383efe 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2214,17 +2214,22 @@ void unlock_page_memcg(struct page *page)
 }
 EXPORT_SYMBOL(unlock_page_memcg);
 
-struct memcg_stock_pcp {
-	struct mem_cgroup *cached; /* this never be root cgroup */
-	unsigned int nr_pages;
-
+struct obj_stock {
 #ifdef CONFIG_MEMCG_KMEM
 	struct obj_cgroup *cached_objcg;
 	struct pglist_data *cached_pgdat;
 	unsigned int nr_bytes;
 	int vmstat_idx;
 	int vmstat_bytes;
+#else
+	int dummy[0];
 #endif
+};
+
+struct memcg_stock_pcp {
+	struct mem_cgroup *cached; /* this never be root cgroup */
+	unsigned int nr_pages;
+	struct obj_stock obj;
 
 	struct work_struct work;
 	unsigned long flags;
@@ -2234,12 +2239,12 @@ static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
 static DEFINE_MUTEX(percpu_charge_mutex);
 
 #ifdef CONFIG_MEMCG_KMEM
-static void drain_obj_stock(struct memcg_stock_pcp *stock);
+static void drain_obj_stock(struct obj_stock *stock);
 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 				     struct mem_cgroup *root_memcg);
 
 #else
-static inline void drain_obj_stock(struct memcg_stock_pcp *stock)
+static inline void drain_obj_stock(struct obj_stock *stock)
 {
 }
 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
@@ -2249,6 +2254,13 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 }
 #endif
 
+static inline struct obj_stock *current_obj_stock(void)
+{
+	struct memcg_stock_pcp *stock = this_cpu_ptr(&memcg_stock);
+
+	return &stock->obj;
+}
+
 /**
  * consume_stock: Try to consume stocked charge on this cpu.
  * @memcg: memcg to consume from.
@@ -2315,7 +2327,7 @@ static void drain_local_stock(struct work_struct *dummy)
 	local_irq_save(flags);
 
 	stock = this_cpu_ptr(&memcg_stock);
-	drain_obj_stock(stock);
+	drain_obj_stock(&stock->obj);
 	drain_stock(stock);
 	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
 
@@ -3177,13 +3189,13 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg,
 
 static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 {
-	struct memcg_stock_pcp *stock;
+	struct obj_stock *stock;
 	unsigned long flags;
 	bool ret = false;
 
 	local_irq_save(flags);
 
-	stock = this_cpu_ptr(&memcg_stock);
+	stock = current_obj_stock();
 	if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
 		stock->nr_bytes -= nr_bytes;
 		ret = true;
@@ -3194,7 +3206,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 	return ret;
 }
 
-static void drain_obj_stock(struct memcg_stock_pcp *stock)
+static void drain_obj_stock(struct obj_stock *stock)
 {
 	struct obj_cgroup *old = stock->cached_objcg;
 
@@ -3242,8 +3254,8 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 {
 	struct mem_cgroup *memcg;
 
-	if (stock->cached_objcg) {
-		memcg = obj_cgroup_memcg(stock->cached_objcg);
+	if (stock->obj.cached_objcg) {
+		memcg = obj_cgroup_memcg(stock->obj.cached_objcg);
 		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
 			return true;
 	}
@@ -3253,9 +3265,8 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 
 static void __refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 {
-	struct memcg_stock_pcp *stock;
+	struct obj_stock *stock = current_obj_stock();
 
-	stock = this_cpu_ptr(&memcg_stock);
 	if (stock->cached_objcg != objcg) { /* reset if necessary */
 		drain_obj_stock(stock);
 		obj_cgroup_get(objcg);
@@ -3280,7 +3291,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 static void __mod_obj_stock_state(struct obj_cgroup *objcg,
 				  struct pglist_data *pgdat, int idx, int nr)
 {
-	struct memcg_stock_pcp *stock = this_cpu_ptr(&memcg_stock);
+	struct obj_stock *stock = current_obj_stock();
 
 	if (stock->cached_objcg != objcg) {
 		/* Output the current data as is */
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 4/5] mm/memcg: Separate out object stock data into its own struct
@ 2021-04-12 22:55   ` Waiman Long
  0 siblings, 0 replies; 28+ messages in thread
From: Waiman Long @ 2021-04-12 22:55 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun, Waiman Long

The object stock data stored in struct memcg_stock_pcp are independent
of the other page based data stored there. Separating them out into
their own struct to highlight the independency.

Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Acked-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
---
 mm/memcontrol.c | 41 ++++++++++++++++++++++++++---------------
 1 file changed, 26 insertions(+), 15 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 539c3b632e47..69f728383efe 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2214,17 +2214,22 @@ void unlock_page_memcg(struct page *page)
 }
 EXPORT_SYMBOL(unlock_page_memcg);
 
-struct memcg_stock_pcp {
-	struct mem_cgroup *cached; /* this never be root cgroup */
-	unsigned int nr_pages;
-
+struct obj_stock {
 #ifdef CONFIG_MEMCG_KMEM
 	struct obj_cgroup *cached_objcg;
 	struct pglist_data *cached_pgdat;
 	unsigned int nr_bytes;
 	int vmstat_idx;
 	int vmstat_bytes;
+#else
+	int dummy[0];
 #endif
+};
+
+struct memcg_stock_pcp {
+	struct mem_cgroup *cached; /* this never be root cgroup */
+	unsigned int nr_pages;
+	struct obj_stock obj;
 
 	struct work_struct work;
 	unsigned long flags;
@@ -2234,12 +2239,12 @@ static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
 static DEFINE_MUTEX(percpu_charge_mutex);
 
 #ifdef CONFIG_MEMCG_KMEM
-static void drain_obj_stock(struct memcg_stock_pcp *stock);
+static void drain_obj_stock(struct obj_stock *stock);
 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 				     struct mem_cgroup *root_memcg);
 
 #else
-static inline void drain_obj_stock(struct memcg_stock_pcp *stock)
+static inline void drain_obj_stock(struct obj_stock *stock)
 {
 }
 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
@@ -2249,6 +2254,13 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 }
 #endif
 
+static inline struct obj_stock *current_obj_stock(void)
+{
+	struct memcg_stock_pcp *stock = this_cpu_ptr(&memcg_stock);
+
+	return &stock->obj;
+}
+
 /**
  * consume_stock: Try to consume stocked charge on this cpu.
  * @memcg: memcg to consume from.
@@ -2315,7 +2327,7 @@ static void drain_local_stock(struct work_struct *dummy)
 	local_irq_save(flags);
 
 	stock = this_cpu_ptr(&memcg_stock);
-	drain_obj_stock(stock);
+	drain_obj_stock(&stock->obj);
 	drain_stock(stock);
 	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
 
@@ -3177,13 +3189,13 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg,
 
 static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 {
-	struct memcg_stock_pcp *stock;
+	struct obj_stock *stock;
 	unsigned long flags;
 	bool ret = false;
 
 	local_irq_save(flags);
 
-	stock = this_cpu_ptr(&memcg_stock);
+	stock = current_obj_stock();
 	if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
 		stock->nr_bytes -= nr_bytes;
 		ret = true;
@@ -3194,7 +3206,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 	return ret;
 }
 
-static void drain_obj_stock(struct memcg_stock_pcp *stock)
+static void drain_obj_stock(struct obj_stock *stock)
 {
 	struct obj_cgroup *old = stock->cached_objcg;
 
@@ -3242,8 +3254,8 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 {
 	struct mem_cgroup *memcg;
 
-	if (stock->cached_objcg) {
-		memcg = obj_cgroup_memcg(stock->cached_objcg);
+	if (stock->obj.cached_objcg) {
+		memcg = obj_cgroup_memcg(stock->obj.cached_objcg);
 		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
 			return true;
 	}
@@ -3253,9 +3265,8 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 
 static void __refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 {
-	struct memcg_stock_pcp *stock;
+	struct obj_stock *stock = current_obj_stock();
 
-	stock = this_cpu_ptr(&memcg_stock);
 	if (stock->cached_objcg != objcg) { /* reset if necessary */
 		drain_obj_stock(stock);
 		obj_cgroup_get(objcg);
@@ -3280,7 +3291,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 static void __mod_obj_stock_state(struct obj_cgroup *objcg,
 				  struct pglist_data *pgdat, int idx, int nr)
 {
-	struct memcg_stock_pcp *stock = this_cpu_ptr(&memcg_stock);
+	struct obj_stock *stock = current_obj_stock();
 
 	if (stock->cached_objcg != objcg) {
 		/* Output the current data as is */
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 4/5] mm/memcg: Separate out object stock data into its own struct
  2021-04-12 22:55   ` Waiman Long
  (?)
@ 2021-04-12 23:07     ` Shakeel Butt
  -1 siblings, 0 replies; 28+ messages in thread
From: Shakeel Butt @ 2021-04-12 23:07 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin, LKML, Cgroups,
	Linux MM, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun

On Mon, Apr 12, 2021 at 3:55 PM Waiman Long <longman@redhat.com> wrote:
>
> The object stock data stored in struct memcg_stock_pcp are independent
> of the other page based data stored there. Separating them out into
> their own struct to highlight the independency.
>
> Signed-off-by: Waiman Long <longman@redhat.com>
> Acked-by: Roman Gushchin <guro@fb.com>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 4/5] mm/memcg: Separate out object stock data into its own struct
@ 2021-04-12 23:07     ` Shakeel Butt
  0 siblings, 0 replies; 28+ messages in thread
From: Shakeel Butt @ 2021-04-12 23:07 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin, LKML, Cgroups,
	Linux MM, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun

On Mon, Apr 12, 2021 at 3:55 PM Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>
> The object stock data stored in struct memcg_stock_pcp are independent
> of the other page based data stored there. Separating them out into
> their own struct to highlight the independency.
>
> Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Acked-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>

Reviewed-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 4/5] mm/memcg: Separate out object stock data into its own struct
@ 2021-04-12 23:07     ` Shakeel Butt
  0 siblings, 0 replies; 28+ messages in thread
From: Shakeel Butt @ 2021-04-12 23:07 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin, LKML, Cgroups,
	Linux MM, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun

On Mon, Apr 12, 2021 at 3:55 PM Waiman Long <longman@redhat.com> wrote:
>
> The object stock data stored in struct memcg_stock_pcp are independent
> of the other page based data stored there. Separating them out into
> their own struct to highlight the independency.
>
> Signed-off-by: Waiman Long <longman@redhat.com>
> Acked-by: Roman Gushchin <guro@fb.com>

Reviewed-by: Shakeel Butt <shakeelb@google.com>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2 5/5] mm/memcg: Optimize user context object stock access
@ 2021-04-12 22:55   ` Waiman Long
  0 siblings, 0 replies; 28+ messages in thread
From: Waiman Long @ 2021-04-12 22:55 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin
  Cc: linux-kernel, cgroups, linux-mm, Shakeel Butt, Muchun Song,
	Alex Shi, Chris Down, Yafang Shao, Wei Yang, Masayoshi Mizuma,
	Xing Zhengjun, Waiman Long

Most kmem_cache_alloc() calls are from user context. With instrumentation
enabled, the measured amount of kmem_cache_alloc() calls from non-task
context was about 0.01% of the total.

The irq disable/enable sequence used in this case to access content
from object stock is slow.  To optimize for user context access, there
are now two object stocks for task context and interrupt context access
respectively.

The task context object stock can be accessed after disabling preemption
which is cheap in non-preempt kernel. The interrupt context object stock
can only be accessed after disabling interrupt. User context code can
access interrupt object stock, but not vice versa.

The mod_objcg_state() function is also modified to make sure that memcg
and lruvec stat updates are done with interrupted disabled.

The downside of this change is that there are more data stored in local
object stocks and not reflected in the charge counter and the vmstat
arrays.  However, this is a small price to pay for better performance.

Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Roman Gushchin <guro@fb.com>
---
 mm/memcontrol.c | 73 +++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 59 insertions(+), 14 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 69f728383efe..29f2df76644a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2229,7 +2229,8 @@ struct obj_stock {
 struct memcg_stock_pcp {
 	struct mem_cgroup *cached; /* this never be root cgroup */
 	unsigned int nr_pages;
-	struct obj_stock obj;
+	struct obj_stock task_obj;
+	struct obj_stock irq_obj;
 
 	struct work_struct work;
 	unsigned long flags;
@@ -2254,11 +2255,48 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 }
 #endif
 
+/*
+ * Most kmem_cache_alloc() calls are from user context. The irq disable/enable
+ * sequence used in this case to access content from object stock is slow.
+ * To optimize for user context access, there are now two object stocks for
+ * task context and interrupt context access respectively.
+ *
+ * The task context object stock can be accessed by disabling preemption only
+ * which is cheap in non-preempt kernel. The interrupt context object stock
+ * can only be accessed after disabling interrupt. User context code can
+ * access interrupt object stock, but not vice versa.
+ */
 static inline struct obj_stock *current_obj_stock(void)
 {
 	struct memcg_stock_pcp *stock = this_cpu_ptr(&memcg_stock);
 
-	return &stock->obj;
+	return in_task() ? &stock->task_obj : &stock->irq_obj;
+}
+
+#define get_obj_stock(flags)				\
+({							\
+	struct memcg_stock_pcp *stock;			\
+	struct obj_stock *obj_stock;			\
+							\
+	if (in_task()) {				\
+		preempt_disable();			\
+		(flags) = -1L;				\
+		stock = this_cpu_ptr(&memcg_stock);	\
+		obj_stock = &stock->task_obj;		\
+	} else {					\
+		local_irq_save(flags);			\
+		stock = this_cpu_ptr(&memcg_stock);	\
+		obj_stock = &stock->irq_obj;		\
+	}						\
+	obj_stock;					\
+})
+
+static inline void put_obj_stock(unsigned long flags)
+{
+	if (flags == -1L)
+		preempt_enable();
+	else
+		local_irq_restore(flags);
 }
 
 /**
@@ -2327,7 +2365,9 @@ static void drain_local_stock(struct work_struct *dummy)
 	local_irq_save(flags);
 
 	stock = this_cpu_ptr(&memcg_stock);
-	drain_obj_stock(&stock->obj);
+	drain_obj_stock(&stock->irq_obj);
+	if (in_task())
+		drain_obj_stock(&stock->task_obj);
 	drain_stock(stock);
 	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
 
@@ -3183,7 +3223,7 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg,
 	memcg = obj_cgroup_memcg(objcg);
 	if (pgdat)
 		lruvec = mem_cgroup_lruvec(memcg, pgdat);
-	__mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
+	mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
 	rcu_read_unlock();
 }
 
@@ -3193,7 +3233,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 	unsigned long flags;
 	bool ret = false;
 
-	local_irq_save(flags);
+	stock = get_obj_stock(flags);
 
 	stock = current_obj_stock();
 	if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
@@ -3201,7 +3241,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 		ret = true;
 	}
 
-	local_irq_restore(flags);
+	put_obj_stock(flags);
 
 	return ret;
 }
@@ -3254,8 +3294,13 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 {
 	struct mem_cgroup *memcg;
 
-	if (stock->obj.cached_objcg) {
-		memcg = obj_cgroup_memcg(stock->obj.cached_objcg);
+	if (in_task() && stock->task_obj.cached_objcg) {
+		memcg = obj_cgroup_memcg(stock->task_obj.cached_objcg);
+		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
+			return true;
+	}
+	if (stock->irq_obj.cached_objcg) {
+		memcg = obj_cgroup_memcg(stock->irq_obj.cached_objcg);
 		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
 			return true;
 	}
@@ -3283,9 +3328,9 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 {
 	unsigned long flags;
 
-	local_irq_save(flags);
+	get_obj_stock(flags);
 	__refill_obj_stock(objcg, nr_bytes);
-	local_irq_restore(flags);
+	put_obj_stock(flags);
 }
 
 static void __mod_obj_stock_state(struct obj_cgroup *objcg,
@@ -3325,9 +3370,9 @@ void mod_obj_stock_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
 {
 	unsigned long flags;
 
-	local_irq_save(flags);
+	get_obj_stock(flags);
 	__mod_obj_stock_state(objcg, pgdat, idx, nr);
-	local_irq_restore(flags);
+	put_obj_stock(flags);
 }
 
 int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
@@ -3380,10 +3425,10 @@ void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
 {
 	unsigned long flags;
 
-	local_irq_save(flags);
+	get_obj_stock(flags);
 	__refill_obj_stock(objcg, size);
 	__mod_obj_stock_state(objcg, pgdat, idx, -(int)size);
-	local_irq_restore(flags);
+	put_obj_stock(flags);
 }
 
 #endif /* CONFIG_MEMCG_KMEM */
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 5/5] mm/memcg: Optimize user context object stock access
@ 2021-04-12 22:55   ` Waiman Long
  0 siblings, 0 replies; 28+ messages in thread
From: Waiman Long @ 2021-04-12 22:55 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun, Waiman Long

Most kmem_cache_alloc() calls are from user context. With instrumentation
enabled, the measured amount of kmem_cache_alloc() calls from non-task
context was about 0.01% of the total.

The irq disable/enable sequence used in this case to access content
from object stock is slow.  To optimize for user context access, there
are now two object stocks for task context and interrupt context access
respectively.

The task context object stock can be accessed after disabling preemption
which is cheap in non-preempt kernel. The interrupt context object stock
can only be accessed after disabling interrupt. User context code can
access interrupt object stock, but not vice versa.

The mod_objcg_state() function is also modified to make sure that memcg
and lruvec stat updates are done with interrupted disabled.

The downside of this change is that there are more data stored in local
object stocks and not reflected in the charge counter and the vmstat
arrays.  However, this is a small price to pay for better performance.

Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Acked-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
---
 mm/memcontrol.c | 73 +++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 59 insertions(+), 14 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 69f728383efe..29f2df76644a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2229,7 +2229,8 @@ struct obj_stock {
 struct memcg_stock_pcp {
 	struct mem_cgroup *cached; /* this never be root cgroup */
 	unsigned int nr_pages;
-	struct obj_stock obj;
+	struct obj_stock task_obj;
+	struct obj_stock irq_obj;
 
 	struct work_struct work;
 	unsigned long flags;
@@ -2254,11 +2255,48 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 }
 #endif
 
+/*
+ * Most kmem_cache_alloc() calls are from user context. The irq disable/enable
+ * sequence used in this case to access content from object stock is slow.
+ * To optimize for user context access, there are now two object stocks for
+ * task context and interrupt context access respectively.
+ *
+ * The task context object stock can be accessed by disabling preemption only
+ * which is cheap in non-preempt kernel. The interrupt context object stock
+ * can only be accessed after disabling interrupt. User context code can
+ * access interrupt object stock, but not vice versa.
+ */
 static inline struct obj_stock *current_obj_stock(void)
 {
 	struct memcg_stock_pcp *stock = this_cpu_ptr(&memcg_stock);
 
-	return &stock->obj;
+	return in_task() ? &stock->task_obj : &stock->irq_obj;
+}
+
+#define get_obj_stock(flags)				\
+({							\
+	struct memcg_stock_pcp *stock;			\
+	struct obj_stock *obj_stock;			\
+							\
+	if (in_task()) {				\
+		preempt_disable();			\
+		(flags) = -1L;				\
+		stock = this_cpu_ptr(&memcg_stock);	\
+		obj_stock = &stock->task_obj;		\
+	} else {					\
+		local_irq_save(flags);			\
+		stock = this_cpu_ptr(&memcg_stock);	\
+		obj_stock = &stock->irq_obj;		\
+	}						\
+	obj_stock;					\
+})
+
+static inline void put_obj_stock(unsigned long flags)
+{
+	if (flags == -1L)
+		preempt_enable();
+	else
+		local_irq_restore(flags);
 }
 
 /**
@@ -2327,7 +2365,9 @@ static void drain_local_stock(struct work_struct *dummy)
 	local_irq_save(flags);
 
 	stock = this_cpu_ptr(&memcg_stock);
-	drain_obj_stock(&stock->obj);
+	drain_obj_stock(&stock->irq_obj);
+	if (in_task())
+		drain_obj_stock(&stock->task_obj);
 	drain_stock(stock);
 	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
 
@@ -3183,7 +3223,7 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg,
 	memcg = obj_cgroup_memcg(objcg);
 	if (pgdat)
 		lruvec = mem_cgroup_lruvec(memcg, pgdat);
-	__mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
+	mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
 	rcu_read_unlock();
 }
 
@@ -3193,7 +3233,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 	unsigned long flags;
 	bool ret = false;
 
-	local_irq_save(flags);
+	stock = get_obj_stock(flags);
 
 	stock = current_obj_stock();
 	if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
@@ -3201,7 +3241,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 		ret = true;
 	}
 
-	local_irq_restore(flags);
+	put_obj_stock(flags);
 
 	return ret;
 }
@@ -3254,8 +3294,13 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 {
 	struct mem_cgroup *memcg;
 
-	if (stock->obj.cached_objcg) {
-		memcg = obj_cgroup_memcg(stock->obj.cached_objcg);
+	if (in_task() && stock->task_obj.cached_objcg) {
+		memcg = obj_cgroup_memcg(stock->task_obj.cached_objcg);
+		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
+			return true;
+	}
+	if (stock->irq_obj.cached_objcg) {
+		memcg = obj_cgroup_memcg(stock->irq_obj.cached_objcg);
 		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
 			return true;
 	}
@@ -3283,9 +3328,9 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 {
 	unsigned long flags;
 
-	local_irq_save(flags);
+	get_obj_stock(flags);
 	__refill_obj_stock(objcg, nr_bytes);
-	local_irq_restore(flags);
+	put_obj_stock(flags);
 }
 
 static void __mod_obj_stock_state(struct obj_cgroup *objcg,
@@ -3325,9 +3370,9 @@ void mod_obj_stock_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
 {
 	unsigned long flags;
 
-	local_irq_save(flags);
+	get_obj_stock(flags);
 	__mod_obj_stock_state(objcg, pgdat, idx, nr);
-	local_irq_restore(flags);
+	put_obj_stock(flags);
 }
 
 int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
@@ -3380,10 +3425,10 @@ void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
 {
 	unsigned long flags;
 
-	local_irq_save(flags);
+	get_obj_stock(flags);
 	__refill_obj_stock(objcg, size);
 	__mod_obj_stock_state(objcg, pgdat, idx, -(int)size);
-	local_irq_restore(flags);
+	put_obj_stock(flags);
 }
 
 #endif /* CONFIG_MEMCG_KMEM */
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 5/5] mm/memcg: Optimize user context object stock access
  2021-04-12 22:55   ` Waiman Long
  (?)
@ 2021-04-12 23:10     ` Shakeel Butt
  -1 siblings, 0 replies; 28+ messages in thread
From: Shakeel Butt @ 2021-04-12 23:10 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin, LKML, Cgroups,
	Linux MM, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun

On Mon, Apr 12, 2021 at 3:55 PM Waiman Long <longman@redhat.com> wrote:
>
> Most kmem_cache_alloc() calls are from user context. With instrumentation
> enabled, the measured amount of kmem_cache_alloc() calls from non-task
> context was about 0.01% of the total.
>
> The irq disable/enable sequence used in this case to access content
> from object stock is slow.  To optimize for user context access, there
> are now two object stocks for task context and interrupt context access
> respectively.
>
> The task context object stock can be accessed after disabling preemption
> which is cheap in non-preempt kernel. The interrupt context object stock
> can only be accessed after disabling interrupt. User context code can
> access interrupt object stock, but not vice versa.
>
> The mod_objcg_state() function is also modified to make sure that memcg
> and lruvec stat updates are done with interrupted disabled.
>
> The downside of this change is that there are more data stored in local
> object stocks and not reflected in the charge counter and the vmstat
> arrays.  However, this is a small price to pay for better performance.
>
> Signed-off-by: Waiman Long <longman@redhat.com>
> Acked-by: Roman Gushchin <guro@fb.com>
> ---
>  mm/memcontrol.c | 73 +++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 59 insertions(+), 14 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 69f728383efe..29f2df76644a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2229,7 +2229,8 @@ struct obj_stock {
>  struct memcg_stock_pcp {
>         struct mem_cgroup *cached; /* this never be root cgroup */
>         unsigned int nr_pages;
> -       struct obj_stock obj;
> +       struct obj_stock task_obj;
> +       struct obj_stock irq_obj;
>
>         struct work_struct work;
>         unsigned long flags;
> @@ -2254,11 +2255,48 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
>  }
>  #endif
>
> +/*
> + * Most kmem_cache_alloc() calls are from user context. The irq disable/enable
> + * sequence used in this case to access content from object stock is slow.
> + * To optimize for user context access, there are now two object stocks for
> + * task context and interrupt context access respectively.
> + *
> + * The task context object stock can be accessed by disabling preemption only
> + * which is cheap in non-preempt kernel. The interrupt context object stock
> + * can only be accessed after disabling interrupt. User context code can
> + * access interrupt object stock, but not vice versa.
> + */
>  static inline struct obj_stock *current_obj_stock(void)
>  {
>         struct memcg_stock_pcp *stock = this_cpu_ptr(&memcg_stock);
>
> -       return &stock->obj;
> +       return in_task() ? &stock->task_obj : &stock->irq_obj;
> +}
> +
> +#define get_obj_stock(flags)                           \
> +({                                                     \
> +       struct memcg_stock_pcp *stock;                  \
> +       struct obj_stock *obj_stock;                    \
> +                                                       \
> +       if (in_task()) {                                \
> +               preempt_disable();                      \
> +               (flags) = -1L;                          \
> +               stock = this_cpu_ptr(&memcg_stock);     \

The above line was missing in the previous version.

> +               obj_stock = &stock->task_obj;           \
> +       } else {                                        \
> +               local_irq_save(flags);                  \
> +               stock = this_cpu_ptr(&memcg_stock);     \
> +               obj_stock = &stock->irq_obj;            \
> +       }                                               \
> +       obj_stock;                                      \
> +})
> +
> +static inline void put_obj_stock(unsigned long flags)
> +{
> +       if (flags == -1L)
> +               preempt_enable();
> +       else
> +               local_irq_restore(flags);
>  }
>
>  /**
> @@ -2327,7 +2365,9 @@ static void drain_local_stock(struct work_struct *dummy)
>         local_irq_save(flags);
>
>         stock = this_cpu_ptr(&memcg_stock);
> -       drain_obj_stock(&stock->obj);
> +       drain_obj_stock(&stock->irq_obj);
> +       if (in_task())
> +               drain_obj_stock(&stock->task_obj);
>         drain_stock(stock);
>         clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
>
> @@ -3183,7 +3223,7 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg,
>         memcg = obj_cgroup_memcg(objcg);
>         if (pgdat)
>                 lruvec = mem_cgroup_lruvec(memcg, pgdat);
> -       __mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
> +       mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
>         rcu_read_unlock();
>  }
>
> @@ -3193,7 +3233,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
>         unsigned long flags;
>         bool ret = false;
>
> -       local_irq_save(flags);
> +       stock = get_obj_stock(flags);
>
>         stock = current_obj_stock();

The above is redundant.

>         if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
> @@ -3201,7 +3241,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
>                 ret = true;
>         }
>
> -       local_irq_restore(flags);
> +       put_obj_stock(flags);
>
>         return ret;
>  }
> @@ -3254,8 +3294,13 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
>  {
>         struct mem_cgroup *memcg;
>
> -       if (stock->obj.cached_objcg) {
> -               memcg = obj_cgroup_memcg(stock->obj.cached_objcg);
> +       if (in_task() && stock->task_obj.cached_objcg) {
> +               memcg = obj_cgroup_memcg(stock->task_obj.cached_objcg);
> +               if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
> +                       return true;
> +       }
> +       if (stock->irq_obj.cached_objcg) {
> +               memcg = obj_cgroup_memcg(stock->irq_obj.cached_objcg);
>                 if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
>                         return true;
>         }
> @@ -3283,9 +3328,9 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
>  {
>         unsigned long flags;
>
> -       local_irq_save(flags);
> +       get_obj_stock(flags);
>         __refill_obj_stock(objcg, nr_bytes);
> -       local_irq_restore(flags);
> +       put_obj_stock(flags);
>  }
>
>  static void __mod_obj_stock_state(struct obj_cgroup *objcg,
> @@ -3325,9 +3370,9 @@ void mod_obj_stock_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
>  {
>         unsigned long flags;
>
> -       local_irq_save(flags);
> +       get_obj_stock(flags);
>         __mod_obj_stock_state(objcg, pgdat, idx, nr);
> -       local_irq_restore(flags);
> +       put_obj_stock(flags);
>  }
>
>  int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
> @@ -3380,10 +3425,10 @@ void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
>  {
>         unsigned long flags;
>
> -       local_irq_save(flags);
> +       get_obj_stock(flags);
>         __refill_obj_stock(objcg, size);
>         __mod_obj_stock_state(objcg, pgdat, idx, -(int)size);
> -       local_irq_restore(flags);
> +       put_obj_stock(flags);
>  }
>
>  #endif /* CONFIG_MEMCG_KMEM */
> --
> 2.18.1
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 5/5] mm/memcg: Optimize user context object stock access
@ 2021-04-12 23:10     ` Shakeel Butt
  0 siblings, 0 replies; 28+ messages in thread
From: Shakeel Butt @ 2021-04-12 23:10 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin, LKML, Cgroups,
	Linux MM, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun

On Mon, Apr 12, 2021 at 3:55 PM Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>
> Most kmem_cache_alloc() calls are from user context. With instrumentation
> enabled, the measured amount of kmem_cache_alloc() calls from non-task
> context was about 0.01% of the total.
>
> The irq disable/enable sequence used in this case to access content
> from object stock is slow.  To optimize for user context access, there
> are now two object stocks for task context and interrupt context access
> respectively.
>
> The task context object stock can be accessed after disabling preemption
> which is cheap in non-preempt kernel. The interrupt context object stock
> can only be accessed after disabling interrupt. User context code can
> access interrupt object stock, but not vice versa.
>
> The mod_objcg_state() function is also modified to make sure that memcg
> and lruvec stat updates are done with interrupted disabled.
>
> The downside of this change is that there are more data stored in local
> object stocks and not reflected in the charge counter and the vmstat
> arrays.  However, this is a small price to pay for better performance.
>
> Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Acked-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
> ---
>  mm/memcontrol.c | 73 +++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 59 insertions(+), 14 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 69f728383efe..29f2df76644a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2229,7 +2229,8 @@ struct obj_stock {
>  struct memcg_stock_pcp {
>         struct mem_cgroup *cached; /* this never be root cgroup */
>         unsigned int nr_pages;
> -       struct obj_stock obj;
> +       struct obj_stock task_obj;
> +       struct obj_stock irq_obj;
>
>         struct work_struct work;
>         unsigned long flags;
> @@ -2254,11 +2255,48 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
>  }
>  #endif
>
> +/*
> + * Most kmem_cache_alloc() calls are from user context. The irq disable/enable
> + * sequence used in this case to access content from object stock is slow.
> + * To optimize for user context access, there are now two object stocks for
> + * task context and interrupt context access respectively.
> + *
> + * The task context object stock can be accessed by disabling preemption only
> + * which is cheap in non-preempt kernel. The interrupt context object stock
> + * can only be accessed after disabling interrupt. User context code can
> + * access interrupt object stock, but not vice versa.
> + */
>  static inline struct obj_stock *current_obj_stock(void)
>  {
>         struct memcg_stock_pcp *stock = this_cpu_ptr(&memcg_stock);
>
> -       return &stock->obj;
> +       return in_task() ? &stock->task_obj : &stock->irq_obj;
> +}
> +
> +#define get_obj_stock(flags)                           \
> +({                                                     \
> +       struct memcg_stock_pcp *stock;                  \
> +       struct obj_stock *obj_stock;                    \
> +                                                       \
> +       if (in_task()) {                                \
> +               preempt_disable();                      \
> +               (flags) = -1L;                          \
> +               stock = this_cpu_ptr(&memcg_stock);     \

The above line was missing in the previous version.

> +               obj_stock = &stock->task_obj;           \
> +       } else {                                        \
> +               local_irq_save(flags);                  \
> +               stock = this_cpu_ptr(&memcg_stock);     \
> +               obj_stock = &stock->irq_obj;            \
> +       }                                               \
> +       obj_stock;                                      \
> +})
> +
> +static inline void put_obj_stock(unsigned long flags)
> +{
> +       if (flags == -1L)
> +               preempt_enable();
> +       else
> +               local_irq_restore(flags);
>  }
>
>  /**
> @@ -2327,7 +2365,9 @@ static void drain_local_stock(struct work_struct *dummy)
>         local_irq_save(flags);
>
>         stock = this_cpu_ptr(&memcg_stock);
> -       drain_obj_stock(&stock->obj);
> +       drain_obj_stock(&stock->irq_obj);
> +       if (in_task())
> +               drain_obj_stock(&stock->task_obj);
>         drain_stock(stock);
>         clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
>
> @@ -3183,7 +3223,7 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg,
>         memcg = obj_cgroup_memcg(objcg);
>         if (pgdat)
>                 lruvec = mem_cgroup_lruvec(memcg, pgdat);
> -       __mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
> +       mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
>         rcu_read_unlock();
>  }
>
> @@ -3193,7 +3233,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
>         unsigned long flags;
>         bool ret = false;
>
> -       local_irq_save(flags);
> +       stock = get_obj_stock(flags);
>
>         stock = current_obj_stock();

The above is redundant.

>         if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
> @@ -3201,7 +3241,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
>                 ret = true;
>         }
>
> -       local_irq_restore(flags);
> +       put_obj_stock(flags);
>
>         return ret;
>  }
> @@ -3254,8 +3294,13 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
>  {
>         struct mem_cgroup *memcg;
>
> -       if (stock->obj.cached_objcg) {
> -               memcg = obj_cgroup_memcg(stock->obj.cached_objcg);
> +       if (in_task() && stock->task_obj.cached_objcg) {
> +               memcg = obj_cgroup_memcg(stock->task_obj.cached_objcg);
> +               if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
> +                       return true;
> +       }
> +       if (stock->irq_obj.cached_objcg) {
> +               memcg = obj_cgroup_memcg(stock->irq_obj.cached_objcg);
>                 if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
>                         return true;
>         }
> @@ -3283,9 +3328,9 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
>  {
>         unsigned long flags;
>
> -       local_irq_save(flags);
> +       get_obj_stock(flags);
>         __refill_obj_stock(objcg, nr_bytes);
> -       local_irq_restore(flags);
> +       put_obj_stock(flags);
>  }
>
>  static void __mod_obj_stock_state(struct obj_cgroup *objcg,
> @@ -3325,9 +3370,9 @@ void mod_obj_stock_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
>  {
>         unsigned long flags;
>
> -       local_irq_save(flags);
> +       get_obj_stock(flags);
>         __mod_obj_stock_state(objcg, pgdat, idx, nr);
> -       local_irq_restore(flags);
> +       put_obj_stock(flags);
>  }
>
>  int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
> @@ -3380,10 +3425,10 @@ void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
>  {
>         unsigned long flags;
>
> -       local_irq_save(flags);
> +       get_obj_stock(flags);
>         __refill_obj_stock(objcg, size);
>         __mod_obj_stock_state(objcg, pgdat, idx, -(int)size);
> -       local_irq_restore(flags);
> +       put_obj_stock(flags);
>  }
>
>  #endif /* CONFIG_MEMCG_KMEM */
> --
> 2.18.1
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 5/5] mm/memcg: Optimize user context object stock access
@ 2021-04-12 23:10     ` Shakeel Butt
  0 siblings, 0 replies; 28+ messages in thread
From: Shakeel Butt @ 2021-04-12 23:10 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin, LKML, Cgroups,
	Linux MM, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun

On Mon, Apr 12, 2021 at 3:55 PM Waiman Long <longman@redhat.com> wrote:
>
> Most kmem_cache_alloc() calls are from user context. With instrumentation
> enabled, the measured amount of kmem_cache_alloc() calls from non-task
> context was about 0.01% of the total.
>
> The irq disable/enable sequence used in this case to access content
> from object stock is slow.  To optimize for user context access, there
> are now two object stocks for task context and interrupt context access
> respectively.
>
> The task context object stock can be accessed after disabling preemption
> which is cheap in non-preempt kernel. The interrupt context object stock
> can only be accessed after disabling interrupt. User context code can
> access interrupt object stock, but not vice versa.
>
> The mod_objcg_state() function is also modified to make sure that memcg
> and lruvec stat updates are done with interrupted disabled.
>
> The downside of this change is that there are more data stored in local
> object stocks and not reflected in the charge counter and the vmstat
> arrays.  However, this is a small price to pay for better performance.
>
> Signed-off-by: Waiman Long <longman@redhat.com>
> Acked-by: Roman Gushchin <guro@fb.com>
> ---
>  mm/memcontrol.c | 73 +++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 59 insertions(+), 14 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 69f728383efe..29f2df76644a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2229,7 +2229,8 @@ struct obj_stock {
>  struct memcg_stock_pcp {
>         struct mem_cgroup *cached; /* this never be root cgroup */
>         unsigned int nr_pages;
> -       struct obj_stock obj;
> +       struct obj_stock task_obj;
> +       struct obj_stock irq_obj;
>
>         struct work_struct work;
>         unsigned long flags;
> @@ -2254,11 +2255,48 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
>  }
>  #endif
>
> +/*
> + * Most kmem_cache_alloc() calls are from user context. The irq disable/enable
> + * sequence used in this case to access content from object stock is slow.
> + * To optimize for user context access, there are now two object stocks for
> + * task context and interrupt context access respectively.
> + *
> + * The task context object stock can be accessed by disabling preemption only
> + * which is cheap in non-preempt kernel. The interrupt context object stock
> + * can only be accessed after disabling interrupt. User context code can
> + * access interrupt object stock, but not vice versa.
> + */
>  static inline struct obj_stock *current_obj_stock(void)
>  {
>         struct memcg_stock_pcp *stock = this_cpu_ptr(&memcg_stock);
>
> -       return &stock->obj;
> +       return in_task() ? &stock->task_obj : &stock->irq_obj;
> +}
> +
> +#define get_obj_stock(flags)                           \
> +({                                                     \
> +       struct memcg_stock_pcp *stock;                  \
> +       struct obj_stock *obj_stock;                    \
> +                                                       \
> +       if (in_task()) {                                \
> +               preempt_disable();                      \
> +               (flags) = -1L;                          \
> +               stock = this_cpu_ptr(&memcg_stock);     \

The above line was missing in the previous version.

> +               obj_stock = &stock->task_obj;           \
> +       } else {                                        \
> +               local_irq_save(flags);                  \
> +               stock = this_cpu_ptr(&memcg_stock);     \
> +               obj_stock = &stock->irq_obj;            \
> +       }                                               \
> +       obj_stock;                                      \
> +})
> +
> +static inline void put_obj_stock(unsigned long flags)
> +{
> +       if (flags == -1L)
> +               preempt_enable();
> +       else
> +               local_irq_restore(flags);
>  }
>
>  /**
> @@ -2327,7 +2365,9 @@ static void drain_local_stock(struct work_struct *dummy)
>         local_irq_save(flags);
>
>         stock = this_cpu_ptr(&memcg_stock);
> -       drain_obj_stock(&stock->obj);
> +       drain_obj_stock(&stock->irq_obj);
> +       if (in_task())
> +               drain_obj_stock(&stock->task_obj);
>         drain_stock(stock);
>         clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
>
> @@ -3183,7 +3223,7 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg,
>         memcg = obj_cgroup_memcg(objcg);
>         if (pgdat)
>                 lruvec = mem_cgroup_lruvec(memcg, pgdat);
> -       __mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
> +       mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
>         rcu_read_unlock();
>  }
>
> @@ -3193,7 +3233,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
>         unsigned long flags;
>         bool ret = false;
>
> -       local_irq_save(flags);
> +       stock = get_obj_stock(flags);
>
>         stock = current_obj_stock();

The above is redundant.

>         if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
> @@ -3201,7 +3241,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
>                 ret = true;
>         }
>
> -       local_irq_restore(flags);
> +       put_obj_stock(flags);
>
>         return ret;
>  }
> @@ -3254,8 +3294,13 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
>  {
>         struct mem_cgroup *memcg;
>
> -       if (stock->obj.cached_objcg) {
> -               memcg = obj_cgroup_memcg(stock->obj.cached_objcg);
> +       if (in_task() && stock->task_obj.cached_objcg) {
> +               memcg = obj_cgroup_memcg(stock->task_obj.cached_objcg);
> +               if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
> +                       return true;
> +       }
> +       if (stock->irq_obj.cached_objcg) {
> +               memcg = obj_cgroup_memcg(stock->irq_obj.cached_objcg);
>                 if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
>                         return true;
>         }
> @@ -3283,9 +3328,9 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
>  {
>         unsigned long flags;
>
> -       local_irq_save(flags);
> +       get_obj_stock(flags);
>         __refill_obj_stock(objcg, nr_bytes);
> -       local_irq_restore(flags);
> +       put_obj_stock(flags);
>  }
>
>  static void __mod_obj_stock_state(struct obj_cgroup *objcg,
> @@ -3325,9 +3370,9 @@ void mod_obj_stock_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
>  {
>         unsigned long flags;
>
> -       local_irq_save(flags);
> +       get_obj_stock(flags);
>         __mod_obj_stock_state(objcg, pgdat, idx, nr);
> -       local_irq_restore(flags);
> +       put_obj_stock(flags);
>  }
>
>  int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
> @@ -3380,10 +3425,10 @@ void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
>  {
>         unsigned long flags;
>
> -       local_irq_save(flags);
> +       get_obj_stock(flags);
>         __refill_obj_stock(objcg, size);
>         __mod_obj_stock_state(objcg, pgdat, idx, -(int)size);
> -       local_irq_restore(flags);
> +       put_obj_stock(flags);
>  }
>
>  #endif /* CONFIG_MEMCG_KMEM */
> --
> 2.18.1
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 5/5] mm/memcg: Optimize user context object stock access
  2021-04-12 23:10     ` Shakeel Butt
  (?)
@ 2021-04-12 23:14       ` Shakeel Butt
  -1 siblings, 0 replies; 28+ messages in thread
From: Shakeel Butt @ 2021-04-12 23:14 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin, LKML, Cgroups,
	Linux MM, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun

On Mon, Apr 12, 2021 at 4:10 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Mon, Apr 12, 2021 at 3:55 PM Waiman Long <longman@redhat.com> wrote:
> >
> > Most kmem_cache_alloc() calls are from user context. With instrumentation
> > enabled, the measured amount of kmem_cache_alloc() calls from non-task
> > context was about 0.01% of the total.
> >
> > The irq disable/enable sequence used in this case to access content
> > from object stock is slow.  To optimize for user context access, there
> > are now two object stocks for task context and interrupt context access
> > respectively.
> >
> > The task context object stock can be accessed after disabling preemption
> > which is cheap in non-preempt kernel. The interrupt context object stock
> > can only be accessed after disabling interrupt. User context code can
> > access interrupt object stock, but not vice versa.
> >
> > The mod_objcg_state() function is also modified to make sure that memcg
> > and lruvec stat updates are done with interrupted disabled.
> >
> > The downside of this change is that there are more data stored in local
> > object stocks and not reflected in the charge counter and the vmstat
> > arrays.  However, this is a small price to pay for better performance.
> >
> > Signed-off-by: Waiman Long <longman@redhat.com>
> > Acked-by: Roman Gushchin <guro@fb.com>
> > ---
> >  mm/memcontrol.c | 73 +++++++++++++++++++++++++++++++++++++++----------
> >  1 file changed, 59 insertions(+), 14 deletions(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 69f728383efe..29f2df76644a 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2229,7 +2229,8 @@ struct obj_stock {
> >  struct memcg_stock_pcp {
> >         struct mem_cgroup *cached; /* this never be root cgroup */
> >         unsigned int nr_pages;
> > -       struct obj_stock obj;
> > +       struct obj_stock task_obj;
> > +       struct obj_stock irq_obj;
> >
> >         struct work_struct work;
> >         unsigned long flags;
> > @@ -2254,11 +2255,48 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
> >  }
> >  #endif
> >
> > +/*
> > + * Most kmem_cache_alloc() calls are from user context. The irq disable/enable
> > + * sequence used in this case to access content from object stock is slow.
> > + * To optimize for user context access, there are now two object stocks for
> > + * task context and interrupt context access respectively.
> > + *
> > + * The task context object stock can be accessed by disabling preemption only
> > + * which is cheap in non-preempt kernel. The interrupt context object stock
> > + * can only be accessed after disabling interrupt. User context code can
> > + * access interrupt object stock, but not vice versa.
> > + */
> >  static inline struct obj_stock *current_obj_stock(void)
> >  {
> >         struct memcg_stock_pcp *stock = this_cpu_ptr(&memcg_stock);
> >
> > -       return &stock->obj;
> > +       return in_task() ? &stock->task_obj : &stock->irq_obj;
> > +}
> > +
> > +#define get_obj_stock(flags)                           \
> > +({                                                     \
> > +       struct memcg_stock_pcp *stock;                  \
> > +       struct obj_stock *obj_stock;                    \
> > +                                                       \
> > +       if (in_task()) {                                \
> > +               preempt_disable();                      \
> > +               (flags) = -1L;                          \
> > +               stock = this_cpu_ptr(&memcg_stock);     \
>
> The above line was missing in the previous version.
>
> > +               obj_stock = &stock->task_obj;           \
> > +       } else {                                        \
> > +               local_irq_save(flags);                  \
> > +               stock = this_cpu_ptr(&memcg_stock);     \
> > +               obj_stock = &stock->irq_obj;            \
> > +       }                                               \
> > +       obj_stock;                                      \
> > +})
> > +
> > +static inline void put_obj_stock(unsigned long flags)
> > +{
> > +       if (flags == -1L)
> > +               preempt_enable();
> > +       else
> > +               local_irq_restore(flags);
> >  }
> >
> >  /**
> > @@ -2327,7 +2365,9 @@ static void drain_local_stock(struct work_struct *dummy)
> >         local_irq_save(flags);
> >
> >         stock = this_cpu_ptr(&memcg_stock);
> > -       drain_obj_stock(&stock->obj);
> > +       drain_obj_stock(&stock->irq_obj);
> > +       if (in_task())
> > +               drain_obj_stock(&stock->task_obj);
> >         drain_stock(stock);
> >         clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
> >
> > @@ -3183,7 +3223,7 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg,
> >         memcg = obj_cgroup_memcg(objcg);
> >         if (pgdat)
> >                 lruvec = mem_cgroup_lruvec(memcg, pgdat);
> > -       __mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
> > +       mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
> >         rcu_read_unlock();
> >  }
> >
> > @@ -3193,7 +3233,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
> >         unsigned long flags;
> >         bool ret = false;
> >
> > -       local_irq_save(flags);
> > +       stock = get_obj_stock(flags);
> >
> >         stock = current_obj_stock();
>
> The above is redundant.
>

After cleanup you can add:

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 5/5] mm/memcg: Optimize user context object stock access
@ 2021-04-12 23:14       ` Shakeel Butt
  0 siblings, 0 replies; 28+ messages in thread
From: Shakeel Butt @ 2021-04-12 23:14 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin, LKML, Cgroups,
	Linux MM, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun

On Mon, Apr 12, 2021 at 4:10 PM Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>
> On Mon, Apr 12, 2021 at 3:55 PM Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> >
> > Most kmem_cache_alloc() calls are from user context. With instrumentation
> > enabled, the measured amount of kmem_cache_alloc() calls from non-task
> > context was about 0.01% of the total.
> >
> > The irq disable/enable sequence used in this case to access content
> > from object stock is slow.  To optimize for user context access, there
> > are now two object stocks for task context and interrupt context access
> > respectively.
> >
> > The task context object stock can be accessed after disabling preemption
> > which is cheap in non-preempt kernel. The interrupt context object stock
> > can only be accessed after disabling interrupt. User context code can
> > access interrupt object stock, but not vice versa.
> >
> > The mod_objcg_state() function is also modified to make sure that memcg
> > and lruvec stat updates are done with interrupted disabled.
> >
> > The downside of this change is that there are more data stored in local
> > object stocks and not reflected in the charge counter and the vmstat
> > arrays.  However, this is a small price to pay for better performance.
> >
> > Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > Acked-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
> > ---
> >  mm/memcontrol.c | 73 +++++++++++++++++++++++++++++++++++++++----------
> >  1 file changed, 59 insertions(+), 14 deletions(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 69f728383efe..29f2df76644a 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2229,7 +2229,8 @@ struct obj_stock {
> >  struct memcg_stock_pcp {
> >         struct mem_cgroup *cached; /* this never be root cgroup */
> >         unsigned int nr_pages;
> > -       struct obj_stock obj;
> > +       struct obj_stock task_obj;
> > +       struct obj_stock irq_obj;
> >
> >         struct work_struct work;
> >         unsigned long flags;
> > @@ -2254,11 +2255,48 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
> >  }
> >  #endif
> >
> > +/*
> > + * Most kmem_cache_alloc() calls are from user context. The irq disable/enable
> > + * sequence used in this case to access content from object stock is slow.
> > + * To optimize for user context access, there are now two object stocks for
> > + * task context and interrupt context access respectively.
> > + *
> > + * The task context object stock can be accessed by disabling preemption only
> > + * which is cheap in non-preempt kernel. The interrupt context object stock
> > + * can only be accessed after disabling interrupt. User context code can
> > + * access interrupt object stock, but not vice versa.
> > + */
> >  static inline struct obj_stock *current_obj_stock(void)
> >  {
> >         struct memcg_stock_pcp *stock = this_cpu_ptr(&memcg_stock);
> >
> > -       return &stock->obj;
> > +       return in_task() ? &stock->task_obj : &stock->irq_obj;
> > +}
> > +
> > +#define get_obj_stock(flags)                           \
> > +({                                                     \
> > +       struct memcg_stock_pcp *stock;                  \
> > +       struct obj_stock *obj_stock;                    \
> > +                                                       \
> > +       if (in_task()) {                                \
> > +               preempt_disable();                      \
> > +               (flags) = -1L;                          \
> > +               stock = this_cpu_ptr(&memcg_stock);     \
>
> The above line was missing in the previous version.
>
> > +               obj_stock = &stock->task_obj;           \
> > +       } else {                                        \
> > +               local_irq_save(flags);                  \
> > +               stock = this_cpu_ptr(&memcg_stock);     \
> > +               obj_stock = &stock->irq_obj;            \
> > +       }                                               \
> > +       obj_stock;                                      \
> > +})
> > +
> > +static inline void put_obj_stock(unsigned long flags)
> > +{
> > +       if (flags == -1L)
> > +               preempt_enable();
> > +       else
> > +               local_irq_restore(flags);
> >  }
> >
> >  /**
> > @@ -2327,7 +2365,9 @@ static void drain_local_stock(struct work_struct *dummy)
> >         local_irq_save(flags);
> >
> >         stock = this_cpu_ptr(&memcg_stock);
> > -       drain_obj_stock(&stock->obj);
> > +       drain_obj_stock(&stock->irq_obj);
> > +       if (in_task())
> > +               drain_obj_stock(&stock->task_obj);
> >         drain_stock(stock);
> >         clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
> >
> > @@ -3183,7 +3223,7 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg,
> >         memcg = obj_cgroup_memcg(objcg);
> >         if (pgdat)
> >                 lruvec = mem_cgroup_lruvec(memcg, pgdat);
> > -       __mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
> > +       mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
> >         rcu_read_unlock();
> >  }
> >
> > @@ -3193,7 +3233,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
> >         unsigned long flags;
> >         bool ret = false;
> >
> > -       local_irq_save(flags);
> > +       stock = get_obj_stock(flags);
> >
> >         stock = current_obj_stock();
>
> The above is redundant.
>

After cleanup you can add:

Reviewed-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 5/5] mm/memcg: Optimize user context object stock access
@ 2021-04-12 23:14       ` Shakeel Butt
  0 siblings, 0 replies; 28+ messages in thread
From: Shakeel Butt @ 2021-04-12 23:14 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin, LKML, Cgroups,
	Linux MM, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun

On Mon, Apr 12, 2021 at 4:10 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Mon, Apr 12, 2021 at 3:55 PM Waiman Long <longman@redhat.com> wrote:
> >
> > Most kmem_cache_alloc() calls are from user context. With instrumentation
> > enabled, the measured amount of kmem_cache_alloc() calls from non-task
> > context was about 0.01% of the total.
> >
> > The irq disable/enable sequence used in this case to access content
> > from object stock is slow.  To optimize for user context access, there
> > are now two object stocks for task context and interrupt context access
> > respectively.
> >
> > The task context object stock can be accessed after disabling preemption
> > which is cheap in non-preempt kernel. The interrupt context object stock
> > can only be accessed after disabling interrupt. User context code can
> > access interrupt object stock, but not vice versa.
> >
> > The mod_objcg_state() function is also modified to make sure that memcg
> > and lruvec stat updates are done with interrupted disabled.
> >
> > The downside of this change is that there are more data stored in local
> > object stocks and not reflected in the charge counter and the vmstat
> > arrays.  However, this is a small price to pay for better performance.
> >
> > Signed-off-by: Waiman Long <longman@redhat.com>
> > Acked-by: Roman Gushchin <guro@fb.com>
> > ---
> >  mm/memcontrol.c | 73 +++++++++++++++++++++++++++++++++++++++----------
> >  1 file changed, 59 insertions(+), 14 deletions(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 69f728383efe..29f2df76644a 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2229,7 +2229,8 @@ struct obj_stock {
> >  struct memcg_stock_pcp {
> >         struct mem_cgroup *cached; /* this never be root cgroup */
> >         unsigned int nr_pages;
> > -       struct obj_stock obj;
> > +       struct obj_stock task_obj;
> > +       struct obj_stock irq_obj;
> >
> >         struct work_struct work;
> >         unsigned long flags;
> > @@ -2254,11 +2255,48 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
> >  }
> >  #endif
> >
> > +/*
> > + * Most kmem_cache_alloc() calls are from user context. The irq disable/enable
> > + * sequence used in this case to access content from object stock is slow.
> > + * To optimize for user context access, there are now two object stocks for
> > + * task context and interrupt context access respectively.
> > + *
> > + * The task context object stock can be accessed by disabling preemption only
> > + * which is cheap in non-preempt kernel. The interrupt context object stock
> > + * can only be accessed after disabling interrupt. User context code can
> > + * access interrupt object stock, but not vice versa.
> > + */
> >  static inline struct obj_stock *current_obj_stock(void)
> >  {
> >         struct memcg_stock_pcp *stock = this_cpu_ptr(&memcg_stock);
> >
> > -       return &stock->obj;
> > +       return in_task() ? &stock->task_obj : &stock->irq_obj;
> > +}
> > +
> > +#define get_obj_stock(flags)                           \
> > +({                                                     \
> > +       struct memcg_stock_pcp *stock;                  \
> > +       struct obj_stock *obj_stock;                    \
> > +                                                       \
> > +       if (in_task()) {                                \
> > +               preempt_disable();                      \
> > +               (flags) = -1L;                          \
> > +               stock = this_cpu_ptr(&memcg_stock);     \
>
> The above line was missing in the previous version.
>
> > +               obj_stock = &stock->task_obj;           \
> > +       } else {                                        \
> > +               local_irq_save(flags);                  \
> > +               stock = this_cpu_ptr(&memcg_stock);     \
> > +               obj_stock = &stock->irq_obj;            \
> > +       }                                               \
> > +       obj_stock;                                      \
> > +})
> > +
> > +static inline void put_obj_stock(unsigned long flags)
> > +{
> > +       if (flags == -1L)
> > +               preempt_enable();
> > +       else
> > +               local_irq_restore(flags);
> >  }
> >
> >  /**
> > @@ -2327,7 +2365,9 @@ static void drain_local_stock(struct work_struct *dummy)
> >         local_irq_save(flags);
> >
> >         stock = this_cpu_ptr(&memcg_stock);
> > -       drain_obj_stock(&stock->obj);
> > +       drain_obj_stock(&stock->irq_obj);
> > +       if (in_task())
> > +               drain_obj_stock(&stock->task_obj);
> >         drain_stock(stock);
> >         clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
> >
> > @@ -3183,7 +3223,7 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg,
> >         memcg = obj_cgroup_memcg(objcg);
> >         if (pgdat)
> >                 lruvec = mem_cgroup_lruvec(memcg, pgdat);
> > -       __mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
> > +       mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
> >         rcu_read_unlock();
> >  }
> >
> > @@ -3193,7 +3233,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
> >         unsigned long flags;
> >         bool ret = false;
> >
> > -       local_irq_save(flags);
> > +       stock = get_obj_stock(flags);
> >
> >         stock = current_obj_stock();
>
> The above is redundant.
>

After cleanup you can add:

Reviewed-by: Shakeel Butt <shakeelb@google.com>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 5/5] mm/memcg: Optimize user context object stock access
@ 2021-04-13  1:03       ` Waiman Long
  0 siblings, 0 replies; 28+ messages in thread
From: Waiman Long @ 2021-04-13  1:03 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin, LKML, Cgroups,
	Linux MM, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun

On 4/12/21 7:10 PM, Shakeel Butt wrote:
> On Mon, Apr 12, 2021 at 3:55 PM Waiman Long <longman@redhat.com> wrote:
>> Most kmem_cache_alloc() calls are from user context. With instrumentation
>> enabled, the measured amount of kmem_cache_alloc() calls from non-task
>> context was about 0.01% of the total.
>>
>> The irq disable/enable sequence used in this case to access content
>> from object stock is slow.  To optimize for user context access, there
>> are now two object stocks for task context and interrupt context access
>> respectively.
>>
>> The task context object stock can be accessed after disabling preemption
>> which is cheap in non-preempt kernel. The interrupt context object stock
>> can only be accessed after disabling interrupt. User context code can
>> access interrupt object stock, but not vice versa.
>>
>> The mod_objcg_state() function is also modified to make sure that memcg
>> and lruvec stat updates are done with interrupted disabled.
>>
>> The downside of this change is that there are more data stored in local
>> object stocks and not reflected in the charge counter and the vmstat
>> arrays.  However, this is a small price to pay for better performance.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> Acked-by: Roman Gushchin <guro@fb.com>
>> ---
>>   mm/memcontrol.c | 73 +++++++++++++++++++++++++++++++++++++++----------
>>   1 file changed, 59 insertions(+), 14 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 69f728383efe..29f2df76644a 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -2229,7 +2229,8 @@ struct obj_stock {
>>   struct memcg_stock_pcp {
>>          struct mem_cgroup *cached; /* this never be root cgroup */
>>          unsigned int nr_pages;
>> -       struct obj_stock obj;
>> +       struct obj_stock task_obj;
>> +       struct obj_stock irq_obj;
>>
>>          struct work_struct work;
>>          unsigned long flags;
>> @@ -2254,11 +2255,48 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
>>   }
>>   #endif
>>
>> +/*
>> + * Most kmem_cache_alloc() calls are from user context. The irq disable/enable
>> + * sequence used in this case to access content from object stock is slow.
>> + * To optimize for user context access, there are now two object stocks for
>> + * task context and interrupt context access respectively.
>> + *
>> + * The task context object stock can be accessed by disabling preemption only
>> + * which is cheap in non-preempt kernel. The interrupt context object stock
>> + * can only be accessed after disabling interrupt. User context code can
>> + * access interrupt object stock, but not vice versa.
>> + */
>>   static inline struct obj_stock *current_obj_stock(void)
>>   {
>>          struct memcg_stock_pcp *stock = this_cpu_ptr(&memcg_stock);
>>
>> -       return &stock->obj;
>> +       return in_task() ? &stock->task_obj : &stock->irq_obj;
>> +}
>> +
>> +#define get_obj_stock(flags)                           \
>> +({                                                     \
>> +       struct memcg_stock_pcp *stock;                  \
>> +       struct obj_stock *obj_stock;                    \
>> +                                                       \
>> +       if (in_task()) {                                \
>> +               preempt_disable();                      \
>> +               (flags) = -1L;                          \
>> +               stock = this_cpu_ptr(&memcg_stock);     \
> The above line was missing in the previous version.
>
>> +               obj_stock = &stock->task_obj;           \
>> +       } else {                                        \
>> +               local_irq_save(flags);                  \
>> +               stock = this_cpu_ptr(&memcg_stock);     \
>> +               obj_stock = &stock->irq_obj;            \
>> +       }                                               \
>> +       obj_stock;                                      \
>> +})
>> +
>> +static inline void put_obj_stock(unsigned long flags)
>> +{
>> +       if (flags == -1L)
>> +               preempt_enable();
>> +       else
>> +               local_irq_restore(flags);
>>   }
>>
>>   /**
>> @@ -2327,7 +2365,9 @@ static void drain_local_stock(struct work_struct *dummy)
>>          local_irq_save(flags);
>>
>>          stock = this_cpu_ptr(&memcg_stock);
>> -       drain_obj_stock(&stock->obj);
>> +       drain_obj_stock(&stock->irq_obj);
>> +       if (in_task())
>> +               drain_obj_stock(&stock->task_obj);
>>          drain_stock(stock);
>>          clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
>>
>> @@ -3183,7 +3223,7 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg,
>>          memcg = obj_cgroup_memcg(objcg);
>>          if (pgdat)
>>                  lruvec = mem_cgroup_lruvec(memcg, pgdat);
>> -       __mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
>> +       mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
>>          rcu_read_unlock();
>>   }
>>
>> @@ -3193,7 +3233,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
>>          unsigned long flags;
>>          bool ret = false;
>>
>> -       local_irq_save(flags);
>> +       stock = get_obj_stock(flags);
>>
>>          stock = current_obj_stock();
> The above is redundant.

Right. I should check the patch carefully. Will remove it.

Thanks,
Longman



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 5/5] mm/memcg: Optimize user context object stock access
@ 2021-04-13  1:03       ` Waiman Long
  0 siblings, 0 replies; 28+ messages in thread
From: Waiman Long @ 2021-04-13  1:03 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin, LKML, Cgroups,
	Linux MM, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun

On 4/12/21 7:10 PM, Shakeel Butt wrote:
> On Mon, Apr 12, 2021 at 3:55 PM Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> Most kmem_cache_alloc() calls are from user context. With instrumentation
>> enabled, the measured amount of kmem_cache_alloc() calls from non-task
>> context was about 0.01% of the total.
>>
>> The irq disable/enable sequence used in this case to access content
>> from object stock is slow.  To optimize for user context access, there
>> are now two object stocks for task context and interrupt context access
>> respectively.
>>
>> The task context object stock can be accessed after disabling preemption
>> which is cheap in non-preempt kernel. The interrupt context object stock
>> can only be accessed after disabling interrupt. User context code can
>> access interrupt object stock, but not vice versa.
>>
>> The mod_objcg_state() function is also modified to make sure that memcg
>> and lruvec stat updates are done with interrupted disabled.
>>
>> The downside of this change is that there are more data stored in local
>> object stocks and not reflected in the charge counter and the vmstat
>> arrays.  However, this is a small price to pay for better performance.
>>
>> Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>> Acked-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
>> ---
>>   mm/memcontrol.c | 73 +++++++++++++++++++++++++++++++++++++++----------
>>   1 file changed, 59 insertions(+), 14 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 69f728383efe..29f2df76644a 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -2229,7 +2229,8 @@ struct obj_stock {
>>   struct memcg_stock_pcp {
>>          struct mem_cgroup *cached; /* this never be root cgroup */
>>          unsigned int nr_pages;
>> -       struct obj_stock obj;
>> +       struct obj_stock task_obj;
>> +       struct obj_stock irq_obj;
>>
>>          struct work_struct work;
>>          unsigned long flags;
>> @@ -2254,11 +2255,48 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
>>   }
>>   #endif
>>
>> +/*
>> + * Most kmem_cache_alloc() calls are from user context. The irq disable/enable
>> + * sequence used in this case to access content from object stock is slow.
>> + * To optimize for user context access, there are now two object stocks for
>> + * task context and interrupt context access respectively.
>> + *
>> + * The task context object stock can be accessed by disabling preemption only
>> + * which is cheap in non-preempt kernel. The interrupt context object stock
>> + * can only be accessed after disabling interrupt. User context code can
>> + * access interrupt object stock, but not vice versa.
>> + */
>>   static inline struct obj_stock *current_obj_stock(void)
>>   {
>>          struct memcg_stock_pcp *stock = this_cpu_ptr(&memcg_stock);
>>
>> -       return &stock->obj;
>> +       return in_task() ? &stock->task_obj : &stock->irq_obj;
>> +}
>> +
>> +#define get_obj_stock(flags)                           \
>> +({                                                     \
>> +       struct memcg_stock_pcp *stock;                  \
>> +       struct obj_stock *obj_stock;                    \
>> +                                                       \
>> +       if (in_task()) {                                \
>> +               preempt_disable();                      \
>> +               (flags) = -1L;                          \
>> +               stock = this_cpu_ptr(&memcg_stock);     \
> The above line was missing in the previous version.
>
>> +               obj_stock = &stock->task_obj;           \
>> +       } else {                                        \
>> +               local_irq_save(flags);                  \
>> +               stock = this_cpu_ptr(&memcg_stock);     \
>> +               obj_stock = &stock->irq_obj;            \
>> +       }                                               \
>> +       obj_stock;                                      \
>> +})
>> +
>> +static inline void put_obj_stock(unsigned long flags)
>> +{
>> +       if (flags == -1L)
>> +               preempt_enable();
>> +       else
>> +               local_irq_restore(flags);
>>   }
>>
>>   /**
>> @@ -2327,7 +2365,9 @@ static void drain_local_stock(struct work_struct *dummy)
>>          local_irq_save(flags);
>>
>>          stock = this_cpu_ptr(&memcg_stock);
>> -       drain_obj_stock(&stock->obj);
>> +       drain_obj_stock(&stock->irq_obj);
>> +       if (in_task())
>> +               drain_obj_stock(&stock->task_obj);
>>          drain_stock(stock);
>>          clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
>>
>> @@ -3183,7 +3223,7 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg,
>>          memcg = obj_cgroup_memcg(objcg);
>>          if (pgdat)
>>                  lruvec = mem_cgroup_lruvec(memcg, pgdat);
>> -       __mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
>> +       mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
>>          rcu_read_unlock();
>>   }
>>
>> @@ -3193,7 +3233,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
>>          unsigned long flags;
>>          bool ret = false;
>>
>> -       local_irq_save(flags);
>> +       stock = get_obj_stock(flags);
>>
>>          stock = current_obj_stock();
> The above is redundant.

Right. I should check the patch carefully. Will remove it.

Thanks,
Longman



^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2021-04-13 18:33 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-12 22:54 [PATCH v2 0/5] mm/memcg: Reduce kmemcache memory accounting overhead Waiman Long
2021-04-12 22:54 ` Waiman Long
2021-04-12 22:54 ` [PATCH v2 1/5] mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state() Waiman Long
2021-04-12 22:54   ` Waiman Long
2021-04-12 22:55 ` [PATCH v2 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state() Waiman Long
2021-04-12 22:55   ` Waiman Long
2021-04-12 22:55 ` [PATCH v2 3/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp Waiman Long
2021-04-12 22:55   ` Waiman Long
2021-04-12 23:03   ` Shakeel Butt
2021-04-12 23:03     ` Shakeel Butt
2021-04-12 23:03     ` Shakeel Butt
2021-04-13 18:32   ` kernel test robot
2021-04-13 18:32     ` kernel test robot
2021-04-12 22:55 ` [PATCH v2 4/5] mm/memcg: Separate out object stock data into its own struct Waiman Long
2021-04-12 22:55   ` Waiman Long
2021-04-12 23:07   ` Shakeel Butt
2021-04-12 23:07     ` Shakeel Butt
2021-04-12 23:07     ` Shakeel Butt
2021-04-12 22:55 ` [PATCH v2 5/5] mm/memcg: Optimize user context object stock access Waiman Long
2021-04-12 22:55   ` Waiman Long
2021-04-12 23:10   ` Shakeel Butt
2021-04-12 23:10     ` Shakeel Butt
2021-04-12 23:10     ` Shakeel Butt
2021-04-12 23:14     ` Shakeel Butt
2021-04-12 23:14       ` Shakeel Butt
2021-04-12 23:14       ` Shakeel Butt
2021-04-13  1:03     ` Waiman Long
2021-04-13  1:03       ` Waiman Long

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.