[PATCH v3 0/5] mm/memcg: Reduce kmemcache memory accounting overhead

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/5] mm/memcg: Reduce kmemcache memory accounting overhead
@ 2021-04-14  1:20 Waiman Long
  2021-04-14  1:20 ` [PATCH v3 1/5] mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state() Waiman Long
                   ` (6 more replies)
  0 siblings, 7 replies; 36+ messages in thread
From: Waiman Long @ 2021-04-14  1:20 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin
  Cc: linux-kernel, cgroups, linux-mm, Shakeel Butt, Muchun Song,
	Alex Shi, Chris Down, Yafang Shao, Wei Yang, Masayoshi Mizuma,
	Xing Zhengjun, Waiman Long

 v3:
  - Add missing "inline" qualifier to the alternate mod_obj_stock_state()
    in patch 3.
  - Remove redundant current_obj_stock() call in patch 5.

 v2:
  - Fix bug found by test robot in patch 5.
  - Update cover letter and commit logs.

With the recent introduction of the new slab memory controller, we
eliminate the need for having separate kmemcaches for each memory
cgroup and reduce overall kernel memory usage. However, we also add
additional memory accounting overhead to each call of kmem_cache_alloc()
and kmem_cache_free().

For workloads that require a lot of kmemcache allocations and
de-allocations, they may experience performance regression as illustrated
in [1] and [2].

A simple kernel module that performs repeated loop of 100,000,000
kmem_cache_alloc() and kmem_cache_free() of a 64-byte object at module
init time is used for benchmarking. The test was run on a CascadeLake
server with turbo-boosting disable to reduce run-to-run variation.

With memory accounting disable, the run time was 2.848s. With memory
accounting enabled, the run times with the application of various
patches in the patchset were:

  Applied patches   Run time   Accounting overhead   Overhead %age
  ---------------   --------   -------------------   -------------
       None          10.800s         7.952s              100.0%
        1-2           9.140s         6.292s               79.1%
        1-3           7.641s         4.793s               60.3%
        1-5           6.801s         3.953s               49.7%

Note that this is the best case scenario where most updates happen only
to the percpu stocks. Real workloads will likely have a certain amount
of updates to the memcg charges and vmstats. So the performance benefit
will be less.

It was found that a big part of the memory accounting overhead
was caused by the local_irq_save()/local_irq_restore() sequences in
updating local stock charge bytes and vmstat array, at least in x86
systems. There are two such sequences in kmem_cache_alloc() and two
in kmem_cache_free(). This patchset tries to reduce the use of such
sequences as much as possible. In fact, it eliminates them in the common
case. Another part of this patchset to cache the vmstat data update in
the local stock as well which also helps.

[1] https://lore.kernel.org/linux-mm/20210408193948.vfktg3azh2wrt56t@gabell/T/#u
[2] https://lore.kernel.org/lkml/20210114025151.GA22932@xsang-OptiPlex-9020/

Waiman Long (5):
  mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state()
  mm/memcg: Introduce obj_cgroup_uncharge_mod_state()
  mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
  mm/memcg: Separate out object stock data into its own struct
  mm/memcg: Optimize user context object stock access

 include/linux/memcontrol.h |  14 ++-
 mm/memcontrol.c            | 199 ++++++++++++++++++++++++++++++++-----
 mm/percpu.c                |   9 +-
 mm/slab.h                  |  32 +++---
 4 files changed, 196 insertions(+), 58 deletions(-)

-- 
2.18.1

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v3 1/5] mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state()
  2021-04-14  1:20 [PATCH v3 0/5] mm/memcg: Reduce kmemcache memory accounting overhead Waiman Long
@ 2021-04-14  1:20 ` Waiman Long
  2021-04-15  3:27   ` Masayoshi Mizuma
  2021-04-15 16:40   ` Johannes Weiner
  2021-04-14  1:20 ` [PATCH v3 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state() Waiman Long
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 36+ messages in thread
From: Waiman Long @ 2021-04-14  1:20 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin
  Cc: linux-kernel, cgroups, linux-mm, Shakeel Butt, Muchun Song,
	Alex Shi, Chris Down, Yafang Shao, Wei Yang, Masayoshi Mizuma,
	Xing Zhengjun, Waiman Long

The caller of mod_memcg_lruvec_state() has both memcg and lruvec readily
available. So both of them are now passed to mod_memcg_lruvec_state()
and __mod_memcg_lruvec_state(). The __mod_memcg_lruvec_state() is
updated to allow either of the two parameters to be set to null. This
makes mod_memcg_lruvec_state() equivalent to mod_memcg_state() if lruvec
is null.

The new __mod_memcg_lruvec_state() function will be used in the next
patch as a replacement of mod_memcg_state() in mm/percpu.c for the
consolidation of the memory uncharge and vmstat update functions in
the kmem_cache_free() path.

Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
---
 include/linux/memcontrol.h | 12 +++++++-----
 mm/memcontrol.c            | 19 +++++++++++++------
 mm/slab.h                  |  2 +-
 3 files changed, 21 insertions(+), 12 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0c04d39a7967..95f12996e66c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -955,8 +955,8 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
 	return x;
 }
 
-void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
-			      int val);
+void __mod_memcg_lruvec_state(struct mem_cgroup *memcg, struct lruvec *lruvec,
+			      enum node_stat_item idx, int val);
 void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val);
 
 static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx,
@@ -969,13 +969,14 @@ static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx,
 	local_irq_restore(flags);
 }
 
-static inline void mod_memcg_lruvec_state(struct lruvec *lruvec,
+static inline void mod_memcg_lruvec_state(struct mem_cgroup *memcg,
+					  struct lruvec *lruvec,
 					  enum node_stat_item idx, int val)
 {
 	unsigned long flags;
 
 	local_irq_save(flags);
-	__mod_memcg_lruvec_state(lruvec, idx, val);
+	__mod_memcg_lruvec_state(memcg, lruvec, idx, val);
 	local_irq_restore(flags);
 }
 
@@ -1369,7 +1370,8 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
 	return node_page_state(lruvec_pgdat(lruvec), idx);
 }
 
-static inline void __mod_memcg_lruvec_state(struct lruvec *lruvec,
+static inline void __mod_memcg_lruvec_state(struct mem_cgroup *memcg,
+					    struct lruvec *lruvec,
 					    enum node_stat_item idx, int val)
 {
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e064ac0d850a..d66e1e38f8ac 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -799,20 +799,27 @@ parent_nodeinfo(struct mem_cgroup_per_node *pn, int nid)
 	return mem_cgroup_nodeinfo(parent, nid);
 }
 
-void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
-			      int val)
+/*
+ * Either one of memcg or lruvec can be NULL, but not both.
+ */
+void __mod_memcg_lruvec_state(struct mem_cgroup *memcg, struct lruvec *lruvec,
+			      enum node_stat_item idx, int val)
 {
 	struct mem_cgroup_per_node *pn;
-	struct mem_cgroup *memcg;
 	long x, threshold = MEMCG_CHARGE_BATCH;
 
+	/* Update lruvec */
 	pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
-	memcg = pn->memcg;
+
+	if (!memcg)
+		memcg = pn->memcg;
 
 	/* Update memcg */
 	__mod_memcg_state(memcg, idx, val);
 
-	/* Update lruvec */
+	if (!lruvec)
+		return;
+
 	__this_cpu_add(pn->lruvec_stat_local->count[idx], val);
 
 	if (vmstat_item_in_bytes(idx))
@@ -848,7 +855,7 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 
 	/* Update memcg and lruvec */
 	if (!mem_cgroup_disabled())
-		__mod_memcg_lruvec_state(lruvec, idx, val);
+		__mod_memcg_lruvec_state(NULL, lruvec, idx, val);
 }
 
 void __mod_lruvec_page_state(struct page *page, enum node_stat_item idx,
diff --git a/mm/slab.h b/mm/slab.h
index 076582f58f68..bc6c7545e487 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -293,7 +293,7 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg,
 	rcu_read_lock();
 	memcg = obj_cgroup_memcg(objcg);
 	lruvec = mem_cgroup_lruvec(memcg, pgdat);
-	mod_memcg_lruvec_state(lruvec, idx, nr);
+	mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
 	rcu_read_unlock();
 }
 
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state()
  2021-04-14  1:20 [PATCH v3 0/5] mm/memcg: Reduce kmemcache memory accounting overhead Waiman Long
  2021-04-14  1:20 ` [PATCH v3 1/5] mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state() Waiman Long
@ 2021-04-14  1:20 ` Waiman Long
  2021-04-15  3:27   ` Masayoshi Mizuma
  2021-04-15 16:30   ` Johannes Weiner
  2021-04-14  1:20 ` [PATCH v3 3/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp Waiman Long
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 36+ messages in thread
From: Waiman Long @ 2021-04-14  1:20 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin
  Cc: linux-kernel, cgroups, linux-mm, Shakeel Butt, Muchun Song,
	Alex Shi, Chris Down, Yafang Shao, Wei Yang, Masayoshi Mizuma,
	Xing Zhengjun, Waiman Long

In memcg_slab_free_hook()/pcpu_memcg_free_hook(), obj_cgroup_uncharge()
is followed by mod_objcg_state()/mod_memcg_state(). Each of these
function call goes through a separate irq_save/irq_restore cycle. That
is inefficient.  Introduce a new function obj_cgroup_uncharge_mod_state()
that combines them with a single irq_save/irq_restore cycle.

Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <guro@fb.com>
---
 include/linux/memcontrol.h |  2 ++
 mm/memcontrol.c            | 31 +++++++++++++++++++++++++++----
 mm/percpu.c                |  9 ++-------
 mm/slab.h                  |  6 +++---
 4 files changed, 34 insertions(+), 14 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 95f12996e66c..6890f999c1a3 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1592,6 +1592,8 @@ struct obj_cgroup *get_obj_cgroup_from_current(void);
 
 int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size);
 void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);
+void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
+				   struct pglist_data *pgdat, int idx);
 
 extern struct static_key_false memcg_kmem_enabled_key;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d66e1e38f8ac..b19100c68aa0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3225,12 +3225,9 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 	return false;
 }
 
-static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
+static void __refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 {
 	struct memcg_stock_pcp *stock;
-	unsigned long flags;
-
-	local_irq_save(flags);
 
 	stock = this_cpu_ptr(&memcg_stock);
 	if (stock->cached_objcg != objcg) { /* reset if necessary */
@@ -3243,7 +3240,14 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 
 	if (stock->nr_bytes > PAGE_SIZE)
 		drain_obj_stock(stock);
+}
+
+static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
+{
+	unsigned long flags;
 
+	local_irq_save(flags);
+	__refill_obj_stock(objcg, nr_bytes);
 	local_irq_restore(flags);
 }
 
@@ -3292,6 +3296,25 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
 	refill_obj_stock(objcg, size);
 }
 
+void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
+				   struct pglist_data *pgdat, int idx)
+{
+	unsigned long flags;
+	struct mem_cgroup *memcg;
+	struct lruvec *lruvec = NULL;
+
+	local_irq_save(flags);
+	__refill_obj_stock(objcg, size);
+
+	rcu_read_lock();
+	memcg = obj_cgroup_memcg(objcg);
+	if (pgdat)
+		lruvec = mem_cgroup_lruvec(memcg, pgdat);
+	__mod_memcg_lruvec_state(memcg, lruvec, idx, -(int)size);
+	rcu_read_unlock();
+	local_irq_restore(flags);
+}
+
 #endif /* CONFIG_MEMCG_KMEM */
 
 /*
diff --git a/mm/percpu.c b/mm/percpu.c
index 23308113a5ff..fd7aad6d7f90 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1631,13 +1631,8 @@ static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
 	objcg = chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT];
 	chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = NULL;
 
-	obj_cgroup_uncharge(objcg, size * num_possible_cpus());
-
-	rcu_read_lock();
-	mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B,
-			-(size * num_possible_cpus()));
-	rcu_read_unlock();
-
+	obj_cgroup_uncharge_mod_state(objcg, size * num_possible_cpus(),
+				      NULL, MEMCG_PERCPU_B);
 	obj_cgroup_put(objcg);
 }
 
diff --git a/mm/slab.h b/mm/slab.h
index bc6c7545e487..677cdc52e641 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -366,9 +366,9 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s_orig,
 			continue;
 
 		objcgs[off] = NULL;
-		obj_cgroup_uncharge(objcg, obj_full_size(s));
-		mod_objcg_state(objcg, page_pgdat(page), cache_vmstat_idx(s),
-				-obj_full_size(s));
+		obj_cgroup_uncharge_mod_state(objcg, obj_full_size(s),
+					      page_pgdat(page),
+					      cache_vmstat_idx(s));
 		obj_cgroup_put(objcg);
 	}
 }
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 3/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
  2021-04-14  1:20 [PATCH v3 0/5] mm/memcg: Reduce kmemcache memory accounting overhead Waiman Long
  2021-04-14  1:20 ` [PATCH v3 1/5] mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state() Waiman Long
  2021-04-14  1:20 ` [PATCH v3 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state() Waiman Long
@ 2021-04-14  1:20 ` Waiman Long
  2021-04-15  3:28   ` Masayoshi Mizuma
  2021-04-15 16:50   ` Johannes Weiner
  2021-04-14  1:20 ` [PATCH v3 4/5] mm/memcg: Separate out object stock data into its own struct Waiman Long
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 36+ messages in thread
From: Waiman Long @ 2021-04-14  1:20 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin
  Cc: linux-kernel, cgroups, linux-mm, Shakeel Butt, Muchun Song,
	Alex Shi, Chris Down, Yafang Shao, Wei Yang, Masayoshi Mizuma,
	Xing Zhengjun, Waiman Long

Before the new slab memory controller with per object byte charging,
charging and vmstat data update happen only when new slab pages are
allocated or freed. Now they are done with every kmem_cache_alloc()
and kmem_cache_free(). This causes additional overhead for workloads
that generate a lot of alloc and free calls.

The memcg_stock_pcp is used to cache byte charge for a specific
obj_cgroup to reduce that overhead. To further reducing it, this patch
makes the vmstat data cached in the memcg_stock_pcp structure as well
until it accumulates a page size worth of update or when other cached
data change.

On a 2-socket Cascade Lake server with instrumentation enabled and this
patch applied, it was found that about 17% (946796 out of 5515184) of the
time when __mod_obj_stock_state() is called leads to an actual call to
mod_objcg_state() after initial boot. When doing parallel kernel build,
the figure was about 16% (21894614 out of 139780628). So caching the
vmstat data reduces the number of calls to mod_objcg_state() by more
than 80%.

Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
---
 mm/memcontrol.c | 78 +++++++++++++++++++++++++++++++++++++++++++------
 mm/slab.h       | 26 +++++++----------
 2 files changed, 79 insertions(+), 25 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b19100c68aa0..539c3b632e47 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2220,7 +2220,10 @@ struct memcg_stock_pcp {
 
 #ifdef CONFIG_MEMCG_KMEM
 	struct obj_cgroup *cached_objcg;
+	struct pglist_data *cached_pgdat;
 	unsigned int nr_bytes;
+	int vmstat_idx;
+	int vmstat_bytes;
 #endif
 
 	struct work_struct work;
@@ -3157,6 +3160,21 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
 	css_put(&memcg->css);
 }
 
+static inline void mod_objcg_state(struct obj_cgroup *objcg,
+				   struct pglist_data *pgdat,
+				   enum node_stat_item idx, int nr)
+{
+	struct mem_cgroup *memcg;
+	struct lruvec *lruvec = NULL;
+
+	rcu_read_lock();
+	memcg = obj_cgroup_memcg(objcg);
+	if (pgdat)
+		lruvec = mem_cgroup_lruvec(memcg, pgdat);
+	__mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
+	rcu_read_unlock();
+}
+
 static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 {
 	struct memcg_stock_pcp *stock;
@@ -3207,6 +3225,14 @@ static void drain_obj_stock(struct memcg_stock_pcp *stock)
 		stock->nr_bytes = 0;
 	}
 
+	if (stock->vmstat_bytes) {
+		mod_objcg_state(old, stock->cached_pgdat, stock->vmstat_idx,
+				stock->vmstat_bytes);
+		stock->vmstat_bytes = 0;
+		stock->vmstat_idx = 0;
+		stock->cached_pgdat = NULL;
+	}
+
 	obj_cgroup_put(old);
 	stock->cached_objcg = NULL;
 }
@@ -3251,6 +3277,48 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 	local_irq_restore(flags);
 }
 
+static void __mod_obj_stock_state(struct obj_cgroup *objcg,
+				  struct pglist_data *pgdat, int idx, int nr)
+{
+	struct memcg_stock_pcp *stock = this_cpu_ptr(&memcg_stock);
+
+	if (stock->cached_objcg != objcg) {
+		/* Output the current data as is */
+	} else if (!stock->vmstat_bytes) {
+		/* Save the current data */
+		stock->vmstat_bytes = nr;
+		stock->vmstat_idx = idx;
+		stock->cached_pgdat = pgdat;
+		nr = 0;
+	} else if ((stock->cached_pgdat != pgdat) ||
+		   (stock->vmstat_idx != idx)) {
+		/* Output the cached data & save the current data */
+		swap(nr, stock->vmstat_bytes);
+		swap(idx, stock->vmstat_idx);
+		swap(pgdat, stock->cached_pgdat);
+	} else {
+		stock->vmstat_bytes += nr;
+		if (abs(nr) > PAGE_SIZE) {
+			nr = stock->vmstat_bytes;
+			stock->vmstat_bytes = 0;
+		} else {
+			nr = 0;
+		}
+	}
+	if (nr)
+		mod_objcg_state(objcg, pgdat, idx, nr);
+}
+
+void mod_obj_stock_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
+			 int idx, int nr)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__mod_obj_stock_state(objcg, pgdat, idx, nr);
+	local_irq_restore(flags);
+}
+
 int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
 {
 	struct mem_cgroup *memcg;
@@ -3300,18 +3368,10 @@ void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
 				   struct pglist_data *pgdat, int idx)
 {
 	unsigned long flags;
-	struct mem_cgroup *memcg;
-	struct lruvec *lruvec = NULL;
 
 	local_irq_save(flags);
 	__refill_obj_stock(objcg, size);
-
-	rcu_read_lock();
-	memcg = obj_cgroup_memcg(objcg);
-	if (pgdat)
-		lruvec = mem_cgroup_lruvec(memcg, pgdat);
-	__mod_memcg_lruvec_state(memcg, lruvec, idx, -(int)size);
-	rcu_read_unlock();
+	__mod_obj_stock_state(objcg, pgdat, idx, -(int)size);
 	local_irq_restore(flags);
 }
 
diff --git a/mm/slab.h b/mm/slab.h
index 677cdc52e641..03bd9813422b 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -239,6 +239,8 @@ static inline bool kmem_cache_debug_flags(struct kmem_cache *s, slab_flags_t fla
 #ifdef CONFIG_MEMCG_KMEM
 int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s,
 				 gfp_t gfp, bool new_page);
+void mod_obj_stock_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
+			 int idx, int nr);
 
 static inline void memcg_free_page_obj_cgroups(struct page *page)
 {
@@ -283,20 +285,6 @@ static inline bool memcg_slab_pre_alloc_hook(struct kmem_cache *s,
 	return true;
 }
 
-static inline void mod_objcg_state(struct obj_cgroup *objcg,
-				   struct pglist_data *pgdat,
-				   enum node_stat_item idx, int nr)
-{
-	struct mem_cgroup *memcg;
-	struct lruvec *lruvec;
-
-	rcu_read_lock();
-	memcg = obj_cgroup_memcg(objcg);
-	lruvec = mem_cgroup_lruvec(memcg, pgdat);
-	mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
-	rcu_read_unlock();
-}
-
 static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
 					      struct obj_cgroup *objcg,
 					      gfp_t flags, size_t size,
@@ -324,8 +312,9 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
 			off = obj_to_index(s, page, p[i]);
 			obj_cgroup_get(objcg);
 			page_objcgs(page)[off] = objcg;
-			mod_objcg_state(objcg, page_pgdat(page),
-					cache_vmstat_idx(s), obj_full_size(s));
+			mod_obj_stock_state(objcg, page_pgdat(page),
+					    cache_vmstat_idx(s),
+					    obj_full_size(s));
 		} else {
 			obj_cgroup_uncharge(objcg, obj_full_size(s));
 		}
@@ -408,6 +397,11 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s,
 					void **p, int objects)
 {
 }
+
+static inline void mod_obj_stock_state(struct obj_cgroup *objcg,
+				       struct pglist_data *pgdat, int idx, int nr)
+{
+}
 #endif /* CONFIG_MEMCG_KMEM */
 
 static inline struct kmem_cache *virt_to_cache(const void *obj)
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 4/5] mm/memcg: Separate out object stock data into its own struct
  2021-04-14  1:20 [PATCH v3 0/5] mm/memcg: Reduce kmemcache memory accounting overhead Waiman Long
                   ` (2 preceding siblings ...)
  2021-04-14  1:20 ` [PATCH v3 3/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp Waiman Long
@ 2021-04-14  1:20 ` Waiman Long
  2021-04-15  3:28   ` Masayoshi Mizuma
  2021-04-15 16:57   ` Johannes Weiner
  2021-04-14  1:20 ` [PATCH v3 5/5] mm/memcg: Optimize user context object stock access Waiman Long
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 36+ messages in thread
From: Waiman Long @ 2021-04-14  1:20 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin
  Cc: linux-kernel, cgroups, linux-mm, Shakeel Butt, Muchun Song,
	Alex Shi, Chris Down, Yafang Shao, Wei Yang, Masayoshi Mizuma,
	Xing Zhengjun, Waiman Long

The object stock data stored in struct memcg_stock_pcp are independent
of the other page based data stored there. Separating them out into
their own struct to highlight the independency.

Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
---
 mm/memcontrol.c | 41 ++++++++++++++++++++++++++---------------
 1 file changed, 26 insertions(+), 15 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 539c3b632e47..69f728383efe 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2214,17 +2214,22 @@ void unlock_page_memcg(struct page *page)
 }
 EXPORT_SYMBOL(unlock_page_memcg);
 
-struct memcg_stock_pcp {
-	struct mem_cgroup *cached; /* this never be root cgroup */
-	unsigned int nr_pages;
-
+struct obj_stock {
 #ifdef CONFIG_MEMCG_KMEM
 	struct obj_cgroup *cached_objcg;
 	struct pglist_data *cached_pgdat;
 	unsigned int nr_bytes;
 	int vmstat_idx;
 	int vmstat_bytes;
+#else
+	int dummy[0];
 #endif
+};
+
+struct memcg_stock_pcp {
+	struct mem_cgroup *cached; /* this never be root cgroup */
+	unsigned int nr_pages;
+	struct obj_stock obj;
 
 	struct work_struct work;
 	unsigned long flags;
@@ -2234,12 +2239,12 @@ static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
 static DEFINE_MUTEX(percpu_charge_mutex);
 
 #ifdef CONFIG_MEMCG_KMEM
-static void drain_obj_stock(struct memcg_stock_pcp *stock);
+static void drain_obj_stock(struct obj_stock *stock);
 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 				     struct mem_cgroup *root_memcg);
 
 #else
-static inline void drain_obj_stock(struct memcg_stock_pcp *stock)
+static inline void drain_obj_stock(struct obj_stock *stock)
 {
 }
 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
@@ -2249,6 +2254,13 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 }
 #endif
 
+static inline struct obj_stock *current_obj_stock(void)
+{
+	struct memcg_stock_pcp *stock = this_cpu_ptr(&memcg_stock);
+
+	return &stock->obj;
+}
+
 /**
  * consume_stock: Try to consume stocked charge on this cpu.
  * @memcg: memcg to consume from.
@@ -2315,7 +2327,7 @@ static void drain_local_stock(struct work_struct *dummy)
 	local_irq_save(flags);
 
 	stock = this_cpu_ptr(&memcg_stock);
-	drain_obj_stock(stock);
+	drain_obj_stock(&stock->obj);
 	drain_stock(stock);
 	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
 
@@ -3177,13 +3189,13 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg,
 
 static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 {
-	struct memcg_stock_pcp *stock;
+	struct obj_stock *stock;
 	unsigned long flags;
 	bool ret = false;
 
 	local_irq_save(flags);
 
-	stock = this_cpu_ptr(&memcg_stock);
+	stock = current_obj_stock();
 	if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
 		stock->nr_bytes -= nr_bytes;
 		ret = true;
@@ -3194,7 +3206,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 	return ret;
 }
 
-static void drain_obj_stock(struct memcg_stock_pcp *stock)
+static void drain_obj_stock(struct obj_stock *stock)
 {
 	struct obj_cgroup *old = stock->cached_objcg;
 
@@ -3242,8 +3254,8 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 {
 	struct mem_cgroup *memcg;
 
-	if (stock->cached_objcg) {
-		memcg = obj_cgroup_memcg(stock->cached_objcg);
+	if (stock->obj.cached_objcg) {
+		memcg = obj_cgroup_memcg(stock->obj.cached_objcg);
 		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
 			return true;
 	}
@@ -3253,9 +3265,8 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 
 static void __refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 {
-	struct memcg_stock_pcp *stock;
+	struct obj_stock *stock = current_obj_stock();
 
-	stock = this_cpu_ptr(&memcg_stock);
 	if (stock->cached_objcg != objcg) { /* reset if necessary */
 		drain_obj_stock(stock);
 		obj_cgroup_get(objcg);
@@ -3280,7 +3291,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 static void __mod_obj_stock_state(struct obj_cgroup *objcg,
 				  struct pglist_data *pgdat, int idx, int nr)
 {
-	struct memcg_stock_pcp *stock = this_cpu_ptr(&memcg_stock);
+	struct obj_stock *stock = current_obj_stock();
 
 	if (stock->cached_objcg != objcg) {
 		/* Output the current data as is */
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 5/5] mm/memcg: Optimize user context object stock access
  2021-04-14  1:20 [PATCH v3 0/5] mm/memcg: Reduce kmemcache memory accounting overhead Waiman Long
                   ` (3 preceding siblings ...)
  2021-04-14  1:20 ` [PATCH v3 4/5] mm/memcg: Separate out object stock data into its own struct Waiman Long
@ 2021-04-14  1:20 ` Waiman Long
  2021-04-15  3:28   ` Masayoshi Mizuma
  2021-04-15 17:53   ` Johannes Weiner
  2021-04-15  3:26 ` [PATCH v3 0/5] mm/memcg: Reduce kmemcache memory accounting overhead Masayoshi Mizuma
  2021-04-15 17:10 ` Matthew Wilcox
  6 siblings, 2 replies; 36+ messages in thread
From: Waiman Long @ 2021-04-14  1:20 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin
  Cc: linux-kernel, cgroups, linux-mm, Shakeel Butt, Muchun Song,
	Alex Shi, Chris Down, Yafang Shao, Wei Yang, Masayoshi Mizuma,
	Xing Zhengjun, Waiman Long

Most kmem_cache_alloc() calls are from user context. With instrumentation
enabled, the measured amount of kmem_cache_alloc() calls from non-task
context was about 0.01% of the total.

The irq disable/enable sequence used in this case to access content
from object stock is slow.  To optimize for user context access, there
are now two object stocks for task context and interrupt context access
respectively.

The task context object stock can be accessed after disabling preemption
which is cheap in non-preempt kernel. The interrupt context object stock
can only be accessed after disabling interrupt. User context code can
access interrupt object stock, but not vice versa.

The mod_objcg_state() function is also modified to make sure that memcg
and lruvec stat updates are done with interrupted disabled.

The downside of this change is that there are more data stored in local
object stocks and not reflected in the charge counter and the vmstat
arrays.  However, this is a small price to pay for better performance.

Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
---
 mm/memcontrol.c | 74 +++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 59 insertions(+), 15 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 69f728383efe..8875e896e52b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2229,7 +2229,8 @@ struct obj_stock {
 struct memcg_stock_pcp {
 	struct mem_cgroup *cached; /* this never be root cgroup */
 	unsigned int nr_pages;
-	struct obj_stock obj;
+	struct obj_stock task_obj;
+	struct obj_stock irq_obj;
 
 	struct work_struct work;
 	unsigned long flags;
@@ -2254,11 +2255,48 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 }
 #endif
 
+/*
+ * Most kmem_cache_alloc() calls are from user context. The irq disable/enable
+ * sequence used in this case to access content from object stock is slow.
+ * To optimize for user context access, there are now two object stocks for
+ * task context and interrupt context access respectively.
+ *
+ * The task context object stock can be accessed by disabling preemption only
+ * which is cheap in non-preempt kernel. The interrupt context object stock
+ * can only be accessed after disabling interrupt. User context code can
+ * access interrupt object stock, but not vice versa.
+ */
 static inline struct obj_stock *current_obj_stock(void)
 {
 	struct memcg_stock_pcp *stock = this_cpu_ptr(&memcg_stock);
 
-	return &stock->obj;
+	return in_task() ? &stock->task_obj : &stock->irq_obj;
+}
+
+#define get_obj_stock(flags)				\
+({							\
+	struct memcg_stock_pcp *stock;			\
+	struct obj_stock *obj_stock;			\
+							\
+	if (in_task()) {				\
+		preempt_disable();			\
+		(flags) = -1L;				\
+		stock = this_cpu_ptr(&memcg_stock);	\
+		obj_stock = &stock->task_obj;		\
+	} else {					\
+		local_irq_save(flags);			\
+		stock = this_cpu_ptr(&memcg_stock);	\
+		obj_stock = &stock->irq_obj;		\
+	}						\
+	obj_stock;					\
+})
+
+static inline void put_obj_stock(unsigned long flags)
+{
+	if (flags == -1L)
+		preempt_enable();
+	else
+		local_irq_restore(flags);
 }
 
 /**
@@ -2327,7 +2365,9 @@ static void drain_local_stock(struct work_struct *dummy)
 	local_irq_save(flags);
 
 	stock = this_cpu_ptr(&memcg_stock);
-	drain_obj_stock(&stock->obj);
+	drain_obj_stock(&stock->irq_obj);
+	if (in_task())
+		drain_obj_stock(&stock->task_obj);
 	drain_stock(stock);
 	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
 
@@ -3183,7 +3223,7 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg,
 	memcg = obj_cgroup_memcg(objcg);
 	if (pgdat)
 		lruvec = mem_cgroup_lruvec(memcg, pgdat);
-	__mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
+	mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
 	rcu_read_unlock();
 }
 
@@ -3193,15 +3233,14 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 	unsigned long flags;
 	bool ret = false;
 
-	local_irq_save(flags);
+	stock = get_obj_stock(flags);
 
-	stock = current_obj_stock();
 	if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
 		stock->nr_bytes -= nr_bytes;
 		ret = true;
 	}
 
-	local_irq_restore(flags);
+	put_obj_stock(flags);
 
 	return ret;
 }
@@ -3254,8 +3293,13 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 {
 	struct mem_cgroup *memcg;
 
-	if (stock->obj.cached_objcg) {
-		memcg = obj_cgroup_memcg(stock->obj.cached_objcg);
+	if (in_task() && stock->task_obj.cached_objcg) {
+		memcg = obj_cgroup_memcg(stock->task_obj.cached_objcg);
+		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
+			return true;
+	}
+	if (stock->irq_obj.cached_objcg) {
+		memcg = obj_cgroup_memcg(stock->irq_obj.cached_objcg);
 		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
 			return true;
 	}
@@ -3283,9 +3327,9 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 {
 	unsigned long flags;
 
-	local_irq_save(flags);
+	get_obj_stock(flags);
 	__refill_obj_stock(objcg, nr_bytes);
-	local_irq_restore(flags);
+	put_obj_stock(flags);
 }
 
 static void __mod_obj_stock_state(struct obj_cgroup *objcg,
@@ -3325,9 +3369,9 @@ void mod_obj_stock_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
 {
 	unsigned long flags;
 
-	local_irq_save(flags);
+	get_obj_stock(flags);
 	__mod_obj_stock_state(objcg, pgdat, idx, nr);
-	local_irq_restore(flags);
+	put_obj_stock(flags);
 }
 
 int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
@@ -3380,10 +3424,10 @@ void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
 {
 	unsigned long flags;
 
-	local_irq_save(flags);
+	get_obj_stock(flags);
 	__refill_obj_stock(objcg, size);
 	__mod_obj_stock_state(objcg, pgdat, idx, -(int)size);
-	local_irq_restore(flags);
+	put_obj_stock(flags);
 }
 
 #endif /* CONFIG_MEMCG_KMEM */
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 0/5] mm/memcg: Reduce kmemcache memory accounting overhead
  2021-04-14  1:20 [PATCH v3 0/5] mm/memcg: Reduce kmemcache memory accounting overhead Waiman Long
                   ` (4 preceding siblings ...)
  2021-04-14  1:20 ` [PATCH v3 5/5] mm/memcg: Optimize user context object stock access Waiman Long
@ 2021-04-15  3:26 ` Masayoshi Mizuma
  2021-04-15 13:17   ` Waiman Long
  2021-04-15 17:10 ` Matthew Wilcox
  6 siblings, 1 reply; 36+ messages in thread
From: Masayoshi Mizuma @ 2021-04-15  3:26 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin, linux-kernel,
	cgroups, linux-mm, Shakeel Butt, Muchun Song, Alex Shi,
	Chris Down, Yafang Shao, Wei Yang, Xing Zhengjun

On Tue, Apr 13, 2021 at 09:20:22PM -0400, Waiman Long wrote:
>  v3:
>   - Add missing "inline" qualifier to the alternate mod_obj_stock_state()
>     in patch 3.
>   - Remove redundant current_obj_stock() call in patch 5.
> 
>  v2:
>   - Fix bug found by test robot in patch 5.
>   - Update cover letter and commit logs.
> 
> With the recent introduction of the new slab memory controller, we
> eliminate the need for having separate kmemcaches for each memory
> cgroup and reduce overall kernel memory usage. However, we also add
> additional memory accounting overhead to each call of kmem_cache_alloc()
> and kmem_cache_free().
> 
> For workloads that require a lot of kmemcache allocations and
> de-allocations, they may experience performance regression as illustrated
> in [1] and [2].
> 
> A simple kernel module that performs repeated loop of 100,000,000
> kmem_cache_alloc() and kmem_cache_free() of a 64-byte object at module
> init time is used for benchmarking. The test was run on a CascadeLake
> server with turbo-boosting disable to reduce run-to-run variation.
> 
> With memory accounting disable, the run time was 2.848s. With memory
> accounting enabled, the run times with the application of various
> patches in the patchset were:
> 
>   Applied patches   Run time   Accounting overhead   Overhead %age
>   ---------------   --------   -------------------   -------------
>        None          10.800s         7.952s              100.0%
>         1-2           9.140s         6.292s               79.1%
>         1-3           7.641s         4.793s               60.3%
>         1-5           6.801s         3.953s               49.7%
> 
> Note that this is the best case scenario where most updates happen only
> to the percpu stocks. Real workloads will likely have a certain amount
> of updates to the memcg charges and vmstats. So the performance benefit
> will be less.
> 
> It was found that a big part of the memory accounting overhead
> was caused by the local_irq_save()/local_irq_restore() sequences in
> updating local stock charge bytes and vmstat array, at least in x86
> systems. There are two such sequences in kmem_cache_alloc() and two
> in kmem_cache_free(). This patchset tries to reduce the use of such
> sequences as much as possible. In fact, it eliminates them in the common
> case. Another part of this patchset to cache the vmstat data update in
> the local stock as well which also helps.
> 
> [1] https://lore.kernel.org/linux-mm/20210408193948.vfktg3azh2wrt56t@gabell/T/#u

Hi Longman,

Thank you for your patches.
I rerun the benchmark with your patches, it seems that the reduction
is small... The total duration of sendto() and recvfrom() system call 
during the benchmark are as follows.

- sendto
  - v5.8 vanilla:                      2576.056 msec (100%)
  - v5.12-rc7 vanilla:                 2988.911 msec (116%)
  - v5.12-rc7 with your patches (1-5): 2984.307 msec (115%)

- recvfrom
  - v5.8 vanilla:                      2113.156 msec (100%)
  - v5.12-rc7 vanilla:                 2305.810 msec (109%)
  - v5.12-rc7 with your patches (1-5): 2287.351 msec (108%)

kmem_cache_alloc()/kmem_cache_free() are called around 1,400,000 times during
the benchmark. I ran a loop in a kernel module as following. The duration
is reduced by your patches actually.

  ---
  dummy_cache = KMEM_CACHE(dummy, SLAB_ACCOUNT);
  for (i = 0; i < 1400000; i++) {
	p = kmem_cache_alloc(dummy_cache, GFP_KERNEL);
	kmem_cache_free(dummy_cache, p);
  }
  ---

- v5.12-rc7 vanilla:                 110 msec (100%)
- v5.12-rc7 with your patches (1-5):  85 msec (77%)

It seems that the reduction is small for the benchmark though...
Anyway, I can see your patches reduce the overhead.
Please feel free to add:

	Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>

Thanks!
Masa

> [2] https://lore.kernel.org/lkml/20210114025151.GA22932@xsang-OptiPlex-9020/
> 
> Waiman Long (5):
>   mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state()
>   mm/memcg: Introduce obj_cgroup_uncharge_mod_state()
>   mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
>   mm/memcg: Separate out object stock data into its own struct
>   mm/memcg: Optimize user context object stock access
> 
>  include/linux/memcontrol.h |  14 ++-
>  mm/memcontrol.c            | 199 ++++++++++++++++++++++++++++++++-----
>  mm/percpu.c                |   9 +-
>  mm/slab.h                  |  32 +++---
>  4 files changed, 196 insertions(+), 58 deletions(-)
> 
> -- 
> 2.18.1
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 1/5] mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state()
  2021-04-14  1:20 ` [PATCH v3 1/5] mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state() Waiman Long
@ 2021-04-15  3:27   ` Masayoshi Mizuma
  2021-04-15 16:40   ` Johannes Weiner
  1 sibling, 0 replies; 36+ messages in thread
From: Masayoshi Mizuma @ 2021-04-15  3:27 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin, linux-kernel,
	cgroups, linux-mm, Shakeel Butt, Muchun Song, Alex Shi,
	Chris Down, Yafang Shao, Wei Yang, Xing Zhengjun

On Tue, Apr 13, 2021 at 09:20:23PM -0400, Waiman Long wrote:
> The caller of mod_memcg_lruvec_state() has both memcg and lruvec readily
> available. So both of them are now passed to mod_memcg_lruvec_state()
> and __mod_memcg_lruvec_state(). The __mod_memcg_lruvec_state() is
> updated to allow either of the two parameters to be set to null. This
> makes mod_memcg_lruvec_state() equivalent to mod_memcg_state() if lruvec
> is null.
> 
> The new __mod_memcg_lruvec_state() function will be used in the next
> patch as a replacement of mod_memcg_state() in mm/percpu.c for the
> consolidation of the memory uncharge and vmstat update functions in
> the kmem_cache_free() path.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> Acked-by: Roman Gushchin <guro@fb.com>
> Reviewed-by: Shakeel Butt <shakeelb@google.com>
> ---
>  include/linux/memcontrol.h | 12 +++++++-----
>  mm/memcontrol.c            | 19 +++++++++++++------
>  mm/slab.h                  |  2 +-
>  3 files changed, 21 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 0c04d39a7967..95f12996e66c 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -955,8 +955,8 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
>  	return x;
>  }
>  
> -void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
> -			      int val);
> +void __mod_memcg_lruvec_state(struct mem_cgroup *memcg, struct lruvec *lruvec,
> +			      enum node_stat_item idx, int val);
>  void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val);
>  
>  static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx,
> @@ -969,13 +969,14 @@ static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx,
>  	local_irq_restore(flags);
>  }
>  
> -static inline void mod_memcg_lruvec_state(struct lruvec *lruvec,
> +static inline void mod_memcg_lruvec_state(struct mem_cgroup *memcg,
> +					  struct lruvec *lruvec,
>  					  enum node_stat_item idx, int val)
>  {
>  	unsigned long flags;
>  
>  	local_irq_save(flags);
> -	__mod_memcg_lruvec_state(lruvec, idx, val);
> +	__mod_memcg_lruvec_state(memcg, lruvec, idx, val);
>  	local_irq_restore(flags);
>  }
>  
> @@ -1369,7 +1370,8 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
>  	return node_page_state(lruvec_pgdat(lruvec), idx);
>  }
>  
> -static inline void __mod_memcg_lruvec_state(struct lruvec *lruvec,
> +static inline void __mod_memcg_lruvec_state(struct mem_cgroup *memcg,
> +					    struct lruvec *lruvec,
>  					    enum node_stat_item idx, int val)
>  {
>  }
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e064ac0d850a..d66e1e38f8ac 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -799,20 +799,27 @@ parent_nodeinfo(struct mem_cgroup_per_node *pn, int nid)
>  	return mem_cgroup_nodeinfo(parent, nid);
>  }
>  
> -void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
> -			      int val)
> +/*
> + * Either one of memcg or lruvec can be NULL, but not both.
> + */
> +void __mod_memcg_lruvec_state(struct mem_cgroup *memcg, struct lruvec *lruvec,
> +			      enum node_stat_item idx, int val)
>  {
>  	struct mem_cgroup_per_node *pn;
> -	struct mem_cgroup *memcg;
>  	long x, threshold = MEMCG_CHARGE_BATCH;
>  
> +	/* Update lruvec */
>  	pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
> -	memcg = pn->memcg;
> +
> +	if (!memcg)
> +		memcg = pn->memcg;
>  
>  	/* Update memcg */
>  	__mod_memcg_state(memcg, idx, val);
>  
> -	/* Update lruvec */
> +	if (!lruvec)
> +		return;
> +
>  	__this_cpu_add(pn->lruvec_stat_local->count[idx], val);
>  
>  	if (vmstat_item_in_bytes(idx))
> @@ -848,7 +855,7 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
>  
>  	/* Update memcg and lruvec */
>  	if (!mem_cgroup_disabled())
> -		__mod_memcg_lruvec_state(lruvec, idx, val);
> +		__mod_memcg_lruvec_state(NULL, lruvec, idx, val);
>  }
>  
>  void __mod_lruvec_page_state(struct page *page, enum node_stat_item idx,
> diff --git a/mm/slab.h b/mm/slab.h
> index 076582f58f68..bc6c7545e487 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -293,7 +293,7 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg,
>  	rcu_read_lock();
>  	memcg = obj_cgroup_memcg(objcg);
>  	lruvec = mem_cgroup_lruvec(memcg, pgdat);
> -	mod_memcg_lruvec_state(lruvec, idx, nr);
> +	mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
>  	rcu_read_unlock();
>  }
>  
> -- 
> 2.18.1
> 

Please feel free to add:

Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>

Thanks!
Masa

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state()
  2021-04-14  1:20 ` [PATCH v3 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state() Waiman Long
@ 2021-04-15  3:27   ` Masayoshi Mizuma
  2021-04-15 16:30   ` Johannes Weiner
  1 sibling, 0 replies; 36+ messages in thread
From: Masayoshi Mizuma @ 2021-04-15  3:27 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin, linux-kernel,
	cgroups, linux-mm, Shakeel Butt, Muchun Song, Alex Shi,
	Chris Down, Yafang Shao, Wei Yang, Xing Zhengjun

On Tue, Apr 13, 2021 at 09:20:24PM -0400, Waiman Long wrote:
> In memcg_slab_free_hook()/pcpu_memcg_free_hook(), obj_cgroup_uncharge()
> is followed by mod_objcg_state()/mod_memcg_state(). Each of these
> function call goes through a separate irq_save/irq_restore cycle. That
> is inefficient.  Introduce a new function obj_cgroup_uncharge_mod_state()
> that combines them with a single irq_save/irq_restore cycle.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> Reviewed-by: Shakeel Butt <shakeelb@google.com>
> Acked-by: Roman Gushchin <guro@fb.com>
> ---
>  include/linux/memcontrol.h |  2 ++
>  mm/memcontrol.c            | 31 +++++++++++++++++++++++++++----
>  mm/percpu.c                |  9 ++-------
>  mm/slab.h                  |  6 +++---
>  4 files changed, 34 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 95f12996e66c..6890f999c1a3 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1592,6 +1592,8 @@ struct obj_cgroup *get_obj_cgroup_from_current(void);
>  
>  int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size);
>  void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);
> +void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
> +				   struct pglist_data *pgdat, int idx);
>  
>  extern struct static_key_false memcg_kmem_enabled_key;
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d66e1e38f8ac..b19100c68aa0 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3225,12 +3225,9 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
>  	return false;
>  }
>  
> -static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
> +static void __refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
>  {
>  	struct memcg_stock_pcp *stock;
> -	unsigned long flags;
> -
> -	local_irq_save(flags);
>  
>  	stock = this_cpu_ptr(&memcg_stock);
>  	if (stock->cached_objcg != objcg) { /* reset if necessary */
> @@ -3243,7 +3240,14 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
>  
>  	if (stock->nr_bytes > PAGE_SIZE)
>  		drain_obj_stock(stock);
> +}
> +
> +static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
> +{
> +	unsigned long flags;
>  
> +	local_irq_save(flags);
> +	__refill_obj_stock(objcg, nr_bytes);
>  	local_irq_restore(flags);
>  }
>  
> @@ -3292,6 +3296,25 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
>  	refill_obj_stock(objcg, size);
>  }
>  
> +void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
> +				   struct pglist_data *pgdat, int idx)
> +{
> +	unsigned long flags;
> +	struct mem_cgroup *memcg;
> +	struct lruvec *lruvec = NULL;
> +
> +	local_irq_save(flags);
> +	__refill_obj_stock(objcg, size);
> +
> +	rcu_read_lock();
> +	memcg = obj_cgroup_memcg(objcg);
> +	if (pgdat)
> +		lruvec = mem_cgroup_lruvec(memcg, pgdat);
> +	__mod_memcg_lruvec_state(memcg, lruvec, idx, -(int)size);
> +	rcu_read_unlock();
> +	local_irq_restore(flags);
> +}
> +
>  #endif /* CONFIG_MEMCG_KMEM */
>  
>  /*
> diff --git a/mm/percpu.c b/mm/percpu.c
> index 23308113a5ff..fd7aad6d7f90 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -1631,13 +1631,8 @@ static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
>  	objcg = chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT];
>  	chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = NULL;
>  
> -	obj_cgroup_uncharge(objcg, size * num_possible_cpus());
> -
> -	rcu_read_lock();
> -	mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B,
> -			-(size * num_possible_cpus()));
> -	rcu_read_unlock();
> -
> +	obj_cgroup_uncharge_mod_state(objcg, size * num_possible_cpus(),
> +				      NULL, MEMCG_PERCPU_B);
>  	obj_cgroup_put(objcg);
>  }
>  
> diff --git a/mm/slab.h b/mm/slab.h
> index bc6c7545e487..677cdc52e641 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -366,9 +366,9 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s_orig,
>  			continue;
>  
>  		objcgs[off] = NULL;
> -		obj_cgroup_uncharge(objcg, obj_full_size(s));
> -		mod_objcg_state(objcg, page_pgdat(page), cache_vmstat_idx(s),
> -				-obj_full_size(s));
> +		obj_cgroup_uncharge_mod_state(objcg, obj_full_size(s),
> +					      page_pgdat(page),
> +					      cache_vmstat_idx(s));
>  		obj_cgroup_put(objcg);
>  	}
>  }
> -- 
> 2.18.1
> 

Please feel free to add:

Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>

Thanks!
Masa

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 3/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
  2021-04-14  1:20 ` [PATCH v3 3/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp Waiman Long
@ 2021-04-15  3:28   ` Masayoshi Mizuma
  2021-04-15 16:50   ` Johannes Weiner
  1 sibling, 0 replies; 36+ messages in thread
From: Masayoshi Mizuma @ 2021-04-15  3:28 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin, linux-kernel,
	cgroups, linux-mm, Shakeel Butt, Muchun Song, Alex Shi,
	Chris Down, Yafang Shao, Wei Yang, Xing Zhengjun

On Tue, Apr 13, 2021 at 09:20:25PM -0400, Waiman Long wrote:
> Before the new slab memory controller with per object byte charging,
> charging and vmstat data update happen only when new slab pages are
> allocated or freed. Now they are done with every kmem_cache_alloc()
> and kmem_cache_free(). This causes additional overhead for workloads
> that generate a lot of alloc and free calls.
> 
> The memcg_stock_pcp is used to cache byte charge for a specific
> obj_cgroup to reduce that overhead. To further reducing it, this patch
> makes the vmstat data cached in the memcg_stock_pcp structure as well
> until it accumulates a page size worth of update or when other cached
> data change.
> 
> On a 2-socket Cascade Lake server with instrumentation enabled and this
> patch applied, it was found that about 17% (946796 out of 5515184) of the
> time when __mod_obj_stock_state() is called leads to an actual call to
> mod_objcg_state() after initial boot. When doing parallel kernel build,
> the figure was about 16% (21894614 out of 139780628). So caching the
> vmstat data reduces the number of calls to mod_objcg_state() by more
> than 80%.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> Reviewed-by: Shakeel Butt <shakeelb@google.com>
> ---
>  mm/memcontrol.c | 78 +++++++++++++++++++++++++++++++++++++++++++------
>  mm/slab.h       | 26 +++++++----------
>  2 files changed, 79 insertions(+), 25 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b19100c68aa0..539c3b632e47 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2220,7 +2220,10 @@ struct memcg_stock_pcp {
>  
>  #ifdef CONFIG_MEMCG_KMEM
>  	struct obj_cgroup *cached_objcg;
> +	struct pglist_data *cached_pgdat;
>  	unsigned int nr_bytes;
> +	int vmstat_idx;
> +	int vmstat_bytes;
>  #endif
>  
>  	struct work_struct work;
> @@ -3157,6 +3160,21 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
>  	css_put(&memcg->css);
>  }
>  
> +static inline void mod_objcg_state(struct obj_cgroup *objcg,
> +				   struct pglist_data *pgdat,
> +				   enum node_stat_item idx, int nr)
> +{
> +	struct mem_cgroup *memcg;
> +	struct lruvec *lruvec = NULL;
> +
> +	rcu_read_lock();
> +	memcg = obj_cgroup_memcg(objcg);
> +	if (pgdat)
> +		lruvec = mem_cgroup_lruvec(memcg, pgdat);
> +	__mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
> +	rcu_read_unlock();
> +}
> +
>  static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
>  {
>  	struct memcg_stock_pcp *stock;
> @@ -3207,6 +3225,14 @@ static void drain_obj_stock(struct memcg_stock_pcp *stock)
>  		stock->nr_bytes = 0;
>  	}
>  
> +	if (stock->vmstat_bytes) {
> +		mod_objcg_state(old, stock->cached_pgdat, stock->vmstat_idx,
> +				stock->vmstat_bytes);
> +		stock->vmstat_bytes = 0;
> +		stock->vmstat_idx = 0;
> +		stock->cached_pgdat = NULL;
> +	}
> +
>  	obj_cgroup_put(old);
>  	stock->cached_objcg = NULL;
>  }
> @@ -3251,6 +3277,48 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
>  	local_irq_restore(flags);
>  }
>  
> +static void __mod_obj_stock_state(struct obj_cgroup *objcg,
> +				  struct pglist_data *pgdat, int idx, int nr)
> +{
> +	struct memcg_stock_pcp *stock = this_cpu_ptr(&memcg_stock);
> +
> +	if (stock->cached_objcg != objcg) {
> +		/* Output the current data as is */
> +	} else if (!stock->vmstat_bytes) {
> +		/* Save the current data */
> +		stock->vmstat_bytes = nr;
> +		stock->vmstat_idx = idx;
> +		stock->cached_pgdat = pgdat;
> +		nr = 0;
> +	} else if ((stock->cached_pgdat != pgdat) ||
> +		   (stock->vmstat_idx != idx)) {
> +		/* Output the cached data & save the current data */
> +		swap(nr, stock->vmstat_bytes);
> +		swap(idx, stock->vmstat_idx);
> +		swap(pgdat, stock->cached_pgdat);
> +	} else {
> +		stock->vmstat_bytes += nr;
> +		if (abs(nr) > PAGE_SIZE) {
> +			nr = stock->vmstat_bytes;
> +			stock->vmstat_bytes = 0;
> +		} else {
> +			nr = 0;
> +		}
> +	}
> +	if (nr)
> +		mod_objcg_state(objcg, pgdat, idx, nr);
> +}
> +
> +void mod_obj_stock_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
> +			 int idx, int nr)
> +{
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +	__mod_obj_stock_state(objcg, pgdat, idx, nr);
> +	local_irq_restore(flags);
> +}
> +
>  int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
>  {
>  	struct mem_cgroup *memcg;
> @@ -3300,18 +3368,10 @@ void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
>  				   struct pglist_data *pgdat, int idx)
>  {
>  	unsigned long flags;
> -	struct mem_cgroup *memcg;
> -	struct lruvec *lruvec = NULL;
>  
>  	local_irq_save(flags);
>  	__refill_obj_stock(objcg, size);
> -
> -	rcu_read_lock();
> -	memcg = obj_cgroup_memcg(objcg);
> -	if (pgdat)
> -		lruvec = mem_cgroup_lruvec(memcg, pgdat);
> -	__mod_memcg_lruvec_state(memcg, lruvec, idx, -(int)size);
> -	rcu_read_unlock();
> +	__mod_obj_stock_state(objcg, pgdat, idx, -(int)size);
>  	local_irq_restore(flags);
>  }
>  
> diff --git a/mm/slab.h b/mm/slab.h
> index 677cdc52e641..03bd9813422b 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -239,6 +239,8 @@ static inline bool kmem_cache_debug_flags(struct kmem_cache *s, slab_flags_t fla
>  #ifdef CONFIG_MEMCG_KMEM
>  int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s,
>  				 gfp_t gfp, bool new_page);
> +void mod_obj_stock_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
> +			 int idx, int nr);
>  
>  static inline void memcg_free_page_obj_cgroups(struct page *page)
>  {
> @@ -283,20 +285,6 @@ static inline bool memcg_slab_pre_alloc_hook(struct kmem_cache *s,
>  	return true;
>  }
>  
> -static inline void mod_objcg_state(struct obj_cgroup *objcg,
> -				   struct pglist_data *pgdat,
> -				   enum node_stat_item idx, int nr)
> -{
> -	struct mem_cgroup *memcg;
> -	struct lruvec *lruvec;
> -
> -	rcu_read_lock();
> -	memcg = obj_cgroup_memcg(objcg);
> -	lruvec = mem_cgroup_lruvec(memcg, pgdat);
> -	mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
> -	rcu_read_unlock();
> -}
> -
>  static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
>  					      struct obj_cgroup *objcg,
>  					      gfp_t flags, size_t size,
> @@ -324,8 +312,9 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
>  			off = obj_to_index(s, page, p[i]);
>  			obj_cgroup_get(objcg);
>  			page_objcgs(page)[off] = objcg;
> -			mod_objcg_state(objcg, page_pgdat(page),
> -					cache_vmstat_idx(s), obj_full_size(s));
> +			mod_obj_stock_state(objcg, page_pgdat(page),
> +					    cache_vmstat_idx(s),
> +					    obj_full_size(s));
>  		} else {
>  			obj_cgroup_uncharge(objcg, obj_full_size(s));
>  		}
> @@ -408,6 +397,11 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s,
>  					void **p, int objects)
>  {
>  }
> +
> +static inline void mod_obj_stock_state(struct obj_cgroup *objcg,
> +				       struct pglist_data *pgdat, int idx, int nr)
> +{
> +}
>  #endif /* CONFIG_MEMCG_KMEM */
>  
>  static inline struct kmem_cache *virt_to_cache(const void *obj)
> -- 
> 2.18.1
> 

Please feel free to add:

Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>

Thanks!
Masa

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 4/5] mm/memcg: Separate out object stock data into its own struct
  2021-04-14  1:20 ` [PATCH v3 4/5] mm/memcg: Separate out object stock data into its own struct Waiman Long
@ 2021-04-15  3:28   ` Masayoshi Mizuma
  2021-04-15 16:57   ` Johannes Weiner
  1 sibling, 0 replies; 36+ messages in thread
From: Masayoshi Mizuma @ 2021-04-15  3:28 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin, linux-kernel,
	cgroups, linux-mm, Shakeel Butt, Muchun Song, Alex Shi,
	Chris Down, Yafang Shao, Wei Yang, Xing Zhengjun

On Tue, Apr 13, 2021 at 09:20:26PM -0400, Waiman Long wrote:
> The object stock data stored in struct memcg_stock_pcp are independent
> of the other page based data stored there. Separating them out into
> their own struct to highlight the independency.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> Acked-by: Roman Gushchin <guro@fb.com>
> Reviewed-by: Shakeel Butt <shakeelb@google.com>
> ---
>  mm/memcontrol.c | 41 ++++++++++++++++++++++++++---------------
>  1 file changed, 26 insertions(+), 15 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 539c3b632e47..69f728383efe 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2214,17 +2214,22 @@ void unlock_page_memcg(struct page *page)
>  }
>  EXPORT_SYMBOL(unlock_page_memcg);
>  
> -struct memcg_stock_pcp {
> -	struct mem_cgroup *cached; /* this never be root cgroup */
> -	unsigned int nr_pages;
> -
> +struct obj_stock {
>  #ifdef CONFIG_MEMCG_KMEM
>  	struct obj_cgroup *cached_objcg;
>  	struct pglist_data *cached_pgdat;
>  	unsigned int nr_bytes;
>  	int vmstat_idx;
>  	int vmstat_bytes;
> +#else
> +	int dummy[0];
>  #endif
> +};
> +
> +struct memcg_stock_pcp {
> +	struct mem_cgroup *cached; /* this never be root cgroup */
> +	unsigned int nr_pages;
> +	struct obj_stock obj;
>  
>  	struct work_struct work;
>  	unsigned long flags;
> @@ -2234,12 +2239,12 @@ static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
>  static DEFINE_MUTEX(percpu_charge_mutex);
>  
>  #ifdef CONFIG_MEMCG_KMEM
> -static void drain_obj_stock(struct memcg_stock_pcp *stock);
> +static void drain_obj_stock(struct obj_stock *stock);
>  static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
>  				     struct mem_cgroup *root_memcg);
>  
>  #else
> -static inline void drain_obj_stock(struct memcg_stock_pcp *stock)
> +static inline void drain_obj_stock(struct obj_stock *stock)
>  {
>  }
>  static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
> @@ -2249,6 +2254,13 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
>  }
>  #endif
>  
> +static inline struct obj_stock *current_obj_stock(void)
> +{
> +	struct memcg_stock_pcp *stock = this_cpu_ptr(&memcg_stock);
> +
> +	return &stock->obj;
> +}
> +
>  /**
>   * consume_stock: Try to consume stocked charge on this cpu.
>   * @memcg: memcg to consume from.
> @@ -2315,7 +2327,7 @@ static void drain_local_stock(struct work_struct *dummy)
>  	local_irq_save(flags);
>  
>  	stock = this_cpu_ptr(&memcg_stock);
> -	drain_obj_stock(stock);
> +	drain_obj_stock(&stock->obj);
>  	drain_stock(stock);
>  	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
>  
> @@ -3177,13 +3189,13 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg,
>  
>  static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
>  {
> -	struct memcg_stock_pcp *stock;
> +	struct obj_stock *stock;
>  	unsigned long flags;
>  	bool ret = false;
>  
>  	local_irq_save(flags);
>  
> -	stock = this_cpu_ptr(&memcg_stock);
> +	stock = current_obj_stock();
>  	if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
>  		stock->nr_bytes -= nr_bytes;
>  		ret = true;
> @@ -3194,7 +3206,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
>  	return ret;
>  }
>  
> -static void drain_obj_stock(struct memcg_stock_pcp *stock)
> +static void drain_obj_stock(struct obj_stock *stock)
>  {
>  	struct obj_cgroup *old = stock->cached_objcg;
>  
> @@ -3242,8 +3254,8 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
>  {
>  	struct mem_cgroup *memcg;
>  
> -	if (stock->cached_objcg) {
> -		memcg = obj_cgroup_memcg(stock->cached_objcg);
> +	if (stock->obj.cached_objcg) {
> +		memcg = obj_cgroup_memcg(stock->obj.cached_objcg);
>  		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
>  			return true;
>  	}
> @@ -3253,9 +3265,8 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
>  
>  static void __refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
>  {
> -	struct memcg_stock_pcp *stock;
> +	struct obj_stock *stock = current_obj_stock();
>  
> -	stock = this_cpu_ptr(&memcg_stock);
>  	if (stock->cached_objcg != objcg) { /* reset if necessary */
>  		drain_obj_stock(stock);
>  		obj_cgroup_get(objcg);
> @@ -3280,7 +3291,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
>  static void __mod_obj_stock_state(struct obj_cgroup *objcg,
>  				  struct pglist_data *pgdat, int idx, int nr)
>  {
> -	struct memcg_stock_pcp *stock = this_cpu_ptr(&memcg_stock);
> +	struct obj_stock *stock = current_obj_stock();
>  
>  	if (stock->cached_objcg != objcg) {
>  		/* Output the current data as is */
> -- 
> 2.18.1
> 
Please feel free to add:

Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>

Thanks!
Masa

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 5/5] mm/memcg: Optimize user context object stock access
  2021-04-14  1:20 ` [PATCH v3 5/5] mm/memcg: Optimize user context object stock access Waiman Long
@ 2021-04-15  3:28   ` Masayoshi Mizuma
  2021-04-15  9:44     ` Christoph Lameter
  2021-04-15 17:53   ` Johannes Weiner
  1 sibling, 1 reply; 36+ messages in thread
From: Masayoshi Mizuma @ 2021-04-15  3:28 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin, linux-kernel,
	cgroups, linux-mm, Shakeel Butt, Muchun Song, Alex Shi,
	Chris Down, Yafang Shao, Wei Yang, Xing Zhengjun

On Tue, Apr 13, 2021 at 09:20:27PM -0400, Waiman Long wrote:
> Most kmem_cache_alloc() calls are from user context. With instrumentation
> enabled, the measured amount of kmem_cache_alloc() calls from non-task
> context was about 0.01% of the total.
> 
> The irq disable/enable sequence used in this case to access content
> from object stock is slow.  To optimize for user context access, there
> are now two object stocks for task context and interrupt context access
> respectively.
> 
> The task context object stock can be accessed after disabling preemption
> which is cheap in non-preempt kernel. The interrupt context object stock
> can only be accessed after disabling interrupt. User context code can
> access interrupt object stock, but not vice versa.
> 
> The mod_objcg_state() function is also modified to make sure that memcg
> and lruvec stat updates are done with interrupted disabled.
> 
> The downside of this change is that there are more data stored in local
> object stocks and not reflected in the charge counter and the vmstat
> arrays.  However, this is a small price to pay for better performance.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> Acked-by: Roman Gushchin <guro@fb.com>
> Reviewed-by: Shakeel Butt <shakeelb@google.com>
> ---
>  mm/memcontrol.c | 74 +++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 59 insertions(+), 15 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 69f728383efe..8875e896e52b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2229,7 +2229,8 @@ struct obj_stock {
>  struct memcg_stock_pcp {
>  	struct mem_cgroup *cached; /* this never be root cgroup */
>  	unsigned int nr_pages;
> -	struct obj_stock obj;
> +	struct obj_stock task_obj;
> +	struct obj_stock irq_obj;
>  
>  	struct work_struct work;
>  	unsigned long flags;
> @@ -2254,11 +2255,48 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
>  }
>  #endif
>  
> +/*
> + * Most kmem_cache_alloc() calls are from user context. The irq disable/enable
> + * sequence used in this case to access content from object stock is slow.
> + * To optimize for user context access, there are now two object stocks for
> + * task context and interrupt context access respectively.
> + *
> + * The task context object stock can be accessed by disabling preemption only
> + * which is cheap in non-preempt kernel. The interrupt context object stock
> + * can only be accessed after disabling interrupt. User context code can
> + * access interrupt object stock, but not vice versa.
> + */
>  static inline struct obj_stock *current_obj_stock(void)
>  {
>  	struct memcg_stock_pcp *stock = this_cpu_ptr(&memcg_stock);
>  
> -	return &stock->obj;
> +	return in_task() ? &stock->task_obj : &stock->irq_obj;
> +}
> +
> +#define get_obj_stock(flags)				\
> +({							\
> +	struct memcg_stock_pcp *stock;			\
> +	struct obj_stock *obj_stock;			\
> +							\
> +	if (in_task()) {				\
> +		preempt_disable();			\
> +		(flags) = -1L;				\
> +		stock = this_cpu_ptr(&memcg_stock);	\
> +		obj_stock = &stock->task_obj;		\
> +	} else {					\
> +		local_irq_save(flags);			\
> +		stock = this_cpu_ptr(&memcg_stock);	\
> +		obj_stock = &stock->irq_obj;		\
> +	}						\
> +	obj_stock;					\
> +})
> +
> +static inline void put_obj_stock(unsigned long flags)
> +{
> +	if (flags == -1L)
> +		preempt_enable();
> +	else
> +		local_irq_restore(flags);
>  }
>  
>  /**
> @@ -2327,7 +2365,9 @@ static void drain_local_stock(struct work_struct *dummy)
>  	local_irq_save(flags);
>  
>  	stock = this_cpu_ptr(&memcg_stock);
> -	drain_obj_stock(&stock->obj);
> +	drain_obj_stock(&stock->irq_obj);
> +	if (in_task())
> +		drain_obj_stock(&stock->task_obj);
>  	drain_stock(stock);
>  	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
>  
> @@ -3183,7 +3223,7 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg,
>  	memcg = obj_cgroup_memcg(objcg);
>  	if (pgdat)
>  		lruvec = mem_cgroup_lruvec(memcg, pgdat);
> -	__mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
> +	mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
>  	rcu_read_unlock();
>  }
>  
> @@ -3193,15 +3233,14 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
>  	unsigned long flags;
>  	bool ret = false;
>  
> -	local_irq_save(flags);
> +	stock = get_obj_stock(flags);
>  
> -	stock = current_obj_stock();
>  	if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
>  		stock->nr_bytes -= nr_bytes;
>  		ret = true;
>  	}
>  
> -	local_irq_restore(flags);
> +	put_obj_stock(flags);
>  
>  	return ret;
>  }
> @@ -3254,8 +3293,13 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
>  {
>  	struct mem_cgroup *memcg;
>  
> -	if (stock->obj.cached_objcg) {
> -		memcg = obj_cgroup_memcg(stock->obj.cached_objcg);
> +	if (in_task() && stock->task_obj.cached_objcg) {
> +		memcg = obj_cgroup_memcg(stock->task_obj.cached_objcg);
> +		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
> +			return true;
> +	}
> +	if (stock->irq_obj.cached_objcg) {
> +		memcg = obj_cgroup_memcg(stock->irq_obj.cached_objcg);
>  		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
>  			return true;
>  	}
> @@ -3283,9 +3327,9 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
>  {
>  	unsigned long flags;
>  
> -	local_irq_save(flags);
> +	get_obj_stock(flags);
>  	__refill_obj_stock(objcg, nr_bytes);
> -	local_irq_restore(flags);
> +	put_obj_stock(flags);
>  }
>  
>  static void __mod_obj_stock_state(struct obj_cgroup *objcg,
> @@ -3325,9 +3369,9 @@ void mod_obj_stock_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
>  {
>  	unsigned long flags;
>  
> -	local_irq_save(flags);
> +	get_obj_stock(flags);
>  	__mod_obj_stock_state(objcg, pgdat, idx, nr);
> -	local_irq_restore(flags);
> +	put_obj_stock(flags);
>  }
>  
>  int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
> @@ -3380,10 +3424,10 @@ void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
>  {
>  	unsigned long flags;
>  
> -	local_irq_save(flags);
> +	get_obj_stock(flags);
>  	__refill_obj_stock(objcg, size);
>  	__mod_obj_stock_state(objcg, pgdat, idx, -(int)size);
> -	local_irq_restore(flags);
> +	put_obj_stock(flags);
>  }
>  
>  #endif /* CONFIG_MEMCG_KMEM */
> -- 
> 2.18.1
> 

Please feel free to add:

Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>

Thanks!
Masa

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 5/5] mm/memcg: Optimize user context object stock access
  2021-04-15  3:28   ` Masayoshi Mizuma
@ 2021-04-15  9:44     ` Christoph Lameter
  2021-04-15 12:16       ` Masayoshi Mizuma
  0 siblings, 1 reply; 36+ messages in thread
From: Christoph Lameter @ 2021-04-15  9:44 UTC (permalink / raw)
  To: Masayoshi Mizuma
  Cc: Waiman Long, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Andrew Morton, Tejun Heo, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin, linux-kernel,
	cgroups, linux-mm, Shakeel Butt, Muchun Song, Alex Shi,
	Chris Down, Yafang Shao, Wei Yang, Xing Zhengjun

On Wed, 14 Apr 2021, Masayoshi Mizuma wrote:

> Please feel free to add:
>
> Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
>
> Thanks!
> Masa
>

Would you please stop quoting the whole patch when you have nothing to say
about the details? It is enough to just respond without quoting. I was
looking through this trying to find something you said about individual
sections of code but there was nothing.




^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 5/5] mm/memcg: Optimize user context object stock access
  2021-04-15  9:44     ` Christoph Lameter
@ 2021-04-15 12:16       ` Masayoshi Mizuma
  0 siblings, 0 replies; 36+ messages in thread
From: Masayoshi Mizuma @ 2021-04-15 12:16 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Waiman Long, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Andrew Morton, Tejun Heo, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin, linux-kernel,
	cgroups, linux-mm, Shakeel Butt, Muchun Song, Alex Shi,
	Chris Down, Yafang Shao, Wei Yang, Xing Zhengjun

On Thu, Apr 15, 2021 at 11:44:55AM +0200, Christoph Lameter wrote:
> Would you please stop quoting the whole patch when you have nothing to say
> about the details? It is enough to just respond without quoting. I was
> looking through this trying to find something you said about individual
> sections of code but there was nothing.

Thank you for pointing it out and sorry about that.
I'll do that next time.

- Masa

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 0/5] mm/memcg: Reduce kmemcache memory accounting overhead
  2021-04-15  3:26 ` [PATCH v3 0/5] mm/memcg: Reduce kmemcache memory accounting overhead Masayoshi Mizuma
@ 2021-04-15 13:17   ` Waiman Long
  2021-04-15 15:47     ` Masayoshi Mizuma
  0 siblings, 1 reply; 36+ messages in thread
From: Waiman Long @ 2021-04-15 13:17 UTC (permalink / raw)
  To: Masayoshi Mizuma
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin, linux-kernel,
	cgroups, linux-mm, Shakeel Butt, Muchun Song, Alex Shi,
	Chris Down, Yafang Shao, Wei Yang, Xing Zhengjun

On 4/14/21 11:26 PM, Masayoshi Mizuma wrote:
>
> Hi Longman,
>
> Thank you for your patches.
> I rerun the benchmark with your patches, it seems that the reduction
> is small... The total duration of sendto() and recvfrom() system call
> during the benchmark are as follows.
>
> - sendto
>    - v5.8 vanilla:                      2576.056 msec (100%)
>    - v5.12-rc7 vanilla:                 2988.911 msec (116%)
>    - v5.12-rc7 with your patches (1-5): 2984.307 msec (115%)
>
> - recvfrom
>    - v5.8 vanilla:                      2113.156 msec (100%)
>    - v5.12-rc7 vanilla:                 2305.810 msec (109%)
>    - v5.12-rc7 with your patches (1-5): 2287.351 msec (108%)
>
> kmem_cache_alloc()/kmem_cache_free() are called around 1,400,000 times during
> the benchmark. I ran a loop in a kernel module as following. The duration
> is reduced by your patches actually.
>
>    ---
>    dummy_cache = KMEM_CACHE(dummy, SLAB_ACCOUNT);
>    for (i = 0; i < 1400000; i++) {
> 	p = kmem_cache_alloc(dummy_cache, GFP_KERNEL);
> 	kmem_cache_free(dummy_cache, p);
>    }
>    ---
>
> - v5.12-rc7 vanilla:                 110 msec (100%)
> - v5.12-rc7 with your patches (1-5):  85 msec (77%)
>
> It seems that the reduction is small for the benchmark though...
> Anyway, I can see your patches reduce the overhead.
> Please feel free to add:
>
> 	Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
>
> Thanks!
> Masa
>
Thanks for the testing.

I was focusing on your kernel module benchmark in testing my patch. I 
will try out your pgbench benchmark to see if there can be other tuning 
that can be done.

BTW, how many numa nodes does your test machine? I did my testing with a 
2-socket system. The vmstat caching part may be less effective on 
systems with more numa nodes. I will try to find a larger 4-socket 
systems for testing.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 0/5] mm/memcg: Reduce kmemcache memory accounting overhead
  2021-04-15 13:17   ` Waiman Long
@ 2021-04-15 15:47     ` Masayoshi Mizuma
  0 siblings, 0 replies; 36+ messages in thread
From: Masayoshi Mizuma @ 2021-04-15 15:47 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin, linux-kernel,
	cgroups, linux-mm, Shakeel Butt, Muchun Song, Alex Shi,
	Chris Down, Yafang Shao, Wei Yang, Xing Zhengjun

On Thu, Apr 15, 2021 at 09:17:37AM -0400, Waiman Long wrote:
> I was focusing on your kernel module benchmark in testing my patch. I will
> try out your pgbench benchmark to see if there can be other tuning that can
> be done.

Thanks a lot!

> BTW, how many numa nodes does your test machine? I did my testing with a
> 2-socket system. The vmstat caching part may be less effective on systems
> with more numa nodes. I will try to find a larger 4-socket systems for
> testing.

The test machine has one node.

- Masa

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state()
  2021-04-14  1:20 ` [PATCH v3 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state() Waiman Long
  2021-04-15  3:27   ` Masayoshi Mizuma
@ 2021-04-15 16:30   ` Johannes Weiner
  2021-04-15 16:35     ` Waiman Long
  1 sibling, 1 reply; 36+ messages in thread
From: Johannes Weiner @ 2021-04-15 16:30 UTC (permalink / raw)
  To: Waiman Long
  Cc: Michal Hocko, Vladimir Davydov, Andrew Morton, Tejun Heo,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin, linux-kernel, cgroups, linux-mm,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun

On Tue, Apr 13, 2021 at 09:20:24PM -0400, Waiman Long wrote:
> In memcg_slab_free_hook()/pcpu_memcg_free_hook(), obj_cgroup_uncharge()
> is followed by mod_objcg_state()/mod_memcg_state(). Each of these
> function call goes through a separate irq_save/irq_restore cycle. That
> is inefficient.  Introduce a new function obj_cgroup_uncharge_mod_state()
> that combines them with a single irq_save/irq_restore cycle.
>
> @@ -3292,6 +3296,25 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
>  	refill_obj_stock(objcg, size);
>  }
>  
> +void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
> +				   struct pglist_data *pgdat, int idx)

The optimization makes sense.

But please don't combine independent operations like this into a
single function. It makes for an unclear parameter list, it's a pain
in the behind to change the constituent operations later on, and it
has a habit of attracting more random bools over time. E.g. what if
the caller already has irqs disabled? What if it KNOWS that irqs are
enabled and it could use local_irq_disable() instead of save?

Just provide an __obj_cgroup_uncharge() that assumes irqs are
disabled, combine with the existing __mod_memcg_lruvec_state(), and
bubble the irq handling up to those callsites which know better.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state()
  2021-04-15 16:30   ` Johannes Weiner
@ 2021-04-15 16:35     ` Waiman Long
  2021-04-15 18:10       ` Johannes Weiner
  0 siblings, 1 reply; 36+ messages in thread
From: Waiman Long @ 2021-04-15 16:35 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Vladimir Davydov, Andrew Morton, Tejun Heo,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin, linux-kernel, cgroups, linux-mm,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun

On 4/15/21 12:30 PM, Johannes Weiner wrote:
> On Tue, Apr 13, 2021 at 09:20:24PM -0400, Waiman Long wrote:
>> In memcg_slab_free_hook()/pcpu_memcg_free_hook(), obj_cgroup_uncharge()
>> is followed by mod_objcg_state()/mod_memcg_state(). Each of these
>> function call goes through a separate irq_save/irq_restore cycle. That
>> is inefficient.  Introduce a new function obj_cgroup_uncharge_mod_state()
>> that combines them with a single irq_save/irq_restore cycle.
>>
>> @@ -3292,6 +3296,25 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
>>   	refill_obj_stock(objcg, size);
>>   }
>>   
>> +void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
>> +				   struct pglist_data *pgdat, int idx)
> The optimization makes sense.
>
> But please don't combine independent operations like this into a
> single function. It makes for an unclear parameter list, it's a pain
> in the behind to change the constituent operations later on, and it
> has a habit of attracting more random bools over time. E.g. what if
> the caller already has irqs disabled? What if it KNOWS that irqs are
> enabled and it could use local_irq_disable() instead of save?
>
> Just provide an __obj_cgroup_uncharge() that assumes irqs are
> disabled, combine with the existing __mod_memcg_lruvec_state(), and
> bubble the irq handling up to those callsites which know better.
>
That will also work. However, the reason I did that was because of patch 
5 in the series. I could put the get_obj_stock() and put_obj_stock() 
code in slab.h and allowed them to be used directly in various places, 
but hiding in one function is easier.

Anyway, I can change the patch if you think that is the right way.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 1/5] mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state()
  2021-04-14  1:20 ` [PATCH v3 1/5] mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state() Waiman Long
  2021-04-15  3:27   ` Masayoshi Mizuma
@ 2021-04-15 16:40   ` Johannes Weiner
  2021-04-15 16:59     ` Waiman Long
  1 sibling, 1 reply; 36+ messages in thread
From: Johannes Weiner @ 2021-04-15 16:40 UTC (permalink / raw)
  To: Waiman Long
  Cc: Michal Hocko, Vladimir Davydov, Andrew Morton, Tejun Heo,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin, linux-kernel, cgroups, linux-mm,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun

On Tue, Apr 13, 2021 at 09:20:23PM -0400, Waiman Long wrote:
> The caller of mod_memcg_lruvec_state() has both memcg and lruvec readily
> available. So both of them are now passed to mod_memcg_lruvec_state()
> and __mod_memcg_lruvec_state(). The __mod_memcg_lruvec_state() is
> updated to allow either of the two parameters to be set to null. This
> makes mod_memcg_lruvec_state() equivalent to mod_memcg_state() if lruvec
> is null.
> 
> The new __mod_memcg_lruvec_state() function will be used in the next
> patch as a replacement of mod_memcg_state() in mm/percpu.c for the
> consolidation of the memory uncharge and vmstat update functions in
> the kmem_cache_free() path.

This requires users who want both to pass a pgdat that can be derived
from the lruvec. This is error prone, and we just acked a patch that
removes this very thing from mem_cgroup_page_lruvec().

With the suggestion for patch 2, this shouldn't be necessary anymore,
though. And sort of underlines my point around that combined function
creating akwward code above and below it.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 3/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
  2021-04-14  1:20 ` [PATCH v3 3/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp Waiman Long
  2021-04-15  3:28   ` Masayoshi Mizuma
@ 2021-04-15 16:50   ` Johannes Weiner
  2021-04-15 17:08     ` Waiman Long
  1 sibling, 1 reply; 36+ messages in thread
From: Johannes Weiner @ 2021-04-15 16:50 UTC (permalink / raw)
  To: Waiman Long
  Cc: Michal Hocko, Vladimir Davydov, Andrew Morton, Tejun Heo,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin, linux-kernel, cgroups, linux-mm,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun

On Tue, Apr 13, 2021 at 09:20:25PM -0400, Waiman Long wrote:
> Before the new slab memory controller with per object byte charging,
> charging and vmstat data update happen only when new slab pages are
> allocated or freed. Now they are done with every kmem_cache_alloc()
> and kmem_cache_free(). This causes additional overhead for workloads
> that generate a lot of alloc and free calls.
> 
> The memcg_stock_pcp is used to cache byte charge for a specific
> obj_cgroup to reduce that overhead. To further reducing it, this patch
> makes the vmstat data cached in the memcg_stock_pcp structure as well
> until it accumulates a page size worth of update or when other cached
> data change.
> 
> On a 2-socket Cascade Lake server with instrumentation enabled and this
> patch applied, it was found that about 17% (946796 out of 5515184) of the
> time when __mod_obj_stock_state() is called leads to an actual call to
> mod_objcg_state() after initial boot. When doing parallel kernel build,
> the figure was about 16% (21894614 out of 139780628). So caching the
> vmstat data reduces the number of calls to mod_objcg_state() by more
> than 80%.

Right, but mod_objcg_state() is itself already percpu-cached. What's
the benefit of avoiding calls to it with another percpu cache?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 4/5] mm/memcg: Separate out object stock data into its own struct
  2021-04-14  1:20 ` [PATCH v3 4/5] mm/memcg: Separate out object stock data into its own struct Waiman Long
  2021-04-15  3:28   ` Masayoshi Mizuma
@ 2021-04-15 16:57   ` Johannes Weiner
  1 sibling, 0 replies; 36+ messages in thread
From: Johannes Weiner @ 2021-04-15 16:57 UTC (permalink / raw)
  To: Waiman Long
  Cc: Michal Hocko, Vladimir Davydov, Andrew Morton, Tejun Heo,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin, linux-kernel, cgroups, linux-mm,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun

On Tue, Apr 13, 2021 at 09:20:26PM -0400, Waiman Long wrote:
> The object stock data stored in struct memcg_stock_pcp are independent
> of the other page based data stored there. Separating them out into
> their own struct to highlight the independency.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> Acked-by: Roman Gushchin <guro@fb.com>
> Reviewed-by: Shakeel Butt <shakeelb@google.com>
> ---
>  mm/memcontrol.c | 41 ++++++++++++++++++++++++++---------------
>  1 file changed, 26 insertions(+), 15 deletions(-)

It's almost twice more code, and IMO not any clearer. Plus it adds
some warts: int dummy[0], redundant naming in stock->obj.cached_objcg,
this_cpu_ptr() doesn't really need a wrapper etc.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 1/5] mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state()
  2021-04-15 16:40   ` Johannes Weiner
@ 2021-04-15 16:59     ` Waiman Long
  2021-04-16 15:48       ` Johannes Weiner
  0 siblings, 1 reply; 36+ messages in thread
From: Waiman Long @ 2021-04-15 16:59 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Vladimir Davydov, Andrew Morton, Tejun Heo,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin, linux-kernel, cgroups, linux-mm,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun

On 4/15/21 12:40 PM, Johannes Weiner wrote:
> On Tue, Apr 13, 2021 at 09:20:23PM -0400, Waiman Long wrote:
>> The caller of mod_memcg_lruvec_state() has both memcg and lruvec readily
>> available. So both of them are now passed to mod_memcg_lruvec_state()
>> and __mod_memcg_lruvec_state(). The __mod_memcg_lruvec_state() is
>> updated to allow either of the two parameters to be set to null. This
>> makes mod_memcg_lruvec_state() equivalent to mod_memcg_state() if lruvec
>> is null.
>>
>> The new __mod_memcg_lruvec_state() function will be used in the next
>> patch as a replacement of mod_memcg_state() in mm/percpu.c for the
>> consolidation of the memory uncharge and vmstat update functions in
>> the kmem_cache_free() path.
> This requires users who want both to pass a pgdat that can be derived
> from the lruvec. This is error prone, and we just acked a patch that
> removes this very thing from mem_cgroup_page_lruvec().
>
> With the suggestion for patch 2, this shouldn't be necessary anymore,
> though. And sort of underlines my point around that combined function
> creating akwward code above and below it.
>
The reason of passing in the pgdat is because of the caching of vmstat 
data. lruvec may be gone if the corresponding memory cgroup is removed, 
but pgdat should stay put. That is why I put pgdat in the obj_stock for 
caching. I could also put the node id instead of pgdat.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 3/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
  2021-04-15 16:50   ` Johannes Weiner
@ 2021-04-15 17:08     ` Waiman Long
  2021-04-15 18:13       ` Johannes Weiner
  0 siblings, 1 reply; 36+ messages in thread
From: Waiman Long @ 2021-04-15 17:08 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Vladimir Davydov, Andrew Morton, Tejun Heo,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin, linux-kernel, cgroups, linux-mm,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun

On 4/15/21 12:50 PM, Johannes Weiner wrote:
> On Tue, Apr 13, 2021 at 09:20:25PM -0400, Waiman Long wrote:
>> Before the new slab memory controller with per object byte charging,
>> charging and vmstat data update happen only when new slab pages are
>> allocated or freed. Now they are done with every kmem_cache_alloc()
>> and kmem_cache_free(). This causes additional overhead for workloads
>> that generate a lot of alloc and free calls.
>>
>> The memcg_stock_pcp is used to cache byte charge for a specific
>> obj_cgroup to reduce that overhead. To further reducing it, this patch
>> makes the vmstat data cached in the memcg_stock_pcp structure as well
>> until it accumulates a page size worth of update or when other cached
>> data change.
>>
>> On a 2-socket Cascade Lake server with instrumentation enabled and this
>> patch applied, it was found that about 17% (946796 out of 5515184) of the
>> time when __mod_obj_stock_state() is called leads to an actual call to
>> mod_objcg_state() after initial boot. When doing parallel kernel build,
>> the figure was about 16% (21894614 out of 139780628). So caching the
>> vmstat data reduces the number of calls to mod_objcg_state() by more
>> than 80%.
> Right, but mod_objcg_state() is itself already percpu-cached. What's
> the benefit of avoiding calls to it with another percpu cache?
>
There are actually 2 set of vmstat data that have to be updated. One is 
associated with the memcg and other one is for each lruvec within the 
cgroup. Caching it in obj_stock, we replace 2 writes to two colder 
cachelines with one write to a hot cacheline. If you look at patch 5, I 
break obj_stock into two - one for task context and one for irq context. 
Interrupt disable is no longer needed in task context, but that is not 
possible when writing to the actual vmstat data arrays.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 0/5] mm/memcg: Reduce kmemcache memory accounting overhead
  2021-04-14  1:20 [PATCH v3 0/5] mm/memcg: Reduce kmemcache memory accounting overhead Waiman Long
                   ` (5 preceding siblings ...)
  2021-04-15  3:26 ` [PATCH v3 0/5] mm/memcg: Reduce kmemcache memory accounting overhead Masayoshi Mizuma
@ 2021-04-15 17:10 ` Matthew Wilcox
  2021-04-15 17:41   ` Waiman Long
  6 siblings, 1 reply; 36+ messages in thread
From: Matthew Wilcox @ 2021-04-15 17:10 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin, linux-kernel,
	cgroups, linux-mm, Shakeel Butt, Muchun Song, Alex Shi,
	Chris Down, Yafang Shao, Wei Yang, Masayoshi Mizuma,
	Xing Zhengjun

On Tue, Apr 13, 2021 at 09:20:22PM -0400, Waiman Long wrote:
> With memory accounting disable, the run time was 2.848s. With memory
> accounting enabled, the run times with the application of various
> patches in the patchset were:
> 
>   Applied patches   Run time   Accounting overhead   Overhead %age
>   ---------------   --------   -------------------   -------------
>        None          10.800s         7.952s              100.0%
>         1-2           9.140s         6.292s               79.1%
>         1-3           7.641s         4.793s               60.3%
>         1-5           6.801s         3.953s               49.7%

I think this is a misleading way to report the overhead.  I would have said:

			10.800s		7.952s		279.2%
			 9.140s		6.292s		220.9%
			 7.641s		4.793s		168.3%
			 6.801s		3.953s		138.8%


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 0/5] mm/memcg: Reduce kmemcache memory accounting overhead
  2021-04-15 17:10 ` Matthew Wilcox
@ 2021-04-15 17:41   ` Waiman Long
  0 siblings, 0 replies; 36+ messages in thread
From: Waiman Long @ 2021-04-15 17:41 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin, linux-kernel,
	cgroups, linux-mm, Shakeel Butt, Muchun Song, Alex Shi,
	Chris Down, Yafang Shao, Wei Yang, Masayoshi Mizuma,
	Xing Zhengjun

On 4/15/21 1:10 PM, Matthew Wilcox wrote:
> On Tue, Apr 13, 2021 at 09:20:22PM -0400, Waiman Long wrote:
>> With memory accounting disable, the run time was 2.848s. With memory
>> accounting enabled, the run times with the application of various
>> patches in the patchset were:
>>
>>    Applied patches   Run time   Accounting overhead   Overhead %age
>>    ---------------   --------   -------------------   -------------
>>         None          10.800s         7.952s              100.0%
>>          1-2           9.140s         6.292s               79.1%
>>          1-3           7.641s         4.793s               60.3%
>>          1-5           6.801s         3.953s               49.7%
> I think this is a misleading way to report the overhead.  I would have said:
>
> 			10.800s		7.952s		279.2%
> 			 9.140s		6.292s		220.9%
> 			 7.641s		4.793s		168.3%
> 			 6.801s		3.953s		138.8%
>
What I want to emphasize is the reduction in the accounting overhead 
part of execution time. Your percentage used the accounting disable time 
as the denominator. I think both are valid, I will be more clear about 
that in my version of the patch.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 5/5] mm/memcg: Optimize user context object stock access
  2021-04-14  1:20 ` [PATCH v3 5/5] mm/memcg: Optimize user context object stock access Waiman Long
  2021-04-15  3:28   ` Masayoshi Mizuma
@ 2021-04-15 17:53   ` Johannes Weiner
  2021-04-15 18:16     ` Waiman Long
  1 sibling, 1 reply; 36+ messages in thread
From: Johannes Weiner @ 2021-04-15 17:53 UTC (permalink / raw)
  To: Waiman Long
  Cc: Michal Hocko, Vladimir Davydov, Andrew Morton, Tejun Heo,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin, linux-kernel, cgroups, linux-mm,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun

On Tue, Apr 13, 2021 at 09:20:27PM -0400, Waiman Long wrote:
> Most kmem_cache_alloc() calls are from user context. With instrumentation
> enabled, the measured amount of kmem_cache_alloc() calls from non-task
> context was about 0.01% of the total.
> 
> The irq disable/enable sequence used in this case to access content
> from object stock is slow.  To optimize for user context access, there
> are now two object stocks for task context and interrupt context access
> respectively.
> 
> The task context object stock can be accessed after disabling preemption
> which is cheap in non-preempt kernel. The interrupt context object stock
> can only be accessed after disabling interrupt. User context code can
> access interrupt object stock, but not vice versa.
> 
> The mod_objcg_state() function is also modified to make sure that memcg
> and lruvec stat updates are done with interrupted disabled.
> 
> The downside of this change is that there are more data stored in local
> object stocks and not reflected in the charge counter and the vmstat
> arrays.  However, this is a small price to pay for better performance.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> Acked-by: Roman Gushchin <guro@fb.com>
> Reviewed-by: Shakeel Butt <shakeelb@google.com>

This makes sense, and also explains the previous patch a bit
better. But please merge those two.

> @@ -2229,7 +2229,8 @@ struct obj_stock {
>  struct memcg_stock_pcp {
>  	struct mem_cgroup *cached; /* this never be root cgroup */
>  	unsigned int nr_pages;
> -	struct obj_stock obj;
> +	struct obj_stock task_obj;
> +	struct obj_stock irq_obj;
>  
>  	struct work_struct work;
>  	unsigned long flags;
> @@ -2254,11 +2255,48 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
>  }
>  #endif
>  
> +/*
> + * Most kmem_cache_alloc() calls are from user context. The irq disable/enable
> + * sequence used in this case to access content from object stock is slow.
> + * To optimize for user context access, there are now two object stocks for
> + * task context and interrupt context access respectively.
> + *
> + * The task context object stock can be accessed by disabling preemption only
> + * which is cheap in non-preempt kernel. The interrupt context object stock
> + * can only be accessed after disabling interrupt. User context code can
> + * access interrupt object stock, but not vice versa.
> + */
>  static inline struct obj_stock *current_obj_stock(void)
>  {
>  	struct memcg_stock_pcp *stock = this_cpu_ptr(&memcg_stock);
>  
> -	return &stock->obj;
> +	return in_task() ? &stock->task_obj : &stock->irq_obj;
> +}
> +
> +#define get_obj_stock(flags)				\
> +({							\
> +	struct memcg_stock_pcp *stock;			\
> +	struct obj_stock *obj_stock;			\
> +							\
> +	if (in_task()) {				\
> +		preempt_disable();			\
> +		(flags) = -1L;				\
> +		stock = this_cpu_ptr(&memcg_stock);	\
> +		obj_stock = &stock->task_obj;		\
> +	} else {					\
> +		local_irq_save(flags);			\
> +		stock = this_cpu_ptr(&memcg_stock);	\
> +		obj_stock = &stock->irq_obj;		\
> +	}						\
> +	obj_stock;					\
> +})
> +
> +static inline void put_obj_stock(unsigned long flags)
> +{
> +	if (flags == -1L)
> +		preempt_enable();
> +	else
> +		local_irq_restore(flags);
>  }

Please make them both functions and use 'unsigned long *flags'.

Also I'm not sure doing in_task() twice would actually be more
expensive than the == -1 special case, and easier to understand.

> @@ -2327,7 +2365,9 @@ static void drain_local_stock(struct work_struct *dummy)
>  	local_irq_save(flags);
>  
>  	stock = this_cpu_ptr(&memcg_stock);
> -	drain_obj_stock(&stock->obj);
> +	drain_obj_stock(&stock->irq_obj);
> +	if (in_task())
> +		drain_obj_stock(&stock->task_obj);
>  	drain_stock(stock);
>  	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
>  
> @@ -3183,7 +3223,7 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg,
>  	memcg = obj_cgroup_memcg(objcg);
>  	if (pgdat)
>  		lruvec = mem_cgroup_lruvec(memcg, pgdat);
> -	__mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
> +	mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
>  	rcu_read_unlock();

This is actually a bug introduced in the earlier patch, isn't it?
Calling __mod_memcg_lruvec_state() without irqs disabled...

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state()
  2021-04-15 16:35     ` Waiman Long
@ 2021-04-15 18:10       ` Johannes Weiner
  2021-04-15 18:47         ` Waiman Long
  0 siblings, 1 reply; 36+ messages in thread
From: Johannes Weiner @ 2021-04-15 18:10 UTC (permalink / raw)
  To: Waiman Long
  Cc: Michal Hocko, Vladimir Davydov, Andrew Morton, Tejun Heo,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin, linux-kernel, cgroups, linux-mm,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun

On Thu, Apr 15, 2021 at 12:35:45PM -0400, Waiman Long wrote:
> On 4/15/21 12:30 PM, Johannes Weiner wrote:
> > On Tue, Apr 13, 2021 at 09:20:24PM -0400, Waiman Long wrote:
> > > In memcg_slab_free_hook()/pcpu_memcg_free_hook(), obj_cgroup_uncharge()
> > > is followed by mod_objcg_state()/mod_memcg_state(). Each of these
> > > function call goes through a separate irq_save/irq_restore cycle. That
> > > is inefficient.  Introduce a new function obj_cgroup_uncharge_mod_state()
> > > that combines them with a single irq_save/irq_restore cycle.
> > > 
> > > @@ -3292,6 +3296,25 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
> > >   	refill_obj_stock(objcg, size);
> > >   }
> > > +void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
> > > +				   struct pglist_data *pgdat, int idx)
> > The optimization makes sense.
> > 
> > But please don't combine independent operations like this into a
> > single function. It makes for an unclear parameter list, it's a pain
> > in the behind to change the constituent operations later on, and it
> > has a habit of attracting more random bools over time. E.g. what if
> > the caller already has irqs disabled? What if it KNOWS that irqs are
> > enabled and it could use local_irq_disable() instead of save?
> > 
> > Just provide an __obj_cgroup_uncharge() that assumes irqs are
> > disabled, combine with the existing __mod_memcg_lruvec_state(), and
> > bubble the irq handling up to those callsites which know better.
> > 
> That will also work. However, the reason I did that was because of patch 5
> in the series. I could put the get_obj_stock() and put_obj_stock() code in
> slab.h and allowed them to be used directly in various places, but hiding in
> one function is easier.

Yeah it's more obvious after getting to patch 5.

But with the irq disabling gone entirely, is there still an incentive
to combine the atomic section at all? Disabling preemption is pretty
cheap, so it wouldn't matter to just do it twice.

I.e. couldn't the final sequence in slab code simply be

	objcg_uncharge()
	mod_objcg_state()

again and each function disables preemption (and in the rare case
irqs) as it sees fit?

You lose the irqsoff batching in the cold path, but as you say, hit
rates are pretty good, and it doesn't seem worth complicating the code
for the cold path.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 3/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
  2021-04-15 17:08     ` Waiman Long
@ 2021-04-15 18:13       ` Johannes Weiner
  0 siblings, 0 replies; 36+ messages in thread
From: Johannes Weiner @ 2021-04-15 18:13 UTC (permalink / raw)
  To: Waiman Long
  Cc: Michal Hocko, Vladimir Davydov, Andrew Morton, Tejun Heo,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin, linux-kernel, cgroups, linux-mm,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun

On Thu, Apr 15, 2021 at 01:08:29PM -0400, Waiman Long wrote:
> On 4/15/21 12:50 PM, Johannes Weiner wrote:
> > On Tue, Apr 13, 2021 at 09:20:25PM -0400, Waiman Long wrote:
> > > Before the new slab memory controller with per object byte charging,
> > > charging and vmstat data update happen only when new slab pages are
> > > allocated or freed. Now they are done with every kmem_cache_alloc()
> > > and kmem_cache_free(). This causes additional overhead for workloads
> > > that generate a lot of alloc and free calls.
> > > 
> > > The memcg_stock_pcp is used to cache byte charge for a specific
> > > obj_cgroup to reduce that overhead. To further reducing it, this patch
> > > makes the vmstat data cached in the memcg_stock_pcp structure as well
> > > until it accumulates a page size worth of update or when other cached
> > > data change.
> > > 
> > > On a 2-socket Cascade Lake server with instrumentation enabled and this
> > > patch applied, it was found that about 17% (946796 out of 5515184) of the
> > > time when __mod_obj_stock_state() is called leads to an actual call to
> > > mod_objcg_state() after initial boot. When doing parallel kernel build,
> > > the figure was about 16% (21894614 out of 139780628). So caching the
> > > vmstat data reduces the number of calls to mod_objcg_state() by more
> > > than 80%.
> > Right, but mod_objcg_state() is itself already percpu-cached. What's
> > the benefit of avoiding calls to it with another percpu cache?
> > 
> There are actually 2 set of vmstat data that have to be updated. One is
> associated with the memcg and other one is for each lruvec within the
> cgroup. Caching it in obj_stock, we replace 2 writes to two colder
> cachelines with one write to a hot cacheline. If you look at patch 5, I
> break obj_stock into two - one for task context and one for irq context.
> Interrupt disable is no longer needed in task context, but that is not
> possible when writing to the actual vmstat data arrays.

Ah, thanks for the explanation. Both of these points are worth
mentioning in the changelog of this patch.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 5/5] mm/memcg: Optimize user context object stock access
  2021-04-15 17:53   ` Johannes Weiner
@ 2021-04-15 18:16     ` Waiman Long
  2021-04-15 18:53       ` Johannes Weiner
  0 siblings, 1 reply; 36+ messages in thread
From: Waiman Long @ 2021-04-15 18:16 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Vladimir Davydov, Andrew Morton, Tejun Heo,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin, linux-kernel, cgroups, linux-mm,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun

On 4/15/21 1:53 PM, Johannes Weiner wrote:
> On Tue, Apr 13, 2021 at 09:20:27PM -0400, Waiman Long wrote:
>> Most kmem_cache_alloc() calls are from user context. With instrumentation
>> enabled, the measured amount of kmem_cache_alloc() calls from non-task
>> context was about 0.01% of the total.
>>
>> The irq disable/enable sequence used in this case to access content
>> from object stock is slow.  To optimize for user context access, there
>> are now two object stocks for task context and interrupt context access
>> respectively.
>>
>> The task context object stock can be accessed after disabling preemption
>> which is cheap in non-preempt kernel. The interrupt context object stock
>> can only be accessed after disabling interrupt. User context code can
>> access interrupt object stock, but not vice versa.
>>
>> The mod_objcg_state() function is also modified to make sure that memcg
>> and lruvec stat updates are done with interrupted disabled.
>>
>> The downside of this change is that there are more data stored in local
>> object stocks and not reflected in the charge counter and the vmstat
>> arrays.  However, this is a small price to pay for better performance.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> Acked-by: Roman Gushchin <guro@fb.com>
>> Reviewed-by: Shakeel Butt <shakeelb@google.com>
> This makes sense, and also explains the previous patch a bit
> better. But please merge those two.
The reason I broke it into two is so that the patches are individually 
easier to review. I prefer to update the commit log of patch 4 to 
explain why the obj_stock structure is introduced instead of merging the 
two.
>
>> @@ -2229,7 +2229,8 @@ struct obj_stock {
>>   struct memcg_stock_pcp {
>>   	struct mem_cgroup *cached; /* this never be root cgroup */
>>   	unsigned int nr_pages;
>> -	struct obj_stock obj;
>> +	struct obj_stock task_obj;
>> +	struct obj_stock irq_obj;
>>   
>>   	struct work_struct work;
>>   	unsigned long flags;
>> @@ -2254,11 +2255,48 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
>>   }
>>   #endif
>>   
>> +/*
>> + * Most kmem_cache_alloc() calls are from user context. The irq disable/enable
>> + * sequence used in this case to access content from object stock is slow.
>> + * To optimize for user context access, there are now two object stocks for
>> + * task context and interrupt context access respectively.
>> + *
>> + * The task context object stock can be accessed by disabling preemption only
>> + * which is cheap in non-preempt kernel. The interrupt context object stock
>> + * can only be accessed after disabling interrupt. User context code can
>> + * access interrupt object stock, but not vice versa.
>> + */
>>   static inline struct obj_stock *current_obj_stock(void)
>>   {
>>   	struct memcg_stock_pcp *stock = this_cpu_ptr(&memcg_stock);
>>   
>> -	return &stock->obj;
>> +	return in_task() ? &stock->task_obj : &stock->irq_obj;
>> +}
>> +
>> +#define get_obj_stock(flags)				\
>> +({							\
>> +	struct memcg_stock_pcp *stock;			\
>> +	struct obj_stock *obj_stock;			\
>> +							\
>> +	if (in_task()) {				\
>> +		preempt_disable();			\
>> +		(flags) = -1L;				\
>> +		stock = this_cpu_ptr(&memcg_stock);	\
>> +		obj_stock = &stock->task_obj;		\
>> +	} else {					\
>> +		local_irq_save(flags);			\
>> +		stock = this_cpu_ptr(&memcg_stock);	\
>> +		obj_stock = &stock->irq_obj;		\
>> +	}						\
>> +	obj_stock;					\
>> +})
>> +
>> +static inline void put_obj_stock(unsigned long flags)
>> +{
>> +	if (flags == -1L)
>> +		preempt_enable();
>> +	else
>> +		local_irq_restore(flags);
>>   }
> Please make them both functions and use 'unsigned long *flags'.
Sure, I can do that.
>
> Also I'm not sure doing in_task() twice would actually be more
> expensive than the == -1 special case, and easier to understand.
I can make that change too. Either way is fine with me.
>
>> @@ -2327,7 +2365,9 @@ static void drain_local_stock(struct work_struct *dummy)
>>   	local_irq_save(flags);
>>   
>>   	stock = this_cpu_ptr(&memcg_stock);
>> -	drain_obj_stock(&stock->obj);
>> +	drain_obj_stock(&stock->irq_obj);
>> +	if (in_task())
>> +		drain_obj_stock(&stock->task_obj);
>>   	drain_stock(stock);
>>   	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
>>   
>> @@ -3183,7 +3223,7 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg,
>>   	memcg = obj_cgroup_memcg(objcg);
>>   	if (pgdat)
>>   		lruvec = mem_cgroup_lruvec(memcg, pgdat);
>> -	__mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
>> +	mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
>>   	rcu_read_unlock();
> This is actually a bug introduced in the earlier patch, isn't it?
> Calling __mod_memcg_lruvec_state() without irqs disabled...
>
Not really, in patch 3, mod_objcg_state() is called only in the stock 
update context where interrupt had already been disabled. But now, that 
is no longer the case, that is why i need to update mod_objcg_state() to 
make sure irq is disabled before updating vmstat data array.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state()
  2021-04-15 18:10       ` Johannes Weiner
@ 2021-04-15 18:47         ` Waiman Long
  2021-04-15 19:40           ` Johannes Weiner
  0 siblings, 1 reply; 36+ messages in thread
From: Waiman Long @ 2021-04-15 18:47 UTC (permalink / raw)
  To: Johannes Weiner, Waiman Long
  Cc: Michal Hocko, Vladimir Davydov, Andrew Morton, Tejun Heo,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin, linux-kernel, cgroups, linux-mm,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun

On 4/15/21 2:10 PM, Johannes Weiner wrote:
> On Thu, Apr 15, 2021 at 12:35:45PM -0400, Waiman Long wrote:
>> On 4/15/21 12:30 PM, Johannes Weiner wrote:
>>> On Tue, Apr 13, 2021 at 09:20:24PM -0400, Waiman Long wrote:
>>>> In memcg_slab_free_hook()/pcpu_memcg_free_hook(), obj_cgroup_uncharge()
>>>> is followed by mod_objcg_state()/mod_memcg_state(). Each of these
>>>> function call goes through a separate irq_save/irq_restore cycle. That
>>>> is inefficient.  Introduce a new function obj_cgroup_uncharge_mod_state()
>>>> that combines them with a single irq_save/irq_restore cycle.
>>>>
>>>> @@ -3292,6 +3296,25 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
>>>>    	refill_obj_stock(objcg, size);
>>>>    }
>>>> +void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
>>>> +				   struct pglist_data *pgdat, int idx)
>>> The optimization makes sense.
>>>
>>> But please don't combine independent operations like this into a
>>> single function. It makes for an unclear parameter list, it's a pain
>>> in the behind to change the constituent operations later on, and it
>>> has a habit of attracting more random bools over time. E.g. what if
>>> the caller already has irqs disabled? What if it KNOWS that irqs are
>>> enabled and it could use local_irq_disable() instead of save?
>>>
>>> Just provide an __obj_cgroup_uncharge() that assumes irqs are
>>> disabled, combine with the existing __mod_memcg_lruvec_state(), and
>>> bubble the irq handling up to those callsites which know better.
>>>
>> That will also work. However, the reason I did that was because of patch 5
>> in the series. I could put the get_obj_stock() and put_obj_stock() code in
>> slab.h and allowed them to be used directly in various places, but hiding in
>> one function is easier.
> Yeah it's more obvious after getting to patch 5.
>
> But with the irq disabling gone entirely, is there still an incentive
> to combine the atomic section at all? Disabling preemption is pretty
> cheap, so it wouldn't matter to just do it twice.
>
> I.e. couldn't the final sequence in slab code simply be
>
> 	objcg_uncharge()
> 	mod_objcg_state()
>
> again and each function disables preemption (and in the rare case
> irqs) as it sees fit?
>
> You lose the irqsoff batching in the cold path, but as you say, hit
> rates are pretty good, and it doesn't seem worth complicating the code
> for the cold path.
>
That does make sense, though a little bit of performance may be lost. I 
will try that out to see how it work out performance wise.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 5/5] mm/memcg: Optimize user context object stock access
  2021-04-15 18:16     ` Waiman Long
@ 2021-04-15 18:53       ` Johannes Weiner
  2021-04-15 19:06         ` Waiman Long
  0 siblings, 1 reply; 36+ messages in thread
From: Johannes Weiner @ 2021-04-15 18:53 UTC (permalink / raw)
  To: Waiman Long
  Cc: Michal Hocko, Vladimir Davydov, Andrew Morton, Tejun Heo,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin, linux-kernel, cgroups, linux-mm,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun

On Thu, Apr 15, 2021 at 02:16:17PM -0400, Waiman Long wrote:
> On 4/15/21 1:53 PM, Johannes Weiner wrote:
> > On Tue, Apr 13, 2021 at 09:20:27PM -0400, Waiman Long wrote:
> > > Most kmem_cache_alloc() calls are from user context. With instrumentation
> > > enabled, the measured amount of kmem_cache_alloc() calls from non-task
> > > context was about 0.01% of the total.
> > > 
> > > The irq disable/enable sequence used in this case to access content
> > > from object stock is slow.  To optimize for user context access, there
> > > are now two object stocks for task context and interrupt context access
> > > respectively.
> > > 
> > > The task context object stock can be accessed after disabling preemption
> > > which is cheap in non-preempt kernel. The interrupt context object stock
> > > can only be accessed after disabling interrupt. User context code can
> > > access interrupt object stock, but not vice versa.
> > > 
> > > The mod_objcg_state() function is also modified to make sure that memcg
> > > and lruvec stat updates are done with interrupted disabled.
> > > 
> > > The downside of this change is that there are more data stored in local
> > > object stocks and not reflected in the charge counter and the vmstat
> > > arrays.  However, this is a small price to pay for better performance.
> > > 
> > > Signed-off-by: Waiman Long <longman@redhat.com>
> > > Acked-by: Roman Gushchin <guro@fb.com>
> > > Reviewed-by: Shakeel Butt <shakeelb@google.com>
> > This makes sense, and also explains the previous patch a bit
> > better. But please merge those two.
> The reason I broke it into two is so that the patches are individually
> easier to review. I prefer to update the commit log of patch 4 to explain
> why the obj_stock structure is introduced instead of merging the two.

Well I did not find them easier to review separately.

> > > @@ -2327,7 +2365,9 @@ static void drain_local_stock(struct work_struct *dummy)
> > >   	local_irq_save(flags);
> > >   	stock = this_cpu_ptr(&memcg_stock);
> > > -	drain_obj_stock(&stock->obj);
> > > +	drain_obj_stock(&stock->irq_obj);
> > > +	if (in_task())
> > > +		drain_obj_stock(&stock->task_obj);
> > >   	drain_stock(stock);
> > >   	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
> > > @@ -3183,7 +3223,7 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg,
> > >   	memcg = obj_cgroup_memcg(objcg);
> > >   	if (pgdat)
> > >   		lruvec = mem_cgroup_lruvec(memcg, pgdat);
> > > -	__mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
> > > +	mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
> > >   	rcu_read_unlock();
> > This is actually a bug introduced in the earlier patch, isn't it?
> > Calling __mod_memcg_lruvec_state() without irqs disabled...
> > 
> Not really, in patch 3, mod_objcg_state() is called only in the stock update
> context where interrupt had already been disabled. But now, that is no
> longer the case, that is why i need to update mod_objcg_state() to make sure
> irq is disabled before updating vmstat data array.

Oh, I see it now. Man, that's subtle. We've had several very hard to
track down preemption bugs in those stats, because they manifest as
counter imbalances and you have no idea if there is a leak somewhere.

The convention for these functions is that the __ prefix indicates
that preemption has been suitably disabled. Please always follow this
convention, even if the semantic change is temporary.

Btw, is there a reason why the stock caching isn't just part of
mod_objcg_state()? Why does the user need to choose if they want the
caching or not? It's not like we ask for this when charging, either.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 5/5] mm/memcg: Optimize user context object stock access
  2021-04-15 18:53       ` Johannes Weiner
@ 2021-04-15 19:06         ` Waiman Long
  0 siblings, 0 replies; 36+ messages in thread
From: Waiman Long @ 2021-04-15 19:06 UTC (permalink / raw)
  To: Johannes Weiner, Waiman Long
  Cc: Michal Hocko, Vladimir Davydov, Andrew Morton, Tejun Heo,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin, linux-kernel, cgroups, linux-mm,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun

On 4/15/21 2:53 PM, Johannes Weiner wrote:
> On Thu, Apr 15, 2021 at 02:16:17PM -0400, Waiman Long wrote:
>> On 4/15/21 1:53 PM, Johannes Weiner wrote:
>>> On Tue, Apr 13, 2021 at 09:20:27PM -0400, Waiman Long wrote:
>>>> Most kmem_cache_alloc() calls are from user context. With instrumentation
>>>> enabled, the measured amount of kmem_cache_alloc() calls from non-task
>>>> context was about 0.01% of the total.
>>>>
>>>> The irq disable/enable sequence used in this case to access content
>>>> from object stock is slow.  To optimize for user context access, there
>>>> are now two object stocks for task context and interrupt context access
>>>> respectively.
>>>>
>>>> The task context object stock can be accessed after disabling preemption
>>>> which is cheap in non-preempt kernel. The interrupt context object stock
>>>> can only be accessed after disabling interrupt. User context code can
>>>> access interrupt object stock, but not vice versa.
>>>>
>>>> The mod_objcg_state() function is also modified to make sure that memcg
>>>> and lruvec stat updates are done with interrupted disabled.
>>>>
>>>> The downside of this change is that there are more data stored in local
>>>> object stocks and not reflected in the charge counter and the vmstat
>>>> arrays.  However, this is a small price to pay for better performance.
>>>>
>>>> Signed-off-by: Waiman Long <longman@redhat.com>
>>>> Acked-by: Roman Gushchin <guro@fb.com>
>>>> Reviewed-by: Shakeel Butt <shakeelb@google.com>
>>> This makes sense, and also explains the previous patch a bit
>>> better. But please merge those two.
>> The reason I broke it into two is so that the patches are individually
>> easier to review. I prefer to update the commit log of patch 4 to explain
>> why the obj_stock structure is introduced instead of merging the two.
> Well I did not find them easier to review separately.
>
>>>> @@ -2327,7 +2365,9 @@ static void drain_local_stock(struct work_struct *dummy)
>>>>    	local_irq_save(flags);
>>>>    	stock = this_cpu_ptr(&memcg_stock);
>>>> -	drain_obj_stock(&stock->obj);
>>>> +	drain_obj_stock(&stock->irq_obj);
>>>> +	if (in_task())
>>>> +		drain_obj_stock(&stock->task_obj);
>>>>    	drain_stock(stock);
>>>>    	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
>>>> @@ -3183,7 +3223,7 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg,
>>>>    	memcg = obj_cgroup_memcg(objcg);
>>>>    	if (pgdat)
>>>>    		lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>>> -	__mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
>>>> +	mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
>>>>    	rcu_read_unlock();
>>> This is actually a bug introduced in the earlier patch, isn't it?
>>> Calling __mod_memcg_lruvec_state() without irqs disabled...
>>>
>> Not really, in patch 3, mod_objcg_state() is called only in the stock update
>> context where interrupt had already been disabled. But now, that is no
>> longer the case, that is why i need to update mod_objcg_state() to make sure
>> irq is disabled before updating vmstat data array.
> Oh, I see it now. Man, that's subtle. We've had several very hard to
> track down preemption bugs in those stats, because they manifest as
> counter imbalances and you have no idea if there is a leak somewhere.
>
> The convention for these functions is that the __ prefix indicates
> that preemption has been suitably disabled. Please always follow this
> convention, even if the semantic change is temporary.
I see. I will fix that in the next version.
>
> Btw, is there a reason why the stock caching isn't just part of
> mod_objcg_state()? Why does the user need to choose if they want the
> caching or not? It's not like we ask for this when charging, either.
>
Yes, I can revert it to make mod_objcg_state() to call 
mod_obj_stock_state() internally instead of the other way around.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state()
  2021-04-15 18:47         ` Waiman Long
@ 2021-04-15 19:40           ` Johannes Weiner
  2021-04-15 19:44             ` Waiman Long
  0 siblings, 1 reply; 36+ messages in thread
From: Johannes Weiner @ 2021-04-15 19:40 UTC (permalink / raw)
  To: Waiman Long
  Cc: Michal Hocko, Vladimir Davydov, Andrew Morton, Tejun Heo,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin, linux-kernel, cgroups, linux-mm,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun

On Thu, Apr 15, 2021 at 02:47:31PM -0400, Waiman Long wrote:
> On 4/15/21 2:10 PM, Johannes Weiner wrote:
> > On Thu, Apr 15, 2021 at 12:35:45PM -0400, Waiman Long wrote:
> > > On 4/15/21 12:30 PM, Johannes Weiner wrote:
> > > > On Tue, Apr 13, 2021 at 09:20:24PM -0400, Waiman Long wrote:
> > > > > In memcg_slab_free_hook()/pcpu_memcg_free_hook(), obj_cgroup_uncharge()
> > > > > is followed by mod_objcg_state()/mod_memcg_state(). Each of these
> > > > > function call goes through a separate irq_save/irq_restore cycle. That
> > > > > is inefficient.  Introduce a new function obj_cgroup_uncharge_mod_state()
> > > > > that combines them with a single irq_save/irq_restore cycle.
> > > > > 
> > > > > @@ -3292,6 +3296,25 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
> > > > >    	refill_obj_stock(objcg, size);
> > > > >    }
> > > > > +void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
> > > > > +				   struct pglist_data *pgdat, int idx)
> > > > The optimization makes sense.
> > > > 
> > > > But please don't combine independent operations like this into a
> > > > single function. It makes for an unclear parameter list, it's a pain
> > > > in the behind to change the constituent operations later on, and it
> > > > has a habit of attracting more random bools over time. E.g. what if
> > > > the caller already has irqs disabled? What if it KNOWS that irqs are
> > > > enabled and it could use local_irq_disable() instead of save?
> > > > 
> > > > Just provide an __obj_cgroup_uncharge() that assumes irqs are
> > > > disabled, combine with the existing __mod_memcg_lruvec_state(), and
> > > > bubble the irq handling up to those callsites which know better.
> > > > 
> > > That will also work. However, the reason I did that was because of patch 5
> > > in the series. I could put the get_obj_stock() and put_obj_stock() code in
> > > slab.h and allowed them to be used directly in various places, but hiding in
> > > one function is easier.
> > Yeah it's more obvious after getting to patch 5.
> > 
> > But with the irq disabling gone entirely, is there still an incentive
> > to combine the atomic section at all? Disabling preemption is pretty
> > cheap, so it wouldn't matter to just do it twice.
> > 
> > I.e. couldn't the final sequence in slab code simply be
> > 
> > 	objcg_uncharge()
> > 	mod_objcg_state()
> > 
> > again and each function disables preemption (and in the rare case
> > irqs) as it sees fit?
> > 
> > You lose the irqsoff batching in the cold path, but as you say, hit
> > rates are pretty good, and it doesn't seem worth complicating the code
> > for the cold path.
> > 
> That does make sense, though a little bit of performance may be lost. I will
> try that out to see how it work out performance wise.

Thanks.

Even if we still end up doing it, it's great to have that cost
isolated, so we know how much extra code complexity corresponds to how
much performance gain. It seems the task/irq split could otherwise be
a pretty localized change with no API implications.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state()
  2021-04-15 19:40           ` Johannes Weiner
@ 2021-04-15 19:44             ` Waiman Long
  2021-04-15 20:19               ` Johannes Weiner
  0 siblings, 1 reply; 36+ messages in thread
From: Waiman Long @ 2021-04-15 19:44 UTC (permalink / raw)
  To: Johannes Weiner, Waiman Long
  Cc: Michal Hocko, Vladimir Davydov, Andrew Morton, Tejun Heo,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin, linux-kernel, cgroups, linux-mm,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun

On 4/15/21 3:40 PM, Johannes Weiner wrote:
> On Thu, Apr 15, 2021 at 02:47:31PM -0400, Waiman Long wrote:
>> On 4/15/21 2:10 PM, Johannes Weiner wrote:
>>> On Thu, Apr 15, 2021 at 12:35:45PM -0400, Waiman Long wrote:
>>>> On 4/15/21 12:30 PM, Johannes Weiner wrote:
>>>>> On Tue, Apr 13, 2021 at 09:20:24PM -0400, Waiman Long wrote:
>>>>>> In memcg_slab_free_hook()/pcpu_memcg_free_hook(), obj_cgroup_uncharge()
>>>>>> is followed by mod_objcg_state()/mod_memcg_state(). Each of these
>>>>>> function call goes through a separate irq_save/irq_restore cycle. That
>>>>>> is inefficient.  Introduce a new function obj_cgroup_uncharge_mod_state()
>>>>>> that combines them with a single irq_save/irq_restore cycle.
>>>>>>
>>>>>> @@ -3292,6 +3296,25 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
>>>>>>     	refill_obj_stock(objcg, size);
>>>>>>     }
>>>>>> +void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
>>>>>> +				   struct pglist_data *pgdat, int idx)
>>>>> The optimization makes sense.
>>>>>
>>>>> But please don't combine independent operations like this into a
>>>>> single function. It makes for an unclear parameter list, it's a pain
>>>>> in the behind to change the constituent operations later on, and it
>>>>> has a habit of attracting more random bools over time. E.g. what if
>>>>> the caller already has irqs disabled? What if it KNOWS that irqs are
>>>>> enabled and it could use local_irq_disable() instead of save?
>>>>>
>>>>> Just provide an __obj_cgroup_uncharge() that assumes irqs are
>>>>> disabled, combine with the existing __mod_memcg_lruvec_state(), and
>>>>> bubble the irq handling up to those callsites which know better.
>>>>>
>>>> That will also work. However, the reason I did that was because of patch 5
>>>> in the series. I could put the get_obj_stock() and put_obj_stock() code in
>>>> slab.h and allowed them to be used directly in various places, but hiding in
>>>> one function is easier.
>>> Yeah it's more obvious after getting to patch 5.
>>>
>>> But with the irq disabling gone entirely, is there still an incentive
>>> to combine the atomic section at all? Disabling preemption is pretty
>>> cheap, so it wouldn't matter to just do it twice.
>>>
>>> I.e. couldn't the final sequence in slab code simply be
>>>
>>> 	objcg_uncharge()
>>> 	mod_objcg_state()
>>>
>>> again and each function disables preemption (and in the rare case
>>> irqs) as it sees fit?
>>>
>>> You lose the irqsoff batching in the cold path, but as you say, hit
>>> rates are pretty good, and it doesn't seem worth complicating the code
>>> for the cold path.
>>>
>> That does make sense, though a little bit of performance may be lost. I will
>> try that out to see how it work out performance wise.
> Thanks.
>
> Even if we still end up doing it, it's great to have that cost
> isolated, so we know how much extra code complexity corresponds to how
> much performance gain. It seems the task/irq split could otherwise be
> a pretty localized change with no API implications.
>
I still want to move mod_objcg_state() function to memcontrol.c though 
as I don't want to put any obj_stock stuff in mm/slab.h.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state()
  2021-04-15 19:44             ` Waiman Long
@ 2021-04-15 20:19               ` Johannes Weiner
  0 siblings, 0 replies; 36+ messages in thread
From: Johannes Weiner @ 2021-04-15 20:19 UTC (permalink / raw)
  To: Waiman Long
  Cc: Michal Hocko, Vladimir Davydov, Andrew Morton, Tejun Heo,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin, linux-kernel, cgroups, linux-mm,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun

On Thu, Apr 15, 2021 at 03:44:56PM -0400, Waiman Long wrote:
> On 4/15/21 3:40 PM, Johannes Weiner wrote:
> > On Thu, Apr 15, 2021 at 02:47:31PM -0400, Waiman Long wrote:
> > > On 4/15/21 2:10 PM, Johannes Weiner wrote:
> > > > On Thu, Apr 15, 2021 at 12:35:45PM -0400, Waiman Long wrote:
> > > > > On 4/15/21 12:30 PM, Johannes Weiner wrote:
> > > > > > On Tue, Apr 13, 2021 at 09:20:24PM -0400, Waiman Long wrote:
> > > > > > > In memcg_slab_free_hook()/pcpu_memcg_free_hook(), obj_cgroup_uncharge()
> > > > > > > is followed by mod_objcg_state()/mod_memcg_state(). Each of these
> > > > > > > function call goes through a separate irq_save/irq_restore cycle. That
> > > > > > > is inefficient.  Introduce a new function obj_cgroup_uncharge_mod_state()
> > > > > > > that combines them with a single irq_save/irq_restore cycle.
> > > > > > > 
> > > > > > > @@ -3292,6 +3296,25 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
> > > > > > >     	refill_obj_stock(objcg, size);
> > > > > > >     }
> > > > > > > +void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
> > > > > > > +				   struct pglist_data *pgdat, int idx)
> > > > > > The optimization makes sense.
> > > > > > 
> > > > > > But please don't combine independent operations like this into a
> > > > > > single function. It makes for an unclear parameter list, it's a pain
> > > > > > in the behind to change the constituent operations later on, and it
> > > > > > has a habit of attracting more random bools over time. E.g. what if
> > > > > > the caller already has irqs disabled? What if it KNOWS that irqs are
> > > > > > enabled and it could use local_irq_disable() instead of save?
> > > > > > 
> > > > > > Just provide an __obj_cgroup_uncharge() that assumes irqs are
> > > > > > disabled, combine with the existing __mod_memcg_lruvec_state(), and
> > > > > > bubble the irq handling up to those callsites which know better.
> > > > > > 
> > > > > That will also work. However, the reason I did that was because of patch 5
> > > > > in the series. I could put the get_obj_stock() and put_obj_stock() code in
> > > > > slab.h and allowed them to be used directly in various places, but hiding in
> > > > > one function is easier.
> > > > Yeah it's more obvious after getting to patch 5.
> > > > 
> > > > But with the irq disabling gone entirely, is there still an incentive
> > > > to combine the atomic section at all? Disabling preemption is pretty
> > > > cheap, so it wouldn't matter to just do it twice.
> > > > 
> > > > I.e. couldn't the final sequence in slab code simply be
> > > > 
> > > > 	objcg_uncharge()
> > > > 	mod_objcg_state()
> > > > 
> > > > again and each function disables preemption (and in the rare case
> > > > irqs) as it sees fit?
> > > > 
> > > > You lose the irqsoff batching in the cold path, but as you say, hit
> > > > rates are pretty good, and it doesn't seem worth complicating the code
> > > > for the cold path.
> > > > 
> > > That does make sense, though a little bit of performance may be lost. I will
> > > try that out to see how it work out performance wise.
> > Thanks.
> > 
> > Even if we still end up doing it, it's great to have that cost
> > isolated, so we know how much extra code complexity corresponds to how
> > much performance gain. It seems the task/irq split could otherwise be
> > a pretty localized change with no API implications.
> > 
> I still want to move mod_objcg_state() function to memcontrol.c though as I
> don't want to put any obj_stock stuff in mm/slab.h.

No objection from me!

That's actually a nice cleanup, IMO. Not sure why it was separated
from the rest of the objcg interface implementation to begin with.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 1/5] mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state()
  2021-04-15 16:59     ` Waiman Long
@ 2021-04-16 15:48       ` Johannes Weiner
  0 siblings, 0 replies; 36+ messages in thread
From: Johannes Weiner @ 2021-04-16 15:48 UTC (permalink / raw)
  To: Waiman Long
  Cc: Michal Hocko, Vladimir Davydov, Andrew Morton, Tejun Heo,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin, linux-kernel, cgroups, linux-mm,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Wei Yang, Masayoshi Mizuma, Xing Zhengjun

On Thu, Apr 15, 2021 at 12:59:21PM -0400, Waiman Long wrote:
> On 4/15/21 12:40 PM, Johannes Weiner wrote:
> > On Tue, Apr 13, 2021 at 09:20:23PM -0400, Waiman Long wrote:
> > > The caller of mod_memcg_lruvec_state() has both memcg and lruvec readily
> > > available. So both of them are now passed to mod_memcg_lruvec_state()
> > > and __mod_memcg_lruvec_state(). The __mod_memcg_lruvec_state() is
> > > updated to allow either of the two parameters to be set to null. This
> > > makes mod_memcg_lruvec_state() equivalent to mod_memcg_state() if lruvec
> > > is null.
> > > 
> > > The new __mod_memcg_lruvec_state() function will be used in the next
> > > patch as a replacement of mod_memcg_state() in mm/percpu.c for the
> > > consolidation of the memory uncharge and vmstat update functions in
> > > the kmem_cache_free() path.
> > This requires users who want both to pass a pgdat that can be derived
> > from the lruvec. This is error prone, and we just acked a patch that
> > removes this very thing from mem_cgroup_page_lruvec().
> > 
> > With the suggestion for patch 2, this shouldn't be necessary anymore,
> > though. And sort of underlines my point around that combined function
> > creating akwward code above and below it.
> > 
> The reason of passing in the pgdat is because of the caching of vmstat data.
> lruvec may be gone if the corresponding memory cgroup is removed, but pgdat
> should stay put. That is why I put pgdat in the obj_stock for caching. I
> could also put the node id instead of pgdat.

Internally storing the pgdat is fine.

I was talking about the interface requiring the user to pass redundant
information (pgdat, can be derived from lruvec) because the parameter
is overloaded to modify the function's behavior, which is unintuitive.

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2021-04-16 15:48 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-14  1:20 [PATCH v3 0/5] mm/memcg: Reduce kmemcache memory accounting overhead Waiman Long
2021-04-14  1:20 ` [PATCH v3 1/5] mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state() Waiman Long
2021-04-15  3:27   ` Masayoshi Mizuma
2021-04-15 16:40   ` Johannes Weiner
2021-04-15 16:59     ` Waiman Long
2021-04-16 15:48       ` Johannes Weiner
2021-04-14  1:20 ` [PATCH v3 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state() Waiman Long
2021-04-15  3:27   ` Masayoshi Mizuma
2021-04-15 16:30   ` Johannes Weiner
2021-04-15 16:35     ` Waiman Long
2021-04-15 18:10       ` Johannes Weiner
2021-04-15 18:47         ` Waiman Long
2021-04-15 19:40           ` Johannes Weiner
2021-04-15 19:44             ` Waiman Long
2021-04-15 20:19               ` Johannes Weiner
2021-04-14  1:20 ` [PATCH v3 3/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp Waiman Long
2021-04-15  3:28   ` Masayoshi Mizuma
2021-04-15 16:50   ` Johannes Weiner
2021-04-15 17:08     ` Waiman Long
2021-04-15 18:13       ` Johannes Weiner
2021-04-14  1:20 ` [PATCH v3 4/5] mm/memcg: Separate out object stock data into its own struct Waiman Long
2021-04-15  3:28   ` Masayoshi Mizuma
2021-04-15 16:57   ` Johannes Weiner
2021-04-14  1:20 ` [PATCH v3 5/5] mm/memcg: Optimize user context object stock access Waiman Long
2021-04-15  3:28   ` Masayoshi Mizuma
2021-04-15  9:44     ` Christoph Lameter
2021-04-15 12:16       ` Masayoshi Mizuma
2021-04-15 17:53   ` Johannes Weiner
2021-04-15 18:16     ` Waiman Long
2021-04-15 18:53       ` Johannes Weiner
2021-04-15 19:06         ` Waiman Long
2021-04-15  3:26 ` [PATCH v3 0/5] mm/memcg: Reduce kmemcache memory accounting overhead Masayoshi Mizuma
2021-04-15 13:17   ` Waiman Long
2021-04-15 15:47     ` Masayoshi Mizuma
2021-04-15 17:10 ` Matthew Wilcox
2021-04-15 17:41   ` Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).