linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] mm/memcg: Reduce kmemcache memory accounting overhead
@ 2021-04-09 23:18 Waiman Long
  2021-04-09 23:18 ` [PATCH 1/5] mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state() Waiman Long
                   ` (6 more replies)
  0 siblings, 7 replies; 22+ messages in thread
From: Waiman Long @ 2021-04-09 23:18 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin
  Cc: linux-kernel, cgroups, linux-mm, Shakeel Butt, Muchun Song,
	Alex Shi, Chris Down, Yafang Shao, Alexander Duyck, Wei Yang,
	Masayoshi Mizuma, Waiman Long

With the recent introduction of the new slab memory controller, we
eliminate the need for having separate kmemcaches for each memory
cgroup and reduce overall kernel memory usage. However, we also add
additional memory accounting overhead to each call of kmem_cache_alloc()
and kmem_cache_free().

For workloads that require a lot of kmemcache allocations and
de-allocations, they may experience performance regression as illustrated
in [1].

With a simple kernel module that performs repeated loop of 100,000,000
kmem_cache_alloc() and kmem_cache_free() of 64-byte object at module
init. The execution time to load the kernel module with and without
memory accounting were:

  with accounting = 6.798s
  w/o  accounting = 1.758s

That is an increase of 5.04s (287%). With this patchset applied, the
execution time became 4.254s. So the memory accounting overhead is now
2.496s which is a 50% reduction.

It was found that a major part of the memory accounting overhead
is caused by the local_irq_save()/local_irq_restore() sequences in
updating local stock charge bytes and vmstat array, at least in x86
systems. There are two such sequences in kmem_cache_alloc() and two
in kmem_cache_free(). This patchset tries to reduce the use of such
sequences as much as possible. In fact, it eliminates them in the common
case. Another part of this patchset to cache the vmstat data update in
the local stock as well which also helps.

[1] https://lore.kernel.org/linux-mm/20210408193948.vfktg3azh2wrt56t@gabell/T/#u

Waiman Long (5):
  mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state()
  mm/memcg: Introduce obj_cgroup_uncharge_mod_state()
  mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
  mm/memcg: Separate out object stock data into its own struct
  mm/memcg: Optimize user context object stock access

 include/linux/memcontrol.h |  14 ++-
 mm/memcontrol.c            | 198 ++++++++++++++++++++++++++++++++-----
 mm/percpu.c                |   9 +-
 mm/slab.h                  |  32 +++---
 4 files changed, 195 insertions(+), 58 deletions(-)

-- 
2.18.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/5] mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state()
  2021-04-09 23:18 [PATCH 0/5] mm/memcg: Reduce kmemcache memory accounting overhead Waiman Long
@ 2021-04-09 23:18 ` Waiman Long
  2021-04-12 18:04   ` Roman Gushchin
  2021-04-12 19:22   ` Shakeel Butt
  2021-04-09 23:18 ` [PATCH 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state() Waiman Long
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 22+ messages in thread
From: Waiman Long @ 2021-04-09 23:18 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin
  Cc: linux-kernel, cgroups, linux-mm, Shakeel Butt, Muchun Song,
	Alex Shi, Chris Down, Yafang Shao, Alexander Duyck, Wei Yang,
	Masayoshi Mizuma, Waiman Long

The caller of mod_memcg_lruvec_state() has both memcg and lruvec readily
available. So both of them are now passed to mod_memcg_lruvec_state()
and __mod_memcg_lruvec_state(). The __mod_memcg_lruvec_state() is
updated to allow either of the two parameters to be set to null. This
makes mod_memcg_lruvec_state() equivalent to mod_memcg_state() if lruvec
is null.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/memcontrol.h | 12 +++++++-----
 mm/memcontrol.c            | 19 +++++++++++++------
 mm/slab.h                  |  2 +-
 3 files changed, 21 insertions(+), 12 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0c04d39a7967..95f12996e66c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -955,8 +955,8 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
 	return x;
 }
 
-void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
-			      int val);
+void __mod_memcg_lruvec_state(struct mem_cgroup *memcg, struct lruvec *lruvec,
+			      enum node_stat_item idx, int val);
 void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val);
 
 static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx,
@@ -969,13 +969,14 @@ static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx,
 	local_irq_restore(flags);
 }
 
-static inline void mod_memcg_lruvec_state(struct lruvec *lruvec,
+static inline void mod_memcg_lruvec_state(struct mem_cgroup *memcg,
+					  struct lruvec *lruvec,
 					  enum node_stat_item idx, int val)
 {
 	unsigned long flags;
 
 	local_irq_save(flags);
-	__mod_memcg_lruvec_state(lruvec, idx, val);
+	__mod_memcg_lruvec_state(memcg, lruvec, idx, val);
 	local_irq_restore(flags);
 }
 
@@ -1369,7 +1370,8 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
 	return node_page_state(lruvec_pgdat(lruvec), idx);
 }
 
-static inline void __mod_memcg_lruvec_state(struct lruvec *lruvec,
+static inline void __mod_memcg_lruvec_state(struct mem_cgroup *memcg,
+					    struct lruvec *lruvec,
 					    enum node_stat_item idx, int val)
 {
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e064ac0d850a..d66e1e38f8ac 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -799,20 +799,27 @@ parent_nodeinfo(struct mem_cgroup_per_node *pn, int nid)
 	return mem_cgroup_nodeinfo(parent, nid);
 }
 
-void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
-			      int val)
+/*
+ * Either one of memcg or lruvec can be NULL, but not both.
+ */
+void __mod_memcg_lruvec_state(struct mem_cgroup *memcg, struct lruvec *lruvec,
+			      enum node_stat_item idx, int val)
 {
 	struct mem_cgroup_per_node *pn;
-	struct mem_cgroup *memcg;
 	long x, threshold = MEMCG_CHARGE_BATCH;
 
+	/* Update lruvec */
 	pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
-	memcg = pn->memcg;
+
+	if (!memcg)
+		memcg = pn->memcg;
 
 	/* Update memcg */
 	__mod_memcg_state(memcg, idx, val);
 
-	/* Update lruvec */
+	if (!lruvec)
+		return;
+
 	__this_cpu_add(pn->lruvec_stat_local->count[idx], val);
 
 	if (vmstat_item_in_bytes(idx))
@@ -848,7 +855,7 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 
 	/* Update memcg and lruvec */
 	if (!mem_cgroup_disabled())
-		__mod_memcg_lruvec_state(lruvec, idx, val);
+		__mod_memcg_lruvec_state(NULL, lruvec, idx, val);
 }
 
 void __mod_lruvec_page_state(struct page *page, enum node_stat_item idx,
diff --git a/mm/slab.h b/mm/slab.h
index 076582f58f68..bc6c7545e487 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -293,7 +293,7 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg,
 	rcu_read_lock();
 	memcg = obj_cgroup_memcg(objcg);
 	lruvec = mem_cgroup_lruvec(memcg, pgdat);
-	mod_memcg_lruvec_state(lruvec, idx, nr);
+	mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
 	rcu_read_unlock();
 }
 
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state()
  2021-04-09 23:18 [PATCH 0/5] mm/memcg: Reduce kmemcache memory accounting overhead Waiman Long
  2021-04-09 23:18 ` [PATCH 1/5] mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state() Waiman Long
@ 2021-04-09 23:18 ` Waiman Long
  2021-04-12 15:50   ` Shakeel Butt
  2021-04-12 18:10   ` Roman Gushchin
  2021-04-09 23:18 ` [PATCH 3/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp Waiman Long
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 22+ messages in thread
From: Waiman Long @ 2021-04-09 23:18 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin
  Cc: linux-kernel, cgroups, linux-mm, Shakeel Butt, Muchun Song,
	Alex Shi, Chris Down, Yafang Shao, Alexander Duyck, Wei Yang,
	Masayoshi Mizuma, Waiman Long

In memcg_slab_free_hook()/pcpu_memcg_free_hook(), obj_cgroup_uncharge()
is followed by mod_objcg_state()/mod_memcg_state(). Each of these
function call goes through a separate irq_save/irq_restore cycle. That
is inefficient.  Introduce a new function obj_cgroup_uncharge_mod_state()
that combines them with a single irq_save/irq_restore cycle.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/memcontrol.h |  2 ++
 mm/memcontrol.c            | 31 +++++++++++++++++++++++++++----
 mm/percpu.c                |  9 ++-------
 mm/slab.h                  |  6 +++---
 4 files changed, 34 insertions(+), 14 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 95f12996e66c..6890f999c1a3 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1592,6 +1592,8 @@ struct obj_cgroup *get_obj_cgroup_from_current(void);
 
 int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size);
 void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);
+void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
+				   struct pglist_data *pgdat, int idx);
 
 extern struct static_key_false memcg_kmem_enabled_key;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d66e1e38f8ac..b19100c68aa0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3225,12 +3225,9 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 	return false;
 }
 
-static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
+static void __refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 {
 	struct memcg_stock_pcp *stock;
-	unsigned long flags;
-
-	local_irq_save(flags);
 
 	stock = this_cpu_ptr(&memcg_stock);
 	if (stock->cached_objcg != objcg) { /* reset if necessary */
@@ -3243,7 +3240,14 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 
 	if (stock->nr_bytes > PAGE_SIZE)
 		drain_obj_stock(stock);
+}
+
+static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
+{
+	unsigned long flags;
 
+	local_irq_save(flags);
+	__refill_obj_stock(objcg, nr_bytes);
 	local_irq_restore(flags);
 }
 
@@ -3292,6 +3296,25 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
 	refill_obj_stock(objcg, size);
 }
 
+void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
+				   struct pglist_data *pgdat, int idx)
+{
+	unsigned long flags;
+	struct mem_cgroup *memcg;
+	struct lruvec *lruvec = NULL;
+
+	local_irq_save(flags);
+	__refill_obj_stock(objcg, size);
+
+	rcu_read_lock();
+	memcg = obj_cgroup_memcg(objcg);
+	if (pgdat)
+		lruvec = mem_cgroup_lruvec(memcg, pgdat);
+	__mod_memcg_lruvec_state(memcg, lruvec, idx, -(int)size);
+	rcu_read_unlock();
+	local_irq_restore(flags);
+}
+
 #endif /* CONFIG_MEMCG_KMEM */
 
 /*
diff --git a/mm/percpu.c b/mm/percpu.c
index 6596a0a4286e..24ecd51c688c 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1631,13 +1631,8 @@ static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
 	objcg = chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT];
 	chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = NULL;
 
-	obj_cgroup_uncharge(objcg, size * num_possible_cpus());
-
-	rcu_read_lock();
-	mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B,
-			-(size * num_possible_cpus()));
-	rcu_read_unlock();
-
+	obj_cgroup_uncharge_mod_state(objcg, size * num_possible_cpus(),
+				      NULL, MEMCG_PERCPU_B);
 	obj_cgroup_put(objcg);
 }
 
diff --git a/mm/slab.h b/mm/slab.h
index bc6c7545e487..677cdc52e641 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -366,9 +366,9 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s_orig,
 			continue;
 
 		objcgs[off] = NULL;
-		obj_cgroup_uncharge(objcg, obj_full_size(s));
-		mod_objcg_state(objcg, page_pgdat(page), cache_vmstat_idx(s),
-				-obj_full_size(s));
+		obj_cgroup_uncharge_mod_state(objcg, obj_full_size(s),
+					      page_pgdat(page),
+					      cache_vmstat_idx(s));
 		obj_cgroup_put(objcg);
 	}
 }
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
  2021-04-09 23:18 [PATCH 0/5] mm/memcg: Reduce kmemcache memory accounting overhead Waiman Long
  2021-04-09 23:18 ` [PATCH 1/5] mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state() Waiman Long
  2021-04-09 23:18 ` [PATCH 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state() Waiman Long
@ 2021-04-09 23:18 ` Waiman Long
  2021-04-12 18:22   ` Roman Gushchin
  2021-04-09 23:18 ` [PATCH 4/5] mm/memcg: Separate out object stock data into its own struct Waiman Long
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Waiman Long @ 2021-04-09 23:18 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin
  Cc: linux-kernel, cgroups, linux-mm, Shakeel Butt, Muchun Song,
	Alex Shi, Chris Down, Yafang Shao, Alexander Duyck, Wei Yang,
	Masayoshi Mizuma, Waiman Long

Before the new slab memory controller with per object byte charging,
charging and vmstat data update happen only when new slab pages are
allocated or freed. Now they are done with every kmem_cache_alloc()
and kmem_cache_free(). This causes additional overhead for workloads
that generate a lot of alloc and free calls.

The memcg_stock_pcp is used to cache byte charge for a specific
obj_cgroup to reduce that overhead. To further reducing it, this patch
makes the vmstat data cached in the memcg_stock_pcp structure as well
until it accumulates a page size worth of update or when other cached
data change.

On a 2-socket Cascade Lake server with instrumentation enabled and this
patch applied, it was found that about 17% (946796 out of 5515184) of the
time when __mod_obj_stock_state() is called leads to an actual call to
mod_objcg_state() after initial boot. When doing parallel kernel build,
the figure was about 16% (21894614 out of 139780628). So caching the
vmstat data reduces the number of calls to mod_objcg_state() by more
than 80%.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 mm/memcontrol.c | 78 +++++++++++++++++++++++++++++++++++++++++++------
 mm/slab.h       | 26 +++++++----------
 2 files changed, 79 insertions(+), 25 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b19100c68aa0..539c3b632e47 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2220,7 +2220,10 @@ struct memcg_stock_pcp {
 
 #ifdef CONFIG_MEMCG_KMEM
 	struct obj_cgroup *cached_objcg;
+	struct pglist_data *cached_pgdat;
 	unsigned int nr_bytes;
+	int vmstat_idx;
+	int vmstat_bytes;
 #endif
 
 	struct work_struct work;
@@ -3157,6 +3160,21 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
 	css_put(&memcg->css);
 }
 
+static inline void mod_objcg_state(struct obj_cgroup *objcg,
+				   struct pglist_data *pgdat,
+				   enum node_stat_item idx, int nr)
+{
+	struct mem_cgroup *memcg;
+	struct lruvec *lruvec = NULL;
+
+	rcu_read_lock();
+	memcg = obj_cgroup_memcg(objcg);
+	if (pgdat)
+		lruvec = mem_cgroup_lruvec(memcg, pgdat);
+	__mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
+	rcu_read_unlock();
+}
+
 static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 {
 	struct memcg_stock_pcp *stock;
@@ -3207,6 +3225,14 @@ static void drain_obj_stock(struct memcg_stock_pcp *stock)
 		stock->nr_bytes = 0;
 	}
 
+	if (stock->vmstat_bytes) {
+		mod_objcg_state(old, stock->cached_pgdat, stock->vmstat_idx,
+				stock->vmstat_bytes);
+		stock->vmstat_bytes = 0;
+		stock->vmstat_idx = 0;
+		stock->cached_pgdat = NULL;
+	}
+
 	obj_cgroup_put(old);
 	stock->cached_objcg = NULL;
 }
@@ -3251,6 +3277,48 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 	local_irq_restore(flags);
 }
 
+static void __mod_obj_stock_state(struct obj_cgroup *objcg,
+				  struct pglist_data *pgdat, int idx, int nr)
+{
+	struct memcg_stock_pcp *stock = this_cpu_ptr(&memcg_stock);
+
+	if (stock->cached_objcg != objcg) {
+		/* Output the current data as is */
+	} else if (!stock->vmstat_bytes) {
+		/* Save the current data */
+		stock->vmstat_bytes = nr;
+		stock->vmstat_idx = idx;
+		stock->cached_pgdat = pgdat;
+		nr = 0;
+	} else if ((stock->cached_pgdat != pgdat) ||
+		   (stock->vmstat_idx != idx)) {
+		/* Output the cached data & save the current data */
+		swap(nr, stock->vmstat_bytes);
+		swap(idx, stock->vmstat_idx);
+		swap(pgdat, stock->cached_pgdat);
+	} else {
+		stock->vmstat_bytes += nr;
+		if (abs(nr) > PAGE_SIZE) {
+			nr = stock->vmstat_bytes;
+			stock->vmstat_bytes = 0;
+		} else {
+			nr = 0;
+		}
+	}
+	if (nr)
+		mod_objcg_state(objcg, pgdat, idx, nr);
+}
+
+void mod_obj_stock_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
+			 int idx, int nr)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__mod_obj_stock_state(objcg, pgdat, idx, nr);
+	local_irq_restore(flags);
+}
+
 int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
 {
 	struct mem_cgroup *memcg;
@@ -3300,18 +3368,10 @@ void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
 				   struct pglist_data *pgdat, int idx)
 {
 	unsigned long flags;
-	struct mem_cgroup *memcg;
-	struct lruvec *lruvec = NULL;
 
 	local_irq_save(flags);
 	__refill_obj_stock(objcg, size);
-
-	rcu_read_lock();
-	memcg = obj_cgroup_memcg(objcg);
-	if (pgdat)
-		lruvec = mem_cgroup_lruvec(memcg, pgdat);
-	__mod_memcg_lruvec_state(memcg, lruvec, idx, -(int)size);
-	rcu_read_unlock();
+	__mod_obj_stock_state(objcg, pgdat, idx, -(int)size);
 	local_irq_restore(flags);
 }
 
diff --git a/mm/slab.h b/mm/slab.h
index 677cdc52e641..ae971975d9fc 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -239,6 +239,8 @@ static inline bool kmem_cache_debug_flags(struct kmem_cache *s, slab_flags_t fla
 #ifdef CONFIG_MEMCG_KMEM
 int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s,
 				 gfp_t gfp, bool new_page);
+void mod_obj_stock_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
+			 int idx, int nr);
 
 static inline void memcg_free_page_obj_cgroups(struct page *page)
 {
@@ -283,20 +285,6 @@ static inline bool memcg_slab_pre_alloc_hook(struct kmem_cache *s,
 	return true;
 }
 
-static inline void mod_objcg_state(struct obj_cgroup *objcg,
-				   struct pglist_data *pgdat,
-				   enum node_stat_item idx, int nr)
-{
-	struct mem_cgroup *memcg;
-	struct lruvec *lruvec;
-
-	rcu_read_lock();
-	memcg = obj_cgroup_memcg(objcg);
-	lruvec = mem_cgroup_lruvec(memcg, pgdat);
-	mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
-	rcu_read_unlock();
-}
-
 static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
 					      struct obj_cgroup *objcg,
 					      gfp_t flags, size_t size,
@@ -324,8 +312,9 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
 			off = obj_to_index(s, page, p[i]);
 			obj_cgroup_get(objcg);
 			page_objcgs(page)[off] = objcg;
-			mod_objcg_state(objcg, page_pgdat(page),
-					cache_vmstat_idx(s), obj_full_size(s));
+			mod_obj_stock_state(objcg, page_pgdat(page),
+					    cache_vmstat_idx(s),
+					    obj_full_size(s));
 		} else {
 			obj_cgroup_uncharge(objcg, obj_full_size(s));
 		}
@@ -408,6 +397,11 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s,
 					void **p, int objects)
 {
 }
+
+static void mod_obj_stock_state(struct obj_cgroup *objcg,
+				struct pglist_data *pgdat, int idx, int nr)
+{
+}
 #endif /* CONFIG_MEMCG_KMEM */
 
 static inline struct kmem_cache *virt_to_cache(const void *obj)
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/5] mm/memcg: Separate out object stock data into its own struct
  2021-04-09 23:18 [PATCH 0/5] mm/memcg: Reduce kmemcache memory accounting overhead Waiman Long
                   ` (2 preceding siblings ...)
  2021-04-09 23:18 ` [PATCH 3/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp Waiman Long
@ 2021-04-09 23:18 ` Waiman Long
  2021-04-12 18:53   ` Roman Gushchin
  2021-04-09 23:18 ` [PATCH 5/5] mm/memcg: Optimize user context object stock access Waiman Long
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Waiman Long @ 2021-04-09 23:18 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin
  Cc: linux-kernel, cgroups, linux-mm, Shakeel Butt, Muchun Song,
	Alex Shi, Chris Down, Yafang Shao, Alexander Duyck, Wei Yang,
	Masayoshi Mizuma, Waiman Long

The object stock data stored in struct memcg_stock_pcp are independent
of the other page based data stored there. Separating them out into
their own struct to highlight the independency.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 mm/memcontrol.c | 41 ++++++++++++++++++++++++++---------------
 1 file changed, 26 insertions(+), 15 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 539c3b632e47..69f728383efe 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2214,17 +2214,22 @@ void unlock_page_memcg(struct page *page)
 }
 EXPORT_SYMBOL(unlock_page_memcg);
 
-struct memcg_stock_pcp {
-	struct mem_cgroup *cached; /* this never be root cgroup */
-	unsigned int nr_pages;
-
+struct obj_stock {
 #ifdef CONFIG_MEMCG_KMEM
 	struct obj_cgroup *cached_objcg;
 	struct pglist_data *cached_pgdat;
 	unsigned int nr_bytes;
 	int vmstat_idx;
 	int vmstat_bytes;
+#else
+	int dummy[0];
 #endif
+};
+
+struct memcg_stock_pcp {
+	struct mem_cgroup *cached; /* this never be root cgroup */
+	unsigned int nr_pages;
+	struct obj_stock obj;
 
 	struct work_struct work;
 	unsigned long flags;
@@ -2234,12 +2239,12 @@ static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
 static DEFINE_MUTEX(percpu_charge_mutex);
 
 #ifdef CONFIG_MEMCG_KMEM
-static void drain_obj_stock(struct memcg_stock_pcp *stock);
+static void drain_obj_stock(struct obj_stock *stock);
 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 				     struct mem_cgroup *root_memcg);
 
 #else
-static inline void drain_obj_stock(struct memcg_stock_pcp *stock)
+static inline void drain_obj_stock(struct obj_stock *stock)
 {
 }
 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
@@ -2249,6 +2254,13 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 }
 #endif
 
+static inline struct obj_stock *current_obj_stock(void)
+{
+	struct memcg_stock_pcp *stock = this_cpu_ptr(&memcg_stock);
+
+	return &stock->obj;
+}
+
 /**
  * consume_stock: Try to consume stocked charge on this cpu.
  * @memcg: memcg to consume from.
@@ -2315,7 +2327,7 @@ static void drain_local_stock(struct work_struct *dummy)
 	local_irq_save(flags);
 
 	stock = this_cpu_ptr(&memcg_stock);
-	drain_obj_stock(stock);
+	drain_obj_stock(&stock->obj);
 	drain_stock(stock);
 	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
 
@@ -3177,13 +3189,13 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg,
 
 static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 {
-	struct memcg_stock_pcp *stock;
+	struct obj_stock *stock;
 	unsigned long flags;
 	bool ret = false;
 
 	local_irq_save(flags);
 
-	stock = this_cpu_ptr(&memcg_stock);
+	stock = current_obj_stock();
 	if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
 		stock->nr_bytes -= nr_bytes;
 		ret = true;
@@ -3194,7 +3206,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 	return ret;
 }
 
-static void drain_obj_stock(struct memcg_stock_pcp *stock)
+static void drain_obj_stock(struct obj_stock *stock)
 {
 	struct obj_cgroup *old = stock->cached_objcg;
 
@@ -3242,8 +3254,8 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 {
 	struct mem_cgroup *memcg;
 
-	if (stock->cached_objcg) {
-		memcg = obj_cgroup_memcg(stock->cached_objcg);
+	if (stock->obj.cached_objcg) {
+		memcg = obj_cgroup_memcg(stock->obj.cached_objcg);
 		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
 			return true;
 	}
@@ -3253,9 +3265,8 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 
 static void __refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 {
-	struct memcg_stock_pcp *stock;
+	struct obj_stock *stock = current_obj_stock();
 
-	stock = this_cpu_ptr(&memcg_stock);
 	if (stock->cached_objcg != objcg) { /* reset if necessary */
 		drain_obj_stock(stock);
 		obj_cgroup_get(objcg);
@@ -3280,7 +3291,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 static void __mod_obj_stock_state(struct obj_cgroup *objcg,
 				  struct pglist_data *pgdat, int idx, int nr)
 {
-	struct memcg_stock_pcp *stock = this_cpu_ptr(&memcg_stock);
+	struct obj_stock *stock = current_obj_stock();
 
 	if (stock->cached_objcg != objcg) {
 		/* Output the current data as is */
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 5/5] mm/memcg: Optimize user context object stock access
  2021-04-09 23:18 [PATCH 0/5] mm/memcg: Reduce kmemcache memory accounting overhead Waiman Long
                   ` (3 preceding siblings ...)
  2021-04-09 23:18 ` [PATCH 4/5] mm/memcg: Separate out object stock data into its own struct Waiman Long
@ 2021-04-09 23:18 ` Waiman Long
  2021-04-12 18:55   ` Roman Gushchin
  2021-04-10  1:51 ` [PATCH 0/5] mm/memcg: Reduce kmemcache memory accounting overhead Roman Gushchin
  2021-04-12 19:05 ` Roman Gushchin
  6 siblings, 1 reply; 22+ messages in thread
From: Waiman Long @ 2021-04-09 23:18 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin
  Cc: linux-kernel, cgroups, linux-mm, Shakeel Butt, Muchun Song,
	Alex Shi, Chris Down, Yafang Shao, Alexander Duyck, Wei Yang,
	Masayoshi Mizuma, Waiman Long

Most kmem_cache_alloc() calls are from user context. With instrumentation
enabled, the measured amount of kmem_cache_alloc() calls from non-task
context was about 0.01% of the total.

The irq disable/enable sequence used in this case to access content
from object stock is slow.  To optimize for user context access, there
are now two object stocks for task context and interrupt context access
respectively.

The task context object stock can be accessed after disabling preemption
which is cheap in non-preempt kernel. The interrupt context object stock
can only be accessed after disabling interrupt. User context code can
access interrupt object stock, but not vice versa.

The mod_objcg_state() function is also modified to make sure that memcg
and lruvec stat updates are done with interrupted disabled.

The downside of this change is that there are more data stored in local
object stocks and not reflected in the charge counter and the vmstat
arrays.  However, this is a small price to pay for better performance.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 mm/memcontrol.c | 71 +++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 57 insertions(+), 14 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 69f728383efe..00c9074e42e5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2229,7 +2229,8 @@ struct obj_stock {
 struct memcg_stock_pcp {
 	struct mem_cgroup *cached; /* this never be root cgroup */
 	unsigned int nr_pages;
-	struct obj_stock obj;
+	struct obj_stock task_obj;
+	struct obj_stock irq_obj;
 
 	struct work_struct work;
 	unsigned long flags;
@@ -2254,11 +2255,46 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 }
 #endif
 
+/*
+ * Most kmem_cache_alloc() calls are from user context. The irq disable/enable
+ * sequence used in this case to access content from object stock is slow.
+ * To optimize for user context access, there are now two object stocks for
+ * task context and interrupt context access respectively.
+ *
+ * The task context object stock can be accessed by disabling preemption only
+ * which is cheap in non-preempt kernel. The interrupt context object stock
+ * can only be accessed after disabling interrupt. User context code can
+ * access interrupt object stock, but not vice versa.
+ */
 static inline struct obj_stock *current_obj_stock(void)
 {
 	struct memcg_stock_pcp *stock = this_cpu_ptr(&memcg_stock);
 
-	return &stock->obj;
+	return in_task() ? &stock->task_obj : &stock->irq_obj;
+}
+
+#define get_obj_stock(flags)			\
+({						\
+	struct memcg_stock_pcp *stock;		\
+	struct obj_stock *obj_stock;		\
+						\
+	if (in_task()) {			\
+		preempt_disable();		\
+		(flags) = -1L;			\
+		obj_stock = &stock->task_obj;	\
+	} else {				\
+		local_irq_save(flags);		\
+		obj_stock = &stock->irq_obj;	\
+	}					\
+	obj_stock;				\
+})
+
+static inline void put_obj_stock(unsigned long flags)
+{
+	if (flags == -1L)
+		preempt_enable();
+	else
+		local_irq_restore(flags);
 }
 
 /**
@@ -2327,7 +2363,9 @@ static void drain_local_stock(struct work_struct *dummy)
 	local_irq_save(flags);
 
 	stock = this_cpu_ptr(&memcg_stock);
-	drain_obj_stock(&stock->obj);
+	drain_obj_stock(&stock->irq_obj);
+	if (in_task())
+		drain_obj_stock(&stock->task_obj);
 	drain_stock(stock);
 	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
 
@@ -3183,7 +3221,7 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg,
 	memcg = obj_cgroup_memcg(objcg);
 	if (pgdat)
 		lruvec = mem_cgroup_lruvec(memcg, pgdat);
-	__mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
+	mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
 	rcu_read_unlock();
 }
 
@@ -3193,7 +3231,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 	unsigned long flags;
 	bool ret = false;
 
-	local_irq_save(flags);
+	stock = get_obj_stock(flags);
 
 	stock = current_obj_stock();
 	if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
@@ -3201,7 +3239,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 		ret = true;
 	}
 
-	local_irq_restore(flags);
+	put_obj_stock(flags);
 
 	return ret;
 }
@@ -3254,8 +3292,13 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 {
 	struct mem_cgroup *memcg;
 
-	if (stock->obj.cached_objcg) {
-		memcg = obj_cgroup_memcg(stock->obj.cached_objcg);
+	if (in_task() && stock->task_obj.cached_objcg) {
+		memcg = obj_cgroup_memcg(stock->task_obj.cached_objcg);
+		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
+			return true;
+	}
+	if (stock->irq_obj.cached_objcg) {
+		memcg = obj_cgroup_memcg(stock->irq_obj.cached_objcg);
 		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
 			return true;
 	}
@@ -3283,9 +3326,9 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 {
 	unsigned long flags;
 
-	local_irq_save(flags);
+	get_obj_stock(flags);
 	__refill_obj_stock(objcg, nr_bytes);
-	local_irq_restore(flags);
+	put_obj_stock(flags);
 }
 
 static void __mod_obj_stock_state(struct obj_cgroup *objcg,
@@ -3325,9 +3368,9 @@ void mod_obj_stock_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
 {
 	unsigned long flags;
 
-	local_irq_save(flags);
+	get_obj_stock(flags);
 	__mod_obj_stock_state(objcg, pgdat, idx, nr);
-	local_irq_restore(flags);
+	put_obj_stock(flags);
 }
 
 int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
@@ -3380,10 +3423,10 @@ void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
 {
 	unsigned long flags;
 
-	local_irq_save(flags);
+	get_obj_stock(flags);
 	__refill_obj_stock(objcg, size);
 	__mod_obj_stock_state(objcg, pgdat, idx, -(int)size);
-	local_irq_restore(flags);
+	put_obj_stock(flags);
 }
 
 #endif /* CONFIG_MEMCG_KMEM */
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/5] mm/memcg: Reduce kmemcache memory accounting overhead
  2021-04-09 23:18 [PATCH 0/5] mm/memcg: Reduce kmemcache memory accounting overhead Waiman Long
                   ` (4 preceding siblings ...)
  2021-04-09 23:18 ` [PATCH 5/5] mm/memcg: Optimize user context object stock access Waiman Long
@ 2021-04-10  1:51 ` Roman Gushchin
  2021-04-12 14:03   ` Waiman Long
  2021-04-12 19:05 ` Roman Gushchin
  6 siblings, 1 reply; 22+ messages in thread
From: Roman Gushchin @ 2021-04-10  1:51 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, linux-kernel, cgroups, linux-mm,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Alexander Duyck, Wei Yang, Masayoshi Mizuma

On Fri, Apr 09, 2021 at 07:18:37PM -0400, Waiman Long wrote:
> With the recent introduction of the new slab memory controller, we
> eliminate the need for having separate kmemcaches for each memory
> cgroup and reduce overall kernel memory usage. However, we also add
> additional memory accounting overhead to each call of kmem_cache_alloc()
> and kmem_cache_free().
> 
> For workloads that require a lot of kmemcache allocations and
> de-allocations, they may experience performance regression as illustrated
> in [1].
> 
> With a simple kernel module that performs repeated loop of 100,000,000
> kmem_cache_alloc() and kmem_cache_free() of 64-byte object at module
> init. The execution time to load the kernel module with and without
> memory accounting were:
> 
>   with accounting = 6.798s
>   w/o  accounting = 1.758s
> 
> That is an increase of 5.04s (287%). With this patchset applied, the
> execution time became 4.254s. So the memory accounting overhead is now
> 2.496s which is a 50% reduction.

Hi Waiman!

Thank you for working on it, it's indeed very useful!
A couple of questions:
1) did your config included lockdep or not?
2) do you have a (rough) estimation how much each change contributes
   to the overall reduction?

Thanks!

> 
> It was found that a major part of the memory accounting overhead
> is caused by the local_irq_save()/local_irq_restore() sequences in
> updating local stock charge bytes and vmstat array, at least in x86
> systems. There are two such sequences in kmem_cache_alloc() and two
> in kmem_cache_free(). This patchset tries to reduce the use of such
> sequences as much as possible. In fact, it eliminates them in the common
> case. Another part of this patchset to cache the vmstat data update in
> the local stock as well which also helps.
> 
> [1] https://lore.kernel.org/linux-mm/20210408193948.vfktg3azh2wrt56t@gabell/T/#u
> 
> Waiman Long (5):
>   mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state()
>   mm/memcg: Introduce obj_cgroup_uncharge_mod_state()
>   mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
>   mm/memcg: Separate out object stock data into its own struct
>   mm/memcg: Optimize user context object stock access
> 
>  include/linux/memcontrol.h |  14 ++-
>  mm/memcontrol.c            | 198 ++++++++++++++++++++++++++++++++-----
>  mm/percpu.c                |   9 +-
>  mm/slab.h                  |  32 +++---
>  4 files changed, 195 insertions(+), 58 deletions(-)
> 
> -- 
> 2.18.1
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/5] mm/memcg: Reduce kmemcache memory accounting overhead
  2021-04-10  1:51 ` [PATCH 0/5] mm/memcg: Reduce kmemcache memory accounting overhead Roman Gushchin
@ 2021-04-12 14:03   ` Waiman Long
  2021-04-12 17:47     ` Roman Gushchin
  0 siblings, 1 reply; 22+ messages in thread
From: Waiman Long @ 2021-04-12 14:03 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, linux-kernel, cgroups, linux-mm,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Alexander Duyck, Wei Yang, Masayoshi Mizuma

On 4/9/21 9:51 PM, Roman Gushchin wrote:
> On Fri, Apr 09, 2021 at 07:18:37PM -0400, Waiman Long wrote:
>> With the recent introduction of the new slab memory controller, we
>> eliminate the need for having separate kmemcaches for each memory
>> cgroup and reduce overall kernel memory usage. However, we also add
>> additional memory accounting overhead to each call of kmem_cache_alloc()
>> and kmem_cache_free().
>>
>> For workloads that require a lot of kmemcache allocations and
>> de-allocations, they may experience performance regression as illustrated
>> in [1].
>>
>> With a simple kernel module that performs repeated loop of 100,000,000
>> kmem_cache_alloc() and kmem_cache_free() of 64-byte object at module
>> init. The execution time to load the kernel module with and without
>> memory accounting were:
>>
>>    with accounting = 6.798s
>>    w/o  accounting = 1.758s
>>
>> That is an increase of 5.04s (287%). With this patchset applied, the
>> execution time became 4.254s. So the memory accounting overhead is now
>> 2.496s which is a 50% reduction.
> Hi Waiman!
>
> Thank you for working on it, it's indeed very useful!
> A couple of questions:
> 1) did your config included lockdep or not?
The test kernel is based on a production kernel config and so lockdep 
isn't enabled.
> 2) do you have a (rough) estimation how much each change contributes
>     to the overall reduction?

I should have a better breakdown of the effect of individual patches. I 
rerun the benchmarking module with turbo-boosting disabled to reduce 
run-to-run variation. The execution times were:

Before patch: time = 10.800s (with memory accounting), 2.848s (w/o 
accounting), overhead = 7.952s
After patch 2: time = 9.140s, overhead = 6.292s
After patch 3: time = 7.641s, overhead = 4.793s
After patch 5: time = 6.801s, overhead = 3.953s

Patches 1 & 4 are preparatory patches that should affect performance.

So the memory accounting overhead was reduced by about half.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state()
  2021-04-09 23:18 ` [PATCH 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state() Waiman Long
@ 2021-04-12 15:50   ` Shakeel Butt
  2021-04-12 18:10   ` Roman Gushchin
  1 sibling, 0 replies; 22+ messages in thread
From: Shakeel Butt @ 2021-04-12 15:50 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin, LKML, Cgroups,
	Linux MM, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Alexander Duyck, Wei Yang, Masayoshi Mizuma

On Fri, Apr 9, 2021 at 4:19 PM Waiman Long <longman@redhat.com> wrote:
>
> In memcg_slab_free_hook()/pcpu_memcg_free_hook(), obj_cgroup_uncharge()
> is followed by mod_objcg_state()/mod_memcg_state(). Each of these
> function call goes through a separate irq_save/irq_restore cycle. That
> is inefficient.  Introduce a new function obj_cgroup_uncharge_mod_state()
> that combines them with a single irq_save/irq_restore cycle.
>
> Signed-off-by: Waiman Long <longman@redhat.com>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/5] mm/memcg: Reduce kmemcache memory accounting overhead
  2021-04-12 14:03   ` Waiman Long
@ 2021-04-12 17:47     ` Roman Gushchin
  2021-04-12 19:20       ` Waiman Long
  0 siblings, 1 reply; 22+ messages in thread
From: Roman Gushchin @ 2021-04-12 17:47 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, linux-kernel, cgroups, linux-mm,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Alexander Duyck, Wei Yang, Masayoshi Mizuma

On Mon, Apr 12, 2021 at 10:03:13AM -0400, Waiman Long wrote:
> On 4/9/21 9:51 PM, Roman Gushchin wrote:
> > On Fri, Apr 09, 2021 at 07:18:37PM -0400, Waiman Long wrote:
> > > With the recent introduction of the new slab memory controller, we
> > > eliminate the need for having separate kmemcaches for each memory
> > > cgroup and reduce overall kernel memory usage. However, we also add
> > > additional memory accounting overhead to each call of kmem_cache_alloc()
> > > and kmem_cache_free().
> > > 
> > > For workloads that require a lot of kmemcache allocations and
> > > de-allocations, they may experience performance regression as illustrated
> > > in [1].
> > > 
> > > With a simple kernel module that performs repeated loop of 100,000,000
> > > kmem_cache_alloc() and kmem_cache_free() of 64-byte object at module
> > > init. The execution time to load the kernel module with and without
> > > memory accounting were:
> > > 
> > >    with accounting = 6.798s
> > >    w/o  accounting = 1.758s
> > > 
> > > That is an increase of 5.04s (287%). With this patchset applied, the
> > > execution time became 4.254s. So the memory accounting overhead is now
> > > 2.496s which is a 50% reduction.
> > Hi Waiman!
> > 
> > Thank you for working on it, it's indeed very useful!
> > A couple of questions:
> > 1) did your config included lockdep or not?
> The test kernel is based on a production kernel config and so lockdep isn't
> enabled.
> > 2) do you have a (rough) estimation how much each change contributes
> >     to the overall reduction?
> 
> I should have a better breakdown of the effect of individual patches. I
> rerun the benchmarking module with turbo-boosting disabled to reduce
> run-to-run variation. The execution times were:
> 
> Before patch: time = 10.800s (with memory accounting), 2.848s (w/o
> accounting), overhead = 7.952s
> After patch 2: time = 9.140s, overhead = 6.292s
> After patch 3: time = 7.641s, overhead = 4.793s
> After patch 5: time = 6.801s, overhead = 3.953s

Thank you! If there will be v2, I'd include this information into commit logs.

> 
> Patches 1 & 4 are preparatory patches that should affect performance.
> 
> So the memory accounting overhead was reduced by about half.

This is really great!

Thanks!

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/5] mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state()
  2021-04-09 23:18 ` [PATCH 1/5] mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state() Waiman Long
@ 2021-04-12 18:04   ` Roman Gushchin
  2021-04-12 19:24     ` Waiman Long
  2021-04-12 19:22   ` Shakeel Butt
  1 sibling, 1 reply; 22+ messages in thread
From: Roman Gushchin @ 2021-04-12 18:04 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, linux-kernel, cgroups, linux-mm,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Alexander Duyck, Wei Yang, Masayoshi Mizuma

On Fri, Apr 09, 2021 at 07:18:38PM -0400, Waiman Long wrote:
> The caller of mod_memcg_lruvec_state() has both memcg and lruvec readily
> available. So both of them are now passed to mod_memcg_lruvec_state()
> and __mod_memcg_lruvec_state(). The __mod_memcg_lruvec_state() is
> updated to allow either of the two parameters to be set to null. This
> makes mod_memcg_lruvec_state() equivalent to mod_memcg_state() if lruvec
> is null.

This patch seems to be correct, but it's a bit hard to understand why
it's required without looking into the rest of the series. Can you, please,
add a couple of words about it? E.g. we need it to handle stats which do not
exist on the lruvec level...

Otherwise,
Acked-by: Roman Gushchin <guro@fb.com>

Thanks!

> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  include/linux/memcontrol.h | 12 +++++++-----
>  mm/memcontrol.c            | 19 +++++++++++++------
>  mm/slab.h                  |  2 +-
>  3 files changed, 21 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 0c04d39a7967..95f12996e66c 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -955,8 +955,8 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
>  	return x;
>  }
>  
> -void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
> -			      int val);
> +void __mod_memcg_lruvec_state(struct mem_cgroup *memcg, struct lruvec *lruvec,
> +			      enum node_stat_item idx, int val);
>  void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val);
>  
>  static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx,
> @@ -969,13 +969,14 @@ static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx,
>  	local_irq_restore(flags);
>  }
>  
> -static inline void mod_memcg_lruvec_state(struct lruvec *lruvec,
> +static inline void mod_memcg_lruvec_state(struct mem_cgroup *memcg,
> +					  struct lruvec *lruvec,
>  					  enum node_stat_item idx, int val)
>  {
>  	unsigned long flags;
>  
>  	local_irq_save(flags);
> -	__mod_memcg_lruvec_state(lruvec, idx, val);
> +	__mod_memcg_lruvec_state(memcg, lruvec, idx, val);
>  	local_irq_restore(flags);
>  }
>  
> @@ -1369,7 +1370,8 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
>  	return node_page_state(lruvec_pgdat(lruvec), idx);
>  }
>  
> -static inline void __mod_memcg_lruvec_state(struct lruvec *lruvec,
> +static inline void __mod_memcg_lruvec_state(struct mem_cgroup *memcg,
> +					    struct lruvec *lruvec,
>  					    enum node_stat_item idx, int val)
>  {
>  }
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e064ac0d850a..d66e1e38f8ac 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -799,20 +799,27 @@ parent_nodeinfo(struct mem_cgroup_per_node *pn, int nid)
>  	return mem_cgroup_nodeinfo(parent, nid);
>  }
>  
> -void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
> -			      int val)
> +/*
> + * Either one of memcg or lruvec can be NULL, but not both.
> + */
> +void __mod_memcg_lruvec_state(struct mem_cgroup *memcg, struct lruvec *lruvec,
> +			      enum node_stat_item idx, int val)
>  {
>  	struct mem_cgroup_per_node *pn;
> -	struct mem_cgroup *memcg;
>  	long x, threshold = MEMCG_CHARGE_BATCH;
>  
> +	/* Update lruvec */
>  	pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
> -	memcg = pn->memcg;
> +
> +	if (!memcg)
> +		memcg = pn->memcg;
>  
>  	/* Update memcg */
>  	__mod_memcg_state(memcg, idx, val);
>  
> -	/* Update lruvec */
> +	if (!lruvec)
> +		return;
> +
>  	__this_cpu_add(pn->lruvec_stat_local->count[idx], val);
>  
>  	if (vmstat_item_in_bytes(idx))
> @@ -848,7 +855,7 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
>  
>  	/* Update memcg and lruvec */
>  	if (!mem_cgroup_disabled())
> -		__mod_memcg_lruvec_state(lruvec, idx, val);
> +		__mod_memcg_lruvec_state(NULL, lruvec, idx, val);
>  }
>  
>  void __mod_lruvec_page_state(struct page *page, enum node_stat_item idx,
> diff --git a/mm/slab.h b/mm/slab.h
> index 076582f58f68..bc6c7545e487 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -293,7 +293,7 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg,
>  	rcu_read_lock();
>  	memcg = obj_cgroup_memcg(objcg);
>  	lruvec = mem_cgroup_lruvec(memcg, pgdat);
> -	mod_memcg_lruvec_state(lruvec, idx, nr);
> +	mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
>  	rcu_read_unlock();
>  }
>  
> -- 
> 2.18.1
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state()
  2021-04-09 23:18 ` [PATCH 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state() Waiman Long
  2021-04-12 15:50   ` Shakeel Butt
@ 2021-04-12 18:10   ` Roman Gushchin
  1 sibling, 0 replies; 22+ messages in thread
From: Roman Gushchin @ 2021-04-12 18:10 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, linux-kernel, cgroups, linux-mm,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Alexander Duyck, Wei Yang, Masayoshi Mizuma

On Fri, Apr 09, 2021 at 07:18:39PM -0400, Waiman Long wrote:
> In memcg_slab_free_hook()/pcpu_memcg_free_hook(), obj_cgroup_uncharge()
> is followed by mod_objcg_state()/mod_memcg_state(). Each of these
> function call goes through a separate irq_save/irq_restore cycle. That
> is inefficient.  Introduce a new function obj_cgroup_uncharge_mod_state()
> that combines them with a single irq_save/irq_restore cycle.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>

Acked-by: Roman Gushchin <guro@fb.com>

Thanks!

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
  2021-04-09 23:18 ` [PATCH 3/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp Waiman Long
@ 2021-04-12 18:22   ` Roman Gushchin
  2021-04-12 19:30     ` Waiman Long
  0 siblings, 1 reply; 22+ messages in thread
From: Roman Gushchin @ 2021-04-12 18:22 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, linux-kernel, cgroups, linux-mm,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Alexander Duyck, Wei Yang, Masayoshi Mizuma

On Fri, Apr 09, 2021 at 07:18:40PM -0400, Waiman Long wrote:
> Before the new slab memory controller with per object byte charging,
> charging and vmstat data update happen only when new slab pages are
> allocated or freed. Now they are done with every kmem_cache_alloc()
> and kmem_cache_free(). This causes additional overhead for workloads
> that generate a lot of alloc and free calls.
> 
> The memcg_stock_pcp is used to cache byte charge for a specific
> obj_cgroup to reduce that overhead. To further reducing it, this patch
> makes the vmstat data cached in the memcg_stock_pcp structure as well
> until it accumulates a page size worth of update or when other cached
> data change.

The idea makes total sense to me and also gives a hope to remove
byte-sized vmstats in the long-term.

> 
> On a 2-socket Cascade Lake server with instrumentation enabled and this
> patch applied, it was found that about 17% (946796 out of 5515184) of the
> time when __mod_obj_stock_state() is called leads to an actual call to
> mod_objcg_state() after initial boot. When doing parallel kernel build,
> the figure was about 16% (21894614 out of 139780628). So caching the
> vmstat data reduces the number of calls to mod_objcg_state() by more
> than 80%.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  mm/memcontrol.c | 78 +++++++++++++++++++++++++++++++++++++++++++------
>  mm/slab.h       | 26 +++++++----------
>  2 files changed, 79 insertions(+), 25 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b19100c68aa0..539c3b632e47 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2220,7 +2220,10 @@ struct memcg_stock_pcp {
>  
>  #ifdef CONFIG_MEMCG_KMEM
>  	struct obj_cgroup *cached_objcg;
> +	struct pglist_data *cached_pgdat;
>  	unsigned int nr_bytes;
> +	int vmstat_idx;
> +	int vmstat_bytes;
>  #endif

Because vmstat_idx can realistically take only 3 values (slab_reclaimable,
slab_unreclaimable and percpu), I wonder if it's better to have
vmstat_bytes[3] and save a bit more on the reduced number of flushes?
It must be an often case when a complex (reclaimable) kernel object has
non-reclaimable parts (e.g. kmallocs) or percpu counters.
If the difference will be too small, maybe the current form is better.

>  
>  	struct work_struct work;
> @@ -3157,6 +3160,21 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
>  	css_put(&memcg->css);
>  }
>  
> +static inline void mod_objcg_state(struct obj_cgroup *objcg,
> +				   struct pglist_data *pgdat,
> +				   enum node_stat_item idx, int nr)
> +{
> +	struct mem_cgroup *memcg;
> +	struct lruvec *lruvec = NULL;
> +
> +	rcu_read_lock();
> +	memcg = obj_cgroup_memcg(objcg);
> +	if (pgdat)
> +		lruvec = mem_cgroup_lruvec(memcg, pgdat);
> +	__mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
> +	rcu_read_unlock();
> +}
> +
>  static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
>  {
>  	struct memcg_stock_pcp *stock;
> @@ -3207,6 +3225,14 @@ static void drain_obj_stock(struct memcg_stock_pcp *stock)
>  		stock->nr_bytes = 0;
>  	}
>  
> +	if (stock->vmstat_bytes) {
> +		mod_objcg_state(old, stock->cached_pgdat, stock->vmstat_idx,
> +				stock->vmstat_bytes);
> +		stock->vmstat_bytes = 0;
> +		stock->vmstat_idx = 0;
> +		stock->cached_pgdat = NULL;
> +	}
> +
>  	obj_cgroup_put(old);
>  	stock->cached_objcg = NULL;
>  }
> @@ -3251,6 +3277,48 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
>  	local_irq_restore(flags);
>  }
>  
> +static void __mod_obj_stock_state(struct obj_cgroup *objcg,
> +				  struct pglist_data *pgdat, int idx, int nr)
> +{
> +	struct memcg_stock_pcp *stock = this_cpu_ptr(&memcg_stock);
> +
> +	if (stock->cached_objcg != objcg) {
> +		/* Output the current data as is */
> +	} else if (!stock->vmstat_bytes) {
> +		/* Save the current data */
> +		stock->vmstat_bytes = nr;
> +		stock->vmstat_idx = idx;
> +		stock->cached_pgdat = pgdat;
> +		nr = 0;
> +	} else if ((stock->cached_pgdat != pgdat) ||
> +		   (stock->vmstat_idx != idx)) {
> +		/* Output the cached data & save the current data */
> +		swap(nr, stock->vmstat_bytes);
> +		swap(idx, stock->vmstat_idx);
> +		swap(pgdat, stock->cached_pgdat);
> +	} else {
> +		stock->vmstat_bytes += nr;
> +		if (abs(nr) > PAGE_SIZE) {
> +			nr = stock->vmstat_bytes;
> +			stock->vmstat_bytes = 0;
> +		} else {
> +			nr = 0;
> +		}
> +	}
> +	if (nr)
> +		mod_objcg_state(objcg, pgdat, idx, nr);
> +}
> +
> +void mod_obj_stock_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
> +			 int idx, int nr)
> +{
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +	__mod_obj_stock_state(objcg, pgdat, idx, nr);
> +	local_irq_restore(flags);
> +}
> +
>  int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
>  {
>  	struct mem_cgroup *memcg;
> @@ -3300,18 +3368,10 @@ void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
>  				   struct pglist_data *pgdat, int idx)
>  {
>  	unsigned long flags;
> -	struct mem_cgroup *memcg;
> -	struct lruvec *lruvec = NULL;
>  
>  	local_irq_save(flags);
>  	__refill_obj_stock(objcg, size);
> -
> -	rcu_read_lock();
> -	memcg = obj_cgroup_memcg(objcg);
> -	if (pgdat)
> -		lruvec = mem_cgroup_lruvec(memcg, pgdat);
> -	__mod_memcg_lruvec_state(memcg, lruvec, idx, -(int)size);
> -	rcu_read_unlock();
> +	__mod_obj_stock_state(objcg, pgdat, idx, -(int)size);
>  	local_irq_restore(flags);
>  }
>  
> diff --git a/mm/slab.h b/mm/slab.h
> index 677cdc52e641..ae971975d9fc 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -239,6 +239,8 @@ static inline bool kmem_cache_debug_flags(struct kmem_cache *s, slab_flags_t fla
>  #ifdef CONFIG_MEMCG_KMEM
>  int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s,
>  				 gfp_t gfp, bool new_page);
> +void mod_obj_stock_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
> +			 int idx, int nr);
>  
>  static inline void memcg_free_page_obj_cgroups(struct page *page)
>  {
> @@ -283,20 +285,6 @@ static inline bool memcg_slab_pre_alloc_hook(struct kmem_cache *s,
>  	return true;
>  }
>  
> -static inline void mod_objcg_state(struct obj_cgroup *objcg,
> -				   struct pglist_data *pgdat,
> -				   enum node_stat_item idx, int nr)
> -{
> -	struct mem_cgroup *memcg;
> -	struct lruvec *lruvec;
> -
> -	rcu_read_lock();
> -	memcg = obj_cgroup_memcg(objcg);
> -	lruvec = mem_cgroup_lruvec(memcg, pgdat);
> -	mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
> -	rcu_read_unlock();
> -}
> -
>  static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
>  					      struct obj_cgroup *objcg,
>  					      gfp_t flags, size_t size,
> @@ -324,8 +312,9 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
>  			off = obj_to_index(s, page, p[i]);
>  			obj_cgroup_get(objcg);
>  			page_objcgs(page)[off] = objcg;
> -			mod_objcg_state(objcg, page_pgdat(page),
> -					cache_vmstat_idx(s), obj_full_size(s));
> +			mod_obj_stock_state(objcg, page_pgdat(page),
> +					    cache_vmstat_idx(s),
> +					    obj_full_size(s));
>  		} else {
>  			obj_cgroup_uncharge(objcg, obj_full_size(s));
>  		}
> @@ -408,6 +397,11 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s,
>  					void **p, int objects)
>  {
>  }
> +
> +static void mod_obj_stock_state(struct obj_cgroup *objcg,
> +				struct pglist_data *pgdat, int idx, int nr)
> +{
> +}
>  #endif /* CONFIG_MEMCG_KMEM */
>  
>  static inline struct kmem_cache *virt_to_cache(const void *obj)
> -- 
> 2.18.1
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/5] mm/memcg: Separate out object stock data into its own struct
  2021-04-09 23:18 ` [PATCH 4/5] mm/memcg: Separate out object stock data into its own struct Waiman Long
@ 2021-04-12 18:53   ` Roman Gushchin
  0 siblings, 0 replies; 22+ messages in thread
From: Roman Gushchin @ 2021-04-12 18:53 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, linux-kernel, cgroups, linux-mm,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Alexander Duyck, Wei Yang, Masayoshi Mizuma

On Fri, Apr 09, 2021 at 07:18:41PM -0400, Waiman Long wrote:
> The object stock data stored in struct memcg_stock_pcp are independent
> of the other page based data stored there. Separating them out into
> their own struct to highlight the independency.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>

Acked-by: Roman Gushchin <guro@fb.com>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/5] mm/memcg: Optimize user context object stock access
  2021-04-09 23:18 ` [PATCH 5/5] mm/memcg: Optimize user context object stock access Waiman Long
@ 2021-04-12 18:55   ` Roman Gushchin
  2021-04-12 19:58     ` Waiman Long
  0 siblings, 1 reply; 22+ messages in thread
From: Roman Gushchin @ 2021-04-12 18:55 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, linux-kernel, cgroups, linux-mm,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Alexander Duyck, Wei Yang, Masayoshi Mizuma

On Fri, Apr 09, 2021 at 07:18:42PM -0400, Waiman Long wrote:
> Most kmem_cache_alloc() calls are from user context. With instrumentation
> enabled, the measured amount of kmem_cache_alloc() calls from non-task
> context was about 0.01% of the total.
> 
> The irq disable/enable sequence used in this case to access content
> from object stock is slow.  To optimize for user context access, there
> are now two object stocks for task context and interrupt context access
> respectively.
> 
> The task context object stock can be accessed after disabling preemption
> which is cheap in non-preempt kernel. The interrupt context object stock
> can only be accessed after disabling interrupt. User context code can
> access interrupt object stock, but not vice versa.
> 
> The mod_objcg_state() function is also modified to make sure that memcg
> and lruvec stat updates are done with interrupted disabled.
> 
> The downside of this change is that there are more data stored in local
> object stocks and not reflected in the charge counter and the vmstat
> arrays.  However, this is a small price to pay for better performance.

I agree, the extra memory space is not a significant concern.
I'd be more worried about the code complexity, but the result looks
nice to me!

Acked-by: Roman Gushchin <guro@fb.com>

Btw, it seems that the mm tree ran a bit off, so I had to apply this series
on top of Linus's tree to review. Please, rebase.

Thanks!

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/5] mm/memcg: Reduce kmemcache memory accounting overhead
  2021-04-09 23:18 [PATCH 0/5] mm/memcg: Reduce kmemcache memory accounting overhead Waiman Long
                   ` (5 preceding siblings ...)
  2021-04-10  1:51 ` [PATCH 0/5] mm/memcg: Reduce kmemcache memory accounting overhead Roman Gushchin
@ 2021-04-12 19:05 ` Roman Gushchin
  2021-04-12 19:51   ` Waiman Long
  6 siblings, 1 reply; 22+ messages in thread
From: Roman Gushchin @ 2021-04-12 19:05 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, linux-kernel, cgroups, linux-mm,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Alexander Duyck, Wei Yang, Masayoshi Mizuma

On Fri, Apr 09, 2021 at 07:18:37PM -0400, Waiman Long wrote:
> With the recent introduction of the new slab memory controller, we
> eliminate the need for having separate kmemcaches for each memory
> cgroup and reduce overall kernel memory usage. However, we also add
> additional memory accounting overhead to each call of kmem_cache_alloc()
> and kmem_cache_free().
> 
> For workloads that require a lot of kmemcache allocations and
> de-allocations, they may experience performance regression as illustrated
> in [1].
> 
> With a simple kernel module that performs repeated loop of 100,000,000
> kmem_cache_alloc() and kmem_cache_free() of 64-byte object at module
> init. The execution time to load the kernel module with and without
> memory accounting were:
> 
>   with accounting = 6.798s
>   w/o  accounting = 1.758s
> 
> That is an increase of 5.04s (287%). With this patchset applied, the
> execution time became 4.254s. So the memory accounting overhead is now
> 2.496s which is a 50% reduction.

Btw, there were two recent independent report about benchmark results
regression caused by the introduction of the per-object accounting:
1) Xing reported a hackbench regression:
https://lkml.org/lkml/2021/1/13/1277
2) Masayoshi reported a pgbench regression:
https://www.spinics.net/lists/linux-mm/msg252540.html

I wonder if you can run them (or at least one) and attach the result
to the series? It would be very helpful.

Thank you!

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/5] mm/memcg: Reduce kmemcache memory accounting overhead
  2021-04-12 17:47     ` Roman Gushchin
@ 2021-04-12 19:20       ` Waiman Long
  0 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2021-04-12 19:20 UTC (permalink / raw)
  To: Roman Gushchin, Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, linux-kernel, cgroups, linux-mm,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Alexander Duyck, Wei Yang, Masayoshi Mizuma

On 4/12/21 1:47 PM, Roman Gushchin wrote:
> On Mon, Apr 12, 2021 at 10:03:13AM -0400, Waiman Long wrote:
>> On 4/9/21 9:51 PM, Roman Gushchin wrote:
>>> On Fri, Apr 09, 2021 at 07:18:37PM -0400, Waiman Long wrote:
>>>> With the recent introduction of the new slab memory controller, we
>>>> eliminate the need for having separate kmemcaches for each memory
>>>> cgroup and reduce overall kernel memory usage. However, we also add
>>>> additional memory accounting overhead to each call of kmem_cache_alloc()
>>>> and kmem_cache_free().
>>>>
>>>> For workloads that require a lot of kmemcache allocations and
>>>> de-allocations, they may experience performance regression as illustrated
>>>> in [1].
>>>>
>>>> With a simple kernel module that performs repeated loop of 100,000,000
>>>> kmem_cache_alloc() and kmem_cache_free() of 64-byte object at module
>>>> init. The execution time to load the kernel module with and without
>>>> memory accounting were:
>>>>
>>>>     with accounting = 6.798s
>>>>     w/o  accounting = 1.758s
>>>>
>>>> That is an increase of 5.04s (287%). With this patchset applied, the
>>>> execution time became 4.254s. So the memory accounting overhead is now
>>>> 2.496s which is a 50% reduction.
>>> Hi Waiman!
>>>
>>> Thank you for working on it, it's indeed very useful!
>>> A couple of questions:
>>> 1) did your config included lockdep or not?
>> The test kernel is based on a production kernel config and so lockdep isn't
>> enabled.
>>> 2) do you have a (rough) estimation how much each change contributes
>>>      to the overall reduction?
>> I should have a better breakdown of the effect of individual patches. I
>> rerun the benchmarking module with turbo-boosting disabled to reduce
>> run-to-run variation. The execution times were:
>>
>> Before patch: time = 10.800s (with memory accounting), 2.848s (w/o
>> accounting), overhead = 7.952s
>> After patch 2: time = 9.140s, overhead = 6.292s
>> After patch 3: time = 7.641s, overhead = 4.793s
>> After patch 5: time = 6.801s, overhead = 3.953s
> Thank you! If there will be v2, I'd include this information into commit logs.

Yes, I am planning to send out v2 with these information in the 
cover-letter. I am just waiting a bit to see if there are more feedback.

-Longman

>
>> Patches 1 & 4 are preparatory patches that should affect performance.
>>
>> So the memory accounting overhead was reduced by about half.

BTW, the benchmark that I used is kind of the best case behavior as it 
as all updates are to the percpu stocks. Real workloads will likely to 
have a certain amount of update to the memcg charges and vmstats. So the 
performance benefit will be less.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/5] mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state()
  2021-04-09 23:18 ` [PATCH 1/5] mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state() Waiman Long
  2021-04-12 18:04   ` Roman Gushchin
@ 2021-04-12 19:22   ` Shakeel Butt
  1 sibling, 0 replies; 22+ messages in thread
From: Shakeel Butt @ 2021-04-12 19:22 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Roman Gushchin, LKML, Cgroups,
	Linux MM, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Alexander Duyck, Wei Yang, Masayoshi Mizuma

On Fri, Apr 9, 2021 at 4:19 PM Waiman Long <longman@redhat.com> wrote:
>
> The caller of mod_memcg_lruvec_state() has both memcg and lruvec readily
> available. So both of them are now passed to mod_memcg_lruvec_state()
> and __mod_memcg_lruvec_state(). The __mod_memcg_lruvec_state() is
> updated to allow either of the two parameters to be set to null. This
> makes mod_memcg_lruvec_state() equivalent to mod_memcg_state() if lruvec
> is null.
>
> Signed-off-by: Waiman Long <longman@redhat.com>

Similar to Roman's suggestion: instead of what this patch is doing the
'why' would be better in the changelog.

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/5] mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state()
  2021-04-12 18:04   ` Roman Gushchin
@ 2021-04-12 19:24     ` Waiman Long
  0 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2021-04-12 19:24 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, linux-kernel, cgroups, linux-mm,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Alexander Duyck, Wei Yang, Masayoshi Mizuma

On 4/12/21 2:04 PM, Roman Gushchin wrote:
> On Fri, Apr 09, 2021 at 07:18:38PM -0400, Waiman Long wrote:
>> The caller of mod_memcg_lruvec_state() has both memcg and lruvec readily
>> available. So both of them are now passed to mod_memcg_lruvec_state()
>> and __mod_memcg_lruvec_state(). The __mod_memcg_lruvec_state() is
>> updated to allow either of the two parameters to be set to null. This
>> makes mod_memcg_lruvec_state() equivalent to mod_memcg_state() if lruvec
>> is null.
> This patch seems to be correct, but it's a bit hard to understand why
> it's required without looking into the rest of the series. Can you, please,
> add a couple of words about it? E.g. we need it to handle stats which do not
> exist on the lruvec level...
>
> Otherwise,
> Acked-by: Roman Gushchin <guro@fb.com>

Good point. Will update the commit log to indicate the change is needed 
for the subsequent patch.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
  2021-04-12 18:22   ` Roman Gushchin
@ 2021-04-12 19:30     ` Waiman Long
  0 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2021-04-12 19:30 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, linux-kernel, cgroups, linux-mm,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Alexander Duyck, Wei Yang, Masayoshi Mizuma

On 4/12/21 2:22 PM, Roman Gushchin wrote:
> On Fri, Apr 09, 2021 at 07:18:40PM -0400, Waiman Long wrote:
>> Before the new slab memory controller with per object byte charging,
>> charging and vmstat data update happen only when new slab pages are
>> allocated or freed. Now they are done with every kmem_cache_alloc()
>> and kmem_cache_free(). This causes additional overhead for workloads
>> that generate a lot of alloc and free calls.
>>
>> The memcg_stock_pcp is used to cache byte charge for a specific
>> obj_cgroup to reduce that overhead. To further reducing it, this patch
>> makes the vmstat data cached in the memcg_stock_pcp structure as well
>> until it accumulates a page size worth of update or when other cached
>> data change.
> The idea makes total sense to me and also gives a hope to remove
> byte-sized vmstats in the long-term.
>
>> On a 2-socket Cascade Lake server with instrumentation enabled and this
>> patch applied, it was found that about 17% (946796 out of 5515184) of the
>> time when __mod_obj_stock_state() is called leads to an actual call to
>> mod_objcg_state() after initial boot. When doing parallel kernel build,
>> the figure was about 16% (21894614 out of 139780628). So caching the
>> vmstat data reduces the number of calls to mod_objcg_state() by more
>> than 80%.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>   mm/memcontrol.c | 78 +++++++++++++++++++++++++++++++++++++++++++------
>>   mm/slab.h       | 26 +++++++----------
>>   2 files changed, 79 insertions(+), 25 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index b19100c68aa0..539c3b632e47 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -2220,7 +2220,10 @@ struct memcg_stock_pcp {
>>   
>>   #ifdef CONFIG_MEMCG_KMEM
>>   	struct obj_cgroup *cached_objcg;
>> +	struct pglist_data *cached_pgdat;
>>   	unsigned int nr_bytes;
>> +	int vmstat_idx;
>> +	int vmstat_bytes;
>>   #endif
> Because vmstat_idx can realistically take only 3 values (slab_reclaimable,
> slab_unreclaimable and percpu), I wonder if it's better to have
> vmstat_bytes[3] and save a bit more on the reduced number of flushes?
> It must be an often case when a complex (reclaimable) kernel object has
> non-reclaimable parts (e.g. kmallocs) or percpu counters.
> If the difference will be too small, maybe the current form is better.

I have thought about that too. However, that will make the code more 
complex. So I decided to cache just one for now. We can certainly play 
around with caching more in a later patch.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/5] mm/memcg: Reduce kmemcache memory accounting overhead
  2021-04-12 19:05 ` Roman Gushchin
@ 2021-04-12 19:51   ` Waiman Long
  0 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2021-04-12 19:51 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, linux-kernel, cgroups, linux-mm,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Alexander Duyck, Wei Yang, Masayoshi Mizuma

On 4/12/21 3:05 PM, Roman Gushchin wrote:
> On Fri, Apr 09, 2021 at 07:18:37PM -0400, Waiman Long wrote:
>> With the recent introduction of the new slab memory controller, we
>> eliminate the need for having separate kmemcaches for each memory
>> cgroup and reduce overall kernel memory usage. However, we also add
>> additional memory accounting overhead to each call of kmem_cache_alloc()
>> and kmem_cache_free().
>>
>> For workloads that require a lot of kmemcache allocations and
>> de-allocations, they may experience performance regression as illustrated
>> in [1].
>>
>> With a simple kernel module that performs repeated loop of 100,000,000
>> kmem_cache_alloc() and kmem_cache_free() of 64-byte object at module
>> init. The execution time to load the kernel module with and without
>> memory accounting were:
>>
>>    with accounting = 6.798s
>>    w/o  accounting = 1.758s
>>
>> That is an increase of 5.04s (287%). With this patchset applied, the
>> execution time became 4.254s. So the memory accounting overhead is now
>> 2.496s which is a 50% reduction.
> Btw, there were two recent independent report about benchmark results
> regression caused by the introduction of the per-object accounting:
> 1) Xing reported a hackbench regression:
> https://lkml.org/lkml/2021/1/13/1277
> 2) Masayoshi reported a pgbench regression:
> https://www.spinics.net/lists/linux-mm/msg252540.html
>
> I wonder if you can run them (or at least one) and attach the result
> to the series? It would be very helpful.

Actually, it was a bug reported filed by Masayoshi-san that triggered me 
to work on reducing the memory accounting overhead. He is also in the cc 
line and so is aware of that. I will cc Xing in my v2 patch.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/5] mm/memcg: Optimize user context object stock access
  2021-04-12 18:55   ` Roman Gushchin
@ 2021-04-12 19:58     ` Waiman Long
  0 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2021-04-12 19:58 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Tejun Heo, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, linux-kernel, cgroups, linux-mm,
	Shakeel Butt, Muchun Song, Alex Shi, Chris Down, Yafang Shao,
	Alexander Duyck, Wei Yang, Masayoshi Mizuma

On 4/12/21 2:55 PM, Roman Gushchin wrote:
> On Fri, Apr 09, 2021 at 07:18:42PM -0400, Waiman Long wrote:
>> Most kmem_cache_alloc() calls are from user context. With instrumentation
>> enabled, the measured amount of kmem_cache_alloc() calls from non-task
>> context was about 0.01% of the total.
>>
>> The irq disable/enable sequence used in this case to access content
>> from object stock is slow.  To optimize for user context access, there
>> are now two object stocks for task context and interrupt context access
>> respectively.
>>
>> The task context object stock can be accessed after disabling preemption
>> which is cheap in non-preempt kernel. The interrupt context object stock
>> can only be accessed after disabling interrupt. User context code can
>> access interrupt object stock, but not vice versa.
>>
>> The mod_objcg_state() function is also modified to make sure that memcg
>> and lruvec stat updates are done with interrupted disabled.
>>
>> The downside of this change is that there are more data stored in local
>> object stocks and not reflected in the charge counter and the vmstat
>> arrays.  However, this is a small price to pay for better performance.
> I agree, the extra memory space is not a significant concern.
> I'd be more worried about the code complexity, but the result looks
> nice to me!
>
> Acked-by: Roman Gushchin <guro@fb.com>
>
> Btw, it seems that the mm tree ran a bit off, so I had to apply this series
> on top of Linus's tree to review. Please, rebase.

This patchset is based on the code in Linus' tree. I had applied the 
patchset to linux-next to see if there was any conflicts. Two of the 
patches had minor fuzzes around the edge but no actual merge conflict 
for now.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2021-04-12 19:58 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-09 23:18 [PATCH 0/5] mm/memcg: Reduce kmemcache memory accounting overhead Waiman Long
2021-04-09 23:18 ` [PATCH 1/5] mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state() Waiman Long
2021-04-12 18:04   ` Roman Gushchin
2021-04-12 19:24     ` Waiman Long
2021-04-12 19:22   ` Shakeel Butt
2021-04-09 23:18 ` [PATCH 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state() Waiman Long
2021-04-12 15:50   ` Shakeel Butt
2021-04-12 18:10   ` Roman Gushchin
2021-04-09 23:18 ` [PATCH 3/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp Waiman Long
2021-04-12 18:22   ` Roman Gushchin
2021-04-12 19:30     ` Waiman Long
2021-04-09 23:18 ` [PATCH 4/5] mm/memcg: Separate out object stock data into its own struct Waiman Long
2021-04-12 18:53   ` Roman Gushchin
2021-04-09 23:18 ` [PATCH 5/5] mm/memcg: Optimize user context object stock access Waiman Long
2021-04-12 18:55   ` Roman Gushchin
2021-04-12 19:58     ` Waiman Long
2021-04-10  1:51 ` [PATCH 0/5] mm/memcg: Reduce kmemcache memory accounting overhead Roman Gushchin
2021-04-12 14:03   ` Waiman Long
2021-04-12 17:47     ` Roman Gushchin
2021-04-12 19:20       ` Waiman Long
2021-04-12 19:05 ` Roman Gushchin
2021-04-12 19:51   ` Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).