linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/19] The new cgroup slab memory controller
@ 2020-04-22 20:46 Roman Gushchin
  2020-04-22 20:46 ` [PATCH v3 01/19] mm: memcg: factor out memcg- and lruvec-level changes out of __mod_lruvec_state() Roman Gushchin
                   ` (18 more replies)
  0 siblings, 19 replies; 84+ messages in thread
From: Roman Gushchin @ 2020-04-22 20:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, linux-mm, kernel-team,
	linux-kernel, Roman Gushchin

This is a third version of the slab cgroup controller rework.

The patchset moves the accounting from the page level to the object
level. It allows to share slab pages between memory cgroups.
This leads to a significant win in the slab utilization (up to 45%)
and the corresponding drop in the total kernel memory footprint.
The reduced number of unmovable slab pages should also have a positive
effect on the memory fragmentation.

The patchset makes the slab accounting code simpler: there is no more
need in the complicated dynamic creation and destruction of per-cgroup
slab caches, all memory cgroups use a global set of shared slab caches.
The lifetime of slab caches is not more connected to the lifetime
of memory cgroups.

The more precise accounting does require more CPU, however in practice
the difference seems to be negligible. We've been using the new slab
controller in Facebook production for several months with different
workloads and haven't seen any noticeable regressions. What we've seen
were memory savings in order of 1 GB per host (it varied heavily depending
on the actual workload, size of RAM, number of CPUs, memory pressure, etc).

The third version of the patchset added yet another step towards
the simplification of the code: sharing of slab caches between
accounted and non-accounted allocations. It comes with significant
upsides (most noticeable, a complete elimination of dynamic slab caches
creation) but not without some regression risks, so this change sits
on top of the patchset and is not completely merged in. So in the unlikely
event of a noticeable performance regression it can be reverted separately.

v3:
  1) added a patch that switches to a global single set of kmem_caches
  2) kmem API clean up dropped, because if has been already merged
  3) byte-sized slab vmstat API over page-sized global counters and
     bytes-sized memcg/lruvec counters
  3) obj_cgroup refcounting simplifications and other minor fixes
  4) other minor changes

v2:
  1) implemented re-layering and renaming suggested by Johannes,
     added his patch to the set. Thanks!
  2) fixed the issue discovered by Bharata B Rao. Thanks!
  3) added kmem API clean up part
  4) added slab/memcg follow-up clean up part
  5) fixed a couple of issues discovered by internal testing on FB fleet.
  6) added kselftests
  7) included metadata into the charge calculation
  8) refreshed commit logs, regrouped patches, rebased onto mm tree, etc

v1:
  1) fixed a bug in zoneinfo_show_print()
  2) added some comments to the subpage charging API, a minor fix
  3) separated memory.kmem.slabinfo deprecation into a separate patch,
     provided a drgn-based replacement
  4) rebased on top of the current mm tree

RFC:
  https://lwn.net/Articles/798605/


Johannes Weiner (1):
  mm: memcontrol: decouple reference counting from page accounting

Roman Gushchin (18):
  mm: memcg: factor out memcg- and lruvec-level changes out of
    __mod_lruvec_state()
  mm: memcg: prepare for byte-sized vmstat items
  mm: memcg: convert vmstat slab counters to bytes
  mm: slub: implement SLUB version of obj_to_index()
  mm: memcg/slab: obj_cgroup API
  mm: memcg/slab: allocate obj_cgroups for non-root slab pages
  mm: memcg/slab: save obj_cgroup for non-root slab objects
  mm: memcg/slab: charge individual slab objects instead of pages
  mm: memcg/slab: deprecate memory.kmem.slabinfo
  mm: memcg/slab: move memcg_kmem_bypass() to memcontrol.h
  mm: memcg/slab: use a single set of kmem_caches for all accounted
    allocations
  mm: memcg/slab: simplify memcg cache creation
  mm: memcg/slab: deprecate memcg_kmem_get_cache()
  mm: memcg/slab: deprecate slab_root_caches
  mm: memcg/slab: remove redundant check in memcg_accumulate_slabinfo()
  mm: memcg/slab: use a single set of kmem_caches for all allocations
  kselftests: cgroup: add kernel memory accounting tests
  tools/cgroup: add memcg_slabinfo.py tool

 drivers/base/node.c                        |   6 +-
 fs/proc/meminfo.c                          |   4 +-
 include/linux/memcontrol.h                 |  80 ++-
 include/linux/mm_types.h                   |   5 +-
 include/linux/mmzone.h                     |  19 +-
 include/linux/slab.h                       |   5 -
 include/linux/slab_def.h                   |   8 +-
 include/linux/slub_def.h                   |  20 +-
 include/linux/vmstat.h                     |  16 +-
 kernel/power/snapshot.c                    |   2 +-
 mm/memcontrol.c                            | 569 ++++++++++--------
 mm/oom_kill.c                              |   2 +-
 mm/page_alloc.c                            |   8 +-
 mm/slab.c                                  |  39 +-
 mm/slab.h                                  | 365 +++++-------
 mm/slab_common.c                           | 643 +--------------------
 mm/slob.c                                  |  12 +-
 mm/slub.c                                  | 183 +-----
 mm/vmscan.c                                |   3 +-
 mm/vmstat.c                                |  33 +-
 mm/workingset.c                            |   6 +-
 tools/cgroup/memcg_slabinfo.py             | 226 ++++++++
 tools/testing/selftests/cgroup/.gitignore  |   1 +
 tools/testing/selftests/cgroup/Makefile    |   2 +
 tools/testing/selftests/cgroup/test_kmem.c | 382 ++++++++++++
 25 files changed, 1322 insertions(+), 1317 deletions(-)
 create mode 100755 tools/cgroup/memcg_slabinfo.py
 create mode 100644 tools/testing/selftests/cgroup/test_kmem.c

-- 
2.25.3



^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH v3 01/19] mm: memcg: factor out memcg- and lruvec-level changes out of __mod_lruvec_state()
  2020-04-22 20:46 [PATCH v3 00/19] The new cgroup slab memory controller Roman Gushchin
@ 2020-04-22 20:46 ` Roman Gushchin
  2020-05-07 20:33   ` Johannes Weiner
  2020-05-20 10:49   ` Vlastimil Babka
  2020-04-22 20:46 ` [PATCH v3 02/19] mm: memcg: prepare for byte-sized vmstat items Roman Gushchin
                   ` (17 subsequent siblings)
  18 siblings, 2 replies; 84+ messages in thread
From: Roman Gushchin @ 2020-04-22 20:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, linux-mm, kernel-team,
	linux-kernel, Roman Gushchin

To convert memcg and lruvec slab counters to bytes there must be
a way to change these counters without touching node counters.
Factor out __mod_memcg_lruvec_state() out of __mod_lruvec_state().

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/memcontrol.h | 17 ++++++++++++++++
 mm/memcontrol.c            | 41 +++++++++++++++++++++-----------------
 2 files changed, 40 insertions(+), 18 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d630af1a4e17..c2eb73d89f5d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -692,11 +692,23 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
 	return x;
 }
 
+void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
+			      int val);
 void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 			int val);
 void __mod_lruvec_slab_state(void *p, enum node_stat_item idx, int val);
 void mod_memcg_obj_state(void *p, int idx, int val);
 
+static inline void mod_memcg_lruvec_state(struct lruvec *lruvec,
+					  enum node_stat_item idx, int val)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__mod_memcg_lruvec_state(lruvec, idx, val);
+	local_irq_restore(flags);
+}
+
 static inline void mod_lruvec_state(struct lruvec *lruvec,
 				    enum node_stat_item idx, int val)
 {
@@ -1092,6 +1104,11 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
 	return node_page_state(lruvec_pgdat(lruvec), idx);
 }
 
+static inline void __mod_memcg_lruvec_state(struct lruvec *lruvec,
+					    enum node_stat_item idx, int val)
+{
+}
+
 static inline void __mod_lruvec_state(struct lruvec *lruvec,
 				      enum node_stat_item idx, int val)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 44579831221a..f6ff20095105 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -713,30 +713,14 @@ parent_nodeinfo(struct mem_cgroup_per_node *pn, int nid)
 	return mem_cgroup_nodeinfo(parent, nid);
 }
 
-/**
- * __mod_lruvec_state - update lruvec memory statistics
- * @lruvec: the lruvec
- * @idx: the stat item
- * @val: delta to add to the counter, can be negative
- *
- * The lruvec is the intersection of the NUMA node and a cgroup. This
- * function updates the all three counters that are affected by a
- * change of state at this level: per-node, per-cgroup, per-lruvec.
- */
-void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
-			int val)
+void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
+			      int val)
 {
 	pg_data_t *pgdat = lruvec_pgdat(lruvec);
 	struct mem_cgroup_per_node *pn;
 	struct mem_cgroup *memcg;
 	long x;
 
-	/* Update node */
-	__mod_node_page_state(pgdat, idx, val);
-
-	if (mem_cgroup_disabled())
-		return;
-
 	pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
 	memcg = pn->memcg;
 
@@ -757,6 +741,27 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 	__this_cpu_write(pn->lruvec_stat_cpu->count[idx], x);
 }
 
+/**
+ * __mod_lruvec_state - update lruvec memory statistics
+ * @lruvec: the lruvec
+ * @idx: the stat item
+ * @val: delta to add to the counter, can be negative
+ *
+ * The lruvec is the intersection of the NUMA node and a cgroup. This
+ * function updates the all three counters that are affected by a
+ * change of state at this level: per-node, per-cgroup, per-lruvec.
+ */
+void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
+			int val)
+{
+	/* Update node */
+	__mod_node_page_state(lruvec_pgdat(lruvec), idx, val);
+
+	/* Update memcg and lruvec */
+	if (!mem_cgroup_disabled())
+		__mod_memcg_lruvec_state(lruvec, idx, val);
+}
+
 void __mod_lruvec_slab_state(void *p, enum node_stat_item idx, int val)
 {
 	pg_data_t *pgdat = page_pgdat(virt_to_page(p));
-- 
2.25.3



^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH v3 02/19] mm: memcg: prepare for byte-sized vmstat items
  2020-04-22 20:46 [PATCH v3 00/19] The new cgroup slab memory controller Roman Gushchin
  2020-04-22 20:46 ` [PATCH v3 01/19] mm: memcg: factor out memcg- and lruvec-level changes out of __mod_lruvec_state() Roman Gushchin
@ 2020-04-22 20:46 ` Roman Gushchin
  2020-05-07 20:34   ` Johannes Weiner
  2020-05-20 11:31   ` Vlastimil Babka
  2020-04-22 20:46 ` [PATCH v3 03/19] mm: memcg: convert vmstat slab counters to bytes Roman Gushchin
                   ` (16 subsequent siblings)
  18 siblings, 2 replies; 84+ messages in thread
From: Roman Gushchin @ 2020-04-22 20:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, linux-mm, kernel-team,
	linux-kernel, Roman Gushchin

To implement per-object slab memory accounting, we need to
convert slab vmstat counters to bytes. Actually, out of
4 levels of counters: global, per-node, per-memcg and per-lruvec
only two last levels will require byte-sized counters.
It's because global and per-node counters will be counting the
number of slab pages, and per-memcg and per-lruvec will be
counting the amount of memory taken by charged slab objects.

Converting all vmstat counters to bytes or even all slab
counters to bytes would introduce an additional overhead.
So instead let's store global and per-node counters
in pages, and memcg and lruvec counters in bytes.

To make the API clean all access helpers (both on the read
and write sides) are dealing with bytes.

To avoid back-and-forth conversions a new flavor of helpers
is introduced, which always returns values in pages:
node_page_state_pages() and global_node_page_state_pages().

Actually new helpers are just reading raw values. Old helpers are
simple wrappers, which perform a conversion if the vmstat items are
in bytes. Because at the moment no one actually need bytes,
there are WARN_ON_ONCE() macroses inside to warn about inappropriate
use cases.

Thanks to Johannes Weiner for the idea of having the byte-sized API
on top of the page-sized internal storage.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 drivers/base/node.c    |  2 +-
 include/linux/mmzone.h |  5 +++++
 include/linux/vmstat.h | 16 +++++++++++++++-
 mm/memcontrol.c        | 14 ++++++++++----
 mm/vmstat.c            | 33 +++++++++++++++++++++++++++++----
 5 files changed, 60 insertions(+), 10 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 10d7e818e118..9d6afb7d2ccd 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -507,7 +507,7 @@ static ssize_t node_read_vmstat(struct device *dev,
 
 	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
 		n += sprintf(buf+n, "%s %lu\n", node_stat_name(i),
-			     node_page_state(pgdat, i));
+			     node_page_state_pages(pgdat, i));
 
 	return n;
 }
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c1fbda9ddd1f..22fe65edf425 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -204,6 +204,11 @@ enum node_stat_item {
 	NR_VM_NODE_STAT_ITEMS
 };
 
+static __always_inline bool vmstat_item_in_bytes(enum node_stat_item item)
+{
+	return false;
+}
+
 /*
  * We do arithmetic on the LRU lists in various places in the code,
  * so it is important to keep the active lists LRU_ACTIVE higher in
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 292485f3d24d..117763827cd0 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -190,7 +190,8 @@ static inline unsigned long global_zone_page_state(enum zone_stat_item item)
 	return x;
 }
 
-static inline unsigned long global_node_page_state(enum node_stat_item item)
+static inline
+unsigned long global_node_page_state_pages(enum node_stat_item item)
 {
 	long x = atomic_long_read(&vm_node_stat[item]);
 #ifdef CONFIG_SMP
@@ -200,6 +201,16 @@ static inline unsigned long global_node_page_state(enum node_stat_item item)
 	return x;
 }
 
+static inline unsigned long global_node_page_state(enum node_stat_item item)
+{
+	unsigned long x = global_node_page_state_pages(item);
+
+	if (WARN_ON_ONCE(vmstat_item_in_bytes(item)))
+		return x << PAGE_SHIFT;
+
+	return x;
+}
+
 static inline unsigned long zone_page_state(struct zone *zone,
 					enum zone_stat_item item)
 {
@@ -240,9 +251,12 @@ extern unsigned long sum_zone_node_page_state(int node,
 extern unsigned long sum_zone_numa_state(int node, enum numa_stat_item item);
 extern unsigned long node_page_state(struct pglist_data *pgdat,
 						enum node_stat_item item);
+extern unsigned long node_page_state_pages(struct pglist_data *pgdat,
+					   enum node_stat_item item);
 #else
 #define sum_zone_node_page_state(node, item) global_zone_page_state(item)
 #define node_page_state(node, item) global_node_page_state(item)
+#define node_page_state_pages(node, item) global_node_page_state_pages(item)
 #endif /* CONFIG_NUMA */
 
 #ifdef CONFIG_SMP
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f6ff20095105..5f700fa8b78c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -681,13 +681,16 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
  */
 void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)
 {
-	long x;
+	long x, threshold = MEMCG_CHARGE_BATCH;
 
 	if (mem_cgroup_disabled())
 		return;
 
+	if (vmstat_item_in_bytes(idx))
+		threshold <<= PAGE_SHIFT;
+
 	x = val + __this_cpu_read(memcg->vmstats_percpu->stat[idx]);
-	if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
+	if (unlikely(abs(x) > threshold)) {
 		struct mem_cgroup *mi;
 
 		/*
@@ -719,7 +722,7 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 	pg_data_t *pgdat = lruvec_pgdat(lruvec);
 	struct mem_cgroup_per_node *pn;
 	struct mem_cgroup *memcg;
-	long x;
+	long x, threshold = MEMCG_CHARGE_BATCH;
 
 	pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
 	memcg = pn->memcg;
@@ -730,8 +733,11 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 	/* Update lruvec */
 	__this_cpu_add(pn->lruvec_stat_local->count[idx], val);
 
+	if (vmstat_item_in_bytes(idx))
+		threshold <<= PAGE_SHIFT;
+
 	x = val + __this_cpu_read(pn->lruvec_stat_cpu->count[idx]);
-	if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
+	if (unlikely(abs(x) > threshold)) {
 		struct mem_cgroup_per_node *pi;
 
 		for (pi = pn; pi; pi = parent_nodeinfo(pi, pgdat->node_id))
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 6fd1407f4632..7ac13f6d189a 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -341,6 +341,11 @@ void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
 	long x;
 	long t;
 
+	if (vmstat_item_in_bytes(item)) {
+		WARN_ON(delta & (PAGE_SIZE - 1));
+		delta >>= PAGE_SHIFT;
+	}
+
 	x = delta + __this_cpu_read(*p);
 
 	t = __this_cpu_read(pcp->stat_threshold);
@@ -398,6 +403,8 @@ void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
 	s8 __percpu *p = pcp->vm_node_stat_diff + item;
 	s8 v, t;
 
+	WARN_ON_ONCE(vmstat_item_in_bytes(item));
+
 	v = __this_cpu_inc_return(*p);
 	t = __this_cpu_read(pcp->stat_threshold);
 	if (unlikely(v > t)) {
@@ -442,6 +449,8 @@ void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
 	s8 __percpu *p = pcp->vm_node_stat_diff + item;
 	s8 v, t;
 
+	WARN_ON_ONCE(vmstat_item_in_bytes(item));
+
 	v = __this_cpu_dec_return(*p);
 	t = __this_cpu_read(pcp->stat_threshold);
 	if (unlikely(v < - t)) {
@@ -541,6 +550,11 @@ static inline void mod_node_state(struct pglist_data *pgdat,
 	s8 __percpu *p = pcp->vm_node_stat_diff + item;
 	long o, n, t, z;
 
+	if (vmstat_item_in_bytes(item)) {
+		WARN_ON_ONCE(delta & (PAGE_SIZE - 1));
+		delta >>= PAGE_SHIFT;
+	}
+
 	do {
 		z = 0;  /* overflow to node counters */
 
@@ -989,8 +1003,8 @@ unsigned long sum_zone_numa_state(int node,
 /*
  * Determine the per node value of a stat item.
  */
-unsigned long node_page_state(struct pglist_data *pgdat,
-				enum node_stat_item item)
+unsigned long node_page_state_pages(struct pglist_data *pgdat,
+				    enum node_stat_item item)
 {
 	long x = atomic_long_read(&pgdat->vm_stat[item]);
 #ifdef CONFIG_SMP
@@ -999,6 +1013,17 @@ unsigned long node_page_state(struct pglist_data *pgdat,
 #endif
 	return x;
 }
+
+unsigned long node_page_state(struct pglist_data *pgdat,
+				enum node_stat_item item)
+{
+	unsigned long x = node_page_state_pages(pgdat, item);
+
+	if (WARN_ON_ONCE(vmstat_item_in_bytes(item)))
+		return x << PAGE_SHIFT;
+
+	return x;
+}
 #endif
 
 #ifdef CONFIG_COMPACTION
@@ -1571,7 +1596,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		seq_printf(m, "\n  per-node stats");
 		for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
 			seq_printf(m, "\n      %-12s %lu", node_stat_name(i),
-				   node_page_state(pgdat, i));
+				   node_page_state_pages(pgdat, i));
 		}
 	}
 	seq_printf(m,
@@ -1692,7 +1717,7 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
 #endif
 
 	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
-		v[i] = global_node_page_state(i);
+		v[i] = global_node_page_state_pages(i);
 	v += NR_VM_NODE_STAT_ITEMS;
 
 	global_dirty_limits(v + NR_DIRTY_BG_THRESHOLD,
-- 
2.25.3



^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH v3 03/19] mm: memcg: convert vmstat slab counters to bytes
  2020-04-22 20:46 [PATCH v3 00/19] The new cgroup slab memory controller Roman Gushchin
  2020-04-22 20:46 ` [PATCH v3 01/19] mm: memcg: factor out memcg- and lruvec-level changes out of __mod_lruvec_state() Roman Gushchin
  2020-04-22 20:46 ` [PATCH v3 02/19] mm: memcg: prepare for byte-sized vmstat items Roman Gushchin
@ 2020-04-22 20:46 ` Roman Gushchin
  2020-05-07 20:41   ` Johannes Weiner
  2020-05-20 12:25   ` Vlastimil Babka
  2020-04-22 20:46 ` [PATCH v3 04/19] mm: slub: implement SLUB version of obj_to_index() Roman Gushchin
                   ` (15 subsequent siblings)
  18 siblings, 2 replies; 84+ messages in thread
From: Roman Gushchin @ 2020-04-22 20:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, linux-mm, kernel-team,
	linux-kernel, Roman Gushchin

In order to prepare for per-object slab memory accounting, convert
NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE vmstat items to bytes.

To make it obvious, rename them to NR_SLAB_RECLAIMABLE_B and
NR_SLAB_UNRECLAIMABLE_B (similar to NR_KERNEL_STACK_KB).

Internally global and per-node counters are stored in pages,
however memcg and lruvec counters are stored in bytes.
This scheme may look weird, but only for now. As soon as slab
pages will be shared between multiple cgroups, global and
node counters will reflect the total number of slab pages.
However memcg and lruvec counters will be used for per-memcg
slab memory tracking, which will take separate kernel objects
in the account. Keeping global and node counters in pages helps
to avoid additional overhead.

The size of slab memory shouldn't exceed 4Gb on 32-bit machines,
so it will fit into atomic_long_t we use for vmstats.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 drivers/base/node.c     |  4 ++--
 fs/proc/meminfo.c       |  4 ++--
 include/linux/mmzone.h  | 16 +++++++++++++---
 kernel/power/snapshot.c |  2 +-
 mm/memcontrol.c         | 11 ++++-------
 mm/oom_kill.c           |  2 +-
 mm/page_alloc.c         |  8 ++++----
 mm/slab.h               | 15 ++++++++-------
 mm/slab_common.c        |  4 ++--
 mm/slob.c               | 12 ++++++------
 mm/slub.c               |  8 ++++----
 mm/vmscan.c             |  3 ++-
 mm/workingset.c         |  6 ++++--
 13 files changed, 53 insertions(+), 42 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 9d6afb7d2ccd..b3d13fa715ad 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -368,8 +368,8 @@ static ssize_t node_read_meminfo(struct device *dev,
 	unsigned long sreclaimable, sunreclaimable;
 
 	si_meminfo_node(&i, nid);
-	sreclaimable = node_page_state(pgdat, NR_SLAB_RECLAIMABLE);
-	sunreclaimable = node_page_state(pgdat, NR_SLAB_UNRECLAIMABLE);
+	sreclaimable = node_page_state_pages(pgdat, NR_SLAB_RECLAIMABLE_B);
+	sunreclaimable = node_page_state_pages(pgdat, NR_SLAB_UNRECLAIMABLE_B);
 	n = sprintf(buf,
 		       "Node %d MemTotal:       %8lu kB\n"
 		       "Node %d MemFree:        %8lu kB\n"
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 8c1f1bb1a5ce..0811e4100084 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -53,8 +53,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 		pages[lru] = global_node_page_state(NR_LRU_BASE + lru);
 
 	available = si_mem_available();
-	sreclaimable = global_node_page_state(NR_SLAB_RECLAIMABLE);
-	sunreclaim = global_node_page_state(NR_SLAB_UNRECLAIMABLE);
+	sreclaimable = global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B);
+	sunreclaim = global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE_B);
 
 	show_val_kb(m, "MemTotal:       ", i.totalram);
 	show_val_kb(m, "MemFree:        ", i.freeram);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 22fe65edf425..1c68c482df6f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -171,8 +171,8 @@ enum node_stat_item {
 	NR_INACTIVE_FILE,	/*  "     "     "   "       "         */
 	NR_ACTIVE_FILE,		/*  "     "     "   "       "         */
 	NR_UNEVICTABLE,		/*  "     "     "   "       "         */
-	NR_SLAB_RECLAIMABLE,
-	NR_SLAB_UNRECLAIMABLE,
+	NR_SLAB_RECLAIMABLE_B,
+	NR_SLAB_UNRECLAIMABLE_B,
 	NR_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
 	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
 	WORKINGSET_NODES,
@@ -206,7 +206,17 @@ enum node_stat_item {
 
 static __always_inline bool vmstat_item_in_bytes(enum node_stat_item item)
 {
-	return false;
+	/*
+	 * Global and per-node slab counters track slab pages.
+	 * It's expected that changes are multiples of PAGE_SIZE.
+	 * Internally values are stored in pages.
+	 *
+	 * Per-memcg and per-lruvec counters track memory, consumed
+	 * by individual slab objects. These counters are actually
+	 * byte-precise.
+	 */
+	return (item == NR_SLAB_RECLAIMABLE_B ||
+		item == NR_SLAB_UNRECLAIMABLE_B);
 }
 
 /*
diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 659800157b17..22da1728b9cb 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -1664,7 +1664,7 @@ static unsigned long minimum_image_size(unsigned long saveable)
 {
 	unsigned long size;
 
-	size = global_node_page_state(NR_SLAB_RECLAIMABLE)
+	size = global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B)
 		+ global_node_page_state(NR_ACTIVE_ANON)
 		+ global_node_page_state(NR_INACTIVE_ANON)
 		+ global_node_page_state(NR_ACTIVE_FILE)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5f700fa8b78c..6cbc1f4829fc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1409,9 +1409,8 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
 		       (u64)memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) *
 		       1024);
 	seq_buf_printf(&s, "slab %llu\n",
-		       (u64)(memcg_page_state(memcg, NR_SLAB_RECLAIMABLE) +
-			     memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE)) *
-		       PAGE_SIZE);
+		       (u64)(memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B) +
+			     memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B)));
 	seq_buf_printf(&s, "sock %llu\n",
 		       (u64)memcg_page_state(memcg, MEMCG_SOCK) *
 		       PAGE_SIZE);
@@ -1445,11 +1444,9 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
 			       PAGE_SIZE);
 
 	seq_buf_printf(&s, "slab_reclaimable %llu\n",
-		       (u64)memcg_page_state(memcg, NR_SLAB_RECLAIMABLE) *
-		       PAGE_SIZE);
+		       (u64)memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B));
 	seq_buf_printf(&s, "slab_unreclaimable %llu\n",
-		       (u64)memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE) *
-		       PAGE_SIZE);
+		       (u64)memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B));
 
 	/* Accumulated memory events */
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 463b3d74a64a..eb0ccb8666b0 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -184,7 +184,7 @@ static bool is_dump_unreclaim_slabs(void)
 		 global_node_page_state(NR_ISOLATED_FILE) +
 		 global_node_page_state(NR_UNEVICTABLE);
 
-	return (global_node_page_state(NR_SLAB_UNRECLAIMABLE) > nr_lru);
+	return (global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE_B) > nr_lru);
 }
 
 /**
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b48336e20bdc..a4daae53b273 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5175,8 +5175,8 @@ long si_mem_available(void)
 	 * items that are in use, and cannot be freed. Cap this estimate at the
 	 * low watermark.
 	 */
-	reclaimable = global_node_page_state(NR_SLAB_RECLAIMABLE) +
-			global_node_page_state(NR_KERNEL_MISC_RECLAIMABLE);
+	reclaimable = global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B) +
+		global_node_page_state(NR_KERNEL_MISC_RECLAIMABLE);
 	available += reclaimable - min(reclaimable / 2, wmark_low);
 
 	if (available < 0)
@@ -5320,8 +5320,8 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 		global_node_page_state(NR_FILE_DIRTY),
 		global_node_page_state(NR_WRITEBACK),
 		global_node_page_state(NR_UNSTABLE_NFS),
-		global_node_page_state(NR_SLAB_RECLAIMABLE),
-		global_node_page_state(NR_SLAB_UNRECLAIMABLE),
+		global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B),
+		global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE_B),
 		global_node_page_state(NR_FILE_MAPPED),
 		global_node_page_state(NR_SHMEM),
 		global_zone_page_state(NR_PAGETABLE),
diff --git a/mm/slab.h b/mm/slab.h
index 815e4e9a94cd..633eedb6bad1 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -272,7 +272,7 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *, gfp_t, size_t, void **);
 static inline int cache_vmstat_idx(struct kmem_cache *s)
 {
 	return (s->flags & SLAB_RECLAIM_ACCOUNT) ?
-		NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE;
+		NR_SLAB_RECLAIMABLE_B : NR_SLAB_UNRECLAIMABLE_B;
 }
 
 #ifdef CONFIG_MEMCG_KMEM
@@ -361,7 +361,7 @@ static __always_inline int memcg_charge_slab(struct page *page,
 
 	if (unlikely(!memcg || mem_cgroup_is_root(memcg))) {
 		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
-				    nr_pages);
+				    nr_pages << PAGE_SHIFT);
 		percpu_ref_get_many(&s->memcg_params.refcnt, nr_pages);
 		return 0;
 	}
@@ -371,7 +371,7 @@ static __always_inline int memcg_charge_slab(struct page *page,
 		goto out;
 
 	lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
-	mod_lruvec_state(lruvec, cache_vmstat_idx(s), nr_pages);
+	mod_lruvec_state(lruvec, cache_vmstat_idx(s), nr_pages << PAGE_SHIFT);
 
 	/* transer try_charge() page references to kmem_cache */
 	percpu_ref_get_many(&s->memcg_params.refcnt, nr_pages);
@@ -396,11 +396,12 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
 	memcg = READ_ONCE(s->memcg_params.memcg);
 	if (likely(!mem_cgroup_is_root(memcg))) {
 		lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
-		mod_lruvec_state(lruvec, cache_vmstat_idx(s), -nr_pages);
+		mod_lruvec_state(lruvec, cache_vmstat_idx(s),
+				 -(nr_pages << PAGE_SHIFT));
 		memcg_kmem_uncharge(memcg, nr_pages);
 	} else {
 		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
-				    -nr_pages);
+				    -(nr_pages << PAGE_SHIFT));
 	}
 	rcu_read_unlock();
 
@@ -484,7 +485,7 @@ static __always_inline int charge_slab_page(struct page *page,
 {
 	if (is_root_cache(s)) {
 		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
-				    1 << order);
+				    PAGE_SIZE << order);
 		return 0;
 	}
 
@@ -496,7 +497,7 @@ static __always_inline void uncharge_slab_page(struct page *page, int order,
 {
 	if (is_root_cache(s)) {
 		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
-				    -(1 << order));
+				    -(PAGE_SIZE << order));
 		return;
 	}
 
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 9e72ba224175..b578ae29c743 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1325,8 +1325,8 @@ void *kmalloc_order(size_t size, gfp_t flags, unsigned int order)
 	page = alloc_pages(flags, order);
 	if (likely(page)) {
 		ret = page_address(page);
-		mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE,
-				    1 << order);
+		mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE_B,
+				    PAGE_SIZE << order);
 	}
 	ret = kasan_kmalloc_large(ret, size, flags);
 	/* As ret might get tagged, call kmemleak hook after KASAN. */
diff --git a/mm/slob.c b/mm/slob.c
index ac2aecfbc7a8..7cc9805c8091 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -202,8 +202,8 @@ static void *slob_new_pages(gfp_t gfp, int order, int node)
 	if (!page)
 		return NULL;
 
-	mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE,
-			    1 << order);
+	mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE_B,
+			    PAGE_SIZE << order);
 	return page_address(page);
 }
 
@@ -214,8 +214,8 @@ static void slob_free_pages(void *b, int order)
 	if (current->reclaim_state)
 		current->reclaim_state->reclaimed_slab += 1 << order;
 
-	mod_node_page_state(page_pgdat(sp), NR_SLAB_UNRECLAIMABLE,
-			    -(1 << order));
+	mod_node_page_state(page_pgdat(sp), NR_SLAB_UNRECLAIMABLE_B,
+			    -(PAGE_SIZE << order));
 	__free_pages(sp, order);
 }
 
@@ -552,8 +552,8 @@ void kfree(const void *block)
 		slob_free(m, *m + align);
 	} else {
 		unsigned int order = compound_order(sp);
-		mod_node_page_state(page_pgdat(sp), NR_SLAB_UNRECLAIMABLE,
-				    -(1 << order));
+		mod_node_page_state(page_pgdat(sp), NR_SLAB_UNRECLAIMABLE_B,
+				    -(PAGE_SIZE << order));
 		__free_pages(sp, order);
 
 	}
diff --git a/mm/slub.c b/mm/slub.c
index 914b7261e6b6..03071ae5ff07 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3898,8 +3898,8 @@ static void *kmalloc_large_node(size_t size, gfp_t flags, int node)
 	page = alloc_pages_node(node, flags, order);
 	if (page) {
 		ptr = page_address(page);
-		mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE,
-				    1 << order);
+		mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE_B,
+				    PAGE_SIZE << order);
 	}
 
 	return kmalloc_large_node_hook(ptr, size, flags);
@@ -4030,8 +4030,8 @@ void kfree(const void *x)
 
 		BUG_ON(!PageCompound(page));
 		kfree_hook(object);
-		mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE,
-				    -(1 << order));
+		mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE_B,
+				    -(PAGE_SIZE << order));
 		__free_pages(page, order);
 		return;
 	}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4c3a760c0522..88aa6656aaca 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4226,7 +4226,8 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
 	 * unmapped file backed pages.
 	 */
 	if (node_pagecache_reclaimable(pgdat) <= pgdat->min_unmapped_pages &&
-	    node_page_state(pgdat, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages)
+	    node_page_state_pages(pgdat, NR_SLAB_RECLAIMABLE_B) <=
+	    pgdat->min_slab_pages)
 		return NODE_RECLAIM_FULL;
 
 	/*
diff --git a/mm/workingset.c b/mm/workingset.c
index 474186b76ced..9358c1ee5bb6 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -467,8 +467,10 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
 		for (pages = 0, i = 0; i < NR_LRU_LISTS; i++)
 			pages += lruvec_page_state_local(lruvec,
 							 NR_LRU_BASE + i);
-		pages += lruvec_page_state_local(lruvec, NR_SLAB_RECLAIMABLE);
-		pages += lruvec_page_state_local(lruvec, NR_SLAB_UNRECLAIMABLE);
+		pages += lruvec_page_state_local(
+			lruvec, NR_SLAB_RECLAIMABLE_B) >> PAGE_SHIFT;
+		pages += lruvec_page_state_local(
+			lruvec, NR_SLAB_UNRECLAIMABLE_B) >> PAGE_SHIFT;
 	} else
 #endif
 		pages = node_present_pages(sc->nid);
-- 
2.25.3



^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH v3 04/19] mm: slub: implement SLUB version of obj_to_index()
  2020-04-22 20:46 [PATCH v3 00/19] The new cgroup slab memory controller Roman Gushchin
                   ` (2 preceding siblings ...)
  2020-04-22 20:46 ` [PATCH v3 03/19] mm: memcg: convert vmstat slab counters to bytes Roman Gushchin
@ 2020-04-22 20:46 ` Roman Gushchin
  2020-04-22 23:52   ` Christopher Lameter
  2020-05-20 13:51   ` Vlastimil Babka
  2020-04-22 20:46 ` [PATCH v3 05/19] mm: memcontrol: decouple reference counting from page accounting Roman Gushchin
                   ` (14 subsequent siblings)
  18 siblings, 2 replies; 84+ messages in thread
From: Roman Gushchin @ 2020-04-22 20:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, linux-mm, kernel-team,
	linux-kernel, Roman Gushchin, Christoph Lameter

This commit implements SLUB version of the obj_to_index() function,
which will be required to calculate the offset of obj_cgroup in the
obj_cgroups vector to store/obtain the objcg ownership data.

To make it faster, let's repeat the SLAB's trick introduced by
commit 6a2d7a955d8d ("[PATCH] SLAB: use a multiply instead of a
divide in obj_to_index()") and avoid an expensive division.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/slub_def.h | 9 +++++++++
 mm/slub.c                | 1 +
 2 files changed, 10 insertions(+)

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index d2153789bd9f..200ea292f250 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -8,6 +8,7 @@
  * (C) 2007 SGI, Christoph Lameter
  */
 #include <linux/kobject.h>
+#include <linux/reciprocal_div.h>
 
 enum stat_item {
 	ALLOC_FASTPATH,		/* Allocation from cpu slab */
@@ -86,6 +87,7 @@ struct kmem_cache {
 	unsigned long min_partial;
 	unsigned int size;	/* The size of an object including metadata */
 	unsigned int object_size;/* The size of an object without metadata */
+	struct reciprocal_value reciprocal_size;
 	unsigned int offset;	/* Free pointer offset */
 #ifdef CONFIG_SLUB_CPU_PARTIAL
 	/* Number of per cpu partial objects to keep around */
@@ -182,4 +184,11 @@ static inline void *nearest_obj(struct kmem_cache *cache, struct page *page,
 	return result;
 }
 
+static inline unsigned int obj_to_index(const struct kmem_cache *cache,
+					const struct page *page, void *obj)
+{
+	return reciprocal_divide(kasan_reset_tag(obj) - page_address(page),
+				 cache->reciprocal_size);
+}
+
 #endif /* _LINUX_SLUB_DEF_H */
diff --git a/mm/slub.c b/mm/slub.c
index 03071ae5ff07..8d16babe1829 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3660,6 +3660,7 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order)
 	 */
 	size = ALIGN(size, s->align);
 	s->size = size;
+	s->reciprocal_size = reciprocal_value(size);
 	if (forced_order >= 0)
 		order = forced_order;
 	else
-- 
2.25.3



^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH v3 05/19] mm: memcontrol: decouple reference counting from page accounting
  2020-04-22 20:46 [PATCH v3 00/19] The new cgroup slab memory controller Roman Gushchin
                   ` (3 preceding siblings ...)
  2020-04-22 20:46 ` [PATCH v3 04/19] mm: slub: implement SLUB version of obj_to_index() Roman Gushchin
@ 2020-04-22 20:46 ` Roman Gushchin
  2020-04-22 20:46 ` [PATCH v3 06/19] mm: memcg/slab: obj_cgroup API Roman Gushchin
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 84+ messages in thread
From: Roman Gushchin @ 2020-04-22 20:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, linux-mm, kernel-team,
	linux-kernel, Roman Gushchin

From: Johannes Weiner <hannes@cmpxchg.org>

The reference counting of a memcg is currently coupled directly to how
many 4k pages are charged to it. This doesn't work well with Roman's
new slab controller, which maintains pools of objects and doesn't want
to keep an extra balance sheet for the pages backing those objects.

This unusual refcounting design (reference counts usually track
pointers to an object) is only for historical reasons: memcg used to
not take any css references and simply stalled offlining until all
charges had been reparented and the page counters had dropped to
zero. When we got rid of the reparenting requirement, the simple
mechanical translation was to take a reference for every charge.

More historical context can be found in commit e8ea14cc6ead ("mm:
memcontrol: take a css reference for each charged page"),
commit 64f219938941 ("mm: memcontrol: remove obsolete kmemcg pinning
tricks") and commit b2052564e66d ("mm: memcontrol: continue cache
reclaim from offlined groups").

The new slab controller exposes the limitations in this scheme, so
let's switch it to a more idiomatic reference counting model based on
actual kernel pointers to the memcg:

- The per-cpu stock holds a reference to the memcg its caching

- User pages hold a reference for their page->mem_cgroup. Transparent
  huge pages will no longer acquire tail references in advance, we'll
  get them if needed during the split.

- Kernel pages hold a reference for their page->mem_cgroup

- mem_cgroup_try_charge(), if successful, will return one reference to
  be consumed by page->mem_cgroup during commit, or put during cancel

- Pages allocated in the root cgroup will acquire and release css
  references for simplicity. css_get() and css_put() optimize that.

- The current memcg_charge_slab() already hacked around the per-charge
  references; this change gets rid of that as well.

Roman: I've reformatted commit references in the commit log to make
  checkpatch.pl happy.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Roman Gushchin <guro@fb.com>
---
 mm/memcontrol.c | 45 ++++++++++++++++++++++++++-------------------
 mm/slab.h       |  2 --
 2 files changed, 26 insertions(+), 21 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6cbc1f4829fc..83805b48817d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2111,13 +2111,17 @@ static void drain_stock(struct memcg_stock_pcp *stock)
 {
 	struct mem_cgroup *old = stock->cached;
 
+	if (!old)
+		return;
+
 	if (stock->nr_pages) {
 		page_counter_uncharge(&old->memory, stock->nr_pages);
 		if (do_memsw_account())
 			page_counter_uncharge(&old->memsw, stock->nr_pages);
-		css_put_many(&old->css, stock->nr_pages);
 		stock->nr_pages = 0;
 	}
+
+	css_put(&old->css);
 	stock->cached = NULL;
 }
 
@@ -2153,6 +2157,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 	stock = this_cpu_ptr(&memcg_stock);
 	if (stock->cached != memcg) { /* reset if necessary */
 		drain_stock(stock);
+		css_get(&memcg->css);
 		stock->cached = memcg;
 	}
 	stock->nr_pages += nr_pages;
@@ -2583,12 +2588,10 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	page_counter_charge(&memcg->memory, nr_pages);
 	if (do_memsw_account())
 		page_counter_charge(&memcg->memsw, nr_pages);
-	css_get_many(&memcg->css, nr_pages);
 
 	return 0;
 
 done_restock:
-	css_get_many(&memcg->css, batch);
 	if (batch > nr_pages)
 		refill_stock(memcg, batch - nr_pages);
 
@@ -2625,8 +2628,6 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
 	page_counter_uncharge(&memcg->memory, nr_pages);
 	if (do_memsw_account())
 		page_counter_uncharge(&memcg->memsw, nr_pages);
-
-	css_put_many(&memcg->css, nr_pages);
 }
 
 static void lock_page_lru(struct page *page, int *isolated)
@@ -2977,6 +2978,7 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
 		if (!ret) {
 			page->mem_cgroup = memcg;
 			__SetPageKmemcg(page);
+			return 0;
 		}
 	}
 	css_put(&memcg->css);
@@ -2999,12 +3001,11 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
 	VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
 	__memcg_kmem_uncharge(memcg, nr_pages);
 	page->mem_cgroup = NULL;
+	css_put(&memcg->css);
 
 	/* slab pages do not have PageKmemcg flag set */
 	if (PageKmemcg(page))
 		__ClearPageKmemcg(page);
-
-	css_put_many(&memcg->css, nr_pages);
 }
 #endif /* CONFIG_MEMCG_KMEM */
 
@@ -3016,15 +3017,18 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
  */
 void mem_cgroup_split_huge_fixup(struct page *head)
 {
+	struct mem_cgroup *memcg = head->mem_cgroup;
 	int i;
 
 	if (mem_cgroup_disabled())
 		return;
 
-	for (i = 1; i < HPAGE_PMD_NR; i++)
-		head[i].mem_cgroup = head->mem_cgroup;
+	for (i = 1; i < HPAGE_PMD_NR; i++) {
+		css_get(&memcg->css);
+		head[i].mem_cgroup = memcg;
+	}
 
-	__mod_memcg_state(head->mem_cgroup, MEMCG_RSS_HUGE, -HPAGE_PMD_NR);
+	__mod_memcg_state(memcg, MEMCG_RSS_HUGE, -HPAGE_PMD_NR);
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
@@ -5443,7 +5447,9 @@ static int mem_cgroup_move_account(struct page *page,
 	 * uncharging, charging, migration, or LRU putback.
 	 */
 
-	/* caller should have done css_get */
+	css_get(&to->css);
+	css_put(&from->css);
+
 	page->mem_cgroup = to;
 
 	spin_unlock_irqrestore(&from->move_lock, flags);
@@ -6537,8 +6543,10 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 		memcg = get_mem_cgroup_from_mm(mm);
 
 	ret = try_charge(memcg, gfp_mask, nr_pages);
-
-	css_put(&memcg->css);
+	if (ret) {
+		css_put(&memcg->css);
+		memcg = NULL;
+	}
 out:
 	*memcgp = memcg;
 	return ret;
@@ -6634,6 +6642,8 @@ void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg,
 		return;
 
 	cancel_charge(memcg, nr_pages);
+
+	css_put(&memcg->css);
 }
 
 struct uncharge_gather {
@@ -6675,9 +6685,6 @@ static void uncharge_batch(const struct uncharge_gather *ug)
 	__this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, nr_pages);
 	memcg_check_events(ug->memcg, ug->dummy_page);
 	local_irq_restore(flags);
-
-	if (!mem_cgroup_is_root(ug->memcg))
-		css_put_many(&ug->memcg->css, nr_pages);
 }
 
 static void uncharge_page(struct page *page, struct uncharge_gather *ug)
@@ -6725,6 +6732,7 @@ static void uncharge_page(struct page *page, struct uncharge_gather *ug)
 
 	ug->dummy_page = page;
 	page->mem_cgroup = NULL;
+	css_put(&ug->memcg->css);
 }
 
 static void uncharge_list(struct list_head *page_list)
@@ -6831,8 +6839,8 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage)
 	page_counter_charge(&memcg->memory, nr_pages);
 	if (do_memsw_account())
 		page_counter_charge(&memcg->memsw, nr_pages);
-	css_get_many(&memcg->css, nr_pages);
 
+	css_get(&memcg->css);
 	commit_charge(newpage, memcg, false);
 
 	local_irq_save(flags);
@@ -7071,8 +7079,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 				     -nr_entries);
 	memcg_check_events(memcg, page);
 
-	if (!mem_cgroup_is_root(memcg))
-		css_put_many(&memcg->css, nr_entries);
+	css_put(&memcg->css);
 }
 
 /**
diff --git a/mm/slab.h b/mm/slab.h
index 633eedb6bad1..8a574d9361c1 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -373,9 +373,7 @@ static __always_inline int memcg_charge_slab(struct page *page,
 	lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
 	mod_lruvec_state(lruvec, cache_vmstat_idx(s), nr_pages << PAGE_SHIFT);
 
-	/* transer try_charge() page references to kmem_cache */
 	percpu_ref_get_many(&s->memcg_params.refcnt, nr_pages);
-	css_put_many(&memcg->css, nr_pages);
 out:
 	css_put(&memcg->css);
 	return ret;
-- 
2.25.3



^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH v3 06/19] mm: memcg/slab: obj_cgroup API
  2020-04-22 20:46 [PATCH v3 00/19] The new cgroup slab memory controller Roman Gushchin
                   ` (4 preceding siblings ...)
  2020-04-22 20:46 ` [PATCH v3 05/19] mm: memcontrol: decouple reference counting from page accounting Roman Gushchin
@ 2020-04-22 20:46 ` Roman Gushchin
  2020-05-07 21:03   ` Johannes Weiner
  2020-04-22 20:46 ` [PATCH v3 07/19] mm: memcg/slab: allocate obj_cgroups for non-root slab pages Roman Gushchin
                   ` (12 subsequent siblings)
  18 siblings, 1 reply; 84+ messages in thread
From: Roman Gushchin @ 2020-04-22 20:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, linux-mm, kernel-team,
	linux-kernel, Roman Gushchin

Obj_cgroup API provides an ability to account sub-page sized kernel
objects, which potentially outlive the original memory cgroup.

The top-level API consists of the following functions:
  bool obj_cgroup_tryget(struct obj_cgroup *objcg);
  void obj_cgroup_get(struct obj_cgroup *objcg);
  void obj_cgroup_put(struct obj_cgroup *objcg);

  int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size);
  void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);

  struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg);
  struct obj_cgroup *get_obj_cgroup_from_current(void);

Object cgroup is basically a pointer to a memory cgroup with a per-cpu
reference counter. It substitutes a memory cgroup in places where
it's necessary to charge a custom amount of bytes instead of pages.

All charged memory rounded down to pages is charged to the
corresponding memory cgroup using __memcg_kmem_charge().

It implements reparenting: on memcg offlining it's getting reattached
to the parent memory cgroup. Each online memory cgroup has an
associated active object cgroup to handle new allocations and the list
of all attached object cgroups. On offlining of a cgroup this list is
reparented and for each object cgroup in the list the memcg pointer is
swapped to the parent memory cgroup. It prevents long-living objects
from pinning the original memory cgroup in the memory.

The implementation is based on byte-sized per-cpu stocks. A sub-page
sized leftover is stored in an atomic field, which is a part of
obj_cgroup object. So on cgroup offlining the leftover is automatically
reparented.

memcg->objcg is rcu protected.
objcg->memcg is a raw pointer, which is always pointing at a memory
cgroup, but can be atomically swapped to the parent memory cgroup. So
the caller must ensure the lifetime of the cgroup, e.g. grab
rcu_read_lock or css_set_lock.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/memcontrol.h |  51 ++++++++
 mm/memcontrol.c            | 248 ++++++++++++++++++++++++++++++++++++-
 2 files changed, 298 insertions(+), 1 deletion(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c2eb73d89f5d..bf1be842fd27 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -23,6 +23,7 @@
 #include <linux/page-flags.h>
 
 struct mem_cgroup;
+struct obj_cgroup;
 struct page;
 struct mm_struct;
 struct kmem_cache;
@@ -194,6 +195,22 @@ struct memcg_cgwb_frn {
 	struct wb_completion done;	/* tracks in-flight foreign writebacks */
 };
 
+/*
+ * Bucket for arbitrarily byte-sized objects charged to a memory
+ * cgroup. The bucket can be reparented in one piece when the cgroup
+ * is destroyed, without having to round up the individual references
+ * of all live memory objects in the wild.
+ */
+struct obj_cgroup {
+	struct percpu_ref refcnt;
+	struct mem_cgroup *memcg;
+	atomic_t nr_charged_bytes;
+	union {
+		struct list_head list;
+		struct rcu_head rcu;
+	};
+};
+
 /*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
@@ -306,6 +323,8 @@ struct mem_cgroup {
 	int kmemcg_id;
 	enum memcg_kmem_state kmem_state;
 	struct list_head kmem_caches;
+	struct obj_cgroup __rcu *objcg;
+	struct list_head objcg_list;
 #endif
 
 #ifdef CONFIG_CGROUP_WRITEBACK
@@ -429,6 +448,33 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
 	return css ? container_of(css, struct mem_cgroup, css) : NULL;
 }
 
+static inline bool obj_cgroup_tryget(struct obj_cgroup *objcg)
+{
+	return percpu_ref_tryget(&objcg->refcnt);
+}
+
+static inline void obj_cgroup_get(struct obj_cgroup *objcg)
+{
+	percpu_ref_get(&objcg->refcnt);
+}
+
+static inline void obj_cgroup_put(struct obj_cgroup *objcg)
+{
+	percpu_ref_put(&objcg->refcnt);
+}
+
+/*
+ * After the initialization objcg->memcg is always pointing at
+ * a valid memcg, but can be atomically swapped to the parent memcg.
+ *
+ * The caller must ensure that the returned memcg won't be released:
+ * e.g. acquire the rcu_read_lock or css_set_lock.
+ */
+static inline struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg)
+{
+	return READ_ONCE(objcg->memcg);
+}
+
 static inline void mem_cgroup_put(struct mem_cgroup *memcg)
 {
 	if (memcg)
@@ -1390,6 +1436,11 @@ void __memcg_kmem_uncharge(struct mem_cgroup *memcg, unsigned int nr_pages);
 int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order);
 void __memcg_kmem_uncharge_page(struct page *page, int order);
 
+struct obj_cgroup *get_obj_cgroup_from_current(void);
+
+int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size);
+void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);
+
 extern struct static_key_false memcg_kmem_enabled_key;
 extern struct workqueue_struct *memcg_kmem_cache_wq;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 83805b48817d..7f87a0eeafec 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -257,6 +257,78 @@ struct cgroup_subsys_state *vmpressure_to_css(struct vmpressure *vmpr)
 }
 
 #ifdef CONFIG_MEMCG_KMEM
+extern spinlock_t css_set_lock;
+
+static void obj_cgroup_release(struct percpu_ref *ref)
+{
+	struct obj_cgroup *objcg = container_of(ref, struct obj_cgroup, refcnt);
+	struct mem_cgroup *memcg;
+	unsigned int nr_bytes;
+	unsigned int nr_pages;
+	unsigned long flags;
+
+	nr_bytes = atomic_read(&objcg->nr_charged_bytes);
+	WARN_ON_ONCE(nr_bytes & (PAGE_SIZE - 1));
+	nr_pages = nr_bytes >> PAGE_SHIFT;
+
+	spin_lock_irqsave(&css_set_lock, flags);
+	memcg = obj_cgroup_memcg(objcg);
+	if (nr_pages)
+		__memcg_kmem_uncharge(memcg, nr_pages);
+	list_del(&objcg->list);
+	mem_cgroup_put(memcg);
+	spin_unlock_irqrestore(&css_set_lock, flags);
+
+	percpu_ref_exit(ref);
+	kfree_rcu(objcg, rcu);
+}
+
+static struct obj_cgroup *obj_cgroup_alloc(void)
+{
+	struct obj_cgroup *objcg;
+	int ret;
+
+	objcg = kzalloc(sizeof(struct obj_cgroup), GFP_KERNEL);
+	if (!objcg)
+		return NULL;
+
+	ret = percpu_ref_init(&objcg->refcnt, obj_cgroup_release, 0,
+			      GFP_KERNEL);
+	if (ret) {
+		kfree(objcg);
+		return NULL;
+	}
+	INIT_LIST_HEAD(&objcg->list);
+	return objcg;
+}
+
+static void memcg_reparent_objcgs(struct mem_cgroup *memcg,
+				  struct mem_cgroup *parent)
+{
+	struct obj_cgroup *objcg, *iter;
+
+	objcg = rcu_replace_pointer(memcg->objcg, NULL, true);
+
+	spin_lock_irq(&css_set_lock);
+
+	/* Move active objcg to the parent's list */
+	xchg(&objcg->memcg, parent);
+	css_get(&parent->css);
+	list_add(&objcg->list, &parent->objcg_list);
+
+	/* Move already reparented objcgs to the parent's list */
+	list_for_each_entry(iter, &memcg->objcg_list, list) {
+		css_get(&parent->css);
+		xchg(&iter->memcg, parent);
+		css_put(&memcg->css);
+	}
+	list_splice(&memcg->objcg_list, &parent->objcg_list);
+
+	spin_unlock_irq(&css_set_lock);
+
+	percpu_ref_kill(&objcg->refcnt);
+}
+
 /*
  * This will be the memcg's index in each cache's ->memcg_params.memcg_caches.
  * The main reason for not using cgroup id for this:
@@ -2064,6 +2136,12 @@ EXPORT_SYMBOL(unlock_page_memcg);
 struct memcg_stock_pcp {
 	struct mem_cgroup *cached; /* this never be root cgroup */
 	unsigned int nr_pages;
+
+#ifdef CONFIG_MEMCG_KMEM
+	struct obj_cgroup *cached_objcg;
+	unsigned int nr_bytes;
+#endif
+
 	struct work_struct work;
 	unsigned long flags;
 #define FLUSHING_CACHED_CHARGE	0
@@ -2071,6 +2149,22 @@ struct memcg_stock_pcp {
 static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
 static DEFINE_MUTEX(percpu_charge_mutex);
 
+#ifdef CONFIG_MEMCG_KMEM
+static void drain_obj_stock(struct memcg_stock_pcp *stock);
+static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
+				     struct mem_cgroup *root_memcg);
+
+#else
+static inline void drain_obj_stock(struct memcg_stock_pcp *stock)
+{
+}
+static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
+				     struct mem_cgroup *root_memcg)
+{
+	return false;
+}
+#endif
+
 /**
  * consume_stock: Try to consume stocked charge on this cpu.
  * @memcg: memcg to consume from.
@@ -2137,6 +2231,7 @@ static void drain_local_stock(struct work_struct *dummy)
 	local_irq_save(flags);
 
 	stock = this_cpu_ptr(&memcg_stock);
+	drain_obj_stock(stock);
 	drain_stock(stock);
 	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
 
@@ -2196,6 +2291,8 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
 		if (memcg && stock->nr_pages &&
 		    mem_cgroup_is_descendant(memcg, root_memcg))
 			flush = true;
+		if (obj_stock_flush_required(stock, root_memcg))
+			flush = true;
 		rcu_read_unlock();
 
 		if (flush &&
@@ -2723,6 +2820,30 @@ struct mem_cgroup *mem_cgroup_from_obj(void *p)
 	return page->mem_cgroup;
 }
 
+__always_inline struct obj_cgroup *get_obj_cgroup_from_current(void)
+{
+	struct obj_cgroup *objcg = NULL;
+	struct mem_cgroup *memcg;
+
+	if (unlikely(!current->mm))
+		return NULL;
+
+	rcu_read_lock();
+	if (unlikely(current->active_memcg))
+		memcg = rcu_dereference(current->active_memcg);
+	else
+		memcg = mem_cgroup_from_task(current);
+
+	for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) {
+		objcg = rcu_dereference(memcg->objcg);
+		if (objcg && obj_cgroup_tryget(objcg))
+			break;
+	}
+	rcu_read_unlock();
+
+	return objcg;
+}
+
 static int memcg_alloc_cache_id(void)
 {
 	int id, size;
@@ -3007,6 +3128,120 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
 	if (PageKmemcg(page))
 		__ClearPageKmemcg(page);
 }
+
+static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
+{
+	struct memcg_stock_pcp *stock;
+	unsigned long flags;
+	bool ret = false;
+
+	local_irq_save(flags);
+
+	stock = this_cpu_ptr(&memcg_stock);
+	if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
+		stock->nr_bytes -= nr_bytes;
+		ret = true;
+	}
+
+	local_irq_restore(flags);
+
+	return ret;
+}
+
+static void drain_obj_stock(struct memcg_stock_pcp *stock)
+{
+	struct obj_cgroup *old = stock->cached_objcg;
+
+	if (!old)
+		return;
+
+	if (stock->nr_bytes) {
+		unsigned int nr_pages = stock->nr_bytes >> PAGE_SHIFT;
+		unsigned int nr_bytes = stock->nr_bytes & (PAGE_SIZE - 1);
+
+		if (nr_pages) {
+			rcu_read_lock();
+			__memcg_kmem_uncharge(obj_cgroup_memcg(old), nr_pages);
+			rcu_read_unlock();
+		}
+
+		atomic_add(nr_bytes, &old->nr_charged_bytes);
+		stock->nr_bytes = 0;
+	}
+
+	obj_cgroup_put(old);
+	stock->cached_objcg = NULL;
+}
+
+static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
+				     struct mem_cgroup *root_memcg)
+{
+	struct mem_cgroup *memcg;
+
+	if (stock->cached_objcg) {
+		memcg = obj_cgroup_memcg(stock->cached_objcg);
+		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
+			return true;
+	}
+
+	return false;
+}
+
+static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
+{
+	struct memcg_stock_pcp *stock;
+	unsigned long flags;
+
+	local_irq_save(flags);
+
+	stock = this_cpu_ptr(&memcg_stock);
+	if (stock->cached_objcg != objcg) { /* reset if necessary */
+		drain_obj_stock(stock);
+		obj_cgroup_get(objcg);
+		stock->cached_objcg = objcg;
+		stock->nr_bytes = atomic_xchg(&objcg->nr_charged_bytes, 0);
+	}
+	stock->nr_bytes += nr_bytes;
+
+	if (stock->nr_bytes > PAGE_SIZE)
+		drain_obj_stock(stock);
+
+	local_irq_restore(flags);
+}
+
+int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
+{
+	struct mem_cgroup *memcg;
+	unsigned int nr_pages, nr_bytes;
+	int ret;
+
+	if (consume_obj_stock(objcg, size))
+		return 0;
+
+	rcu_read_lock();
+	memcg = obj_cgroup_memcg(objcg);
+	css_get(&memcg->css);
+	rcu_read_unlock();
+
+	nr_pages = size >> PAGE_SHIFT;
+	nr_bytes = size & (PAGE_SIZE - 1);
+
+	if (nr_bytes)
+		nr_pages += 1;
+
+	ret = __memcg_kmem_charge(memcg, gfp, nr_pages);
+	if (!ret && nr_bytes)
+		refill_obj_stock(objcg, PAGE_SIZE - nr_bytes);
+
+	css_put(&memcg->css);
+	return ret;
+}
+
+void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
+{
+	refill_obj_stock(objcg, size);
+}
+
 #endif /* CONFIG_MEMCG_KMEM */
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -3429,6 +3664,7 @@ static void memcg_flush_percpu_vmevents(struct mem_cgroup *memcg)
 #ifdef CONFIG_MEMCG_KMEM
 static int memcg_online_kmem(struct mem_cgroup *memcg)
 {
+	struct obj_cgroup *objcg;
 	int memcg_id;
 
 	if (cgroup_memory_nokmem)
@@ -3441,6 +3677,14 @@ static int memcg_online_kmem(struct mem_cgroup *memcg)
 	if (memcg_id < 0)
 		return memcg_id;
 
+	objcg = obj_cgroup_alloc();
+	if (!objcg) {
+		memcg_free_cache_id(memcg_id);
+		return -ENOMEM;
+	}
+	objcg->memcg = memcg;
+	rcu_assign_pointer(memcg->objcg, objcg);
+
 	static_branch_inc(&memcg_kmem_enabled_key);
 	/*
 	 * A memory cgroup is considered kmem-online as soon as it gets
@@ -3476,9 +3720,10 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
 		parent = root_mem_cgroup;
 
 	/*
-	 * Deactivate and reparent kmem_caches.
+	 * Deactivate and reparent kmem_caches and objcgs.
 	 */
 	memcg_deactivate_kmem_caches(memcg, parent);
+	memcg_reparent_objcgs(memcg, parent);
 
 	kmemcg_id = memcg->kmemcg_id;
 	BUG_ON(kmemcg_id < 0);
@@ -5045,6 +5290,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 	memcg->socket_pressure = jiffies;
 #ifdef CONFIG_MEMCG_KMEM
 	memcg->kmemcg_id = -1;
+	INIT_LIST_HEAD(&memcg->objcg_list);
 #endif
 #ifdef CONFIG_CGROUP_WRITEBACK
 	INIT_LIST_HEAD(&memcg->cgwb_list);
-- 
2.25.3



^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH v3 07/19] mm: memcg/slab: allocate obj_cgroups for non-root slab pages
  2020-04-22 20:46 [PATCH v3 00/19] The new cgroup slab memory controller Roman Gushchin
                   ` (5 preceding siblings ...)
  2020-04-22 20:46 ` [PATCH v3 06/19] mm: memcg/slab: obj_cgroup API Roman Gushchin
@ 2020-04-22 20:46 ` Roman Gushchin
  2020-04-23 20:20   ` Roman Gushchin
                     ` (2 more replies)
  2020-04-22 20:46 ` [PATCH v3 08/19] mm: memcg/slab: save obj_cgroup for non-root slab objects Roman Gushchin
                   ` (11 subsequent siblings)
  18 siblings, 3 replies; 84+ messages in thread
From: Roman Gushchin @ 2020-04-22 20:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, linux-mm, kernel-team,
	linux-kernel, Roman Gushchin

Allocate and release memory to store obj_cgroup pointers for each
non-root slab page. Reuse page->mem_cgroup pointer to store a pointer
to the allocated space.

To distinguish between obj_cgroups and memcg pointers in case
when it's not obvious which one is used (as in page_cgroup_ino()),
let's always set the lowest bit in the obj_cgroup case.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/mm_types.h |  5 ++++-
 include/linux/slab_def.h |  5 +++++
 include/linux/slub_def.h |  2 ++
 mm/memcontrol.c          | 17 +++++++++++---
 mm/slab.c                |  3 ++-
 mm/slab.h                | 48 ++++++++++++++++++++++++++++++++++++++++
 mm/slub.c                |  5 +++++
 7 files changed, 80 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 4aba6c0c2ba8..0ad7e700f26d 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -198,7 +198,10 @@ struct page {
 	atomic_t _refcount;
 
 #ifdef CONFIG_MEMCG
-	struct mem_cgroup *mem_cgroup;
+	union {
+		struct mem_cgroup *mem_cgroup;
+		struct obj_cgroup **obj_cgroups;
+	};
 #endif
 
 	/*
diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h
index abc7de77b988..967a9a525eab 100644
--- a/include/linux/slab_def.h
+++ b/include/linux/slab_def.h
@@ -114,4 +114,9 @@ static inline unsigned int obj_to_index(const struct kmem_cache *cache,
 	return reciprocal_divide(offset, cache->reciprocal_buffer_size);
 }
 
+static inline int objs_per_slab(const struct kmem_cache *cache)
+{
+	return cache->num;
+}
+
 #endif	/* _LINUX_SLAB_DEF_H */
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 200ea292f250..cbda7d55796a 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -191,4 +191,6 @@ static inline unsigned int obj_to_index(const struct kmem_cache *cache,
 				 cache->reciprocal_size);
 }
 
+extern int objs_per_slab(struct kmem_cache *cache);
+
 #endif /* _LINUX_SLUB_DEF_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7f87a0eeafec..63826e460b3f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -549,10 +549,21 @@ ino_t page_cgroup_ino(struct page *page)
 	unsigned long ino = 0;
 
 	rcu_read_lock();
-	if (PageSlab(page) && !PageTail(page))
+	if (PageSlab(page) && !PageTail(page)) {
 		memcg = memcg_from_slab_page(page);
-	else
-		memcg = READ_ONCE(page->mem_cgroup);
+	} else {
+		memcg = page->mem_cgroup;
+
+		/*
+		 * The lowest bit set means that memcg isn't a valid
+		 * memcg pointer, but a obj_cgroups pointer.
+		 * In this case the page is shared and doesn't belong
+		 * to any specific memory cgroup.
+		 */
+		if ((unsigned long) memcg & 0x1UL)
+			memcg = NULL;
+	}
+
 	while (memcg && !(memcg->css.flags & CSS_ONLINE))
 		memcg = parent_mem_cgroup(memcg);
 	if (memcg)
diff --git a/mm/slab.c b/mm/slab.c
index 9350062ffc1a..f2d67984595b 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1370,7 +1370,8 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
 		return NULL;
 	}
 
-	if (charge_slab_page(page, flags, cachep->gfporder, cachep)) {
+	if (charge_slab_page(page, flags, cachep->gfporder, cachep,
+			     cachep->num)) {
 		__free_pages(page, cachep->gfporder);
 		return NULL;
 	}
diff --git a/mm/slab.h b/mm/slab.h
index 8a574d9361c1..44def57f050e 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -319,6 +319,18 @@ static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
 	return s->memcg_params.root_cache;
 }
 
+static inline struct obj_cgroup **page_obj_cgroups(struct page *page)
+{
+	/*
+	 * page->mem_cgroup and page->obj_cgroups are sharing the same
+	 * space. To distinguish between them in case we don't know for sure
+	 * that the page is a slab page (e.g. page_cgroup_ino()), let's
+	 * always set the lowest bit of obj_cgroups.
+	 */
+	return (struct obj_cgroup **)
+		((unsigned long)page->obj_cgroups & ~0x1UL);
+}
+
 /*
  * Expects a pointer to a slab page. Please note, that PageSlab() check
  * isn't sufficient, as it returns true also for tail compound slab pages,
@@ -406,6 +418,25 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
 	percpu_ref_put_many(&s->memcg_params.refcnt, nr_pages);
 }
 
+static inline int memcg_alloc_page_obj_cgroups(struct page *page, gfp_t gfp,
+					       unsigned int objects)
+{
+	void *vec;
+
+	vec = kcalloc(objects, sizeof(struct obj_cgroup *), gfp);
+	if (!vec)
+		return -ENOMEM;
+
+	page->obj_cgroups = (struct obj_cgroup **) ((unsigned long)vec | 0x1UL);
+	return 0;
+}
+
+static inline void memcg_free_page_obj_cgroups(struct page *page)
+{
+	kfree(page_obj_cgroups(page));
+	page->obj_cgroups = NULL;
+}
+
 extern void slab_init_memcg_params(struct kmem_cache *);
 extern void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg);
 
@@ -455,6 +486,16 @@ static inline void memcg_uncharge_slab(struct page *page, int order,
 {
 }
 
+static inline int memcg_alloc_page_obj_cgroups(struct page *page, gfp_t gfp,
+					       unsigned int objects)
+{
+	return 0;
+}
+
+static inline void memcg_free_page_obj_cgroups(struct page *page)
+{
+}
+
 static inline void slab_init_memcg_params(struct kmem_cache *s)
 {
 }
@@ -481,12 +522,18 @@ static __always_inline int charge_slab_page(struct page *page,
 					    gfp_t gfp, int order,
 					    struct kmem_cache *s)
 {
+	int ret;
+
 	if (is_root_cache(s)) {
 		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
 				    PAGE_SIZE << order);
 		return 0;
 	}
 
+	ret = memcg_alloc_page_obj_cgroups(page, gfp, objs_per_slab(s));
+	if (ret)
+		return ret;
+
 	return memcg_charge_slab(page, gfp, order, s);
 }
 
@@ -499,6 +546,7 @@ static __always_inline void uncharge_slab_page(struct page *page, int order,
 		return;
 	}
 
+	memcg_free_page_obj_cgroups(page);
 	memcg_uncharge_slab(page, order, s);
 }
 
diff --git a/mm/slub.c b/mm/slub.c
index 8d16babe1829..68c2c45dfac1 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5992,4 +5992,9 @@ ssize_t slabinfo_write(struct file *file, const char __user *buffer,
 {
 	return -EIO;
 }
+
+int objs_per_slab(struct kmem_cache *cache)
+{
+	return oo_objects(cache->oo);
+}
 #endif /* CONFIG_SLUB_DEBUG */
-- 
2.25.3



^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH v3 08/19] mm: memcg/slab: save obj_cgroup for non-root slab objects
  2020-04-22 20:46 [PATCH v3 00/19] The new cgroup slab memory controller Roman Gushchin
                   ` (6 preceding siblings ...)
  2020-04-22 20:46 ` [PATCH v3 07/19] mm: memcg/slab: allocate obj_cgroups for non-root slab pages Roman Gushchin
@ 2020-04-22 20:46 ` Roman Gushchin
  2020-05-25 15:07   ` Vlastimil Babka
  2020-04-22 20:46 ` [PATCH v3 09/19] mm: memcg/slab: charge individual slab objects instead of pages Roman Gushchin
                   ` (10 subsequent siblings)
  18 siblings, 1 reply; 84+ messages in thread
From: Roman Gushchin @ 2020-04-22 20:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, linux-mm, kernel-team,
	linux-kernel, Roman Gushchin

Store the obj_cgroup pointer in the corresponding place of
page->obj_cgroups for each allocated non-root slab object.
Make sure that each allocated object holds a reference to obj_cgroup.

Objcg pointer is obtained from the memcg->objcg dereferencing
in memcg_kmem_get_cache() and passed from pre_alloc_hook to
post_alloc_hook. Then in case of successful allocation(s) it's
getting stored in the page->obj_cgroups vector.

The objcg obtaining part look a bit bulky now, but it will be simplified
by next commits in the series.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/memcontrol.h |  3 +-
 mm/memcontrol.c            | 14 +++++++--
 mm/slab.c                  | 18 +++++++-----
 mm/slab.h                  | 60 ++++++++++++++++++++++++++++++++++----
 mm/slub.c                  | 14 +++++----
 5 files changed, 88 insertions(+), 21 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index bf1be842fd27..44b7d1244620 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1426,7 +1426,8 @@ static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg,
 }
 #endif
 
-struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep);
+struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep,
+					struct obj_cgroup **objcgp);
 void memcg_kmem_put_cache(struct kmem_cache *cachep);
 
 #ifdef CONFIG_MEMCG_KMEM
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 63826e460b3f..deb6ceae7577 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2964,7 +2964,8 @@ static inline bool memcg_kmem_bypass(void)
  * done with it, memcg_kmem_put_cache() must be called to release the
  * reference.
  */
-struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep)
+struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep,
+					struct obj_cgroup **objcgp)
 {
 	struct mem_cgroup *memcg;
 	struct kmem_cache *memcg_cachep;
@@ -3020,8 +3021,17 @@ struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep)
 	 */
 	if (unlikely(!memcg_cachep))
 		memcg_schedule_kmem_cache_create(memcg, cachep);
-	else if (percpu_ref_tryget(&memcg_cachep->memcg_params.refcnt))
+	else if (percpu_ref_tryget(&memcg_cachep->memcg_params.refcnt)) {
+		struct obj_cgroup *objcg = rcu_dereference(memcg->objcg);
+
+		if (!objcg || !obj_cgroup_tryget(objcg)) {
+			percpu_ref_put(&memcg_cachep->memcg_params.refcnt);
+			goto out_unlock;
+		}
+
+		*objcgp = objcg;
 		cachep = memcg_cachep;
+	}
 out_unlock:
 	rcu_read_unlock();
 	return cachep;
diff --git a/mm/slab.c b/mm/slab.c
index f2d67984595b..ad38fbae4042 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3223,9 +3223,10 @@ slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
 	unsigned long save_flags;
 	void *ptr;
 	int slab_node = numa_mem_id();
+	struct obj_cgroup *objcg = NULL;
 
 	flags &= gfp_allowed_mask;
-	cachep = slab_pre_alloc_hook(cachep, flags);
+	cachep = slab_pre_alloc_hook(cachep, &objcg, 1, flags);
 	if (unlikely(!cachep))
 		return NULL;
 
@@ -3261,7 +3262,7 @@ slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
 	if (unlikely(slab_want_init_on_alloc(flags, cachep)) && ptr)
 		memset(ptr, 0, cachep->object_size);
 
-	slab_post_alloc_hook(cachep, flags, 1, &ptr);
+	slab_post_alloc_hook(cachep, objcg, flags, 1, &ptr);
 	return ptr;
 }
 
@@ -3302,9 +3303,10 @@ slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller)
 {
 	unsigned long save_flags;
 	void *objp;
+	struct obj_cgroup *objcg = NULL;
 
 	flags &= gfp_allowed_mask;
-	cachep = slab_pre_alloc_hook(cachep, flags);
+	cachep = slab_pre_alloc_hook(cachep, &objcg, 1, flags);
 	if (unlikely(!cachep))
 		return NULL;
 
@@ -3318,7 +3320,7 @@ slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller)
 	if (unlikely(slab_want_init_on_alloc(flags, cachep)) && objp)
 		memset(objp, 0, cachep->object_size);
 
-	slab_post_alloc_hook(cachep, flags, 1, &objp);
+	slab_post_alloc_hook(cachep, objcg, flags, 1, &objp);
 	return objp;
 }
 
@@ -3440,6 +3442,7 @@ void ___cache_free(struct kmem_cache *cachep, void *objp,
 		memset(objp, 0, cachep->object_size);
 	kmemleak_free_recursive(objp, cachep->flags);
 	objp = cache_free_debugcheck(cachep, objp, caller);
+	memcg_slab_free_hook(cachep, virt_to_head_page(objp), objp);
 
 	/*
 	 * Skip calling cache_free_alien() when the platform is not numa.
@@ -3505,8 +3508,9 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 			  void **p)
 {
 	size_t i;
+	struct obj_cgroup *objcg = NULL;
 
-	s = slab_pre_alloc_hook(s, flags);
+	s = slab_pre_alloc_hook(s, &objcg, size, flags);
 	if (!s)
 		return 0;
 
@@ -3529,13 +3533,13 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 		for (i = 0; i < size; i++)
 			memset(p[i], 0, s->object_size);
 
-	slab_post_alloc_hook(s, flags, size, p);
+	slab_post_alloc_hook(s, objcg, flags, size, p);
 	/* FIXME: Trace call missing. Christoph would like a bulk variant */
 	return size;
 error:
 	local_irq_enable();
 	cache_alloc_debugcheck_after_bulk(s, flags, i, p, _RET_IP_);
-	slab_post_alloc_hook(s, flags, i, p);
+	slab_post_alloc_hook(s, objcg, flags, i, p);
 	__kmem_cache_free_bulk(s, i, p);
 	return 0;
 }
diff --git a/mm/slab.h b/mm/slab.h
index 44def57f050e..525e09e05743 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -437,6 +437,41 @@ static inline void memcg_free_page_obj_cgroups(struct page *page)
 	page->obj_cgroups = NULL;
 }
 
+static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
+					      struct obj_cgroup *objcg,
+					      size_t size, void **p)
+{
+	struct page *page;
+	unsigned long off;
+	size_t i;
+
+	for (i = 0; i < size; i++) {
+		if (likely(p[i])) {
+			page = virt_to_head_page(p[i]);
+			off = obj_to_index(s, page, p[i]);
+			obj_cgroup_get(objcg);
+			page_obj_cgroups(page)[off] = objcg;
+		}
+	}
+	obj_cgroup_put(objcg);
+	memcg_kmem_put_cache(s);
+}
+
+static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page,
+					void *p)
+{
+	struct obj_cgroup *objcg;
+	unsigned int off;
+
+	if (!memcg_kmem_enabled() || is_root_cache(s))
+		return;
+
+	off = obj_to_index(s, page, p);
+	objcg = page_obj_cgroups(page)[off];
+	page_obj_cgroups(page)[off] = NULL;
+	obj_cgroup_put(objcg);
+}
+
 extern void slab_init_memcg_params(struct kmem_cache *);
 extern void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg);
 
@@ -496,6 +531,17 @@ static inline void memcg_free_page_obj_cgroups(struct page *page)
 {
 }
 
+static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
+					      struct obj_cgroup *objcg,
+					      size_t size, void **p)
+{
+}
+
+static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page,
+					void *p)
+{
+}
+
 static inline void slab_init_memcg_params(struct kmem_cache *s)
 {
 }
@@ -604,7 +650,8 @@ static inline size_t slab_ksize(const struct kmem_cache *s)
 }
 
 static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
-						     gfp_t flags)
+						     struct obj_cgroup **objcgp,
+						     size_t size, gfp_t flags)
 {
 	flags &= gfp_allowed_mask;
 
@@ -618,13 +665,14 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
 
 	if (memcg_kmem_enabled() &&
 	    ((flags & __GFP_ACCOUNT) || (s->flags & SLAB_ACCOUNT)))
-		return memcg_kmem_get_cache(s);
+		return memcg_kmem_get_cache(s, objcgp);
 
 	return s;
 }
 
-static inline void slab_post_alloc_hook(struct kmem_cache *s, gfp_t flags,
-					size_t size, void **p)
+static inline void slab_post_alloc_hook(struct kmem_cache *s,
+					struct obj_cgroup *objcg,
+					gfp_t flags, size_t size, void **p)
 {
 	size_t i;
 
@@ -636,8 +684,8 @@ static inline void slab_post_alloc_hook(struct kmem_cache *s, gfp_t flags,
 					 s->flags, flags);
 	}
 
-	if (memcg_kmem_enabled())
-		memcg_kmem_put_cache(s);
+	if (!is_root_cache(s))
+		memcg_slab_post_alloc_hook(s, objcg, size, p);
 }
 
 #ifndef CONFIG_SLOB
diff --git a/mm/slub.c b/mm/slub.c
index 68c2c45dfac1..67ae40fcfcda 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2734,8 +2734,9 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
 	struct kmem_cache_cpu *c;
 	struct page *page;
 	unsigned long tid;
+	struct obj_cgroup *objcg = NULL;
 
-	s = slab_pre_alloc_hook(s, gfpflags);
+	s = slab_pre_alloc_hook(s, &objcg, 1, gfpflags);
 	if (!s)
 		return NULL;
 redo:
@@ -2811,7 +2812,7 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
 	if (unlikely(slab_want_init_on_alloc(gfpflags, s)) && object)
 		memset(object, 0, s->object_size);
 
-	slab_post_alloc_hook(s, gfpflags, 1, &object);
+	slab_post_alloc_hook(s, objcg, gfpflags, 1, &object);
 
 	return object;
 }
@@ -3016,6 +3017,8 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
 	void *tail_obj = tail ? : head;
 	struct kmem_cache_cpu *c;
 	unsigned long tid;
+
+	memcg_slab_free_hook(s, page, head);
 redo:
 	/*
 	 * Determine the currently cpus per cpu slab.
@@ -3195,9 +3198,10 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 {
 	struct kmem_cache_cpu *c;
 	int i;
+	struct obj_cgroup *objcg = NULL;
 
 	/* memcg and kmem_cache debug support */
-	s = slab_pre_alloc_hook(s, flags);
+	s = slab_pre_alloc_hook(s, &objcg, size, flags);
 	if (unlikely(!s))
 		return false;
 	/*
@@ -3251,11 +3255,11 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 	}
 
 	/* memcg and kmem_cache debug support */
-	slab_post_alloc_hook(s, flags, size, p);
+	slab_post_alloc_hook(s, objcg, flags, size, p);
 	return i;
 error:
 	local_irq_enable();
-	slab_post_alloc_hook(s, flags, i, p);
+	slab_post_alloc_hook(s, objcg, flags, i, p);
 	__kmem_cache_free_bulk(s, i, p);
 	return 0;
 }
-- 
2.25.3



^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH v3 09/19] mm: memcg/slab: charge individual slab objects instead of pages
  2020-04-22 20:46 [PATCH v3 00/19] The new cgroup slab memory controller Roman Gushchin
                   ` (7 preceding siblings ...)
  2020-04-22 20:46 ` [PATCH v3 08/19] mm: memcg/slab: save obj_cgroup for non-root slab objects Roman Gushchin
@ 2020-04-22 20:46 ` Roman Gushchin
  2020-05-25 16:10   ` Vlastimil Babka
  2020-04-22 20:46 ` [PATCH v3 10/19] mm: memcg/slab: deprecate memory.kmem.slabinfo Roman Gushchin
                   ` (9 subsequent siblings)
  18 siblings, 1 reply; 84+ messages in thread
From: Roman Gushchin @ 2020-04-22 20:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, linux-mm, kernel-team,
	linux-kernel, Roman Gushchin

Switch to per-object accounting of non-root slab objects.

Charging is performed using obj_cgroup API in the pre_alloc hook.
Obj_cgroup is charged with the size of the object and the size
of metadata: as now it's the size of an obj_cgroup pointer.
If the amount of memory has been charged successfully, the actual
allocation code is executed. Otherwise, -ENOMEM is returned.

In the post_alloc hook if the actual allocation succeeded,
corresponding vmstats are bumped and the obj_cgroup pointer is saved.
Otherwise, the charge is canceled.

On the free path obj_cgroup pointer is obtained and used to uncharge
the size of the releasing object.

Memcg and lruvec counters are now representing only memory used
by active slab objects and do not include the free space. The free
space is shared and doesn't belong to any specific cgroup.

Global per-node slab vmstats are still modified from (un)charge_slab_page()
functions. The idea is to keep all slab pages accounted as slab pages
on system level.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 mm/slab.h | 173 ++++++++++++++++++++++++------------------------------
 1 file changed, 77 insertions(+), 96 deletions(-)

diff --git a/mm/slab.h b/mm/slab.h
index 525e09e05743..0ecf14bec6a2 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -352,72 +352,6 @@ static inline struct mem_cgroup *memcg_from_slab_page(struct page *page)
 	return NULL;
 }
 
-/*
- * Charge the slab page belonging to the non-root kmem_cache.
- * Can be called for non-root kmem_caches only.
- */
-static __always_inline int memcg_charge_slab(struct page *page,
-					     gfp_t gfp, int order,
-					     struct kmem_cache *s)
-{
-	unsigned int nr_pages = 1 << order;
-	struct mem_cgroup *memcg;
-	struct lruvec *lruvec;
-	int ret;
-
-	rcu_read_lock();
-	memcg = READ_ONCE(s->memcg_params.memcg);
-	while (memcg && !css_tryget_online(&memcg->css))
-		memcg = parent_mem_cgroup(memcg);
-	rcu_read_unlock();
-
-	if (unlikely(!memcg || mem_cgroup_is_root(memcg))) {
-		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
-				    nr_pages << PAGE_SHIFT);
-		percpu_ref_get_many(&s->memcg_params.refcnt, nr_pages);
-		return 0;
-	}
-
-	ret = memcg_kmem_charge(memcg, gfp, nr_pages);
-	if (ret)
-		goto out;
-
-	lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
-	mod_lruvec_state(lruvec, cache_vmstat_idx(s), nr_pages << PAGE_SHIFT);
-
-	percpu_ref_get_many(&s->memcg_params.refcnt, nr_pages);
-out:
-	css_put(&memcg->css);
-	return ret;
-}
-
-/*
- * Uncharge a slab page belonging to a non-root kmem_cache.
- * Can be called for non-root kmem_caches only.
- */
-static __always_inline void memcg_uncharge_slab(struct page *page, int order,
-						struct kmem_cache *s)
-{
-	unsigned int nr_pages = 1 << order;
-	struct mem_cgroup *memcg;
-	struct lruvec *lruvec;
-
-	rcu_read_lock();
-	memcg = READ_ONCE(s->memcg_params.memcg);
-	if (likely(!mem_cgroup_is_root(memcg))) {
-		lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
-		mod_lruvec_state(lruvec, cache_vmstat_idx(s),
-				 -(nr_pages << PAGE_SHIFT));
-		memcg_kmem_uncharge(memcg, nr_pages);
-	} else {
-		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
-				    -(nr_pages << PAGE_SHIFT));
-	}
-	rcu_read_unlock();
-
-	percpu_ref_put_many(&s->memcg_params.refcnt, nr_pages);
-}
-
 static inline int memcg_alloc_page_obj_cgroups(struct page *page, gfp_t gfp,
 					       unsigned int objects)
 {
@@ -437,6 +371,47 @@ static inline void memcg_free_page_obj_cgroups(struct page *page)
 	page->obj_cgroups = NULL;
 }
 
+static inline size_t obj_full_size(struct kmem_cache *s)
+{
+	/*
+	 * For each accounted object there is an extra space which is used
+	 * to store obj_cgroup membership. Charge it too.
+	 */
+	return s->size + sizeof(struct obj_cgroup *);
+}
+
+static inline struct kmem_cache *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
+						struct obj_cgroup **objcgp,
+						size_t objects, gfp_t flags)
+{
+	struct kmem_cache *cachep;
+
+	cachep = memcg_kmem_get_cache(s, objcgp);
+	if (is_root_cache(cachep))
+		return s;
+
+	if (obj_cgroup_charge(*objcgp, flags, objects * obj_full_size(s))) {
+		memcg_kmem_put_cache(cachep);
+		cachep = NULL;
+	}
+
+	return cachep;
+}
+
+static inline void mod_objcg_state(struct obj_cgroup *objcg,
+				   struct pglist_data *pgdat,
+				   int idx, int nr)
+{
+	struct mem_cgroup *memcg;
+	struct lruvec *lruvec;
+
+	rcu_read_lock();
+	memcg = obj_cgroup_memcg(objcg);
+	lruvec = mem_cgroup_lruvec(memcg, pgdat);
+	mod_memcg_lruvec_state(lruvec, idx, nr);
+	rcu_read_unlock();
+}
+
 static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
 					      struct obj_cgroup *objcg,
 					      size_t size, void **p)
@@ -451,6 +426,10 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
 			off = obj_to_index(s, page, p[i]);
 			obj_cgroup_get(objcg);
 			page_obj_cgroups(page)[off] = objcg;
+			mod_objcg_state(objcg, page_pgdat(page),
+					cache_vmstat_idx(s), obj_full_size(s));
+		} else {
+			obj_cgroup_uncharge(objcg, obj_full_size(s));
 		}
 	}
 	obj_cgroup_put(objcg);
@@ -469,6 +448,11 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page,
 	off = obj_to_index(s, page, p);
 	objcg = page_obj_cgroups(page)[off];
 	page_obj_cgroups(page)[off] = NULL;
+
+	obj_cgroup_uncharge(objcg, obj_full_size(s));
+	mod_objcg_state(objcg, page_pgdat(page), cache_vmstat_idx(s),
+			-obj_full_size(s));
+
 	obj_cgroup_put(objcg);
 }
 
@@ -510,17 +494,6 @@ static inline struct mem_cgroup *memcg_from_slab_page(struct page *page)
 	return NULL;
 }
 
-static inline int memcg_charge_slab(struct page *page, gfp_t gfp, int order,
-				    struct kmem_cache *s)
-{
-	return 0;
-}
-
-static inline void memcg_uncharge_slab(struct page *page, int order,
-				       struct kmem_cache *s)
-{
-}
-
 static inline int memcg_alloc_page_obj_cgroups(struct page *page, gfp_t gfp,
 					       unsigned int objects)
 {
@@ -531,6 +504,13 @@ static inline void memcg_free_page_obj_cgroups(struct page *page)
 {
 }
 
+static inline struct kmem_cache *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
+						struct obj_cgroup **objcgp,
+						size_t objects, gfp_t flags)
+{
+	return NULL;
+}
+
 static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
 					      struct obj_cgroup *objcg,
 					      size_t size, void **p)
@@ -568,32 +548,33 @@ static __always_inline int charge_slab_page(struct page *page,
 					    gfp_t gfp, int order,
 					    struct kmem_cache *s)
 {
-	int ret;
-
-	if (is_root_cache(s)) {
-		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
-				    PAGE_SIZE << order);
-		return 0;
-	}
+#ifdef CONFIG_MEMCG_KMEM
+	if (!is_root_cache(s)) {
+		int ret;
 
-	ret = memcg_alloc_page_obj_cgroups(page, gfp, objs_per_slab(s));
-	if (ret)
-		return ret;
+		ret = memcg_alloc_page_obj_cgroups(page, gfp, objs_per_slab(s));
+		if (ret)
+			return ret;
 
-	return memcg_charge_slab(page, gfp, order, s);
+		percpu_ref_get_many(&s->memcg_params.refcnt, 1 << order);
+	}
+#endif
+	mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
+			    PAGE_SIZE << order);
+	return 0;
 }
 
 static __always_inline void uncharge_slab_page(struct page *page, int order,
 					       struct kmem_cache *s)
 {
-	if (is_root_cache(s)) {
-		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
-				    -(PAGE_SIZE << order));
-		return;
+#ifdef CONFIG_MEMCG_KMEM
+	if (!is_root_cache(s)) {
+		memcg_free_page_obj_cgroups(page);
+		percpu_ref_put_many(&s->memcg_params.refcnt, 1 << order);
 	}
-
-	memcg_free_page_obj_cgroups(page);
-	memcg_uncharge_slab(page, order, s);
+#endif
+	mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
+			    -(PAGE_SIZE << order));
 }
 
 static inline struct kmem_cache *cache_from_obj(struct kmem_cache *s, void *x)
@@ -665,7 +646,7 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
 
 	if (memcg_kmem_enabled() &&
 	    ((flags & __GFP_ACCOUNT) || (s->flags & SLAB_ACCOUNT)))
-		return memcg_kmem_get_cache(s, objcgp);
+		return memcg_slab_pre_alloc_hook(s, objcgp, size, flags);
 
 	return s;
 }
-- 
2.25.3



^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH v3 10/19] mm: memcg/slab: deprecate memory.kmem.slabinfo
  2020-04-22 20:46 [PATCH v3 00/19] The new cgroup slab memory controller Roman Gushchin
                   ` (8 preceding siblings ...)
  2020-04-22 20:46 ` [PATCH v3 09/19] mm: memcg/slab: charge individual slab objects instead of pages Roman Gushchin
@ 2020-04-22 20:46 ` Roman Gushchin
  2020-05-07 21:05   ` Johannes Weiner
  2020-04-22 20:47 ` [PATCH v3 11/19] mm: memcg/slab: move memcg_kmem_bypass() to memcontrol.h Roman Gushchin
                   ` (8 subsequent siblings)
  18 siblings, 1 reply; 84+ messages in thread
From: Roman Gushchin @ 2020-04-22 20:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, linux-mm, kernel-team,
	linux-kernel, Roman Gushchin

Deprecate memory.kmem.slabinfo.

An empty file will be presented if corresponding config options are
enabled.

The interface is implementation dependent, isn't present in cgroup v2,
and is generally useful only for core mm debugging purposes. In other
words, it doesn't provide any value for the absolute majority of users.

A drgn-based replacement can be found in tools/cgroup/slabinfo.py .
It does support cgroup v1 and v2, mimics memory.kmem.slabinfo output
and also allows to get any additional information without a need
to recompile the kernel.

If a drgn-based solution is too slow for a task, a bpf-based tracing
tool can be used, which can easily keep track of all slab allocations
belonging to a memory cgroup.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 mm/memcontrol.c  |  3 ---
 mm/slab_common.c | 31 ++++---------------------------
 2 files changed, 4 insertions(+), 30 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index deb6ceae7577..f957b029a62f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5089,9 +5089,6 @@ static struct cftype mem_cgroup_legacy_files[] = {
 	(defined(CONFIG_SLAB) || defined(CONFIG_SLUB_DEBUG))
 	{
 		.name = "kmem.slabinfo",
-		.seq_start = memcg_slab_start,
-		.seq_next = memcg_slab_next,
-		.seq_stop = memcg_slab_stop,
 		.seq_show = memcg_slab_show,
 	},
 #endif
diff --git a/mm/slab_common.c b/mm/slab_common.c
index b578ae29c743..3c89c2adc930 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1523,35 +1523,12 @@ void dump_unreclaimable_slab(void)
 }
 
 #if defined(CONFIG_MEMCG_KMEM)
-void *memcg_slab_start(struct seq_file *m, loff_t *pos)
-{
-	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
-
-	mutex_lock(&slab_mutex);
-	return seq_list_start(&memcg->kmem_caches, *pos);
-}
-
-void *memcg_slab_next(struct seq_file *m, void *p, loff_t *pos)
-{
-	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
-
-	return seq_list_next(p, &memcg->kmem_caches, pos);
-}
-
-void memcg_slab_stop(struct seq_file *m, void *p)
-{
-	mutex_unlock(&slab_mutex);
-}
-
 int memcg_slab_show(struct seq_file *m, void *p)
 {
-	struct kmem_cache *s = list_entry(p, struct kmem_cache,
-					  memcg_params.kmem_caches_node);
-	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
-
-	if (p == memcg->kmem_caches.next)
-		print_slabinfo_header(m);
-	cache_show(s, m);
+	/*
+	 * Deprecated.
+	 * Please, take a look at tools/cgroup/slabinfo.py .
+	 */
 	return 0;
 }
 #endif
-- 
2.25.3



^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH v3 11/19] mm: memcg/slab: move memcg_kmem_bypass() to memcontrol.h
  2020-04-22 20:46 [PATCH v3 00/19] The new cgroup slab memory controller Roman Gushchin
                   ` (9 preceding siblings ...)
  2020-04-22 20:46 ` [PATCH v3 10/19] mm: memcg/slab: deprecate memory.kmem.slabinfo Roman Gushchin
@ 2020-04-22 20:47 ` Roman Gushchin
  2020-05-25 17:03   ` Vlastimil Babka
  2020-04-22 20:47 ` [PATCH v3 12/19] mm: memcg/slab: use a single set of kmem_caches for all accounted allocations Roman Gushchin
                   ` (7 subsequent siblings)
  18 siblings, 1 reply; 84+ messages in thread
From: Roman Gushchin @ 2020-04-22 20:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, linux-mm, kernel-team,
	linux-kernel, Roman Gushchin

To make the memcg_kmem_bypass() function available outside of
the memcontrol.c, let's move it to memcontrol.h. The function
is small and nicely fits into static inline sort of functions.

It will be used from the slab code.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/memcontrol.h | 7 +++++++
 mm/memcontrol.c            | 7 -------
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 44b7d1244620..840eb8d486a8 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1462,6 +1462,13 @@ static inline bool memcg_kmem_enabled(void)
 	return static_branch_unlikely(&memcg_kmem_enabled_key);
 }
 
+static inline bool memcg_kmem_bypass(void)
+{
+	if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
+		return true;
+	return false;
+}
+
 static inline int memcg_kmem_charge_page(struct page *page, gfp_t gfp,
 					 int order)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f957b029a62f..06a5929f4872 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2941,13 +2941,6 @@ static void memcg_schedule_kmem_cache_create(struct mem_cgroup *memcg,
 	queue_work(memcg_kmem_cache_wq, &cw->work);
 }
 
-static inline bool memcg_kmem_bypass(void)
-{
-	if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
-		return true;
-	return false;
-}
-
 /**
  * memcg_kmem_get_cache: select the correct per-memcg cache for allocation
  * @cachep: the original global kmem cache
-- 
2.25.3



^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH v3 12/19] mm: memcg/slab: use a single set of kmem_caches for all accounted allocations
  2020-04-22 20:46 [PATCH v3 00/19] The new cgroup slab memory controller Roman Gushchin
                   ` (10 preceding siblings ...)
  2020-04-22 20:47 ` [PATCH v3 11/19] mm: memcg/slab: move memcg_kmem_bypass() to memcontrol.h Roman Gushchin
@ 2020-04-22 20:47 ` Roman Gushchin
  2020-05-26 10:12   ` Vlastimil Babka
  2020-04-22 20:47 ` [PATCH v3 13/19] mm: memcg/slab: simplify memcg cache creation Roman Gushchin
                   ` (6 subsequent siblings)
  18 siblings, 1 reply; 84+ messages in thread
From: Roman Gushchin @ 2020-04-22 20:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, linux-mm, kernel-team,
	linux-kernel, Roman Gushchin

This is fairly big but mostly red patch, which makes all accounted
slab allocations use a single set of kmem_caches instead of
creating a separate set for each memory cgroup.

Because the number of non-root kmem_caches is now capped by the number
of root kmem_caches, there is no need to shrink or destroy them
prematurely. They can be perfectly destroyed together with their
root counterparts. This allows to dramatically simplify the
management of non-root kmem_caches and delete a ton of code.

This patch performs the following changes:
1) introduces memcg_params.memcg_cache pointer to represent the
   kmem_cache which will be used for all non-root allocations
2) reuses the existing memcg kmem_cache creation mechanism
   to create memcg kmem_cache on the first allocation attempt
3) memcg kmem_caches are named <kmemcache_name>-memcg,
   e.g. dentry-memcg
4) simplifies memcg_kmem_get_cache() to just return memcg kmem_cache
   or schedule it's creation and return the root cache
5) removes almost all non-root kmem_cache management code
   (separate refcounter, reparenting, shrinking, etc)
6) makes slab debugfs to display root_mem_cgroup css id and never
   show :dead and :deact flags in the memcg_slabinfo attribute.

Following patches in the series will simplify the kmem_cache creation.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/memcontrol.h |   5 +-
 include/linux/slab.h       |   5 +-
 mm/memcontrol.c            | 163 +++-----------
 mm/slab.c                  |  16 +-
 mm/slab.h                  | 145 ++++---------
 mm/slab_common.c           | 426 ++++---------------------------------
 mm/slub.c                  |  38 +---
 7 files changed, 128 insertions(+), 670 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 840eb8d486a8..698b92d60da5 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -322,7 +322,6 @@ struct mem_cgroup {
         /* Index in the kmem_cache->memcg_params.memcg_caches array */
 	int kmemcg_id;
 	enum memcg_kmem_state kmem_state;
-	struct list_head kmem_caches;
 	struct obj_cgroup __rcu *objcg;
 	struct list_head objcg_list;
 #endif
@@ -1426,9 +1425,7 @@ static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg,
 }
 #endif
 
-struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep,
-					struct obj_cgroup **objcgp);
-void memcg_kmem_put_cache(struct kmem_cache *cachep);
+struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep);
 
 #ifdef CONFIG_MEMCG_KMEM
 int __memcg_kmem_charge(struct mem_cgroup *memcg, gfp_t gfp,
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 6d454886bcaf..310768bfa8d2 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -155,8 +155,7 @@ struct kmem_cache *kmem_cache_create_usercopy(const char *name,
 void kmem_cache_destroy(struct kmem_cache *);
 int kmem_cache_shrink(struct kmem_cache *);
 
-void memcg_create_kmem_cache(struct mem_cgroup *, struct kmem_cache *);
-void memcg_deactivate_kmem_caches(struct mem_cgroup *, struct mem_cgroup *);
+void memcg_create_kmem_cache(struct kmem_cache *cachep);
 
 /*
  * Please use this macro to create slab caches. Simply specify the
@@ -578,8 +577,6 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
 	return __kmalloc_node(size, flags, node);
 }
 
-int memcg_update_all_caches(int num_memcgs);
-
 /**
  * kmalloc_array - allocate memory for an array.
  * @n: number of elements.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 06a5929f4872..9fe2433fbe67 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -330,7 +330,7 @@ static void memcg_reparent_objcgs(struct mem_cgroup *memcg,
 }
 
 /*
- * This will be the memcg's index in each cache's ->memcg_params.memcg_caches.
+ * This will be used as a shrinker list's index.
  * The main reason for not using cgroup id for this:
  *  this works better in sparse environments, where we have a lot of memcgs,
  *  but only a few kmem-limited. Or also, if we have, for instance, 200
@@ -549,20 +549,16 @@ ino_t page_cgroup_ino(struct page *page)
 	unsigned long ino = 0;
 
 	rcu_read_lock();
-	if (PageSlab(page) && !PageTail(page)) {
-		memcg = memcg_from_slab_page(page);
-	} else {
-		memcg = page->mem_cgroup;
+	memcg = page->mem_cgroup;
 
-		/*
-		 * The lowest bit set means that memcg isn't a valid
-		 * memcg pointer, but a obj_cgroups pointer.
-		 * In this case the page is shared and doesn't belong
-		 * to any specific memory cgroup.
-		 */
-		if ((unsigned long) memcg & 0x1UL)
-			memcg = NULL;
-	}
+	/*
+	 * The lowest bit set means that memcg isn't a valid
+	 * memcg pointer, but a obj_cgroups pointer.
+	 * In this case the page is shared and doesn't belong
+	 * to any specific memory cgroup.
+	 */
+	if ((unsigned long) memcg & 0x1UL)
+		memcg = NULL;
 
 	while (memcg && !(memcg->css.flags & CSS_ONLINE))
 		memcg = parent_mem_cgroup(memcg);
@@ -2820,12 +2816,18 @@ struct mem_cgroup *mem_cgroup_from_obj(void *p)
 	page = virt_to_head_page(p);
 
 	/*
-	 * Slab pages don't have page->mem_cgroup set because corresponding
-	 * kmem caches can be reparented during the lifetime. That's why
-	 * memcg_from_slab_page() should be used instead.
+	 * Slab objects are accounted individually, not per-page.
+	 * Memcg membership data for each individual object is saved in
+	 * the page->obj_cgroups.
 	 */
-	if (PageSlab(page))
-		return memcg_from_slab_page(page);
+	if (page_has_obj_cgroups(page)) {
+		struct obj_cgroup *objcg;
+		unsigned int off;
+
+		off = obj_to_index(page->slab_cache, page, p);
+		objcg = page_obj_cgroups(page)[off];
+		return obj_cgroup_memcg(objcg);
+	}
 
 	/* All other pages use page->mem_cgroup */
 	return page->mem_cgroup;
@@ -2880,9 +2882,7 @@ static int memcg_alloc_cache_id(void)
 	else if (size > MEMCG_CACHES_MAX_SIZE)
 		size = MEMCG_CACHES_MAX_SIZE;
 
-	err = memcg_update_all_caches(size);
-	if (!err)
-		err = memcg_update_all_list_lrus(size);
+	err = memcg_update_all_list_lrus(size);
 	if (!err)
 		memcg_nr_cache_ids = size;
 
@@ -2901,7 +2901,6 @@ static void memcg_free_cache_id(int id)
 }
 
 struct memcg_kmem_cache_create_work {
-	struct mem_cgroup *memcg;
 	struct kmem_cache *cachep;
 	struct work_struct work;
 };
@@ -2910,31 +2909,24 @@ static void memcg_kmem_cache_create_func(struct work_struct *w)
 {
 	struct memcg_kmem_cache_create_work *cw =
 		container_of(w, struct memcg_kmem_cache_create_work, work);
-	struct mem_cgroup *memcg = cw->memcg;
 	struct kmem_cache *cachep = cw->cachep;
 
-	memcg_create_kmem_cache(memcg, cachep);
+	memcg_create_kmem_cache(cachep);
 
-	css_put(&memcg->css);
 	kfree(cw);
 }
 
 /*
  * Enqueue the creation of a per-memcg kmem_cache.
  */
-static void memcg_schedule_kmem_cache_create(struct mem_cgroup *memcg,
-					       struct kmem_cache *cachep)
+static void memcg_schedule_kmem_cache_create(struct kmem_cache *cachep)
 {
 	struct memcg_kmem_cache_create_work *cw;
 
-	if (!css_tryget_online(&memcg->css))
-		return;
-
 	cw = kmalloc(sizeof(*cw), GFP_NOWAIT | __GFP_NOWARN);
 	if (!cw)
 		return;
 
-	cw->memcg = memcg;
 	cw->cachep = cachep;
 	INIT_WORK(&cw->work, memcg_kmem_cache_create_func);
 
@@ -2942,102 +2934,26 @@ static void memcg_schedule_kmem_cache_create(struct mem_cgroup *memcg,
 }
 
 /**
- * memcg_kmem_get_cache: select the correct per-memcg cache for allocation
+ * memcg_kmem_get_cache: select memcg or root cache for allocation
  * @cachep: the original global kmem cache
  *
  * Return the kmem_cache we're supposed to use for a slab allocation.
- * We try to use the current memcg's version of the cache.
  *
  * If the cache does not exist yet, if we are the first user of it, we
  * create it asynchronously in a workqueue and let the current allocation
  * go through with the original cache.
- *
- * This function takes a reference to the cache it returns to assure it
- * won't get destroyed while we are working with it. Once the caller is
- * done with it, memcg_kmem_put_cache() must be called to release the
- * reference.
  */
-struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep,
-					struct obj_cgroup **objcgp)
+struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep)
 {
-	struct mem_cgroup *memcg;
 	struct kmem_cache *memcg_cachep;
-	struct memcg_cache_array *arr;
-	int kmemcg_id;
-
-	VM_BUG_ON(!is_root_cache(cachep));
 
-	if (memcg_kmem_bypass())
+	memcg_cachep = READ_ONCE(cachep->memcg_params.memcg_cache);
+	if (unlikely(!memcg_cachep)) {
+		memcg_schedule_kmem_cache_create(cachep);
 		return cachep;
-
-	rcu_read_lock();
-
-	if (unlikely(current->active_memcg))
-		memcg = current->active_memcg;
-	else
-		memcg = mem_cgroup_from_task(current);
-
-	if (!memcg || memcg == root_mem_cgroup)
-		goto out_unlock;
-
-	kmemcg_id = READ_ONCE(memcg->kmemcg_id);
-	if (kmemcg_id < 0)
-		goto out_unlock;
-
-	arr = rcu_dereference(cachep->memcg_params.memcg_caches);
-
-	/*
-	 * Make sure we will access the up-to-date value. The code updating
-	 * memcg_caches issues a write barrier to match the data dependency
-	 * barrier inside READ_ONCE() (see memcg_create_kmem_cache()).
-	 */
-	memcg_cachep = READ_ONCE(arr->entries[kmemcg_id]);
-
-	/*
-	 * If we are in a safe context (can wait, and not in interrupt
-	 * context), we could be be predictable and return right away.
-	 * This would guarantee that the allocation being performed
-	 * already belongs in the new cache.
-	 *
-	 * However, there are some clashes that can arrive from locking.
-	 * For instance, because we acquire the slab_mutex while doing
-	 * memcg_create_kmem_cache, this means no further allocation
-	 * could happen with the slab_mutex held. So it's better to
-	 * defer everything.
-	 *
-	 * If the memcg is dying or memcg_cache is about to be released,
-	 * don't bother creating new kmem_caches. Because memcg_cachep
-	 * is ZEROed as the fist step of kmem offlining, we don't need
-	 * percpu_ref_tryget_live() here. css_tryget_online() check in
-	 * memcg_schedule_kmem_cache_create() will prevent us from
-	 * creation of a new kmem_cache.
-	 */
-	if (unlikely(!memcg_cachep))
-		memcg_schedule_kmem_cache_create(memcg, cachep);
-	else if (percpu_ref_tryget(&memcg_cachep->memcg_params.refcnt)) {
-		struct obj_cgroup *objcg = rcu_dereference(memcg->objcg);
-
-		if (!objcg || !obj_cgroup_tryget(objcg)) {
-			percpu_ref_put(&memcg_cachep->memcg_params.refcnt);
-			goto out_unlock;
-		}
-
-		*objcgp = objcg;
-		cachep = memcg_cachep;
 	}
-out_unlock:
-	rcu_read_unlock();
-	return cachep;
-}
 
-/**
- * memcg_kmem_put_cache: drop reference taken by memcg_kmem_get_cache
- * @cachep: the cache returned by memcg_kmem_get_cache
- */
-void memcg_kmem_put_cache(struct kmem_cache *cachep)
-{
-	if (!is_root_cache(cachep))
-		percpu_ref_put(&cachep->memcg_params.refcnt);
+	return memcg_cachep;
 }
 
 /**
@@ -3708,7 +3624,6 @@ static int memcg_online_kmem(struct mem_cgroup *memcg)
 	 */
 	memcg->kmemcg_id = memcg_id;
 	memcg->kmem_state = KMEM_ONLINE;
-	INIT_LIST_HEAD(&memcg->kmem_caches);
 
 	return 0;
 }
@@ -3721,22 +3636,13 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
 
 	if (memcg->kmem_state != KMEM_ONLINE)
 		return;
-	/*
-	 * Clear the online state before clearing memcg_caches array
-	 * entries. The slab_mutex in memcg_deactivate_kmem_caches()
-	 * guarantees that no cache will be created for this cgroup
-	 * after we are done (see memcg_create_kmem_cache()).
-	 */
+
 	memcg->kmem_state = KMEM_ALLOCATED;
 
 	parent = parent_mem_cgroup(memcg);
 	if (!parent)
 		parent = root_mem_cgroup;
 
-	/*
-	 * Deactivate and reparent kmem_caches and objcgs.
-	 */
-	memcg_deactivate_kmem_caches(memcg, parent);
 	memcg_reparent_objcgs(memcg, parent);
 
 	kmemcg_id = memcg->kmemcg_id;
@@ -3771,10 +3677,8 @@ static void memcg_free_kmem(struct mem_cgroup *memcg)
 	if (unlikely(memcg->kmem_state == KMEM_ONLINE))
 		memcg_offline_kmem(memcg);
 
-	if (memcg->kmem_state == KMEM_ALLOCATED) {
-		WARN_ON(!list_empty(&memcg->kmem_caches));
+	if (memcg->kmem_state == KMEM_ALLOCATED)
 		static_branch_dec(&memcg_kmem_enabled_key);
-	}
 }
 #else
 static int memcg_online_kmem(struct mem_cgroup *memcg)
@@ -5363,9 +5267,6 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 
 	/* The following stuff does not apply to the root */
 	if (!parent) {
-#ifdef CONFIG_MEMCG_KMEM
-		INIT_LIST_HEAD(&memcg->kmem_caches);
-#endif
 		root_mem_cgroup = memcg;
 		return &memcg->css;
 	}
diff --git a/mm/slab.c b/mm/slab.c
index ad38fbae4042..17f781a5b62c 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1239,7 +1239,7 @@ void __init kmem_cache_init(void)
 				  nr_node_ids * sizeof(struct kmem_cache_node *),
 				  SLAB_HWCACHE_ALIGN, 0, 0);
 	list_add(&kmem_cache->list, &slab_caches);
-	memcg_link_cache(kmem_cache, NULL);
+	memcg_link_cache(kmem_cache);
 	slab_state = PARTIAL;
 
 	/*
@@ -2244,17 +2244,6 @@ int __kmem_cache_shrink(struct kmem_cache *cachep)
 	return (ret ? 1 : 0);
 }
 
-#ifdef CONFIG_MEMCG
-void __kmemcg_cache_deactivate(struct kmem_cache *cachep)
-{
-	__kmem_cache_shrink(cachep);
-}
-
-void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s)
-{
-}
-#endif
-
 int __kmem_cache_shutdown(struct kmem_cache *cachep)
 {
 	return __kmem_cache_shrink(cachep);
@@ -3862,7 +3851,8 @@ static int do_tune_cpucache(struct kmem_cache *cachep, int limit,
 		return ret;
 
 	lockdep_assert_held(&slab_mutex);
-	for_each_memcg_cache(c, cachep) {
+	c = memcg_cache(cachep);
+	if (c) {
 		/* return value determined by the root cache only */
 		__do_tune_cpucache(c, limit, batchcount, shared, gfp);
 	}
diff --git a/mm/slab.h b/mm/slab.h
index 0ecf14bec6a2..28c582ec997a 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -32,66 +32,25 @@ struct kmem_cache {
 
 #else /* !CONFIG_SLOB */
 
-struct memcg_cache_array {
-	struct rcu_head rcu;
-	struct kmem_cache *entries[];
-};
-
 /*
  * This is the main placeholder for memcg-related information in kmem caches.
- * Both the root cache and the child caches will have it. For the root cache,
- * this will hold a dynamically allocated array large enough to hold
- * information about the currently limited memcgs in the system. To allow the
- * array to be accessed without taking any locks, on relocation we free the old
- * version only after a grace period.
- *
- * Root and child caches hold different metadata.
+ * Both the root cache and the child cache will have it. Some fields are used
+ * in both cases, other are specific to root caches.
  *
  * @root_cache:	Common to root and child caches.  NULL for root, pointer to
  *		the root cache for children.
  *
  * The following fields are specific to root caches.
  *
- * @memcg_caches: kmemcg ID indexed table of child caches.  This table is
- *		used to index child cachces during allocation and cleared
- *		early during shutdown.
- *
- * @root_caches_node: List node for slab_root_caches list.
- *
- * @children:	List of all child caches.  While the child caches are also
- *		reachable through @memcg_caches, a child cache remains on
- *		this list until it is actually destroyed.
- *
- * The following fields are specific to child caches.
- *
- * @memcg:	Pointer to the memcg this cache belongs to.
- *
- * @children_node: List node for @root_cache->children list.
- *
- * @kmem_caches_node: List node for @memcg->kmem_caches list.
+ * @memcg_cache: pointer to memcg kmem cache, used by all non-root memory
+ *		cgroups.
+ * @root_caches_node: list node for slab_root_caches list.
  */
 struct memcg_cache_params {
 	struct kmem_cache *root_cache;
-	union {
-		struct {
-			struct memcg_cache_array __rcu *memcg_caches;
-			struct list_head __root_caches_node;
-			struct list_head children;
-			bool dying;
-		};
-		struct {
-			struct mem_cgroup *memcg;
-			struct list_head children_node;
-			struct list_head kmem_caches_node;
-			struct percpu_ref refcnt;
-
-			void (*work_fn)(struct kmem_cache *);
-			union {
-				struct rcu_head rcu_head;
-				struct work_struct work;
-			};
-		};
-	};
+
+	struct kmem_cache *memcg_cache;
+	struct list_head __root_caches_node;
 };
 #endif /* CONFIG_SLOB */
 
@@ -234,8 +193,6 @@ bool __kmem_cache_empty(struct kmem_cache *);
 int __kmem_cache_shutdown(struct kmem_cache *);
 void __kmem_cache_release(struct kmem_cache *);
 int __kmem_cache_shrink(struct kmem_cache *);
-void __kmemcg_cache_deactivate(struct kmem_cache *s);
-void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s);
 void slab_kmem_cache_release(struct kmem_cache *);
 void kmem_cache_shrink_all(struct kmem_cache *s);
 
@@ -281,14 +238,6 @@ static inline int cache_vmstat_idx(struct kmem_cache *s)
 extern struct list_head		slab_root_caches;
 #define root_caches_node	memcg_params.__root_caches_node
 
-/*
- * Iterate over all memcg caches of the given root cache. The caller must hold
- * slab_mutex.
- */
-#define for_each_memcg_cache(iter, root) \
-	list_for_each_entry(iter, &(root)->memcg_params.children, \
-			    memcg_params.children_node)
-
 static inline bool is_root_cache(struct kmem_cache *s)
 {
 	return !s->memcg_params.root_cache;
@@ -319,6 +268,13 @@ static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
 	return s->memcg_params.root_cache;
 }
 
+static inline struct kmem_cache *memcg_cache(struct kmem_cache *s)
+{
+	if (is_root_cache(s))
+		return s->memcg_params.memcg_cache;
+	return NULL;
+}
+
 static inline struct obj_cgroup **page_obj_cgroups(struct page *page)
 {
 	/*
@@ -331,25 +287,9 @@ static inline struct obj_cgroup **page_obj_cgroups(struct page *page)
 		((unsigned long)page->obj_cgroups & ~0x1UL);
 }
 
-/*
- * Expects a pointer to a slab page. Please note, that PageSlab() check
- * isn't sufficient, as it returns true also for tail compound slab pages,
- * which do not have slab_cache pointer set.
- * So this function assumes that the page can pass PageSlab() && !PageTail()
- * check.
- *
- * The kmem_cache can be reparented asynchronously. The caller must ensure
- * the memcg lifetime, e.g. by taking rcu_read_lock() or cgroup_mutex.
- */
-static inline struct mem_cgroup *memcg_from_slab_page(struct page *page)
+static inline bool page_has_obj_cgroups(struct page *page)
 {
-	struct kmem_cache *s;
-
-	s = READ_ONCE(page->slab_cache);
-	if (s && !is_root_cache(s))
-		return READ_ONCE(s->memcg_params.memcg);
-
-	return NULL;
+	return ((unsigned long)page->obj_cgroups & 0x1UL);
 }
 
 static inline int memcg_alloc_page_obj_cgroups(struct page *page, gfp_t gfp,
@@ -385,16 +325,25 @@ static inline struct kmem_cache *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
 						size_t objects, gfp_t flags)
 {
 	struct kmem_cache *cachep;
+	struct obj_cgroup *objcg;
+
+	if (memcg_kmem_bypass())
+		return s;
 
-	cachep = memcg_kmem_get_cache(s, objcgp);
+	cachep = memcg_kmem_get_cache(s);
 	if (is_root_cache(cachep))
 		return s;
 
-	if (obj_cgroup_charge(*objcgp, flags, objects * obj_full_size(s))) {
-		memcg_kmem_put_cache(cachep);
+	objcg = get_obj_cgroup_from_current();
+	if (!objcg)
+		return s;
+
+	if (obj_cgroup_charge(objcg, flags, objects * obj_full_size(s))) {
+		obj_cgroup_put(objcg);
 		cachep = NULL;
 	}
 
+	*objcgp = objcg;
 	return cachep;
 }
 
@@ -433,7 +382,6 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
 		}
 	}
 	obj_cgroup_put(objcg);
-	memcg_kmem_put_cache(s);
 }
 
 static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page,
@@ -457,7 +405,7 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page,
 }
 
 extern void slab_init_memcg_params(struct kmem_cache *);
-extern void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg);
+extern void memcg_link_cache(struct kmem_cache *s);
 
 #else /* CONFIG_MEMCG_KMEM */
 
@@ -465,9 +413,6 @@ extern void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg);
 #define slab_root_caches	slab_caches
 #define root_caches_node	list
 
-#define for_each_memcg_cache(iter, root) \
-	for ((void)(iter), (void)(root); 0; )
-
 static inline bool is_root_cache(struct kmem_cache *s)
 {
 	return true;
@@ -489,7 +434,17 @@ static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
 	return s;
 }
 
-static inline struct mem_cgroup *memcg_from_slab_page(struct page *page)
+static inline struct kmem_cache *memcg_cache(struct kmem_cache *s)
+{
+	return NULL;
+}
+
+static inline bool page_has_obj_cgroups(struct page *page)
+{
+	return false;
+}
+
+static inline struct mem_cgroup *memcg_from_slab_obj(void *ptr)
 {
 	return NULL;
 }
@@ -526,8 +481,7 @@ static inline void slab_init_memcg_params(struct kmem_cache *s)
 {
 }
 
-static inline void memcg_link_cache(struct kmem_cache *s,
-				    struct mem_cgroup *memcg)
+static inline void memcg_link_cache(struct kmem_cache *s)
 {
 }
 
@@ -548,17 +502,14 @@ static __always_inline int charge_slab_page(struct page *page,
 					    gfp_t gfp, int order,
 					    struct kmem_cache *s)
 {
-#ifdef CONFIG_MEMCG_KMEM
 	if (!is_root_cache(s)) {
 		int ret;
 
 		ret = memcg_alloc_page_obj_cgroups(page, gfp, objs_per_slab(s));
 		if (ret)
 			return ret;
-
-		percpu_ref_get_many(&s->memcg_params.refcnt, 1 << order);
 	}
-#endif
+
 	mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
 			    PAGE_SIZE << order);
 	return 0;
@@ -567,12 +518,9 @@ static __always_inline int charge_slab_page(struct page *page,
 static __always_inline void uncharge_slab_page(struct page *page, int order,
 					       struct kmem_cache *s)
 {
-#ifdef CONFIG_MEMCG_KMEM
-	if (!is_root_cache(s)) {
+	if (!is_root_cache(s))
 		memcg_free_page_obj_cgroups(page);
-		percpu_ref_put_many(&s->memcg_params.refcnt, 1 << order);
-	}
-#endif
+
 	mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
 			    -(PAGE_SIZE << order));
 }
@@ -721,9 +669,6 @@ static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
 void *slab_start(struct seq_file *m, loff_t *pos);
 void *slab_next(struct seq_file *m, void *p, loff_t *pos);
 void slab_stop(struct seq_file *m, void *p);
-void *memcg_slab_start(struct seq_file *m, loff_t *pos);
-void *memcg_slab_next(struct seq_file *m, void *p, loff_t *pos);
-void memcg_slab_stop(struct seq_file *m, void *p);
 int memcg_slab_show(struct seq_file *m, void *p);
 
 #if defined(CONFIG_SLAB) || defined(CONFIG_SLUB_DEBUG)
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 3c89c2adc930..e9deaafddbb6 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -131,141 +131,36 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t nr,
 #ifdef CONFIG_MEMCG_KMEM
 
 LIST_HEAD(slab_root_caches);
-static DEFINE_SPINLOCK(memcg_kmem_wq_lock);
-
-static void kmemcg_cache_shutdown(struct percpu_ref *percpu_ref);
 
 void slab_init_memcg_params(struct kmem_cache *s)
 {
 	s->memcg_params.root_cache = NULL;
-	RCU_INIT_POINTER(s->memcg_params.memcg_caches, NULL);
-	INIT_LIST_HEAD(&s->memcg_params.children);
-	s->memcg_params.dying = false;
+	s->memcg_params.memcg_cache = NULL;
 }
 
-static int init_memcg_params(struct kmem_cache *s,
-			     struct kmem_cache *root_cache)
+static void init_memcg_params(struct kmem_cache *s,
+			      struct kmem_cache *root_cache)
 {
-	struct memcg_cache_array *arr;
-
-	if (root_cache) {
-		int ret = percpu_ref_init(&s->memcg_params.refcnt,
-					  kmemcg_cache_shutdown,
-					  0, GFP_KERNEL);
-		if (ret)
-			return ret;
-
+	if (root_cache)
 		s->memcg_params.root_cache = root_cache;
-		INIT_LIST_HEAD(&s->memcg_params.children_node);
-		INIT_LIST_HEAD(&s->memcg_params.kmem_caches_node);
-		return 0;
-	}
-
-	slab_init_memcg_params(s);
-
-	if (!memcg_nr_cache_ids)
-		return 0;
-
-	arr = kvzalloc(sizeof(struct memcg_cache_array) +
-		       memcg_nr_cache_ids * sizeof(void *),
-		       GFP_KERNEL);
-	if (!arr)
-		return -ENOMEM;
-
-	RCU_INIT_POINTER(s->memcg_params.memcg_caches, arr);
-	return 0;
-}
-
-static void destroy_memcg_params(struct kmem_cache *s)
-{
-	if (is_root_cache(s)) {
-		kvfree(rcu_access_pointer(s->memcg_params.memcg_caches));
-	} else {
-		mem_cgroup_put(s->memcg_params.memcg);
-		WRITE_ONCE(s->memcg_params.memcg, NULL);
-		percpu_ref_exit(&s->memcg_params.refcnt);
-	}
-}
-
-static void free_memcg_params(struct rcu_head *rcu)
-{
-	struct memcg_cache_array *old;
-
-	old = container_of(rcu, struct memcg_cache_array, rcu);
-	kvfree(old);
-}
-
-static int update_memcg_params(struct kmem_cache *s, int new_array_size)
-{
-	struct memcg_cache_array *old, *new;
-
-	new = kvzalloc(sizeof(struct memcg_cache_array) +
-		       new_array_size * sizeof(void *), GFP_KERNEL);
-	if (!new)
-		return -ENOMEM;
-
-	old = rcu_dereference_protected(s->memcg_params.memcg_caches,
-					lockdep_is_held(&slab_mutex));
-	if (old)
-		memcpy(new->entries, old->entries,
-		       memcg_nr_cache_ids * sizeof(void *));
-
-	rcu_assign_pointer(s->memcg_params.memcg_caches, new);
-	if (old)
-		call_rcu(&old->rcu, free_memcg_params);
-	return 0;
-}
-
-int memcg_update_all_caches(int num_memcgs)
-{
-	struct kmem_cache *s;
-	int ret = 0;
-
-	mutex_lock(&slab_mutex);
-	list_for_each_entry(s, &slab_root_caches, root_caches_node) {
-		ret = update_memcg_params(s, num_memcgs);
-		/*
-		 * Instead of freeing the memory, we'll just leave the caches
-		 * up to this point in an updated state.
-		 */
-		if (ret)
-			break;
-	}
-	mutex_unlock(&slab_mutex);
-	return ret;
+	else
+		slab_init_memcg_params(s);
 }
 
-void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg)
+void memcg_link_cache(struct kmem_cache *s)
 {
-	if (is_root_cache(s)) {
+	if (is_root_cache(s))
 		list_add(&s->root_caches_node, &slab_root_caches);
-	} else {
-		css_get(&memcg->css);
-		s->memcg_params.memcg = memcg;
-		list_add(&s->memcg_params.children_node,
-			 &s->memcg_params.root_cache->memcg_params.children);
-		list_add(&s->memcg_params.kmem_caches_node,
-			 &s->memcg_params.memcg->kmem_caches);
-	}
 }
 
 static void memcg_unlink_cache(struct kmem_cache *s)
 {
-	if (is_root_cache(s)) {
+	if (is_root_cache(s))
 		list_del(&s->root_caches_node);
-	} else {
-		list_del(&s->memcg_params.children_node);
-		list_del(&s->memcg_params.kmem_caches_node);
-	}
 }
 #else
-static inline int init_memcg_params(struct kmem_cache *s,
-				    struct kmem_cache *root_cache)
-{
-	return 0;
-}
-
-static inline void destroy_memcg_params(struct kmem_cache *s)
+static inline void init_memcg_params(struct kmem_cache *s,
+				     struct kmem_cache *root_cache)
 {
 }
 
@@ -380,7 +275,7 @@ static struct kmem_cache *create_cache(const char *name,
 		unsigned int object_size, unsigned int align,
 		slab_flags_t flags, unsigned int useroffset,
 		unsigned int usersize, void (*ctor)(void *),
-		struct mem_cgroup *memcg, struct kmem_cache *root_cache)
+		struct kmem_cache *root_cache)
 {
 	struct kmem_cache *s;
 	int err;
@@ -400,24 +295,20 @@ static struct kmem_cache *create_cache(const char *name,
 	s->useroffset = useroffset;
 	s->usersize = usersize;
 
-	err = init_memcg_params(s, root_cache);
-	if (err)
-		goto out_free_cache;
-
+	init_memcg_params(s, root_cache);
 	err = __kmem_cache_create(s, flags);
 	if (err)
 		goto out_free_cache;
 
 	s->refcount = 1;
 	list_add(&s->list, &slab_caches);
-	memcg_link_cache(s, memcg);
+	memcg_link_cache(s);
 out:
 	if (err)
 		return ERR_PTR(err);
 	return s;
 
 out_free_cache:
-	destroy_memcg_params(s);
 	kmem_cache_free(kmem_cache, s);
 	goto out;
 }
@@ -504,7 +395,7 @@ kmem_cache_create_usercopy(const char *name,
 
 	s = create_cache(cache_name, size,
 			 calculate_alignment(flags, align, size),
-			 flags, useroffset, usersize, ctor, NULL, NULL);
+			 flags, useroffset, usersize, ctor, NULL);
 	if (IS_ERR(s)) {
 		err = PTR_ERR(s);
 		kfree_const(cache_name);
@@ -629,51 +520,27 @@ static int shutdown_cache(struct kmem_cache *s)
 
 #ifdef CONFIG_MEMCG_KMEM
 /*
- * memcg_create_kmem_cache - Create a cache for a memory cgroup.
- * @memcg: The memory cgroup the new cache is for.
+ * memcg_create_kmem_cache - Create a cache for non-root memory cgroups.
  * @root_cache: The parent of the new cache.
  *
  * This function attempts to create a kmem cache that will serve allocation
- * requests going from @memcg to @root_cache. The new cache inherits properties
- * from its parent.
+ * requests going all non-root memory cgroups to @root_cache. The new cache
+ * inherits properties from its parent.
  */
-void memcg_create_kmem_cache(struct mem_cgroup *memcg,
-			     struct kmem_cache *root_cache)
+void memcg_create_kmem_cache(struct kmem_cache *root_cache)
 {
-	static char memcg_name_buf[NAME_MAX + 1]; /* protected by slab_mutex */
-	struct cgroup_subsys_state *css = &memcg->css;
-	struct memcg_cache_array *arr;
 	struct kmem_cache *s = NULL;
 	char *cache_name;
-	int idx;
 
 	get_online_cpus();
 	get_online_mems();
 
 	mutex_lock(&slab_mutex);
 
-	/*
-	 * The memory cgroup could have been offlined while the cache
-	 * creation work was pending.
-	 */
-	if (memcg->kmem_state != KMEM_ONLINE)
+	if (root_cache->memcg_params.memcg_cache)
 		goto out_unlock;
 
-	idx = memcg_cache_id(memcg);
-	arr = rcu_dereference_protected(root_cache->memcg_params.memcg_caches,
-					lockdep_is_held(&slab_mutex));
-
-	/*
-	 * Since per-memcg caches are created asynchronously on first
-	 * allocation (see memcg_kmem_get_cache()), several threads can try to
-	 * create the same cache, but only one of them may succeed.
-	 */
-	if (arr->entries[idx])
-		goto out_unlock;
-
-	cgroup_name(css->cgroup, memcg_name_buf, sizeof(memcg_name_buf));
-	cache_name = kasprintf(GFP_KERNEL, "%s(%llu:%s)", root_cache->name,
-			       css->serial_nr, memcg_name_buf);
+	cache_name = kasprintf(GFP_KERNEL, "%s-memcg", root_cache->name);
 	if (!cache_name)
 		goto out_unlock;
 
@@ -681,7 +548,7 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
 			 root_cache->align,
 			 root_cache->flags & CACHE_CREATE_MASK,
 			 root_cache->useroffset, root_cache->usersize,
-			 root_cache->ctor, memcg, root_cache);
+			 root_cache->ctor, root_cache);
 	/*
 	 * If we could not create a memcg cache, do not complain, because
 	 * that's not critical at all as we can always proceed with the root
@@ -698,7 +565,7 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
 	 * initialized.
 	 */
 	smp_wmb();
-	arr->entries[idx] = s;
+	root_cache->memcg_params.memcg_cache = s;
 
 out_unlock:
 	mutex_unlock(&slab_mutex);
@@ -707,197 +574,18 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
 	put_online_cpus();
 }
 
-static void kmemcg_workfn(struct work_struct *work)
-{
-	struct kmem_cache *s = container_of(work, struct kmem_cache,
-					    memcg_params.work);
-
-	get_online_cpus();
-	get_online_mems();
-
-	mutex_lock(&slab_mutex);
-	s->memcg_params.work_fn(s);
-	mutex_unlock(&slab_mutex);
-
-	put_online_mems();
-	put_online_cpus();
-}
-
-static void kmemcg_rcufn(struct rcu_head *head)
-{
-	struct kmem_cache *s = container_of(head, struct kmem_cache,
-					    memcg_params.rcu_head);
-
-	/*
-	 * We need to grab blocking locks.  Bounce to ->work.  The
-	 * work item shares the space with the RCU head and can't be
-	 * initialized earlier.
-	 */
-	INIT_WORK(&s->memcg_params.work, kmemcg_workfn);
-	queue_work(memcg_kmem_cache_wq, &s->memcg_params.work);
-}
-
-static void kmemcg_cache_shutdown_fn(struct kmem_cache *s)
-{
-	WARN_ON(shutdown_cache(s));
-}
-
-static void kmemcg_cache_shutdown(struct percpu_ref *percpu_ref)
-{
-	struct kmem_cache *s = container_of(percpu_ref, struct kmem_cache,
-					    memcg_params.refcnt);
-	unsigned long flags;
-
-	spin_lock_irqsave(&memcg_kmem_wq_lock, flags);
-	if (s->memcg_params.root_cache->memcg_params.dying)
-		goto unlock;
-
-	s->memcg_params.work_fn = kmemcg_cache_shutdown_fn;
-	INIT_WORK(&s->memcg_params.work, kmemcg_workfn);
-	queue_work(memcg_kmem_cache_wq, &s->memcg_params.work);
-
-unlock:
-	spin_unlock_irqrestore(&memcg_kmem_wq_lock, flags);
-}
-
-static void kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s)
-{
-	__kmemcg_cache_deactivate_after_rcu(s);
-	percpu_ref_kill(&s->memcg_params.refcnt);
-}
-
-static void kmemcg_cache_deactivate(struct kmem_cache *s)
-{
-	if (WARN_ON_ONCE(is_root_cache(s)))
-		return;
-
-	__kmemcg_cache_deactivate(s);
-	s->flags |= SLAB_DEACTIVATED;
-
-	/*
-	 * memcg_kmem_wq_lock is used to synchronize memcg_params.dying
-	 * flag and make sure that no new kmem_cache deactivation tasks
-	 * are queued (see flush_memcg_workqueue() ).
-	 */
-	spin_lock_irq(&memcg_kmem_wq_lock);
-	if (s->memcg_params.root_cache->memcg_params.dying)
-		goto unlock;
-
-	s->memcg_params.work_fn = kmemcg_cache_deactivate_after_rcu;
-	call_rcu(&s->memcg_params.rcu_head, kmemcg_rcufn);
-unlock:
-	spin_unlock_irq(&memcg_kmem_wq_lock);
-}
-
-void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg,
-				  struct mem_cgroup *parent)
-{
-	int idx;
-	struct memcg_cache_array *arr;
-	struct kmem_cache *s, *c;
-	unsigned int nr_reparented;
-
-	idx = memcg_cache_id(memcg);
-
-	get_online_cpus();
-	get_online_mems();
-
-	mutex_lock(&slab_mutex);
-	list_for_each_entry(s, &slab_root_caches, root_caches_node) {
-		arr = rcu_dereference_protected(s->memcg_params.memcg_caches,
-						lockdep_is_held(&slab_mutex));
-		c = arr->entries[idx];
-		if (!c)
-			continue;
-
-		kmemcg_cache_deactivate(c);
-		arr->entries[idx] = NULL;
-	}
-	nr_reparented = 0;
-	list_for_each_entry(s, &memcg->kmem_caches,
-			    memcg_params.kmem_caches_node) {
-		WRITE_ONCE(s->memcg_params.memcg, parent);
-		css_put(&memcg->css);
-		nr_reparented++;
-	}
-	if (nr_reparented) {
-		list_splice_init(&memcg->kmem_caches,
-				 &parent->kmem_caches);
-		css_get_many(&parent->css, nr_reparented);
-	}
-	mutex_unlock(&slab_mutex);
-
-	put_online_mems();
-	put_online_cpus();
-}
-
 static int shutdown_memcg_caches(struct kmem_cache *s)
 {
-	struct memcg_cache_array *arr;
-	struct kmem_cache *c, *c2;
-	LIST_HEAD(busy);
-	int i;
-
 	BUG_ON(!is_root_cache(s));
 
-	/*
-	 * First, shutdown active caches, i.e. caches that belong to online
-	 * memory cgroups.
-	 */
-	arr = rcu_dereference_protected(s->memcg_params.memcg_caches,
-					lockdep_is_held(&slab_mutex));
-	for_each_memcg_cache_index(i) {
-		c = arr->entries[i];
-		if (!c)
-			continue;
-		if (shutdown_cache(c))
-			/*
-			 * The cache still has objects. Move it to a temporary
-			 * list so as not to try to destroy it for a second
-			 * time while iterating over inactive caches below.
-			 */
-			list_move(&c->memcg_params.children_node, &busy);
-		else
-			/*
-			 * The cache is empty and will be destroyed soon. Clear
-			 * the pointer to it in the memcg_caches array so that
-			 * it will never be accessed even if the root cache
-			 * stays alive.
-			 */
-			arr->entries[i] = NULL;
-	}
-
-	/*
-	 * Second, shutdown all caches left from memory cgroups that are now
-	 * offline.
-	 */
-	list_for_each_entry_safe(c, c2, &s->memcg_params.children,
-				 memcg_params.children_node)
-		shutdown_cache(c);
-
-	list_splice(&busy, &s->memcg_params.children);
+	if (s->memcg_params.memcg_cache)
+		WARN_ON(shutdown_cache(s->memcg_params.memcg_cache));
 
-	/*
-	 * A cache being destroyed must be empty. In particular, this means
-	 * that all per memcg caches attached to it must be empty too.
-	 */
-	if (!list_empty(&s->memcg_params.children))
-		return -EBUSY;
 	return 0;
 }
 
 static void flush_memcg_workqueue(struct kmem_cache *s)
 {
-	spin_lock_irq(&memcg_kmem_wq_lock);
-	s->memcg_params.dying = true;
-	spin_unlock_irq(&memcg_kmem_wq_lock);
-
-	/*
-	 * SLAB and SLUB deactivate the kmem_caches through call_rcu. Make
-	 * sure all registered rcu callbacks have been invoked.
-	 */
-	rcu_barrier();
-
 	/*
 	 * SLAB and SLUB create memcg kmem_caches through workqueue and SLUB
 	 * deactivates the memcg kmem_caches through workqueue. Make sure all
@@ -905,18 +593,6 @@ static void flush_memcg_workqueue(struct kmem_cache *s)
 	 */
 	if (likely(memcg_kmem_cache_wq))
 		flush_workqueue(memcg_kmem_cache_wq);
-
-	/*
-	 * If we're racing with children kmem_cache deactivation, it might
-	 * take another rcu grace period to complete their destruction.
-	 * At this moment the corresponding percpu_ref_kill() call should be
-	 * done, but it might take another rcu grace period to complete
-	 * switching to the atomic mode.
-	 * Please, note that we check without grabbing the slab_mutex. It's safe
-	 * because at this moment the children list can't grow.
-	 */
-	if (!list_empty(&s->memcg_params.children))
-		rcu_barrier();
 }
 #else
 static inline int shutdown_memcg_caches(struct kmem_cache *s)
@@ -932,7 +608,6 @@ static inline void flush_memcg_workqueue(struct kmem_cache *s)
 void slab_kmem_cache_release(struct kmem_cache *s)
 {
 	__kmem_cache_release(s);
-	destroy_memcg_params(s);
 	kfree_const(s->name);
 	kmem_cache_free(kmem_cache, s);
 }
@@ -996,7 +671,7 @@ int kmem_cache_shrink(struct kmem_cache *cachep)
 EXPORT_SYMBOL(kmem_cache_shrink);
 
 /**
- * kmem_cache_shrink_all - shrink a cache and all memcg caches for root cache
+ * kmem_cache_shrink_all - shrink root and memcg caches
  * @s: The cache pointer
  */
 void kmem_cache_shrink_all(struct kmem_cache *s)
@@ -1013,21 +688,11 @@ void kmem_cache_shrink_all(struct kmem_cache *s)
 	kasan_cache_shrink(s);
 	__kmem_cache_shrink(s);
 
-	/*
-	 * We have to take the slab_mutex to protect from the memcg list
-	 * modification.
-	 */
-	mutex_lock(&slab_mutex);
-	for_each_memcg_cache(c, s) {
-		/*
-		 * Don't need to shrink deactivated memcg caches.
-		 */
-		if (s->flags & SLAB_DEACTIVATED)
-			continue;
+	c = memcg_cache(s);
+	if (c) {
 		kasan_cache_shrink(c);
 		__kmem_cache_shrink(c);
 	}
-	mutex_unlock(&slab_mutex);
 	put_online_mems();
 	put_online_cpus();
 }
@@ -1082,7 +747,7 @@ struct kmem_cache *__init create_kmalloc_cache(const char *name,
 
 	create_boot_cache(s, name, size, flags, useroffset, usersize);
 	list_add(&s->list, &slab_caches);
-	memcg_link_cache(s, NULL);
+	memcg_link_cache(s);
 	s->refcount = 1;
 	return s;
 }
@@ -1445,7 +1110,8 @@ memcg_accumulate_slabinfo(struct kmem_cache *s, struct slabinfo *info)
 	if (!is_root_cache(s))
 		return;
 
-	for_each_memcg_cache(c, s) {
+	c = memcg_cache(s);
+	if (c) {
 		memset(&sinfo, 0, sizeof(sinfo));
 		get_slabinfo(c, &sinfo);
 
@@ -1576,7 +1242,7 @@ module_init(slab_proc_init);
 
 #if defined(CONFIG_DEBUG_FS) && defined(CONFIG_MEMCG_KMEM)
 /*
- * Display information about kmem caches that have child memcg caches.
+ * Display information about kmem caches that have memcg cache.
  */
 static int memcg_slabinfo_show(struct seq_file *m, void *unused)
 {
@@ -1588,9 +1254,9 @@ static int memcg_slabinfo_show(struct seq_file *m, void *unused)
 	seq_puts(m, " <active_slabs> <num_slabs>\n");
 	list_for_each_entry(s, &slab_root_caches, root_caches_node) {
 		/*
-		 * Skip kmem caches that don't have any memcg children.
+		 * Skip kmem caches that don't have the memcg cache.
 		 */
-		if (list_empty(&s->memcg_params.children))
+		if (!s->memcg_params.memcg_cache)
 			continue;
 
 		memset(&sinfo, 0, sizeof(sinfo));
@@ -1599,23 +1265,13 @@ static int memcg_slabinfo_show(struct seq_file *m, void *unused)
 			   cache_name(s), sinfo.active_objs, sinfo.num_objs,
 			   sinfo.active_slabs, sinfo.num_slabs);
 
-		for_each_memcg_cache(c, s) {
-			struct cgroup_subsys_state *css;
-			char *status = "";
-
-			css = &c->memcg_params.memcg->css;
-			if (!(css->flags & CSS_ONLINE))
-				status = ":dead";
-			else if (c->flags & SLAB_DEACTIVATED)
-				status = ":deact";
-
-			memset(&sinfo, 0, sizeof(sinfo));
-			get_slabinfo(c, &sinfo);
-			seq_printf(m, "%-17s %4d%-6s %6lu %6lu %6lu %6lu\n",
-				   cache_name(c), css->id, status,
-				   sinfo.active_objs, sinfo.num_objs,
-				   sinfo.active_slabs, sinfo.num_slabs);
-		}
+		c = s->memcg_params.memcg_cache;
+		memset(&sinfo, 0, sizeof(sinfo));
+		get_slabinfo(c, &sinfo);
+		seq_printf(m, "%-17s %4d %6lu %6lu %6lu %6lu\n",
+			   cache_name(c), root_mem_cgroup->css.id,
+			   sinfo.active_objs, sinfo.num_objs,
+			   sinfo.active_slabs, sinfo.num_slabs);
 	}
 	mutex_unlock(&slab_mutex);
 	return 0;
diff --git a/mm/slub.c b/mm/slub.c
index 67ae40fcfcda..3e4cb081af5d 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4117,36 +4117,6 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 	return ret;
 }
 
-#ifdef CONFIG_MEMCG
-void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s)
-{
-	/*
-	 * Called with all the locks held after a sched RCU grace period.
-	 * Even if @s becomes empty after shrinking, we can't know that @s
-	 * doesn't have allocations already in-flight and thus can't
-	 * destroy @s until the associated memcg is released.
-	 *
-	 * However, let's remove the sysfs files for empty caches here.
-	 * Each cache has a lot of interface files which aren't
-	 * particularly useful for empty draining caches; otherwise, we can
-	 * easily end up with millions of unnecessary sysfs files on
-	 * systems which have a lot of memory and transient cgroups.
-	 */
-	if (!__kmem_cache_shrink(s))
-		sysfs_slab_remove(s);
-}
-
-void __kmemcg_cache_deactivate(struct kmem_cache *s)
-{
-	/*
-	 * Disable empty slabs caching. Used to avoid pinning offline
-	 * memory cgroups by kmem pages that can be freed.
-	 */
-	slub_set_cpu_partial(s, 0);
-	s->min_partial = 0;
-}
-#endif	/* CONFIG_MEMCG */
-
 static int slab_mem_going_offline_callback(void *arg)
 {
 	struct kmem_cache *s;
@@ -4303,7 +4273,7 @@ static struct kmem_cache * __init bootstrap(struct kmem_cache *static_cache)
 	}
 	slab_init_memcg_params(s);
 	list_add(&s->list, &slab_caches);
-	memcg_link_cache(s, NULL);
+	memcg_link_cache(s);
 	return s;
 }
 
@@ -4371,7 +4341,8 @@ __kmem_cache_alias(const char *name, unsigned int size, unsigned int align,
 		s->object_size = max(s->object_size, size);
 		s->inuse = max(s->inuse, ALIGN(size, sizeof(void *)));
 
-		for_each_memcg_cache(c, s) {
+		c = memcg_cache(s);
+		if (c) {
 			c->object_size = s->object_size;
 			c->inuse = max(c->inuse, ALIGN(size, sizeof(void *)));
 		}
@@ -5626,7 +5597,8 @@ static ssize_t slab_attr_store(struct kobject *kobj,
 		 * directly either failed or succeeded, in which case we loop
 		 * through the descendants with best-effort propagation.
 		 */
-		for_each_memcg_cache(c, s)
+		c = memcg_cache(s);
+		if (c)
 			attribute->store(c, buf, len);
 		mutex_unlock(&slab_mutex);
 	}
-- 
2.25.3



^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH v3 13/19] mm: memcg/slab: simplify memcg cache creation
  2020-04-22 20:46 [PATCH v3 00/19] The new cgroup slab memory controller Roman Gushchin
                   ` (11 preceding siblings ...)
  2020-04-22 20:47 ` [PATCH v3 12/19] mm: memcg/slab: use a single set of kmem_caches for all accounted allocations Roman Gushchin
@ 2020-04-22 20:47 ` Roman Gushchin
  2020-05-26 10:31   ` Vlastimil Babka
  2020-04-22 20:47 ` [PATCH v3 14/19] mm: memcg/slab: deprecate memcg_kmem_get_cache() Roman Gushchin
                   ` (5 subsequent siblings)
  18 siblings, 1 reply; 84+ messages in thread
From: Roman Gushchin @ 2020-04-22 20:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, linux-mm, kernel-team,
	linux-kernel, Roman Gushchin

Because the number of non-root kmem_caches doesn't depend on the
number of memory cgroups anymore and is generally not very big,
there is no more need for a dedicated workqueue.

Also, as there is no more need to pass any arguments to the
memcg_create_kmem_cache() except the root kmem_cache, it's
possible to just embed the work structure into the kmem_cache
and avoid the dynamic allocation of the work structure.

This will also simplify the synchronization: for each root kmem_cache
there is only one work. So there will be no more concurrent attempts
to create a non-root kmem_cache for a root kmem_cache: the second and
all following attempts to queue the work will fail.

On the kmem_cache destruction path there is no more need to call the
expensive flush_workqueue() and wait for all pending works to be
finished. Instead, cancel_work_sync() can be used to cancel/wait for
only one work.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/memcontrol.h |  1 -
 mm/memcontrol.c            | 48 +-------------------------------------
 mm/slab.h                  |  2 ++
 mm/slab_common.c           | 22 +++++++++--------
 4 files changed, 15 insertions(+), 58 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 698b92d60da5..87e6da5015b3 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1440,7 +1440,6 @@ int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size);
 void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);
 
 extern struct static_key_false memcg_kmem_enabled_key;
-extern struct workqueue_struct *memcg_kmem_cache_wq;
 
 extern int memcg_nr_cache_ids;
 void memcg_get_cache_ids(void);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9fe2433fbe67..55fd42155a37 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -379,8 +379,6 @@ void memcg_put_cache_ids(void)
  */
 DEFINE_STATIC_KEY_FALSE(memcg_kmem_enabled_key);
 EXPORT_SYMBOL(memcg_kmem_enabled_key);
-
-struct workqueue_struct *memcg_kmem_cache_wq;
 #endif
 
 static int memcg_shrinker_map_size;
@@ -2900,39 +2898,6 @@ static void memcg_free_cache_id(int id)
 	ida_simple_remove(&memcg_cache_ida, id);
 }
 
-struct memcg_kmem_cache_create_work {
-	struct kmem_cache *cachep;
-	struct work_struct work;
-};
-
-static void memcg_kmem_cache_create_func(struct work_struct *w)
-{
-	struct memcg_kmem_cache_create_work *cw =
-		container_of(w, struct memcg_kmem_cache_create_work, work);
-	struct kmem_cache *cachep = cw->cachep;
-
-	memcg_create_kmem_cache(cachep);
-
-	kfree(cw);
-}
-
-/*
- * Enqueue the creation of a per-memcg kmem_cache.
- */
-static void memcg_schedule_kmem_cache_create(struct kmem_cache *cachep)
-{
-	struct memcg_kmem_cache_create_work *cw;
-
-	cw = kmalloc(sizeof(*cw), GFP_NOWAIT | __GFP_NOWARN);
-	if (!cw)
-		return;
-
-	cw->cachep = cachep;
-	INIT_WORK(&cw->work, memcg_kmem_cache_create_func);
-
-	queue_work(memcg_kmem_cache_wq, &cw->work);
-}
-
 /**
  * memcg_kmem_get_cache: select memcg or root cache for allocation
  * @cachep: the original global kmem cache
@@ -2949,7 +2914,7 @@ struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep)
 
 	memcg_cachep = READ_ONCE(cachep->memcg_params.memcg_cache);
 	if (unlikely(!memcg_cachep)) {
-		memcg_schedule_kmem_cache_create(cachep);
+		queue_work(system_wq, &cachep->memcg_params.work);
 		return cachep;
 	}
 
@@ -7122,17 +7087,6 @@ static int __init mem_cgroup_init(void)
 {
 	int cpu, node;
 
-#ifdef CONFIG_MEMCG_KMEM
-	/*
-	 * Kmem cache creation is mostly done with the slab_mutex held,
-	 * so use a workqueue with limited concurrency to avoid stalling
-	 * all worker threads in case lots of cgroups are created and
-	 * destroyed simultaneously.
-	 */
-	memcg_kmem_cache_wq = alloc_workqueue("memcg_kmem_cache", 0, 1);
-	BUG_ON(!memcg_kmem_cache_wq);
-#endif
-
 	cpuhp_setup_state_nocalls(CPUHP_MM_MEMCQ_DEAD, "mm/memctrl:dead", NULL,
 				  memcg_hotplug_cpu_dead);
 
diff --git a/mm/slab.h b/mm/slab.h
index 28c582ec997a..a4e115cb8bdc 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -45,12 +45,14 @@ struct kmem_cache {
  * @memcg_cache: pointer to memcg kmem cache, used by all non-root memory
  *		cgroups.
  * @root_caches_node: list node for slab_root_caches list.
+ * @work: work struct used to create the non-root cache.
  */
 struct memcg_cache_params {
 	struct kmem_cache *root_cache;
 
 	struct kmem_cache *memcg_cache;
 	struct list_head __root_caches_node;
+	struct work_struct work;
 };
 #endif /* CONFIG_SLOB */
 
diff --git a/mm/slab_common.c b/mm/slab_common.c
index e9deaafddbb6..10aa2acb84ca 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -132,10 +132,18 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t nr,
 
 LIST_HEAD(slab_root_caches);
 
+static void memcg_kmem_cache_create_func(struct work_struct *work)
+{
+	struct kmem_cache *cachep = container_of(work, struct kmem_cache,
+						 memcg_params.work);
+	memcg_create_kmem_cache(cachep);
+}
+
 void slab_init_memcg_params(struct kmem_cache *s)
 {
 	s->memcg_params.root_cache = NULL;
 	s->memcg_params.memcg_cache = NULL;
+	INIT_WORK(&s->memcg_params.work, memcg_kmem_cache_create_func);
 }
 
 static void init_memcg_params(struct kmem_cache *s,
@@ -584,15 +592,9 @@ static int shutdown_memcg_caches(struct kmem_cache *s)
 	return 0;
 }
 
-static void flush_memcg_workqueue(struct kmem_cache *s)
+static void cancel_memcg_cache_creation(struct kmem_cache *s)
 {
-	/*
-	 * SLAB and SLUB create memcg kmem_caches through workqueue and SLUB
-	 * deactivates the memcg kmem_caches through workqueue. Make sure all
-	 * previous workitems on workqueue are processed.
-	 */
-	if (likely(memcg_kmem_cache_wq))
-		flush_workqueue(memcg_kmem_cache_wq);
+	cancel_work_sync(&s->memcg_params.work);
 }
 #else
 static inline int shutdown_memcg_caches(struct kmem_cache *s)
@@ -600,7 +602,7 @@ static inline int shutdown_memcg_caches(struct kmem_cache *s)
 	return 0;
 }
 
-static inline void flush_memcg_workqueue(struct kmem_cache *s)
+static inline void cancel_memcg_cache_creation(struct kmem_cache *s)
 {
 }
 #endif /* CONFIG_MEMCG_KMEM */
@@ -619,7 +621,7 @@ void kmem_cache_destroy(struct kmem_cache *s)
 	if (unlikely(!s))
 		return;
 
-	flush_memcg_workqueue(s);
+	cancel_memcg_cache_creation(s);
 
 	get_online_cpus();
 	get_online_mems();
-- 
2.25.3



^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH v3 14/19] mm: memcg/slab: deprecate memcg_kmem_get_cache()
  2020-04-22 20:46 [PATCH v3 00/19] The new cgroup slab memory controller Roman Gushchin
                   ` (12 preceding siblings ...)
  2020-04-22 20:47 ` [PATCH v3 13/19] mm: memcg/slab: simplify memcg cache creation Roman Gushchin
@ 2020-04-22 20:47 ` Roman Gushchin
  2020-05-26 10:34   ` Vlastimil Babka
  2020-04-22 20:47 ` [PATCH v3 15/19] mm: memcg/slab: deprecate slab_root_caches Roman Gushchin
                   ` (4 subsequent siblings)
  18 siblings, 1 reply; 84+ messages in thread
From: Roman Gushchin @ 2020-04-22 20:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, linux-mm, kernel-team,
	linux-kernel, Roman Gushchin

The memcg_kmem_get_cache() function became really trivial, so
let's just inline it into the single call point:
memcg_slab_pre_alloc_hook().

It will make the code less bulky and can also help the compiler
to generate a better code.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/memcontrol.h |  2 --
 mm/memcontrol.c            | 25 +------------------------
 mm/slab.h                  | 11 +++++++++--
 mm/slab_common.c           |  2 +-
 4 files changed, 11 insertions(+), 29 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 87e6da5015b3..5de89a767496 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1425,8 +1425,6 @@ static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg,
 }
 #endif
 
-struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep);
-
 #ifdef CONFIG_MEMCG_KMEM
 int __memcg_kmem_charge(struct mem_cgroup *memcg, gfp_t gfp,
 			unsigned int nr_pages);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 55fd42155a37..bd58b91631f7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -373,7 +373,7 @@ void memcg_put_cache_ids(void)
 
 /*
  * A lot of the calls to the cache allocation functions are expected to be
- * inlined by the compiler. Since the calls to memcg_kmem_get_cache are
+ * inlined by the compiler. Since the calls to memcg_slab_pre_alloc_hook() are
  * conditional to this static branch, we'll have to allow modules that does
  * kmem_cache_alloc and the such to see this symbol as well
  */
@@ -2898,29 +2898,6 @@ static void memcg_free_cache_id(int id)
 	ida_simple_remove(&memcg_cache_ida, id);
 }
 
-/**
- * memcg_kmem_get_cache: select memcg or root cache for allocation
- * @cachep: the original global kmem cache
- *
- * Return the kmem_cache we're supposed to use for a slab allocation.
- *
- * If the cache does not exist yet, if we are the first user of it, we
- * create it asynchronously in a workqueue and let the current allocation
- * go through with the original cache.
- */
-struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep)
-{
-	struct kmem_cache *memcg_cachep;
-
-	memcg_cachep = READ_ONCE(cachep->memcg_params.memcg_cache);
-	if (unlikely(!memcg_cachep)) {
-		queue_work(system_wq, &cachep->memcg_params.work);
-		return cachep;
-	}
-
-	return memcg_cachep;
-}
-
 /**
  * __memcg_kmem_charge: charge a number of kernel pages to a memcg
  * @memcg: memory cgroup to charge
diff --git a/mm/slab.h b/mm/slab.h
index a4e115cb8bdc..cbee6cb0a331 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -332,9 +332,16 @@ static inline struct kmem_cache *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
 	if (memcg_kmem_bypass())
 		return s;
 
-	cachep = memcg_kmem_get_cache(s);
-	if (is_root_cache(cachep))
+	cachep = READ_ONCE(s->memcg_params.memcg_cache);
+	if (unlikely(!cachep)) {
+		/*
+		 * If memcg cache does not exist yet, we schedule it's
+		 * asynchronous creation and let the current allocation
+		 * go through with the root cache.
+		 */
+		queue_work(system_wq, &s->memcg_params.work);
 		return s;
+	}
 
 	objcg = get_obj_cgroup_from_current();
 	if (!objcg)
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 10aa2acb84ca..f8874a159637 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -568,7 +568,7 @@ void memcg_create_kmem_cache(struct kmem_cache *root_cache)
 	}
 
 	/*
-	 * Since readers won't lock (see memcg_kmem_get_cache()), we need a
+	 * Since readers won't lock (see memcg_slab_pre_alloc_hook()), we need a
 	 * barrier here to ensure nobody will see the kmem_cache partially
 	 * initialized.
 	 */
-- 
2.25.3



^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH v3 15/19] mm: memcg/slab: deprecate slab_root_caches
  2020-04-22 20:46 [PATCH v3 00/19] The new cgroup slab memory controller Roman Gushchin
                   ` (13 preceding siblings ...)
  2020-04-22 20:47 ` [PATCH v3 14/19] mm: memcg/slab: deprecate memcg_kmem_get_cache() Roman Gushchin
@ 2020-04-22 20:47 ` Roman Gushchin
  2020-05-26 10:52   ` Vlastimil Babka
  2020-04-22 20:47 ` [PATCH v3 16/19] mm: memcg/slab: remove redundant check in memcg_accumulate_slabinfo() Roman Gushchin
                   ` (3 subsequent siblings)
  18 siblings, 1 reply; 84+ messages in thread
From: Roman Gushchin @ 2020-04-22 20:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, linux-mm, kernel-team,
	linux-kernel, Roman Gushchin

Currently there are two lists of kmem_caches:
1) slab_caches, which contains all kmem_caches,
2) slab_root_caches, which contains only root kmem_caches.

And there is some preprocessor magic to have a single list
if CONFIG_MEMCG_KMEM isn't enabled.

It was required earlier because the number of non-root kmem_caches
was proportional to the number of memory cgroups and could reach
really big values. Now, when it cannot exceed the number of root
kmem_caches, there is really no reason to maintain two lists.

We never iterate over the slab_root_caches list on any hot paths,
so it's perfectly fine to iterate over slab_caches and filter out
non-root kmem_caches.

It allows to remove a lot of config-dependent code and two pointers
from the kmem_cache structure.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 mm/slab.c        |  1 -
 mm/slab.h        | 17 -----------------
 mm/slab_common.c | 37 ++++++++-----------------------------
 mm/slub.c        |  1 -
 4 files changed, 8 insertions(+), 48 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 17f781a5b62c..5e933f5e24db 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1239,7 +1239,6 @@ void __init kmem_cache_init(void)
 				  nr_node_ids * sizeof(struct kmem_cache_node *),
 				  SLAB_HWCACHE_ALIGN, 0, 0);
 	list_add(&kmem_cache->list, &slab_caches);
-	memcg_link_cache(kmem_cache);
 	slab_state = PARTIAL;
 
 	/*
diff --git a/mm/slab.h b/mm/slab.h
index cbee6cb0a331..2958ca8d3159 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -44,14 +44,12 @@ struct kmem_cache {
  *
  * @memcg_cache: pointer to memcg kmem cache, used by all non-root memory
  *		cgroups.
- * @root_caches_node: list node for slab_root_caches list.
  * @work: work struct used to create the non-root cache.
  */
 struct memcg_cache_params {
 	struct kmem_cache *root_cache;
 
 	struct kmem_cache *memcg_cache;
-	struct list_head __root_caches_node;
 	struct work_struct work;
 };
 #endif /* CONFIG_SLOB */
@@ -235,11 +233,6 @@ static inline int cache_vmstat_idx(struct kmem_cache *s)
 }
 
 #ifdef CONFIG_MEMCG_KMEM
-
-/* List of all root caches. */
-extern struct list_head		slab_root_caches;
-#define root_caches_node	memcg_params.__root_caches_node
-
 static inline bool is_root_cache(struct kmem_cache *s)
 {
 	return !s->memcg_params.root_cache;
@@ -414,14 +407,8 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page,
 }
 
 extern void slab_init_memcg_params(struct kmem_cache *);
-extern void memcg_link_cache(struct kmem_cache *s);
 
 #else /* CONFIG_MEMCG_KMEM */
-
-/* If !memcg, all caches are root. */
-#define slab_root_caches	slab_caches
-#define root_caches_node	list
-
 static inline bool is_root_cache(struct kmem_cache *s)
 {
 	return true;
@@ -490,10 +477,6 @@ static inline void slab_init_memcg_params(struct kmem_cache *s)
 {
 }
 
-static inline void memcg_link_cache(struct kmem_cache *s)
-{
-}
-
 #endif /* CONFIG_MEMCG_KMEM */
 
 static inline struct kmem_cache *virt_to_cache(const void *obj)
diff --git a/mm/slab_common.c b/mm/slab_common.c
index f8874a159637..c045afb9724e 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -129,9 +129,6 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t nr,
 }
 
 #ifdef CONFIG_MEMCG_KMEM
-
-LIST_HEAD(slab_root_caches);
-
 static void memcg_kmem_cache_create_func(struct work_struct *work)
 {
 	struct kmem_cache *cachep = container_of(work, struct kmem_cache,
@@ -154,27 +151,11 @@ static void init_memcg_params(struct kmem_cache *s,
 	else
 		slab_init_memcg_params(s);
 }
-
-void memcg_link_cache(struct kmem_cache *s)
-{
-	if (is_root_cache(s))
-		list_add(&s->root_caches_node, &slab_root_caches);
-}
-
-static void memcg_unlink_cache(struct kmem_cache *s)
-{
-	if (is_root_cache(s))
-		list_del(&s->root_caches_node);
-}
 #else
 static inline void init_memcg_params(struct kmem_cache *s,
 				     struct kmem_cache *root_cache)
 {
 }
-
-static inline void memcg_unlink_cache(struct kmem_cache *s)
-{
-}
 #endif /* CONFIG_MEMCG_KMEM */
 
 /*
@@ -251,7 +232,7 @@ struct kmem_cache *find_mergeable(unsigned int size, unsigned int align,
 	if (flags & SLAB_NEVER_MERGE)
 		return NULL;
 
-	list_for_each_entry_reverse(s, &slab_root_caches, root_caches_node) {
+	list_for_each_entry_reverse(s, &slab_caches, list) {
 		if (slab_unmergeable(s))
 			continue;
 
@@ -310,7 +291,6 @@ static struct kmem_cache *create_cache(const char *name,
 
 	s->refcount = 1;
 	list_add(&s->list, &slab_caches);
-	memcg_link_cache(s);
 out:
 	if (err)
 		return ERR_PTR(err);
@@ -505,7 +485,6 @@ static int shutdown_cache(struct kmem_cache *s)
 	if (__kmem_cache_shutdown(s) != 0)
 		return -EBUSY;
 
-	memcg_unlink_cache(s);
 	list_del(&s->list);
 
 	if (s->flags & SLAB_TYPESAFE_BY_RCU) {
@@ -749,7 +728,6 @@ struct kmem_cache *__init create_kmalloc_cache(const char *name,
 
 	create_boot_cache(s, name, size, flags, useroffset, usersize);
 	list_add(&s->list, &slab_caches);
-	memcg_link_cache(s);
 	s->refcount = 1;
 	return s;
 }
@@ -1090,12 +1068,12 @@ static void print_slabinfo_header(struct seq_file *m)
 void *slab_start(struct seq_file *m, loff_t *pos)
 {
 	mutex_lock(&slab_mutex);
-	return seq_list_start(&slab_root_caches, *pos);
+	return seq_list_start(&slab_caches, *pos);
 }
 
 void *slab_next(struct seq_file *m, void *p, loff_t *pos)
 {
-	return seq_list_next(p, &slab_root_caches, pos);
+	return seq_list_next(p, &slab_caches, pos);
 }
 
 void slab_stop(struct seq_file *m, void *p)
@@ -1148,11 +1126,12 @@ static void cache_show(struct kmem_cache *s, struct seq_file *m)
 
 static int slab_show(struct seq_file *m, void *p)
 {
-	struct kmem_cache *s = list_entry(p, struct kmem_cache, root_caches_node);
+	struct kmem_cache *s = list_entry(p, struct kmem_cache, list);
 
-	if (p == slab_root_caches.next)
+	if (p == slab_caches.next)
 		print_slabinfo_header(m);
-	cache_show(s, m);
+	if (is_root_cache(s))
+		cache_show(s, m);
 	return 0;
 }
 
@@ -1254,7 +1233,7 @@ static int memcg_slabinfo_show(struct seq_file *m, void *unused)
 	mutex_lock(&slab_mutex);
 	seq_puts(m, "# <name> <css_id[:dead|deact]> <active_objs> <num_objs>");
 	seq_puts(m, " <active_slabs> <num_slabs>\n");
-	list_for_each_entry(s, &slab_root_caches, root_caches_node) {
+	list_for_each_entry(s, &slab_caches, list) {
 		/*
 		 * Skip kmem caches that don't have the memcg cache.
 		 */
diff --git a/mm/slub.c b/mm/slub.c
index 3e4cb081af5d..799082723e77 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4273,7 +4273,6 @@ static struct kmem_cache * __init bootstrap(struct kmem_cache *static_cache)
 	}
 	slab_init_memcg_params(s);
 	list_add(&s->list, &slab_caches);
-	memcg_link_cache(s);
 	return s;
 }
 
-- 
2.25.3



^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH v3 16/19] mm: memcg/slab: remove redundant check in memcg_accumulate_slabinfo()
  2020-04-22 20:46 [PATCH v3 00/19] The new cgroup slab memory controller Roman Gushchin
                   ` (14 preceding siblings ...)
  2020-04-22 20:47 ` [PATCH v3 15/19] mm: memcg/slab: deprecate slab_root_caches Roman Gushchin
@ 2020-04-22 20:47 ` Roman Gushchin
  2020-05-26 11:31   ` Vlastimil Babka
  2020-04-22 20:47 ` [PATCH v3 17/19] mm: memcg/slab: use a single set of kmem_caches for all allocations Roman Gushchin
                   ` (2 subsequent siblings)
  18 siblings, 1 reply; 84+ messages in thread
From: Roman Gushchin @ 2020-04-22 20:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, linux-mm, kernel-team,
	linux-kernel, Roman Gushchin

memcg_accumulate_slabinfo() is never called with a non-root
kmem_cache as a first argument, so the is_root_cache(s) check
is redundant and can be removed without any functional change.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 mm/slab_common.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/mm/slab_common.c b/mm/slab_common.c
index c045afb9724e..52164ad0f197 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1087,9 +1087,6 @@ memcg_accumulate_slabinfo(struct kmem_cache *s, struct slabinfo *info)
 	struct kmem_cache *c;
 	struct slabinfo sinfo;
 
-	if (!is_root_cache(s))
-		return;
-
 	c = memcg_cache(s);
 	if (c) {
 		memset(&sinfo, 0, sizeof(sinfo));
-- 
2.25.3



^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH v3 17/19] mm: memcg/slab: use a single set of kmem_caches for all allocations
  2020-04-22 20:46 [PATCH v3 00/19] The new cgroup slab memory controller Roman Gushchin
                   ` (15 preceding siblings ...)
  2020-04-22 20:47 ` [PATCH v3 16/19] mm: memcg/slab: remove redundant check in memcg_accumulate_slabinfo() Roman Gushchin
@ 2020-04-22 20:47 ` Roman Gushchin
  2020-05-26 14:55   ` Vlastimil Babka
  2020-04-22 20:47 ` [PATCH v3 18/19] kselftests: cgroup: add kernel memory accounting tests Roman Gushchin
  2020-04-22 20:47 ` [PATCH v3 19/19] tools/cgroup: add memcg_slabinfo.py tool Roman Gushchin
  18 siblings, 1 reply; 84+ messages in thread
From: Roman Gushchin @ 2020-04-22 20:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, linux-mm, kernel-team,
	linux-kernel, Roman Gushchin

Instead of having two sets of kmem_caches: one for system-wide and
non-accounted allocations and the second one shared by all accounted
allocations, we can use just one.

The idea is simple: space for obj_cgroup metadata can be allocated
on demand and filled only for accounted allocations.

It allows to remove a bunch of code which is required to handle
kmem_cache clones for accounted allocations. There is no more need
to create them, accumulate statistics, propagate attributes, etc.
It's a quite significant simplification.

Also, because the total number of slab_caches is reduced almost twice
(not all kmem_caches have a memcg clone), some additional memory
savings are expected. On my devvm it additionally saves about 3.5%
of slab memory.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/slab.h     |   2 -
 include/linux/slab_def.h |   3 -
 include/linux/slub_def.h |   9 --
 mm/memcontrol.c          |   5 +-
 mm/slab.c                |   7 +-
 mm/slab.h                | 180 +++++++-----------------------
 mm/slab_common.c         | 230 +--------------------------------------
 mm/slub.c                | 126 +--------------------
 8 files changed, 55 insertions(+), 507 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 310768bfa8d2..694a4f69e146 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -155,8 +155,6 @@ struct kmem_cache *kmem_cache_create_usercopy(const char *name,
 void kmem_cache_destroy(struct kmem_cache *);
 int kmem_cache_shrink(struct kmem_cache *);
 
-void memcg_create_kmem_cache(struct kmem_cache *cachep);
-
 /*
  * Please use this macro to create slab caches. Simply specify the
  * name of the structure and maybe some flags that are listed above.
diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h
index 967a9a525eab..73f9308e98e3 100644
--- a/include/linux/slab_def.h
+++ b/include/linux/slab_def.h
@@ -72,9 +72,6 @@ struct kmem_cache {
 	int obj_offset;
 #endif /* CONFIG_DEBUG_SLAB */
 
-#ifdef CONFIG_MEMCG
-	struct memcg_cache_params memcg_params;
-#endif
 #ifdef CONFIG_KASAN
 	struct kasan_cache kasan_info;
 #endif
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index cbda7d55796a..cdf4f299c982 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -110,15 +110,6 @@ struct kmem_cache {
 	struct kobject kobj;	/* For sysfs */
 	struct work_struct kobj_remove_work;
 #endif
-#ifdef CONFIG_MEMCG
-	struct memcg_cache_params memcg_params;
-	/* For propagation, maximum size of a stored attr */
-	unsigned int max_attr_size;
-#ifdef CONFIG_SYSFS
-	struct kset *memcg_kset;
-#endif
-#endif
-
 #ifdef CONFIG_SLAB_FREELIST_HARDENED
 	unsigned long random;
 #endif
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bd58b91631f7..4af95739ccb6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2824,7 +2824,10 @@ struct mem_cgroup *mem_cgroup_from_obj(void *p)
 
 		off = obj_to_index(page->slab_cache, page, p);
 		objcg = page_obj_cgroups(page)[off];
-		return obj_cgroup_memcg(objcg);
+		if (objcg)
+			return obj_cgroup_memcg(objcg);
+
+		return NULL;
 	}
 
 	/* All other pages use page->mem_cgroup */
diff --git a/mm/slab.c b/mm/slab.c
index 5e933f5e24db..181ce8665d55 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1369,12 +1369,7 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
 		return NULL;
 	}
 
-	if (charge_slab_page(page, flags, cachep->gfporder, cachep,
-			     cachep->num)) {
-		__free_pages(page, cachep->gfporder);
-		return NULL;
-	}
-
+	charge_slab_page(page, flags, cachep->gfporder, cachep, cachep->num);
 	__SetPageSlab(page);
 	/* Record if ALLOC_NO_WATERMARKS was set when allocating the slab */
 	if (sk_memalloc_socks() && page_is_pfmemalloc(page))
diff --git a/mm/slab.h b/mm/slab.h
index 2958ca8d3159..13fadf33be5c 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -30,28 +30,6 @@ struct kmem_cache {
 	struct list_head list;	/* List of all slab caches on the system */
 };
 
-#else /* !CONFIG_SLOB */
-
-/*
- * This is the main placeholder for memcg-related information in kmem caches.
- * Both the root cache and the child cache will have it. Some fields are used
- * in both cases, other are specific to root caches.
- *
- * @root_cache:	Common to root and child caches.  NULL for root, pointer to
- *		the root cache for children.
- *
- * The following fields are specific to root caches.
- *
- * @memcg_cache: pointer to memcg kmem cache, used by all non-root memory
- *		cgroups.
- * @work: work struct used to create the non-root cache.
- */
-struct memcg_cache_params {
-	struct kmem_cache *root_cache;
-
-	struct kmem_cache *memcg_cache;
-	struct work_struct work;
-};
 #endif /* CONFIG_SLOB */
 
 #ifdef CONFIG_SLAB
@@ -194,7 +172,6 @@ int __kmem_cache_shutdown(struct kmem_cache *);
 void __kmem_cache_release(struct kmem_cache *);
 int __kmem_cache_shrink(struct kmem_cache *);
 void slab_kmem_cache_release(struct kmem_cache *);
-void kmem_cache_shrink_all(struct kmem_cache *s);
 
 struct seq_file;
 struct file;
@@ -233,43 +210,6 @@ static inline int cache_vmstat_idx(struct kmem_cache *s)
 }
 
 #ifdef CONFIG_MEMCG_KMEM
-static inline bool is_root_cache(struct kmem_cache *s)
-{
-	return !s->memcg_params.root_cache;
-}
-
-static inline bool slab_equal_or_root(struct kmem_cache *s,
-				      struct kmem_cache *p)
-{
-	return p == s || p == s->memcg_params.root_cache;
-}
-
-/*
- * We use suffixes to the name in memcg because we can't have caches
- * created in the system with the same name. But when we print them
- * locally, better refer to them with the base name
- */
-static inline const char *cache_name(struct kmem_cache *s)
-{
-	if (!is_root_cache(s))
-		s = s->memcg_params.root_cache;
-	return s->name;
-}
-
-static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
-{
-	if (is_root_cache(s))
-		return s;
-	return s->memcg_params.root_cache;
-}
-
-static inline struct kmem_cache *memcg_cache(struct kmem_cache *s)
-{
-	if (is_root_cache(s))
-		return s->memcg_params.memcg_cache;
-	return NULL;
-}
-
 static inline struct obj_cgroup **page_obj_cgroups(struct page *page)
 {
 	/*
@@ -315,38 +255,25 @@ static inline size_t obj_full_size(struct kmem_cache *s)
 	return s->size + sizeof(struct obj_cgroup *);
 }
 
-static inline struct kmem_cache *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
-						struct obj_cgroup **objcgp,
-						size_t objects, gfp_t flags)
+static inline struct obj_cgroup *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
+							   size_t objects,
+							   gfp_t flags)
 {
-	struct kmem_cache *cachep;
 	struct obj_cgroup *objcg;
 
 	if (memcg_kmem_bypass())
-		return s;
-
-	cachep = READ_ONCE(s->memcg_params.memcg_cache);
-	if (unlikely(!cachep)) {
-		/*
-		 * If memcg cache does not exist yet, we schedule it's
-		 * asynchronous creation and let the current allocation
-		 * go through with the root cache.
-		 */
-		queue_work(system_wq, &s->memcg_params.work);
-		return s;
-	}
+		return NULL;
 
 	objcg = get_obj_cgroup_from_current();
 	if (!objcg)
-		return s;
+		return NULL;
 
 	if (obj_cgroup_charge(objcg, flags, objects * obj_full_size(s))) {
 		obj_cgroup_put(objcg);
-		cachep = NULL;
+		return NULL;
 	}
 
-	*objcgp = objcg;
-	return cachep;
+	return objcg;
 }
 
 static inline void mod_objcg_state(struct obj_cgroup *objcg,
@@ -365,15 +292,28 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg,
 
 static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
 					      struct obj_cgroup *objcg,
-					      size_t size, void **p)
+					      gfp_t flags, size_t size,
+					      void **p)
 {
 	struct page *page;
 	unsigned long off;
 	size_t i;
 
+	if (!objcg)
+		return;
+
+	flags &= ~__GFP_ACCOUNT;
 	for (i = 0; i < size; i++) {
 		if (likely(p[i])) {
 			page = virt_to_head_page(p[i]);
+
+			if (!page_has_obj_cgroups(page) &&
+			    memcg_alloc_page_obj_cgroups(page, flags,
+							 objs_per_slab(s))) {
+				obj_cgroup_uncharge(objcg, obj_full_size(s));
+				continue;
+			}
+
 			off = obj_to_index(s, page, p[i]);
 			obj_cgroup_get(objcg);
 			page_obj_cgroups(page)[off] = objcg;
@@ -392,13 +332,19 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page,
 	struct obj_cgroup *objcg;
 	unsigned int off;
 
-	if (!memcg_kmem_enabled() || is_root_cache(s))
+	if (!memcg_kmem_enabled())
+		return;
+
+	if (!page_has_obj_cgroups(page))
 		return;
 
 	off = obj_to_index(s, page, p);
 	objcg = page_obj_cgroups(page)[off];
 	page_obj_cgroups(page)[off] = NULL;
 
+	if (!objcg)
+		return;
+
 	obj_cgroup_uncharge(objcg, obj_full_size(s));
 	mod_objcg_state(objcg, page_pgdat(page), cache_vmstat_idx(s),
 			-obj_full_size(s));
@@ -406,35 +352,7 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page,
 	obj_cgroup_put(objcg);
 }
 
-extern void slab_init_memcg_params(struct kmem_cache *);
-
 #else /* CONFIG_MEMCG_KMEM */
-static inline bool is_root_cache(struct kmem_cache *s)
-{
-	return true;
-}
-
-static inline bool slab_equal_or_root(struct kmem_cache *s,
-				      struct kmem_cache *p)
-{
-	return s == p;
-}
-
-static inline const char *cache_name(struct kmem_cache *s)
-{
-	return s->name;
-}
-
-static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
-{
-	return s;
-}
-
-static inline struct kmem_cache *memcg_cache(struct kmem_cache *s)
-{
-	return NULL;
-}
-
 static inline bool page_has_obj_cgroups(struct page *page)
 {
 	return false;
@@ -455,16 +373,17 @@ static inline void memcg_free_page_obj_cgroups(struct page *page)
 {
 }
 
-static inline struct kmem_cache *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
-						struct obj_cgroup **objcgp,
-						size_t objects, gfp_t flags)
+static inline struct obj_cgroup *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
+							   size_t objects,
+							   gfp_t flags)
 {
 	return NULL;
 }
 
 static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
 					      struct obj_cgroup *objcg,
-					      size_t size, void **p)
+					      gfp_t flags, size_t size,
+					      void **p)
 {
 }
 
@@ -472,11 +391,6 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page,
 					void *p)
 {
 }
-
-static inline void slab_init_memcg_params(struct kmem_cache *s)
-{
-}
-
 #endif /* CONFIG_MEMCG_KMEM */
 
 static inline struct kmem_cache *virt_to_cache(const void *obj)
@@ -490,28 +404,18 @@ static inline struct kmem_cache *virt_to_cache(const void *obj)
 	return page->slab_cache;
 }
 
-static __always_inline int charge_slab_page(struct page *page,
-					    gfp_t gfp, int order,
-					    struct kmem_cache *s)
+static __always_inline void charge_slab_page(struct page *page,
+					     gfp_t gfp, int order,
+					     struct kmem_cache *s)
 {
-	if (!is_root_cache(s)) {
-		int ret;
-
-		ret = memcg_alloc_page_obj_cgroups(page, gfp, objs_per_slab(s));
-		if (ret)
-			return ret;
-	}
-
 	mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
 			    PAGE_SIZE << order);
-	return 0;
 }
 
 static __always_inline void uncharge_slab_page(struct page *page, int order,
 					       struct kmem_cache *s)
 {
-	if (!is_root_cache(s))
-		memcg_free_page_obj_cgroups(page);
+	memcg_free_page_obj_cgroups(page);
 
 	mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
 			    -(PAGE_SIZE << order));
@@ -525,8 +429,7 @@ static inline struct kmem_cache *cache_from_obj(struct kmem_cache *s, void *x)
 	 * When kmemcg is not being used, both assignments should return the
 	 * same value. but we don't want to pay the assignment price in that
 	 * case. If it is not compiled in, the compiler should be smart enough
-	 * to not do even the assignment. In that case, slab_equal_or_root
-	 * will also be a constant.
+	 * to not do even the assignment.
 	 */
 	if (!memcg_kmem_enabled() &&
 	    !IS_ENABLED(CONFIG_SLAB_FREELIST_HARDENED) &&
@@ -534,7 +437,7 @@ static inline struct kmem_cache *cache_from_obj(struct kmem_cache *s, void *x)
 		return s;
 
 	cachep = virt_to_cache(x);
-	WARN_ONCE(cachep && !slab_equal_or_root(cachep, s),
+	WARN_ONCE(cachep && cachep != s,
 		  "%s: Wrong slab cache. %s but object is from %s\n",
 		  __func__, s->name, cachep->name);
 	return cachep;
@@ -586,7 +489,7 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
 
 	if (memcg_kmem_enabled() &&
 	    ((flags & __GFP_ACCOUNT) || (s->flags & SLAB_ACCOUNT)))
-		return memcg_slab_pre_alloc_hook(s, objcgp, size, flags);
+		*objcgp = memcg_slab_pre_alloc_hook(s, size, flags);
 
 	return s;
 }
@@ -605,8 +508,7 @@ static inline void slab_post_alloc_hook(struct kmem_cache *s,
 					 s->flags, flags);
 	}
 
-	if (!is_root_cache(s))
-		memcg_slab_post_alloc_hook(s, objcg, size, p);
+	memcg_slab_post_alloc_hook(s, objcg, flags, size, p);
 }
 
 #ifndef CONFIG_SLOB
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 52164ad0f197..7be382d45514 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -128,36 +128,6 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t nr,
 	return i;
 }
 
-#ifdef CONFIG_MEMCG_KMEM
-static void memcg_kmem_cache_create_func(struct work_struct *work)
-{
-	struct kmem_cache *cachep = container_of(work, struct kmem_cache,
-						 memcg_params.work);
-	memcg_create_kmem_cache(cachep);
-}
-
-void slab_init_memcg_params(struct kmem_cache *s)
-{
-	s->memcg_params.root_cache = NULL;
-	s->memcg_params.memcg_cache = NULL;
-	INIT_WORK(&s->memcg_params.work, memcg_kmem_cache_create_func);
-}
-
-static void init_memcg_params(struct kmem_cache *s,
-			      struct kmem_cache *root_cache)
-{
-	if (root_cache)
-		s->memcg_params.root_cache = root_cache;
-	else
-		slab_init_memcg_params(s);
-}
-#else
-static inline void init_memcg_params(struct kmem_cache *s,
-				     struct kmem_cache *root_cache)
-{
-}
-#endif /* CONFIG_MEMCG_KMEM */
-
 /*
  * Figure out what the alignment of the objects will be given a set of
  * flags, a user specified alignment and the size of the objects.
@@ -195,9 +165,6 @@ int slab_unmergeable(struct kmem_cache *s)
 	if (slab_nomerge || (s->flags & SLAB_NEVER_MERGE))
 		return 1;
 
-	if (!is_root_cache(s))
-		return 1;
-
 	if (s->ctor)
 		return 1;
 
@@ -284,7 +251,6 @@ static struct kmem_cache *create_cache(const char *name,
 	s->useroffset = useroffset;
 	s->usersize = usersize;
 
-	init_memcg_params(s, root_cache);
 	err = __kmem_cache_create(s, flags);
 	if (err)
 		goto out_free_cache;
@@ -342,7 +308,6 @@ kmem_cache_create_usercopy(const char *name,
 
 	get_online_cpus();
 	get_online_mems();
-	memcg_get_cache_ids();
 
 	mutex_lock(&slab_mutex);
 
@@ -392,7 +357,6 @@ kmem_cache_create_usercopy(const char *name,
 out_unlock:
 	mutex_unlock(&slab_mutex);
 
-	memcg_put_cache_ids();
 	put_online_mems();
 	put_online_cpus();
 
@@ -505,87 +469,6 @@ static int shutdown_cache(struct kmem_cache *s)
 	return 0;
 }
 
-#ifdef CONFIG_MEMCG_KMEM
-/*
- * memcg_create_kmem_cache - Create a cache for non-root memory cgroups.
- * @root_cache: The parent of the new cache.
- *
- * This function attempts to create a kmem cache that will serve allocation
- * requests going all non-root memory cgroups to @root_cache. The new cache
- * inherits properties from its parent.
- */
-void memcg_create_kmem_cache(struct kmem_cache *root_cache)
-{
-	struct kmem_cache *s = NULL;
-	char *cache_name;
-
-	get_online_cpus();
-	get_online_mems();
-
-	mutex_lock(&slab_mutex);
-
-	if (root_cache->memcg_params.memcg_cache)
-		goto out_unlock;
-
-	cache_name = kasprintf(GFP_KERNEL, "%s-memcg", root_cache->name);
-	if (!cache_name)
-		goto out_unlock;
-
-	s = create_cache(cache_name, root_cache->object_size,
-			 root_cache->align,
-			 root_cache->flags & CACHE_CREATE_MASK,
-			 root_cache->useroffset, root_cache->usersize,
-			 root_cache->ctor, root_cache);
-	/*
-	 * If we could not create a memcg cache, do not complain, because
-	 * that's not critical at all as we can always proceed with the root
-	 * cache.
-	 */
-	if (IS_ERR(s)) {
-		kfree(cache_name);
-		goto out_unlock;
-	}
-
-	/*
-	 * Since readers won't lock (see memcg_slab_pre_alloc_hook()), we need a
-	 * barrier here to ensure nobody will see the kmem_cache partially
-	 * initialized.
-	 */
-	smp_wmb();
-	root_cache->memcg_params.memcg_cache = s;
-
-out_unlock:
-	mutex_unlock(&slab_mutex);
-
-	put_online_mems();
-	put_online_cpus();
-}
-
-static int shutdown_memcg_caches(struct kmem_cache *s)
-{
-	BUG_ON(!is_root_cache(s));
-
-	if (s->memcg_params.memcg_cache)
-		WARN_ON(shutdown_cache(s->memcg_params.memcg_cache));
-
-	return 0;
-}
-
-static void cancel_memcg_cache_creation(struct kmem_cache *s)
-{
-	cancel_work_sync(&s->memcg_params.work);
-}
-#else
-static inline int shutdown_memcg_caches(struct kmem_cache *s)
-{
-	return 0;
-}
-
-static inline void cancel_memcg_cache_creation(struct kmem_cache *s)
-{
-}
-#endif /* CONFIG_MEMCG_KMEM */
-
 void slab_kmem_cache_release(struct kmem_cache *s)
 {
 	__kmem_cache_release(s);
@@ -600,8 +483,6 @@ void kmem_cache_destroy(struct kmem_cache *s)
 	if (unlikely(!s))
 		return;
 
-	cancel_memcg_cache_creation(s);
-
 	get_online_cpus();
 	get_online_mems();
 
@@ -611,10 +492,7 @@ void kmem_cache_destroy(struct kmem_cache *s)
 	if (s->refcount)
 		goto out_unlock;
 
-	err = shutdown_memcg_caches(s);
-	if (!err)
-		err = shutdown_cache(s);
-
+	err = shutdown_cache(s);
 	if (err) {
 		pr_err("kmem_cache_destroy %s: Slab cache still has objects\n",
 		       s->name);
@@ -651,33 +529,6 @@ int kmem_cache_shrink(struct kmem_cache *cachep)
 }
 EXPORT_SYMBOL(kmem_cache_shrink);
 
-/**
- * kmem_cache_shrink_all - shrink root and memcg caches
- * @s: The cache pointer
- */
-void kmem_cache_shrink_all(struct kmem_cache *s)
-{
-	struct kmem_cache *c;
-
-	if (!IS_ENABLED(CONFIG_MEMCG_KMEM) || !is_root_cache(s)) {
-		kmem_cache_shrink(s);
-		return;
-	}
-
-	get_online_cpus();
-	get_online_mems();
-	kasan_cache_shrink(s);
-	__kmem_cache_shrink(s);
-
-	c = memcg_cache(s);
-	if (c) {
-		kasan_cache_shrink(c);
-		__kmem_cache_shrink(c);
-	}
-	put_online_mems();
-	put_online_cpus();
-}
-
 bool slab_is_available(void)
 {
 	return slab_state >= UP;
@@ -706,8 +557,6 @@ void __init create_boot_cache(struct kmem_cache *s, const char *name,
 	s->useroffset = useroffset;
 	s->usersize = usersize;
 
-	slab_init_memcg_params(s);
-
 	err = __kmem_cache_create(s, flags);
 
 	if (err)
@@ -1081,25 +930,6 @@ void slab_stop(struct seq_file *m, void *p)
 	mutex_unlock(&slab_mutex);
 }
 
-static void
-memcg_accumulate_slabinfo(struct kmem_cache *s, struct slabinfo *info)
-{
-	struct kmem_cache *c;
-	struct slabinfo sinfo;
-
-	c = memcg_cache(s);
-	if (c) {
-		memset(&sinfo, 0, sizeof(sinfo));
-		get_slabinfo(c, &sinfo);
-
-		info->active_slabs += sinfo.active_slabs;
-		info->num_slabs += sinfo.num_slabs;
-		info->shared_avail += sinfo.shared_avail;
-		info->active_objs += sinfo.active_objs;
-		info->num_objs += sinfo.num_objs;
-	}
-}
-
 static void cache_show(struct kmem_cache *s, struct seq_file *m)
 {
 	struct slabinfo sinfo;
@@ -1107,10 +937,8 @@ static void cache_show(struct kmem_cache *s, struct seq_file *m)
 	memset(&sinfo, 0, sizeof(sinfo));
 	get_slabinfo(s, &sinfo);
 
-	memcg_accumulate_slabinfo(s, &sinfo);
-
 	seq_printf(m, "%-17s %6lu %6lu %6u %4u %4d",
-		   cache_name(s), sinfo.active_objs, sinfo.num_objs, s->size,
+		   s->name, sinfo.active_objs, sinfo.num_objs, s->size,
 		   sinfo.objects_per_slab, (1 << sinfo.cache_order));
 
 	seq_printf(m, " : tunables %4u %4u %4u",
@@ -1127,8 +955,7 @@ static int slab_show(struct seq_file *m, void *p)
 
 	if (p == slab_caches.next)
 		print_slabinfo_header(m);
-	if (is_root_cache(s))
-		cache_show(s, m);
+	cache_show(s, m);
 	return 0;
 }
 
@@ -1153,13 +980,13 @@ void dump_unreclaimable_slab(void)
 	pr_info("Name                      Used          Total\n");
 
 	list_for_each_entry_safe(s, s2, &slab_caches, list) {
-		if (!is_root_cache(s) || (s->flags & SLAB_RECLAIM_ACCOUNT))
+		if (s->flags & SLAB_RECLAIM_ACCOUNT)
 			continue;
 
 		get_slabinfo(s, &sinfo);
 
 		if (sinfo.num_objs > 0)
-			pr_info("%-17s %10luKB %10luKB\n", cache_name(s),
+			pr_info("%-17s %10luKB %10luKB\n", s->name,
 				(sinfo.active_objs * s->size) / 1024,
 				(sinfo.num_objs * s->size) / 1024);
 	}
@@ -1218,53 +1045,6 @@ static int __init slab_proc_init(void)
 }
 module_init(slab_proc_init);
 
-#if defined(CONFIG_DEBUG_FS) && defined(CONFIG_MEMCG_KMEM)
-/*
- * Display information about kmem caches that have memcg cache.
- */
-static int memcg_slabinfo_show(struct seq_file *m, void *unused)
-{
-	struct kmem_cache *s, *c;
-	struct slabinfo sinfo;
-
-	mutex_lock(&slab_mutex);
-	seq_puts(m, "# <name> <css_id[:dead|deact]> <active_objs> <num_objs>");
-	seq_puts(m, " <active_slabs> <num_slabs>\n");
-	list_for_each_entry(s, &slab_caches, list) {
-		/*
-		 * Skip kmem caches that don't have the memcg cache.
-		 */
-		if (!s->memcg_params.memcg_cache)
-			continue;
-
-		memset(&sinfo, 0, sizeof(sinfo));
-		get_slabinfo(s, &sinfo);
-		seq_printf(m, "%-17s root       %6lu %6lu %6lu %6lu\n",
-			   cache_name(s), sinfo.active_objs, sinfo.num_objs,
-			   sinfo.active_slabs, sinfo.num_slabs);
-
-		c = s->memcg_params.memcg_cache;
-		memset(&sinfo, 0, sizeof(sinfo));
-		get_slabinfo(c, &sinfo);
-		seq_printf(m, "%-17s %4d %6lu %6lu %6lu %6lu\n",
-			   cache_name(c), root_mem_cgroup->css.id,
-			   sinfo.active_objs, sinfo.num_objs,
-			   sinfo.active_slabs, sinfo.num_slabs);
-	}
-	mutex_unlock(&slab_mutex);
-	return 0;
-}
-DEFINE_SHOW_ATTRIBUTE(memcg_slabinfo);
-
-static int __init memcg_slabinfo_init(void)
-{
-	debugfs_create_file("memcg_slabinfo", S_IFREG | S_IRUGO,
-			    NULL, NULL, &memcg_slabinfo_fops);
-	return 0;
-}
-
-late_initcall(memcg_slabinfo_init);
-#endif /* CONFIG_DEBUG_FS && CONFIG_MEMCG_KMEM */
 #endif /* CONFIG_SLAB || CONFIG_SLUB_DEBUG */
 
 static __always_inline void *__do_krealloc(const void *p, size_t new_size,
diff --git a/mm/slub.c b/mm/slub.c
index 799082723e77..d875bab1626a 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -214,13 +214,11 @@ enum track_item { TRACK_ALLOC, TRACK_FREE };
 #ifdef CONFIG_SYSFS
 static int sysfs_slab_add(struct kmem_cache *);
 static int sysfs_slab_alias(struct kmem_cache *, const char *);
-static void memcg_propagate_slab_attrs(struct kmem_cache *s);
 static void sysfs_slab_remove(struct kmem_cache *s);
 #else
 static inline int sysfs_slab_add(struct kmem_cache *s) { return 0; }
 static inline int sysfs_slab_alias(struct kmem_cache *s, const char *p)
 							{ return 0; }
-static inline void memcg_propagate_slab_attrs(struct kmem_cache *s) { }
 static inline void sysfs_slab_remove(struct kmem_cache *s) { }
 #endif
 
@@ -1536,10 +1534,8 @@ static inline struct page *alloc_slab_page(struct kmem_cache *s,
 	else
 		page = __alloc_pages_node(node, flags, order);
 
-	if (page && charge_slab_page(page, flags, order, s)) {
-		__free_pages(page, order);
-		page = NULL;
-	}
+	if (page)
+		charge_slab_page(page, flags, order, s);
 
 	return page;
 }
@@ -4271,7 +4267,6 @@ static struct kmem_cache * __init bootstrap(struct kmem_cache *static_cache)
 			p->slab_cache = s;
 #endif
 	}
-	slab_init_memcg_params(s);
 	list_add(&s->list, &slab_caches);
 	return s;
 }
@@ -4327,7 +4322,7 @@ struct kmem_cache *
 __kmem_cache_alias(const char *name, unsigned int size, unsigned int align,
 		   slab_flags_t flags, void (*ctor)(void *))
 {
-	struct kmem_cache *s, *c;
+	struct kmem_cache *s;
 
 	s = find_mergeable(size, align, flags, name, ctor);
 	if (s) {
@@ -4340,12 +4335,6 @@ __kmem_cache_alias(const char *name, unsigned int size, unsigned int align,
 		s->object_size = max(s->object_size, size);
 		s->inuse = max(s->inuse, ALIGN(size, sizeof(void *)));
 
-		c = memcg_cache(s);
-		if (c) {
-			c->object_size = s->object_size;
-			c->inuse = max(c->inuse, ALIGN(size, sizeof(void *)));
-		}
-
 		if (sysfs_slab_alias(s, name)) {
 			s->refcount--;
 			s = NULL;
@@ -4367,7 +4356,6 @@ int __kmem_cache_create(struct kmem_cache *s, slab_flags_t flags)
 	if (slab_state <= UP)
 		return 0;
 
-	memcg_propagate_slab_attrs(s);
 	err = sysfs_slab_add(s);
 	if (err)
 		__kmem_cache_release(s);
@@ -5347,7 +5335,7 @@ static ssize_t shrink_store(struct kmem_cache *s,
 			const char *buf, size_t length)
 {
 	if (buf[0] == '1')
-		kmem_cache_shrink_all(s);
+		kmem_cache_shrink(s);
 	else
 		return -EINVAL;
 	return length;
@@ -5571,98 +5559,9 @@ static ssize_t slab_attr_store(struct kobject *kobj,
 		return -EIO;
 
 	err = attribute->store(s, buf, len);
-#ifdef CONFIG_MEMCG
-	if (slab_state >= FULL && err >= 0 && is_root_cache(s)) {
-		struct kmem_cache *c;
-
-		mutex_lock(&slab_mutex);
-		if (s->max_attr_size < len)
-			s->max_attr_size = len;
-
-		/*
-		 * This is a best effort propagation, so this function's return
-		 * value will be determined by the parent cache only. This is
-		 * basically because not all attributes will have a well
-		 * defined semantics for rollbacks - most of the actions will
-		 * have permanent effects.
-		 *
-		 * Returning the error value of any of the children that fail
-		 * is not 100 % defined, in the sense that users seeing the
-		 * error code won't be able to know anything about the state of
-		 * the cache.
-		 *
-		 * Only returning the error code for the parent cache at least
-		 * has well defined semantics. The cache being written to
-		 * directly either failed or succeeded, in which case we loop
-		 * through the descendants with best-effort propagation.
-		 */
-		c = memcg_cache(s);
-		if (c)
-			attribute->store(c, buf, len);
-		mutex_unlock(&slab_mutex);
-	}
-#endif
 	return err;
 }
 
-static void memcg_propagate_slab_attrs(struct kmem_cache *s)
-{
-#ifdef CONFIG_MEMCG
-	int i;
-	char *buffer = NULL;
-	struct kmem_cache *root_cache;
-
-	if (is_root_cache(s))
-		return;
-
-	root_cache = s->memcg_params.root_cache;
-
-	/*
-	 * This mean this cache had no attribute written. Therefore, no point
-	 * in copying default values around
-	 */
-	if (!root_cache->max_attr_size)
-		return;
-
-	for (i = 0; i < ARRAY_SIZE(slab_attrs); i++) {
-		char mbuf[64];
-		char *buf;
-		struct slab_attribute *attr = to_slab_attr(slab_attrs[i]);
-		ssize_t len;
-
-		if (!attr || !attr->store || !attr->show)
-			continue;
-
-		/*
-		 * It is really bad that we have to allocate here, so we will
-		 * do it only as a fallback. If we actually allocate, though,
-		 * we can just use the allocated buffer until the end.
-		 *
-		 * Most of the slub attributes will tend to be very small in
-		 * size, but sysfs allows buffers up to a page, so they can
-		 * theoretically happen.
-		 */
-		if (buffer)
-			buf = buffer;
-		else if (root_cache->max_attr_size < ARRAY_SIZE(mbuf))
-			buf = mbuf;
-		else {
-			buffer = (char *) get_zeroed_page(GFP_KERNEL);
-			if (WARN_ON(!buffer))
-				continue;
-			buf = buffer;
-		}
-
-		len = attr->show(root_cache, buf);
-		if (len > 0)
-			attr->store(s, buf, len);
-	}
-
-	if (buffer)
-		free_page((unsigned long)buffer);
-#endif	/* CONFIG_MEMCG */
-}
-
 static void kmem_cache_release(struct kobject *k)
 {
 	slab_kmem_cache_release(to_slab(k));
@@ -5695,10 +5594,6 @@ static struct kset *slab_kset;
 
 static inline struct kset *cache_kset(struct kmem_cache *s)
 {
-#ifdef CONFIG_MEMCG
-	if (!is_root_cache(s))
-		return s->memcg_params.root_cache->memcg_kset;
-#endif
 	return slab_kset;
 }
 
@@ -5755,9 +5650,6 @@ static void sysfs_slab_remove_workfn(struct work_struct *work)
 		 */
 		goto out;
 
-#ifdef CONFIG_MEMCG
-	kset_unregister(s->memcg_kset);
-#endif
 	kobject_uevent(&s->kobj, KOBJ_REMOVE);
 out:
 	kobject_put(&s->kobj);
@@ -5806,16 +5698,6 @@ static int sysfs_slab_add(struct kmem_cache *s)
 	if (err)
 		goto out_del_kobj;
 
-#ifdef CONFIG_MEMCG
-	if (is_root_cache(s) && memcg_sysfs_enabled) {
-		s->memcg_kset = kset_create_and_add("cgroup", NULL, &s->kobj);
-		if (!s->memcg_kset) {
-			err = -ENOMEM;
-			goto out_del_kobj;
-		}
-	}
-#endif
-
 	kobject_uevent(&s->kobj, KOBJ_ADD);
 	if (!unmergeable) {
 		/* Setup first alias */
-- 
2.25.3



^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH v3 18/19] kselftests: cgroup: add kernel memory accounting tests
  2020-04-22 20:46 [PATCH v3 00/19] The new cgroup slab memory controller Roman Gushchin
                   ` (16 preceding siblings ...)
  2020-04-22 20:47 ` [PATCH v3 17/19] mm: memcg/slab: use a single set of kmem_caches for all allocations Roman Gushchin
@ 2020-04-22 20:47 ` Roman Gushchin
  2020-05-26 15:24   ` Vlastimil Babka
  2020-04-22 20:47 ` [PATCH v3 19/19] tools/cgroup: add memcg_slabinfo.py tool Roman Gushchin
  18 siblings, 1 reply; 84+ messages in thread
From: Roman Gushchin @ 2020-04-22 20:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, linux-mm, kernel-team,
	linux-kernel, Roman Gushchin

Add some tests to cover the kernel memory accounting functionality.
These are covering some issues (and changes) we had recently.

1) A test which allocates a lot of negative dentries, checks memcg
slab statistics, creates memory pressure by setting memory.max
to some low value and checks that some number of slabs was reclaimed.

2) A test which covers side effects of memcg destruction: it creates
and destroys a large number of sub-cgroups, each containing a
multi-threaded workload which allocates and releases some kernel
memory. Then it checks that the charge ans memory.stats do add up
on the parent level.

3) A test which reads /proc/kpagecgroup and implicitly checks that it
doesn't crash the system.

4) A test which spawns a large number of threads and checks that
the kernel stacks accounting works as expected.

5) A test which checks that living charged slab objects are not
preventing the memory cgroup from being released after being deleted
by a user.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 tools/testing/selftests/cgroup/.gitignore  |   1 +
 tools/testing/selftests/cgroup/Makefile    |   2 +
 tools/testing/selftests/cgroup/test_kmem.c | 382 +++++++++++++++++++++
 3 files changed, 385 insertions(+)
 create mode 100644 tools/testing/selftests/cgroup/test_kmem.c

diff --git a/tools/testing/selftests/cgroup/.gitignore b/tools/testing/selftests/cgroup/.gitignore
index aa6de65b0838..84cfcabea838 100644
--- a/tools/testing/selftests/cgroup/.gitignore
+++ b/tools/testing/selftests/cgroup/.gitignore
@@ -2,3 +2,4 @@
 test_memcontrol
 test_core
 test_freezer
+test_kmem
\ No newline at end of file
diff --git a/tools/testing/selftests/cgroup/Makefile b/tools/testing/selftests/cgroup/Makefile
index 967f268fde74..4794844a228e 100644
--- a/tools/testing/selftests/cgroup/Makefile
+++ b/tools/testing/selftests/cgroup/Makefile
@@ -6,11 +6,13 @@ all:
 TEST_FILES     := with_stress.sh
 TEST_PROGS     := test_stress.sh
 TEST_GEN_PROGS = test_memcontrol
+TEST_GEN_PROGS = test_kmem
 TEST_GEN_PROGS += test_core
 TEST_GEN_PROGS += test_freezer
 
 include ../lib.mk
 
 $(OUTPUT)/test_memcontrol: cgroup_util.c ../clone3/clone3_selftests.h
+$(OUTPUT)/test_kmem: cgroup_util.c ../clone3/clone3_selftests.h
 $(OUTPUT)/test_core: cgroup_util.c ../clone3/clone3_selftests.h
 $(OUTPUT)/test_freezer: cgroup_util.c ../clone3/clone3_selftests.h
diff --git a/tools/testing/selftests/cgroup/test_kmem.c b/tools/testing/selftests/cgroup/test_kmem.c
new file mode 100644
index 000000000000..5bc1132fec6b
--- /dev/null
+++ b/tools/testing/selftests/cgroup/test_kmem.c
@@ -0,0 +1,382 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+
+#include <linux/limits.h>
+#include <fcntl.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <unistd.h>
+#include <sys/wait.h>
+#include <errno.h>
+#include <sys/sysinfo.h>
+#include <pthread.h>
+
+#include "../kselftest.h"
+#include "cgroup_util.h"
+
+
+static int alloc_dcache(const char *cgroup, void *arg)
+{
+	unsigned long i;
+	struct stat st;
+	char buf[128];
+
+	for (i = 0; i < (unsigned long)arg; i++) {
+		snprintf(buf, sizeof(buf),
+			"/something-non-existent-with-a-long-name-%64lu-%d",
+			 i, getpid());
+		stat(buf, &st);
+	}
+
+	return 0;
+}
+
+/*
+ * This test allocates 100000 of negative dentries with long names.
+ * Then it checks that "slab" in memory.stat is larger than 1M.
+ * Then it sets memory.high to 1M and checks that at least 1/2
+ * of slab memory has been reclaimed.
+ */
+static int test_kmem_basic(const char *root)
+{
+	int ret = KSFT_FAIL;
+	char *cg = NULL;
+	long slab0, slab1, current;
+
+	cg = cg_name(root, "kmem_basic_test");
+	if (!cg)
+		goto cleanup;
+
+	if (cg_create(cg))
+		goto cleanup;
+
+	if (cg_run(cg, alloc_dcache, (void *)100000))
+		goto cleanup;
+
+	slab0 = cg_read_key_long(cg, "memory.stat", "slab ");
+	if (slab0 < (1 >> 20))
+		goto cleanup;
+
+	cg_write(cg, "memory.high", "1M");
+	slab1 = cg_read_key_long(cg, "memory.stat", "slab ");
+	if (slab1 <= 0)
+		goto cleanup;
+
+	current = cg_read_long(cg, "memory.current");
+	if (current <= 0)
+		goto cleanup;
+
+	if (slab1 < slab0 / 2 && current < slab0 / 2)
+		ret = KSFT_PASS;
+cleanup:
+	cg_destroy(cg);
+	free(cg);
+
+	return ret;
+}
+
+static void *alloc_kmem_fn(void *arg)
+{
+	alloc_dcache(NULL, (void *)100);
+	return NULL;
+}
+
+static int alloc_kmem_smp(const char *cgroup, void *arg)
+{
+	int nr_threads = 2 * get_nprocs();
+	pthread_t *tinfo;
+	unsigned long i;
+	int ret = -1;
+
+	tinfo = calloc(nr_threads, sizeof(pthread_t));
+	if (tinfo == NULL)
+		return -1;
+
+	for (i = 0; i < nr_threads; i++) {
+		if (pthread_create(&tinfo[i], NULL, &alloc_kmem_fn,
+				   (void *)i)) {
+			free(tinfo);
+			return -1;
+		}
+	}
+
+	for (i = 0; i < nr_threads; i++) {
+		ret = pthread_join(tinfo[i], NULL);
+		if (ret)
+			break;
+	}
+
+	free(tinfo);
+	return ret;
+}
+
+static int cg_run_in_subcgroups(const char *parent,
+				int (*fn)(const char *cgroup, void *arg),
+				void *arg, int times)
+{
+	char *child;
+	int i;
+
+	for (i = 0; i < times; i++) {
+		child = cg_name_indexed(parent, "child", i);
+		if (!child)
+			return -1;
+
+		if (cg_create(child)) {
+			cg_destroy(child);
+			free(child);
+			return -1;
+		}
+
+		if (cg_run(child, fn, NULL)) {
+			cg_destroy(child);
+			free(child);
+			return -1;
+		}
+
+		cg_destroy(child);
+		free(child);
+	}
+
+	return 0;
+}
+
+/*
+ * The test creates and destroys a large number of cgroups. In each cgroup it
+ * allocates some slab memory (mostly negative dentries) using 2 * NR_CPUS
+ * threads. Then it checks the sanity of numbers on the parent level:
+ * the total size of the cgroups should be roughly equal to
+ * anon + file + slab + kernel_stack.
+ */
+static int test_kmem_memcg_deletion(const char *root)
+{
+	long current, slab, anon, file, kernel_stack, sum;
+	int ret = KSFT_FAIL;
+	char *parent;
+
+	parent = cg_name(root, "kmem_memcg_deletion_test");
+	if (!parent)
+		goto cleanup;
+
+	if (cg_create(parent))
+		goto cleanup;
+
+	if (cg_write(parent, "cgroup.subtree_control", "+memory"))
+		goto cleanup;
+
+	if (cg_run_in_subcgroups(parent, alloc_kmem_smp, NULL, 100))
+		goto cleanup;
+
+	current = cg_read_long(parent, "memory.current");
+	slab = cg_read_key_long(parent, "memory.stat", "slab ");
+	anon = cg_read_key_long(parent, "memory.stat", "anon ");
+	file = cg_read_key_long(parent, "memory.stat", "file ");
+	kernel_stack = cg_read_key_long(parent, "memory.stat", "kernel_stack ");
+	if (current < 0 || slab < 0 || anon < 0 || file < 0 ||
+	    kernel_stack < 0)
+		goto cleanup;
+
+	sum = slab + anon + file + kernel_stack;
+	if (abs(sum - current) < 4096 * 32 * 2 * get_nprocs()) {
+		ret = KSFT_PASS;
+	} else {
+		printf("memory.current = %ld\n", current);
+		printf("slab + anon + file + kernel_stack = %ld\n", sum);
+		printf("slab = %ld\n", slab);
+		printf("anon = %ld\n", anon);
+		printf("file = %ld\n", file);
+		printf("kernel_stack = %ld\n", kernel_stack);
+	}
+
+cleanup:
+	cg_destroy(parent);
+	free(parent);
+
+	return ret;
+}
+
+/*
+ * The test reads the entire /proc/kpagecgroup. If the operation went
+ * successfully (and the kernel didn't panic), the test is treated as passed.
+ */
+static int test_kmem_proc_kpagecgroup(const char *root)
+{
+	unsigned long buf[128];
+	int ret = KSFT_FAIL;
+	ssize_t len;
+	int fd;
+
+	fd = open("/proc/kpagecgroup", O_RDONLY);
+	if (fd < 0)
+		return ret;
+
+	do {
+		len = read(fd, buf, sizeof(buf));
+	} while (len > 0);
+
+	if (len == 0)
+		ret = KSFT_PASS;
+
+	close(fd);
+	return ret;
+}
+
+static void *pthread_wait_fn(void *arg)
+{
+	sleep(100);
+	return NULL;
+}
+
+static int spawn_1000_threads(const char *cgroup, void *arg)
+{
+	int nr_threads = 1000;
+	pthread_t *tinfo;
+	unsigned long i;
+	long stack;
+	int ret = -1;
+
+	tinfo = calloc(nr_threads, sizeof(pthread_t));
+	if (tinfo == NULL)
+		return -1;
+
+	for (i = 0; i < nr_threads; i++) {
+		if (pthread_create(&tinfo[i], NULL, &pthread_wait_fn,
+				   (void *)i)) {
+			free(tinfo);
+			return(-1);
+		}
+	}
+
+	stack = cg_read_key_long(cgroup, "memory.stat", "kernel_stack ");
+	if (stack >= 4096 * 1000)
+		ret = 0;
+
+	free(tinfo);
+	return ret;
+}
+
+/*
+ * The test spawns a process, which spawns 1000 threads. Then it checks
+ * that memory.stat's kernel_stack is at least 1000 pages large.
+ */
+static int test_kmem_kernel_stacks(const char *root)
+{
+	int ret = KSFT_FAIL;
+	char *cg = NULL;
+
+	cg = cg_name(root, "kmem_kernel_stacks_test");
+	if (!cg)
+		goto cleanup;
+
+	if (cg_create(cg))
+		goto cleanup;
+
+	if (cg_run(cg, spawn_1000_threads, NULL))
+		goto cleanup;
+
+	ret = KSFT_PASS;
+cleanup:
+	cg_destroy(cg);
+	free(cg);
+
+	return ret;
+}
+
+/*
+ * This test sequentionally creates 30 child cgroups, allocates some
+ * kernel memory in each of them, and deletes them. Then it checks
+ * that the number of dying cgroups on the parent level is 0.
+ */
+static int test_kmem_dead_cgroups(const char *root)
+{
+	int ret = KSFT_FAIL;
+	char *parent;
+	long dead;
+	int i;
+
+	parent = cg_name(root, "kmem_dead_cgroups_test");
+	if (!parent)
+		goto cleanup;
+
+	if (cg_create(parent))
+		goto cleanup;
+
+	if (cg_write(parent, "cgroup.subtree_control", "+memory"))
+		goto cleanup;
+
+	if (cg_run_in_subcgroups(parent, alloc_dcache, (void *)100, 30))
+		goto cleanup;
+
+	for (i = 0; i < 5; i++) {
+		dead = cg_read_key_long(parent, "cgroup.stat",
+					"nr_dying_descendants ");
+		if (dead == 0) {
+			ret = KSFT_PASS;
+			break;
+		}
+		/*
+		 * Reclaiming cgroups might take some time,
+		 * let's wait a bit and repeat.
+		 */
+		sleep(1);
+	}
+
+cleanup:
+	cg_destroy(parent);
+	free(parent);
+
+	return ret;
+}
+
+#define T(x) { x, #x }
+struct kmem_test {
+	int (*fn)(const char *root);
+	const char *name;
+} tests[] = {
+	T(test_kmem_basic),
+	T(test_kmem_memcg_deletion),
+	T(test_kmem_proc_kpagecgroup),
+	T(test_kmem_kernel_stacks),
+	T(test_kmem_dead_cgroups),
+};
+#undef T
+
+int main(int argc, char **argv)
+{
+	char root[PATH_MAX];
+	int i, ret = EXIT_SUCCESS;
+
+	if (cg_find_unified_root(root, sizeof(root)))
+		ksft_exit_skip("cgroup v2 isn't mounted\n");
+
+	/*
+	 * Check that memory controller is available:
+	 * memory is listed in cgroup.controllers
+	 */
+	if (cg_read_strstr(root, "cgroup.controllers", "memory"))
+		ksft_exit_skip("memory controller isn't available\n");
+
+	if (cg_read_strstr(root, "cgroup.subtree_control", "memory"))
+		if (cg_write(root, "cgroup.subtree_control", "+memory"))
+			ksft_exit_skip("Failed to set memory controller\n");
+
+	for (i = 0; i < ARRAY_SIZE(tests); i++) {
+		switch (tests[i].fn(root)) {
+		case KSFT_PASS:
+			ksft_test_result_pass("%s\n", tests[i].name);
+			break;
+		case KSFT_SKIP:
+			ksft_test_result_skip("%s\n", tests[i].name);
+			break;
+		default:
+			ret = EXIT_FAILURE;
+			ksft_test_result_fail("%s\n", tests[i].name);
+			break;
+		}
+	}
+
+	return ret;
+}
-- 
2.25.3



^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH v3 19/19] tools/cgroup: add memcg_slabinfo.py tool
  2020-04-22 20:46 [PATCH v3 00/19] The new cgroup slab memory controller Roman Gushchin
                   ` (17 preceding siblings ...)
  2020-04-22 20:47 ` [PATCH v3 18/19] kselftests: cgroup: add kernel memory accounting tests Roman Gushchin
@ 2020-04-22 20:47 ` Roman Gushchin
  2020-05-05 15:59   ` Tejun Heo
  18 siblings, 1 reply; 84+ messages in thread
From: Roman Gushchin @ 2020-04-22 20:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, linux-mm, kernel-team,
	linux-kernel, Roman Gushchin, Waiman Long, Tobin C . Harding,
	Tejun Heo

Add a drgn-based tool to display slab information for a given memcg.
Can replace cgroup v1 memory.kmem.slabinfo interface on cgroup v2,
but in a more flexiable way.

Currently supports only SLUB configuration, but SLAB can be trivially
added later.

Output example:
$ sudo ./tools/cgroup/memcg_slabinfo.py /sys/fs/cgroup/user.slice/user-111017.slice/user\@111017.service
shmem_inode_cache     92     92    704   46    8 : tunables    0    0    0 : slabdata      2      2      0
eventpoll_pwq         56     56     72   56    1 : tunables    0    0    0 : slabdata      1      1      0
eventpoll_epi         32     32    128   32    1 : tunables    0    0    0 : slabdata      1      1      0
kmalloc-8              0      0      8  512    1 : tunables    0    0    0 : slabdata      0      0      0
kmalloc-96             0      0     96   42    1 : tunables    0    0    0 : slabdata      0      0      0
kmalloc-2048           0      0   2048   16    8 : tunables    0    0    0 : slabdata      0      0      0
kmalloc-64           128    128     64   64    1 : tunables    0    0    0 : slabdata      2      2      0
mm_struct            160    160   1024   32    8 : tunables    0    0    0 : slabdata      5      5      0
signal_cache          96     96   1024   32    8 : tunables    0    0    0 : slabdata      3      3      0
sighand_cache         45     45   2112   15    8 : tunables    0    0    0 : slabdata      3      3      0
files_cache          138    138    704   46    8 : tunables    0    0    0 : slabdata      3      3      0
task_delay_info      153    153     80   51    1 : tunables    0    0    0 : slabdata      3      3      0
task_struct           27     27   3520    9    8 : tunables    0    0    0 : slabdata      3      3      0
radix_tree_node       56     56    584   28    4 : tunables    0    0    0 : slabdata      2      2      0
btrfs_inode          140    140   1136   28    8 : tunables    0    0    0 : slabdata      5      5      0
kmalloc-1024          64     64   1024   32    8 : tunables    0    0    0 : slabdata      2      2      0
kmalloc-192           84     84    192   42    2 : tunables    0    0    0 : slabdata      2      2      0
inode_cache           54     54    600   27    4 : tunables    0    0    0 : slabdata      2      2      0
kmalloc-128            0      0    128   32    1 : tunables    0    0    0 : slabdata      0      0      0
kmalloc-512           32     32    512   32    4 : tunables    0    0    0 : slabdata      1      1      0
skbuff_head_cache     32     32    256   32    2 : tunables    0    0    0 : slabdata      1      1      0
sock_inode_cache      46     46    704   46    8 : tunables    0    0    0 : slabdata      1      1      0
cred_jar             378    378    192   42    2 : tunables    0    0    0 : slabdata      9      9      0
proc_inode_cache      96     96    672   24    4 : tunables    0    0    0 : slabdata      4      4      0
dentry               336    336    192   42    2 : tunables    0    0    0 : slabdata      8      8      0
filp                 697    864    256   32    2 : tunables    0    0    0 : slabdata     27     27      0
anon_vma             644    644     88   46    1 : tunables    0    0    0 : slabdata     14     14      0
pid                 1408   1408     64   64    1 : tunables    0    0    0 : slabdata     22     22      0
vm_area_struct      1200   1200    200   40    2 : tunables    0    0    0 : slabdata     30     30      0

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
---
 tools/cgroup/memcg_slabinfo.py | 226 +++++++++++++++++++++++++++++++++
 1 file changed, 226 insertions(+)
 create mode 100755 tools/cgroup/memcg_slabinfo.py

diff --git a/tools/cgroup/memcg_slabinfo.py b/tools/cgroup/memcg_slabinfo.py
new file mode 100755
index 000000000000..c4225ed63565
--- /dev/null
+++ b/tools/cgroup/memcg_slabinfo.py
@@ -0,0 +1,226 @@
+#!/usr/bin/env drgn
+#
+# Copyright (C) 2020 Roman Gushchin <guro@fb.com>
+# Copyright (C) 2020 Facebook
+
+from os import stat
+import argparse
+import sys
+
+from drgn.helpers.linux import list_for_each_entry, list_empty
+from drgn.helpers.linux import for_each_page
+from drgn.helpers.linux.cpumask import for_each_online_cpu
+from drgn.helpers.linux.percpu import per_cpu_ptr
+from drgn import container_of, FaultError, Object
+
+
+DESC = """
+This is a drgn script to provide slab statistics for memory cgroups.
+It supports cgroup v2 and v1 and can emulate memory.kmem.slabinfo
+interface of cgroup v1.
+For drgn, visit https://github.com/osandov/drgn.
+"""
+
+
+MEMCGS = {}
+
+OO_SHIFT = 16
+OO_MASK = ((1 << OO_SHIFT) - 1)
+
+
+def err(s):
+    print('slabinfo.py: error: %s' % s, file=sys.stderr, flush=True)
+    sys.exit(1)
+
+
+def find_memcg_ids(css=prog['root_mem_cgroup'].css, prefix=''):
+    if not list_empty(css.children.address_of_()):
+        for css in list_for_each_entry('struct cgroup_subsys_state',
+                                       css.children.address_of_(),
+                                       'sibling'):
+            name = prefix + '/' + css.cgroup.kn.name.string_().decode('utf-8')
+            memcg = container_of(css, 'struct mem_cgroup', 'css')
+            MEMCGS[css.cgroup.kn.id.value_()] = memcg
+            find_memcg_ids(css, name)
+
+
+def is_root_cache(s):
+    try:
+        return False if s.memcg_params.root_cache else True
+    except AttributeError:
+        return True
+
+
+def cache_name(s):
+    if is_root_cache(s):
+        return s.name.string_().decode('utf-8')
+    else:
+        return s.memcg_params.root_cache.name.string_().decode('utf-8')
+
+
+# SLUB
+
+def oo_order(s):
+    return s.oo.x >> OO_SHIFT
+
+
+def oo_objects(s):
+    return s.oo.x & OO_MASK
+
+
+def count_partial(n, fn):
+    nr_pages = 0
+    for page in list_for_each_entry('struct page', n.partial.address_of_(),
+                                    'lru'):
+         nr_pages += fn(page)
+    return nr_pages
+
+
+def count_free(page):
+    return page.objects - page.inuse
+
+
+def slub_get_slabinfo(s, cfg):
+    nr_slabs = 0
+    nr_objs = 0
+    nr_free = 0
+
+    for node in range(cfg['nr_nodes']):
+        n = s.node[node]
+        nr_slabs += n.nr_slabs.counter.value_()
+        nr_objs += n.total_objects.counter.value_()
+        nr_free += count_partial(n, count_free)
+
+    return {'active_objs': nr_objs - nr_free,
+            'num_objs': nr_objs,
+            'active_slabs': nr_slabs,
+            'num_slabs': nr_slabs,
+            'objects_per_slab': oo_objects(s),
+            'cache_order': oo_order(s),
+            'limit': 0,
+            'batchcount': 0,
+            'shared': 0,
+            'shared_avail': 0}
+
+
+def cache_show(s, cfg, objs):
+    if cfg['allocator'] == 'SLUB':
+        sinfo = slub_get_slabinfo(s, cfg)
+    else:
+        err('SLAB isn\'t supported yet')
+
+    if cfg['shared_slab_pages']:
+        sinfo['active_objs'] = objs
+        sinfo['num_objs'] = objs
+
+    print('%-17s %6lu %6lu %6u %4u %4d'
+          ' : tunables %4u %4u %4u'
+          ' : slabdata %6lu %6lu %6lu' % (
+              cache_name(s), sinfo['active_objs'], sinfo['num_objs'],
+              s.size, sinfo['objects_per_slab'], 1 << sinfo['cache_order'],
+              sinfo['limit'], sinfo['batchcount'], sinfo['shared'],
+              sinfo['active_slabs'], sinfo['num_slabs'],
+              sinfo['shared_avail']))
+
+
+def detect_kernel_config():
+    cfg = {}
+
+    cfg['nr_nodes'] = prog['nr_online_nodes'].value_()
+
+    if prog.type('struct kmem_cache').members[1][1] == 'flags':
+        cfg['allocator'] = 'SLUB'
+    elif prog.type('struct kmem_cache').members[1][1] == 'batchcount':
+        cfg['allocator'] = 'SLAB'
+    else:
+        err('Can\'t determine the slab allocator')
+
+    cfg['shared_slab_pages'] = False
+    try:
+        if prog.type('struct obj_cgroup'):
+            cfg['shared_slab_pages'] = True
+    except:
+        pass
+
+    return cfg
+
+
+def for_each_slab_page(prog):
+    PGSlab = 1 << prog.constant('PG_slab')
+    PGHead = 1 << prog.constant('PG_head')
+
+    for page in for_each_page(prog):
+        try:
+            if page.flags.value_() & PGSlab:
+                yield page
+        except FaultError:
+            pass
+
+
+def main():
+    parser = argparse.ArgumentParser(description=DESC,
+                                     formatter_class=
+                                     argparse.RawTextHelpFormatter)
+    parser.add_argument('cgroup', metavar='CGROUP',
+                        help='Target memory cgroup')
+    args = parser.parse_args()
+
+    try:
+        cgroup_id = stat(args.cgroup).st_ino
+        find_memcg_ids()
+        memcg = MEMCGS[cgroup_id]
+    except KeyError:
+        err('Can\'t find the memory cgroup')
+
+    cfg = detect_kernel_config()
+
+    print('# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab>'
+          ' : tunables <limit> <batchcount> <sharedfactor>'
+          ' : slabdata <active_slabs> <num_slabs> <sharedavail>')
+
+    if cfg['shared_slab_pages']:
+        obj_cgroups = set()
+        stats = {}
+        caches = {}
+
+        # find memcg pointers belonging to the specified cgroup
+        obj_cgroups.add(memcg.objcg.value_())
+        for ptr in list_for_each_entry('struct obj_cgroup',
+                                       memcg.objcg_list.address_of_(),
+                                       'list'):
+            obj_cgroups.add(ptr.value_())
+
+        # look over all slab pages, belonging to non-root memcgs
+        # and look for objects belonging to the given memory cgroup
+        for page in for_each_slab_page(prog):
+            objcg_vec_raw = page.obj_cgroups.value_()
+            if objcg_vec_raw == 0:
+                continue
+            cache = page.slab_cache
+            if not cache:
+                continue
+            addr = cache.value_()
+            caches[addr] = cache
+            # clear the lowest bit to get the true obj_cgroups
+            objcg_vec = Object(prog, page.obj_cgroups.type_,
+                               value=objcg_vec_raw & ~1)
+
+            if addr not in stats:
+                stats[addr] = 0
+
+            for i in range(oo_objects(cache)):
+                if objcg_vec[i].value_() in obj_cgroups:
+                    stats[addr] += 1
+
+        for addr in caches:
+            if stats[addr] > 0:
+                cache_show(caches[addr], cfg, stats[addr])
+
+    else:
+        for s in list_for_each_entry('struct kmem_cache',
+                                     memcg.kmem_caches.address_of_(),
+                                     'memcg_params.kmem_caches_node'):
+            cache_show(s, cfg, None)
+
+
+main()
-- 
2.25.3



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 04/19] mm: slub: implement SLUB version of obj_to_index()
  2020-04-22 20:46 ` [PATCH v3 04/19] mm: slub: implement SLUB version of obj_to_index() Roman Gushchin
@ 2020-04-22 23:52   ` Christopher Lameter
  2020-04-23  0:05     ` Roman Gushchin
  2020-04-23 21:01     ` Roman Gushchin
  2020-05-20 13:51   ` Vlastimil Babka
  1 sibling, 2 replies; 84+ messages in thread
From: Christopher Lameter @ 2020-04-22 23:52 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, linux-mm,
	kernel-team, linux-kernel

On Wed, 22 Apr 2020, Roman Gushchin wrote:

>  enum stat_item {
>  	ALLOC_FASTPATH,		/* Allocation from cpu slab */
> @@ -86,6 +87,7 @@ struct kmem_cache {
>  	unsigned long min_partial;
>  	unsigned int size;	/* The size of an object including metadata */
>  	unsigned int object_size;/* The size of an object without metadata */
> +	struct reciprocal_value reciprocal_size;


This needs to be moved further back since it is not an item that needs to
be cache hot for the hotpaths. Place it with "align", inuse etc?

Hmmm. the same applies to min_partial maybe?



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 04/19] mm: slub: implement SLUB version of obj_to_index()
  2020-04-22 23:52   ` Christopher Lameter
@ 2020-04-23  0:05     ` Roman Gushchin
  2020-04-25  2:10       ` Christopher Lameter
  2020-04-23 21:01     ` Roman Gushchin
  1 sibling, 1 reply; 84+ messages in thread
From: Roman Gushchin @ 2020-04-23  0:05 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, linux-mm,
	kernel-team, linux-kernel

On Wed, Apr 22, 2020 at 11:52:13PM +0000, Christoph Lameter wrote:
> On Wed, 22 Apr 2020, Roman Gushchin wrote:
> 
> >  enum stat_item {
> >  	ALLOC_FASTPATH,		/* Allocation from cpu slab */
> > @@ -86,6 +87,7 @@ struct kmem_cache {
> >  	unsigned long min_partial;
> >  	unsigned int size;	/* The size of an object including metadata */
> >  	unsigned int object_size;/* The size of an object without metadata */
> > +	struct reciprocal_value reciprocal_size;
> 
> 
> This needs to be moved further back since it is not an item that needs to
> be cache hot for the hotpaths.

It could be relatively hot, because it's accessed for reading on every
accounted allocation.

> Place it with "align", inuse etc?
> 
> Hmmm. the same applies to min_partial maybe?
> 
> 

And min_partial should much colder.

So maybe a patch on top of the series which moves both fields can work?

Thanks!


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 07/19] mm: memcg/slab: allocate obj_cgroups for non-root slab pages
  2020-04-22 20:46 ` [PATCH v3 07/19] mm: memcg/slab: allocate obj_cgroups for non-root slab pages Roman Gushchin
@ 2020-04-23 20:20   ` Roman Gushchin
  2020-05-22 18:27   ` Vlastimil Babka
  2020-05-25 14:46   ` Vlastimil Babka
  2 siblings, 0 replies; 84+ messages in thread
From: Roman Gushchin @ 2020-04-23 20:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, linux-mm, kernel-team, linux-kernel

On Wed, Apr 22, 2020 at 01:46:56PM -0700, Roman Gushchin wrote:
> Allocate and release memory to store obj_cgroup pointers for each
> non-root slab page. Reuse page->mem_cgroup pointer to store a pointer
> to the allocated space.
> 
> To distinguish between obj_cgroups and memcg pointers in case
> when it's not obvious which one is used (as in page_cgroup_ino()),
> let's always set the lowest bit in the obj_cgroup case.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> ---
>  include/linux/mm_types.h |  5 ++++-
>  include/linux/slab_def.h |  5 +++++
>  include/linux/slub_def.h |  2 ++
>  mm/memcontrol.c          | 17 +++++++++++---
>  mm/slab.c                |  3 ++-
>  mm/slab.h                | 48 ++++++++++++++++++++++++++++++++++++++++
>  mm/slub.c                |  5 +++++
>  7 files changed, 80 insertions(+), 5 deletions(-)
>
...
> diff --git a/mm/slub.c b/mm/slub.c
> index 8d16babe1829..68c2c45dfac1 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -5992,4 +5992,9 @@ ssize_t slabinfo_write(struct file *file, const char __user *buffer,
>  {
>  	return -EIO;
>  }
> +
> +int objs_per_slab(struct kmem_cache *cache)
> +{
> +	return oo_objects(cache->oo);
> +}
>  #endif /* CONFIG_SLUB_DEBUG */
> -- 
> 2.25.3
> 


Ooops, the build bot found that objs_per_slab() was accidentally guarded by
CONFIG_SLUB_DEBUG. An updated version below.

--

From 6b358e0157815535c3a73b4ce7b28f9c4c7804b3 Mon Sep 17 00:00:00 2001
From: Roman Gushchin <guro@fb.com>
Date: Wed, 10 Jul 2019 15:44:38 -0700
Subject: [PATCH v3.1 07/19] mm: memcg/slab: allocate obj_cgroups for non-root
 slab pages

Allocate and release memory to store obj_cgroup pointers for each
non-root slab page. Reuse page->mem_cgroup pointer to store a pointer
to the allocated space.

To distinguish between obj_cgroups and memcg pointers in case
when it's not obvious which one is used (as in page_cgroup_ino()),
let's always set the lowest bit in the obj_cgroup case.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/mm_types.h |  5 ++++-
 include/linux/slab_def.h |  5 +++++
 include/linux/slub_def.h |  2 ++
 mm/memcontrol.c          | 17 +++++++++++---
 mm/slab.c                |  3 ++-
 mm/slab.h                | 48 ++++++++++++++++++++++++++++++++++++++++
 mm/slub.c                |  5 +++++
 7 files changed, 80 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 4aba6c0c2ba8..0ad7e700f26d 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -198,7 +198,10 @@ struct page {
 	atomic_t _refcount;
 
 #ifdef CONFIG_MEMCG
-	struct mem_cgroup *mem_cgroup;
+	union {
+		struct mem_cgroup *mem_cgroup;
+		struct obj_cgroup **obj_cgroups;
+	};
 #endif
 
 	/*
diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h
index abc7de77b988..967a9a525eab 100644
--- a/include/linux/slab_def.h
+++ b/include/linux/slab_def.h
@@ -114,4 +114,9 @@ static inline unsigned int obj_to_index(const struct kmem_cache *cache,
 	return reciprocal_divide(offset, cache->reciprocal_buffer_size);
 }
 
+static inline int objs_per_slab(const struct kmem_cache *cache)
+{
+	return cache->num;
+}
+
 #endif	/* _LINUX_SLAB_DEF_H */
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 200ea292f250..cbda7d55796a 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -191,4 +191,6 @@ static inline unsigned int obj_to_index(const struct kmem_cache *cache,
 				 cache->reciprocal_size);
 }
 
+extern int objs_per_slab(struct kmem_cache *cache);
+
 #endif /* _LINUX_SLUB_DEF_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7f87a0eeafec..63826e460b3f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -549,10 +549,21 @@ ino_t page_cgroup_ino(struct page *page)
 	unsigned long ino = 0;
 
 	rcu_read_lock();
-	if (PageSlab(page) && !PageTail(page))
+	if (PageSlab(page) && !PageTail(page)) {
 		memcg = memcg_from_slab_page(page);
-	else
-		memcg = READ_ONCE(page->mem_cgroup);
+	} else {
+		memcg = page->mem_cgroup;
+
+		/*
+		 * The lowest bit set means that memcg isn't a valid
+		 * memcg pointer, but a obj_cgroups pointer.
+		 * In this case the page is shared and doesn't belong
+		 * to any specific memory cgroup.
+		 */
+		if ((unsigned long) memcg & 0x1UL)
+			memcg = NULL;
+	}
+
 	while (memcg && !(memcg->css.flags & CSS_ONLINE))
 		memcg = parent_mem_cgroup(memcg);
 	if (memcg)
diff --git a/mm/slab.c b/mm/slab.c
index 9350062ffc1a..f2d67984595b 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1370,7 +1370,8 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
 		return NULL;
 	}
 
-	if (charge_slab_page(page, flags, cachep->gfporder, cachep)) {
+	if (charge_slab_page(page, flags, cachep->gfporder, cachep,
+			     cachep->num)) {
 		__free_pages(page, cachep->gfporder);
 		return NULL;
 	}
diff --git a/mm/slab.h b/mm/slab.h
index 8a574d9361c1..44def57f050e 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -319,6 +319,18 @@ static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
 	return s->memcg_params.root_cache;
 }
 
+static inline struct obj_cgroup **page_obj_cgroups(struct page *page)
+{
+	/*
+	 * page->mem_cgroup and page->obj_cgroups are sharing the same
+	 * space. To distinguish between them in case we don't know for sure
+	 * that the page is a slab page (e.g. page_cgroup_ino()), let's
+	 * always set the lowest bit of obj_cgroups.
+	 */
+	return (struct obj_cgroup **)
+		((unsigned long)page->obj_cgroups & ~0x1UL);
+}
+
 /*
  * Expects a pointer to a slab page. Please note, that PageSlab() check
  * isn't sufficient, as it returns true also for tail compound slab pages,
@@ -406,6 +418,25 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
 	percpu_ref_put_many(&s->memcg_params.refcnt, nr_pages);
 }
 
+static inline int memcg_alloc_page_obj_cgroups(struct page *page, gfp_t gfp,
+					       unsigned int objects)
+{
+	void *vec;
+
+	vec = kcalloc(objects, sizeof(struct obj_cgroup *), gfp);
+	if (!vec)
+		return -ENOMEM;
+
+	page->obj_cgroups = (struct obj_cgroup **) ((unsigned long)vec | 0x1UL);
+	return 0;
+}
+
+static inline void memcg_free_page_obj_cgroups(struct page *page)
+{
+	kfree(page_obj_cgroups(page));
+	page->obj_cgroups = NULL;
+}
+
 extern void slab_init_memcg_params(struct kmem_cache *);
 extern void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg);
 
@@ -455,6 +486,16 @@ static inline void memcg_uncharge_slab(struct page *page, int order,
 {
 }
 
+static inline int memcg_alloc_page_obj_cgroups(struct page *page, gfp_t gfp,
+					       unsigned int objects)
+{
+	return 0;
+}
+
+static inline void memcg_free_page_obj_cgroups(struct page *page)
+{
+}
+
 static inline void slab_init_memcg_params(struct kmem_cache *s)
 {
 }
@@ -481,12 +522,18 @@ static __always_inline int charge_slab_page(struct page *page,
 					    gfp_t gfp, int order,
 					    struct kmem_cache *s)
 {
+	int ret;
+
 	if (is_root_cache(s)) {
 		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
 				    PAGE_SIZE << order);
 		return 0;
 	}
 
+	ret = memcg_alloc_page_obj_cgroups(page, gfp, objs_per_slab(s));
+	if (ret)
+		return ret;
+
 	return memcg_charge_slab(page, gfp, order, s);
 }
 
@@ -499,6 +546,7 @@ static __always_inline void uncharge_slab_page(struct page *page, int order,
 		return;
 	}
 
+	memcg_free_page_obj_cgroups(page);
 	memcg_uncharge_slab(page, order, s);
 }
 
diff --git a/mm/slub.c b/mm/slub.c
index 8d16babe1829..a5fb0bb5c77a 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -344,6 +344,11 @@ static inline unsigned int oo_objects(struct kmem_cache_order_objects x)
 	return x.x & OO_MASK;
 }
 
+int objs_per_slab(struct kmem_cache *cache)
+{
+	return oo_objects(cache->oo);
+}
+
 /*
  * Per slab locking using the pagelock
  */
-- 
2.25.3



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 04/19] mm: slub: implement SLUB version of obj_to_index()
  2020-04-22 23:52   ` Christopher Lameter
  2020-04-23  0:05     ` Roman Gushchin
@ 2020-04-23 21:01     ` Roman Gushchin
  2020-04-25  2:10       ` Christopher Lameter
  1 sibling, 1 reply; 84+ messages in thread
From: Roman Gushchin @ 2020-04-23 21:01 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, linux-mm,
	kernel-team, linux-kernel

On Wed, Apr 22, 2020 at 11:52:13PM +0000, Christoph Lameter wrote:
> On Wed, 22 Apr 2020, Roman Gushchin wrote:
> 
> >  enum stat_item {
> >  	ALLOC_FASTPATH,		/* Allocation from cpu slab */
> > @@ -86,6 +87,7 @@ struct kmem_cache {
> >  	unsigned long min_partial;
> >  	unsigned int size;	/* The size of an object including metadata */
> >  	unsigned int object_size;/* The size of an object without metadata */
> > +	struct reciprocal_value reciprocal_size;
> 
> 
> This needs to be moved further back since it is not an item that needs to
> be cache hot for the hotpaths. Place it with "align", inuse etc?
> 
> Hmmm. the same applies to min_partial maybe?
> 
>

Something like this?

Thanks!

--

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index cdf4f299c982..6246a3c65cd5 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -84,10 +84,8 @@ struct kmem_cache {
        struct kmem_cache_cpu __percpu *cpu_slab;
        /* Used for retrieving partial slabs, etc. */
        slab_flags_t flags;
-       unsigned long min_partial;
        unsigned int size;      /* The size of an object including metadata */
        unsigned int object_size;/* The size of an object without metadata */
-       struct reciprocal_value reciprocal_size;
        unsigned int offset;    /* Free pointer offset */
 #ifdef CONFIG_SLUB_CPU_PARTIAL
        /* Number of per cpu partial objects to keep around */
@@ -103,6 +101,8 @@ struct kmem_cache {
        void (*ctor)(void *);
        unsigned int inuse;             /* Offset to metadata */
        unsigned int align;             /* Alignment */
+       unsigned long min_partial;
+       struct reciprocal_value reciprocal_size;
        unsigned int red_left_pad;      /* Left redzone padding size */
        const char *name;       /* Name (only for display!) */
        struct list_head list;  /* List of slab caches */


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 04/19] mm: slub: implement SLUB version of obj_to_index()
  2020-04-23  0:05     ` Roman Gushchin
@ 2020-04-25  2:10       ` Christopher Lameter
  2020-04-25  2:46         ` Roman Gushchin
  0 siblings, 1 reply; 84+ messages in thread
From: Christopher Lameter @ 2020-04-25  2:10 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, linux-mm,
	kernel-team, linux-kernel

On Wed, 22 Apr 2020, Roman Gushchin wrote:

> On Wed, Apr 22, 2020 at 11:52:13PM +0000, Christoph Lameter wrote:
> > On Wed, 22 Apr 2020, Roman Gushchin wrote:
> >
> > >  enum stat_item {
> > >  	ALLOC_FASTPATH,		/* Allocation from cpu slab */
> > > @@ -86,6 +87,7 @@ struct kmem_cache {
> > >  	unsigned long min_partial;
> > >  	unsigned int size;	/* The size of an object including metadata */
> > >  	unsigned int object_size;/* The size of an object without metadata */
> > > +	struct reciprocal_value reciprocal_size;
> >
> >
> > This needs to be moved further back since it is not an item that needs to
> > be cache hot for the hotpaths.
>
> It could be relatively hot, because it's accessed for reading on every
> accounted allocation.

The patch seems to only use it for setup and debugging? It is used for
every "accounted" allocation???? Where? And what is an "accounted"
allocation?



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 04/19] mm: slub: implement SLUB version of obj_to_index()
  2020-04-23 21:01     ` Roman Gushchin
@ 2020-04-25  2:10       ` Christopher Lameter
  0 siblings, 0 replies; 84+ messages in thread
From: Christopher Lameter @ 2020-04-25  2:10 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, linux-mm,
	kernel-team, linux-kernel

On Thu, 23 Apr 2020, Roman Gushchin wrote:

> Something like this?

Yup.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 04/19] mm: slub: implement SLUB version of obj_to_index()
  2020-04-25  2:10       ` Christopher Lameter
@ 2020-04-25  2:46         ` Roman Gushchin
  2020-04-27 16:21           ` Christopher Lameter
  0 siblings, 1 reply; 84+ messages in thread
From: Roman Gushchin @ 2020-04-25  2:46 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, linux-mm,
	kernel-team, linux-kernel

On Sat, Apr 25, 2020 at 02:10:24AM +0000, Christoph Lameter wrote:
> On Wed, 22 Apr 2020, Roman Gushchin wrote:
> 
> > On Wed, Apr 22, 2020 at 11:52:13PM +0000, Christoph Lameter wrote:
> > > On Wed, 22 Apr 2020, Roman Gushchin wrote:
> > >
> > > >  enum stat_item {
> > > >  	ALLOC_FASTPATH,		/* Allocation from cpu slab */
> > > > @@ -86,6 +87,7 @@ struct kmem_cache {
> > > >  	unsigned long min_partial;
> > > >  	unsigned int size;	/* The size of an object including metadata */
> > > >  	unsigned int object_size;/* The size of an object without metadata */
> > > > +	struct reciprocal_value reciprocal_size;
> > >
> > >
> > > This needs to be moved further back since it is not an item that needs to
> > > be cache hot for the hotpaths.
> >
> > It could be relatively hot, because it's accessed for reading on every
> > accounted allocation.
> 
> The patch seems to only use it for setup and debugging? It is used for
> every "accounted" allocation???? Where? And what is an "accounted"
> allocation?
> 
>

Please, take a look at the whole series:
https://lore.kernel.org/linux-mm/20200422204708.2176080-1-guro@fb.com/T/#t

I'm sorry, I had to cc you directly for the whole thing. Your feedback
will be highly appreciated.

It's used to calculate the offset of the memcg pointer for every slab
object which is charged to a memory cgroup. So it must be quite hot.

Thanks!

Roman


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 04/19] mm: slub: implement SLUB version of obj_to_index()
  2020-04-25  2:46         ` Roman Gushchin
@ 2020-04-27 16:21           ` Christopher Lameter
  2020-04-27 16:46             ` Roman Gushchin
  0 siblings, 1 reply; 84+ messages in thread
From: Christopher Lameter @ 2020-04-27 16:21 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, linux-mm,
	kernel-team, linux-kernel

On Fri, 24 Apr 2020, Roman Gushchin wrote:

> > The patch seems to only use it for setup and debugging? It is used for
> > every "accounted" allocation???? Where? And what is an "accounted"
> > allocation?
> >
> >
>
> Please, take a look at the whole series:
> https://lore.kernel.org/linux-mm/20200422204708.2176080-1-guro@fb.com/T/#t
>
> I'm sorry, I had to cc you directly for the whole thing. Your feedback
> will be highly appreciated.
>
> It's used to calculate the offset of the memcg pointer for every slab
> object which is charged to a memory cgroup. So it must be quite hot.


Ahh... Thanks. I just looked at it.

You need this because you have a separate structure attached to a page
that tracks membership of the slab object to the cgroup. This is used to
calculate the offset into that array....

Why do you need this? Just slap a pointer to the cgroup as additional
metadata onto the slab object. Is that not much simpler, safer and faster?



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 04/19] mm: slub: implement SLUB version of obj_to_index()
  2020-04-27 16:21           ` Christopher Lameter
@ 2020-04-27 16:46             ` Roman Gushchin
  2020-04-28 17:06               ` Roman Gushchin
                                 ` (2 more replies)
  0 siblings, 3 replies; 84+ messages in thread
From: Roman Gushchin @ 2020-04-27 16:46 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, linux-mm,
	kernel-team, linux-kernel

On Mon, Apr 27, 2020 at 04:21:01PM +0000, Christoph Lameter wrote:
> On Fri, 24 Apr 2020, Roman Gushchin wrote:
> 
> > > The patch seems to only use it for setup and debugging? It is used for
> > > every "accounted" allocation???? Where? And what is an "accounted"
> > > allocation?
> > >
> > >
> >
> > Please, take a look at the whole series:
> > https://lore.kernel.org/linux-mm/20200422204708.2176080-1-guro@fb.com/T/#t
> >
> > I'm sorry, I had to cc you directly for the whole thing. Your feedback
> > will be highly appreciated.
> >
> > It's used to calculate the offset of the memcg pointer for every slab
> > object which is charged to a memory cgroup. So it must be quite hot.
> 
> 
> Ahh... Thanks. I just looked at it.
> 
> You need this because you have a separate structure attached to a page
> that tracks membership of the slab object to the cgroup. This is used to
> calculate the offset into that array....
> 
> Why do you need this? Just slap a pointer to the cgroup as additional
> metadata onto the slab object. Is that not much simpler, safer and faster?
> 

So, the problem is that not all slab objects are accounted, and sometimes
we don't know if advance if they are accounted or not (with the current semantics
of __GFP_ACCOUNT and SLAB_ACCOUNT flags). So we either have to increase
the size of ALL slab objects, either create a pair of slab caches for each size.

The first option is not that cheap in terms of the memory overhead. Especially
for those who disable cgroups using a boot-time option.
The second should be fine, but it will be less simple in terms of the code complexity
(in comparison to the final result of the current proposal).

I'm not strictly against of either approach, but I'd look for a broader consensus
on what's the best approach here.

Thanks!


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 04/19] mm: slub: implement SLUB version of obj_to_index()
  2020-04-27 16:46             ` Roman Gushchin
@ 2020-04-28 17:06               ` Roman Gushchin
  2020-04-28 17:45               ` Johannes Weiner
  2020-04-30 16:29               ` Christopher Lameter
  2 siblings, 0 replies; 84+ messages in thread
From: Roman Gushchin @ 2020-04-28 17:06 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, linux-mm,
	kernel-team, linux-kernel

On Mon, Apr 27, 2020 at 09:46:38AM -0700, Roman Gushchin wrote:
> On Mon, Apr 27, 2020 at 04:21:01PM +0000, Christoph Lameter wrote:
> > On Fri, 24 Apr 2020, Roman Gushchin wrote:
> > 
> > > > The patch seems to only use it for setup and debugging? It is used for
> > > > every "accounted" allocation???? Where? And what is an "accounted"
> > > > allocation?
> > > >
> > > >
> > >
> > > Please, take a look at the whole series:
> > > https://lore.kernel.org/linux-mm/20200422204708.2176080-1-guro@fb.com/T/#t
> > >
> > > I'm sorry, I had to cc you directly for the whole thing. Your feedback
> > > will be highly appreciated.
> > >
> > > It's used to calculate the offset of the memcg pointer for every slab
> > > object which is charged to a memory cgroup. So it must be quite hot.
> > 
> > 
> > Ahh... Thanks. I just looked at it.
> > 
> > You need this because you have a separate structure attached to a page
> > that tracks membership of the slab object to the cgroup. This is used to
> > calculate the offset into that array....
> > 
> > Why do you need this? Just slap a pointer to the cgroup as additional
> > metadata onto the slab object. Is that not much simpler, safer and faster?
> > 
> 
> So, the problem is that not all slab objects are accounted, and sometimes
> we don't know if advance if they are accounted or not (with the current semantics
> of __GFP_ACCOUNT and SLAB_ACCOUNT flags). So we either have to increase
> the size of ALL slab objects, either create a pair of slab caches for each size.
> 
> The first option is not that cheap in terms of the memory overhead. Especially
> for those who disable cgroups using a boot-time option.
> The second should be fine, but it will be less simple in terms of the code complexity
> (in comparison to the final result of the current proposal).
> 
> I'm not strictly against of either approach, but I'd look for a broader consensus
> on what's the best approach here.

To be more clear here: in my original version (prior to v3) I had two sets
of kmem_caches: one for root- and other non accounted allocations, and the
other was shared by all non-root memory cgroups. With this approach it's
easy to switch to your suggestion and put the memcg pointer nearby the object.

Johannes persistently pushed on the design with a single set of kmem_caches,
shared by *all* allocations. I've implemented this approach as a separate patch
on top of the series and added to v3. It allows to dramatically simplify the code
and remove ~0.5k sloc, but with this approach it's not easy to implement what
you're suggesting without increasing the size of *all* slab objects, which is
sub-optimal.

So it looks like there are two options:
1) switch back to a root- and memcg sets of kmem_caches, put the memcg pointer
   just behind the slab object
2) stick with what we've in v3

I guess the first option might be better from the performance POV, the second
is simpler/cleaner in terms of the code. So I'm ok to switch to 1) if there is
a consensus on what's better.

Thanks!


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 04/19] mm: slub: implement SLUB version of obj_to_index()
  2020-04-27 16:46             ` Roman Gushchin
  2020-04-28 17:06               ` Roman Gushchin
@ 2020-04-28 17:45               ` Johannes Weiner
  2020-04-30 16:29               ` Christopher Lameter
  2 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2020-04-28 17:45 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Christopher Lameter, Andrew Morton, Michal Hocko, linux-mm,
	kernel-team, linux-kernel

On Mon, Apr 27, 2020 at 09:46:38AM -0700, Roman Gushchin wrote:
> On Mon, Apr 27, 2020 at 04:21:01PM +0000, Christoph Lameter wrote:
> > On Fri, 24 Apr 2020, Roman Gushchin wrote:
> > 
> > > > The patch seems to only use it for setup and debugging? It is used for
> > > > every "accounted" allocation???? Where? And what is an "accounted"
> > > > allocation?
> > > >
> > > >
> > >
> > > Please, take a look at the whole series:
> > > https://lore.kernel.org/linux-mm/20200422204708.2176080-1-guro@fb.com/T/#t
> > >
> > > I'm sorry, I had to cc you directly for the whole thing. Your feedback
> > > will be highly appreciated.
> > >
> > > It's used to calculate the offset of the memcg pointer for every slab
> > > object which is charged to a memory cgroup. So it must be quite hot.
> > 
> > 
> > Ahh... Thanks. I just looked at it.
> > 
> > You need this because you have a separate structure attached to a page
> > that tracks membership of the slab object to the cgroup. This is used to
> > calculate the offset into that array....
> > 
> > Why do you need this? Just slap a pointer to the cgroup as additional
> > metadata onto the slab object. Is that not much simpler, safer and faster?
> > 
> 
> So, the problem is that not all slab objects are accounted, and sometimes
> we don't know if advance if they are accounted or not (with the current semantics
> of __GFP_ACCOUNT and SLAB_ACCOUNT flags). So we either have to increase
> the size of ALL slab objects, either create a pair of slab caches for each size.

Both options seem completely disproportionate in their memory cost,
and the latter one in terms of code complexity, to avoid the offset
calculation. As a share of the total object accounting cost, I'd
expect this to be minimal.

Does the mult stand out in an annotated perf profile?

Is it enough to bring back 500+ lines of code, an additional branch on
accounted allocations, and the memory fragmentation of split caches?

I highly doubt it.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 04/19] mm: slub: implement SLUB version of obj_to_index()
  2020-04-27 16:46             ` Roman Gushchin
  2020-04-28 17:06               ` Roman Gushchin
  2020-04-28 17:45               ` Johannes Weiner
@ 2020-04-30 16:29               ` Christopher Lameter
  2020-04-30 17:15                 ` Roman Gushchin
  2 siblings, 1 reply; 84+ messages in thread
From: Christopher Lameter @ 2020-04-30 16:29 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, linux-mm,
	kernel-team, linux-kernel

On Mon, 27 Apr 2020, Roman Gushchin wrote:

> > Why do you need this? Just slap a pointer to the cgroup as additional
> > metadata onto the slab object. Is that not much simpler, safer and faster?
> >
>
> So, the problem is that not all slab objects are accounted, and sometimes
> we don't know if advance if they are accounted or not (with the current semantics
> of __GFP_ACCOUNT and SLAB_ACCOUNT flags). So we either have to increase
> the size of ALL slab objects, either create a pair of slab caches for each size.

>
> The first option is not that cheap in terms of the memory overhead. Especially
> for those who disable cgroups using a boot-time option.


If the cgroups are disabled on boot time then you can switch back to the
compact version. Otherwise just add a pointer to each object. It will make
it consistent and there is not much memory wastage.

The problem comes about with the power of 2 caches in the kmalloc array.
If one keeps the "natural alignment" instead of going for the normal
alignment of slab caches then the alignment will cause a lot of memory
wastage and thus the scheme of off slab metadata is likely going to be
unavoidable.

But I think we are just stacking one bad idea onto another here making
things much more complex than they could be. Well at least this justifies
all our jobs .... (not mine I am out of work... hehehe)



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 04/19] mm: slub: implement SLUB version of obj_to_index()
  2020-04-30 16:29               ` Christopher Lameter
@ 2020-04-30 17:15                 ` Roman Gushchin
  2020-05-02 23:54                   ` Christopher Lameter
  0 siblings, 1 reply; 84+ messages in thread
From: Roman Gushchin @ 2020-04-30 17:15 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, linux-mm,
	kernel-team, linux-kernel

On Thu, Apr 30, 2020 at 04:29:50PM +0000, Christoph Lameter wrote:
> On Mon, 27 Apr 2020, Roman Gushchin wrote:
> 
> > > Why do you need this? Just slap a pointer to the cgroup as additional
> > > metadata onto the slab object. Is that not much simpler, safer and faster?
> > >
> >
> > So, the problem is that not all slab objects are accounted, and sometimes
> > we don't know if advance if they are accounted or not (with the current semantics
> > of __GFP_ACCOUNT and SLAB_ACCOUNT flags). So we either have to increase
> > the size of ALL slab objects, either create a pair of slab caches for each size.
> 
> >
> > The first option is not that cheap in terms of the memory overhead. Especially
> > for those who disable cgroups using a boot-time option.
> 
> 
> If the cgroups are disabled on boot time then you can switch back to the
> compact version. Otherwise just add a pointer to each object. It will make
> it consistent and there is not much memory wastage.
> 
> The problem comes about with the power of 2 caches in the kmalloc array.

It's a very good point, and it's an argument to stick with the current design
(an external vector of memcg pointers).

> If one keeps the "natural alignment" instead of going for the normal
> alignment of slab caches then the alignment will cause a lot of memory
> wastage and thus the scheme of off slab metadata is likely going to be
> unavoidable.
> 
> But I think we are just stacking one bad idea onto another here making
> things much more complex than they could be. Well at least this justifies
> all our jobs .... (not mine I am out of work... hehehe)

Sorry, but what exactly do you mean?
I don't think reducing the kernel memory footprint by almost half
is such a bad idea.

Thanks!


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 04/19] mm: slub: implement SLUB version of obj_to_index()
  2020-04-30 17:15                 ` Roman Gushchin
@ 2020-05-02 23:54                   ` Christopher Lameter
  2020-05-04 18:29                     ` Roman Gushchin
  0 siblings, 1 reply; 84+ messages in thread
From: Christopher Lameter @ 2020-05-02 23:54 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, linux-mm,
	kernel-team, linux-kernel

On Thu, 30 Apr 2020, Roman Gushchin wrote:

> Sorry, but what exactly do you mean?

I think the right approach is to add a pointer to each slab object for
memcg support.



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 04/19] mm: slub: implement SLUB version of obj_to_index()
  2020-05-02 23:54                   ` Christopher Lameter
@ 2020-05-04 18:29                     ` Roman Gushchin
  2020-05-08 21:35                       ` Christopher Lameter
  0 siblings, 1 reply; 84+ messages in thread
From: Roman Gushchin @ 2020-05-04 18:29 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, linux-mm,
	kernel-team, linux-kernel

On Sat, May 02, 2020 at 11:54:09PM +0000, Christoph Lameter wrote:
> On Thu, 30 Apr 2020, Roman Gushchin wrote:
> 
> > Sorry, but what exactly do you mean?
> 
> I think the right approach is to add a pointer to each slab object for
> memcg support.
>

As I understand, embedding the memcg pointer will hopefully make allocations
cheaper in terms of CPU, but will require more memory. And you think that
it's worth it. Is it a correct understanding?

Can you, please, describe a bit more detailed how it should be done
from your point of view?
I mean where to store the pointer, should it be SLAB/SLUB-specific code
or a generic code, what do to with kmallocs alignments, should we
merge slabs which had a different size before and now have the same
because of the memcg pointer and aligment, etc.

I'm happy to follow your advice and perform some tests to get an idea of
how significant the memory overhead is and how big are CPU savings.
I guess with these numbers it will be easy to make a decision.

Thanks!


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 19/19] tools/cgroup: add memcg_slabinfo.py tool
  2020-04-22 20:47 ` [PATCH v3 19/19] tools/cgroup: add memcg_slabinfo.py tool Roman Gushchin
@ 2020-05-05 15:59   ` Tejun Heo
  0 siblings, 0 replies; 84+ messages in thread
From: Tejun Heo @ 2020-05-05 15:59 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, linux-mm,
	kernel-team, linux-kernel, Waiman Long, Tobin C . Harding

On Wed, Apr 22, 2020 at 01:47:08PM -0700, Roman Gushchin wrote:
> Add a drgn-based tool to display slab information for a given memcg.
> Can replace cgroup v1 memory.kmem.slabinfo interface on cgroup v2,
> but in a more flexiable way.
> 
> Currently supports only SLUB configuration, but SLAB can be trivially
> added later.
...
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: Waiman Long <longman@redhat.com>
> Cc: Tobin C. Harding <tobin@kernel.org>
> Cc: Tejun Heo <tj@kernel.org>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 01/19] mm: memcg: factor out memcg- and lruvec-level changes out of __mod_lruvec_state()
  2020-04-22 20:46 ` [PATCH v3 01/19] mm: memcg: factor out memcg- and lruvec-level changes out of __mod_lruvec_state() Roman Gushchin
@ 2020-05-07 20:33   ` Johannes Weiner
  2020-05-20 10:49   ` Vlastimil Babka
  1 sibling, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2020-05-07 20:33 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Michal Hocko, linux-mm, kernel-team, linux-kernel

On Wed, Apr 22, 2020 at 01:46:50PM -0700, Roman Gushchin wrote:
> To convert memcg and lruvec slab counters to bytes there must be
> a way to change these counters without touching node counters.
> Factor out __mod_memcg_lruvec_state() out of __mod_lruvec_state().
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Roman and I have talked a bunch about the function names here. They're
not optimal, with mod_lruvec_state() doing the entire intersection -
node, memcg, lruvec - and mod_memcg_lruvec_state() being a specific
version that does not update the node.

However, the usecases for mod_memcg_lruvec_state() are highly
specific, so the function won't be widely used. As such, it received
the longer name, and we get to keep the shorter mod_lruvec_state() for
the much more widely used function.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 02/19] mm: memcg: prepare for byte-sized vmstat items
  2020-04-22 20:46 ` [PATCH v3 02/19] mm: memcg: prepare for byte-sized vmstat items Roman Gushchin
@ 2020-05-07 20:34   ` Johannes Weiner
  2020-05-20 11:31   ` Vlastimil Babka
  1 sibling, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2020-05-07 20:34 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Michal Hocko, linux-mm, kernel-team, linux-kernel

On Wed, Apr 22, 2020 at 01:46:51PM -0700, Roman Gushchin wrote:
> To implement per-object slab memory accounting, we need to
> convert slab vmstat counters to bytes. Actually, out of
> 4 levels of counters: global, per-node, per-memcg and per-lruvec
> only two last levels will require byte-sized counters.
> It's because global and per-node counters will be counting the
> number of slab pages, and per-memcg and per-lruvec will be
> counting the amount of memory taken by charged slab objects.
> 
> Converting all vmstat counters to bytes or even all slab
> counters to bytes would introduce an additional overhead.
> So instead let's store global and per-node counters
> in pages, and memcg and lruvec counters in bytes.
> 
> To make the API clean all access helpers (both on the read
> and write sides) are dealing with bytes.
> 
> To avoid back-and-forth conversions a new flavor of helpers
> is introduced, which always returns values in pages:
> node_page_state_pages() and global_node_page_state_pages().
> 
> Actually new helpers are just reading raw values. Old helpers are
> simple wrappers, which perform a conversion if the vmstat items are
> in bytes. Because at the moment no one actually need bytes,
> there are WARN_ON_ONCE() macroses inside to warn about inappropriate
> use cases.
> 
> Thanks to Johannes Weiner for the idea of having the byte-sized API
> on top of the page-sized internal storage.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 03/19] mm: memcg: convert vmstat slab counters to bytes
  2020-04-22 20:46 ` [PATCH v3 03/19] mm: memcg: convert vmstat slab counters to bytes Roman Gushchin
@ 2020-05-07 20:41   ` Johannes Weiner
  2020-05-20 12:25   ` Vlastimil Babka
  1 sibling, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2020-05-07 20:41 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Michal Hocko, linux-mm, kernel-team, linux-kernel

On Wed, Apr 22, 2020 at 01:46:52PM -0700, Roman Gushchin wrote:
> In order to prepare for per-object slab memory accounting, convert
> NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE vmstat items to bytes.
> 
> To make it obvious, rename them to NR_SLAB_RECLAIMABLE_B and
> NR_SLAB_UNRECLAIMABLE_B (similar to NR_KERNEL_STACK_KB).
> 
> Internally global and per-node counters are stored in pages,
> however memcg and lruvec counters are stored in bytes.
> This scheme may look weird, but only for now. As soon as slab
> pages will be shared between multiple cgroups, global and
> node counters will reflect the total number of slab pages.
> However memcg and lruvec counters will be used for per-memcg
> slab memory tracking, which will take separate kernel objects
> in the account. Keeping global and node counters in pages helps
> to avoid additional overhead.
> 
> The size of slab memory shouldn't exceed 4Gb on 32-bit machines,
> so it will fit into atomic_long_t we use for vmstats.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Thanks for splitting this out, it makes both this and the previous
patch easier to read.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 06/19] mm: memcg/slab: obj_cgroup API
  2020-04-22 20:46 ` [PATCH v3 06/19] mm: memcg/slab: obj_cgroup API Roman Gushchin
@ 2020-05-07 21:03   ` Johannes Weiner
  2020-05-07 22:26     ` Roman Gushchin
  0 siblings, 1 reply; 84+ messages in thread
From: Johannes Weiner @ 2020-05-07 21:03 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Michal Hocko, linux-mm, kernel-team, linux-kernel

On Wed, Apr 22, 2020 at 01:46:55PM -0700, Roman Gushchin wrote:
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -257,6 +257,78 @@ struct cgroup_subsys_state *vmpressure_to_css(struct vmpressure *vmpr)
>  }
>  
>  #ifdef CONFIG_MEMCG_KMEM
> +extern spinlock_t css_set_lock;
> +
> +static void obj_cgroup_release(struct percpu_ref *ref)
> +{
> +	struct obj_cgroup *objcg = container_of(ref, struct obj_cgroup, refcnt);
> +	struct mem_cgroup *memcg;
> +	unsigned int nr_bytes;
> +	unsigned int nr_pages;
> +	unsigned long flags;
> +
> +	nr_bytes = atomic_read(&objcg->nr_charged_bytes);
> +	WARN_ON_ONCE(nr_bytes & (PAGE_SIZE - 1));
> +	nr_pages = nr_bytes >> PAGE_SHIFT;

What guarantees that we don't have a partial page in there at this
point? I guess any outstanding allocations would pin the objcg, so
when it's released all objects have been freed.

But if that's true, how can we have full pages remaining in there now?

> @@ -2723,6 +2820,30 @@ struct mem_cgroup *mem_cgroup_from_obj(void *p)
>  	return page->mem_cgroup;
>  }
>  
> +__always_inline struct obj_cgroup *get_obj_cgroup_from_current(void)
> +{
> +	struct obj_cgroup *objcg = NULL;
> +	struct mem_cgroup *memcg;
> +
> +	if (unlikely(!current->mm))
> +		return NULL;
> +
> +	rcu_read_lock();
> +	if (unlikely(current->active_memcg))
> +		memcg = rcu_dereference(current->active_memcg);
> +	else
> +		memcg = mem_cgroup_from_task(current);
> +
> +	for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) {
> +		objcg = rcu_dereference(memcg->objcg);
> +		if (objcg && obj_cgroup_tryget(objcg))
> +			break;
> +	}
> +	rcu_read_unlock();
> +
> +	return objcg;
> +}

Thanks for moving this here from one of the later patches, it helps
understanding the life cycle of obj_cgroup better.

> +int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
> +{
> +	struct mem_cgroup *memcg;
> +	unsigned int nr_pages, nr_bytes;
> +	int ret;
> +
> +	if (consume_obj_stock(objcg, size))
> +		return 0;
> +
> +	rcu_read_lock();
> +	memcg = obj_cgroup_memcg(objcg);
> +	css_get(&memcg->css);
> +	rcu_read_unlock();
> +
> +	nr_pages = size >> PAGE_SHIFT;
> +	nr_bytes = size & (PAGE_SIZE - 1);
> +
> +	if (nr_bytes)
> +		nr_pages += 1;
> +
> +	ret = __memcg_kmem_charge(memcg, gfp, nr_pages);

If consume_obj_stock() fails because some other memcg is cached,
should this try to consume the partial page in objcg->nr_charged_bytes
before getting more pages?

> +	if (!ret && nr_bytes)
> +		refill_obj_stock(objcg, PAGE_SIZE - nr_bytes);

This will put the cgroup into the cache if the allocation resulted in
a partially free page.

But if this was a page allocation, we may have objcg->nr_cache_bytes
from a previous subpage allocation that we should probably put back
into the stock.

It's not a common case, I'm just trying to understand what
objcg->nr_cache_bytes holds and when it does so.

The rest of this patch looks good to me!


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 10/19] mm: memcg/slab: deprecate memory.kmem.slabinfo
  2020-04-22 20:46 ` [PATCH v3 10/19] mm: memcg/slab: deprecate memory.kmem.slabinfo Roman Gushchin
@ 2020-05-07 21:05   ` Johannes Weiner
  0 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2020-05-07 21:05 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Michal Hocko, linux-mm, kernel-team, linux-kernel

On Wed, Apr 22, 2020 at 01:46:59PM -0700, Roman Gushchin wrote:
> Deprecate memory.kmem.slabinfo.
> 
> An empty file will be presented if corresponding config options are
> enabled.
> 
> The interface is implementation dependent, isn't present in cgroup v2,
> and is generally useful only for core mm debugging purposes. In other
> words, it doesn't provide any value for the absolute majority of users.
> 
> A drgn-based replacement can be found in tools/cgroup/slabinfo.py .
> It does support cgroup v1 and v2, mimics memory.kmem.slabinfo output
> and also allows to get any additional information without a need
> to recompile the kernel.
> 
> If a drgn-based solution is too slow for a task, a bpf-based tracing
> tool can be used, which can easily keep track of all slab allocations
> belonging to a memory cgroup.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 06/19] mm: memcg/slab: obj_cgroup API
  2020-05-07 21:03   ` Johannes Weiner
@ 2020-05-07 22:26     ` Roman Gushchin
  2020-05-12 22:56       ` Johannes Weiner
  0 siblings, 1 reply; 84+ messages in thread
From: Roman Gushchin @ 2020-05-07 22:26 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, linux-mm, kernel-team, linux-kernel

On Thu, May 07, 2020 at 05:03:14PM -0400, Johannes Weiner wrote:
> On Wed, Apr 22, 2020 at 01:46:55PM -0700, Roman Gushchin wrote:
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -257,6 +257,78 @@ struct cgroup_subsys_state *vmpressure_to_css(struct vmpressure *vmpr)
> >  }
> >  
> >  #ifdef CONFIG_MEMCG_KMEM
> > +extern spinlock_t css_set_lock;
> > +
> > +static void obj_cgroup_release(struct percpu_ref *ref)
> > +{
> > +	struct obj_cgroup *objcg = container_of(ref, struct obj_cgroup, refcnt);
> > +	struct mem_cgroup *memcg;
> > +	unsigned int nr_bytes;
> > +	unsigned int nr_pages;
> > +	unsigned long flags;
> > +
> > +	nr_bytes = atomic_read(&objcg->nr_charged_bytes);
> > +	WARN_ON_ONCE(nr_bytes & (PAGE_SIZE - 1));
> > +	nr_pages = nr_bytes >> PAGE_SHIFT;
> 
> What guarantees that we don't have a partial page in there at this
> point? I guess any outstanding allocations would pin the objcg, so
> when it's released all objects have been freed.

Right, this is exactly the reason why there can't be a partial page
at this point.

> 
> But if that's true, how can we have full pages remaining in there now?

Imagine the following sequence:
1) CPU0: objcg == stock->cached_objcg
2) CPU1: we do a small allocation (e.g. 92 bytes), page is charged
3) CPU1: a process from another memcg is allocating something, stock if flushed,
   objcg->nr_charged_bytes = PAGE_SIZE - 92
5) CPU0: we do release this object, 92 bytes are added to stock->nr_bytes
6) CPU0: stock is flushed, 92 bytes are added to objcg->nr_charged_bytes

In the result, nr_charged_bytes == PAGE_SIZE. This PAGE will be uncharged
in obj_cgroup_release().

I've double checked this, it's actually pretty easy to trigger in the real life.

> 
> > @@ -2723,6 +2820,30 @@ struct mem_cgroup *mem_cgroup_from_obj(void *p)
> >  	return page->mem_cgroup;
> >  }
> >  
> > +__always_inline struct obj_cgroup *get_obj_cgroup_from_current(void)
> > +{
> > +	struct obj_cgroup *objcg = NULL;
> > +	struct mem_cgroup *memcg;
> > +
> > +	if (unlikely(!current->mm))
> > +		return NULL;
> > +
> > +	rcu_read_lock();
> > +	if (unlikely(current->active_memcg))
> > +		memcg = rcu_dereference(current->active_memcg);
> > +	else
> > +		memcg = mem_cgroup_from_task(current);
> > +
> > +	for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) {
> > +		objcg = rcu_dereference(memcg->objcg);
> > +		if (objcg && obj_cgroup_tryget(objcg))
> > +			break;
> > +	}
> > +	rcu_read_unlock();
> > +
> > +	return objcg;
> > +}
> 
> Thanks for moving this here from one of the later patches, it helps
> understanding the life cycle of obj_cgroup better.
> 
> > +int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
> > +{
> > +	struct mem_cgroup *memcg;
> > +	unsigned int nr_pages, nr_bytes;
> > +	int ret;
> > +
> > +	if (consume_obj_stock(objcg, size))
> > +		return 0;
> > +
> > +	rcu_read_lock();
> > +	memcg = obj_cgroup_memcg(objcg);
> > +	css_get(&memcg->css);
> > +	rcu_read_unlock();
> > +
> > +	nr_pages = size >> PAGE_SHIFT;
> > +	nr_bytes = size & (PAGE_SIZE - 1);
> > +
> > +	if (nr_bytes)
> > +		nr_pages += 1;
> > +
> > +	ret = __memcg_kmem_charge(memcg, gfp, nr_pages);
> 
> If consume_obj_stock() fails because some other memcg is cached,
> should this try to consume the partial page in objcg->nr_charged_bytes
> before getting more pages?

We can definitely do it, but I'm not sure if it's good for the performance.

Dealing with nr_charged_bytes will require up to two atomic writes,
so calling __memcg_kmem_charge() can be faster if memcg is cached
on percpu stock.

> 
> > +	if (!ret && nr_bytes)
> > +		refill_obj_stock(objcg, PAGE_SIZE - nr_bytes);
> 
> This will put the cgroup into the cache if the allocation resulted in
> a partially free page.
> 
> But if this was a page allocation, we may have objcg->nr_cache_bytes
> from a previous subpage allocation that we should probably put back
> into the stock.

Yeah, we can do this, but I don't know if there will be any benefits.

Actually we don't wanna to touch objcg->nr_cache_bytes too often, as
it can become a contention point if there are many threads allocating
in the memory cgroup.

So maybe we want to do the opposite: relax it a bit and stop flushing
it on every stock refill and flush only if it exceeds a certain value.

> 
> It's not a common case, I'm just trying to understand what
> objcg->nr_cache_bytes holds and when it does so.

So it's actually a centralized leftover from the rounding of the actual
charge to the page size.

> 
> The rest of this patch looks good to me!

Great!

Thank you very much for the review!


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 04/19] mm: slub: implement SLUB version of obj_to_index()
  2020-05-04 18:29                     ` Roman Gushchin
@ 2020-05-08 21:35                       ` Christopher Lameter
  2020-05-13  0:57                         ` Roman Gushchin
  2020-05-15 20:02                         ` Roman Gushchin
  0 siblings, 2 replies; 84+ messages in thread
From: Christopher Lameter @ 2020-05-08 21:35 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, linux-mm,
	kernel-team, linux-kernel

On Mon, 4 May 2020, Roman Gushchin wrote:

> On Sat, May 02, 2020 at 11:54:09PM +0000, Christoph Lameter wrote:
> > On Thu, 30 Apr 2020, Roman Gushchin wrote:
> >
> > > Sorry, but what exactly do you mean?
> >
> > I think the right approach is to add a pointer to each slab object for
> > memcg support.
> >
>
> As I understand, embedding the memcg pointer will hopefully make allocations
> cheaper in terms of CPU, but will require more memory. And you think that
> it's worth it. Is it a correct understanding?

It definitely makes the code less complex. The additional memory is
minimal. In many cases you have already some space wasted at the end of
the object that could be used for the pointer.

> Can you, please, describe a bit more detailed how it should be done
> from your point of view?

Add it to the metadata at the end of the object. Like the debugging
information or the pointer for RCU freeing.

> I mean where to store the pointer, should it be SLAB/SLUB-specific code
> or a generic code, what do to with kmallocs alignments, should we
> merge slabs which had a different size before and now have the same
> because of the memcg pointer and aligment, etc.

Both SLAB and SLUB have the same capabilities there. Slabs that had
different sizes before will now have different sizes as well. So the
merging does not change.

> I'm happy to follow your advice and perform some tests to get an idea of
> how significant the memory overhead is and how big are CPU savings.
> I guess with these numbers it will be easy to make a decision.

Sure. The main issue are the power of two kmalloc caches and how to add
the pointer to these caches in order not to waste memory. SLAB has done
this in the past by creating additional structues in a page frame.




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 06/19] mm: memcg/slab: obj_cgroup API
  2020-05-07 22:26     ` Roman Gushchin
@ 2020-05-12 22:56       ` Johannes Weiner
  2020-05-15 22:01         ` Roman Gushchin
  0 siblings, 1 reply; 84+ messages in thread
From: Johannes Weiner @ 2020-05-12 22:56 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Michal Hocko, linux-mm, kernel-team, linux-kernel

On Thu, May 07, 2020 at 03:26:31PM -0700, Roman Gushchin wrote:
> On Thu, May 07, 2020 at 05:03:14PM -0400, Johannes Weiner wrote:
> > On Wed, Apr 22, 2020 at 01:46:55PM -0700, Roman Gushchin wrote:
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -257,6 +257,78 @@ struct cgroup_subsys_state *vmpressure_to_css(struct vmpressure *vmpr)
> > >  }
> > >  
> > >  #ifdef CONFIG_MEMCG_KMEM
> > > +extern spinlock_t css_set_lock;
> > > +
> > > +static void obj_cgroup_release(struct percpu_ref *ref)
> > > +{
> > > +	struct obj_cgroup *objcg = container_of(ref, struct obj_cgroup, refcnt);
> > > +	struct mem_cgroup *memcg;
> > > +	unsigned int nr_bytes;
> > > +	unsigned int nr_pages;
> > > +	unsigned long flags;
> > > +
> > > +	nr_bytes = atomic_read(&objcg->nr_charged_bytes);
> > > +	WARN_ON_ONCE(nr_bytes & (PAGE_SIZE - 1));
> > > +	nr_pages = nr_bytes >> PAGE_SHIFT;
> > 
> > What guarantees that we don't have a partial page in there at this
> > point? I guess any outstanding allocations would pin the objcg, so
> > when it's released all objects have been freed.
> 
> Right, this is exactly the reason why there can't be a partial page
> at this point.
> 
> > 
> > But if that's true, how can we have full pages remaining in there now?
> 
> Imagine the following sequence:
> 1) CPU0: objcg == stock->cached_objcg
> 2) CPU1: we do a small allocation (e.g. 92 bytes), page is charged
> 3) CPU1: a process from another memcg is allocating something, stock if flushed,
>    objcg->nr_charged_bytes = PAGE_SIZE - 92
> 5) CPU0: we do release this object, 92 bytes are added to stock->nr_bytes
> 6) CPU0: stock is flushed, 92 bytes are added to objcg->nr_charged_bytes
> 
> In the result, nr_charged_bytes == PAGE_SIZE. This PAGE will be uncharged
> in obj_cgroup_release().
> 
> I've double checked this, it's actually pretty easy to trigger in the real life.

Ah, so no outstanding allocations, but a full page split between the
percpu cache and objcg->nr_charged_bytes.

Would it simplify things if refill_obj_stock() drained on >= PAGE_SIZE
stock instead of > PAGE_SIZE?

Otherwise, the scenario above would be good to have as a comment as
the drain on release is not self-explanatory.

> > > +int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
> > > +{
> > > +	struct mem_cgroup *memcg;
> > > +	unsigned int nr_pages, nr_bytes;
> > > +	int ret;
> > > +
> > > +	if (consume_obj_stock(objcg, size))
> > > +		return 0;
> > > +
> > > +	rcu_read_lock();
> > > +	memcg = obj_cgroup_memcg(objcg);
> > > +	css_get(&memcg->css);
> > > +	rcu_read_unlock();
> > > +
> > > +	nr_pages = size >> PAGE_SHIFT;
> > > +	nr_bytes = size & (PAGE_SIZE - 1);
> > > +
> > > +	if (nr_bytes)
> > > +		nr_pages += 1;
> > > +
> > > +	ret = __memcg_kmem_charge(memcg, gfp, nr_pages);
> > 
> > If consume_obj_stock() fails because some other memcg is cached,
> > should this try to consume the partial page in objcg->nr_charged_bytes
> > before getting more pages?
> 
> We can definitely do it, but I'm not sure if it's good for the performance.
> 
> Dealing with nr_charged_bytes will require up to two atomic writes,
> so calling __memcg_kmem_charge() can be faster if memcg is cached
> on percpu stock.

Hm, but it's the slowpath. And sooner or later somebody has to deal
with the remaining memory in there.

> > > +	if (!ret && nr_bytes)
> > > +		refill_obj_stock(objcg, PAGE_SIZE - nr_bytes);
> > 
> > This will put the cgroup into the cache if the allocation resulted in
> > a partially free page.
> > 
> > But if this was a page allocation, we may have objcg->nr_cache_bytes
> > from a previous subpage allocation that we should probably put back
> > into the stock.
> 
> Yeah, we can do this, but I don't know if there will be any benefits.

It's mostly about understanding the code.

> Actually we don't wanna to touch objcg->nr_cache_bytes too often, as
> it can become a contention point if there are many threads allocating
> in the memory cgroup.
> 
> So maybe we want to do the opposite: relax it a bit and stop flushing
> it on every stock refill and flush only if it exceeds a certain value.

That could be useful, yes.

> > It's not a common case, I'm just trying to understand what
> > objcg->nr_cache_bytes holds and when it does so.
> 
> So it's actually a centralized leftover from the rounding of the actual
> charge to the page size.

It would be good to add code comments explaining this.

Thanks!


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 04/19] mm: slub: implement SLUB version of obj_to_index()
  2020-05-08 21:35                       ` Christopher Lameter
@ 2020-05-13  0:57                         ` Roman Gushchin
  2020-05-15 21:45                           ` Christopher Lameter
  2020-05-20  9:51                           ` Vlastimil Babka
  2020-05-15 20:02                         ` Roman Gushchin
  1 sibling, 2 replies; 84+ messages in thread
From: Roman Gushchin @ 2020-05-13  0:57 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, linux-mm,
	kernel-team, linux-kernel

On Fri, May 08, 2020 at 09:35:54PM +0000, Christoph Lameter wrote:
> On Mon, 4 May 2020, Roman Gushchin wrote:
> 
> > On Sat, May 02, 2020 at 11:54:09PM +0000, Christoph Lameter wrote:
> > > On Thu, 30 Apr 2020, Roman Gushchin wrote:
> > >
> > > > Sorry, but what exactly do you mean?
> > >
> > > I think the right approach is to add a pointer to each slab object for
> > > memcg support.
> > >
> >
> > As I understand, embedding the memcg pointer will hopefully make allocations
> > cheaper in terms of CPU, but will require more memory. And you think that
> > it's worth it. Is it a correct understanding?
> 
> It definitely makes the code less complex. The additional memory is
> minimal. In many cases you have already some space wasted at the end of
> the object that could be used for the pointer.
> 
> > Can you, please, describe a bit more detailed how it should be done
> > from your point of view?
> 
> Add it to the metadata at the end of the object. Like the debugging
> information or the pointer for RCU freeing.

Enabling debugging metadata currently disables the cache merging.
I doubt that it's acceptable to sacrifice the cache merging in order
to embed the memcg pointer?

> 
> > I mean where to store the pointer, should it be SLAB/SLUB-specific code
> > or a generic code, what do to with kmallocs alignments, should we
> > merge slabs which had a different size before and now have the same
> > because of the memcg pointer and aligment, etc.
> 
> Both SLAB and SLUB have the same capabilities there. Slabs that had
> different sizes before will now have different sizes as well. So the
> merging does not change.

See above. Or should I add it to the object itself before the metadata?

> 
> > I'm happy to follow your advice and perform some tests to get an idea of
> > how significant the memory overhead is and how big are CPU savings.
> > I guess with these numbers it will be easy to make a decision.
> 
> Sure. The main issue are the power of two kmalloc caches and how to add
> the pointer to these caches in order not to waste memory. SLAB has done
> this in the past by creating additional structues in a page frame.

But isn't it then similar to what I'm doing now?

Btw, I'm trying to build up a prototype with an embedded memcg pointer,
but it seems to be way more tricky than I thought. It requires changes to
shrinkers (as they rely on getting the memcg pointer by an arbitrary
kernel address, not necessarily aligned to the head of slab allocation),
figuring out cache merging, adding SLAB support, natural alignment of
kmallocs etc.

Figuring out all these details will likely take several weeks, so the whole
thing will be delayed for one-two major releases (in the best case). Given that
the current implementation saves ~40% of slab memory, I think there is some value
in delivering it as it is. So I wonder if the idea of embedding the pointer
should be considered a blocker, or it can be implemented of top of the proposed
code (given it's not a user-facing api or something like this)?

Thanks!


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 04/19] mm: slub: implement SLUB version of obj_to_index()
  2020-05-08 21:35                       ` Christopher Lameter
  2020-05-13  0:57                         ` Roman Gushchin
@ 2020-05-15 20:02                         ` Roman Gushchin
  1 sibling, 0 replies; 84+ messages in thread
From: Roman Gushchin @ 2020-05-15 20:02 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, linux-mm,
	kernel-team, linux-kernel

On Fri, May 08, 2020 at 09:35:54PM +0000, Christoph Lameter wrote:
> On Mon, 4 May 2020, Roman Gushchin wrote:
> 
> > On Sat, May 02, 2020 at 11:54:09PM +0000, Christoph Lameter wrote:
> > > On Thu, 30 Apr 2020, Roman Gushchin wrote:
> > >
> > > > Sorry, but what exactly do you mean?
> > >
> > > I think the right approach is to add a pointer to each slab object for
> > > memcg support.
> > >
> >
> > As I understand, embedding the memcg pointer will hopefully make allocations
> > cheaper in terms of CPU, but will require more memory. And you think that
> > it's worth it. Is it a correct understanding?
> 
> It definitely makes the code less complex. The additional memory is
> minimal. In many cases you have already some space wasted at the end of
> the object that could be used for the pointer.
> 
> > Can you, please, describe a bit more detailed how it should be done
> > from your point of view?
> 
> Add it to the metadata at the end of the object. Like the debugging
> information or the pointer for RCU freeing.

I've tried to make a prototype, but realized that I don't know how to do
it in a right way with SLUB (without disabling caches merging, etc)
and ended up debugging various memory corruptions.

memcg/kmem changes required to switch between different ways of storing
the memcg pointer are pretty minimal (diff below).

There are two functions which SLAB/SLUB should provide:

void set_obj_cgroup(struct kmem_cache *s, void *ptr, struct obj_cgroup *objcg);
struct obj_cgroup *obtain_obj_cgroup(struct kmem_cache *s, void *ptr);

Ideally, obtain_obj_cgroup should work with an arbitrary kernel pointer, e.g.
a pointer to some field in the structure allocated using kmem_cache_alloc().

Christopher, will you be able to help with the SLUB implementation?
It will be highly appreciated.

Thanks!

--

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4af95739ccb6..398a714874d8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2815,15 +2815,11 @@ struct mem_cgroup *mem_cgroup_from_obj(void *p)
 
 	/*
 	 * Slab objects are accounted individually, not per-page.
-	 * Memcg membership data for each individual object is saved in
-	 * the page->obj_cgroups.
 	 */
-	if (page_has_obj_cgroups(page)) {
+	if (PageSlab(page)) {
 		struct obj_cgroup *objcg;
-		unsigned int off;
 
-		off = obj_to_index(page->slab_cache, page, p);
-		objcg = page_obj_cgroups(page)[off];
+		objcg = obtain_obj_cgroup(page->slab_cache, p);
 		if (objcg)
 			return obj_cgroup_memcg(objcg);
 
diff --git a/mm/slab.h b/mm/slab.h
index 13fadf33be5c..617ce017bc68 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -210,40 +210,15 @@ static inline int cache_vmstat_idx(struct kmem_cache *s)
 }
 
 #ifdef CONFIG_MEMCG_KMEM
-static inline struct obj_cgroup **page_obj_cgroups(struct page *page)
+static inline void set_obj_cgroup(struct kmem_cache *s, void *ptr,
+				  struct obj_cgroup *objcg)
 {
-	/*
-	 * page->mem_cgroup and page->obj_cgroups are sharing the same
-	 * space. To distinguish between them in case we don't know for sure
-	 * that the page is a slab page (e.g. page_cgroup_ino()), let's
-	 * always set the lowest bit of obj_cgroups.
-	 */
-	return (struct obj_cgroup **)
-		((unsigned long)page->obj_cgroups & ~0x1UL);
-}
-
-static inline bool page_has_obj_cgroups(struct page *page)
-{
-	return ((unsigned long)page->obj_cgroups & 0x1UL);
-}
-
-static inline int memcg_alloc_page_obj_cgroups(struct page *page, gfp_t gfp,
-					       unsigned int objects)
-{
-	void *vec;
-
-	vec = kcalloc(objects, sizeof(struct obj_cgroup *), gfp);
-	if (!vec)
-		return -ENOMEM;
-
-	page->obj_cgroups = (struct obj_cgroup **) ((unsigned long)vec | 0x1UL);
-	return 0;
+	// TODO
 }
-
-static inline void memcg_free_page_obj_cgroups(struct page *page)
+static inline struct obj_cgroup *obtain_obj_cgroup(struct kmem_cache *s, void *ptr)
 {
-	kfree(page_obj_cgroups(page));
-	page->obj_cgroups = NULL;
+	// TODO
+	return NULL;
 }
 
 static inline size_t obj_full_size(struct kmem_cache *s)
@@ -296,7 +271,6 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
 					      void **p)
 {
 	struct page *page;
-	unsigned long off;
 	size_t i;
 
 	if (!objcg)
@@ -306,17 +280,8 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
 	for (i = 0; i < size; i++) {
 		if (likely(p[i])) {
 			page = virt_to_head_page(p[i]);
-
-			if (!page_has_obj_cgroups(page) &&
-			    memcg_alloc_page_obj_cgroups(page, flags,
-							 objs_per_slab(s))) {
-				obj_cgroup_uncharge(objcg, obj_full_size(s));
-				continue;
-			}
-
-			off = obj_to_index(s, page, p[i]);
 			obj_cgroup_get(objcg);
-			page_obj_cgroups(page)[off] = objcg;
+			set_obj_cgroup(s, p[i], objcg);
 			mod_objcg_state(objcg, page_pgdat(page),
 					cache_vmstat_idx(s), obj_full_size(s));
 		} else {
@@ -330,21 +295,17 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page,
 					void *p)
 {
 	struct obj_cgroup *objcg;
-	unsigned int off;
 
 	if (!memcg_kmem_enabled())
 		return;
 
-	if (!page_has_obj_cgroups(page))
-		return;
-
-	off = obj_to_index(s, page, p);
-	objcg = page_obj_cgroups(page)[off];
-	page_obj_cgroups(page)[off] = NULL;
+	objcg = obtain_obj_cgroup(s, p);
 
 	if (!objcg)
 		return;
 
+	set_obj_cgroup(s, p, NULL);
+
 	obj_cgroup_uncharge(objcg, obj_full_size(s));
 	mod_objcg_state(objcg, page_pgdat(page), cache_vmstat_idx(s),
 			-obj_full_size(s));
@@ -363,16 +324,6 @@ static inline struct mem_cgroup *memcg_from_slab_obj(void *ptr)
 	return NULL;
 }
 
-static inline int memcg_alloc_page_obj_cgroups(struct page *page, gfp_t gfp,
-					       unsigned int objects)
-{
-	return 0;
-}
-
-static inline void memcg_free_page_obj_cgroups(struct page *page)
-{
-}
-
 static inline struct obj_cgroup *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
 							   size_t objects,
 							   gfp_t flags)
@@ -415,8 +366,6 @@ static __always_inline void charge_slab_page(struct page *page,
 static __always_inline void uncharge_slab_page(struct page *page, int order,
 					       struct kmem_cache *s)
 {
-	memcg_free_page_obj_cgroups(page);
-
 	mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
 			    -(PAGE_SIZE << order));
 }


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 04/19] mm: slub: implement SLUB version of obj_to_index()
  2020-05-13  0:57                         ` Roman Gushchin
@ 2020-05-15 21:45                           ` Christopher Lameter
  2020-05-15 22:12                             ` Roman Gushchin
  2020-05-20  9:51                           ` Vlastimil Babka
  1 sibling, 1 reply; 84+ messages in thread
From: Christopher Lameter @ 2020-05-15 21:45 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, linux-mm,
	kernel-team, linux-kernel

On Tue, 12 May 2020, Roman Gushchin wrote:

> > Add it to the metadata at the end of the object. Like the debugging
> > information or the pointer for RCU freeing.
>
> Enabling debugging metadata currently disables the cache merging.
> I doubt that it's acceptable to sacrifice the cache merging in order
> to embed the memcg pointer?

Well then keep the merging even if you have a memcg pointer.

The disabling for debugging is only to simplify debugging. You dont have
to deal with multiple caches actually using the same storage structures.

> Figuring out all these details will likely take several weeks, so the whole
> thing will be delayed for one-two major releases (in the best case). Given that
> the current implementation saves ~40% of slab memory, I think there is some value
> in delivering it as it is. So I wonder if the idea of embedding the pointer
> should be considered a blocker, or it can be implemented of top of the proposed
> code (given it's not a user-facing api or something like this)?

Sorry no idea from my end here.



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 06/19] mm: memcg/slab: obj_cgroup API
  2020-05-12 22:56       ` Johannes Weiner
@ 2020-05-15 22:01         ` Roman Gushchin
  0 siblings, 0 replies; 84+ messages in thread
From: Roman Gushchin @ 2020-05-15 22:01 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, linux-mm, kernel-team, linux-kernel

On Tue, May 12, 2020 at 06:56:45PM -0400, Johannes Weiner wrote:
> On Thu, May 07, 2020 at 03:26:31PM -0700, Roman Gushchin wrote:
> > On Thu, May 07, 2020 at 05:03:14PM -0400, Johannes Weiner wrote:
> > > On Wed, Apr 22, 2020 at 01:46:55PM -0700, Roman Gushchin wrote:
> > > > --- a/mm/memcontrol.c
> > > > +++ b/mm/memcontrol.c
> > > > @@ -257,6 +257,78 @@ struct cgroup_subsys_state *vmpressure_to_css(struct vmpressure *vmpr)
> > > >  }
> > > >  
> > > >  #ifdef CONFIG_MEMCG_KMEM
> > > > +extern spinlock_t css_set_lock;
> > > > +
> > > > +static void obj_cgroup_release(struct percpu_ref *ref)
> > > > +{
> > > > +	struct obj_cgroup *objcg = container_of(ref, struct obj_cgroup, refcnt);
> > > > +	struct mem_cgroup *memcg;
> > > > +	unsigned int nr_bytes;
> > > > +	unsigned int nr_pages;
> > > > +	unsigned long flags;
> > > > +
> > > > +	nr_bytes = atomic_read(&objcg->nr_charged_bytes);
> > > > +	WARN_ON_ONCE(nr_bytes & (PAGE_SIZE - 1));
> > > > +	nr_pages = nr_bytes >> PAGE_SHIFT;
> > > 
> > > What guarantees that we don't have a partial page in there at this
> > > point? I guess any outstanding allocations would pin the objcg, so
> > > when it's released all objects have been freed.
> > 
> > Right, this is exactly the reason why there can't be a partial page
> > at this point.
> > 
> > > 
> > > But if that's true, how can we have full pages remaining in there now?
> > 
> > Imagine the following sequence:
> > 1) CPU0: objcg == stock->cached_objcg
> > 2) CPU1: we do a small allocation (e.g. 92 bytes), page is charged
> > 3) CPU1: a process from another memcg is allocating something, stock if flushed,
> >    objcg->nr_charged_bytes = PAGE_SIZE - 92
> > 5) CPU0: we do release this object, 92 bytes are added to stock->nr_bytes
> > 6) CPU0: stock is flushed, 92 bytes are added to objcg->nr_charged_bytes
> > 
> > In the result, nr_charged_bytes == PAGE_SIZE. This PAGE will be uncharged
> > in obj_cgroup_release().
> > 
> > I've double checked this, it's actually pretty easy to trigger in the real life.
> 
> Ah, so no outstanding allocations, but a full page split between the
> percpu cache and objcg->nr_charged_bytes.
> 
> Would it simplify things if refill_obj_stock() drained on >= PAGE_SIZE
> stock instead of > PAGE_SIZE?
> 
> Otherwise, the scenario above would be good to have as a comment as
> the drain on release is not self-explanatory.
> 
> > > > +int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
> > > > +{
> > > > +	struct mem_cgroup *memcg;
> > > > +	unsigned int nr_pages, nr_bytes;
> > > > +	int ret;
> > > > +
> > > > +	if (consume_obj_stock(objcg, size))
> > > > +		return 0;
> > > > +
> > > > +	rcu_read_lock();
> > > > +	memcg = obj_cgroup_memcg(objcg);
> > > > +	css_get(&memcg->css);
> > > > +	rcu_read_unlock();
> > > > +
> > > > +	nr_pages = size >> PAGE_SHIFT;
> > > > +	nr_bytes = size & (PAGE_SIZE - 1);
> > > > +
> > > > +	if (nr_bytes)
> > > > +		nr_pages += 1;
> > > > +
> > > > +	ret = __memcg_kmem_charge(memcg, gfp, nr_pages);
> > > 
> > > If consume_obj_stock() fails because some other memcg is cached,
> > > should this try to consume the partial page in objcg->nr_charged_bytes
> > > before getting more pages?
> > 
> > We can definitely do it, but I'm not sure if it's good for the performance.
> > 
> > Dealing with nr_charged_bytes will require up to two atomic writes,
> > so calling __memcg_kmem_charge() can be faster if memcg is cached
> > on percpu stock.
> 
> Hm, but it's the slowpath. And sooner or later somebody has to deal
> with the remaining memory in there.
> 
> > > > +	if (!ret && nr_bytes)
> > > > +		refill_obj_stock(objcg, PAGE_SIZE - nr_bytes);
> > > 
> > > This will put the cgroup into the cache if the allocation resulted in
> > > a partially free page.
> > > 
> > > But if this was a page allocation, we may have objcg->nr_cache_bytes
> > > from a previous subpage allocation that we should probably put back
> > > into the stock.
> > 
> > Yeah, we can do this, but I don't know if there will be any benefits.
> 
> It's mostly about understanding the code.
> 
> > Actually we don't wanna to touch objcg->nr_cache_bytes too often, as
> > it can become a contention point if there are many threads allocating
> > in the memory cgroup.
> > 
> > So maybe we want to do the opposite: relax it a bit and stop flushing
> > it on every stock refill and flush only if it exceeds a certain value.
> 
> That could be useful, yes.
> 
> > > It's not a common case, I'm just trying to understand what
> > > objcg->nr_cache_bytes holds and when it does so.
> > 
> > So it's actually a centralized leftover from the rounding of the actual
> > charge to the page size.
> 
> It would be good to add code comments explaining this.


Thanks you for comments! Please, find the updated version with some added
explanations below. No functional changes, just added some comments here
and there. Thanks!

--

From a9eadd2b624c37ffda981d172f2bfb9ceae0f984 Mon Sep 17 00:00:00 2001
From: Roman Gushchin <guro@fb.com>
Date: Tue, 14 Jan 2020 08:59:24 -0800
Subject: [PATCH v3.1 06/20] mm: memcg/slab: obj_cgroup API

Obj_cgroup API provides an ability to account sub-page sized kernel
objects, which potentially outlive the original memory cgroup.

The top-level API consists of the following functions:
  bool obj_cgroup_tryget(struct obj_cgroup *objcg);
  void obj_cgroup_get(struct obj_cgroup *objcg);
  void obj_cgroup_put(struct obj_cgroup *objcg);

  int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size);
  void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);

  struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg);
  struct obj_cgroup *get_obj_cgroup_from_current(void);

Object cgroup is basically a pointer to a memory cgroup with a per-cpu
reference counter. It substitutes a memory cgroup in places where
it's necessary to charge a custom amount of bytes instead of pages.

All charged memory rounded down to pages is charged to the
corresponding memory cgroup using __memcg_kmem_charge().

It implements reparenting: on memcg offlining it's getting reattached
to the parent memory cgroup. Each online memory cgroup has an
associated active object cgroup to handle new allocations and the list
of all attached object cgroups. On offlining of a cgroup this list is
reparented and for each object cgroup in the list the memcg pointer is
swapped to the parent memory cgroup. It prevents long-living objects
from pinning the original memory cgroup in the memory.

The implementation is based on byte-sized per-cpu stocks. A sub-page
sized leftover is stored in an atomic field, which is a part of
obj_cgroup object. So on cgroup offlining the leftover is automatically
reparented.

memcg->objcg is rcu protected.
objcg->memcg is a raw pointer, which is always pointing at a memory
cgroup, but can be atomically swapped to the parent memory cgroup. So
the caller must ensure the lifetime of the cgroup, e.g. grab
rcu_read_lock or css_set_lock.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/memcontrol.h |  51 +++++++
 mm/memcontrol.c            | 278 ++++++++++++++++++++++++++++++++++++-
 2 files changed, 328 insertions(+), 1 deletion(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c2eb73d89f5d..bf1be842fd27 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -23,6 +23,7 @@
 #include <linux/page-flags.h>
 
 struct mem_cgroup;
+struct obj_cgroup;
 struct page;
 struct mm_struct;
 struct kmem_cache;
@@ -194,6 +195,22 @@ struct memcg_cgwb_frn {
 	struct wb_completion done;	/* tracks in-flight foreign writebacks */
 };
 
+/*
+ * Bucket for arbitrarily byte-sized objects charged to a memory
+ * cgroup. The bucket can be reparented in one piece when the cgroup
+ * is destroyed, without having to round up the individual references
+ * of all live memory objects in the wild.
+ */
+struct obj_cgroup {
+	struct percpu_ref refcnt;
+	struct mem_cgroup *memcg;
+	atomic_t nr_charged_bytes;
+	union {
+		struct list_head list;
+		struct rcu_head rcu;
+	};
+};
+
 /*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
@@ -306,6 +323,8 @@ struct mem_cgroup {
 	int kmemcg_id;
 	enum memcg_kmem_state kmem_state;
 	struct list_head kmem_caches;
+	struct obj_cgroup __rcu *objcg;
+	struct list_head objcg_list;
 #endif
 
 #ifdef CONFIG_CGROUP_WRITEBACK
@@ -429,6 +448,33 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
 	return css ? container_of(css, struct mem_cgroup, css) : NULL;
 }
 
+static inline bool obj_cgroup_tryget(struct obj_cgroup *objcg)
+{
+	return percpu_ref_tryget(&objcg->refcnt);
+}
+
+static inline void obj_cgroup_get(struct obj_cgroup *objcg)
+{
+	percpu_ref_get(&objcg->refcnt);
+}
+
+static inline void obj_cgroup_put(struct obj_cgroup *objcg)
+{
+	percpu_ref_put(&objcg->refcnt);
+}
+
+/*
+ * After the initialization objcg->memcg is always pointing at
+ * a valid memcg, but can be atomically swapped to the parent memcg.
+ *
+ * The caller must ensure that the returned memcg won't be released:
+ * e.g. acquire the rcu_read_lock or css_set_lock.
+ */
+static inline struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg)
+{
+	return READ_ONCE(objcg->memcg);
+}
+
 static inline void mem_cgroup_put(struct mem_cgroup *memcg)
 {
 	if (memcg)
@@ -1390,6 +1436,11 @@ void __memcg_kmem_uncharge(struct mem_cgroup *memcg, unsigned int nr_pages);
 int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order);
 void __memcg_kmem_uncharge_page(struct page *page, int order);
 
+struct obj_cgroup *get_obj_cgroup_from_current(void);
+
+int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size);
+void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);
+
 extern struct static_key_false memcg_kmem_enabled_key;
 extern struct workqueue_struct *memcg_kmem_cache_wq;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 83805b48817d..0423705d3068 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -257,6 +257,98 @@ struct cgroup_subsys_state *vmpressure_to_css(struct vmpressure *vmpr)
 }
 
 #ifdef CONFIG_MEMCG_KMEM
+extern spinlock_t css_set_lock;
+
+static void obj_cgroup_release(struct percpu_ref *ref)
+{
+	struct obj_cgroup *objcg = container_of(ref, struct obj_cgroup, refcnt);
+	struct mem_cgroup *memcg;
+	unsigned int nr_bytes;
+	unsigned int nr_pages;
+	unsigned long flags;
+
+	/*
+	 * At this point all allocated objects are freed, and
+	 * objcg->nr_charged_bytes can't have an arbitrary byte value.
+	 * However, it can be PAGE_SIZE or (x * PAGE_SIZE).
+	 *
+	 * The following sequence can lead to it:
+	 * 1) CPU0: objcg == stock->cached_objcg
+	 * 2) CPU1: we do a small allocation (e.g. 92 bytes),
+	 *          PAGE_SIZE bytes are charged
+	 * 3) CPU1: a process from another memcg is allocating something,
+	 *          the stock if flushed,
+	 *          objcg->nr_charged_bytes = PAGE_SIZE - 92
+	 * 5) CPU0: we do release this object,
+	 *          92 bytes are added to stock->nr_bytes
+	 * 6) CPU0: stock is flushed,
+	 *          92 bytes are added to objcg->nr_charged_bytes
+	 *
+	 * In the result, nr_charged_bytes == PAGE_SIZE.
+	 * This page will be uncharged in obj_cgroup_release().
+	 */
+	nr_bytes = atomic_read(&objcg->nr_charged_bytes);
+	WARN_ON_ONCE(nr_bytes & (PAGE_SIZE - 1));
+	nr_pages = nr_bytes >> PAGE_SHIFT;
+
+	spin_lock_irqsave(&css_set_lock, flags);
+	memcg = obj_cgroup_memcg(objcg);
+	if (nr_pages)
+		__memcg_kmem_uncharge(memcg, nr_pages);
+	list_del(&objcg->list);
+	mem_cgroup_put(memcg);
+	spin_unlock_irqrestore(&css_set_lock, flags);
+
+	percpu_ref_exit(ref);
+	kfree_rcu(objcg, rcu);
+}
+
+static struct obj_cgroup *obj_cgroup_alloc(void)
+{
+	struct obj_cgroup *objcg;
+	int ret;
+
+	objcg = kzalloc(sizeof(struct obj_cgroup), GFP_KERNEL);
+	if (!objcg)
+		return NULL;
+
+	ret = percpu_ref_init(&objcg->refcnt, obj_cgroup_release, 0,
+			      GFP_KERNEL);
+	if (ret) {
+		kfree(objcg);
+		return NULL;
+	}
+	INIT_LIST_HEAD(&objcg->list);
+	return objcg;
+}
+
+static void memcg_reparent_objcgs(struct mem_cgroup *memcg,
+				  struct mem_cgroup *parent)
+{
+	struct obj_cgroup *objcg, *iter;
+
+	objcg = rcu_replace_pointer(memcg->objcg, NULL, true);
+
+	spin_lock_irq(&css_set_lock);
+
+	/* Move active objcg to the parent's list */
+	xchg(&objcg->memcg, parent);
+	css_get(&parent->css);
+	list_add(&objcg->list, &parent->objcg_list);
+
+	/* Move already reparented objcgs to the parent's list */
+	list_for_each_entry(iter, &memcg->objcg_list, list) {
+		css_get(&parent->css);
+		xchg(&iter->memcg, parent);
+		css_put(&memcg->css);
+	}
+	list_splice(&memcg->objcg_list, &parent->objcg_list);
+
+	spin_unlock_irq(&css_set_lock);
+
+	percpu_ref_kill(&objcg->refcnt);
+}
+
 /*
  * This will be the memcg's index in each cache's ->memcg_params.memcg_caches.
  * The main reason for not using cgroup id for this:
@@ -2064,6 +2156,12 @@ EXPORT_SYMBOL(unlock_page_memcg);
 struct memcg_stock_pcp {
 	struct mem_cgroup *cached; /* this never be root cgroup */
 	unsigned int nr_pages;
+
+#ifdef CONFIG_MEMCG_KMEM
+	struct obj_cgroup *cached_objcg;
+	unsigned int nr_bytes;
+#endif
+
 	struct work_struct work;
 	unsigned long flags;
 #define FLUSHING_CACHED_CHARGE	0
@@ -2071,6 +2169,22 @@ struct memcg_stock_pcp {
 static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
 static DEFINE_MUTEX(percpu_charge_mutex);
 
+#ifdef CONFIG_MEMCG_KMEM
+static void drain_obj_stock(struct memcg_stock_pcp *stock);
+static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
+				     struct mem_cgroup *root_memcg);
+
+#else
+static inline void drain_obj_stock(struct memcg_stock_pcp *stock)
+{
+}
+static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
+				     struct mem_cgroup *root_memcg)
+{
+	return false;
+}
+#endif
+
 /**
  * consume_stock: Try to consume stocked charge on this cpu.
  * @memcg: memcg to consume from.
@@ -2137,6 +2251,7 @@ static void drain_local_stock(struct work_struct *dummy)
 	local_irq_save(flags);
 
 	stock = this_cpu_ptr(&memcg_stock);
+	drain_obj_stock(stock);
 	drain_stock(stock);
 	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
 
@@ -2196,6 +2311,8 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
 		if (memcg && stock->nr_pages &&
 		    mem_cgroup_is_descendant(memcg, root_memcg))
 			flush = true;
+		if (obj_stock_flush_required(stock, root_memcg))
+			flush = true;
 		rcu_read_unlock();
 
 		if (flush &&
@@ -2723,6 +2840,30 @@ struct mem_cgroup *mem_cgroup_from_obj(void *p)
 	return page->mem_cgroup;
 }
 
+__always_inline struct obj_cgroup *get_obj_cgroup_from_current(void)
+{
+	struct obj_cgroup *objcg = NULL;
+	struct mem_cgroup *memcg;
+
+	if (unlikely(!current->mm))
+		return NULL;
+
+	rcu_read_lock();
+	if (unlikely(current->active_memcg))
+		memcg = rcu_dereference(current->active_memcg);
+	else
+		memcg = mem_cgroup_from_task(current);
+
+	for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) {
+		objcg = rcu_dereference(memcg->objcg);
+		if (objcg && obj_cgroup_tryget(objcg))
+			break;
+	}
+	rcu_read_unlock();
+
+	return objcg;
+}
+
 static int memcg_alloc_cache_id(void)
 {
 	int id, size;
@@ -3007,6 +3148,130 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
 	if (PageKmemcg(page))
 		__ClearPageKmemcg(page);
 }
+
+static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
+{
+	struct memcg_stock_pcp *stock;
+	unsigned long flags;
+	bool ret = false;
+
+	local_irq_save(flags);
+
+	stock = this_cpu_ptr(&memcg_stock);
+	if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
+		stock->nr_bytes -= nr_bytes;
+		ret = true;
+	}
+
+	local_irq_restore(flags);
+
+	return ret;
+}
+
+static void drain_obj_stock(struct memcg_stock_pcp *stock)
+{
+	struct obj_cgroup *old = stock->cached_objcg;
+
+	if (!old)
+		return;
+
+	if (stock->nr_bytes) {
+		unsigned int nr_pages = stock->nr_bytes >> PAGE_SHIFT;
+		unsigned int nr_bytes = stock->nr_bytes & (PAGE_SIZE - 1);
+
+		if (nr_pages) {
+			rcu_read_lock();
+			__memcg_kmem_uncharge(obj_cgroup_memcg(old), nr_pages);
+			rcu_read_unlock();
+		}
+
+		/*
+		 * The leftover is flushed to the centralized per-memcg value.
+		 * On the next attempt to refill obj stock it will be moved
+		 * to a per-cpu stock (probably, on an other CPU), see
+		 * refill_obj_stock().
+		 *
+		 * How often it's flushed is a trade-off between the memory
+		 * limit enforcement accuracy and potential CPU contention,
+		 * so it might be changed in the future.
+		 */
+		atomic_add(nr_bytes, &old->nr_charged_bytes);
+		stock->nr_bytes = 0;
+	}
+
+	obj_cgroup_put(old);
+	stock->cached_objcg = NULL;
+}
+
+static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
+				     struct mem_cgroup *root_memcg)
+{
+	struct mem_cgroup *memcg;
+
+	if (stock->cached_objcg) {
+		memcg = obj_cgroup_memcg(stock->cached_objcg);
+		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
+			return true;
+	}
+
+	return false;
+}
+
+static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
+{
+	struct memcg_stock_pcp *stock;
+	unsigned long flags;
+
+	local_irq_save(flags);
+
+	stock = this_cpu_ptr(&memcg_stock);
+	if (stock->cached_objcg != objcg) { /* reset if necessary */
+		drain_obj_stock(stock);
+		obj_cgroup_get(objcg);
+		stock->cached_objcg = objcg;
+		stock->nr_bytes = atomic_xchg(&objcg->nr_charged_bytes, 0);
+	}
+	stock->nr_bytes += nr_bytes;
+
+	if (stock->nr_bytes > PAGE_SIZE)
+		drain_obj_stock(stock);
+
+	local_irq_restore(flags);
+}
+
+int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
+{
+	struct mem_cgroup *memcg;
+	unsigned int nr_pages, nr_bytes;
+	int ret;
+
+	if (consume_obj_stock(objcg, size))
+		return 0;
+
+	rcu_read_lock();
+	memcg = obj_cgroup_memcg(objcg);
+	css_get(&memcg->css);
+	rcu_read_unlock();
+
+	nr_pages = size >> PAGE_SHIFT;
+	nr_bytes = size & (PAGE_SIZE - 1);
+
+	if (nr_bytes)
+		nr_pages += 1;
+
+	ret = __memcg_kmem_charge(memcg, gfp, nr_pages);
+	if (!ret && nr_bytes)
+		refill_obj_stock(objcg, PAGE_SIZE - nr_bytes);
+
+	css_put(&memcg->css);
+	return ret;
+}
+
+void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
+{
+	refill_obj_stock(objcg, size);
+}
+
 #endif /* CONFIG_MEMCG_KMEM */
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -3429,6 +3694,7 @@ static void memcg_flush_percpu_vmevents(struct mem_cgroup *memcg)
 #ifdef CONFIG_MEMCG_KMEM
 static int memcg_online_kmem(struct mem_cgroup *memcg)
 {
+	struct obj_cgroup *objcg;
 	int memcg_id;
 
 	if (cgroup_memory_nokmem)
@@ -3441,6 +3707,14 @@ static int memcg_online_kmem(struct mem_cgroup *memcg)
 	if (memcg_id < 0)
 		return memcg_id;
 
+	objcg = obj_cgroup_alloc();
+	if (!objcg) {
+		memcg_free_cache_id(memcg_id);
+		return -ENOMEM;
+	}
+	objcg->memcg = memcg;
+	rcu_assign_pointer(memcg->objcg, objcg);
+
 	static_branch_inc(&memcg_kmem_enabled_key);
 	/*
 	 * A memory cgroup is considered kmem-online as soon as it gets
@@ -3476,9 +3750,10 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
 		parent = root_mem_cgroup;
 
 	/*
-	 * Deactivate and reparent kmem_caches.
+	 * Deactivate and reparent kmem_caches and objcgs.
 	 */
 	memcg_deactivate_kmem_caches(memcg, parent);
+	memcg_reparent_objcgs(memcg, parent);
 
 	kmemcg_id = memcg->kmemcg_id;
 	BUG_ON(kmemcg_id < 0);
@@ -5045,6 +5320,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 	memcg->socket_pressure = jiffies;
 #ifdef CONFIG_MEMCG_KMEM
 	memcg->kmemcg_id = -1;
+	INIT_LIST_HEAD(&memcg->objcg_list);
 #endif
 #ifdef CONFIG_CGROUP_WRITEBACK
 	INIT_LIST_HEAD(&memcg->cgwb_list);
-- 
2.25.4



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 04/19] mm: slub: implement SLUB version of obj_to_index()
  2020-05-15 21:45                           ` Christopher Lameter
@ 2020-05-15 22:12                             ` Roman Gushchin
  0 siblings, 0 replies; 84+ messages in thread
From: Roman Gushchin @ 2020-05-15 22:12 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, linux-mm,
	kernel-team, linux-kernel

On Fri, May 15, 2020 at 09:45:30PM +0000, Christoph Lameter wrote:
> On Tue, 12 May 2020, Roman Gushchin wrote:
> 
> > > Add it to the metadata at the end of the object. Like the debugging
> > > information or the pointer for RCU freeing.
> >
> > Enabling debugging metadata currently disables the cache merging.
> > I doubt that it's acceptable to sacrifice the cache merging in order
> > to embed the memcg pointer?
> 
> Well then keep the merging even if you have a memcg pointer.
> 
> The disabling for debugging is only to simplify debugging. You dont have
> to deal with multiple caches actually using the same storage structures.
> 
> > Figuring out all these details will likely take several weeks, so the whole
> > thing will be delayed for one-two major releases (in the best case). Given that
> > the current implementation saves ~40% of slab memory, I think there is some value
> > in delivering it as it is. So I wonder if the idea of embedding the pointer
> > should be considered a blocker, or it can be implemented of top of the proposed
> > code (given it's not a user-facing api or something like this)?
> 
> Sorry no idea from my end here.

Ok, then I'll continue working on the embedding the pointer as an enhancement
*on top* of the current patchset. As I showed in my other e-mail, switching
to a different way of obj_cgroup storage is fairly trivial and doesn't change
much in the rest of the patchset.

Please, let me know if you're not ok with it.

Thanks!


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 04/19] mm: slub: implement SLUB version of obj_to_index()
  2020-05-13  0:57                         ` Roman Gushchin
  2020-05-15 21:45                           ` Christopher Lameter
@ 2020-05-20  9:51                           ` Vlastimil Babka
  2020-05-20 20:57                             ` Roman Gushchin
  1 sibling, 1 reply; 84+ messages in thread
From: Vlastimil Babka @ 2020-05-20  9:51 UTC (permalink / raw)
  To: Roman Gushchin, Christopher Lameter
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, linux-mm,
	kernel-team, linux-kernel

On 5/13/20 2:57 AM, Roman Gushchin wrote:
> 
> Btw, I'm trying to build up a prototype with an embedded memcg pointer,
> but it seems to be way more tricky than I thought. It requires changes to
> shrinkers (as they rely on getting the memcg pointer by an arbitrary
> kernel address, not necessarily aligned to the head of slab allocation),
> figuring out cache merging, adding SLAB support, natural alignment of
> kmallocs etc.

Is the natural alignment of kmallocs a problem right now? As kmalloc()
allocations are AFAIK not kmemcg-accounted? Or does your implementation add
memcg awareness to everything, even if non-__GFP_ACCOUNT allocations just get a
root memcg pointer?

> Figuring out all these details will likely take several weeks, so the whole
> thing will be delayed for one-two major releases (in the best case). Given that
> the current implementation saves ~40% of slab memory, I think there is some value
> in delivering it as it is. So I wonder if the idea of embedding the pointer
> should be considered a blocker, or it can be implemented of top of the proposed
> code (given it's not a user-facing api or something like this)?
> 
> Thanks!
> 



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 01/19] mm: memcg: factor out memcg- and lruvec-level changes out of __mod_lruvec_state()
  2020-04-22 20:46 ` [PATCH v3 01/19] mm: memcg: factor out memcg- and lruvec-level changes out of __mod_lruvec_state() Roman Gushchin
  2020-05-07 20:33   ` Johannes Weiner
@ 2020-05-20 10:49   ` Vlastimil Babka
  1 sibling, 0 replies; 84+ messages in thread
From: Vlastimil Babka @ 2020-05-20 10:49 UTC (permalink / raw)
  To: Roman Gushchin, Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, linux-mm, kernel-team, linux-kernel

On 4/22/20 10:46 PM, Roman Gushchin wrote:
> To convert memcg and lruvec slab counters to bytes there must be
> a way to change these counters without touching node counters.
> Factor out __mod_memcg_lruvec_state() out of __mod_lruvec_state().
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Nit below:

> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -713,30 +713,14 @@ parent_nodeinfo(struct mem_cgroup_per_node *pn, int nid)
>  	return mem_cgroup_nodeinfo(parent, nid);
>  }
>  
> -/**
> - * __mod_lruvec_state - update lruvec memory statistics
> - * @lruvec: the lruvec
> - * @idx: the stat item
> - * @val: delta to add to the counter, can be negative
> - *
> - * The lruvec is the intersection of the NUMA node and a cgroup. This
> - * function updates the all three counters that are affected by a
> - * change of state at this level: per-node, per-cgroup, per-lruvec.
> - */
> -void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
> -			int val)
> +void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
> +			      int val)
>  {
>  	pg_data_t *pgdat = lruvec_pgdat(lruvec);

Looks like the pgdat can now be moved into the MEMCG_CHARGE_BATCH if().



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 02/19] mm: memcg: prepare for byte-sized vmstat items
  2020-04-22 20:46 ` [PATCH v3 02/19] mm: memcg: prepare for byte-sized vmstat items Roman Gushchin
  2020-05-07 20:34   ` Johannes Weiner
@ 2020-05-20 11:31   ` Vlastimil Babka
  2020-05-20 11:36     ` Vlastimil Babka
  1 sibling, 1 reply; 84+ messages in thread
From: Vlastimil Babka @ 2020-05-20 11:31 UTC (permalink / raw)
  To: Roman Gushchin, Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, linux-mm, kernel-team, linux-kernel

On 4/22/20 10:46 PM, Roman Gushchin wrote:
> To implement per-object slab memory accounting, we need to
> convert slab vmstat counters to bytes. Actually, out of
> 4 levels of counters: global, per-node, per-memcg and per-lruvec
> only two last levels will require byte-sized counters.
> It's because global and per-node counters will be counting the
> number of slab pages, and per-memcg and per-lruvec will be
> counting the amount of memory taken by charged slab objects.
> 
> Converting all vmstat counters to bytes or even all slab
> counters to bytes would introduce an additional overhead.
> So instead let's store global and per-node counters
> in pages, and memcg and lruvec counters in bytes.
> 
> To make the API clean all access helpers (both on the read
> and write sides) are dealing with bytes.
> 
> To avoid back-and-forth conversions a new flavor of helpers
> is introduced, which always returns values in pages:
> node_page_state_pages() and global_node_page_state_pages().
> 
> Actually new helpers are just reading raw values. Old helpers are
> simple wrappers, which perform a conversion if the vmstat items are
> in bytes. Because at the moment no one actually need bytes,
> there are WARN_ON_ONCE() macroses inside to warn about inappropriate
> use cases.
> 
> Thanks to Johannes Weiner for the idea of having the byte-sized API
> on top of the page-sized internal storage.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>

Reviewed-By: Vlastimil Babka <vbabka@suse.cz>

But it's somewhat complicated, so it would be great to document it in comments
of e.g. include/linux/vmstat.h that what the API returns as unsigned long, can
be either bytes or pages depending on vmstat_item_in_bytes().

> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -204,6 +204,11 @@ enum node_stat_item {
>  	NR_VM_NODE_STAT_ITEMS
>  };
>  
> +static __always_inline bool vmstat_item_in_bytes(enum node_stat_item item)

This should also have a comment explaining if it's talking about API or storage,
as it's not immediately obvious.

> +{
> +	return false;
> +}
> +
>  /*
>   * We do arithmetic on the LRU lists in various places in the code,
>   * so it is important to keep the active lists LRU_ACTIVE higher in





^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 02/19] mm: memcg: prepare for byte-sized vmstat items
  2020-05-20 11:31   ` Vlastimil Babka
@ 2020-05-20 11:36     ` Vlastimil Babka
  0 siblings, 0 replies; 84+ messages in thread
From: Vlastimil Babka @ 2020-05-20 11:36 UTC (permalink / raw)
  To: Roman Gushchin, Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, linux-mm, kernel-team, linux-kernel

On 5/20/20 1:31 PM, Vlastimil Babka wrote:
> On 4/22/20 10:46 PM, Roman Gushchin wrote:
>> To implement per-object slab memory accounting, we need to
>> convert slab vmstat counters to bytes. Actually, out of
>> 4 levels of counters: global, per-node, per-memcg and per-lruvec
>> only two last levels will require byte-sized counters.
>> It's because global and per-node counters will be counting the
>> number of slab pages, and per-memcg and per-lruvec will be
>> counting the amount of memory taken by charged slab objects.
>> 
>> Converting all vmstat counters to bytes or even all slab
>> counters to bytes would introduce an additional overhead.
>> So instead let's store global and per-node counters
>> in pages, and memcg and lruvec counters in bytes.
>> 
>> To make the API clean all access helpers (both on the read
>> and write sides) are dealing with bytes.
>> 
>> To avoid back-and-forth conversions a new flavor of helpers
>> is introduced, which always returns values in pages:
>> node_page_state_pages() and global_node_page_state_pages().
>> 
>> Actually new helpers are just reading raw values. Old helpers are
>> simple wrappers, which perform a conversion if the vmstat items are
>> in bytes. Because at the moment no one actually need bytes,
>> there are WARN_ON_ONCE() macroses inside to warn about inappropriate
>> use cases.
>> 
>> Thanks to Johannes Weiner for the idea of having the byte-sized API
>> on top of the page-sized internal storage.
>> 
>> Signed-off-by: Roman Gushchin <guro@fb.com>
> 
> Reviewed-By: Vlastimil Babka <vbabka@suse.cz>
> 
> But it's somewhat complicated, so it would be great to document it in comments
> of e.g. include/linux/vmstat.h that what the API returns as unsigned long, can
> be either bytes or pages depending on vmstat_item_in_bytes().

Also forgot to add that if those WARN_ON_ONCEs are going to stay, they should
rather become VM_WARN_ON_ONCEs


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 03/19] mm: memcg: convert vmstat slab counters to bytes
  2020-04-22 20:46 ` [PATCH v3 03/19] mm: memcg: convert vmstat slab counters to bytes Roman Gushchin
  2020-05-07 20:41   ` Johannes Weiner
@ 2020-05-20 12:25   ` Vlastimil Babka
  2020-05-20 19:26     ` Roman Gushchin
  1 sibling, 1 reply; 84+ messages in thread
From: Vlastimil Babka @ 2020-05-20 12:25 UTC (permalink / raw)
  To: Roman Gushchin, Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, linux-mm, kernel-team, linux-kernel

On 4/22/20 10:46 PM, Roman Gushchin wrote:
> In order to prepare for per-object slab memory accounting, convert
> NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE vmstat items to bytes.
> 
> To make it obvious, rename them to NR_SLAB_RECLAIMABLE_B and
> NR_SLAB_UNRECLAIMABLE_B (similar to NR_KERNEL_STACK_KB).
> 
> Internally global and per-node counters are stored in pages,
> however memcg and lruvec counters are stored in bytes.
> This scheme may look weird, but only for now. As soon as slab
> pages will be shared between multiple cgroups, global and
> node counters will reflect the total number of slab pages.
> However memcg and lruvec counters will be used for per-memcg
> slab memory tracking, which will take separate kernel objects
> in the account. Keeping global and node counters in pages helps
> to avoid additional overhead.
> 
> The size of slab memory shouldn't exceed 4Gb on 32-bit machines,
> so it will fit into atomic_long_t we use for vmstats.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> ---
>  drivers/base/node.c     |  4 ++--
>  fs/proc/meminfo.c       |  4 ++--
>  include/linux/mmzone.h  | 16 +++++++++++++---
>  kernel/power/snapshot.c |  2 +-
>  mm/memcontrol.c         | 11 ++++-------
>  mm/oom_kill.c           |  2 +-
>  mm/page_alloc.c         |  8 ++++----
>  mm/slab.h               | 15 ++++++++-------
>  mm/slab_common.c        |  4 ++--
>  mm/slob.c               | 12 ++++++------
>  mm/slub.c               |  8 ++++----
>  mm/vmscan.c             |  3 ++-
>  mm/workingset.c         |  6 ++++--
>  13 files changed, 53 insertions(+), 42 deletions(-)


> @@ -206,7 +206,17 @@ enum node_stat_item {
>  
>  static __always_inline bool vmstat_item_in_bytes(enum node_stat_item item)
>  {
> -	return false;
> +	/*
> +	 * Global and per-node slab counters track slab pages.
> +	 * It's expected that changes are multiples of PAGE_SIZE.
> +	 * Internally values are stored in pages.
> +	 *
> +	 * Per-memcg and per-lruvec counters track memory, consumed
> +	 * by individual slab objects. These counters are actually
> +	 * byte-precise.
> +	 */
> +	return (item == NR_SLAB_RECLAIMABLE_B ||
> +		item == NR_SLAB_UNRECLAIMABLE_B);
>  }

Ok, so this is no longer a no-op, but __always_inline here and inline in
global_node_page_state() should hopefully mean that for all users of
global_node_page_state(<constant>) the compiler will eliminate the branch for
non-slab counters. But there are also functions such as si_mem_available() that
use non-constant item. Maybe compiler is smart enough anyway, but perhaps it's
better to use global_node_page_state_pages() in such callers?

However __mod_node_page_state() and mode_node_state() will now branch always. I
wonder if the "API clean" goal is worth it...

> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1409,9 +1409,8 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
>  		       (u64)memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) *
>  		       1024);
>  	seq_buf_printf(&s, "slab %llu\n",
> -		       (u64)(memcg_page_state(memcg, NR_SLAB_RECLAIMABLE) +
> -			     memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE)) *
> -		       PAGE_SIZE);
> +		       (u64)(memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B) +
> +			     memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B)));
>  	seq_buf_printf(&s, "sock %llu\n",
>  		       (u64)memcg_page_state(memcg, MEMCG_SOCK) *
>  		       PAGE_SIZE);
> @@ -1445,11 +1444,9 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
>  			       PAGE_SIZE);
>  
>  	seq_buf_printf(&s, "slab_reclaimable %llu\n",
> -		       (u64)memcg_page_state(memcg, NR_SLAB_RECLAIMABLE) *
> -		       PAGE_SIZE);
> +		       (u64)memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B));
>  	seq_buf_printf(&s, "slab_unreclaimable %llu\n",
> -		       (u64)memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE) *
> -		       PAGE_SIZE);
> +		       (u64)memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B));

So here we are now printing in bytes instead of pages, right? That's fine for
OOM report, but in sysfs aren't we breaking existing users?



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 04/19] mm: slub: implement SLUB version of obj_to_index()
  2020-04-22 20:46 ` [PATCH v3 04/19] mm: slub: implement SLUB version of obj_to_index() Roman Gushchin
  2020-04-22 23:52   ` Christopher Lameter
@ 2020-05-20 13:51   ` Vlastimil Babka
  2020-05-20 21:00     ` Roman Gushchin
  1 sibling, 1 reply; 84+ messages in thread
From: Vlastimil Babka @ 2020-05-20 13:51 UTC (permalink / raw)
  To: Roman Gushchin, Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, linux-mm, kernel-team,
	linux-kernel, Christoph Lameter

On 4/22/20 10:46 PM, Roman Gushchin wrote:
> This commit implements SLUB version of the obj_to_index() function,
> which will be required to calculate the offset of obj_cgroup in the
> obj_cgroups vector to store/obtain the objcg ownership data.
> 
> To make it faster, let's repeat the SLAB's trick introduced by
> commit 6a2d7a955d8d ("[PATCH] SLAB: use a multiply instead of a
> divide in obj_to_index()") and avoid an expensive division.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Acked-by: Christoph Lameter <cl@linux.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

There's already a slab_index() doing the same without the trick, with only
SLUB_DEBUG callers. Maybe just improve it and perhaps rename? (obj_to_index()
seems more descriptive). The difference is that it takes the result of
page_addr() instead of doing that, as it's being called in a loop on objects
from a single page, so you'd have to perhaps split to obj_to_index(page) and
__obj_to_index(addr) or something.

> ---
>  include/linux/slub_def.h | 9 +++++++++
>  mm/slub.c                | 1 +
>  2 files changed, 10 insertions(+)
> 
> diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
> index d2153789bd9f..200ea292f250 100644
> --- a/include/linux/slub_def.h
> +++ b/include/linux/slub_def.h
> @@ -8,6 +8,7 @@
>   * (C) 2007 SGI, Christoph Lameter
>   */
>  #include <linux/kobject.h>
> +#include <linux/reciprocal_div.h>
>  
>  enum stat_item {
>  	ALLOC_FASTPATH,		/* Allocation from cpu slab */
> @@ -86,6 +87,7 @@ struct kmem_cache {
>  	unsigned long min_partial;
>  	unsigned int size;	/* The size of an object including metadata */
>  	unsigned int object_size;/* The size of an object without metadata */
> +	struct reciprocal_value reciprocal_size;
>  	unsigned int offset;	/* Free pointer offset */
>  #ifdef CONFIG_SLUB_CPU_PARTIAL
>  	/* Number of per cpu partial objects to keep around */
> @@ -182,4 +184,11 @@ static inline void *nearest_obj(struct kmem_cache *cache, struct page *page,
>  	return result;
>  }
>  
> +static inline unsigned int obj_to_index(const struct kmem_cache *cache,
> +					const struct page *page, void *obj)
> +{
> +	return reciprocal_divide(kasan_reset_tag(obj) - page_address(page),
> +				 cache->reciprocal_size);
> +}
> +
>  #endif /* _LINUX_SLUB_DEF_H */
> diff --git a/mm/slub.c b/mm/slub.c
> index 03071ae5ff07..8d16babe1829 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3660,6 +3660,7 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order)
>  	 */
>  	size = ALIGN(size, s->align);
>  	s->size = size;
> +	s->reciprocal_size = reciprocal_value(size);
>  	if (forced_order >= 0)
>  		order = forced_order;
>  	else
> 



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 03/19] mm: memcg: convert vmstat slab counters to bytes
  2020-05-20 12:25   ` Vlastimil Babka
@ 2020-05-20 19:26     ` Roman Gushchin
  2020-05-21  9:57       ` Vlastimil Babka
  0 siblings, 1 reply; 84+ messages in thread
From: Roman Gushchin @ 2020-05-20 19:26 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, linux-mm,
	kernel-team, linux-kernel

On Wed, May 20, 2020 at 02:25:22PM +0200, Vlastimil Babka wrote:
> On 4/22/20 10:46 PM, Roman Gushchin wrote:
> > In order to prepare for per-object slab memory accounting, convert
> > NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE vmstat items to bytes.
> > 
> > To make it obvious, rename them to NR_SLAB_RECLAIMABLE_B and
> > NR_SLAB_UNRECLAIMABLE_B (similar to NR_KERNEL_STACK_KB).
> > 
> > Internally global and per-node counters are stored in pages,
> > however memcg and lruvec counters are stored in bytes.
> > This scheme may look weird, but only for now. As soon as slab
> > pages will be shared between multiple cgroups, global and
> > node counters will reflect the total number of slab pages.
> > However memcg and lruvec counters will be used for per-memcg
> > slab memory tracking, which will take separate kernel objects
> > in the account. Keeping global and node counters in pages helps
> > to avoid additional overhead.
> > 
> > The size of slab memory shouldn't exceed 4Gb on 32-bit machines,
> > so it will fit into atomic_long_t we use for vmstats.
> > 
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > ---
> >  drivers/base/node.c     |  4 ++--
> >  fs/proc/meminfo.c       |  4 ++--
> >  include/linux/mmzone.h  | 16 +++++++++++++---
> >  kernel/power/snapshot.c |  2 +-
> >  mm/memcontrol.c         | 11 ++++-------
> >  mm/oom_kill.c           |  2 +-
> >  mm/page_alloc.c         |  8 ++++----
> >  mm/slab.h               | 15 ++++++++-------
> >  mm/slab_common.c        |  4 ++--
> >  mm/slob.c               | 12 ++++++------
> >  mm/slub.c               |  8 ++++----
> >  mm/vmscan.c             |  3 ++-
> >  mm/workingset.c         |  6 ++++--
> >  13 files changed, 53 insertions(+), 42 deletions(-)
> 
> 
> > @@ -206,7 +206,17 @@ enum node_stat_item {
> >  
> >  static __always_inline bool vmstat_item_in_bytes(enum node_stat_item item)
> >  {
> > -	return false;
> > +	/*
> > +	 * Global and per-node slab counters track slab pages.
> > +	 * It's expected that changes are multiples of PAGE_SIZE.
> > +	 * Internally values are stored in pages.
> > +	 *
> > +	 * Per-memcg and per-lruvec counters track memory, consumed
> > +	 * by individual slab objects. These counters are actually
> > +	 * byte-precise.
> > +	 */
> > +	return (item == NR_SLAB_RECLAIMABLE_B ||
> > +		item == NR_SLAB_UNRECLAIMABLE_B);

Hello, Vlastimil!

Thank you for looking into the patchset, appreciate it.
In the next version I'll add some comments based on your suggestions in previous
letters.

> >  }
> 
> Ok, so this is no longer a no-op, but __always_inline here and inline in
> global_node_page_state() should hopefully mean that for all users of
> global_node_page_state(<constant>) the compiler will eliminate the branch for
> non-slab counters. But there are also functions such as si_mem_available() that
> use non-constant item. Maybe compiler is smart enough anyway, but perhaps it's
> better to use global_node_page_state_pages() in such callers?

I'll take a look, thanks for the idea.

> 
> However __mod_node_page_state() and mode_node_state() will now branch always. I
> wonder if the "API clean" goal is worth it...

You mean just adding a special write-side helper which will perform a conversion
and put VM_WARN_ON_ONCE() into generic write-side helpers?

> 
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1409,9 +1409,8 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
> >  		       (u64)memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) *
> >  		       1024);
> >  	seq_buf_printf(&s, "slab %llu\n",
> > -		       (u64)(memcg_page_state(memcg, NR_SLAB_RECLAIMABLE) +
> > -			     memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE)) *
> > -		       PAGE_SIZE);
> > +		       (u64)(memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B) +
> > +			     memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B)));
> >  	seq_buf_printf(&s, "sock %llu\n",
> >  		       (u64)memcg_page_state(memcg, MEMCG_SOCK) *
> >  		       PAGE_SIZE);
> > @@ -1445,11 +1444,9 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
> >  			       PAGE_SIZE);
> >  
> >  	seq_buf_printf(&s, "slab_reclaimable %llu\n",
> > -		       (u64)memcg_page_state(memcg, NR_SLAB_RECLAIMABLE) *
> > -		       PAGE_SIZE);
> > +		       (u64)memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B));
> >  	seq_buf_printf(&s, "slab_unreclaimable %llu\n",
> > -		       (u64)memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE) *
> > -		       PAGE_SIZE);
> > +		       (u64)memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B));
> 
> So here we are now printing in bytes instead of pages, right? That's fine for
> OOM report, but in sysfs aren't we breaking existing users?
> 

Hm, but it was in bytes previously, look at that x * PAGE_SIZE.
Or do you mean that now it can be not rounded to PAGE_SIZE?
If so, I don't think it breaks any expectations.

Thanks!


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 04/19] mm: slub: implement SLUB version of obj_to_index()
  2020-05-20  9:51                           ` Vlastimil Babka
@ 2020-05-20 20:57                             ` Roman Gushchin
  0 siblings, 0 replies; 84+ messages in thread
From: Roman Gushchin @ 2020-05-20 20:57 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Christopher Lameter, Andrew Morton, Johannes Weiner,
	Michal Hocko, linux-mm, kernel-team, linux-kernel

On Wed, May 20, 2020 at 11:51:51AM +0200, Vlastimil Babka wrote:
> On 5/13/20 2:57 AM, Roman Gushchin wrote:
> > 
> > Btw, I'm trying to build up a prototype with an embedded memcg pointer,
> > but it seems to be way more tricky than I thought. It requires changes to
> > shrinkers (as they rely on getting the memcg pointer by an arbitrary
> > kernel address, not necessarily aligned to the head of slab allocation),
> > figuring out cache merging, adding SLAB support, natural alignment of
> > kmallocs etc.
> 
> Is the natural alignment of kmallocs a problem right now? As kmalloc()
> allocations are AFAIK not kmemcg-accounted? Or does your implementation add
> memcg awareness to everything, even if non-__GFP_ACCOUNT allocations just get a
> root memcg pointer?

There is at least a dozen of accounted kmallocs as now, please search for kmalloc
with GFP_KERNEL_ACCOUNT.

Natural alignment is not an issue with the proposed implementation, but it becomes
a problem as soon as we try to embed the memcg pointer into the object
(as Christopher is suggesting). I'm actually not opposing his suggestion, just
want to settle down the memcg part first, and then discuss the best way
to store the memcg metadata information. As I shown, the required changes to switch
between different ways of storing the data are minimal and do not affect
the rest of the patchset.

Thanks!


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 04/19] mm: slub: implement SLUB version of obj_to_index()
  2020-05-20 13:51   ` Vlastimil Babka
@ 2020-05-20 21:00     ` Roman Gushchin
  2020-05-21 11:01       ` Vlastimil Babka
  0 siblings, 1 reply; 84+ messages in thread
From: Roman Gushchin @ 2020-05-20 21:00 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, linux-mm,
	kernel-team, linux-kernel, Christoph Lameter

On Wed, May 20, 2020 at 03:51:45PM +0200, Vlastimil Babka wrote:
> On 4/22/20 10:46 PM, Roman Gushchin wrote:
> > This commit implements SLUB version of the obj_to_index() function,
> > which will be required to calculate the offset of obj_cgroup in the
> > obj_cgroups vector to store/obtain the objcg ownership data.
> > 
> > To make it faster, let's repeat the SLAB's trick introduced by
> > commit 6a2d7a955d8d ("[PATCH] SLAB: use a multiply instead of a
> > divide in obj_to_index()") and avoid an expensive division.
> > 
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > Acked-by: Christoph Lameter <cl@linux.com>
> > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> There's already a slab_index() doing the same without the trick, with only
> SLUB_DEBUG callers. Maybe just improve it and perhaps rename? (obj_to_index()
> seems more descriptive). The difference is that it takes the result of
> page_addr() instead of doing that, as it's being called in a loop on objects
> from a single page, so you'd have to perhaps split to obj_to_index(page) and
> __obj_to_index(addr) or something.

Good point! How about this one?

--

From beeaecdac85c3a395dcfb99944dc8c858b541cbf Mon Sep 17 00:00:00 2001
From: Roman Gushchin <guro@fb.com>
Date: Mon, 29 Jul 2019 18:18:42 -0700
Subject: [PATCH v3.2 04/19] mm: slub: implement SLUB version of obj_to_index()

This commit implements SLUB version of the obj_to_index() function,
which will be required to calculate the offset of obj_cgroup in the
obj_cgroups vector to store/obtain the objcg ownership data.

To make it faster, let's repeat the SLAB's trick introduced by
commit 6a2d7a955d8d ("[PATCH] SLAB: use a multiply instead of a
divide in obj_to_index()") and avoid an expensive division.

Vlastimil Babka noticed, that SLUB does have already a similar
function called slab_index(), which is defined only if SLUB_DEBUG
is enabled. The function does a similar math, but with a division,
and it also takes a page address instead of a page pointer.

Let's remove slab_index() and replace it with the new helper
__obj_to_index(), which takes a page address. obj_to_index()
will be a simple wrapper taking a page pointer and passing
page_address(page) into __obj_to_index().

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/slub_def.h | 16 ++++++++++++++++
 mm/slub.c                | 15 +++++----------
 2 files changed, 21 insertions(+), 10 deletions(-)

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index d2153789bd9f..30e91c83d401 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -8,6 +8,7 @@
  * (C) 2007 SGI, Christoph Lameter
  */
 #include <linux/kobject.h>
+#include <linux/reciprocal_div.h>
 
 enum stat_item {
 	ALLOC_FASTPATH,		/* Allocation from cpu slab */
@@ -86,6 +87,7 @@ struct kmem_cache {
 	unsigned long min_partial;
 	unsigned int size;	/* The size of an object including metadata */
 	unsigned int object_size;/* The size of an object without metadata */
+	struct reciprocal_value reciprocal_size;
 	unsigned int offset;	/* Free pointer offset */
 #ifdef CONFIG_SLUB_CPU_PARTIAL
 	/* Number of per cpu partial objects to keep around */
@@ -182,4 +184,18 @@ static inline void *nearest_obj(struct kmem_cache *cache, struct page *page,
 	return result;
 }
 
+/* Determine object index from a given position */
+static inline unsigned int __obj_to_index(const struct kmem_cache *cache,
+					  void *addr, void *obj)
+{
+	return reciprocal_divide(kasan_reset_tag(obj) - addr,
+				 cache->reciprocal_size);
+}
+
+static inline unsigned int obj_to_index(const struct kmem_cache *cache,
+					const struct page *page, void *obj)
+{
+	return __obj_to_index(cache, page_address(page), obj);
+}
+
 #endif /* _LINUX_SLUB_DEF_H */
diff --git a/mm/slub.c b/mm/slub.c
index 2df4d4a420d1..d605d18b3c1b 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -313,12 +313,6 @@ static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp)
 		__p < (__addr) + (__objects) * (__s)->size; \
 		__p += (__s)->size)
 
-/* Determine object index from a given position */
-static inline unsigned int slab_index(void *p, struct kmem_cache *s, void *addr)
-{
-	return (kasan_reset_tag(p) - addr) / s->size;
-}
-
 static inline unsigned int order_objects(unsigned int order, unsigned int size)
 {
 	return ((unsigned int)PAGE_SIZE << order) / size;
@@ -461,7 +455,7 @@ static unsigned long *get_map(struct kmem_cache *s, struct page *page)
 	bitmap_zero(object_map, page->objects);
 
 	for (p = page->freelist; p; p = get_freepointer(s, p))
-		set_bit(slab_index(p, s, addr), object_map);
+		set_bit(__obj_to_index(s, addr, p), object_map);
 
 	return object_map;
 }
@@ -3682,6 +3676,7 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order)
 	 */
 	size = ALIGN(size, s->align);
 	s->size = size;
+	s->reciprocal_size = reciprocal_value(size);
 	if (forced_order >= 0)
 		order = forced_order;
 	else
@@ -3788,7 +3783,7 @@ static void list_slab_objects(struct kmem_cache *s, struct page *page,
 	map = get_map(s, page);
 	for_each_object(p, s, addr, page->objects) {
 
-		if (!test_bit(slab_index(p, s, addr), map)) {
+		if (!test_bit(__obj_to_index(s, addr, p), map)) {
 			pr_err("INFO: Object 0x%p @offset=%tu\n", p, p - addr);
 			print_tracking(s, p);
 		}
@@ -4513,7 +4508,7 @@ static void validate_slab(struct kmem_cache *s, struct page *page)
 	/* Now we know that a valid freelist exists */
 	map = get_map(s, page);
 	for_each_object(p, s, addr, page->objects) {
-		u8 val = test_bit(slab_index(p, s, addr), map) ?
+		u8 val = test_bit(__obj_to_index(s, addr, p), map) ?
 			 SLUB_RED_INACTIVE : SLUB_RED_ACTIVE;
 
 		if (!check_object(s, page, p, val))
@@ -4704,7 +4699,7 @@ static void process_slab(struct loc_track *t, struct kmem_cache *s,
 
 	map = get_map(s, page);
 	for_each_object(p, s, addr, page->objects)
-		if (!test_bit(slab_index(p, s, addr), map))
+		if (!test_bit(__obj_to_index(s, addr, p), map))
 			add_location(t, s, get_track(s, p, alloc));
 	put_map(map);
 }
-- 
2.25.4



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 03/19] mm: memcg: convert vmstat slab counters to bytes
  2020-05-20 19:26     ` Roman Gushchin
@ 2020-05-21  9:57       ` Vlastimil Babka
  2020-05-21 21:14         ` Roman Gushchin
  0 siblings, 1 reply; 84+ messages in thread
From: Vlastimil Babka @ 2020-05-21  9:57 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, linux-mm,
	kernel-team, linux-kernel

On 5/20/20 9:26 PM, Roman Gushchin wrote:
> On Wed, May 20, 2020 at 02:25:22PM +0200, Vlastimil Babka wrote:
>> 
>> However __mod_node_page_state() and mode_node_state() will now branch always. I
>> wonder if the "API clean" goal is worth it...
> 
> You mean just adding a special write-side helper which will perform a conversion
> and put VM_WARN_ON_ONCE() into generic write-side helpers?

What I mean is that maybe node/global helpers should assume page granularity,
and lruvec/memcg helpers do the check is they should convert from bytes to pages
when calling node/global helpers. Then there would be no extra branches in
node/global helpers. But maybe it's not worth saving those branches, dunno.

>> 
>> > --- a/mm/memcontrol.c
>> > +++ b/mm/memcontrol.c
>> > @@ -1409,9 +1409,8 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
>> >  		       (u64)memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) *
>> >  		       1024);
>> >  	seq_buf_printf(&s, "slab %llu\n",
>> > -		       (u64)(memcg_page_state(memcg, NR_SLAB_RECLAIMABLE) +
>> > -			     memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE)) *
>> > -		       PAGE_SIZE);
>> > +		       (u64)(memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B) +
>> > +			     memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B)));
>> >  	seq_buf_printf(&s, "sock %llu\n",
>> >  		       (u64)memcg_page_state(memcg, MEMCG_SOCK) *
>> >  		       PAGE_SIZE);
>> > @@ -1445,11 +1444,9 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
>> >  			       PAGE_SIZE);
>> >  
>> >  	seq_buf_printf(&s, "slab_reclaimable %llu\n",
>> > -		       (u64)memcg_page_state(memcg, NR_SLAB_RECLAIMABLE) *
>> > -		       PAGE_SIZE);
>> > +		       (u64)memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B));
>> >  	seq_buf_printf(&s, "slab_unreclaimable %llu\n",
>> > -		       (u64)memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE) *
>> > -		       PAGE_SIZE);
>> > +		       (u64)memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B));
>> 
>> So here we are now printing in bytes instead of pages, right? That's fine for
>> OOM report, but in sysfs aren't we breaking existing users?
>> 
> 
> Hm, but it was in bytes previously, look at that x * PAGE_SIZE.

Yeah, that's what I managed to overlook, sorry.

> Or do you mean that now it can be not rounded to PAGE_SIZE?
> If so, I don't think it breaks any expectations.
> 
> Thanks!
> 



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 04/19] mm: slub: implement SLUB version of obj_to_index()
  2020-05-20 21:00     ` Roman Gushchin
@ 2020-05-21 11:01       ` Vlastimil Babka
  2020-05-21 21:06         ` Roman Gushchin
  0 siblings, 1 reply; 84+ messages in thread
From: Vlastimil Babka @ 2020-05-21 11:01 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, linux-mm,
	kernel-team, linux-kernel, Christoph Lameter

On 5/20/20 11:00 PM, Roman Gushchin wrote:
> 
> From beeaecdac85c3a395dcfb99944dc8c858b541cbf Mon Sep 17 00:00:00 2001
> From: Roman Gushchin <guro@fb.com>
> Date: Mon, 29 Jul 2019 18:18:42 -0700
> Subject: [PATCH v3.2 04/19] mm: slub: implement SLUB version of obj_to_index()
> 
> This commit implements SLUB version of the obj_to_index() function,
> which will be required to calculate the offset of obj_cgroup in the
> obj_cgroups vector to store/obtain the objcg ownership data.
> 
> To make it faster, let's repeat the SLAB's trick introduced by
> commit 6a2d7a955d8d ("[PATCH] SLAB: use a multiply instead of a
> divide in obj_to_index()") and avoid an expensive division.
> 
> Vlastimil Babka noticed, that SLUB does have already a similar
> function called slab_index(), which is defined only if SLUB_DEBUG
> is enabled. The function does a similar math, but with a division,
> and it also takes a page address instead of a page pointer.
> 
> Let's remove slab_index() and replace it with the new helper
> __obj_to_index(), which takes a page address. obj_to_index()
> will be a simple wrapper taking a page pointer and passing
> page_address(page) into __obj_to_index().
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>

Looks good!

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  include/linux/slub_def.h | 16 ++++++++++++++++
>  mm/slub.c                | 15 +++++----------
>  2 files changed, 21 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
> index d2153789bd9f..30e91c83d401 100644
> --- a/include/linux/slub_def.h
> +++ b/include/linux/slub_def.h
> @@ -8,6 +8,7 @@
>   * (C) 2007 SGI, Christoph Lameter
>   */
>  #include <linux/kobject.h>
> +#include <linux/reciprocal_div.h>
>  
>  enum stat_item {
>  	ALLOC_FASTPATH,		/* Allocation from cpu slab */
> @@ -86,6 +87,7 @@ struct kmem_cache {
>  	unsigned long min_partial;
>  	unsigned int size;	/* The size of an object including metadata */
>  	unsigned int object_size;/* The size of an object without metadata */
> +	struct reciprocal_value reciprocal_size;
>  	unsigned int offset;	/* Free pointer offset */
>  #ifdef CONFIG_SLUB_CPU_PARTIAL
>  	/* Number of per cpu partial objects to keep around */
> @@ -182,4 +184,18 @@ static inline void *nearest_obj(struct kmem_cache *cache, struct page *page,
>  	return result;
>  }
>  
> +/* Determine object index from a given position */
> +static inline unsigned int __obj_to_index(const struct kmem_cache *cache,
> +					  void *addr, void *obj)
> +{
> +	return reciprocal_divide(kasan_reset_tag(obj) - addr,
> +				 cache->reciprocal_size);
> +}
> +
> +static inline unsigned int obj_to_index(const struct kmem_cache *cache,
> +					const struct page *page, void *obj)
> +{
> +	return __obj_to_index(cache, page_address(page), obj);
> +}
> +
>  #endif /* _LINUX_SLUB_DEF_H */
> diff --git a/mm/slub.c b/mm/slub.c
> index 2df4d4a420d1..d605d18b3c1b 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -313,12 +313,6 @@ static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp)
>  		__p < (__addr) + (__objects) * (__s)->size; \
>  		__p += (__s)->size)
>  
> -/* Determine object index from a given position */
> -static inline unsigned int slab_index(void *p, struct kmem_cache *s, void *addr)
> -{
> -	return (kasan_reset_tag(p) - addr) / s->size;
> -}
> -
>  static inline unsigned int order_objects(unsigned int order, unsigned int size)
>  {
>  	return ((unsigned int)PAGE_SIZE << order) / size;
> @@ -461,7 +455,7 @@ static unsigned long *get_map(struct kmem_cache *s, struct page *page)
>  	bitmap_zero(object_map, page->objects);
>  
>  	for (p = page->freelist; p; p = get_freepointer(s, p))
> -		set_bit(slab_index(p, s, addr), object_map);
> +		set_bit(__obj_to_index(s, addr, p), object_map);
>  
>  	return object_map;
>  }
> @@ -3682,6 +3676,7 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order)
>  	 */
>  	size = ALIGN(size, s->align);
>  	s->size = size;
> +	s->reciprocal_size = reciprocal_value(size);
>  	if (forced_order >= 0)
>  		order = forced_order;
>  	else
> @@ -3788,7 +3783,7 @@ static void list_slab_objects(struct kmem_cache *s, struct page *page,
>  	map = get_map(s, page);
>  	for_each_object(p, s, addr, page->objects) {
>  
> -		if (!test_bit(slab_index(p, s, addr), map)) {
> +		if (!test_bit(__obj_to_index(s, addr, p), map)) {
>  			pr_err("INFO: Object 0x%p @offset=%tu\n", p, p - addr);
>  			print_tracking(s, p);
>  		}
> @@ -4513,7 +4508,7 @@ static void validate_slab(struct kmem_cache *s, struct page *page)
>  	/* Now we know that a valid freelist exists */
>  	map = get_map(s, page);
>  	for_each_object(p, s, addr, page->objects) {
> -		u8 val = test_bit(slab_index(p, s, addr), map) ?
> +		u8 val = test_bit(__obj_to_index(s, addr, p), map) ?
>  			 SLUB_RED_INACTIVE : SLUB_RED_ACTIVE;
>  
>  		if (!check_object(s, page, p, val))
> @@ -4704,7 +4699,7 @@ static void process_slab(struct loc_track *t, struct kmem_cache *s,
>  
>  	map = get_map(s, page);
>  	for_each_object(p, s, addr, page->objects)
> -		if (!test_bit(slab_index(p, s, addr), map))
> +		if (!test_bit(__obj_to_index(s, addr, p), map))
>  			add_location(t, s, get_track(s, p, alloc));
>  	put_map(map);
>  }
> 



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 04/19] mm: slub: implement SLUB version of obj_to_index()
  2020-05-21 11:01       ` Vlastimil Babka
@ 2020-05-21 21:06         ` Roman Gushchin
  0 siblings, 0 replies; 84+ messages in thread
From: Roman Gushchin @ 2020-05-21 21:06 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, linux-mm,
	kernel-team, linux-kernel, Christoph Lameter

On Thu, May 21, 2020 at 01:01:38PM +0200, Vlastimil Babka wrote:
> On 5/20/20 11:00 PM, Roman Gushchin wrote:
> > 
> > From beeaecdac85c3a395dcfb99944dc8c858b541cbf Mon Sep 17 00:00:00 2001
> > From: Roman Gushchin <guro@fb.com>
> > Date: Mon, 29 Jul 2019 18:18:42 -0700
> > Subject: [PATCH v3.2 04/19] mm: slub: implement SLUB version of obj_to_index()
> > 
> > This commit implements SLUB version of the obj_to_index() function,
> > which will be required to calculate the offset of obj_cgroup in the
> > obj_cgroups vector to store/obtain the objcg ownership data.
> > 
> > To make it faster, let's repeat the SLAB's trick introduced by
> > commit 6a2d7a955d8d ("[PATCH] SLAB: use a multiply instead of a
> > divide in obj_to_index()") and avoid an expensive division.
> > 
> > Vlastimil Babka noticed, that SLUB does have already a similar
> > function called slab_index(), which is defined only if SLUB_DEBUG
> > is enabled. The function does a similar math, but with a division,
> > and it also takes a page address instead of a page pointer.
> > 
> > Let's remove slab_index() and replace it with the new helper
> > __obj_to_index(), which takes a page address. obj_to_index()
> > will be a simple wrapper taking a page pointer and passing
> > page_address(page) into __obj_to_index().
> > 
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> 
> Looks good!
> 
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Thanks!


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 03/19] mm: memcg: convert vmstat slab counters to bytes
  2020-05-21  9:57       ` Vlastimil Babka
@ 2020-05-21 21:14         ` Roman Gushchin
  0 siblings, 0 replies; 84+ messages in thread
From: Roman Gushchin @ 2020-05-21 21:14 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, linux-mm,
	kernel-team, linux-kernel

On Thu, May 21, 2020 at 11:57:12AM +0200, Vlastimil Babka wrote:
> On 5/20/20 9:26 PM, Roman Gushchin wrote:
> > On Wed, May 20, 2020 at 02:25:22PM +0200, Vlastimil Babka wrote:
> >> 
> >> However __mod_node_page_state() and mode_node_state() will now branch always. I
> >> wonder if the "API clean" goal is worth it...
> > 
> > You mean just adding a special write-side helper which will perform a conversion
> > and put VM_WARN_ON_ONCE() into generic write-side helpers?
> 
> What I mean is that maybe node/global helpers should assume page granularity,
> and lruvec/memcg helpers do the check is they should convert from bytes to pages
> when calling node/global helpers. Then there would be no extra branches in
> node/global helpers. But maybe it's not worth saving those branches, dunno.

The problem is with helpers like mod_lruvec_state(), which do modify both global
and memcg-level counters. Also memcg- and global counters share idxes, so
it will be confusing to have NR_SLAB_RECLAIMABLE in bytes on one level and
in pages on the other.

So, idk, maybe there is a better way of organizing these counters in a less
complicated manner, but I've no ideas at the moment. But if you do, I'll appreciate it.

Thanks!


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 07/19] mm: memcg/slab: allocate obj_cgroups for non-root slab pages
  2020-04-22 20:46 ` [PATCH v3 07/19] mm: memcg/slab: allocate obj_cgroups for non-root slab pages Roman Gushchin
  2020-04-23 20:20   ` Roman Gushchin
@ 2020-05-22 18:27   ` Vlastimil Babka
  2020-05-23  1:32     ` Roman Gushchin
  2020-05-26 17:50     ` Roman Gushchin
  2020-05-25 14:46   ` Vlastimil Babka
  2 siblings, 2 replies; 84+ messages in thread
From: Vlastimil Babka @ 2020-05-22 18:27 UTC (permalink / raw)
  To: Roman Gushchin, Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, linux-mm, kernel-team, linux-kernel

On 4/22/20 10:46 PM, Roman Gushchin wrote:
> Allocate and release memory to store obj_cgroup pointers for each
> non-root slab page. Reuse page->mem_cgroup pointer to store a pointer
> to the allocated space.
> 
> To distinguish between obj_cgroups and memcg pointers in case
> when it's not obvious which one is used (as in page_cgroup_ino()),
> let's always set the lowest bit in the obj_cgroup case.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

But I have a suggestion:

...

> --- a/include/linux/slub_def.h
> +++ b/include/linux/slub_def.h
> @@ -191,4 +191,6 @@ static inline unsigned int obj_to_index(const struct kmem_cache *cache,
>  				 cache->reciprocal_size);
>  }
>  
> +extern int objs_per_slab(struct kmem_cache *cache);
> +
>  #endif /* _LINUX_SLUB_DEF_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7f87a0eeafec..63826e460b3f 100644

...

> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -5992,4 +5992,9 @@ ssize_t slabinfo_write(struct file *file, const char __user *buffer,
>  {
>  	return -EIO;
>  }
> +
> +int objs_per_slab(struct kmem_cache *cache)
> +{
> +	return oo_objects(cache->oo);
> +}
>  #endif /* CONFIG_SLUB_DEBUG */
> 

It's somewhat unfortunate to function call just for this. Although perhaps
compiler can be smart enough as charge_slab_page() (that callse objs_per_slab())
is inline and called from alloc_slab_page() which is also in mm/slub.c.

But it might be also a bit wasteful in case SLUB doesn't manage to allocate its
desired order, but smaller. The actual number of objects is then in page->objects.

So ideally this should use something like objs_per_slab_page(cache, page) where
SLAB supplies cache->num and SLUB page->objects, both implementations inline,
and ignoring the other parameter?


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 07/19] mm: memcg/slab: allocate obj_cgroups for non-root slab pages
  2020-05-22 18:27   ` Vlastimil Babka
@ 2020-05-23  1:32     ` Roman Gushchin
  2020-05-26 17:50     ` Roman Gushchin
  1 sibling, 0 replies; 84+ messages in thread
From: Roman Gushchin @ 2020-05-23  1:32 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, linux-mm,
	kernel-team, linux-kernel

On Fri, May 22, 2020 at 08:27:15PM +0200, Vlastimil Babka wrote:
> On 4/22/20 10:46 PM, Roman Gushchin wrote:
> > Allocate and release memory to store obj_cgroup pointers for each
> > non-root slab page. Reuse page->mem_cgroup pointer to store a pointer
> > to the allocated space.
> > 
> > To distinguish between obj_cgroups and memcg pointers in case
> > when it's not obvious which one is used (as in page_cgroup_ino()),
> > let's always set the lowest bit in the obj_cgroup case.
> > 
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> 
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Thank you!

> 
> But I have a suggestion:
> 
> ...
> 
> > --- a/include/linux/slub_def.h
> > +++ b/include/linux/slub_def.h
> > @@ -191,4 +191,6 @@ static inline unsigned int obj_to_index(const struct kmem_cache *cache,
> >  				 cache->reciprocal_size);
> >  }
> >  
> > +extern int objs_per_slab(struct kmem_cache *cache);
> > +
> >  #endif /* _LINUX_SLUB_DEF_H */
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 7f87a0eeafec..63826e460b3f 100644
> 
> ...
> 
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -5992,4 +5992,9 @@ ssize_t slabinfo_write(struct file *file, const char __user *buffer,
> >  {
> >  	return -EIO;
> >  }
> > +
> > +int objs_per_slab(struct kmem_cache *cache)
> > +{
> > +	return oo_objects(cache->oo);
> > +}
> >  #endif /* CONFIG_SLUB_DEBUG */
> > 
> 
> It's somewhat unfortunate to function call just for this. Although perhaps
> compiler can be smart enough as charge_slab_page() (that callse objs_per_slab())
> is inline and called from alloc_slab_page() which is also in mm/slub.c.
> 
> But it might be also a bit wasteful in case SLUB doesn't manage to allocate its
> desired order, but smaller. The actual number of objects is then in page->objects.
> 
> So ideally this should use something like objs_per_slab_page(cache, page) where
> SLAB supplies cache->num and SLUB page->objects, both implementations inline,
> and ignoring the other parameter?

Yeah, good point, makes total sense to me. I'll implement it in the next version
of the patchset.

Thank you!


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 07/19] mm: memcg/slab: allocate obj_cgroups for non-root slab pages
  2020-04-22 20:46 ` [PATCH v3 07/19] mm: memcg/slab: allocate obj_cgroups for non-root slab pages Roman Gushchin
  2020-04-23 20:20   ` Roman Gushchin
  2020-05-22 18:27   ` Vlastimil Babka
@ 2020-05-25 14:46   ` Vlastimil Babka
  2 siblings, 0 replies; 84+ messages in thread
From: Vlastimil Babka @ 2020-05-25 14:46 UTC (permalink / raw)
  To: Roman Gushchin, Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, linux-mm, kernel-team, linux-kernel

On 4/22/20 10:46 PM, Roman Gushchin wrote:
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -1370,7 +1370,8 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
>  		return NULL;
>  	}
>  
> -	if (charge_slab_page(page, flags, cachep->gfporder, cachep)) {
> +	if (charge_slab_page(page, flags, cachep->gfporder, cachep,
> +			     cachep->num)) {
>  		__free_pages(page, cachep->gfporder);
>  		return NULL;
>  	}

Hmm noticed only when looking at later patch, this hunks adds a parameter that
the function doesn't take, so it doesn't compile.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 08/19] mm: memcg/slab: save obj_cgroup for non-root slab objects
  2020-04-22 20:46 ` [PATCH v3 08/19] mm: memcg/slab: save obj_cgroup for non-root slab objects Roman Gushchin
@ 2020-05-25 15:07   ` Vlastimil Babka
  2020-05-26 17:53     ` Roman Gushchin
  0 siblings, 1 reply; 84+ messages in thread
From: Vlastimil Babka @ 2020-05-25 15:07 UTC (permalink / raw)
  To: Roman Gushchin, Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, linux-mm, kernel-team, linux-kernel

On 4/22/20 10:46 PM, Roman Gushchin wrote:
> Store the obj_cgroup pointer in the corresponding place of
> page->obj_cgroups for each allocated non-root slab object.
> Make sure that each allocated object holds a reference to obj_cgroup.
> 
> Objcg pointer is obtained from the memcg->objcg dereferencing
> in memcg_kmem_get_cache() and passed from pre_alloc_hook to
> post_alloc_hook. Then in case of successful allocation(s) it's
> getting stored in the page->obj_cgroups vector.
> 
> The objcg obtaining part look a bit bulky now, but it will be simplified
> by next commits in the series.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Nit below:

> diff --git a/mm/slab.h b/mm/slab.h
> index 44def57f050e..525e09e05743 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
...
> @@ -636,8 +684,8 @@ static inline void slab_post_alloc_hook(struct kmem_cache *s, gfp_t flags,
>  					 s->flags, flags);
>  	}
>  
> -	if (memcg_kmem_enabled())
> -		memcg_kmem_put_cache(s);
> +	if (!is_root_cache(s))
> +		memcg_slab_post_alloc_hook(s, objcg, size, p);
>  }
>  
>  #ifndef CONFIG_SLOB

Keep also the memcg_kmem_enabled() static key check, like elsewhere?



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 09/19] mm: memcg/slab: charge individual slab objects instead of pages
  2020-04-22 20:46 ` [PATCH v3 09/19] mm: memcg/slab: charge individual slab objects instead of pages Roman Gushchin
@ 2020-05-25 16:10   ` Vlastimil Babka
  2020-05-26 18:04     ` Roman Gushchin
  0 siblings, 1 reply; 84+ messages in thread
From: Vlastimil Babka @ 2020-05-25 16:10 UTC (permalink / raw)
  To: Roman Gushchin, Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, linux-mm, kernel-team, linux-kernel

On 4/22/20 10:46 PM, Roman Gushchin wrote:
> Switch to per-object accounting of non-root slab objects.
> 
> Charging is performed using obj_cgroup API in the pre_alloc hook.
> Obj_cgroup is charged with the size of the object and the size
> of metadata: as now it's the size of an obj_cgroup pointer.
> If the amount of memory has been charged successfully, the actual
> allocation code is executed. Otherwise, -ENOMEM is returned.
> 
> In the post_alloc hook if the actual allocation succeeded,
> corresponding vmstats are bumped and the obj_cgroup pointer is saved.
> Otherwise, the charge is canceled.
> 
> On the free path obj_cgroup pointer is obtained and used to uncharge
> the size of the releasing object.
> 
> Memcg and lruvec counters are now representing only memory used
> by active slab objects and do not include the free space. The free
> space is shared and doesn't belong to any specific cgroup.
> 
> Global per-node slab vmstats are still modified from (un)charge_slab_page()
> functions. The idea is to keep all slab pages accounted as slab pages
> on system level.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Suggestion below:

> @@ -568,32 +548,33 @@ static __always_inline int charge_slab_page(struct page *page,
>  					    gfp_t gfp, int order,
>  					    struct kmem_cache *s)
>  {
> -	int ret;
> -
> -	if (is_root_cache(s)) {
> -		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
> -				    PAGE_SIZE << order);
> -		return 0;
> -	}
> +#ifdef CONFIG_MEMCG_KMEM
> +	if (!is_root_cache(s)) {

This could also benefit from memcg_kmem_enabled() static key test AFAICS. Maybe
even have a wrapper for both tests together?

> +		int ret;
>  
> -	ret = memcg_alloc_page_obj_cgroups(page, gfp, objs_per_slab(s));
> -	if (ret)
> -		return ret;
> +		ret = memcg_alloc_page_obj_cgroups(page, gfp, objs_per_slab(s));

You created memcg_alloc_page_obj_cgroups() empty variant for !CONFIG_MEMCG_KMEM
but now the only caller is under CONFIG_MEMCG_KMEM.

> +		if (ret)
> +			return ret;
>  
> -	return memcg_charge_slab(page, gfp, order, s);
> +		percpu_ref_get_many(&s->memcg_params.refcnt, 1 << order);

Perhaps moving this refcount into memcg_alloc_page_obj_cgroups() (maybe the name
should be different then) will allow you to not add #ifdef CONFIG_MEMCG_KMEM in
this function.

Maybe this is all moot after patch 12/19, will find out :)

> +	}
> +#endif
> +	mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
> +			    PAGE_SIZE << order);
> +	return 0;
>  }
>  
>  static __always_inline void uncharge_slab_page(struct page *page, int order,
>  					       struct kmem_cache *s)
>  {
> -	if (is_root_cache(s)) {
> -		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
> -				    -(PAGE_SIZE << order));
> -		return;
> +#ifdef CONFIG_MEMCG_KMEM
> +	if (!is_root_cache(s)) {

Everything from above also applies here.

> +		memcg_free_page_obj_cgroups(page);
> +		percpu_ref_put_many(&s->memcg_params.refcnt, 1 << order);
>  	}
> -
> -	memcg_free_page_obj_cgroups(page);
> -	memcg_uncharge_slab(page, order, s);
> +#endif
> +	mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
> +			    -(PAGE_SIZE << order));
>  }
>  
>  static inline struct kmem_cache *cache_from_obj(struct kmem_cache *s, void *x)




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 11/19] mm: memcg/slab: move memcg_kmem_bypass() to memcontrol.h
  2020-04-22 20:47 ` [PATCH v3 11/19] mm: memcg/slab: move memcg_kmem_bypass() to memcontrol.h Roman Gushchin
@ 2020-05-25 17:03   ` Vlastimil Babka
  0 siblings, 0 replies; 84+ messages in thread
From: Vlastimil Babka @ 2020-05-25 17:03 UTC (permalink / raw)
  To: Roman Gushchin, Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, linux-mm, kernel-team, linux-kernel

On 4/22/20 10:47 PM, Roman Gushchin wrote:
> To make the memcg_kmem_bypass() function available outside of
> the memcontrol.c, let's move it to memcontrol.h. The function
> is small and nicely fits into static inline sort of functions.
> 
> It will be used from the slab code.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  include/linux/memcontrol.h | 7 +++++++
>  mm/memcontrol.c            | 7 -------
>  2 files changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 44b7d1244620..840eb8d486a8 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1462,6 +1462,13 @@ static inline bool memcg_kmem_enabled(void)
>  	return static_branch_unlikely(&memcg_kmem_enabled_key);
>  }
>  
> +static inline bool memcg_kmem_bypass(void)
> +{
> +	if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
> +		return true;
> +	return false;
> +}
> +
>  static inline int memcg_kmem_charge_page(struct page *page, gfp_t gfp,
>  					 int order)
>  {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f957b029a62f..06a5929f4872 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2941,13 +2941,6 @@ static void memcg_schedule_kmem_cache_create(struct mem_cgroup *memcg,
>  	queue_work(memcg_kmem_cache_wq, &cw->work);
>  }
>  
> -static inline bool memcg_kmem_bypass(void)
> -{
> -	if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
> -		return true;
> -	return false;
> -}
> -
>  /**
>   * memcg_kmem_get_cache: select the correct per-memcg cache for allocation
>   * @cachep: the original global kmem cache
> 



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 12/19] mm: memcg/slab: use a single set of kmem_caches for all accounted allocations
  2020-04-22 20:47 ` [PATCH v3 12/19] mm: memcg/slab: use a single set of kmem_caches for all accounted allocations Roman Gushchin
@ 2020-05-26 10:12   ` Vlastimil Babka
  0 siblings, 0 replies; 84+ messages in thread
From: Vlastimil Babka @ 2020-05-26 10:12 UTC (permalink / raw)
  To: Roman Gushchin, Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, linux-mm, kernel-team, linux-kernel

On 4/22/20 10:47 PM, Roman Gushchin wrote:
> This is fairly big but mostly red patch, which makes all accounted
> slab allocations use a single set of kmem_caches instead of
> creating a separate set for each memory cgroup.
> 
> Because the number of non-root kmem_caches is now capped by the number
> of root kmem_caches, there is no need to shrink or destroy them
> prematurely. They can be perfectly destroyed together with their
> root counterparts. This allows to dramatically simplify the
> management of non-root kmem_caches and delete a ton of code.
> 
> This patch performs the following changes:
> 1) introduces memcg_params.memcg_cache pointer to represent the
>    kmem_cache which will be used for all non-root allocations
> 2) reuses the existing memcg kmem_cache creation mechanism
>    to create memcg kmem_cache on the first allocation attempt
> 3) memcg kmem_caches are named <kmemcache_name>-memcg,
>    e.g. dentry-memcg
> 4) simplifies memcg_kmem_get_cache() to just return memcg kmem_cache
>    or schedule it's creation and return the root cache
> 5) removes almost all non-root kmem_cache management code
>    (separate refcounter, reparenting, shrinking, etc)
> 6) makes slab debugfs to display root_mem_cgroup css id and never
>    show :dead and :deact flags in the memcg_slabinfo attribute.
> 
> Following patches in the series will simplify the kmem_cache creation.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> ---
>  include/linux/memcontrol.h |   5 +-
>  include/linux/slab.h       |   5 +-
>  mm/memcontrol.c            | 163 +++-----------
>  mm/slab.c                  |  16 +-
>  mm/slab.h                  | 145 ++++---------
>  mm/slab_common.c           | 426 ++++---------------------------------
>  mm/slub.c                  |  38 +---
>  7 files changed, 128 insertions(+), 670 deletions(-)

Nice stats.

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

> @@ -548,17 +502,14 @@ static __always_inline int charge_slab_page(struct page *page,
>  					    gfp_t gfp, int order,
>  					    struct kmem_cache *s)
>  {
> -#ifdef CONFIG_MEMCG_KMEM

Ah, indeed. Still, less churn if ref manipulation was done in
memcg_alloc/free_page_obj() ?

>  	if (!is_root_cache(s)) {
>  		int ret;
>  
>  		ret = memcg_alloc_page_obj_cgroups(page, gfp, objs_per_slab(s));
>  		if (ret)
>  			return ret;
> -
> -		percpu_ref_get_many(&s->memcg_params.refcnt, 1 << order);
>  	}
> -#endif
> +
>  	mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
>  			    PAGE_SIZE << order);
>  	return 0;


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 13/19] mm: memcg/slab: simplify memcg cache creation
  2020-04-22 20:47 ` [PATCH v3 13/19] mm: memcg/slab: simplify memcg cache creation Roman Gushchin
@ 2020-05-26 10:31   ` Vlastimil Babka
  0 siblings, 0 replies; 84+ messages in thread
From: Vlastimil Babka @ 2020-05-26 10:31 UTC (permalink / raw)
  To: Roman Gushchin, Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, linux-mm, kernel-team, linux-kernel

On 4/22/20 10:47 PM, Roman Gushchin wrote:
> Because the number of non-root kmem_caches doesn't depend on the
> number of memory cgroups anymore and is generally not very big,
> there is no more need for a dedicated workqueue.
> 
> Also, as there is no more need to pass any arguments to the
> memcg_create_kmem_cache() except the root kmem_cache, it's
> possible to just embed the work structure into the kmem_cache
> and avoid the dynamic allocation of the work structure.
> 
> This will also simplify the synchronization: for each root kmem_cache
> there is only one work. So there will be no more concurrent attempts
> to create a non-root kmem_cache for a root kmem_cache: the second and
> all following attempts to queue the work will fail.
> 
> On the kmem_cache destruction path there is no more need to call the
> expensive flush_workqueue() and wait for all pending works to be
> finished. Instead, cancel_work_sync() can be used to cancel/wait for
> only one work.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  include/linux/memcontrol.h |  1 -
>  mm/memcontrol.c            | 48 +-------------------------------------
>  mm/slab.h                  |  2 ++
>  mm/slab_common.c           | 22 +++++++++--------
>  4 files changed, 15 insertions(+), 58 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 698b92d60da5..87e6da5015b3 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1440,7 +1440,6 @@ int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size);
>  void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);
>  
>  extern struct static_key_false memcg_kmem_enabled_key;
> -extern struct workqueue_struct *memcg_kmem_cache_wq;
>  
>  extern int memcg_nr_cache_ids;
>  void memcg_get_cache_ids(void);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 9fe2433fbe67..55fd42155a37 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -379,8 +379,6 @@ void memcg_put_cache_ids(void)
>   */
>  DEFINE_STATIC_KEY_FALSE(memcg_kmem_enabled_key);
>  EXPORT_SYMBOL(memcg_kmem_enabled_key);
> -
> -struct workqueue_struct *memcg_kmem_cache_wq;
>  #endif
>  
>  static int memcg_shrinker_map_size;
> @@ -2900,39 +2898,6 @@ static void memcg_free_cache_id(int id)
>  	ida_simple_remove(&memcg_cache_ida, id);
>  }
>  
> -struct memcg_kmem_cache_create_work {
> -	struct kmem_cache *cachep;
> -	struct work_struct work;
> -};
> -
> -static void memcg_kmem_cache_create_func(struct work_struct *w)
> -{
> -	struct memcg_kmem_cache_create_work *cw =
> -		container_of(w, struct memcg_kmem_cache_create_work, work);
> -	struct kmem_cache *cachep = cw->cachep;
> -
> -	memcg_create_kmem_cache(cachep);
> -
> -	kfree(cw);
> -}
> -
> -/*
> - * Enqueue the creation of a per-memcg kmem_cache.
> - */
> -static void memcg_schedule_kmem_cache_create(struct kmem_cache *cachep)
> -{
> -	struct memcg_kmem_cache_create_work *cw;
> -
> -	cw = kmalloc(sizeof(*cw), GFP_NOWAIT | __GFP_NOWARN);
> -	if (!cw)
> -		return;
> -
> -	cw->cachep = cachep;
> -	INIT_WORK(&cw->work, memcg_kmem_cache_create_func);
> -
> -	queue_work(memcg_kmem_cache_wq, &cw->work);
> -}
> -
>  /**
>   * memcg_kmem_get_cache: select memcg or root cache for allocation
>   * @cachep: the original global kmem cache
> @@ -2949,7 +2914,7 @@ struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep)
>  
>  	memcg_cachep = READ_ONCE(cachep->memcg_params.memcg_cache);
>  	if (unlikely(!memcg_cachep)) {
> -		memcg_schedule_kmem_cache_create(cachep);
> +		queue_work(system_wq, &cachep->memcg_params.work);
>  		return cachep;
>  	}
>  
> @@ -7122,17 +7087,6 @@ static int __init mem_cgroup_init(void)
>  {
>  	int cpu, node;
>  
> -#ifdef CONFIG_MEMCG_KMEM
> -	/*
> -	 * Kmem cache creation is mostly done with the slab_mutex held,
> -	 * so use a workqueue with limited concurrency to avoid stalling
> -	 * all worker threads in case lots of cgroups are created and
> -	 * destroyed simultaneously.
> -	 */
> -	memcg_kmem_cache_wq = alloc_workqueue("memcg_kmem_cache", 0, 1);
> -	BUG_ON(!memcg_kmem_cache_wq);
> -#endif
> -
>  	cpuhp_setup_state_nocalls(CPUHP_MM_MEMCQ_DEAD, "mm/memctrl:dead", NULL,
>  				  memcg_hotplug_cpu_dead);
>  
> diff --git a/mm/slab.h b/mm/slab.h
> index 28c582ec997a..a4e115cb8bdc 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -45,12 +45,14 @@ struct kmem_cache {
>   * @memcg_cache: pointer to memcg kmem cache, used by all non-root memory
>   *		cgroups.
>   * @root_caches_node: list node for slab_root_caches list.
> + * @work: work struct used to create the non-root cache.
>   */
>  struct memcg_cache_params {
>  	struct kmem_cache *root_cache;
>  
>  	struct kmem_cache *memcg_cache;
>  	struct list_head __root_caches_node;
> +	struct work_struct work;
>  };
>  #endif /* CONFIG_SLOB */
>  
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index e9deaafddbb6..10aa2acb84ca 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -132,10 +132,18 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t nr,
>  
>  LIST_HEAD(slab_root_caches);
>  
> +static void memcg_kmem_cache_create_func(struct work_struct *work)
> +{
> +	struct kmem_cache *cachep = container_of(work, struct kmem_cache,
> +						 memcg_params.work);
> +	memcg_create_kmem_cache(cachep);
> +}
> +
>  void slab_init_memcg_params(struct kmem_cache *s)
>  {
>  	s->memcg_params.root_cache = NULL;
>  	s->memcg_params.memcg_cache = NULL;
> +	INIT_WORK(&s->memcg_params.work, memcg_kmem_cache_create_func);
>  }
>  
>  static void init_memcg_params(struct kmem_cache *s,
> @@ -584,15 +592,9 @@ static int shutdown_memcg_caches(struct kmem_cache *s)
>  	return 0;
>  }
>  
> -static void flush_memcg_workqueue(struct kmem_cache *s)
> +static void cancel_memcg_cache_creation(struct kmem_cache *s)
>  {
> -	/*
> -	 * SLAB and SLUB create memcg kmem_caches through workqueue and SLUB
> -	 * deactivates the memcg kmem_caches through workqueue. Make sure all
> -	 * previous workitems on workqueue are processed.
> -	 */
> -	if (likely(memcg_kmem_cache_wq))
> -		flush_workqueue(memcg_kmem_cache_wq);
> +	cancel_work_sync(&s->memcg_params.work);
>  }
>  #else
>  static inline int shutdown_memcg_caches(struct kmem_cache *s)
> @@ -600,7 +602,7 @@ static inline int shutdown_memcg_caches(struct kmem_cache *s)
>  	return 0;
>  }
>  
> -static inline void flush_memcg_workqueue(struct kmem_cache *s)
> +static inline void cancel_memcg_cache_creation(struct kmem_cache *s)
>  {
>  }
>  #endif /* CONFIG_MEMCG_KMEM */
> @@ -619,7 +621,7 @@ void kmem_cache_destroy(struct kmem_cache *s)
>  	if (unlikely(!s))
>  		return;
>  
> -	flush_memcg_workqueue(s);
> +	cancel_memcg_cache_creation(s);
>  
>  	get_online_cpus();
>  	get_online_mems();
> 



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 14/19] mm: memcg/slab: deprecate memcg_kmem_get_cache()
  2020-04-22 20:47 ` [PATCH v3 14/19] mm: memcg/slab: deprecate memcg_kmem_get_cache() Roman Gushchin
@ 2020-05-26 10:34   ` Vlastimil Babka
  0 siblings, 0 replies; 84+ messages in thread
From: Vlastimil Babka @ 2020-05-26 10:34 UTC (permalink / raw)
  To: Roman Gushchin, Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, linux-mm, kernel-team, linux-kernel

"deprecate" means it still exist but shouldn't get new callers, no?
maybe just "remove" or "inline ... into its caller"

On 4/22/20 10:47 PM, Roman Gushchin wrote:
> The memcg_kmem_get_cache() function became really trivial, so
> let's just inline it into the single call point:
> memcg_slab_pre_alloc_hook().
> 
> It will make the code less bulky and can also help the compiler
> to generate a better code.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 15/19] mm: memcg/slab: deprecate slab_root_caches
  2020-04-22 20:47 ` [PATCH v3 15/19] mm: memcg/slab: deprecate slab_root_caches Roman Gushchin
@ 2020-05-26 10:52   ` Vlastimil Babka
  2020-05-26 18:50     ` Roman Gushchin
  0 siblings, 1 reply; 84+ messages in thread
From: Vlastimil Babka @ 2020-05-26 10:52 UTC (permalink / raw)
  To: Roman Gushchin, Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, linux-mm, kernel-team, linux-kernel

On 4/22/20 10:47 PM, Roman Gushchin wrote:
> Currently there are two lists of kmem_caches:
> 1) slab_caches, which contains all kmem_caches,
> 2) slab_root_caches, which contains only root kmem_caches.
> 
> And there is some preprocessor magic to have a single list
> if CONFIG_MEMCG_KMEM isn't enabled.
> 
> It was required earlier because the number of non-root kmem_caches
> was proportional to the number of memory cgroups and could reach
> really big values. Now, when it cannot exceed the number of root
> kmem_caches, there is really no reason to maintain two lists.
> 
> We never iterate over the slab_root_caches list on any hot paths,
> so it's perfectly fine to iterate over slab_caches and filter out
> non-root kmem_caches.
> 
> It allows to remove a lot of config-dependent code and two pointers
> from the kmem_cache structure.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

> @@ -1148,11 +1126,12 @@ static void cache_show(struct kmem_cache *s, struct seq_file *m)
>  
>  static int slab_show(struct seq_file *m, void *p)
>  {
> -	struct kmem_cache *s = list_entry(p, struct kmem_cache, root_caches_node);
> +	struct kmem_cache *s = list_entry(p, struct kmem_cache, list);
>  
> -	if (p == slab_root_caches.next)
> +	if (p == slab_caches.next)
>  		print_slabinfo_header(m);
> -	cache_show(s, m);
> +	if (is_root_cache(s))
> +		cache_show(s, m);

If there wasn't patch 17/19 we could just remove this condition and have
/proc/slabinfo contain the -memcg variants?



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 16/19] mm: memcg/slab: remove redundant check in memcg_accumulate_slabinfo()
  2020-04-22 20:47 ` [PATCH v3 16/19] mm: memcg/slab: remove redundant check in memcg_accumulate_slabinfo() Roman Gushchin
@ 2020-05-26 11:31   ` Vlastimil Babka
  0 siblings, 0 replies; 84+ messages in thread
From: Vlastimil Babka @ 2020-05-26 11:31 UTC (permalink / raw)
  To: Roman Gushchin, Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, linux-mm, kernel-team, linux-kernel

On 4/22/20 10:47 PM, Roman Gushchin wrote:
> memcg_accumulate_slabinfo() is never called with a non-root
> kmem_cache as a first argument, so the is_root_cache(s) check
> is redundant and can be removed without any functional change.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

The check is also in memcg_cache() anyway.

> ---
>  mm/slab_common.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index c045afb9724e..52164ad0f197 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -1087,9 +1087,6 @@ memcg_accumulate_slabinfo(struct kmem_cache *s, struct slabinfo *info)
>  	struct kmem_cache *c;
>  	struct slabinfo sinfo;
>  
> -	if (!is_root_cache(s))
> -		return;
> -
>  	c = memcg_cache(s);
>  	if (c) {
>  		memset(&sinfo, 0, sizeof(sinfo));
> 



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 17/19] mm: memcg/slab: use a single set of kmem_caches for all allocations
  2020-04-22 20:47 ` [PATCH v3 17/19] mm: memcg/slab: use a single set of kmem_caches for all allocations Roman Gushchin
@ 2020-05-26 14:55   ` Vlastimil Babka
  2020-05-27  8:35     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 84+ messages in thread
From: Vlastimil Babka @ 2020-05-26 14:55 UTC (permalink / raw)
  To: Roman Gushchin, Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, linux-mm, kernel-team,
	linux-kernel, Jesper Dangaard Brouer, Mel Gorman

On 4/22/20 10:47 PM, Roman Gushchin wrote:
> Instead of having two sets of kmem_caches: one for system-wide and
> non-accounted allocations and the second one shared by all accounted
> allocations, we can use just one.
> 
> The idea is simple: space for obj_cgroup metadata can be allocated
> on demand and filled only for accounted allocations.
> 
> It allows to remove a bunch of code which is required to handle
> kmem_cache clones for accounted allocations. There is no more need
> to create them, accumulate statistics, propagate attributes, etc.
> It's a quite significant simplification.
> 
> Also, because the total number of slab_caches is reduced almost twice
> (not all kmem_caches have a memcg clone), some additional memory
> savings are expected. On my devvm it additionally saves about 3.5%
> of slab memory.
> 
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Roman Gushchin <guro@fb.com>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

However, as this series will affect slab fastpaths, and perhaps especially this
patch will affect even non-kmemcg allocations being freed, I'm CCing Jesper and
Mel for awareness as they AFAIK did work on network stack memory management
performance, and perhaps some benchmarks are in order...



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 18/19] kselftests: cgroup: add kernel memory accounting tests
  2020-04-22 20:47 ` [PATCH v3 18/19] kselftests: cgroup: add kernel memory accounting tests Roman Gushchin
@ 2020-05-26 15:24   ` Vlastimil Babka
  2020-05-26 15:45     ` Roman Gushchin
  0 siblings, 1 reply; 84+ messages in thread
From: Vlastimil Babka @ 2020-05-26 15:24 UTC (permalink / raw)
  To: Roman Gushchin, Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, linux-mm, kernel-team, linux-kernel

On 4/22/20 10:47 PM, Roman Gushchin wrote:
> Add some tests to cover the kernel memory accounting functionality.
> These are covering some issues (and changes) we had recently.
> 
> 1) A test which allocates a lot of negative dentries, checks memcg
> slab statistics, creates memory pressure by setting memory.max
> to some low value and checks that some number of slabs was reclaimed.
> 
> 2) A test which covers side effects of memcg destruction: it creates
> and destroys a large number of sub-cgroups, each containing a
> multi-threaded workload which allocates and releases some kernel
> memory. Then it checks that the charge ans memory.stats do add up
> on the parent level.
> 
> 3) A test which reads /proc/kpagecgroup and implicitly checks that it
> doesn't crash the system.
> 
> 4) A test which spawns a large number of threads and checks that
> the kernel stacks accounting works as expected.
> 
> 5) A test which checks that living charged slab objects are not
> preventing the memory cgroup from being released after being deleted
> by a user.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> ---
>  tools/testing/selftests/cgroup/.gitignore  |   1 +
>  tools/testing/selftests/cgroup/Makefile    |   2 +
>  tools/testing/selftests/cgroup/test_kmem.c | 382 +++++++++++++++++++++
>  3 files changed, 385 insertions(+)
>  create mode 100644 tools/testing/selftests/cgroup/test_kmem.c
> 
> diff --git a/tools/testing/selftests/cgroup/.gitignore b/tools/testing/selftests/cgroup/.gitignore
> index aa6de65b0838..84cfcabea838 100644
> --- a/tools/testing/selftests/cgroup/.gitignore
> +++ b/tools/testing/selftests/cgroup/.gitignore
> @@ -2,3 +2,4 @@
>  test_memcontrol
>  test_core
>  test_freezer
> +test_kmem
> \ No newline at end of file
> diff --git a/tools/testing/selftests/cgroup/Makefile b/tools/testing/selftests/cgroup/Makefile
> index 967f268fde74..4794844a228e 100644
> --- a/tools/testing/selftests/cgroup/Makefile
> +++ b/tools/testing/selftests/cgroup/Makefile
> @@ -6,11 +6,13 @@ all:
>  TEST_FILES     := with_stress.sh
>  TEST_PROGS     := test_stress.sh
>  TEST_GEN_PROGS = test_memcontrol
> +TEST_GEN_PROGS = test_kmem

Should be +=

>  TEST_GEN_PROGS += test_core
>  TEST_GEN_PROGS += test_freezer
>  
>  include ../lib.mk
>  
>  $(OUTPUT)/test_memcontrol: cgroup_util.c ../clone3/clone3_selftests.h
> +$(OUTPUT)/test_kmem: cgroup_util.c ../clone3/clone3_selftests.h
>  $(OUTPUT)/test_core: cgroup_util.c ../clone3/clone3_selftests.h
>  $(OUTPUT)/test_freezer: cgroup_util.c ../clone3/clone3_selftests.h
> diff --git a/tools/testing/selftests/cgroup/test_kmem.c b/tools/testing/selftests/cgroup/test_kmem.c
> new file mode 100644
> index 000000000000..5bc1132fec6b
> --- /dev/null
> +++ b/tools/testing/selftests/cgroup/test_kmem.c
> @@ -0,0 +1,382 @@
...
> +/*
> + * This test allocates 100000 of negative dentries with long names.
> + * Then it checks that "slab" in memory.stat is larger than 1M.
> + * Then it sets memory.high to 1M and checks that at least 1/2
> + * of slab memory has been reclaimed.
> + */
> +static int test_kmem_basic(const char *root)
> +{
> +	int ret = KSFT_FAIL;
> +	char *cg = NULL;
> +	long slab0, slab1, current;
> +
> +	cg = cg_name(root, "kmem_basic_test");
> +	if (!cg)
> +		goto cleanup;
> +
> +	if (cg_create(cg))
> +		goto cleanup;
> +
> +	if (cg_run(cg, alloc_dcache, (void *)100000))
> +		goto cleanup;
> +
> +	slab0 = cg_read_key_long(cg, "memory.stat", "slab ");
> +	if (slab0 < (1 >> 20))

1 << 20 ?

Anyway I was getting this:
not ok 1 test_kmem_basic
ok 2 test_kmem_memcg_deletion
ok 3 test_kmem_proc_kpagecgroup
not ok 4 test_kmem_kernel_stacks
ok 5 test_kmem_dead_cgroups

Adding some debugging into kmem_basic I found I get memory.stat == 0 at this
point thus it fails the fixed test (otherwise it was failing the <= 0 test after
writing to memory.high). But it's just a VM spinned by virtme which has a very
simple init, so perhaps things are not as initialized as expected.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 18/19] kselftests: cgroup: add kernel memory accounting tests
  2020-05-26 15:24   ` Vlastimil Babka
@ 2020-05-26 15:45     ` Roman Gushchin
  2020-05-27 17:00       ` Vlastimil Babka
  0 siblings, 1 reply; 84+ messages in thread
From: Roman Gushchin @ 2020-05-26 15:45 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, linux-mm,
	kernel-team, linux-kernel

On Tue, May 26, 2020 at 05:24:46PM +0200, Vlastimil Babka wrote:
> On 4/22/20 10:47 PM, Roman Gushchin wrote:
> > Add some tests to cover the kernel memory accounting functionality.
> > These are covering some issues (and changes) we had recently.
> > 
> > 1) A test which allocates a lot of negative dentries, checks memcg
> > slab statistics, creates memory pressure by setting memory.max
> > to some low value and checks that some number of slabs was reclaimed.
> > 
> > 2) A test which covers side effects of memcg destruction: it creates
> > and destroys a large number of sub-cgroups, each containing a
> > multi-threaded workload which allocates and releases some kernel
> > memory. Then it checks that the charge ans memory.stats do add up
> > on the parent level.
> > 
> > 3) A test which reads /proc/kpagecgroup and implicitly checks that it
> > doesn't crash the system.
> > 
> > 4) A test which spawns a large number of threads and checks that
> > the kernel stacks accounting works as expected.
> > 
> > 5) A test which checks that living charged slab objects are not
> > preventing the memory cgroup from being released after being deleted
> > by a user.
> > 
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > ---
> >  tools/testing/selftests/cgroup/.gitignore  |   1 +
> >  tools/testing/selftests/cgroup/Makefile    |   2 +
> >  tools/testing/selftests/cgroup/test_kmem.c | 382 +++++++++++++++++++++
> >  3 files changed, 385 insertions(+)
> >  create mode 100644 tools/testing/selftests/cgroup/test_kmem.c
> > 
> > diff --git a/tools/testing/selftests/cgroup/.gitignore b/tools/testing/selftests/cgroup/.gitignore
> > index aa6de65b0838..84cfcabea838 100644
> > --- a/tools/testing/selftests/cgroup/.gitignore
> > +++ b/tools/testing/selftests/cgroup/.gitignore
> > @@ -2,3 +2,4 @@
> >  test_memcontrol
> >  test_core
> >  test_freezer
> > +test_kmem
> > \ No newline at end of file
> > diff --git a/tools/testing/selftests/cgroup/Makefile b/tools/testing/selftests/cgroup/Makefile
> > index 967f268fde74..4794844a228e 100644
> > --- a/tools/testing/selftests/cgroup/Makefile
> > +++ b/tools/testing/selftests/cgroup/Makefile
> > @@ -6,11 +6,13 @@ all:
> >  TEST_FILES     := with_stress.sh
> >  TEST_PROGS     := test_stress.sh
> >  TEST_GEN_PROGS = test_memcontrol
> > +TEST_GEN_PROGS = test_kmem
> 
> Should be +=
> 
> >  TEST_GEN_PROGS += test_core
> >  TEST_GEN_PROGS += test_freezer
> >  
> >  include ../lib.mk
> >  
> >  $(OUTPUT)/test_memcontrol: cgroup_util.c ../clone3/clone3_selftests.h
> > +$(OUTPUT)/test_kmem: cgroup_util.c ../clone3/clone3_selftests.h
> >  $(OUTPUT)/test_core: cgroup_util.c ../clone3/clone3_selftests.h
> >  $(OUTPUT)/test_freezer: cgroup_util.c ../clone3/clone3_selftests.h
> > diff --git a/tools/testing/selftests/cgroup/test_kmem.c b/tools/testing/selftests/cgroup/test_kmem.c
> > new file mode 100644
> > index 000000000000..5bc1132fec6b
> > --- /dev/null
> > +++ b/tools/testing/selftests/cgroup/test_kmem.c
> > @@ -0,0 +1,382 @@
> ...
> > +/*
> > + * This test allocates 100000 of negative dentries with long names.
> > + * Then it checks that "slab" in memory.stat is larger than 1M.
> > + * Then it sets memory.high to 1M and checks that at least 1/2
> > + * of slab memory has been reclaimed.
> > + */
> > +static int test_kmem_basic(const char *root)
> > +{
> > +	int ret = KSFT_FAIL;
> > +	char *cg = NULL;
> > +	long slab0, slab1, current;
> > +
> > +	cg = cg_name(root, "kmem_basic_test");
> > +	if (!cg)
> > +		goto cleanup;
> > +
> > +	if (cg_create(cg))
> > +		goto cleanup;
> > +
> > +	if (cg_run(cg, alloc_dcache, (void *)100000))
> > +		goto cleanup;
> > +
> > +	slab0 = cg_read_key_long(cg, "memory.stat", "slab ");
> > +	if (slab0 < (1 >> 20))
> 
> 1 << 20 ?
> 
> Anyway I was getting this:
> not ok 1 test_kmem_basic
> ok 2 test_kmem_memcg_deletion
> ok 3 test_kmem_proc_kpagecgroup
> not ok 4 test_kmem_kernel_stacks
> ok 5 test_kmem_dead_cgroups
> 
> Adding some debugging into kmem_basic I found I get memory.stat == 0 at this
> point thus it fails the fixed test (otherwise it was failing the <= 0 test after
> writing to memory.high). But it's just a VM spinned by virtme which has a very
> simple init, so perhaps things are not as initialized as expected.

Hm, it's strange, do you have any values in memory.stat::slab for any cgroups?
Or can you send me your config (and kvm setup), I'll take a look.

Btw, thank you very much for reviewing the series! I appreciate it.
I'll integrate your feedback into the next version, which I'm working on right now.

Thanks!


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 07/19] mm: memcg/slab: allocate obj_cgroups for non-root slab pages
  2020-05-22 18:27   ` Vlastimil Babka
  2020-05-23  1:32     ` Roman Gushchin
@ 2020-05-26 17:50     ` Roman Gushchin
  1 sibling, 0 replies; 84+ messages in thread
From: Roman Gushchin @ 2020-05-26 17:50 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, linux-mm,
	kernel-team, linux-kernel

On Fri, May 22, 2020 at 08:27:15PM +0200, Vlastimil Babka wrote:
> On 4/22/20 10:46 PM, Roman Gushchin wrote:
> > Allocate and release memory to store obj_cgroup pointers for each
> > non-root slab page. Reuse page->mem_cgroup pointer to store a pointer
> > to the allocated space.
> > 
> > To distinguish between obj_cgroups and memcg pointers in case
> > when it's not obvious which one is used (as in page_cgroup_ino()),
> > let's always set the lowest bit in the obj_cgroup case.
> > 
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> 
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> 
> But I have a suggestion:
> 
> ...
> 
> > --- a/include/linux/slub_def.h
> > +++ b/include/linux/slub_def.h
> > @@ -191,4 +191,6 @@ static inline unsigned int obj_to_index(const struct kmem_cache *cache,
> >  				 cache->reciprocal_size);
> >  }
> >  
> > +extern int objs_per_slab(struct kmem_cache *cache);
> > +
> >  #endif /* _LINUX_SLUB_DEF_H */
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 7f87a0eeafec..63826e460b3f 100644
> 
> ...
> 
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -5992,4 +5992,9 @@ ssize_t slabinfo_write(struct file *file, const char __user *buffer,
> >  {
> >  	return -EIO;
> >  }
> > +
> > +int objs_per_slab(struct kmem_cache *cache)
> > +{
> > +	return oo_objects(cache->oo);
> > +}
> >  #endif /* CONFIG_SLUB_DEBUG */
> > 
> 
> It's somewhat unfortunate to function call just for this. Although perhaps
> compiler can be smart enough as charge_slab_page() (that callse objs_per_slab())
> is inline and called from alloc_slab_page() which is also in mm/slub.c.
> 
> But it might be also a bit wasteful in case SLUB doesn't manage to allocate its
> desired order, but smaller. The actual number of objects is then in page->objects.
> 
> So ideally this should use something like objs_per_slab_page(cache, page) where
> SLAB supplies cache->num and SLUB page->objects, both implementations inline,
> and ignoring the other parameter?

Good idea! I'll do this in the next version. Thanks!


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 08/19] mm: memcg/slab: save obj_cgroup for non-root slab objects
  2020-05-25 15:07   ` Vlastimil Babka
@ 2020-05-26 17:53     ` Roman Gushchin
  2020-05-27 11:03       ` Vlastimil Babka
  0 siblings, 1 reply; 84+ messages in thread
From: Roman Gushchin @ 2020-05-26 17:53 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, linux-mm,
	kernel-team, linux-kernel

On Mon, May 25, 2020 at 05:07:22PM +0200, Vlastimil Babka wrote:
> On 4/22/20 10:46 PM, Roman Gushchin wrote:
> > Store the obj_cgroup pointer in the corresponding place of
> > page->obj_cgroups for each allocated non-root slab object.
> > Make sure that each allocated object holds a reference to obj_cgroup.
> > 
> > Objcg pointer is obtained from the memcg->objcg dereferencing
> > in memcg_kmem_get_cache() and passed from pre_alloc_hook to
> > post_alloc_hook. Then in case of successful allocation(s) it's
> > getting stored in the page->obj_cgroups vector.
> > 
> > The objcg obtaining part look a bit bulky now, but it will be simplified
> > by next commits in the series.
> > 
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> 
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Nit below:
> 
> > diff --git a/mm/slab.h b/mm/slab.h
> > index 44def57f050e..525e09e05743 100644
> > --- a/mm/slab.h
> > +++ b/mm/slab.h
> ...
> > @@ -636,8 +684,8 @@ static inline void slab_post_alloc_hook(struct kmem_cache *s, gfp_t flags,
> >  					 s->flags, flags);
> >  	}
> >  
> > -	if (memcg_kmem_enabled())
> > -		memcg_kmem_put_cache(s);
> > +	if (!is_root_cache(s))
> > +		memcg_slab_post_alloc_hook(s, objcg, size, p);
> >  }
> >  
> >  #ifndef CONFIG_SLOB
> 
> Keep also the memcg_kmem_enabled() static key check, like elsewhere?
> 

Ok, will add, it can speed things up a little bit. My only concern is that
the code is not ready for memcg_kmem_enabled() turning negative after being positive.
But it's not a concern, right?

Actually, we can simplify memcg_kmem_enabled() mechanics and enable it
only once as soon as the first memcg is fully initialized. I don't think there
is any value in tracking the actual number of active memcgs.

Thanks!


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 09/19] mm: memcg/slab: charge individual slab objects instead of pages
  2020-05-25 16:10   ` Vlastimil Babka
@ 2020-05-26 18:04     ` Roman Gushchin
  0 siblings, 0 replies; 84+ messages in thread
From: Roman Gushchin @ 2020-05-26 18:04 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, linux-mm,
	kernel-team, linux-kernel

On Mon, May 25, 2020 at 06:10:55PM +0200, Vlastimil Babka wrote:
> On 4/22/20 10:46 PM, Roman Gushchin wrote:
> > Switch to per-object accounting of non-root slab objects.
> > 
> > Charging is performed using obj_cgroup API in the pre_alloc hook.
> > Obj_cgroup is charged with the size of the object and the size
> > of metadata: as now it's the size of an obj_cgroup pointer.
> > If the amount of memory has been charged successfully, the actual
> > allocation code is executed. Otherwise, -ENOMEM is returned.
> > 
> > In the post_alloc hook if the actual allocation succeeded,
> > corresponding vmstats are bumped and the obj_cgroup pointer is saved.
> > Otherwise, the charge is canceled.
> > 
> > On the free path obj_cgroup pointer is obtained and used to uncharge
> > the size of the releasing object.
> > 
> > Memcg and lruvec counters are now representing only memory used
> > by active slab objects and do not include the free space. The free
> > space is shared and doesn't belong to any specific cgroup.
> > 
> > Global per-node slab vmstats are still modified from (un)charge_slab_page()
> > functions. The idea is to keep all slab pages accounted as slab pages
> > on system level.
> > 
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> 
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Suggestion below:
> 
> > @@ -568,32 +548,33 @@ static __always_inline int charge_slab_page(struct page *page,
> >  					    gfp_t gfp, int order,
> >  					    struct kmem_cache *s)
> >  {
> > -	int ret;
> > -
> > -	if (is_root_cache(s)) {
> > -		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
> > -				    PAGE_SIZE << order);
> > -		return 0;
> > -	}
> > +#ifdef CONFIG_MEMCG_KMEM
> > +	if (!is_root_cache(s)) {
> 
> This could also benefit from memcg_kmem_enabled() static key test AFAICS. Maybe
> even have a wrapper for both tests together?

Added.

> 
> > +		int ret;
> >  
> > -	ret = memcg_alloc_page_obj_cgroups(page, gfp, objs_per_slab(s));
> > -	if (ret)
> > -		return ret;
> > +		ret = memcg_alloc_page_obj_cgroups(page, gfp, objs_per_slab(s));
> 
> You created memcg_alloc_page_obj_cgroups() empty variant for !CONFIG_MEMCG_KMEM
> but now the only caller is under CONFIG_MEMCG_KMEM.

Good catch, thanks!

> 
> > +		if (ret)
> > +			return ret;
> >  
> > -	return memcg_charge_slab(page, gfp, order, s);
> > +		percpu_ref_get_many(&s->memcg_params.refcnt, 1 << order);
> 
> Perhaps moving this refcount into memcg_alloc_page_obj_cgroups() (maybe the name
> should be different then) will allow you to not add #ifdef CONFIG_MEMCG_KMEM in
> this function.

The reference counter bumping is not related to obj_cgroups,
we just bump a counter for each slab page belonging to the kmem_cache.
And it will go away later in the patchset with the rest of slab caches
refcounting.

> 
> Maybe this is all moot after patch 12/19, will find out :)
> 
> > +	}
> > +#endif
> > +	mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
> > +			    PAGE_SIZE << order);
> > +	return 0;
> >  }
> >  
> >  static __always_inline void uncharge_slab_page(struct page *page, int order,
> >  					       struct kmem_cache *s)
> >  {
> > -	if (is_root_cache(s)) {
> > -		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
> > -				    -(PAGE_SIZE << order));
> > -		return;
> > +#ifdef CONFIG_MEMCG_KMEM
> > +	if (!is_root_cache(s)) {
> 
> Everything from above also applies here.

Done.
Thanks!

> 
> > +		memcg_free_page_obj_cgroups(page);
> > +		percpu_ref_put_many(&s->memcg_params.refcnt, 1 << order);
> >  	}
> > -
> > -	memcg_free_page_obj_cgroups(page);
> > -	memcg_uncharge_slab(page, order, s);
> > +#endif
> > +	mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
> > +			    -(PAGE_SIZE << order));
> >  }
> >  
> >  static inline struct kmem_cache *cache_from_obj(struct kmem_cache *s, void *x)
> 
> 


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 15/19] mm: memcg/slab: deprecate slab_root_caches
  2020-05-26 10:52   ` Vlastimil Babka
@ 2020-05-26 18:50     ` Roman Gushchin
  0 siblings, 0 replies; 84+ messages in thread
From: Roman Gushchin @ 2020-05-26 18:50 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, linux-mm,
	kernel-team, linux-kernel

On Tue, May 26, 2020 at 12:52:24PM +0200, Vlastimil Babka wrote:
> On 4/22/20 10:47 PM, Roman Gushchin wrote:
> > Currently there are two lists of kmem_caches:
> > 1) slab_caches, which contains all kmem_caches,
> > 2) slab_root_caches, which contains only root kmem_caches.
> > 
> > And there is some preprocessor magic to have a single list
> > if CONFIG_MEMCG_KMEM isn't enabled.
> > 
> > It was required earlier because the number of non-root kmem_caches
> > was proportional to the number of memory cgroups and could reach
> > really big values. Now, when it cannot exceed the number of root
> > kmem_caches, there is really no reason to maintain two lists.
> > 
> > We never iterate over the slab_root_caches list on any hot paths,
> > so it's perfectly fine to iterate over slab_caches and filter out
> > non-root kmem_caches.
> > 
> > It allows to remove a lot of config-dependent code and two pointers
> > from the kmem_cache structure.
> > 
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> 
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Thanks!

> 
> > @@ -1148,11 +1126,12 @@ static void cache_show(struct kmem_cache *s, struct seq_file *m)
> >  
> >  static int slab_show(struct seq_file *m, void *p)
> >  {
> > -	struct kmem_cache *s = list_entry(p, struct kmem_cache, root_caches_node);
> > +	struct kmem_cache *s = list_entry(p, struct kmem_cache, list);
> >  
> > -	if (p == slab_root_caches.next)
> > +	if (p == slab_caches.next)
> >  		print_slabinfo_header(m);
> > -	cache_show(s, m);
> > +	if (is_root_cache(s))
> > +		cache_show(s, m);
> 
> If there wasn't patch 17/19 we could just remove this condition and have
> /proc/slabinfo contain the -memcg variants?

Sure, it's an option too. But because it's a user facing option, I'd keep it
as it is now at least until everything will settle down a bit.

Thanks!


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 17/19] mm: memcg/slab: use a single set of kmem_caches for all allocations
  2020-05-26 14:55   ` Vlastimil Babka
@ 2020-05-27  8:35     ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 84+ messages in thread
From: Jesper Dangaard Brouer @ 2020-05-27  8:35 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Roman Gushchin, Andrew Morton, Johannes Weiner, Michal Hocko,
	linux-mm, kernel-team, linux-kernel, Mel Gorman, brouer

On Tue, 26 May 2020 16:55:05 +0200
Vlastimil Babka <vbabka@suse.cz> wrote:

> On 4/22/20 10:47 PM, Roman Gushchin wrote:
> > Instead of having two sets of kmem_caches: one for system-wide and
> > non-accounted allocations and the second one shared by all accounted
> > allocations, we can use just one.
> > 
> > The idea is simple: space for obj_cgroup metadata can be allocated
> > on demand and filled only for accounted allocations.
> > 
> > It allows to remove a bunch of code which is required to handle
> > kmem_cache clones for accounted allocations. There is no more need
> > to create them, accumulate statistics, propagate attributes, etc.
> > It's a quite significant simplification.
> > 
> > Also, because the total number of slab_caches is reduced almost twice
> > (not all kmem_caches have a memcg clone), some additional memory
> > savings are expected. On my devvm it additionally saves about 3.5%
> > of slab memory.
> > 
> > Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> > Signed-off-by: Roman Gushchin <guro@fb.com>  
> 
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> 
> However, as this series will affect slab fastpaths, and perhaps
> especially this patch will affect even non-kmemcg allocations being
> freed, I'm CCing Jesper and Mel for awareness as they AFAIK did work
> on network stack memory management performance, and perhaps some
> benchmarks are in order...

Thanks for the heads-up! 

We (should) all know Mel Gorman's tests, which is here[1]:
 [1] https://github.com/gormanm/mmtests

My guess is that these change will only be visible with micro
benchmarks of the slub/slab.  I my slab/slub micro benchmarks are
located here [2] https://github.com/netoptimizer/prototype-kernel/

It is kernel modules that is compiled against your devel tree and pushed
to the remote host.  Results are simply printk'ed in dmesg.
Usage compile+push commands documented here[3]:
 [3] https://prototype-kernel.readthedocs.io/en/latest/prototype-kernel/build-process.html

I recommend trying: "slab_bulk_test01"
 modprobe slab_bulk_test01; rmmod slab_bulk_test01
 dmesg

Result from these kernel module benchmarks are included in some
commits[4][5]. And in [4] I found some overhead caused by MEMCG.

 [4] https://git.kernel.org/torvalds/c/ca257195511d
 [5] https://git.kernel.org/torvalds/c/fbd02630c6e3
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 08/19] mm: memcg/slab: save obj_cgroup for non-root slab objects
  2020-05-26 17:53     ` Roman Gushchin
@ 2020-05-27 11:03       ` Vlastimil Babka
  0 siblings, 0 replies; 84+ messages in thread
From: Vlastimil Babka @ 2020-05-27 11:03 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, linux-mm,
	kernel-team, linux-kernel

On 5/26/20 7:53 PM, Roman Gushchin wrote:
> On Mon, May 25, 2020 at 05:07:22PM +0200, Vlastimil Babka wrote:
>> On 4/22/20 10:46 PM, Roman Gushchin wrote:
>> > diff --git a/mm/slab.h b/mm/slab.h
>> > index 44def57f050e..525e09e05743 100644
>> > --- a/mm/slab.h
>> > +++ b/mm/slab.h
>> ...
>> > @@ -636,8 +684,8 @@ static inline void slab_post_alloc_hook(struct kmem_cache *s, gfp_t flags,
>> >  					 s->flags, flags);
>> >  	}
>> >  
>> > -	if (memcg_kmem_enabled())
>> > -		memcg_kmem_put_cache(s);
>> > +	if (!is_root_cache(s))
>> > +		memcg_slab_post_alloc_hook(s, objcg, size, p);
>> >  }
>> >  
>> >  #ifndef CONFIG_SLOB
>> 
>> Keep also the memcg_kmem_enabled() static key check, like elsewhere?
>> 
> 
> Ok, will add, it can speed things up a little bit. My only concern is that
> the code is not ready for memcg_kmem_enabled() turning negative after being positive.
> But it's not a concern, right?
> 
> Actually, we can simplify memcg_kmem_enabled() mechanics and enable it
> only once as soon as the first memcg is fully initialized. I don't think there
> is any value in tracking the actual number of active memcgs.

Yeah, it should be acceptable that once the key is enabled after boot, there's
no way back until reboot.

> Thanks!
> 



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 18/19] kselftests: cgroup: add kernel memory accounting tests
  2020-05-26 15:45     ` Roman Gushchin
@ 2020-05-27 17:00       ` Vlastimil Babka
  2020-05-27 20:45         ` Roman Gushchin
  0 siblings, 1 reply; 84+ messages in thread
From: Vlastimil Babka @ 2020-05-27 17:00 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, linux-mm,
	kernel-team, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1442 bytes --]

On 5/26/20 5:45 PM, Roman Gushchin wrote:
> On Tue, May 26, 2020 at 05:24:46PM +0200, Vlastimil Babka wrote:
>> 1 << 20 ?
>> 
>> Anyway I was getting this:
>> not ok 1 test_kmem_basic
>> ok 2 test_kmem_memcg_deletion
>> ok 3 test_kmem_proc_kpagecgroup
>> not ok 4 test_kmem_kernel_stacks
>> ok 5 test_kmem_dead_cgroups
>> 
>> Adding some debugging into kmem_basic I found I get memory.stat == 0 at this
>> point thus it fails the fixed test (otherwise it was failing the <= 0 test after
>> writing to memory.high). But it's just a VM spinned by virtme which has a very
>> simple init, so perhaps things are not as initialized as expected.
> 
> Hm, it's strange, do you have any values in memory.stat::slab for any cgroups?
> Or can you send me your config (and kvm setup), I'll take a look.

Config is attached, KVM setup is just running virtme [1] on the git gree where I
compiled the kernel:

virtme-run --mods=auto --kdir /path/to/linux.git/

So there's only the cgroup the test creates, and the memry.stat::slab is zero
after alloc_dcache(). Although I can see by strace that it does all those
stat()'s. Also /proc/slabinfo also doesn't seem to be increasing number of
dentries. Could be that because root fs is 9p in virtme?

[1] https://github.com/amluto/virtme

> Btw, thank you very much for reviewing the series! I appreciate it.
> I'll integrate your feedback into the next version, which I'm working on right now.
> 
> Thanks!
> 


[-- Attachment #2: .config --]
[-- Type: application/x-config, Size: 145689 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v3 18/19] kselftests: cgroup: add kernel memory accounting tests
  2020-05-27 17:00       ` Vlastimil Babka
@ 2020-05-27 20:45         ` Roman Gushchin
  0 siblings, 0 replies; 84+ messages in thread
From: Roman Gushchin @ 2020-05-27 20:45 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, linux-mm,
	kernel-team, linux-kernel

On Wed, May 27, 2020 at 07:00:30PM +0200, Vlastimil Babka wrote:
> On 5/26/20 5:45 PM, Roman Gushchin wrote:
> > On Tue, May 26, 2020 at 05:24:46PM +0200, Vlastimil Babka wrote:
> >> 1 << 20 ?
> >> 
> >> Anyway I was getting this:
> >> not ok 1 test_kmem_basic
> >> ok 2 test_kmem_memcg_deletion
> >> ok 3 test_kmem_proc_kpagecgroup
> >> not ok 4 test_kmem_kernel_stacks
> >> ok 5 test_kmem_dead_cgroups
> >> 
> >> Adding some debugging into kmem_basic I found I get memory.stat == 0 at this
> >> point thus it fails the fixed test (otherwise it was failing the <= 0 test after
> >> writing to memory.high). But it's just a VM spinned by virtme which has a very
> >> simple init, so perhaps things are not as initialized as expected.
> > 
> > Hm, it's strange, do you have any values in memory.stat::slab for any cgroups?
> > Or can you send me your config (and kvm setup), I'll take a look.
> 
> Config is attached, KVM setup is just running virtme [1] on the git gree where I
> compiled the kernel:
> 
> virtme-run --mods=auto --kdir /path/to/linux.git/

Thanks!

So, test_kmem_kernel_stacks fails because there is not enough memory in your setup:
it tries to spawn 1000 threads, and virtme-run sets the total memory in 100M,
which is not enough. Adding --memory 4G option makes the test pass.

> 
> So there's only the cgroup the test creates, and the memry.stat::slab is zero
> after alloc_dcache(). Although I can see by strace that it does all those
> stat()'s. Also /proc/slabinfo also doesn't seem to be increasing number of
> dentries. Could be that because root fs is 9p in virtme?

The basic test seems to fail because of 9p. It also fails on the unpatched
kernel, so it's definitely not a regression.

So, maybe it's better to allocate kernel memory in the test using a different
method? I'm open for any ideas here.

Thanks!


^ permalink raw reply	[flat|nested] 84+ messages in thread

end of thread, other threads:[~2020-05-27 20:45 UTC | newest]

Thread overview: 84+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-22 20:46 [PATCH v3 00/19] The new cgroup slab memory controller Roman Gushchin
2020-04-22 20:46 ` [PATCH v3 01/19] mm: memcg: factor out memcg- and lruvec-level changes out of __mod_lruvec_state() Roman Gushchin
2020-05-07 20:33   ` Johannes Weiner
2020-05-20 10:49   ` Vlastimil Babka
2020-04-22 20:46 ` [PATCH v3 02/19] mm: memcg: prepare for byte-sized vmstat items Roman Gushchin
2020-05-07 20:34   ` Johannes Weiner
2020-05-20 11:31   ` Vlastimil Babka
2020-05-20 11:36     ` Vlastimil Babka
2020-04-22 20:46 ` [PATCH v3 03/19] mm: memcg: convert vmstat slab counters to bytes Roman Gushchin
2020-05-07 20:41   ` Johannes Weiner
2020-05-20 12:25   ` Vlastimil Babka
2020-05-20 19:26     ` Roman Gushchin
2020-05-21  9:57       ` Vlastimil Babka
2020-05-21 21:14         ` Roman Gushchin
2020-04-22 20:46 ` [PATCH v3 04/19] mm: slub: implement SLUB version of obj_to_index() Roman Gushchin
2020-04-22 23:52   ` Christopher Lameter
2020-04-23  0:05     ` Roman Gushchin
2020-04-25  2:10       ` Christopher Lameter
2020-04-25  2:46         ` Roman Gushchin
2020-04-27 16:21           ` Christopher Lameter
2020-04-27 16:46             ` Roman Gushchin
2020-04-28 17:06               ` Roman Gushchin
2020-04-28 17:45               ` Johannes Weiner
2020-04-30 16:29               ` Christopher Lameter
2020-04-30 17:15                 ` Roman Gushchin
2020-05-02 23:54                   ` Christopher Lameter
2020-05-04 18:29                     ` Roman Gushchin
2020-05-08 21:35                       ` Christopher Lameter
2020-05-13  0:57                         ` Roman Gushchin
2020-05-15 21:45                           ` Christopher Lameter
2020-05-15 22:12                             ` Roman Gushchin
2020-05-20  9:51                           ` Vlastimil Babka
2020-05-20 20:57                             ` Roman Gushchin
2020-05-15 20:02                         ` Roman Gushchin
2020-04-23 21:01     ` Roman Gushchin
2020-04-25  2:10       ` Christopher Lameter
2020-05-20 13:51   ` Vlastimil Babka
2020-05-20 21:00     ` Roman Gushchin
2020-05-21 11:01       ` Vlastimil Babka
2020-05-21 21:06         ` Roman Gushchin
2020-04-22 20:46 ` [PATCH v3 05/19] mm: memcontrol: decouple reference counting from page accounting Roman Gushchin
2020-04-22 20:46 ` [PATCH v3 06/19] mm: memcg/slab: obj_cgroup API Roman Gushchin
2020-05-07 21:03   ` Johannes Weiner
2020-05-07 22:26     ` Roman Gushchin
2020-05-12 22:56       ` Johannes Weiner
2020-05-15 22:01         ` Roman Gushchin
2020-04-22 20:46 ` [PATCH v3 07/19] mm: memcg/slab: allocate obj_cgroups for non-root slab pages Roman Gushchin
2020-04-23 20:20   ` Roman Gushchin
2020-05-22 18:27   ` Vlastimil Babka
2020-05-23  1:32     ` Roman Gushchin
2020-05-26 17:50     ` Roman Gushchin
2020-05-25 14:46   ` Vlastimil Babka
2020-04-22 20:46 ` [PATCH v3 08/19] mm: memcg/slab: save obj_cgroup for non-root slab objects Roman Gushchin
2020-05-25 15:07   ` Vlastimil Babka
2020-05-26 17:53     ` Roman Gushchin
2020-05-27 11:03       ` Vlastimil Babka
2020-04-22 20:46 ` [PATCH v3 09/19] mm: memcg/slab: charge individual slab objects instead of pages Roman Gushchin
2020-05-25 16:10   ` Vlastimil Babka
2020-05-26 18:04     ` Roman Gushchin
2020-04-22 20:46 ` [PATCH v3 10/19] mm: memcg/slab: deprecate memory.kmem.slabinfo Roman Gushchin
2020-05-07 21:05   ` Johannes Weiner
2020-04-22 20:47 ` [PATCH v3 11/19] mm: memcg/slab: move memcg_kmem_bypass() to memcontrol.h Roman Gushchin
2020-05-25 17:03   ` Vlastimil Babka
2020-04-22 20:47 ` [PATCH v3 12/19] mm: memcg/slab: use a single set of kmem_caches for all accounted allocations Roman Gushchin
2020-05-26 10:12   ` Vlastimil Babka
2020-04-22 20:47 ` [PATCH v3 13/19] mm: memcg/slab: simplify memcg cache creation Roman Gushchin
2020-05-26 10:31   ` Vlastimil Babka
2020-04-22 20:47 ` [PATCH v3 14/19] mm: memcg/slab: deprecate memcg_kmem_get_cache() Roman Gushchin
2020-05-26 10:34   ` Vlastimil Babka
2020-04-22 20:47 ` [PATCH v3 15/19] mm: memcg/slab: deprecate slab_root_caches Roman Gushchin
2020-05-26 10:52   ` Vlastimil Babka
2020-05-26 18:50     ` Roman Gushchin
2020-04-22 20:47 ` [PATCH v3 16/19] mm: memcg/slab: remove redundant check in memcg_accumulate_slabinfo() Roman Gushchin
2020-05-26 11:31   ` Vlastimil Babka
2020-04-22 20:47 ` [PATCH v3 17/19] mm: memcg/slab: use a single set of kmem_caches for all allocations Roman Gushchin
2020-05-26 14:55   ` Vlastimil Babka
2020-05-27  8:35     ` Jesper Dangaard Brouer
2020-04-22 20:47 ` [PATCH v3 18/19] kselftests: cgroup: add kernel memory accounting tests Roman Gushchin
2020-05-26 15:24   ` Vlastimil Babka
2020-05-26 15:45     ` Roman Gushchin
2020-05-27 17:00       ` Vlastimil Babka
2020-05-27 20:45         ` Roman Gushchin
2020-04-22 20:47 ` [PATCH v3 19/19] tools/cgroup: add memcg_slabinfo.py tool Roman Gushchin
2020-05-05 15:59   ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).