LKML Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v6 00/19] The new cgroup slab memory controller
@ 2020-06-08 23:06 Roman Gushchin
  2020-06-08 23:06 ` [PATCH v6 01/19] mm: memcg: factor out memcg- and lruvec-level changes out of __mod_lruvec_state() Roman Gushchin
                   ` (20 more replies)
  0 siblings, 21 replies; 92+ messages in thread
From: Roman Gushchin @ 2020-06-08 23:06 UTC (permalink / raw)
  To: Andrew Morton, Christoph Lameter
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	Vlastimil Babka, kernel-team, linux-kernel, Roman Gushchin

This is v6 of the slab cgroup controller rework.

The patchset moves the accounting from the page level to the object
level. It allows to share slab pages between memory cgroups.
This leads to a significant win in the slab utilization (up to 45%)
and the corresponding drop in the total kernel memory footprint.
The reduced number of unmovable slab pages should also have a positive
effect on the memory fragmentation.

The patchset makes the slab accounting code simpler: there is no more
need in the complicated dynamic creation and destruction of per-cgroup
slab caches, all memory cgroups use a global set of shared slab caches.
The lifetime of slab caches is not more connected to the lifetime
of memory cgroups.

The more precise accounting does require more CPU, however in practice
the difference seems to be negligible. We've been using the new slab
controller in Facebook production for several months with different
workloads and haven't seen any noticeable regressions. What we've seen
were memory savings in order of 1 GB per host (it varied heavily depending
on the actual workload, size of RAM, number of CPUs, memory pressure, etc).

The third version of the patchset added yet another step towards
the simplification of the code: sharing of slab caches between
accounted and non-accounted allocations. It comes with significant
upsides (most noticeable, a complete elimination of dynamic slab caches
creation) but not without some regression risks, so this change sits
on top of the patchset and is not completely merged in. So in the unlikely
event of a noticeable performance regression it can be reverted separately.

v6:
  1) rebased on top of the mm tree
  2) removed a redundant check from cache_from_obj(), suggested by Vlastimil

v5:
  1) fixed a build error, spotted by Vlastimil
  2) added a comment about memcg->nr_charged_bytes, asked by Johannes
  3) added missed acks and reviews

v4:
  1) rebased on top of the mm tree, some fixes here and there
  2) merged obj_to_index() with slab_index(), suggested by Vlastimil
  3) changed objects_per_slab() to a better objects_per_slab_page(),
     suggested by Vlastimil
  4) other minor fixes and changes

v3:
  1) added a patch that switches to a global single set of kmem_caches
  2) kmem API clean up dropped, because if has been already merged
  3) byte-sized slab vmstat API over page-sized global counters and
     bytes-sized memcg/lruvec counters
  3) obj_cgroup refcounting simplifications and other minor fixes
  4) other minor changes

v2:
  1) implemented re-layering and renaming suggested by Johannes,
     added his patch to the set. Thanks!
  2) fixed the issue discovered by Bharata B Rao. Thanks!
  3) added kmem API clean up part
  4) added slab/memcg follow-up clean up part
  5) fixed a couple of issues discovered by internal testing on FB fleet.
  6) added kselftests
  7) included metadata into the charge calculation
  8) refreshed commit logs, regrouped patches, rebased onto mm tree, etc

v1:
  1) fixed a bug in zoneinfo_show_print()
  2) added some comments to the subpage charging API, a minor fix
  3) separated memory.kmem.slabinfo deprecation into a separate patch,
     provided a drgn-based replacement
  4) rebased on top of the current mm tree

RFC:
  https://lwn.net/Articles/798605/


Johannes Weiner (1):
  mm: memcontrol: decouple reference counting from page accounting

Roman Gushchin (18):
  mm: memcg: factor out memcg- and lruvec-level changes out of
    __mod_lruvec_state()
  mm: memcg: prepare for byte-sized vmstat items
  mm: memcg: convert vmstat slab counters to bytes
  mm: slub: implement SLUB version of obj_to_index()
  mm: memcg/slab: obj_cgroup API
  mm: memcg/slab: allocate obj_cgroups for non-root slab pages
  mm: memcg/slab: save obj_cgroup for non-root slab objects
  mm: memcg/slab: charge individual slab objects instead of pages
  mm: memcg/slab: deprecate memory.kmem.slabinfo
  mm: memcg/slab: move memcg_kmem_bypass() to memcontrol.h
  mm: memcg/slab: use a single set of kmem_caches for all accounted
    allocations
  mm: memcg/slab: simplify memcg cache creation
  mm: memcg/slab: remove memcg_kmem_get_cache()
  mm: memcg/slab: deprecate slab_root_caches
  mm: memcg/slab: remove redundant check in memcg_accumulate_slabinfo()
  mm: memcg/slab: use a single set of kmem_caches for all allocations
  kselftests: cgroup: add kernel memory accounting tests
  tools/cgroup: add memcg_slabinfo.py tool

 drivers/base/node.c                        |   6 +-
 fs/proc/meminfo.c                          |   4 +-
 include/linux/memcontrol.h                 |  85 ++-
 include/linux/mm_types.h                   |   5 +-
 include/linux/mmzone.h                     |  24 +-
 include/linux/slab.h                       |   5 -
 include/linux/slab_def.h                   |   9 +-
 include/linux/slub_def.h                   |  31 +-
 include/linux/vmstat.h                     |  14 +-
 kernel/power/snapshot.c                    |   2 +-
 mm/memcontrol.c                            | 608 +++++++++++--------
 mm/oom_kill.c                              |   2 +-
 mm/page_alloc.c                            |   8 +-
 mm/slab.c                                  |  70 +--
 mm/slab.h                                  | 372 +++++-------
 mm/slab_common.c                           | 643 +--------------------
 mm/slob.c                                  |  12 +-
 mm/slub.c                                  | 229 +-------
 mm/vmscan.c                                |   3 +-
 mm/vmstat.c                                |  30 +-
 mm/workingset.c                            |   6 +-
 tools/cgroup/memcg_slabinfo.py             | 226 ++++++++
 tools/testing/selftests/cgroup/.gitignore  |   1 +
 tools/testing/selftests/cgroup/Makefile    |   2 +
 tools/testing/selftests/cgroup/test_kmem.c | 382 ++++++++++++
 25 files changed, 1374 insertions(+), 1405 deletions(-)
 create mode 100755 tools/cgroup/memcg_slabinfo.py
 create mode 100644 tools/testing/selftests/cgroup/test_kmem.c

-- 
2.25.4


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v6 01/19] mm: memcg: factor out memcg- and lruvec-level changes out of __mod_lruvec_state()
  2020-06-08 23:06 [PATCH v6 00/19] The new cgroup slab memory controller Roman Gushchin
@ 2020-06-08 23:06 ` Roman Gushchin
  2020-06-17  1:52   ` Shakeel Butt
  2020-06-08 23:06 ` [PATCH v6 02/19] mm: memcg: prepare for byte-sized vmstat items Roman Gushchin
                   ` (19 subsequent siblings)
  20 siblings, 1 reply; 92+ messages in thread
From: Roman Gushchin @ 2020-06-08 23:06 UTC (permalink / raw)
  To: Andrew Morton, Christoph Lameter
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	Vlastimil Babka, kernel-team, linux-kernel, Roman Gushchin

To convert memcg and lruvec slab counters to bytes there must be
a way to change these counters without touching node counters.
Factor out __mod_memcg_lruvec_state() out of __mod_lruvec_state().

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/memcontrol.h | 17 +++++++++++++++
 mm/memcontrol.c            | 43 +++++++++++++++++++++-----------------
 2 files changed, 41 insertions(+), 19 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index bbf624a7f5a6..93dbc7f9d8b8 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -679,11 +679,23 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
 	return x;
 }
 
+void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
+			      int val);
 void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 			int val);
 void __mod_lruvec_slab_state(void *p, enum node_stat_item idx, int val);
 void mod_memcg_obj_state(void *p, int idx, int val);
 
+static inline void mod_memcg_lruvec_state(struct lruvec *lruvec,
+					  enum node_stat_item idx, int val)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__mod_memcg_lruvec_state(lruvec, idx, val);
+	local_irq_restore(flags);
+}
+
 static inline void mod_lruvec_state(struct lruvec *lruvec,
 				    enum node_stat_item idx, int val)
 {
@@ -1057,6 +1069,11 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
 	return node_page_state(lruvec_pgdat(lruvec), idx);
 }
 
+static inline void __mod_memcg_lruvec_state(struct lruvec *lruvec,
+					    enum node_stat_item idx, int val)
+{
+}
+
 static inline void __mod_lruvec_state(struct lruvec *lruvec,
 				      enum node_stat_item idx, int val)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bbda19c13c19..e8a91e98556b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -713,30 +713,13 @@ parent_nodeinfo(struct mem_cgroup_per_node *pn, int nid)
 	return mem_cgroup_nodeinfo(parent, nid);
 }
 
-/**
- * __mod_lruvec_state - update lruvec memory statistics
- * @lruvec: the lruvec
- * @idx: the stat item
- * @val: delta to add to the counter, can be negative
- *
- * The lruvec is the intersection of the NUMA node and a cgroup. This
- * function updates the all three counters that are affected by a
- * change of state at this level: per-node, per-cgroup, per-lruvec.
- */
-void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
-			int val)
+void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
+			      int val)
 {
-	pg_data_t *pgdat = lruvec_pgdat(lruvec);
 	struct mem_cgroup_per_node *pn;
 	struct mem_cgroup *memcg;
 	long x;
 
-	/* Update node */
-	__mod_node_page_state(pgdat, idx, val);
-
-	if (mem_cgroup_disabled())
-		return;
-
 	pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
 	memcg = pn->memcg;
 
@@ -748,6 +731,7 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 
 	x = val + __this_cpu_read(pn->lruvec_stat_cpu->count[idx]);
 	if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
+		pg_data_t *pgdat = lruvec_pgdat(lruvec);
 		struct mem_cgroup_per_node *pi;
 
 		for (pi = pn; pi; pi = parent_nodeinfo(pi, pgdat->node_id))
@@ -757,6 +741,27 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 	__this_cpu_write(pn->lruvec_stat_cpu->count[idx], x);
 }
 
+/**
+ * __mod_lruvec_state - update lruvec memory statistics
+ * @lruvec: the lruvec
+ * @idx: the stat item
+ * @val: delta to add to the counter, can be negative
+ *
+ * The lruvec is the intersection of the NUMA node and a cgroup. This
+ * function updates the all three counters that are affected by a
+ * change of state at this level: per-node, per-cgroup, per-lruvec.
+ */
+void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
+			int val)
+{
+	/* Update node */
+	__mod_node_page_state(lruvec_pgdat(lruvec), idx, val);
+
+	/* Update memcg and lruvec */
+	if (!mem_cgroup_disabled())
+		__mod_memcg_lruvec_state(lruvec, idx, val);
+}
+
 void __mod_lruvec_slab_state(void *p, enum node_stat_item idx, int val)
 {
 	pg_data_t *pgdat = page_pgdat(virt_to_page(p));
-- 
2.25.4


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v6 02/19] mm: memcg: prepare for byte-sized vmstat items
  2020-06-08 23:06 [PATCH v6 00/19] The new cgroup slab memory controller Roman Gushchin
  2020-06-08 23:06 ` [PATCH v6 01/19] mm: memcg: factor out memcg- and lruvec-level changes out of __mod_lruvec_state() Roman Gushchin
@ 2020-06-08 23:06 ` Roman Gushchin
  2020-06-17  2:57   ` Shakeel Butt
  2020-06-17 15:55   ` Shakeel Butt
  2020-06-08 23:06 ` [PATCH v6 03/19] mm: memcg: convert vmstat slab counters to bytes Roman Gushchin
                   ` (18 subsequent siblings)
  20 siblings, 2 replies; 92+ messages in thread
From: Roman Gushchin @ 2020-06-08 23:06 UTC (permalink / raw)
  To: Andrew Morton, Christoph Lameter
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	Vlastimil Babka, kernel-team, linux-kernel, Roman Gushchin

To implement per-object slab memory accounting, we need to
convert slab vmstat counters to bytes. Actually, out of
4 levels of counters: global, per-node, per-memcg and per-lruvec
only two last levels will require byte-sized counters.
It's because global and per-node counters will be counting the
number of slab pages, and per-memcg and per-lruvec will be
counting the amount of memory taken by charged slab objects.

Converting all vmstat counters to bytes or even all slab
counters to bytes would introduce an additional overhead.
So instead let's store global and per-node counters
in pages, and memcg and lruvec counters in bytes.

To make the API clean all access helpers (both on the read
and write sides) are dealing with bytes.

To avoid back-and-forth conversions a new flavor of read-side
helpers is introduced, which always returns values in pages:
node_page_state_pages() and global_node_page_state_pages().

Actually new helpers are just reading raw values. Old helpers are
simple wrappers, which will complain on an attempt to read
byte value, because at the moment no one actually needs bytes.

Thanks to Johannes Weiner for the idea of having the byte-sized API
on top of the page-sized internal storage.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 drivers/base/node.c    |  2 +-
 include/linux/mmzone.h | 10 ++++++++++
 include/linux/vmstat.h | 14 +++++++++++++-
 mm/memcontrol.c        | 14 ++++++++++----
 mm/vmstat.c            | 30 ++++++++++++++++++++++++++----
 5 files changed, 60 insertions(+), 10 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 5b02f69769e8..e21e31359297 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -513,7 +513,7 @@ static ssize_t node_read_vmstat(struct device *dev,
 
 	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
 		n += sprintf(buf+n, "%s %lu\n", node_stat_name(i),
-			     node_page_state(pgdat, i));
+			     node_page_state_pages(pgdat, i));
 
 	return n;
 }
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c4c37fd12104..fa8eb49d9898 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -206,6 +206,16 @@ enum node_stat_item {
 	NR_VM_NODE_STAT_ITEMS
 };
 
+/*
+ * Returns true if the value is measured in bytes (most vmstat values are
+ * measured in pages). This defines the API part, the internal representation
+ * might be different.
+ */
+static __always_inline bool vmstat_item_in_bytes(enum node_stat_item item)
+{
+	return false;
+}
+
 /*
  * We do arithmetic on the LRU lists in various places in the code,
  * so it is important to keep the active lists LRU_ACTIVE higher in
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index aa961088c551..91220ace31da 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -8,6 +8,7 @@
 #include <linux/vm_event_item.h>
 #include <linux/atomic.h>
 #include <linux/static_key.h>
+#include <linux/mmdebug.h>
 
 extern int sysctl_stat_interval;
 
@@ -192,7 +193,8 @@ static inline unsigned long global_zone_page_state(enum zone_stat_item item)
 	return x;
 }
 
-static inline unsigned long global_node_page_state(enum node_stat_item item)
+static inline
+unsigned long global_node_page_state_pages(enum node_stat_item item)
 {
 	long x = atomic_long_read(&vm_node_stat[item]);
 #ifdef CONFIG_SMP
@@ -202,6 +204,13 @@ static inline unsigned long global_node_page_state(enum node_stat_item item)
 	return x;
 }
 
+static inline unsigned long global_node_page_state(enum node_stat_item item)
+{
+	VM_WARN_ON_ONCE(vmstat_item_in_bytes(item));
+
+	return global_node_page_state_pages(item);
+}
+
 static inline unsigned long zone_page_state(struct zone *zone,
 					enum zone_stat_item item)
 {
@@ -242,9 +251,12 @@ extern unsigned long sum_zone_node_page_state(int node,
 extern unsigned long sum_zone_numa_state(int node, enum numa_stat_item item);
 extern unsigned long node_page_state(struct pglist_data *pgdat,
 						enum node_stat_item item);
+extern unsigned long node_page_state_pages(struct pglist_data *pgdat,
+					   enum node_stat_item item);
 #else
 #define sum_zone_node_page_state(node, item) global_zone_page_state(item)
 #define node_page_state(node, item) global_node_page_state(item)
+#define node_page_state_pages(node, item) global_node_page_state_pages(item)
 #endif /* CONFIG_NUMA */
 
 #ifdef CONFIG_SMP
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e8a91e98556b..07d02e61a73e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -681,13 +681,16 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
  */
 void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)
 {
-	long x;
+	long x, threshold = MEMCG_CHARGE_BATCH;
 
 	if (mem_cgroup_disabled())
 		return;
 
+	if (vmstat_item_in_bytes(idx))
+		threshold <<= PAGE_SHIFT;
+
 	x = val + __this_cpu_read(memcg->vmstats_percpu->stat[idx]);
-	if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
+	if (unlikely(abs(x) > threshold)) {
 		struct mem_cgroup *mi;
 
 		/*
@@ -718,7 +721,7 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 {
 	struct mem_cgroup_per_node *pn;
 	struct mem_cgroup *memcg;
-	long x;
+	long x, threshold = MEMCG_CHARGE_BATCH;
 
 	pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
 	memcg = pn->memcg;
@@ -729,8 +732,11 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 	/* Update lruvec */
 	__this_cpu_add(pn->lruvec_stat_local->count[idx], val);
 
+	if (vmstat_item_in_bytes(idx))
+		threshold <<= PAGE_SHIFT;
+
 	x = val + __this_cpu_read(pn->lruvec_stat_cpu->count[idx]);
-	if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
+	if (unlikely(abs(x) > threshold)) {
 		pg_data_t *pgdat = lruvec_pgdat(lruvec);
 		struct mem_cgroup_per_node *pi;
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 80c9b6221535..f1c321e1d6d3 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -341,6 +341,11 @@ void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
 	long x;
 	long t;
 
+	if (vmstat_item_in_bytes(item)) {
+		VM_WARN_ON_ONCE(delta & (PAGE_SIZE - 1));
+		delta >>= PAGE_SHIFT;
+	}
+
 	x = delta + __this_cpu_read(*p);
 
 	t = __this_cpu_read(pcp->stat_threshold);
@@ -398,6 +403,8 @@ void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
 	s8 __percpu *p = pcp->vm_node_stat_diff + item;
 	s8 v, t;
 
+	VM_WARN_ON_ONCE(vmstat_item_in_bytes(item));
+
 	v = __this_cpu_inc_return(*p);
 	t = __this_cpu_read(pcp->stat_threshold);
 	if (unlikely(v > t)) {
@@ -442,6 +449,8 @@ void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
 	s8 __percpu *p = pcp->vm_node_stat_diff + item;
 	s8 v, t;
 
+	VM_WARN_ON_ONCE(vmstat_item_in_bytes(item));
+
 	v = __this_cpu_dec_return(*p);
 	t = __this_cpu_read(pcp->stat_threshold);
 	if (unlikely(v < - t)) {
@@ -541,6 +550,11 @@ static inline void mod_node_state(struct pglist_data *pgdat,
 	s8 __percpu *p = pcp->vm_node_stat_diff + item;
 	long o, n, t, z;
 
+	if (vmstat_item_in_bytes(item)) {
+		VM_WARN_ON_ONCE(delta & (PAGE_SIZE - 1));
+		delta >>= PAGE_SHIFT;
+	}
+
 	do {
 		z = 0;  /* overflow to node counters */
 
@@ -989,8 +1003,8 @@ unsigned long sum_zone_numa_state(int node,
 /*
  * Determine the per node value of a stat item.
  */
-unsigned long node_page_state(struct pglist_data *pgdat,
-				enum node_stat_item item)
+unsigned long node_page_state_pages(struct pglist_data *pgdat,
+				    enum node_stat_item item)
 {
 	long x = atomic_long_read(&pgdat->vm_stat[item]);
 #ifdef CONFIG_SMP
@@ -999,6 +1013,14 @@ unsigned long node_page_state(struct pglist_data *pgdat,
 #endif
 	return x;
 }
+
+unsigned long node_page_state(struct pglist_data *pgdat,
+			      enum node_stat_item item)
+{
+	VM_WARN_ON_ONCE(vmstat_item_in_bytes(item));
+
+	return node_page_state_pages(pgdat, item);
+}
 #endif
 
 #ifdef CONFIG_COMPACTION
@@ -1581,7 +1603,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		seq_printf(m, "\n  per-node stats");
 		for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
 			seq_printf(m, "\n      %-12s %lu", node_stat_name(i),
-				   node_page_state(pgdat, i));
+				   node_page_state_pages(pgdat, i));
 		}
 	}
 	seq_printf(m,
@@ -1702,7 +1724,7 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
 #endif
 
 	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
-		v[i] = global_node_page_state(i);
+		v[i] = global_node_page_state_pages(i);
 	v += NR_VM_NODE_STAT_ITEMS;
 
 	global_dirty_limits(v + NR_DIRTY_BG_THRESHOLD,
-- 
2.25.4


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v6 03/19] mm: memcg: convert vmstat slab counters to bytes
  2020-06-08 23:06 [PATCH v6 00/19] The new cgroup slab memory controller Roman Gushchin
  2020-06-08 23:06 ` [PATCH v6 01/19] mm: memcg: factor out memcg- and lruvec-level changes out of __mod_lruvec_state() Roman Gushchin
  2020-06-08 23:06 ` [PATCH v6 02/19] mm: memcg: prepare for byte-sized vmstat items Roman Gushchin
@ 2020-06-08 23:06 ` Roman Gushchin
  2020-06-17  3:03   ` Shakeel Butt
  2020-06-08 23:06 ` [PATCH v6 04/19] mm: slub: implement SLUB version of obj_to_index() Roman Gushchin
                   ` (17 subsequent siblings)
  20 siblings, 1 reply; 92+ messages in thread
From: Roman Gushchin @ 2020-06-08 23:06 UTC (permalink / raw)
  To: Andrew Morton, Christoph Lameter
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	Vlastimil Babka, kernel-team, linux-kernel, Roman Gushchin

In order to prepare for per-object slab memory accounting, convert
NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE vmstat items to bytes.

To make it obvious, rename them to NR_SLAB_RECLAIMABLE_B and
NR_SLAB_UNRECLAIMABLE_B (similar to NR_KERNEL_STACK_KB).

Internally global and per-node counters are stored in pages,
however memcg and lruvec counters are stored in bytes.
This scheme may look weird, but only for now. As soon as slab
pages will be shared between multiple cgroups, global and
node counters will reflect the total number of slab pages.
However memcg and lruvec counters will be used for per-memcg
slab memory tracking, which will take separate kernel objects
in the account. Keeping global and node counters in pages helps
to avoid additional overhead.

The size of slab memory shouldn't exceed 4Gb on 32-bit machines,
so it will fit into atomic_long_t we use for vmstats.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 drivers/base/node.c     |  4 ++--
 fs/proc/meminfo.c       |  4 ++--
 include/linux/mmzone.h  | 16 +++++++++++++---
 kernel/power/snapshot.c |  2 +-
 mm/memcontrol.c         | 11 ++++-------
 mm/oom_kill.c           |  2 +-
 mm/page_alloc.c         |  8 ++++----
 mm/slab.h               | 15 ++++++++-------
 mm/slab_common.c        |  4 ++--
 mm/slob.c               | 12 ++++++------
 mm/slub.c               |  8 ++++----
 mm/vmscan.c             |  3 ++-
 mm/workingset.c         |  6 ++++--
 13 files changed, 53 insertions(+), 42 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index e21e31359297..0cf13e31603c 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -368,8 +368,8 @@ static ssize_t node_read_meminfo(struct device *dev,
 	unsigned long sreclaimable, sunreclaimable;
 
 	si_meminfo_node(&i, nid);
-	sreclaimable = node_page_state(pgdat, NR_SLAB_RECLAIMABLE);
-	sunreclaimable = node_page_state(pgdat, NR_SLAB_UNRECLAIMABLE);
+	sreclaimable = node_page_state_pages(pgdat, NR_SLAB_RECLAIMABLE_B);
+	sunreclaimable = node_page_state_pages(pgdat, NR_SLAB_UNRECLAIMABLE_B);
 	n = sprintf(buf,
 		       "Node %d MemTotal:       %8lu kB\n"
 		       "Node %d MemFree:        %8lu kB\n"
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index e9a6841fc25b..38ea95fd919a 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -52,8 +52,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 		pages[lru] = global_node_page_state(NR_LRU_BASE + lru);
 
 	available = si_mem_available();
-	sreclaimable = global_node_page_state(NR_SLAB_RECLAIMABLE);
-	sunreclaim = global_node_page_state(NR_SLAB_UNRECLAIMABLE);
+	sreclaimable = global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B);
+	sunreclaim = global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE_B);
 
 	show_val_kb(m, "MemTotal:       ", i.totalram);
 	show_val_kb(m, "MemFree:        ", i.freeram);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fa8eb49d9898..851c14f0c1b9 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -174,8 +174,8 @@ enum node_stat_item {
 	NR_INACTIVE_FILE,	/*  "     "     "   "       "         */
 	NR_ACTIVE_FILE,		/*  "     "     "   "       "         */
 	NR_UNEVICTABLE,		/*  "     "     "   "       "         */
-	NR_SLAB_RECLAIMABLE,
-	NR_SLAB_UNRECLAIMABLE,
+	NR_SLAB_RECLAIMABLE_B,
+	NR_SLAB_UNRECLAIMABLE_B,
 	NR_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
 	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
 	WORKINGSET_NODES,
@@ -213,7 +213,17 @@ enum node_stat_item {
  */
 static __always_inline bool vmstat_item_in_bytes(enum node_stat_item item)
 {
-	return false;
+	/*
+	 * Global and per-node slab counters track slab pages.
+	 * It's expected that changes are multiples of PAGE_SIZE.
+	 * Internally values are stored in pages.
+	 *
+	 * Per-memcg and per-lruvec counters track memory, consumed
+	 * by individual slab objects. These counters are actually
+	 * byte-precise.
+	 */
+	return (item == NR_SLAB_RECLAIMABLE_B ||
+		item == NR_SLAB_UNRECLAIMABLE_B);
 }
 
 /*
diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 881128b9351e..eefc907e5324 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -1663,7 +1663,7 @@ static unsigned long minimum_image_size(unsigned long saveable)
 {
 	unsigned long size;
 
-	size = global_node_page_state(NR_SLAB_RECLAIMABLE)
+	size = global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B)
 		+ global_node_page_state(NR_ACTIVE_ANON)
 		+ global_node_page_state(NR_INACTIVE_ANON)
 		+ global_node_page_state(NR_ACTIVE_FILE)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 07d02e61a73e..d18bf93e0f19 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1391,9 +1391,8 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
 		       (u64)memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) *
 		       1024);
 	seq_buf_printf(&s, "slab %llu\n",
-		       (u64)(memcg_page_state(memcg, NR_SLAB_RECLAIMABLE) +
-			     memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE)) *
-		       PAGE_SIZE);
+		       (u64)(memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B) +
+			     memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B)));
 	seq_buf_printf(&s, "sock %llu\n",
 		       (u64)memcg_page_state(memcg, MEMCG_SOCK) *
 		       PAGE_SIZE);
@@ -1423,11 +1422,9 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
 			       PAGE_SIZE);
 
 	seq_buf_printf(&s, "slab_reclaimable %llu\n",
-		       (u64)memcg_page_state(memcg, NR_SLAB_RECLAIMABLE) *
-		       PAGE_SIZE);
+		       (u64)memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B));
 	seq_buf_printf(&s, "slab_unreclaimable %llu\n",
-		       (u64)memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE) *
-		       PAGE_SIZE);
+		       (u64)memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B));
 
 	/* Accumulated memory events */
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 6e94962893ee..d30ce75f23fb 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -184,7 +184,7 @@ static bool is_dump_unreclaim_slabs(void)
 		 global_node_page_state(NR_ISOLATED_FILE) +
 		 global_node_page_state(NR_UNEVICTABLE);
 
-	return (global_node_page_state(NR_SLAB_UNRECLAIMABLE) > nr_lru);
+	return (global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE_B) > nr_lru);
 }
 
 /**
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b2163698b9e6..c28a6579d78e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5232,8 +5232,8 @@ long si_mem_available(void)
 	 * items that are in use, and cannot be freed. Cap this estimate at the
 	 * low watermark.
 	 */
-	reclaimable = global_node_page_state(NR_SLAB_RECLAIMABLE) +
-			global_node_page_state(NR_KERNEL_MISC_RECLAIMABLE);
+	reclaimable = global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B) +
+		global_node_page_state(NR_KERNEL_MISC_RECLAIMABLE);
 	available += reclaimable - min(reclaimable / 2, wmark_low);
 
 	if (available < 0)
@@ -5376,8 +5376,8 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 		global_node_page_state(NR_UNEVICTABLE),
 		global_node_page_state(NR_FILE_DIRTY),
 		global_node_page_state(NR_WRITEBACK),
-		global_node_page_state(NR_SLAB_RECLAIMABLE),
-		global_node_page_state(NR_SLAB_UNRECLAIMABLE),
+		global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B),
+		global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE_B),
 		global_node_page_state(NR_FILE_MAPPED),
 		global_node_page_state(NR_SHMEM),
 		global_zone_page_state(NR_PAGETABLE),
diff --git a/mm/slab.h b/mm/slab.h
index 815e4e9a94cd..633eedb6bad1 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -272,7 +272,7 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *, gfp_t, size_t, void **);
 static inline int cache_vmstat_idx(struct kmem_cache *s)
 {
 	return (s->flags & SLAB_RECLAIM_ACCOUNT) ?
-		NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE;
+		NR_SLAB_RECLAIMABLE_B : NR_SLAB_UNRECLAIMABLE_B;
 }
 
 #ifdef CONFIG_MEMCG_KMEM
@@ -361,7 +361,7 @@ static __always_inline int memcg_charge_slab(struct page *page,
 
 	if (unlikely(!memcg || mem_cgroup_is_root(memcg))) {
 		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
-				    nr_pages);
+				    nr_pages << PAGE_SHIFT);
 		percpu_ref_get_many(&s->memcg_params.refcnt, nr_pages);
 		return 0;
 	}
@@ -371,7 +371,7 @@ static __always_inline int memcg_charge_slab(struct page *page,
 		goto out;
 
 	lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
-	mod_lruvec_state(lruvec, cache_vmstat_idx(s), nr_pages);
+	mod_lruvec_state(lruvec, cache_vmstat_idx(s), nr_pages << PAGE_SHIFT);
 
 	/* transer try_charge() page references to kmem_cache */
 	percpu_ref_get_many(&s->memcg_params.refcnt, nr_pages);
@@ -396,11 +396,12 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
 	memcg = READ_ONCE(s->memcg_params.memcg);
 	if (likely(!mem_cgroup_is_root(memcg))) {
 		lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
-		mod_lruvec_state(lruvec, cache_vmstat_idx(s), -nr_pages);
+		mod_lruvec_state(lruvec, cache_vmstat_idx(s),
+				 -(nr_pages << PAGE_SHIFT));
 		memcg_kmem_uncharge(memcg, nr_pages);
 	} else {
 		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
-				    -nr_pages);
+				    -(nr_pages << PAGE_SHIFT));
 	}
 	rcu_read_unlock();
 
@@ -484,7 +485,7 @@ static __always_inline int charge_slab_page(struct page *page,
 {
 	if (is_root_cache(s)) {
 		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
-				    1 << order);
+				    PAGE_SIZE << order);
 		return 0;
 	}
 
@@ -496,7 +497,7 @@ static __always_inline void uncharge_slab_page(struct page *page, int order,
 {
 	if (is_root_cache(s)) {
 		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
-				    -(1 << order));
+				    -(PAGE_SIZE << order));
 		return;
 	}
 
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 9e72ba224175..b578ae29c743 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1325,8 +1325,8 @@ void *kmalloc_order(size_t size, gfp_t flags, unsigned int order)
 	page = alloc_pages(flags, order);
 	if (likely(page)) {
 		ret = page_address(page);
-		mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE,
-				    1 << order);
+		mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE_B,
+				    PAGE_SIZE << order);
 	}
 	ret = kasan_kmalloc_large(ret, size, flags);
 	/* As ret might get tagged, call kmemleak hook after KASAN. */
diff --git a/mm/slob.c b/mm/slob.c
index ac2aecfbc7a8..7cc9805c8091 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -202,8 +202,8 @@ static void *slob_new_pages(gfp_t gfp, int order, int node)
 	if (!page)
 		return NULL;
 
-	mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE,
-			    1 << order);
+	mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE_B,
+			    PAGE_SIZE << order);
 	return page_address(page);
 }
 
@@ -214,8 +214,8 @@ static void slob_free_pages(void *b, int order)
 	if (current->reclaim_state)
 		current->reclaim_state->reclaimed_slab += 1 << order;
 
-	mod_node_page_state(page_pgdat(sp), NR_SLAB_UNRECLAIMABLE,
-			    -(1 << order));
+	mod_node_page_state(page_pgdat(sp), NR_SLAB_UNRECLAIMABLE_B,
+			    -(PAGE_SIZE << order));
 	__free_pages(sp, order);
 }
 
@@ -552,8 +552,8 @@ void kfree(const void *block)
 		slob_free(m, *m + align);
 	} else {
 		unsigned int order = compound_order(sp);
-		mod_node_page_state(page_pgdat(sp), NR_SLAB_UNRECLAIMABLE,
-				    -(1 << order));
+		mod_node_page_state(page_pgdat(sp), NR_SLAB_UNRECLAIMABLE_B,
+				    -(PAGE_SIZE << order));
 		__free_pages(sp, order);
 
 	}
diff --git a/mm/slub.c b/mm/slub.c
index b8f798b50d44..354e475db5ec 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3923,8 +3923,8 @@ static void *kmalloc_large_node(size_t size, gfp_t flags, int node)
 	page = alloc_pages_node(node, flags, order);
 	if (page) {
 		ptr = page_address(page);
-		mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE,
-				    1 << order);
+		mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE_B,
+				    PAGE_SIZE << order);
 	}
 
 	return kmalloc_large_node_hook(ptr, size, flags);
@@ -4055,8 +4055,8 @@ void kfree(const void *x)
 
 		BUG_ON(!PageCompound(page));
 		kfree_hook(object);
-		mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE,
-				    -(1 << order));
+		mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE_B,
+				    -(PAGE_SIZE << order));
 		__free_pages(page, order);
 		return;
 	}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b6d84326bdf2..e1d96242e9f9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4219,7 +4219,8 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
 	 * unmapped file backed pages.
 	 */
 	if (node_pagecache_reclaimable(pgdat) <= pgdat->min_unmapped_pages &&
-	    node_page_state(pgdat, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages)
+	    node_page_state_pages(pgdat, NR_SLAB_RECLAIMABLE_B) <=
+	    pgdat->min_slab_pages)
 		return NODE_RECLAIM_FULL;
 
 	/*
diff --git a/mm/workingset.c b/mm/workingset.c
index d481ea452eeb..9bf12523e3f0 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -478,8 +478,10 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
 		for (pages = 0, i = 0; i < NR_LRU_LISTS; i++)
 			pages += lruvec_page_state_local(lruvec,
 							 NR_LRU_BASE + i);
-		pages += lruvec_page_state_local(lruvec, NR_SLAB_RECLAIMABLE);
-		pages += lruvec_page_state_local(lruvec, NR_SLAB_UNRECLAIMABLE);
+		pages += lruvec_page_state_local(
+			lruvec, NR_SLAB_RECLAIMABLE_B) >> PAGE_SHIFT;
+		pages += lruvec_page_state_local(
+			lruvec, NR_SLAB_UNRECLAIMABLE_B) >> PAGE_SHIFT;
 	} else
 #endif
 		pages = node_present_pages(sc->nid);
-- 
2.25.4


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v6 04/19] mm: slub: implement SLUB version of obj_to_index()
  2020-06-08 23:06 [PATCH v6 00/19] The new cgroup slab memory controller Roman Gushchin
                   ` (2 preceding siblings ...)
  2020-06-08 23:06 ` [PATCH v6 03/19] mm: memcg: convert vmstat slab counters to bytes Roman Gushchin
@ 2020-06-08 23:06 ` Roman Gushchin
  2020-06-17  3:08   ` Shakeel Butt
  2020-06-08 23:06 ` [PATCH v6 05/19] mm: memcontrol: decouple reference counting from page accounting Roman Gushchin
                   ` (16 subsequent siblings)
  20 siblings, 1 reply; 92+ messages in thread
From: Roman Gushchin @ 2020-06-08 23:06 UTC (permalink / raw)
  To: Andrew Morton, Christoph Lameter
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	Vlastimil Babka, kernel-team, linux-kernel, Roman Gushchin

This commit implements SLUB version of the obj_to_index() function,
which will be required to calculate the offset of obj_cgroup in the
obj_cgroups vector to store/obtain the objcg ownership data.

To make it faster, let's repeat the SLAB's trick introduced by
commit 6a2d7a955d8d ("[PATCH] SLAB: use a multiply instead of a
divide in obj_to_index()") and avoid an expensive division.

Vlastimil Babka noticed, that SLUB does have already a similar
function called slab_index(), which is defined only if SLUB_DEBUG
is enabled. The function does a similar math, but with a division,
and it also takes a page address instead of a page pointer.

Let's remove slab_index() and replace it with the new helper
__obj_to_index(), which takes a page address. obj_to_index()
will be a simple wrapper taking a page pointer and passing
page_address(page) into __obj_to_index().

Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/slub_def.h | 16 ++++++++++++++++
 mm/slub.c                | 15 +++++----------
 2 files changed, 21 insertions(+), 10 deletions(-)

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index d2153789bd9f..30e91c83d401 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -8,6 +8,7 @@
  * (C) 2007 SGI, Christoph Lameter
  */
 #include <linux/kobject.h>
+#include <linux/reciprocal_div.h>
 
 enum stat_item {
 	ALLOC_FASTPATH,		/* Allocation from cpu slab */
@@ -86,6 +87,7 @@ struct kmem_cache {
 	unsigned long min_partial;
 	unsigned int size;	/* The size of an object including metadata */
 	unsigned int object_size;/* The size of an object without metadata */
+	struct reciprocal_value reciprocal_size;
 	unsigned int offset;	/* Free pointer offset */
 #ifdef CONFIG_SLUB_CPU_PARTIAL
 	/* Number of per cpu partial objects to keep around */
@@ -182,4 +184,18 @@ static inline void *nearest_obj(struct kmem_cache *cache, struct page *page,
 	return result;
 }
 
+/* Determine object index from a given position */
+static inline unsigned int __obj_to_index(const struct kmem_cache *cache,
+					  void *addr, void *obj)
+{
+	return reciprocal_divide(kasan_reset_tag(obj) - addr,
+				 cache->reciprocal_size);
+}
+
+static inline unsigned int obj_to_index(const struct kmem_cache *cache,
+					const struct page *page, void *obj)
+{
+	return __obj_to_index(cache, page_address(page), obj);
+}
+
 #endif /* _LINUX_SLUB_DEF_H */
diff --git a/mm/slub.c b/mm/slub.c
index 354e475db5ec..6007c38071f5 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -313,12 +313,6 @@ static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp)
 		__p < (__addr) + (__objects) * (__s)->size; \
 		__p += (__s)->size)
 
-/* Determine object index from a given position */
-static inline unsigned int slab_index(void *p, struct kmem_cache *s, void *addr)
-{
-	return (kasan_reset_tag(p) - addr) / s->size;
-}
-
 static inline unsigned int order_objects(unsigned int order, unsigned int size)
 {
 	return ((unsigned int)PAGE_SIZE << order) / size;
@@ -461,7 +455,7 @@ static unsigned long *get_map(struct kmem_cache *s, struct page *page)
 	bitmap_zero(object_map, page->objects);
 
 	for (p = page->freelist; p; p = get_freepointer(s, p))
-		set_bit(slab_index(p, s, addr), object_map);
+		set_bit(__obj_to_index(s, addr, p), object_map);
 
 	return object_map;
 }
@@ -3675,6 +3669,7 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order)
 	 */
 	size = ALIGN(size, s->align);
 	s->size = size;
+	s->reciprocal_size = reciprocal_value(size);
 	if (forced_order >= 0)
 		order = forced_order;
 	else
@@ -3781,7 +3776,7 @@ static void list_slab_objects(struct kmem_cache *s, struct page *page,
 	map = get_map(s, page);
 	for_each_object(p, s, addr, page->objects) {
 
-		if (!test_bit(slab_index(p, s, addr), map)) {
+		if (!test_bit(__obj_to_index(s, addr, p), map)) {
 			pr_err("INFO: Object 0x%p @offset=%tu\n", p, p - addr);
 			print_tracking(s, p);
 		}
@@ -4506,7 +4501,7 @@ static void validate_slab(struct kmem_cache *s, struct page *page)
 	/* Now we know that a valid freelist exists */
 	map = get_map(s, page);
 	for_each_object(p, s, addr, page->objects) {
-		u8 val = test_bit(slab_index(p, s, addr), map) ?
+		u8 val = test_bit(__obj_to_index(s, addr, p), map) ?
 			 SLUB_RED_INACTIVE : SLUB_RED_ACTIVE;
 
 		if (!check_object(s, page, p, val))
@@ -4697,7 +4692,7 @@ static void process_slab(struct loc_track *t, struct kmem_cache *s,
 
 	map = get_map(s, page);
 	for_each_object(p, s, addr, page->objects)
-		if (!test_bit(slab_index(p, s, addr), map))
+		if (!test_bit(__obj_to_index(s, addr, p), map))
 			add_location(t, s, get_track(s, p, alloc));
 	put_map(map);
 }
-- 
2.25.4


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v6 05/19] mm: memcontrol: decouple reference counting from page accounting
  2020-06-08 23:06 [PATCH v6 00/19] The new cgroup slab memory controller Roman Gushchin
                   ` (3 preceding siblings ...)
  2020-06-08 23:06 ` [PATCH v6 04/19] mm: slub: implement SLUB version of obj_to_index() Roman Gushchin
@ 2020-06-08 23:06 ` Roman Gushchin
  2020-06-18  0:47   ` Shakeel Butt
                     ` (2 more replies)
  2020-06-08 23:06 ` [PATCH v6 06/19] mm: memcg/slab: obj_cgroup API Roman Gushchin
                   ` (15 subsequent siblings)
  20 siblings, 3 replies; 92+ messages in thread
From: Roman Gushchin @ 2020-06-08 23:06 UTC (permalink / raw)
  To: Andrew Morton, Christoph Lameter
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	Vlastimil Babka, kernel-team, linux-kernel, Roman Gushchin

From: Johannes Weiner <hannes@cmpxchg.org>

The reference counting of a memcg is currently coupled directly to how
many 4k pages are charged to it. This doesn't work well with Roman's
new slab controller, which maintains pools of objects and doesn't want
to keep an extra balance sheet for the pages backing those objects.

This unusual refcounting design (reference counts usually track
pointers to an object) is only for historical reasons: memcg used to
not take any css references and simply stalled offlining until all
charges had been reparented and the page counters had dropped to
zero. When we got rid of the reparenting requirement, the simple
mechanical translation was to take a reference for every charge.

More historical context can be found in commit e8ea14cc6ead ("mm:
memcontrol: take a css reference for each charged page"),
commit 64f219938941 ("mm: memcontrol: remove obsolete kmemcg pinning
tricks") and commit b2052564e66d ("mm: memcontrol: continue cache
reclaim from offlined groups").

The new slab controller exposes the limitations in this scheme, so
let's switch it to a more idiomatic reference counting model based on
actual kernel pointers to the memcg:

- The per-cpu stock holds a reference to the memcg its caching

- User pages hold a reference for their page->mem_cgroup. Transparent
  huge pages will no longer acquire tail references in advance, we'll
  get them if needed during the split.

- Kernel pages hold a reference for their page->mem_cgroup

- Pages allocated in the root cgroup will acquire and release css
  references for simplicity. css_get() and css_put() optimize that.

- The current memcg_charge_slab() already hacked around the per-charge
  references; this change gets rid of that as well.

Roman:
1) Rebased on top of the current mm tree: added css_get() in
   mem_cgroup_charge(), dropped mem_cgroup_try_charge() part
2) I've reformatted commit references in the commit log to make
   checkpatch.pl happy.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Roman Gushchin <guro@fb.com>
---
 mm/memcontrol.c | 37 +++++++++++++++++++++----------------
 mm/slab.h       |  2 --
 2 files changed, 21 insertions(+), 18 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d18bf93e0f19..80282b2e8b7f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2094,13 +2094,17 @@ static void drain_stock(struct memcg_stock_pcp *stock)
 {
 	struct mem_cgroup *old = stock->cached;
 
+	if (!old)
+		return;
+
 	if (stock->nr_pages) {
 		page_counter_uncharge(&old->memory, stock->nr_pages);
 		if (do_memsw_account())
 			page_counter_uncharge(&old->memsw, stock->nr_pages);
-		css_put_many(&old->css, stock->nr_pages);
 		stock->nr_pages = 0;
 	}
+
+	css_put(&old->css);
 	stock->cached = NULL;
 }
 
@@ -2136,6 +2140,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 	stock = this_cpu_ptr(&memcg_stock);
 	if (stock->cached != memcg) { /* reset if necessary */
 		drain_stock(stock);
+		css_get(&memcg->css);
 		stock->cached = memcg;
 	}
 	stock->nr_pages += nr_pages;
@@ -2594,12 +2599,10 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	page_counter_charge(&memcg->memory, nr_pages);
 	if (do_memsw_account())
 		page_counter_charge(&memcg->memsw, nr_pages);
-	css_get_many(&memcg->css, nr_pages);
 
 	return 0;
 
 done_restock:
-	css_get_many(&memcg->css, batch);
 	if (batch > nr_pages)
 		refill_stock(memcg, batch - nr_pages);
 
@@ -2657,8 +2660,6 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
 	page_counter_uncharge(&memcg->memory, nr_pages);
 	if (do_memsw_account())
 		page_counter_uncharge(&memcg->memsw, nr_pages);
-
-	css_put_many(&memcg->css, nr_pages);
 }
 #endif
 
@@ -2964,6 +2965,7 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
 		if (!ret) {
 			page->mem_cgroup = memcg;
 			__SetPageKmemcg(page);
+			return 0;
 		}
 	}
 	css_put(&memcg->css);
@@ -2986,12 +2988,11 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
 	VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
 	__memcg_kmem_uncharge(memcg, nr_pages);
 	page->mem_cgroup = NULL;
+	css_put(&memcg->css);
 
 	/* slab pages do not have PageKmemcg flag set */
 	if (PageKmemcg(page))
 		__ClearPageKmemcg(page);
-
-	css_put_many(&memcg->css, nr_pages);
 }
 #endif /* CONFIG_MEMCG_KMEM */
 
@@ -3003,13 +3004,16 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
  */
 void mem_cgroup_split_huge_fixup(struct page *head)
 {
+	struct mem_cgroup *memcg = head->mem_cgroup;
 	int i;
 
 	if (mem_cgroup_disabled())
 		return;
 
-	for (i = 1; i < HPAGE_PMD_NR; i++)
-		head[i].mem_cgroup = head->mem_cgroup;
+	for (i = 1; i < HPAGE_PMD_NR; i++) {
+		css_get(&memcg->css);
+		head[i].mem_cgroup = memcg;
+	}
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
@@ -5454,7 +5458,10 @@ static int mem_cgroup_move_account(struct page *page,
 	 */
 	smp_mb();
 
-	page->mem_cgroup = to; 	/* caller should have done css_get */
+	css_get(&to->css);
+	css_put(&from->css);
+
+	page->mem_cgroup = to;
 
 	__unlock_page_memcg(from);
 
@@ -6540,6 +6547,7 @@ int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
 	if (ret)
 		goto out_put;
 
+	css_get(&memcg->css);
 	commit_charge(page, memcg);
 
 	local_irq_disable();
@@ -6594,9 +6602,6 @@ static void uncharge_batch(const struct uncharge_gather *ug)
 	__this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, ug->nr_pages);
 	memcg_check_events(ug->memcg, ug->dummy_page);
 	local_irq_restore(flags);
-
-	if (!mem_cgroup_is_root(ug->memcg))
-		css_put_many(&ug->memcg->css, ug->nr_pages);
 }
 
 static void uncharge_page(struct page *page, struct uncharge_gather *ug)
@@ -6634,6 +6639,7 @@ static void uncharge_page(struct page *page, struct uncharge_gather *ug)
 
 	ug->dummy_page = page;
 	page->mem_cgroup = NULL;
+	css_put(&ug->memcg->css);
 }
 
 static void uncharge_list(struct list_head *page_list)
@@ -6739,8 +6745,8 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage)
 	page_counter_charge(&memcg->memory, nr_pages);
 	if (do_memsw_account())
 		page_counter_charge(&memcg->memsw, nr_pages);
-	css_get_many(&memcg->css, nr_pages);
 
+	css_get(&memcg->css);
 	commit_charge(newpage, memcg);
 
 	local_irq_save(flags);
@@ -6977,8 +6983,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 	mem_cgroup_charge_statistics(memcg, page, -nr_entries);
 	memcg_check_events(memcg, page);
 
-	if (!mem_cgroup_is_root(memcg))
-		css_put_many(&memcg->css, nr_entries);
+	css_put(&memcg->css);
 }
 
 /**
diff --git a/mm/slab.h b/mm/slab.h
index 633eedb6bad1..8a574d9361c1 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -373,9 +373,7 @@ static __always_inline int memcg_charge_slab(struct page *page,
 	lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
 	mod_lruvec_state(lruvec, cache_vmstat_idx(s), nr_pages << PAGE_SHIFT);
 
-	/* transer try_charge() page references to kmem_cache */
 	percpu_ref_get_many(&s->memcg_params.refcnt, nr_pages);
-	css_put_many(&memcg->css, nr_pages);
 out:
 	css_put(&memcg->css);
 	return ret;
-- 
2.25.4


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v6 06/19] mm: memcg/slab: obj_cgroup API
  2020-06-08 23:06 [PATCH v6 00/19] The new cgroup slab memory controller Roman Gushchin
                   ` (4 preceding siblings ...)
  2020-06-08 23:06 ` [PATCH v6 05/19] mm: memcontrol: decouple reference counting from page accounting Roman Gushchin
@ 2020-06-08 23:06 ` Roman Gushchin
  2020-06-19 15:42   ` Shakeel Butt
  2020-06-08 23:06 ` [PATCH v6 07/19] mm: memcg/slab: allocate obj_cgroups for non-root slab pages Roman Gushchin
                   ` (14 subsequent siblings)
  20 siblings, 1 reply; 92+ messages in thread
From: Roman Gushchin @ 2020-06-08 23:06 UTC (permalink / raw)
  To: Andrew Morton, Christoph Lameter
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	Vlastimil Babka, kernel-team, linux-kernel, Roman Gushchin

Obj_cgroup API provides an ability to account sub-page sized kernel
objects, which potentially outlive the original memory cgroup.

The top-level API consists of the following functions:
  bool obj_cgroup_tryget(struct obj_cgroup *objcg);
  void obj_cgroup_get(struct obj_cgroup *objcg);
  void obj_cgroup_put(struct obj_cgroup *objcg);

  int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size);
  void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);

  struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg);
  struct obj_cgroup *get_obj_cgroup_from_current(void);

Object cgroup is basically a pointer to a memory cgroup with a per-cpu
reference counter. It substitutes a memory cgroup in places where
it's necessary to charge a custom amount of bytes instead of pages.

All charged memory rounded down to pages is charged to the
corresponding memory cgroup using __memcg_kmem_charge().

It implements reparenting: on memcg offlining it's getting reattached
to the parent memory cgroup. Each online memory cgroup has an
associated active object cgroup to handle new allocations and the list
of all attached object cgroups. On offlining of a cgroup this list is
reparented and for each object cgroup in the list the memcg pointer is
swapped to the parent memory cgroup. It prevents long-living objects
from pinning the original memory cgroup in the memory.

The implementation is based on byte-sized per-cpu stocks. A sub-page
sized leftover is stored in an atomic field, which is a part of
obj_cgroup object. So on cgroup offlining the leftover is automatically
reparented.

memcg->objcg is rcu protected.
objcg->memcg is a raw pointer, which is always pointing at a memory
cgroup, but can be atomically swapped to the parent memory cgroup. So
the caller must ensure the lifetime of the cgroup, e.g. grab
rcu_read_lock or css_set_lock.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/memcontrol.h |  51 +++++++
 mm/memcontrol.c            | 288 ++++++++++++++++++++++++++++++++++++-
 2 files changed, 338 insertions(+), 1 deletion(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 93dbc7f9d8b8..c69e66fe4f12 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -23,6 +23,7 @@
 #include <linux/page-flags.h>
 
 struct mem_cgroup;
+struct obj_cgroup;
 struct page;
 struct mm_struct;
 struct kmem_cache;
@@ -192,6 +193,22 @@ struct memcg_cgwb_frn {
 	struct wb_completion done;	/* tracks in-flight foreign writebacks */
 };
 
+/*
+ * Bucket for arbitrarily byte-sized objects charged to a memory
+ * cgroup. The bucket can be reparented in one piece when the cgroup
+ * is destroyed, without having to round up the individual references
+ * of all live memory objects in the wild.
+ */
+struct obj_cgroup {
+	struct percpu_ref refcnt;
+	struct mem_cgroup *memcg;
+	atomic_t nr_charged_bytes;
+	union {
+		struct list_head list;
+		struct rcu_head rcu;
+	};
+};
+
 /*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
@@ -301,6 +318,8 @@ struct mem_cgroup {
 	int kmemcg_id;
 	enum memcg_kmem_state kmem_state;
 	struct list_head kmem_caches;
+	struct obj_cgroup __rcu *objcg;
+	struct list_head objcg_list;
 #endif
 
 #ifdef CONFIG_CGROUP_WRITEBACK
@@ -416,6 +435,33 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
 	return css ? container_of(css, struct mem_cgroup, css) : NULL;
 }
 
+static inline bool obj_cgroup_tryget(struct obj_cgroup *objcg)
+{
+	return percpu_ref_tryget(&objcg->refcnt);
+}
+
+static inline void obj_cgroup_get(struct obj_cgroup *objcg)
+{
+	percpu_ref_get(&objcg->refcnt);
+}
+
+static inline void obj_cgroup_put(struct obj_cgroup *objcg)
+{
+	percpu_ref_put(&objcg->refcnt);
+}
+
+/*
+ * After the initialization objcg->memcg is always pointing at
+ * a valid memcg, but can be atomically swapped to the parent memcg.
+ *
+ * The caller must ensure that the returned memcg won't be released:
+ * e.g. acquire the rcu_read_lock or css_set_lock.
+ */
+static inline struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg)
+{
+	return READ_ONCE(objcg->memcg);
+}
+
 static inline void mem_cgroup_put(struct mem_cgroup *memcg)
 {
 	if (memcg)
@@ -1368,6 +1414,11 @@ void __memcg_kmem_uncharge(struct mem_cgroup *memcg, unsigned int nr_pages);
 int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order);
 void __memcg_kmem_uncharge_page(struct page *page, int order);
 
+struct obj_cgroup *get_obj_cgroup_from_current(void);
+
+int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size);
+void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);
+
 extern struct static_key_false memcg_kmem_enabled_key;
 extern struct workqueue_struct *memcg_kmem_cache_wq;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 80282b2e8b7f..7ff66275966c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -257,6 +257,98 @@ struct cgroup_subsys_state *vmpressure_to_css(struct vmpressure *vmpr)
 }
 
 #ifdef CONFIG_MEMCG_KMEM
+extern spinlock_t css_set_lock;
+
+static void obj_cgroup_release(struct percpu_ref *ref)
+{
+	struct obj_cgroup *objcg = container_of(ref, struct obj_cgroup, refcnt);
+	struct mem_cgroup *memcg;
+	unsigned int nr_bytes;
+	unsigned int nr_pages;
+	unsigned long flags;
+
+	/*
+	 * At this point all allocated objects are freed, and
+	 * objcg->nr_charged_bytes can't have an arbitrary byte value.
+	 * However, it can be PAGE_SIZE or (x * PAGE_SIZE).
+	 *
+	 * The following sequence can lead to it:
+	 * 1) CPU0: objcg == stock->cached_objcg
+	 * 2) CPU1: we do a small allocation (e.g. 92 bytes),
+	 *          PAGE_SIZE bytes are charged
+	 * 3) CPU1: a process from another memcg is allocating something,
+	 *          the stock if flushed,
+	 *          objcg->nr_charged_bytes = PAGE_SIZE - 92
+	 * 5) CPU0: we do release this object,
+	 *          92 bytes are added to stock->nr_bytes
+	 * 6) CPU0: stock is flushed,
+	 *          92 bytes are added to objcg->nr_charged_bytes
+	 *
+	 * In the result, nr_charged_bytes == PAGE_SIZE.
+	 * This page will be uncharged in obj_cgroup_release().
+	 */
+	nr_bytes = atomic_read(&objcg->nr_charged_bytes);
+	WARN_ON_ONCE(nr_bytes & (PAGE_SIZE - 1));
+	nr_pages = nr_bytes >> PAGE_SHIFT;
+
+	spin_lock_irqsave(&css_set_lock, flags);
+	memcg = obj_cgroup_memcg(objcg);
+	if (nr_pages)
+		__memcg_kmem_uncharge(memcg, nr_pages);
+	list_del(&objcg->list);
+	mem_cgroup_put(memcg);
+	spin_unlock_irqrestore(&css_set_lock, flags);
+
+	percpu_ref_exit(ref);
+	kfree_rcu(objcg, rcu);
+}
+
+static struct obj_cgroup *obj_cgroup_alloc(void)
+{
+	struct obj_cgroup *objcg;
+	int ret;
+
+	objcg = kzalloc(sizeof(struct obj_cgroup), GFP_KERNEL);
+	if (!objcg)
+		return NULL;
+
+	ret = percpu_ref_init(&objcg->refcnt, obj_cgroup_release, 0,
+			      GFP_KERNEL);
+	if (ret) {
+		kfree(objcg);
+		return NULL;
+	}
+	INIT_LIST_HEAD(&objcg->list);
+	return objcg;
+}
+
+static void memcg_reparent_objcgs(struct mem_cgroup *memcg,
+				  struct mem_cgroup *parent)
+{
+	struct obj_cgroup *objcg, *iter;
+
+	objcg = rcu_replace_pointer(memcg->objcg, NULL, true);
+
+	spin_lock_irq(&css_set_lock);
+
+	/* Move active objcg to the parent's list */
+	xchg(&objcg->memcg, parent);
+	css_get(&parent->css);
+	list_add(&objcg->list, &parent->objcg_list);
+
+	/* Move already reparented objcgs to the parent's list */
+	list_for_each_entry(iter, &memcg->objcg_list, list) {
+		css_get(&parent->css);
+		xchg(&iter->memcg, parent);
+		css_put(&memcg->css);
+	}
+	list_splice(&memcg->objcg_list, &parent->objcg_list);
+
+	spin_unlock_irq(&css_set_lock);
+
+	percpu_ref_kill(&objcg->refcnt);
+}
+
 /*
  * This will be the memcg's index in each cache's ->memcg_params.memcg_caches.
  * The main reason for not using cgroup id for this:
@@ -2047,6 +2139,12 @@ EXPORT_SYMBOL(unlock_page_memcg);
 struct memcg_stock_pcp {
 	struct mem_cgroup *cached; /* this never be root cgroup */
 	unsigned int nr_pages;
+
+#ifdef CONFIG_MEMCG_KMEM
+	struct obj_cgroup *cached_objcg;
+	unsigned int nr_bytes;
+#endif
+
 	struct work_struct work;
 	unsigned long flags;
 #define FLUSHING_CACHED_CHARGE	0
@@ -2054,6 +2152,22 @@ struct memcg_stock_pcp {
 static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
 static DEFINE_MUTEX(percpu_charge_mutex);
 
+#ifdef CONFIG_MEMCG_KMEM
+static void drain_obj_stock(struct memcg_stock_pcp *stock);
+static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
+				     struct mem_cgroup *root_memcg);
+
+#else
+static inline void drain_obj_stock(struct memcg_stock_pcp *stock)
+{
+}
+static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
+				     struct mem_cgroup *root_memcg)
+{
+	return false;
+}
+#endif
+
 /**
  * consume_stock: Try to consume stocked charge on this cpu.
  * @memcg: memcg to consume from.
@@ -2120,6 +2234,7 @@ static void drain_local_stock(struct work_struct *dummy)
 	local_irq_save(flags);
 
 	stock = this_cpu_ptr(&memcg_stock);
+	drain_obj_stock(stock);
 	drain_stock(stock);
 	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
 
@@ -2179,6 +2294,8 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
 		if (memcg && stock->nr_pages &&
 		    mem_cgroup_is_descendant(memcg, root_memcg))
 			flush = true;
+		if (obj_stock_flush_required(stock, root_memcg))
+			flush = true;
 		rcu_read_unlock();
 
 		if (flush &&
@@ -2705,6 +2822,30 @@ struct mem_cgroup *mem_cgroup_from_obj(void *p)
 	return page->mem_cgroup;
 }
 
+__always_inline struct obj_cgroup *get_obj_cgroup_from_current(void)
+{
+	struct obj_cgroup *objcg = NULL;
+	struct mem_cgroup *memcg;
+
+	if (unlikely(!current->mm))
+		return NULL;
+
+	rcu_read_lock();
+	if (unlikely(current->active_memcg))
+		memcg = rcu_dereference(current->active_memcg);
+	else
+		memcg = mem_cgroup_from_task(current);
+
+	for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) {
+		objcg = rcu_dereference(memcg->objcg);
+		if (objcg && obj_cgroup_tryget(objcg))
+			break;
+	}
+	rcu_read_unlock();
+
+	return objcg;
+}
+
 static int memcg_alloc_cache_id(void)
 {
 	int id, size;
@@ -2994,6 +3135,140 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
 	if (PageKmemcg(page))
 		__ClearPageKmemcg(page);
 }
+
+static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
+{
+	struct memcg_stock_pcp *stock;
+	unsigned long flags;
+	bool ret = false;
+
+	local_irq_save(flags);
+
+	stock = this_cpu_ptr(&memcg_stock);
+	if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
+		stock->nr_bytes -= nr_bytes;
+		ret = true;
+	}
+
+	local_irq_restore(flags);
+
+	return ret;
+}
+
+static void drain_obj_stock(struct memcg_stock_pcp *stock)
+{
+	struct obj_cgroup *old = stock->cached_objcg;
+
+	if (!old)
+		return;
+
+	if (stock->nr_bytes) {
+		unsigned int nr_pages = stock->nr_bytes >> PAGE_SHIFT;
+		unsigned int nr_bytes = stock->nr_bytes & (PAGE_SIZE - 1);
+
+		if (nr_pages) {
+			rcu_read_lock();
+			__memcg_kmem_uncharge(obj_cgroup_memcg(old), nr_pages);
+			rcu_read_unlock();
+		}
+
+		/*
+		 * The leftover is flushed to the centralized per-memcg value.
+		 * On the next attempt to refill obj stock it will be moved
+		 * to a per-cpu stock (probably, on an other CPU), see
+		 * refill_obj_stock().
+		 *
+		 * How often it's flushed is a trade-off between the memory
+		 * limit enforcement accuracy and potential CPU contention,
+		 * so it might be changed in the future.
+		 */
+		atomic_add(nr_bytes, &old->nr_charged_bytes);
+		stock->nr_bytes = 0;
+	}
+
+	obj_cgroup_put(old);
+	stock->cached_objcg = NULL;
+}
+
+static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
+				     struct mem_cgroup *root_memcg)
+{
+	struct mem_cgroup *memcg;
+
+	if (stock->cached_objcg) {
+		memcg = obj_cgroup_memcg(stock->cached_objcg);
+		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
+			return true;
+	}
+
+	return false;
+}
+
+static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
+{
+	struct memcg_stock_pcp *stock;
+	unsigned long flags;
+
+	local_irq_save(flags);
+
+	stock = this_cpu_ptr(&memcg_stock);
+	if (stock->cached_objcg != objcg) { /* reset if necessary */
+		drain_obj_stock(stock);
+		obj_cgroup_get(objcg);
+		stock->cached_objcg = objcg;
+		stock->nr_bytes = atomic_xchg(&objcg->nr_charged_bytes, 0);
+	}
+	stock->nr_bytes += nr_bytes;
+
+	if (stock->nr_bytes > PAGE_SIZE)
+		drain_obj_stock(stock);
+
+	local_irq_restore(flags);
+}
+
+int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
+{
+	struct mem_cgroup *memcg;
+	unsigned int nr_pages, nr_bytes;
+	int ret;
+
+	if (consume_obj_stock(objcg, size))
+		return 0;
+
+	/*
+	 * In theory, memcg->nr_charged_bytes can have enough
+	 * pre-charged bytes to satisfy the allocation. However,
+	 * flushing memcg->nr_charged_bytes requires two atomic
+	 * operations, and memcg->nr_charged_bytes can't be big,
+	 * so it's better to ignore it and try grab some new pages.
+	 * memcg->nr_charged_bytes will be flushed in
+	 * refill_obj_stock(), called from this function or
+	 * independently later.
+	 */
+	rcu_read_lock();
+	memcg = obj_cgroup_memcg(objcg);
+	css_get(&memcg->css);
+	rcu_read_unlock();
+
+	nr_pages = size >> PAGE_SHIFT;
+	nr_bytes = size & (PAGE_SIZE - 1);
+
+	if (nr_bytes)
+		nr_pages += 1;
+
+	ret = __memcg_kmem_charge(memcg, gfp, nr_pages);
+	if (!ret && nr_bytes)
+		refill_obj_stock(objcg, PAGE_SIZE - nr_bytes);
+
+	css_put(&memcg->css);
+	return ret;
+}
+
+void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
+{
+	refill_obj_stock(objcg, size);
+}
+
 #endif /* CONFIG_MEMCG_KMEM */
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -3414,6 +3689,7 @@ static void memcg_flush_percpu_vmevents(struct mem_cgroup *memcg)
 #ifdef CONFIG_MEMCG_KMEM
 static int memcg_online_kmem(struct mem_cgroup *memcg)
 {
+	struct obj_cgroup *objcg;
 	int memcg_id;
 
 	if (cgroup_memory_nokmem)
@@ -3426,6 +3702,14 @@ static int memcg_online_kmem(struct mem_cgroup *memcg)
 	if (memcg_id < 0)
 		return memcg_id;
 
+	objcg = obj_cgroup_alloc();
+	if (!objcg) {
+		memcg_free_cache_id(memcg_id);
+		return -ENOMEM;
+	}
+	objcg->memcg = memcg;
+	rcu_assign_pointer(memcg->objcg, objcg);
+
 	static_branch_inc(&memcg_kmem_enabled_key);
 	/*
 	 * A memory cgroup is considered kmem-online as soon as it gets
@@ -3461,9 +3745,10 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
 		parent = root_mem_cgroup;
 
 	/*
-	 * Deactivate and reparent kmem_caches.
+	 * Deactivate and reparent kmem_caches and objcgs.
 	 */
 	memcg_deactivate_kmem_caches(memcg, parent);
+	memcg_reparent_objcgs(memcg, parent);
 
 	kmemcg_id = memcg->kmemcg_id;
 	BUG_ON(kmemcg_id < 0);
@@ -5032,6 +5317,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 	memcg->socket_pressure = jiffies;
 #ifdef CONFIG_MEMCG_KMEM
 	memcg->kmemcg_id = -1;
+	INIT_LIST_HEAD(&memcg->objcg_list);
 #endif
 #ifdef CONFIG_CGROUP_WRITEBACK
 	INIT_LIST_HEAD(&memcg->cgwb_list);
-- 
2.25.4


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v6 07/19] mm: memcg/slab: allocate obj_cgroups for non-root slab pages
  2020-06-08 23:06 [PATCH v6 00/19] The new cgroup slab memory controller Roman Gushchin
                   ` (5 preceding siblings ...)
  2020-06-08 23:06 ` [PATCH v6 06/19] mm: memcg/slab: obj_cgroup API Roman Gushchin
@ 2020-06-08 23:06 ` Roman Gushchin
  2020-06-19 16:36   ` Shakeel Butt
  2020-06-08 23:06 ` [PATCH v6 08/19] mm: memcg/slab: save obj_cgroup for non-root slab objects Roman Gushchin
                   ` (13 subsequent siblings)
  20 siblings, 1 reply; 92+ messages in thread
From: Roman Gushchin @ 2020-06-08 23:06 UTC (permalink / raw)
  To: Andrew Morton, Christoph Lameter
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	Vlastimil Babka, kernel-team, linux-kernel, Roman Gushchin

Allocate and release memory to store obj_cgroup pointers for each
non-root slab page. Reuse page->mem_cgroup pointer to store a pointer
to the allocated space.

To distinguish between obj_cgroups and memcg pointers in case
when it's not obvious which one is used (as in page_cgroup_ino()),
let's always set the lowest bit in the obj_cgroup case.

Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mm_types.h |  5 +++-
 include/linux/slab_def.h |  6 +++++
 include/linux/slub_def.h |  5 ++++
 mm/memcontrol.c          | 17 +++++++++++---
 mm/slab.h                | 49 ++++++++++++++++++++++++++++++++++++++++
 5 files changed, 78 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 64ede5f150dc..0277fbab7c93 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -198,7 +198,10 @@ struct page {
 	atomic_t _refcount;
 
 #ifdef CONFIG_MEMCG
-	struct mem_cgroup *mem_cgroup;
+	union {
+		struct mem_cgroup *mem_cgroup;
+		struct obj_cgroup **obj_cgroups;
+	};
 #endif
 
 	/*
diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h
index abc7de77b988..ccda7b9669a5 100644
--- a/include/linux/slab_def.h
+++ b/include/linux/slab_def.h
@@ -114,4 +114,10 @@ static inline unsigned int obj_to_index(const struct kmem_cache *cache,
 	return reciprocal_divide(offset, cache->reciprocal_buffer_size);
 }
 
+static inline int objs_per_slab_page(const struct kmem_cache *cache,
+				     const struct page *page)
+{
+	return cache->num;
+}
+
 #endif	/* _LINUX_SLAB_DEF_H */
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 30e91c83d401..f87302dcfe8c 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -198,4 +198,9 @@ static inline unsigned int obj_to_index(const struct kmem_cache *cache,
 	return __obj_to_index(cache, page_address(page), obj);
 }
 
+static inline int objs_per_slab_page(const struct kmem_cache *cache,
+				     const struct page *page)
+{
+	return page->objects;
+}
 #endif /* _LINUX_SLUB_DEF_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7ff66275966c..2020c7542aa1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -569,10 +569,21 @@ ino_t page_cgroup_ino(struct page *page)
 	unsigned long ino = 0;
 
 	rcu_read_lock();
-	if (PageSlab(page) && !PageTail(page))
+	if (PageSlab(page) && !PageTail(page)) {
 		memcg = memcg_from_slab_page(page);
-	else
-		memcg = READ_ONCE(page->mem_cgroup);
+	} else {
+		memcg = page->mem_cgroup;
+
+		/*
+		 * The lowest bit set means that memcg isn't a valid
+		 * memcg pointer, but a obj_cgroups pointer.
+		 * In this case the page is shared and doesn't belong
+		 * to any specific memory cgroup.
+		 */
+		if ((unsigned long) memcg & 0x1UL)
+			memcg = NULL;
+	}
+
 	while (memcg && !(memcg->css.flags & CSS_ONLINE))
 		memcg = parent_mem_cgroup(memcg);
 	if (memcg)
diff --git a/mm/slab.h b/mm/slab.h
index 8a574d9361c1..a1633ea15fbf 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -319,6 +319,18 @@ static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
 	return s->memcg_params.root_cache;
 }
 
+static inline struct obj_cgroup **page_obj_cgroups(struct page *page)
+{
+	/*
+	 * page->mem_cgroup and page->obj_cgroups are sharing the same
+	 * space. To distinguish between them in case we don't know for sure
+	 * that the page is a slab page (e.g. page_cgroup_ino()), let's
+	 * always set the lowest bit of obj_cgroups.
+	 */
+	return (struct obj_cgroup **)
+		((unsigned long)page->obj_cgroups & ~0x1UL);
+}
+
 /*
  * Expects a pointer to a slab page. Please note, that PageSlab() check
  * isn't sufficient, as it returns true also for tail compound slab pages,
@@ -406,6 +418,26 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
 	percpu_ref_put_many(&s->memcg_params.refcnt, nr_pages);
 }
 
+static inline int memcg_alloc_page_obj_cgroups(struct page *page,
+					       struct kmem_cache *s, gfp_t gfp)
+{
+	unsigned int objects = objs_per_slab_page(s, page);
+	void *vec;
+
+	vec = kcalloc(objects, sizeof(struct obj_cgroup *), gfp);
+	if (!vec)
+		return -ENOMEM;
+
+	page->obj_cgroups = (struct obj_cgroup **) ((unsigned long)vec | 0x1UL);
+	return 0;
+}
+
+static inline void memcg_free_page_obj_cgroups(struct page *page)
+{
+	kfree(page_obj_cgroups(page));
+	page->obj_cgroups = NULL;
+}
+
 extern void slab_init_memcg_params(struct kmem_cache *);
 extern void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg);
 
@@ -455,6 +487,16 @@ static inline void memcg_uncharge_slab(struct page *page, int order,
 {
 }
 
+static inline int memcg_alloc_page_obj_cgroups(struct page *page,
+					       struct kmem_cache *s, gfp_t gfp)
+{
+	return 0;
+}
+
+static inline void memcg_free_page_obj_cgroups(struct page *page)
+{
+}
+
 static inline void slab_init_memcg_params(struct kmem_cache *s)
 {
 }
@@ -481,12 +523,18 @@ static __always_inline int charge_slab_page(struct page *page,
 					    gfp_t gfp, int order,
 					    struct kmem_cache *s)
 {
+	int ret;
+
 	if (is_root_cache(s)) {
 		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
 				    PAGE_SIZE << order);
 		return 0;
 	}
 
+	ret = memcg_alloc_page_obj_cgroups(page, s, gfp);
+	if (ret)
+		return ret;
+
 	return memcg_charge_slab(page, gfp, order, s);
 }
 
@@ -499,6 +547,7 @@ static __always_inline void uncharge_slab_page(struct page *page, int order,
 		return;
 	}
 
+	memcg_free_page_obj_cgroups(page);
 	memcg_uncharge_slab(page, order, s);
 }
 
-- 
2.25.4


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v6 08/19] mm: memcg/slab: save obj_cgroup for non-root slab objects
  2020-06-08 23:06 [PATCH v6 00/19] The new cgroup slab memory controller Roman Gushchin
                   ` (6 preceding siblings ...)
  2020-06-08 23:06 ` [PATCH v6 07/19] mm: memcg/slab: allocate obj_cgroups for non-root slab pages Roman Gushchin
@ 2020-06-08 23:06 ` Roman Gushchin
  2020-06-20  0:16   ` Shakeel Butt
  2020-06-08 23:06 ` [PATCH v6 09/19] mm: memcg/slab: charge individual slab objects instead of pages Roman Gushchin
                   ` (12 subsequent siblings)
  20 siblings, 1 reply; 92+ messages in thread
From: Roman Gushchin @ 2020-06-08 23:06 UTC (permalink / raw)
  To: Andrew Morton, Christoph Lameter
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	Vlastimil Babka, kernel-team, linux-kernel, Roman Gushchin

Store the obj_cgroup pointer in the corresponding place of
page->obj_cgroups for each allocated non-root slab object.
Make sure that each allocated object holds a reference to obj_cgroup.

Objcg pointer is obtained from the memcg->objcg dereferencing
in memcg_kmem_get_cache() and passed from pre_alloc_hook to
post_alloc_hook. Then in case of successful allocation(s) it's
getting stored in the page->obj_cgroups vector.

The objcg obtaining part look a bit bulky now, but it will be simplified
by next commits in the series.

Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/memcontrol.h |  3 +-
 mm/memcontrol.c            | 14 +++++++--
 mm/slab.c                  | 18 +++++++-----
 mm/slab.h                  | 60 ++++++++++++++++++++++++++++++++++----
 mm/slub.c                  | 14 +++++----
 5 files changed, 88 insertions(+), 21 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c69e66fe4f12..c63473fffdda 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1404,7 +1404,8 @@ static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg,
 }
 #endif
 
-struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep);
+struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep,
+					struct obj_cgroup **objcgp);
 void memcg_kmem_put_cache(struct kmem_cache *cachep);
 
 #ifdef CONFIG_MEMCG_KMEM
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2020c7542aa1..f0ea0ce6bea5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2971,7 +2971,8 @@ static inline bool memcg_kmem_bypass(void)
  * done with it, memcg_kmem_put_cache() must be called to release the
  * reference.
  */
-struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep)
+struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep,
+					struct obj_cgroup **objcgp)
 {
 	struct mem_cgroup *memcg;
 	struct kmem_cache *memcg_cachep;
@@ -3027,8 +3028,17 @@ struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep)
 	 */
 	if (unlikely(!memcg_cachep))
 		memcg_schedule_kmem_cache_create(memcg, cachep);
-	else if (percpu_ref_tryget(&memcg_cachep->memcg_params.refcnt))
+	else if (percpu_ref_tryget(&memcg_cachep->memcg_params.refcnt)) {
+		struct obj_cgroup *objcg = rcu_dereference(memcg->objcg);
+
+		if (!objcg || !obj_cgroup_tryget(objcg)) {
+			percpu_ref_put(&memcg_cachep->memcg_params.refcnt);
+			goto out_unlock;
+		}
+
+		*objcgp = objcg;
 		cachep = memcg_cachep;
+	}
 out_unlock:
 	rcu_read_unlock();
 	return cachep;
diff --git a/mm/slab.c b/mm/slab.c
index 9350062ffc1a..02b4363930c1 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3222,9 +3222,10 @@ slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
 	unsigned long save_flags;
 	void *ptr;
 	int slab_node = numa_mem_id();
+	struct obj_cgroup *objcg = NULL;
 
 	flags &= gfp_allowed_mask;
-	cachep = slab_pre_alloc_hook(cachep, flags);
+	cachep = slab_pre_alloc_hook(cachep, &objcg, 1, flags);
 	if (unlikely(!cachep))
 		return NULL;
 
@@ -3260,7 +3261,7 @@ slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
 	if (unlikely(slab_want_init_on_alloc(flags, cachep)) && ptr)
 		memset(ptr, 0, cachep->object_size);
 
-	slab_post_alloc_hook(cachep, flags, 1, &ptr);
+	slab_post_alloc_hook(cachep, objcg, flags, 1, &ptr);
 	return ptr;
 }
 
@@ -3301,9 +3302,10 @@ slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller)
 {
 	unsigned long save_flags;
 	void *objp;
+	struct obj_cgroup *objcg = NULL;
 
 	flags &= gfp_allowed_mask;
-	cachep = slab_pre_alloc_hook(cachep, flags);
+	cachep = slab_pre_alloc_hook(cachep, &objcg, 1, flags);
 	if (unlikely(!cachep))
 		return NULL;
 
@@ -3317,7 +3319,7 @@ slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller)
 	if (unlikely(slab_want_init_on_alloc(flags, cachep)) && objp)
 		memset(objp, 0, cachep->object_size);
 
-	slab_post_alloc_hook(cachep, flags, 1, &objp);
+	slab_post_alloc_hook(cachep, objcg, flags, 1, &objp);
 	return objp;
 }
 
@@ -3439,6 +3441,7 @@ void ___cache_free(struct kmem_cache *cachep, void *objp,
 		memset(objp, 0, cachep->object_size);
 	kmemleak_free_recursive(objp, cachep->flags);
 	objp = cache_free_debugcheck(cachep, objp, caller);
+	memcg_slab_free_hook(cachep, virt_to_head_page(objp), objp);
 
 	/*
 	 * Skip calling cache_free_alien() when the platform is not numa.
@@ -3504,8 +3507,9 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 			  void **p)
 {
 	size_t i;
+	struct obj_cgroup *objcg = NULL;
 
-	s = slab_pre_alloc_hook(s, flags);
+	s = slab_pre_alloc_hook(s, &objcg, size, flags);
 	if (!s)
 		return 0;
 
@@ -3528,13 +3532,13 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 		for (i = 0; i < size; i++)
 			memset(p[i], 0, s->object_size);
 
-	slab_post_alloc_hook(s, flags, size, p);
+	slab_post_alloc_hook(s, objcg, flags, size, p);
 	/* FIXME: Trace call missing. Christoph would like a bulk variant */
 	return size;
 error:
 	local_irq_enable();
 	cache_alloc_debugcheck_after_bulk(s, flags, i, p, _RET_IP_);
-	slab_post_alloc_hook(s, flags, i, p);
+	slab_post_alloc_hook(s, objcg, flags, i, p);
 	__kmem_cache_free_bulk(s, i, p);
 	return 0;
 }
diff --git a/mm/slab.h b/mm/slab.h
index a1633ea15fbf..8bca0cb4b928 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -438,6 +438,41 @@ static inline void memcg_free_page_obj_cgroups(struct page *page)
 	page->obj_cgroups = NULL;
 }
 
+static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
+					      struct obj_cgroup *objcg,
+					      size_t size, void **p)
+{
+	struct page *page;
+	unsigned long off;
+	size_t i;
+
+	for (i = 0; i < size; i++) {
+		if (likely(p[i])) {
+			page = virt_to_head_page(p[i]);
+			off = obj_to_index(s, page, p[i]);
+			obj_cgroup_get(objcg);
+			page_obj_cgroups(page)[off] = objcg;
+		}
+	}
+	obj_cgroup_put(objcg);
+	memcg_kmem_put_cache(s);
+}
+
+static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page,
+					void *p)
+{
+	struct obj_cgroup *objcg;
+	unsigned int off;
+
+	if (!memcg_kmem_enabled() || is_root_cache(s))
+		return;
+
+	off = obj_to_index(s, page, p);
+	objcg = page_obj_cgroups(page)[off];
+	page_obj_cgroups(page)[off] = NULL;
+	obj_cgroup_put(objcg);
+}
+
 extern void slab_init_memcg_params(struct kmem_cache *);
 extern void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg);
 
@@ -497,6 +532,17 @@ static inline void memcg_free_page_obj_cgroups(struct page *page)
 {
 }
 
+static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
+					      struct obj_cgroup *objcg,
+					      size_t size, void **p)
+{
+}
+
+static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page,
+					void *p)
+{
+}
+
 static inline void slab_init_memcg_params(struct kmem_cache *s)
 {
 }
@@ -605,7 +651,8 @@ static inline size_t slab_ksize(const struct kmem_cache *s)
 }
 
 static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
-						     gfp_t flags)
+						     struct obj_cgroup **objcgp,
+						     size_t size, gfp_t flags)
 {
 	flags &= gfp_allowed_mask;
 
@@ -619,13 +666,14 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
 
 	if (memcg_kmem_enabled() &&
 	    ((flags & __GFP_ACCOUNT) || (s->flags & SLAB_ACCOUNT)))
-		return memcg_kmem_get_cache(s);
+		return memcg_kmem_get_cache(s, objcgp);
 
 	return s;
 }
 
-static inline void slab_post_alloc_hook(struct kmem_cache *s, gfp_t flags,
-					size_t size, void **p)
+static inline void slab_post_alloc_hook(struct kmem_cache *s,
+					struct obj_cgroup *objcg,
+					gfp_t flags, size_t size, void **p)
 {
 	size_t i;
 
@@ -637,8 +685,8 @@ static inline void slab_post_alloc_hook(struct kmem_cache *s, gfp_t flags,
 					 s->flags, flags);
 	}
 
-	if (memcg_kmem_enabled())
-		memcg_kmem_put_cache(s);
+	if (memcg_kmem_enabled() && !is_root_cache(s))
+		memcg_slab_post_alloc_hook(s, objcg, size, p);
 }
 
 #ifndef CONFIG_SLOB
diff --git a/mm/slub.c b/mm/slub.c
index 6007c38071f5..7007eceac4c4 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2738,8 +2738,9 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
 	struct kmem_cache_cpu *c;
 	struct page *page;
 	unsigned long tid;
+	struct obj_cgroup *objcg = NULL;
 
-	s = slab_pre_alloc_hook(s, gfpflags);
+	s = slab_pre_alloc_hook(s, &objcg, 1, gfpflags);
 	if (!s)
 		return NULL;
 redo:
@@ -2815,7 +2816,7 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
 	if (unlikely(slab_want_init_on_alloc(gfpflags, s)) && object)
 		memset(object, 0, s->object_size);
 
-	slab_post_alloc_hook(s, gfpflags, 1, &object);
+	slab_post_alloc_hook(s, objcg, gfpflags, 1, &object);
 
 	return object;
 }
@@ -3020,6 +3021,8 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
 	void *tail_obj = tail ? : head;
 	struct kmem_cache_cpu *c;
 	unsigned long tid;
+
+	memcg_slab_free_hook(s, page, head);
 redo:
 	/*
 	 * Determine the currently cpus per cpu slab.
@@ -3199,9 +3202,10 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 {
 	struct kmem_cache_cpu *c;
 	int i;
+	struct obj_cgroup *objcg = NULL;
 
 	/* memcg and kmem_cache debug support */
-	s = slab_pre_alloc_hook(s, flags);
+	s = slab_pre_alloc_hook(s, &objcg, size, flags);
 	if (unlikely(!s))
 		return false;
 	/*
@@ -3255,11 +3259,11 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 	}
 
 	/* memcg and kmem_cache debug support */
-	slab_post_alloc_hook(s, flags, size, p);
+	slab_post_alloc_hook(s, objcg, flags, size, p);
 	return i;
 error:
 	local_irq_enable();
-	slab_post_alloc_hook(s, flags, i, p);
+	slab_post_alloc_hook(s, objcg, flags, i, p);
 	__kmem_cache_free_bulk(s, i, p);
 	return 0;
 }
-- 
2.25.4


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v6 09/19] mm: memcg/slab: charge individual slab objects instead of pages
  2020-06-08 23:06 [PATCH v6 00/19] The new cgroup slab memory controller Roman Gushchin
                   ` (7 preceding siblings ...)
  2020-06-08 23:06 ` [PATCH v6 08/19] mm: memcg/slab: save obj_cgroup for non-root slab objects Roman Gushchin
@ 2020-06-08 23:06 ` Roman Gushchin
  2020-06-20  0:54   ` Shakeel Butt
  2020-06-08 23:06 ` [PATCH v6 10/19] mm: memcg/slab: deprecate memory.kmem.slabinfo Roman Gushchin
                   ` (11 subsequent siblings)
  20 siblings, 1 reply; 92+ messages in thread
From: Roman Gushchin @ 2020-06-08 23:06 UTC (permalink / raw)
  To: Andrew Morton, Christoph Lameter
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	Vlastimil Babka, kernel-team, linux-kernel, Roman Gushchin

Switch to per-object accounting of non-root slab objects.

Charging is performed using obj_cgroup API in the pre_alloc hook.
Obj_cgroup is charged with the size of the object and the size
of metadata: as now it's the size of an obj_cgroup pointer.
If the amount of memory has been charged successfully, the actual
allocation code is executed. Otherwise, -ENOMEM is returned.

In the post_alloc hook if the actual allocation succeeded,
corresponding vmstats are bumped and the obj_cgroup pointer is saved.
Otherwise, the charge is canceled.

On the free path obj_cgroup pointer is obtained and used to uncharge
the size of the releasing object.

Memcg and lruvec counters are now representing only memory used
by active slab objects and do not include the free space. The free
space is shared and doesn't belong to any specific cgroup.

Global per-node slab vmstats are still modified from (un)charge_slab_page()
functions. The idea is to keep all slab pages accounted as slab pages
on system level.

Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slab.h | 173 ++++++++++++++++++++++++------------------------------
 1 file changed, 77 insertions(+), 96 deletions(-)

diff --git a/mm/slab.h b/mm/slab.h
index 8bca0cb4b928..f219a29052d9 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -352,72 +352,6 @@ static inline struct mem_cgroup *memcg_from_slab_page(struct page *page)
 	return NULL;
 }
 
-/*
- * Charge the slab page belonging to the non-root kmem_cache.
- * Can be called for non-root kmem_caches only.
- */
-static __always_inline int memcg_charge_slab(struct page *page,
-					     gfp_t gfp, int order,
-					     struct kmem_cache *s)
-{
-	unsigned int nr_pages = 1 << order;
-	struct mem_cgroup *memcg;
-	struct lruvec *lruvec;
-	int ret;
-
-	rcu_read_lock();
-	memcg = READ_ONCE(s->memcg_params.memcg);
-	while (memcg && !css_tryget_online(&memcg->css))
-		memcg = parent_mem_cgroup(memcg);
-	rcu_read_unlock();
-
-	if (unlikely(!memcg || mem_cgroup_is_root(memcg))) {
-		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
-				    nr_pages << PAGE_SHIFT);
-		percpu_ref_get_many(&s->memcg_params.refcnt, nr_pages);
-		return 0;
-	}
-
-	ret = memcg_kmem_charge(memcg, gfp, nr_pages);
-	if (ret)
-		goto out;
-
-	lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
-	mod_lruvec_state(lruvec, cache_vmstat_idx(s), nr_pages << PAGE_SHIFT);
-
-	percpu_ref_get_many(&s->memcg_params.refcnt, nr_pages);
-out:
-	css_put(&memcg->css);
-	return ret;
-}
-
-/*
- * Uncharge a slab page belonging to a non-root kmem_cache.
- * Can be called for non-root kmem_caches only.
- */
-static __always_inline void memcg_uncharge_slab(struct page *page, int order,
-						struct kmem_cache *s)
-{
-	unsigned int nr_pages = 1 << order;
-	struct mem_cgroup *memcg;
-	struct lruvec *lruvec;
-
-	rcu_read_lock();
-	memcg = READ_ONCE(s->memcg_params.memcg);
-	if (likely(!mem_cgroup_is_root(memcg))) {
-		lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
-		mod_lruvec_state(lruvec, cache_vmstat_idx(s),
-				 -(nr_pages << PAGE_SHIFT));
-		memcg_kmem_uncharge(memcg, nr_pages);
-	} else {
-		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
-				    -(nr_pages << PAGE_SHIFT));
-	}
-	rcu_read_unlock();
-
-	percpu_ref_put_many(&s->memcg_params.refcnt, nr_pages);
-}
-
 static inline int memcg_alloc_page_obj_cgroups(struct page *page,
 					       struct kmem_cache *s, gfp_t gfp)
 {
@@ -438,6 +372,47 @@ static inline void memcg_free_page_obj_cgroups(struct page *page)
 	page->obj_cgroups = NULL;
 }
 
+static inline size_t obj_full_size(struct kmem_cache *s)
+{
+	/*
+	 * For each accounted object there is an extra space which is used
+	 * to store obj_cgroup membership. Charge it too.
+	 */
+	return s->size + sizeof(struct obj_cgroup *);
+}
+
+static inline struct kmem_cache *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
+						struct obj_cgroup **objcgp,
+						size_t objects, gfp_t flags)
+{
+	struct kmem_cache *cachep;
+
+	cachep = memcg_kmem_get_cache(s, objcgp);
+	if (is_root_cache(cachep))
+		return s;
+
+	if (obj_cgroup_charge(*objcgp, flags, objects * obj_full_size(s))) {
+		memcg_kmem_put_cache(cachep);
+		cachep = NULL;
+	}
+
+	return cachep;
+}
+
+static inline void mod_objcg_state(struct obj_cgroup *objcg,
+				   struct pglist_data *pgdat,
+				   int idx, int nr)
+{
+	struct mem_cgroup *memcg;
+	struct lruvec *lruvec;
+
+	rcu_read_lock();
+	memcg = obj_cgroup_memcg(objcg);
+	lruvec = mem_cgroup_lruvec(memcg, pgdat);
+	mod_memcg_lruvec_state(lruvec, idx, nr);
+	rcu_read_unlock();
+}
+
 static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
 					      struct obj_cgroup *objcg,
 					      size_t size, void **p)
@@ -452,6 +427,10 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
 			off = obj_to_index(s, page, p[i]);
 			obj_cgroup_get(objcg);
 			page_obj_cgroups(page)[off] = objcg;
+			mod_objcg_state(objcg, page_pgdat(page),
+					cache_vmstat_idx(s), obj_full_size(s));
+		} else {
+			obj_cgroup_uncharge(objcg, obj_full_size(s));
 		}
 	}
 	obj_cgroup_put(objcg);
@@ -470,6 +449,11 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page,
 	off = obj_to_index(s, page, p);
 	objcg = page_obj_cgroups(page)[off];
 	page_obj_cgroups(page)[off] = NULL;
+
+	obj_cgroup_uncharge(objcg, obj_full_size(s));
+	mod_objcg_state(objcg, page_pgdat(page), cache_vmstat_idx(s),
+			-obj_full_size(s));
+
 	obj_cgroup_put(objcg);
 }
 
@@ -511,17 +495,6 @@ static inline struct mem_cgroup *memcg_from_slab_page(struct page *page)
 	return NULL;
 }
 
-static inline int memcg_charge_slab(struct page *page, gfp_t gfp, int order,
-				    struct kmem_cache *s)
-{
-	return 0;
-}
-
-static inline void memcg_uncharge_slab(struct page *page, int order,
-				       struct kmem_cache *s)
-{
-}
-
 static inline int memcg_alloc_page_obj_cgroups(struct page *page,
 					       struct kmem_cache *s, gfp_t gfp)
 {
@@ -532,6 +505,13 @@ static inline void memcg_free_page_obj_cgroups(struct page *page)
 {
 }
 
+static inline struct kmem_cache *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
+						struct obj_cgroup **objcgp,
+						size_t objects, gfp_t flags)
+{
+	return NULL;
+}
+
 static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
 					      struct obj_cgroup *objcg,
 					      size_t size, void **p)
@@ -569,32 +549,33 @@ static __always_inline int charge_slab_page(struct page *page,
 					    gfp_t gfp, int order,
 					    struct kmem_cache *s)
 {
-	int ret;
-
-	if (is_root_cache(s)) {
-		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
-				    PAGE_SIZE << order);
-		return 0;
-	}
+#ifdef CONFIG_MEMCG_KMEM
+	if (memcg_kmem_enabled() && !is_root_cache(s)) {
+		int ret;
 
-	ret = memcg_alloc_page_obj_cgroups(page, s, gfp);
-	if (ret)
-		return ret;
+		ret = memcg_alloc_page_obj_cgroups(page, s, gfp);
+		if (ret)
+			return ret;
 
-	return memcg_charge_slab(page, gfp, order, s);
+		percpu_ref_get_many(&s->memcg_params.refcnt, 1 << order);
+	}
+#endif
+	mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
+			    PAGE_SIZE << order);
+	return 0;
 }
 
 static __always_inline void uncharge_slab_page(struct page *page, int order,
 					       struct kmem_cache *s)
 {
-	if (is_root_cache(s)) {
-		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
-				    -(PAGE_SIZE << order));
-		return;
+#ifdef CONFIG_MEMCG_KMEM
+	if (memcg_kmem_enabled() && !is_root_cache(s)) {
+		memcg_free_page_obj_cgroups(page);
+		percpu_ref_put_many(&s->memcg_params.refcnt, 1 << order);
 	}
-
-	memcg_free_page_obj_cgroups(page);
-	memcg_uncharge_slab(page, order, s);
+#endif
+	mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
+			    -(PAGE_SIZE << order));
 }
 
 static inline struct kmem_cache *cache_from_obj(struct kmem_cache *s, void *x)
@@ -666,7 +647,7 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
 
 	if (memcg_kmem_enabled() &&
 	    ((flags & __GFP_ACCOUNT) || (s->flags & SLAB_ACCOUNT)))
-		return memcg_kmem_get_cache(s, objcgp);
+		return memcg_slab_pre_alloc_hook(s, objcgp, size, flags);
 
 	return s;
 }
-- 
2.25.4


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v6 10/19] mm: memcg/slab: deprecate memory.kmem.slabinfo
  2020-06-08 23:06 [PATCH v6 00/19] The new cgroup slab memory controller Roman Gushchin
                   ` (8 preceding siblings ...)
  2020-06-08 23:06 ` [PATCH v6 09/19] mm: memcg/slab: charge individual slab objects instead of pages Roman Gushchin
@ 2020-06-08 23:06 ` Roman Gushchin
  2020-06-22 17:12   ` Shakeel Butt
  2020-06-08 23:06 ` [PATCH v6 11/19] mm: memcg/slab: move memcg_kmem_bypass() to memcontrol.h Roman Gushchin
                   ` (10 subsequent siblings)
  20 siblings, 1 reply; 92+ messages in thread
From: Roman Gushchin @ 2020-06-08 23:06 UTC (permalink / raw)
  To: Andrew Morton, Christoph Lameter
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	Vlastimil Babka, kernel-team, linux-kernel, Roman Gushchin

Deprecate memory.kmem.slabinfo.

An empty file will be presented if corresponding config options are
enabled.

The interface is implementation dependent, isn't present in cgroup v2,
and is generally useful only for core mm debugging purposes. In other
words, it doesn't provide any value for the absolute majority of users.

A drgn-based replacement can be found in tools/cgroup/slabinfo.py .
It does support cgroup v1 and v2, mimics memory.kmem.slabinfo output
and also allows to get any additional information without a need
to recompile the kernel.

If a drgn-based solution is too slow for a task, a bpf-based tracing
tool can be used, which can easily keep track of all slab allocations
belonging to a memory cgroup.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/memcontrol.c  |  3 ---
 mm/slab_common.c | 31 ++++---------------------------
 2 files changed, 4 insertions(+), 30 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f0ea0ce6bea5..004a31941a88 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5116,9 +5116,6 @@ static struct cftype mem_cgroup_legacy_files[] = {
 	(defined(CONFIG_SLAB) || defined(CONFIG_SLUB_DEBUG))
 	{
 		.name = "kmem.slabinfo",
-		.seq_start = memcg_slab_start,
-		.seq_next = memcg_slab_next,
-		.seq_stop = memcg_slab_stop,
 		.seq_show = memcg_slab_show,
 	},
 #endif
diff --git a/mm/slab_common.c b/mm/slab_common.c
index b578ae29c743..3c89c2adc930 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1523,35 +1523,12 @@ void dump_unreclaimable_slab(void)
 }
 
 #if defined(CONFIG_MEMCG_KMEM)
-void *memcg_slab_start(struct seq_file *m, loff_t *pos)
-{
-	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
-
-	mutex_lock(&slab_mutex);
-	return seq_list_start(&memcg->kmem_caches, *pos);
-}
-
-void *memcg_slab_next(struct seq_file *m, void *p, loff_t *pos)
-{
-	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
-
-	return seq_list_next(p, &memcg->kmem_caches, pos);
-}
-
-void memcg_slab_stop(struct seq_file *m, void *p)
-{
-	mutex_unlock(&slab_mutex);
-}
-
 int memcg_slab_show(struct seq_file *m, void *p)
 {
-	struct kmem_cache *s = list_entry(p, struct kmem_cache,
-					  memcg_params.kmem_caches_node);
-	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
-
-	if (p == memcg->kmem_caches.next)
-		print_slabinfo_header(m);
-	cache_show(s, m);
+	/*
+	 * Deprecated.
+	 * Please, take a look at tools/cgroup/slabinfo.py .
+	 */
 	return 0;
 }
 #endif
-- 
2.25.4


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v6 11/19] mm: memcg/slab: move memcg_kmem_bypass() to memcontrol.h
  2020-06-08 23:06 [PATCH v6 00/19] The new cgroup slab memory controller Roman Gushchin
                   ` (9 preceding siblings ...)
  2020-06-08 23:06 ` [PATCH v6 10/19] mm: memcg/slab: deprecate memory.kmem.slabinfo Roman Gushchin
@ 2020-06-08 23:06 ` Roman Gushchin
  2020-06-20  1:19   ` Shakeel Butt
  2020-06-08 23:06 ` [PATCH v6 12/19] mm: memcg/slab: use a single set of kmem_caches for all accounted allocations Roman Gushchin
                   ` (9 subsequent siblings)
  20 siblings, 1 reply; 92+ messages in thread
From: Roman Gushchin @ 2020-06-08 23:06 UTC (permalink / raw)
  To: Andrew Morton, Christoph Lameter
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	Vlastimil Babka, kernel-team, linux-kernel, Roman Gushchin

To make the memcg_kmem_bypass() function available outside of
the memcontrol.c, let's move it to memcontrol.h. The function
is small and nicely fits into static inline sort of functions.

It will be used from the slab code.

Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/memcontrol.h | 12 ++++++++++++
 mm/memcontrol.c            | 12 ------------
 2 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c63473fffdda..ba7065c0922a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1440,6 +1440,18 @@ static inline bool memcg_kmem_enabled(void)
 	return static_branch_unlikely(&memcg_kmem_enabled_key);
 }
 
+static inline bool memcg_kmem_bypass(void)
+{
+	if (in_interrupt())
+		return true;
+
+	/* Allow remote memcg charging in kthread contexts. */
+	if ((!current->mm || (current->flags & PF_KTHREAD)) &&
+	     !current->active_memcg)
+		return true;
+	return false;
+}
+
 static inline int memcg_kmem_charge_page(struct page *page, gfp_t gfp,
 					 int order)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 004a31941a88..51e85d05095c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2943,18 +2943,6 @@ static void memcg_schedule_kmem_cache_create(struct mem_cgroup *memcg,
 	queue_work(memcg_kmem_cache_wq, &cw->work);
 }
 
-static inline bool memcg_kmem_bypass(void)
-{
-	if (in_interrupt())
-		return true;
-
-	/* Allow remote memcg charging in kthread contexts. */
-	if ((!current->mm || (current->flags & PF_KTHREAD)) &&
-	     !current->active_memcg)
-		return true;
-	return false;
-}
-
 /**
  * memcg_kmem_get_cache: select the correct per-memcg cache for allocation
  * @cachep: the original global kmem cache
-- 
2.25.4


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v6 12/19] mm: memcg/slab: use a single set of kmem_caches for all accounted allocations
  2020-06-08 23:06 [PATCH v6 00/19] The new cgroup slab memory controller Roman Gushchin
                   ` (10 preceding siblings ...)
  2020-06-08 23:06 ` [PATCH v6 11/19] mm: memcg/slab: move memcg_kmem_bypass() to memcontrol.h Roman Gushchin
@ 2020-06-08 23:06 ` Roman Gushchin
  2020-06-22 16:56   ` Shakeel Butt
  2020-06-08 23:06 ` [PATCH v6 13/19] mm: memcg/slab: simplify memcg cache creation Roman Gushchin
                   ` (8 subsequent siblings)
  20 siblings, 1 reply; 92+ messages in thread
From: Roman Gushchin @ 2020-06-08 23:06 UTC (permalink / raw)
  To: Andrew Morton, Christoph Lameter
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	Vlastimil Babka, kernel-team, linux-kernel, Roman Gushchin

This is fairly big but mostly red patch, which makes all accounted
slab allocations use a single set of kmem_caches instead of
creating a separate set for each memory cgroup.

Because the number of non-root kmem_caches is now capped by the number
of root kmem_caches, there is no need to shrink or destroy them
prematurely. They can be perfectly destroyed together with their
root counterparts. This allows to dramatically simplify the
management of non-root kmem_caches and delete a ton of code.

This patch performs the following changes:
1) introduces memcg_params.memcg_cache pointer to represent the
   kmem_cache which will be used for all non-root allocations
2) reuses the existing memcg kmem_cache creation mechanism
   to create memcg kmem_cache on the first allocation attempt
3) memcg kmem_caches are named <kmemcache_name>-memcg,
   e.g. dentry-memcg
4) simplifies memcg_kmem_get_cache() to just return memcg kmem_cache
   or schedule it's creation and return the root cache
5) removes almost all non-root kmem_cache management code
   (separate refcounter, reparenting, shrinking, etc)
6) makes slab debugfs to display root_mem_cgroup css id and never
   show :dead and :deact flags in the memcg_slabinfo attribute.

Following patches in the series will simplify the kmem_cache creation.

Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/memcontrol.h |   5 +-
 include/linux/slab.h       |   5 +-
 mm/memcontrol.c            | 163 +++-----------
 mm/slab.c                  |  16 +-
 mm/slab.h                  | 145 ++++---------
 mm/slab_common.c           | 426 ++++---------------------------------
 mm/slub.c                  |  38 +---
 7 files changed, 128 insertions(+), 670 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ba7065c0922a..e2c4d54aa1f6 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -317,7 +317,6 @@ struct mem_cgroup {
         /* Index in the kmem_cache->memcg_params.memcg_caches array */
 	int kmemcg_id;
 	enum memcg_kmem_state kmem_state;
-	struct list_head kmem_caches;
 	struct obj_cgroup __rcu *objcg;
 	struct list_head objcg_list;
 #endif
@@ -1404,9 +1403,7 @@ static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg,
 }
 #endif
 
-struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep,
-					struct obj_cgroup **objcgp);
-void memcg_kmem_put_cache(struct kmem_cache *cachep);
+struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep);
 
 #ifdef CONFIG_MEMCG_KMEM
 int __memcg_kmem_charge(struct mem_cgroup *memcg, gfp_t gfp,
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 6d454886bcaf..310768bfa8d2 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -155,8 +155,7 @@ struct kmem_cache *kmem_cache_create_usercopy(const char *name,
 void kmem_cache_destroy(struct kmem_cache *);
 int kmem_cache_shrink(struct kmem_cache *);
 
-void memcg_create_kmem_cache(struct mem_cgroup *, struct kmem_cache *);
-void memcg_deactivate_kmem_caches(struct mem_cgroup *, struct mem_cgroup *);
+void memcg_create_kmem_cache(struct kmem_cache *cachep);
 
 /*
  * Please use this macro to create slab caches. Simply specify the
@@ -578,8 +577,6 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
 	return __kmalloc_node(size, flags, node);
 }
 
-int memcg_update_all_caches(int num_memcgs);
-
 /**
  * kmalloc_array - allocate memory for an array.
  * @n: number of elements.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 51e85d05095c..995204f65217 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -350,7 +350,7 @@ static void memcg_reparent_objcgs(struct mem_cgroup *memcg,
 }
 
 /*
- * This will be the memcg's index in each cache's ->memcg_params.memcg_caches.
+ * This will be used as a shrinker list's index.
  * The main reason for not using cgroup id for this:
  *  this works better in sparse environments, where we have a lot of memcgs,
  *  but only a few kmem-limited. Or also, if we have, for instance, 200
@@ -569,20 +569,16 @@ ino_t page_cgroup_ino(struct page *page)
 	unsigned long ino = 0;
 
 	rcu_read_lock();
-	if (PageSlab(page) && !PageTail(page)) {
-		memcg = memcg_from_slab_page(page);
-	} else {
-		memcg = page->mem_cgroup;
+	memcg = page->mem_cgroup;
 
-		/*
-		 * The lowest bit set means that memcg isn't a valid
-		 * memcg pointer, but a obj_cgroups pointer.
-		 * In this case the page is shared and doesn't belong
-		 * to any specific memory cgroup.
-		 */
-		if ((unsigned long) memcg & 0x1UL)
-			memcg = NULL;
-	}
+	/*
+	 * The lowest bit set means that memcg isn't a valid
+	 * memcg pointer, but a obj_cgroups pointer.
+	 * In this case the page is shared and doesn't belong
+	 * to any specific memory cgroup.
+	 */
+	if ((unsigned long) memcg & 0x1UL)
+		memcg = NULL;
 
 	while (memcg && !(memcg->css.flags & CSS_ONLINE))
 		memcg = parent_mem_cgroup(memcg);
@@ -2822,12 +2818,18 @@ struct mem_cgroup *mem_cgroup_from_obj(void *p)
 	page = virt_to_head_page(p);
 
 	/*
-	 * Slab pages don't have page->mem_cgroup set because corresponding
-	 * kmem caches can be reparented during the lifetime. That's why
-	 * memcg_from_slab_page() should be used instead.
+	 * Slab objects are accounted individually, not per-page.
+	 * Memcg membership data for each individual object is saved in
+	 * the page->obj_cgroups.
 	 */
-	if (PageSlab(page))
-		return memcg_from_slab_page(page);
+	if (page_has_obj_cgroups(page)) {
+		struct obj_cgroup *objcg;
+		unsigned int off;
+
+		off = obj_to_index(page->slab_cache, page, p);
+		objcg = page_obj_cgroups(page)[off];
+		return obj_cgroup_memcg(objcg);
+	}
 
 	/* All other pages use page->mem_cgroup */
 	return page->mem_cgroup;
@@ -2882,9 +2884,7 @@ static int memcg_alloc_cache_id(void)
 	else if (size > MEMCG_CACHES_MAX_SIZE)
 		size = MEMCG_CACHES_MAX_SIZE;
 
-	err = memcg_update_all_caches(size);
-	if (!err)
-		err = memcg_update_all_list_lrus(size);
+	err = memcg_update_all_list_lrus(size);
 	if (!err)
 		memcg_nr_cache_ids = size;
 
@@ -2903,7 +2903,6 @@ static void memcg_free_cache_id(int id)
 }
 
 struct memcg_kmem_cache_create_work {
-	struct mem_cgroup *memcg;
 	struct kmem_cache *cachep;
 	struct work_struct work;
 };
@@ -2912,31 +2911,24 @@ static void memcg_kmem_cache_create_func(struct work_struct *w)
 {
 	struct memcg_kmem_cache_create_work *cw =
 		container_of(w, struct memcg_kmem_cache_create_work, work);
-	struct mem_cgroup *memcg = cw->memcg;
 	struct kmem_cache *cachep = cw->cachep;
 
-	memcg_create_kmem_cache(memcg, cachep);
+	memcg_create_kmem_cache(cachep);
 
-	css_put(&memcg->css);
 	kfree(cw);
 }
 
 /*
  * Enqueue the creation of a per-memcg kmem_cache.
  */
-static void memcg_schedule_kmem_cache_create(struct mem_cgroup *memcg,
-					       struct kmem_cache *cachep)
+static void memcg_schedule_kmem_cache_create(struct kmem_cache *cachep)
 {
 	struct memcg_kmem_cache_create_work *cw;
 
-	if (!css_tryget_online(&memcg->css))
-		return;
-
 	cw = kmalloc(sizeof(*cw), GFP_NOWAIT | __GFP_NOWARN);
 	if (!cw)
 		return;
 
-	cw->memcg = memcg;
 	cw->cachep = cachep;
 	INIT_WORK(&cw->work, memcg_kmem_cache_create_func);
 
@@ -2944,102 +2936,26 @@ static void memcg_schedule_kmem_cache_create(struct mem_cgroup *memcg,
 }
 
 /**
- * memcg_kmem_get_cache: select the correct per-memcg cache for allocation
+ * memcg_kmem_get_cache: select memcg or root cache for allocation
  * @cachep: the original global kmem cache
  *
  * Return the kmem_cache we're supposed to use for a slab allocation.
- * We try to use the current memcg's version of the cache.
  *
  * If the cache does not exist yet, if we are the first user of it, we
  * create it asynchronously in a workqueue and let the current allocation
  * go through with the original cache.
- *
- * This function takes a reference to the cache it returns to assure it
- * won't get destroyed while we are working with it. Once the caller is
- * done with it, memcg_kmem_put_cache() must be called to release the
- * reference.
  */
-struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep,
-					struct obj_cgroup **objcgp)
+struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep)
 {
-	struct mem_cgroup *memcg;
 	struct kmem_cache *memcg_cachep;
-	struct memcg_cache_array *arr;
-	int kmemcg_id;
 
-	VM_BUG_ON(!is_root_cache(cachep));
-
-	if (memcg_kmem_bypass())
+	memcg_cachep = READ_ONCE(cachep->memcg_params.memcg_cache);
+	if (unlikely(!memcg_cachep)) {
+		memcg_schedule_kmem_cache_create(cachep);
 		return cachep;
-
-	rcu_read_lock();
-
-	if (unlikely(current->active_memcg))
-		memcg = current->active_memcg;
-	else
-		memcg = mem_cgroup_from_task(current);
-
-	if (!memcg || memcg == root_mem_cgroup)
-		goto out_unlock;
-
-	kmemcg_id = READ_ONCE(memcg->kmemcg_id);
-	if (kmemcg_id < 0)
-		goto out_unlock;
-
-	arr = rcu_dereference(cachep->memcg_params.memcg_caches);
-
-	/*
-	 * Make sure we will access the up-to-date value. The code updating
-	 * memcg_caches issues a write barrier to match the data dependency
-	 * barrier inside READ_ONCE() (see memcg_create_kmem_cache()).
-	 */
-	memcg_cachep = READ_ONCE(arr->entries[kmemcg_id]);
-
-	/*
-	 * If we are in a safe context (can wait, and not in interrupt
-	 * context), we could be be predictable and return right away.
-	 * This would guarantee that the allocation being performed
-	 * already belongs in the new cache.
-	 *
-	 * However, there are some clashes that can arrive from locking.
-	 * For instance, because we acquire the slab_mutex while doing
-	 * memcg_create_kmem_cache, this means no further allocation
-	 * could happen with the slab_mutex held. So it's better to
-	 * defer everything.
-	 *
-	 * If the memcg is dying or memcg_cache is about to be released,
-	 * don't bother creating new kmem_caches. Because memcg_cachep
-	 * is ZEROed as the fist step of kmem offlining, we don't need
-	 * percpu_ref_tryget_live() here. css_tryget_online() check in
-	 * memcg_schedule_kmem_cache_create() will prevent us from
-	 * creation of a new kmem_cache.
-	 */
-	if (unlikely(!memcg_cachep))
-		memcg_schedule_kmem_cache_create(memcg, cachep);
-	else if (percpu_ref_tryget(&memcg_cachep->memcg_params.refcnt)) {
-		struct obj_cgroup *objcg = rcu_dereference(memcg->objcg);
-
-		if (!objcg || !obj_cgroup_tryget(objcg)) {
-			percpu_ref_put(&memcg_cachep->memcg_params.refcnt);
-			goto out_unlock;
-		}
-
-		*objcgp = objcg;
-		cachep = memcg_cachep;
 	}
-out_unlock:
-	rcu_read_unlock();
-	return cachep;
-}
 
-/**
- * memcg_kmem_put_cache: drop reference taken by memcg_kmem_get_cache
- * @cachep: the cache returned by memcg_kmem_get_cache
- */
-void memcg_kmem_put_cache(struct kmem_cache *cachep)
-{
-	if (!is_root_cache(cachep))
-		percpu_ref_put(&cachep->memcg_params.refcnt);
+	return memcg_cachep;
 }
 
 /**
@@ -3728,7 +3644,6 @@ static int memcg_online_kmem(struct mem_cgroup *memcg)
 	 */
 	memcg->kmemcg_id = memcg_id;
 	memcg->kmem_state = KMEM_ONLINE;
-	INIT_LIST_HEAD(&memcg->kmem_caches);
 
 	return 0;
 }
@@ -3741,22 +3656,13 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
 
 	if (memcg->kmem_state != KMEM_ONLINE)
 		return;
-	/*
-	 * Clear the online state before clearing memcg_caches array
-	 * entries. The slab_mutex in memcg_deactivate_kmem_caches()
-	 * guarantees that no cache will be created for this cgroup
-	 * after we are done (see memcg_create_kmem_cache()).
-	 */
+
 	memcg->kmem_state = KMEM_ALLOCATED;
 
 	parent = parent_mem_cgroup(memcg);
 	if (!parent)
 		parent = root_mem_cgroup;
 
-	/*
-	 * Deactivate and reparent kmem_caches and objcgs.
-	 */
-	memcg_deactivate_kmem_caches(memcg, parent);
 	memcg_reparent_objcgs(memcg, parent);
 
 	kmemcg_id = memcg->kmemcg_id;
@@ -3791,10 +3697,8 @@ static void memcg_free_kmem(struct mem_cgroup *memcg)
 	if (unlikely(memcg->kmem_state == KMEM_ONLINE))
 		memcg_offline_kmem(memcg);
 
-	if (memcg->kmem_state == KMEM_ALLOCATED) {
-		WARN_ON(!list_empty(&memcg->kmem_caches));
+	if (memcg->kmem_state == KMEM_ALLOCATED)
 		static_branch_dec(&memcg_kmem_enabled_key);
-	}
 }
 #else
 static int memcg_online_kmem(struct mem_cgroup *memcg)
@@ -5386,9 +5290,6 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 
 	/* The following stuff does not apply to the root */
 	if (!parent) {
-#ifdef CONFIG_MEMCG_KMEM
-		INIT_LIST_HEAD(&memcg->kmem_caches);
-#endif
 		root_mem_cgroup = memcg;
 		return &memcg->css;
 	}
diff --git a/mm/slab.c b/mm/slab.c
index 02b4363930c1..7e8d0f62f30b 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1239,7 +1239,7 @@ void __init kmem_cache_init(void)
 				  nr_node_ids * sizeof(struct kmem_cache_node *),
 				  SLAB_HWCACHE_ALIGN, 0, 0);
 	list_add(&kmem_cache->list, &slab_caches);
-	memcg_link_cache(kmem_cache, NULL);
+	memcg_link_cache(kmem_cache);
 	slab_state = PARTIAL;
 
 	/*
@@ -2243,17 +2243,6 @@ int __kmem_cache_shrink(struct kmem_cache *cachep)
 	return (ret ? 1 : 0);
 }
 
-#ifdef CONFIG_MEMCG
-void __kmemcg_cache_deactivate(struct kmem_cache *cachep)
-{
-	__kmem_cache_shrink(cachep);
-}
-
-void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s)
-{
-}
-#endif
-
 int __kmem_cache_shutdown(struct kmem_cache *cachep)
 {
 	return __kmem_cache_shrink(cachep);
@@ -3861,7 +3850,8 @@ static int do_tune_cpucache(struct kmem_cache *cachep, int limit,
 		return ret;
 
 	lockdep_assert_held(&slab_mutex);
-	for_each_memcg_cache(c, cachep) {
+	c = memcg_cache(cachep);
+	if (c) {
 		/* return value determined by the root cache only */
 		__do_tune_cpucache(c, limit, batchcount, shared, gfp);
 	}
diff --git a/mm/slab.h b/mm/slab.h
index f219a29052d9..8f8552df5675 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -32,66 +32,25 @@ struct kmem_cache {
 
 #else /* !CONFIG_SLOB */
 
-struct memcg_cache_array {
-	struct rcu_head rcu;
-	struct kmem_cache *entries[];
-};
-
 /*
  * This is the main placeholder for memcg-related information in kmem caches.
- * Both the root cache and the child caches will have it. For the root cache,
- * this will hold a dynamically allocated array large enough to hold
- * information about the currently limited memcgs in the system. To allow the
- * array to be accessed without taking any locks, on relocation we free the old
- * version only after a grace period.
- *
- * Root and child caches hold different metadata.
+ * Both the root cache and the child cache will have it. Some fields are used
+ * in both cases, other are specific to root caches.
  *
  * @root_cache:	Common to root and child caches.  NULL for root, pointer to
  *		the root cache for children.
  *
  * The following fields are specific to root caches.
  *
- * @memcg_caches: kmemcg ID indexed table of child caches.  This table is
- *		used to index child cachces during allocation and cleared
- *		early during shutdown.
- *
- * @root_caches_node: List node for slab_root_caches list.
- *
- * @children:	List of all child caches.  While the child caches are also
- *		reachable through @memcg_caches, a child cache remains on
- *		this list until it is actually destroyed.
- *
- * The following fields are specific to child caches.
- *
- * @memcg:	Pointer to the memcg this cache belongs to.
- *
- * @children_node: List node for @root_cache->children list.
- *
- * @kmem_caches_node: List node for @memcg->kmem_caches list.
+ * @memcg_cache: pointer to memcg kmem cache, used by all non-root memory
+ *		cgroups.
+ * @root_caches_node: list node for slab_root_caches list.
  */
 struct memcg_cache_params {
 	struct kmem_cache *root_cache;
-	union {
-		struct {
-			struct memcg_cache_array __rcu *memcg_caches;
-			struct list_head __root_caches_node;
-			struct list_head children;
-			bool dying;
-		};
-		struct {
-			struct mem_cgroup *memcg;
-			struct list_head children_node;
-			struct list_head kmem_caches_node;
-			struct percpu_ref refcnt;
-
-			void (*work_fn)(struct kmem_cache *);
-			union {
-				struct rcu_head rcu_head;
-				struct work_struct work;
-			};
-		};
-	};
+
+	struct kmem_cache *memcg_cache;
+	struct list_head __root_caches_node;
 };
 #endif /* CONFIG_SLOB */
 
@@ -234,8 +193,6 @@ bool __kmem_cache_empty(struct kmem_cache *);
 int __kmem_cache_shutdown(struct kmem_cache *);
 void __kmem_cache_release(struct kmem_cache *);
 int __kmem_cache_shrink(struct kmem_cache *);
-void __kmemcg_cache_deactivate(struct kmem_cache *s);
-void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s);
 void slab_kmem_cache_release(struct kmem_cache *);
 void kmem_cache_shrink_all(struct kmem_cache *s);
 
@@ -281,14 +238,6 @@ static inline int cache_vmstat_idx(struct kmem_cache *s)
 extern struct list_head		slab_root_caches;
 #define root_caches_node	memcg_params.__root_caches_node
 
-/*
- * Iterate over all memcg caches of the given root cache. The caller must hold
- * slab_mutex.
- */
-#define for_each_memcg_cache(iter, root) \
-	list_for_each_entry(iter, &(root)->memcg_params.children, \
-			    memcg_params.children_node)
-
 static inline bool is_root_cache(struct kmem_cache *s)
 {
 	return !s->memcg_params.root_cache;
@@ -319,6 +268,13 @@ static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
 	return s->memcg_params.root_cache;
 }
 
+static inline struct kmem_cache *memcg_cache(struct kmem_cache *s)
+{
+	if (is_root_cache(s))
+		return s->memcg_params.memcg_cache;
+	return NULL;
+}
+
 static inline struct obj_cgroup **page_obj_cgroups(struct page *page)
 {
 	/*
@@ -331,25 +287,9 @@ static inline struct obj_cgroup **page_obj_cgroups(struct page *page)
 		((unsigned long)page->obj_cgroups & ~0x1UL);
 }
 
-/*
- * Expects a pointer to a slab page. Please note, that PageSlab() check
- * isn't sufficient, as it returns true also for tail compound slab pages,
- * which do not have slab_cache pointer set.
- * So this function assumes that the page can pass PageSlab() && !PageTail()
- * check.
- *
- * The kmem_cache can be reparented asynchronously. The caller must ensure
- * the memcg lifetime, e.g. by taking rcu_read_lock() or cgroup_mutex.
- */
-static inline struct mem_cgroup *memcg_from_slab_page(struct page *page)
+static inline bool page_has_obj_cgroups(struct page *page)
 {
-	struct kmem_cache *s;
-
-	s = READ_ONCE(page->slab_cache);
-	if (s && !is_root_cache(s))
-		return READ_ONCE(s->memcg_params.memcg);
-
-	return NULL;
+	return ((unsigned long)page->obj_cgroups & 0x1UL);
 }
 
 static inline int memcg_alloc_page_obj_cgroups(struct page *page,
@@ -386,16 +326,25 @@ static inline struct kmem_cache *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
 						size_t objects, gfp_t flags)
 {
 	struct kmem_cache *cachep;
+	struct obj_cgroup *objcg;
+
+	if (memcg_kmem_bypass())
+		return s;
 
-	cachep = memcg_kmem_get_cache(s, objcgp);
+	cachep = memcg_kmem_get_cache(s);
 	if (is_root_cache(cachep))
 		return s;
 
-	if (obj_cgroup_charge(*objcgp, flags, objects * obj_full_size(s))) {
-		memcg_kmem_put_cache(cachep);
+	objcg = get_obj_cgroup_from_current();
+	if (!objcg)
+		return s;
+
+	if (obj_cgroup_charge(objcg, flags, objects * obj_full_size(s))) {
+		obj_cgroup_put(objcg);
 		cachep = NULL;
 	}
 
+	*objcgp = objcg;
 	return cachep;
 }
 
@@ -434,7 +383,6 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
 		}
 	}
 	obj_cgroup_put(objcg);
-	memcg_kmem_put_cache(s);
 }
 
 static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page,
@@ -458,7 +406,7 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page,
 }
 
 extern void slab_init_memcg_params(struct kmem_cache *);
-extern void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg);
+extern void memcg_link_cache(struct kmem_cache *s);
 
 #else /* CONFIG_MEMCG_KMEM */
 
@@ -466,9 +414,6 @@ extern void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg);
 #define slab_root_caches	slab_caches
 #define root_caches_node	list
 
-#define for_each_memcg_cache(iter, root) \
-	for ((void)(iter), (void)(root); 0; )
-
 static inline bool is_root_cache(struct kmem_cache *s)
 {
 	return true;
@@ -490,7 +435,17 @@ static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
 	return s;
 }
 
-static inline struct mem_cgroup *memcg_from_slab_page(struct page *page)
+static inline struct kmem_cache *memcg_cache(struct kmem_cache *s)
+{
+	return NULL;
+}
+
+static inline bool page_has_obj_cgroups(struct page *page)
+{
+	return false;
+}
+
+static inline struct mem_cgroup *memcg_from_slab_obj(void *ptr)
 {
 	return NULL;
 }
@@ -527,8 +482,7 @@ static inline void slab_init_memcg_params(struct kmem_cache *s)
 {
 }
 
-static inline void memcg_link_cache(struct kmem_cache *s,
-				    struct mem_cgroup *memcg)
+static inline void memcg_link_cache(struct kmem_cache *s)
 {
 }
 
@@ -549,17 +503,14 @@ static __always_inline int charge_slab_page(struct page *page,
 					    gfp_t gfp, int order,
 					    struct kmem_cache *s)
 {
-#ifdef CONFIG_MEMCG_KMEM
 	if (memcg_kmem_enabled() && !is_root_cache(s)) {
 		int ret;
 
 		ret = memcg_alloc_page_obj_cgroups(page, s, gfp);
 		if (ret)
 			return ret;
-
-		percpu_ref_get_many(&s->memcg_params.refcnt, 1 << order);
 	}
-#endif
+
 	mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
 			    PAGE_SIZE << order);
 	return 0;
@@ -568,12 +519,9 @@ static __always_inline int charge_slab_page(struct page *page,
 static __always_inline void uncharge_slab_page(struct page *page, int order,
 					       struct kmem_cache *s)
 {
-#ifdef CONFIG_MEMCG_KMEM
-	if (memcg_kmem_enabled() && !is_root_cache(s)) {
+	if (memcg_kmem_enabled() && !is_root_cache(s))
 		memcg_free_page_obj_cgroups(page);
-		percpu_ref_put_many(&s->memcg_params.refcnt, 1 << order);
-	}
-#endif
+
 	mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
 			    -(PAGE_SIZE << order));
 }
@@ -722,9 +670,6 @@ static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
 void *slab_start(struct seq_file *m, loff_t *pos);
 void *slab_next(struct seq_file *m, void *p, loff_t *pos);
 void slab_stop(struct seq_file *m, void *p);
-void *memcg_slab_start(struct seq_file *m, loff_t *pos);
-void *memcg_slab_next(struct seq_file *m, void *p, loff_t *pos);
-void memcg_slab_stop(struct seq_file *m, void *p);
 int memcg_slab_show(struct seq_file *m, void *p);
 
 #if defined(CONFIG_SLAB) || defined(CONFIG_SLUB_DEBUG)
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 3c89c2adc930..e9deaafddbb6 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -131,141 +131,36 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t nr,
 #ifdef CONFIG_MEMCG_KMEM
 
 LIST_HEAD(slab_root_caches);
-static DEFINE_SPINLOCK(memcg_kmem_wq_lock);
-
-static void kmemcg_cache_shutdown(struct percpu_ref *percpu_ref);
 
 void slab_init_memcg_params(struct kmem_cache *s)
 {
 	s->memcg_params.root_cache = NULL;
-	RCU_INIT_POINTER(s->memcg_params.memcg_caches, NULL);
-	INIT_LIST_HEAD(&s->memcg_params.children);
-	s->memcg_params.dying = false;
+	s->memcg_params.memcg_cache = NULL;
 }
 
-static int init_memcg_params(struct kmem_cache *s,
-			     struct kmem_cache *root_cache)
+static void init_memcg_params(struct kmem_cache *s,
+			      struct kmem_cache *root_cache)
 {
-	struct memcg_cache_array *arr;
-
-	if (root_cache) {
-		int ret = percpu_ref_init(&s->memcg_params.refcnt,
-					  kmemcg_cache_shutdown,
-					  0, GFP_KERNEL);
-		if (ret)
-			return ret;
-
+	if (root_cache)
 		s->memcg_params.root_cache = root_cache;
-		INIT_LIST_HEAD(&s->memcg_params.children_node);
-		INIT_LIST_HEAD(&s->memcg_params.kmem_caches_node);
-		return 0;
-	}
-
-	slab_init_memcg_params(s);
-
-	if (!memcg_nr_cache_ids)
-		return 0;
-
-	arr = kvzalloc(sizeof(struct memcg_cache_array) +
-		       memcg_nr_cache_ids * sizeof(void *),
-		       GFP_KERNEL);
-	if (!arr)
-		return -ENOMEM;
-
-	RCU_INIT_POINTER(s->memcg_params.memcg_caches, arr);
-	return 0;
-}
-
-static void destroy_memcg_params(struct kmem_cache *s)
-{
-	if (is_root_cache(s)) {
-		kvfree(rcu_access_pointer(s->memcg_params.memcg_caches));
-	} else {
-		mem_cgroup_put(s->memcg_params.memcg);
-		WRITE_ONCE(s->memcg_params.memcg, NULL);
-		percpu_ref_exit(&s->memcg_params.refcnt);
-	}
-}
-
-static void free_memcg_params(struct rcu_head *rcu)
-{
-	struct memcg_cache_array *old;
-
-	old = container_of(rcu, struct memcg_cache_array, rcu);
-	kvfree(old);
-}
-
-static int update_memcg_params(struct kmem_cache *s, int new_array_size)
-{
-	struct memcg_cache_array *old, *new;
-
-	new = kvzalloc(sizeof(struct memcg_cache_array) +
-		       new_array_size * sizeof(void *), GFP_KERNEL);
-	if (!new)
-		return -ENOMEM;
-
-	old = rcu_dereference_protected(s->memcg_params.memcg_caches,
-					lockdep_is_held(&slab_mutex));
-	if (old)
-		memcpy(new->entries, old->entries,
-		       memcg_nr_cache_ids * sizeof(void *));
-
-	rcu_assign_pointer(s->memcg_params.memcg_caches, new);
-	if (old)
-		call_rcu(&old->rcu, free_memcg_params);
-	return 0;
-}
-
-int memcg_update_all_caches(int num_memcgs)
-{
-	struct kmem_cache *s;
-	int ret = 0;
-
-	mutex_lock(&slab_mutex);
-	list_for_each_entry(s, &slab_root_caches, root_caches_node) {
-		ret = update_memcg_params(s, num_memcgs);
-		/*
-		 * Instead of freeing the memory, we'll just leave the caches
-		 * up to this point in an updated state.
-		 */
-		if (ret)
-			break;
-	}
-	mutex_unlock(&slab_mutex);
-	return ret;
+	else
+		slab_init_memcg_params(s);
 }
 
-void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg)
+void memcg_link_cache(struct kmem_cache *s)
 {
-	if (is_root_cache(s)) {
+	if (is_root_cache(s))
 		list_add(&s->root_caches_node, &slab_root_caches);
-	} else {
-		css_get(&memcg->css);
-		s->memcg_params.memcg = memcg;
-		list_add(&s->memcg_params.children_node,
-			 &s->memcg_params.root_cache->memcg_params.children);
-		list_add(&s->memcg_params.kmem_caches_node,
-			 &s->memcg_params.memcg->kmem_caches);
-	}
 }
 
 static void memcg_unlink_cache(struct kmem_cache *s)
 {
-	if (is_root_cache(s)) {
+	if (is_root_cache(s))
 		list_del(&s->root_caches_node);
-	} else {
-		list_del(&s->memcg_params.children_node);
-		list_del(&s->memcg_params.kmem_caches_node);
-	}
 }
 #else
-static inline int init_memcg_params(struct kmem_cache *s,
-				    struct kmem_cache *root_cache)
-{
-	return 0;
-}
-
-static inline void destroy_memcg_params(struct kmem_cache *s)
+static inline void init_memcg_params(struct kmem_cache *s,
+				     struct kmem_cache *root_cache)
 {
 }
 
@@ -380,7 +275,7 @@ static struct kmem_cache *create_cache(const char *name,
 		unsigned int object_size, unsigned int align,
 		slab_flags_t flags, unsigned int useroffset,
 		unsigned int usersize, void (*ctor)(void *),
-		struct mem_cgroup *memcg, struct kmem_cache *root_cache)
+		struct kmem_cache *root_cache)
 {
 	struct kmem_cache *s;
 	int err;
@@ -400,24 +295,20 @@ static struct kmem_cache *create_cache(const char *name,
 	s->useroffset = useroffset;
 	s->usersize = usersize;
 
-	err = init_memcg_params(s, root_cache);
-	if (err)
-		goto out_free_cache;
-
+	init_memcg_params(s, root_cache);
 	err = __kmem_cache_create(s, flags);
 	if (err)
 		goto out_free_cache;
 
 	s->refcount = 1;
 	list_add(&s->list, &slab_caches);
-	memcg_link_cache(s, memcg);
+	memcg_link_cache(s);
 out:
 	if (err)
 		return ERR_PTR(err);
 	return s;
 
 out_free_cache:
-	destroy_memcg_params(s);
 	kmem_cache_free(kmem_cache, s);
 	goto out;
 }
@@ -504,7 +395,7 @@ kmem_cache_create_usercopy(const char *name,
 
 	s = create_cache(cache_name, size,
 			 calculate_alignment(flags, align, size),
-			 flags, useroffset, usersize, ctor, NULL, NULL);
+			 flags, useroffset, usersize, ctor, NULL);
 	if (IS_ERR(s)) {
 		err = PTR_ERR(s);
 		kfree_const(cache_name);
@@ -629,51 +520,27 @@ static int shutdown_cache(struct kmem_cache *s)
 
 #ifdef CONFIG_MEMCG_KMEM
 /*
- * memcg_create_kmem_cache - Create a cache for a memory cgroup.
- * @memcg: The memory cgroup the new cache is for.
+ * memcg_create_kmem_cache - Create a cache for non-root memory cgroups.
  * @root_cache: The parent of the new cache.
  *
  * This function attempts to create a kmem cache that will serve allocation
- * requests going from @memcg to @root_cache. The new cache inherits properties
- * from its parent.
+ * requests going all non-root memory cgroups to @root_cache. The new cache
+ * inherits properties from its parent.
  */
-void memcg_create_kmem_cache(struct mem_cgroup *memcg,
-			     struct kmem_cache *root_cache)
+void memcg_create_kmem_cache(struct kmem_cache *root_cache)
 {
-	static char memcg_name_buf[NAME_MAX + 1]; /* protected by slab_mutex */
-	struct cgroup_subsys_state *css = &memcg->css;
-	struct memcg_cache_array *arr;
 	struct kmem_cache *s = NULL;
 	char *cache_name;
-	int idx;
 
 	get_online_cpus();
 	get_online_mems();
 
 	mutex_lock(&slab_mutex);
 
-	/*
-	 * The memory cgroup could have been offlined while the cache
-	 * creation work was pending.
-	 */
-	if (memcg->kmem_state != KMEM_ONLINE)
+	if (root_cache->memcg_params.memcg_cache)
 		goto out_unlock;
 
-	idx = memcg_cache_id(memcg);
-	arr = rcu_dereference_protected(root_cache->memcg_params.memcg_caches,
-					lockdep_is_held(&slab_mutex));
-
-	/*
-	 * Since per-memcg caches are created asynchronously on first
-	 * allocation (see memcg_kmem_get_cache()), several threads can try to
-	 * create the same cache, but only one of them may succeed.
-	 */
-	if (arr->entries[idx])
-		goto out_unlock;
-
-	cgroup_name(css->cgroup, memcg_name_buf, sizeof(memcg_name_buf));
-	cache_name = kasprintf(GFP_KERNEL, "%s(%llu:%s)", root_cache->name,
-			       css->serial_nr, memcg_name_buf);
+	cache_name = kasprintf(GFP_KERNEL, "%s-memcg", root_cache->name);
 	if (!cache_name)
 		goto out_unlock;
 
@@ -681,7 +548,7 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
 			 root_cache->align,
 			 root_cache->flags & CACHE_CREATE_MASK,
 			 root_cache->useroffset, root_cache->usersize,
-			 root_cache->ctor, memcg, root_cache);
+			 root_cache->ctor, root_cache);
 	/*
 	 * If we could not create a memcg cache, do not complain, because
 	 * that's not critical at all as we can always proceed with the root
@@ -698,7 +565,7 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
 	 * initialized.
 	 */
 	smp_wmb();
-	arr->entries[idx] = s;
+	root_cache->memcg_params.memcg_cache = s;
 
 out_unlock:
 	mutex_unlock(&slab_mutex);
@@ -707,197 +574,18 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
 	put_online_cpus();
 }
 
-static void kmemcg_workfn(struct work_struct *work)
-{
-	struct kmem_cache *s = container_of(work, struct kmem_cache,
-					    memcg_params.work);
-
-	get_online_cpus();
-	get_online_mems();
-
-	mutex_lock(&slab_mutex);
-	s->memcg_params.work_fn(s);
-	mutex_unlock(&slab_mutex);
-
-	put_online_mems();
-	put_online_cpus();
-}
-
-static void kmemcg_rcufn(struct rcu_head *head)
-{
-	struct kmem_cache *s = container_of(head, struct kmem_cache,
-					    memcg_params.rcu_head);
-
-	/*
-	 * We need to grab blocking locks.  Bounce to ->work.  The
-	 * work item shares the space with the RCU head and can't be
-	 * initialized earlier.
-	 */
-	INIT_WORK(&s->memcg_params.work, kmemcg_workfn);
-	queue_work(memcg_kmem_cache_wq, &s->memcg_params.work);
-}
-
-static void kmemcg_cache_shutdown_fn(struct kmem_cache *s)
-{
-	WARN_ON(shutdown_cache(s));
-}
-
-static void kmemcg_cache_shutdown(struct percpu_ref *percpu_ref)
-{
-	struct kmem_cache *s = container_of(percpu_ref, struct kmem_cache,
-					    memcg_params.refcnt);
-	unsigned long flags;
-
-	spin_lock_irqsave(&memcg_kmem_wq_lock, flags);
-	if (s->memcg_params.root_cache->memcg_params.dying)
-		goto unlock;
-
-	s->memcg_params.work_fn = kmemcg_cache_shutdown_fn;
-	INIT_WORK(&s->memcg_params.work, kmemcg_workfn);
-	queue_work(memcg_kmem_cache_wq, &s->memcg_params.work);
-
-unlock:
-	spin_unlock_irqrestore(&memcg_kmem_wq_lock, flags);
-}
-
-static void kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s)
-{
-	__kmemcg_cache_deactivate_after_rcu(s);
-	percpu_ref_kill(&s->memcg_params.refcnt);
-}
-
-static void kmemcg_cache_deactivate(struct kmem_cache *s)
-{
-	if (WARN_ON_ONCE(is_root_cache(s)))
-		return;
-
-	__kmemcg_cache_deactivate(s);
-	s->flags |= SLAB_DEACTIVATED;
-
-	/*
-	 * memcg_kmem_wq_lock is used to synchronize memcg_params.dying
-	 * flag and make sure that no new kmem_cache deactivation tasks
-	 * are queued (see flush_memcg_workqueue() ).
-	 */
-	spin_lock_irq(&memcg_kmem_wq_lock);
-	if (s->memcg_params.root_cache->memcg_params.dying)
-		goto unlock;
-
-	s->memcg_params.work_fn = kmemcg_cache_deactivate_after_rcu;
-	call_rcu(&s->memcg_params.rcu_head, kmemcg_rcufn);
-unlock:
-	spin_unlock_irq(&memcg_kmem_wq_lock);
-}
-
-void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg,
-				  struct mem_cgroup *parent)
-{
-	int idx;
-	struct memcg_cache_array *arr;
-	struct kmem_cache *s, *c;
-	unsigned int nr_reparented;
-
-	idx = memcg_cache_id(memcg);
-
-	get_online_cpus();
-	get_online_mems();
-
-	mutex_lock(&slab_mutex);
-	list_for_each_entry(s, &slab_root_caches, root_caches_node) {
-		arr = rcu_dereference_protected(s->memcg_params.memcg_caches,
-						lockdep_is_held(&slab_mutex));
-		c = arr->entries[idx];
-		if (!c)
-			continue;
-
-		kmemcg_cache_deactivate(c);
-		arr->entries[idx] = NULL;
-	}
-	nr_reparented = 0;
-	list_for_each_entry(s, &memcg->kmem_caches,
-			    memcg_params.kmem_caches_node) {
-		WRITE_ONCE(s->memcg_params.memcg, parent);
-		css_put(&memcg->css);
-		nr_reparented++;
-	}
-	if (nr_reparented) {
-		list_splice_init(&memcg->kmem_caches,
-				 &parent->kmem_caches);
-		css_get_many(&parent->css, nr_reparented);
-	}
-	mutex_unlock(&slab_mutex);
-
-	put_online_mems();
-	put_online_cpus();
-}
-
 static int shutdown_memcg_caches(struct kmem_cache *s)
 {
-	struct memcg_cache_array *arr;
-	struct kmem_cache *c, *c2;
-	LIST_HEAD(busy);
-	int i;
-
 	BUG_ON(!is_root_cache(s));
 
-	/*
-	 * First, shutdown active caches, i.e. caches that belong to online
-	 * memory cgroups.
-	 */
-	arr = rcu_dereference_protected(s->memcg_params.memcg_caches,
-					lockdep_is_held(&slab_mutex));
-	for_each_memcg_cache_index(i) {
-		c = arr->entries[i];
-		if (!c)
-			continue;
-		if (shutdown_cache(c))
-			/*
-			 * The cache still has objects. Move it to a temporary
-			 * list so as not to try to destroy it for a second
-			 * time while iterating over inactive caches below.
-			 */
-			list_move(&c->memcg_params.children_node, &busy);
-		else
-			/*
-			 * The cache is empty and will be destroyed soon. Clear
-			 * the pointer to it in the memcg_caches array so that
-			 * it will never be accessed even if the root cache
-			 * stays alive.
-			 */
-			arr->entries[i] = NULL;
-	}
-
-	/*
-	 * Second, shutdown all caches left from memory cgroups that are now
-	 * offline.
-	 */
-	list_for_each_entry_safe(c, c2, &s->memcg_params.children,
-				 memcg_params.children_node)
-		shutdown_cache(c);
-
-	list_splice(&busy, &s->memcg_params.children);
+	if (s->memcg_params.memcg_cache)
+		WARN_ON(shutdown_cache(s->memcg_params.memcg_cache));
 
-	/*
-	 * A cache being destroyed must be empty. In particular, this means
-	 * that all per memcg caches attached to it must be empty too.
-	 */
-	if (!list_empty(&s->memcg_params.children))
-		return -EBUSY;
 	return 0;
 }
 
 static void flush_memcg_workqueue(struct kmem_cache *s)
 {
-	spin_lock_irq(&memcg_kmem_wq_lock);
-	s->memcg_params.dying = true;
-	spin_unlock_irq(&memcg_kmem_wq_lock);
-
-	/*
-	 * SLAB and SLUB deactivate the kmem_caches through call_rcu. Make
-	 * sure all registered rcu callbacks have been invoked.
-	 */
-	rcu_barrier();
-
 	/*
 	 * SLAB and SLUB create memcg kmem_caches through workqueue and SLUB
 	 * deactivates the memcg kmem_caches through workqueue. Make sure all
@@ -905,18 +593,6 @@ static void flush_memcg_workqueue(struct kmem_cache *s)
 	 */
 	if (likely(memcg_kmem_cache_wq))
 		flush_workqueue(memcg_kmem_cache_wq);
-
-	/*
-	 * If we're racing with children kmem_cache deactivation, it might
-	 * take another rcu grace period to complete their destruction.
-	 * At this moment the corresponding percpu_ref_kill() call should be
-	 * done, but it might take another rcu grace period to complete
-	 * switching to the atomic mode.
-	 * Please, note that we check without grabbing the slab_mutex. It's safe
-	 * because at this moment the children list can't grow.
-	 */
-	if (!list_empty(&s->memcg_params.children))
-		rcu_barrier();
 }
 #else
 static inline int shutdown_memcg_caches(struct kmem_cache *s)
@@ -932,7 +608,6 @@ static inline void flush_memcg_workqueue(struct kmem_cache *s)
 void slab_kmem_cache_release(struct kmem_cache *s)
 {
 	__kmem_cache_release(s);
-	destroy_memcg_params(s);
 	kfree_const(s->name);
 	kmem_cache_free(kmem_cache, s);
 }
@@ -996,7 +671,7 @@ int kmem_cache_shrink(struct kmem_cache *cachep)
 EXPORT_SYMBOL(kmem_cache_shrink);
 
 /**
- * kmem_cache_shrink_all - shrink a cache and all memcg caches for root cache
+ * kmem_cache_shrink_all - shrink root and memcg caches
  * @s: The cache pointer
  */
 void kmem_cache_shrink_all(struct kmem_cache *s)
@@ -1013,21 +688,11 @@ void kmem_cache_shrink_all(struct kmem_cache *s)
 	kasan_cache_shrink(s);
 	__kmem_cache_shrink(s);
 
-	/*
-	 * We have to take the slab_mutex to protect from the memcg list
-	 * modification.
-	 */
-	mutex_lock(&slab_mutex);
-	for_each_memcg_cache(c, s) {
-		/*
-		 * Don't need to shrink deactivated memcg caches.
-		 */
-		if (s->flags & SLAB_DEACTIVATED)
-			continue;
+	c = memcg_cache(s);
+	if (c) {
 		kasan_cache_shrink(c);
 		__kmem_cache_shrink(c);
 	}
-	mutex_unlock(&slab_mutex);
 	put_online_mems();
 	put_online_cpus();
 }
@@ -1082,7 +747,7 @@ struct kmem_cache *__init create_kmalloc_cache(const char *name,
 
 	create_boot_cache(s, name, size, flags, useroffset, usersize);
 	list_add(&s->list, &slab_caches);
-	memcg_link_cache(s, NULL);
+	memcg_link_cache(s);
 	s->refcount = 1;
 	return s;
 }
@@ -1445,7 +1110,8 @@ memcg_accumulate_slabinfo(struct kmem_cache *s, struct slabinfo *info)
 	if (!is_root_cache(s))
 		return;
 
-	for_each_memcg_cache(c, s) {
+	c = memcg_cache(s);
+	if (c) {
 		memset(&sinfo, 0, sizeof(sinfo));
 		get_slabinfo(c, &sinfo);
 
@@ -1576,7 +1242,7 @@ module_init(slab_proc_init);
 
 #if defined(CONFIG_DEBUG_FS) && defined(CONFIG_MEMCG_KMEM)
 /*
- * Display information about kmem caches that have child memcg caches.
+ * Display information about kmem caches that have memcg cache.
  */
 static int memcg_slabinfo_show(struct seq_file *m, void *unused)
 {
@@ -1588,9 +1254,9 @@ static int memcg_slabinfo_show(struct seq_file *m, void *unused)
 	seq_puts(m, " <active_slabs> <num_slabs>\n");
 	list_for_each_entry(s, &slab_root_caches, root_caches_node) {
 		/*
-		 * Skip kmem caches that don't have any memcg children.
+		 * Skip kmem caches that don't have the memcg cache.
 		 */
-		if (list_empty(&s->memcg_params.children))
+		if (!s->memcg_params.memcg_cache)
 			continue;
 
 		memset(&sinfo, 0, sizeof(sinfo));
@@ -1599,23 +1265,13 @@ static int memcg_slabinfo_show(struct seq_file *m, void *unused)
 			   cache_name(s), sinfo.active_objs, sinfo.num_objs,
 			   sinfo.active_slabs, sinfo.num_slabs);
 
-		for_each_memcg_cache(c, s) {
-			struct cgroup_subsys_state *css;
-			char *status = "";
-
-			css = &c->memcg_params.memcg->css;
-			if (!(css->flags & CSS_ONLINE))
-				status = ":dead";
-			else if (c->flags & SLAB_DEACTIVATED)
-				status = ":deact";
-
-			memset(&sinfo, 0, sizeof(sinfo));
-			get_slabinfo(c, &sinfo);
-			seq_printf(m, "%-17s %4d%-6s %6lu %6lu %6lu %6lu\n",
-				   cache_name(c), css->id, status,
-				   sinfo.active_objs, sinfo.num_objs,
-				   sinfo.active_slabs, sinfo.num_slabs);
-		}
+		c = s->memcg_params.memcg_cache;
+		memset(&sinfo, 0, sizeof(sinfo));
+		get_slabinfo(c, &sinfo);
+		seq_printf(m, "%-17s %4d %6lu %6lu %6lu %6lu\n",
+			   cache_name(c), root_mem_cgroup->css.id,
+			   sinfo.active_objs, sinfo.num_objs,
+			   sinfo.active_slabs, sinfo.num_slabs);
 	}
 	mutex_unlock(&slab_mutex);
 	return 0;
diff --git a/mm/slub.c b/mm/slub.c
index 7007eceac4c4..6761e40e2c2e 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4136,36 +4136,6 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 	return ret;
 }
 
-#ifdef CONFIG_MEMCG
-void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s)
-{
-	/*
-	 * Called with all the locks held after a sched RCU grace period.
-	 * Even if @s becomes empty after shrinking, we can't know that @s
-	 * doesn't have allocations already in-flight and thus can't
-	 * destroy @s until the associated memcg is released.
-	 *
-	 * However, let's remove the sysfs files for empty caches here.
-	 * Each cache has a lot of interface files which aren't
-	 * particularly useful for empty draining caches; otherwise, we can
-	 * easily end up with millions of unnecessary sysfs files on
-	 * systems which have a lot of memory and transient cgroups.
-	 */
-	if (!__kmem_cache_shrink(s))
-		sysfs_slab_remove(s);
-}
-
-void __kmemcg_cache_deactivate(struct kmem_cache *s)
-{
-	/*
-	 * Disable empty slabs caching. Used to avoid pinning offline
-	 * memory cgroups by kmem pages that can be freed.
-	 */
-	slub_set_cpu_partial(s, 0);
-	s->min_partial = 0;
-}
-#endif	/* CONFIG_MEMCG */
-
 static int slab_mem_going_offline_callback(void *arg)
 {
 	struct kmem_cache *s;
@@ -4322,7 +4292,7 @@ static struct kmem_cache * __init bootstrap(struct kmem_cache *static_cache)
 	}
 	slab_init_memcg_params(s);
 	list_add(&s->list, &slab_caches);
-	memcg_link_cache(s, NULL);
+	memcg_link_cache(s);
 	return s;
 }
 
@@ -4390,7 +4360,8 @@ __kmem_cache_alias(const char *name, unsigned int size, unsigned int align,
 		s->object_size = max(s->object_size, size);
 		s->inuse = max(s->inuse, ALIGN(size, sizeof(void *)));
 
-		for_each_memcg_cache(c, s) {
+		c = memcg_cache(s);
+		if (c) {
 			c->object_size = s->object_size;
 			c->inuse = max(c->inuse, ALIGN(size, sizeof(void *)));
 		}
@@ -5645,7 +5616,8 @@ static ssize_t slab_attr_store(struct kobject *kobj,
 		 * directly either failed or succeeded, in which case we loop
 		 * through the descendants with best-effort propagation.
 		 */
-		for_each_memcg_cache(c, s)
+		c = memcg_cache(s);
+		if (c)
 			attribute->store(c, buf, len);
 		mutex_unlock(&slab_mutex);
 	}
-- 
2.25.4


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v6 13/19] mm: memcg/slab: simplify memcg cache creation
  2020-06-08 23:06 [PATCH v6 00/19] The new cgroup slab memory controller Roman Gushchin
                   ` (11 preceding siblings ...)
  2020-06-08 23:06 ` [PATCH v6 12/19] mm: memcg/slab: use a single set of kmem_caches for all accounted allocations Roman Gushchin
@ 2020-06-08 23:06 ` Roman Gushchin
  2020-06-22 17:29   ` Shakeel Butt
  2020-06-08 23:06 ` [PATCH v6 14/19] mm: memcg/slab: remove memcg_kmem_get_cache() Roman Gushchin
                   ` (7 subsequent siblings)
  20 siblings, 1 reply; 92+ messages in thread
From: Roman Gushchin @ 2020-06-08 23:06 UTC (permalink / raw)
  To: Andrew Morton, Christoph Lameter
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	Vlastimil Babka, kernel-team, linux-kernel, Roman Gushchin

Because the number of non-root kmem_caches doesn't depend on the
number of memory cgroups anymore and is generally not very big,
there is no more need for a dedicated workqueue.

Also, as there is no more need to pass any arguments to the
memcg_create_kmem_cache() except the root kmem_cache, it's
possible to just embed the work structure into the kmem_cache
and avoid the dynamic allocation of the work structure.

This will also simplify the synchronization: for each root kmem_cache
there is only one work. So there will be no more concurrent attempts
to create a non-root kmem_cache for a root kmem_cache: the second and
all following attempts to queue the work will fail.

On the kmem_cache destruction path there is no more need to call the
expensive flush_workqueue() and wait for all pending works to be
finished. Instead, cancel_work_sync() can be used to cancel/wait for
only one work.

Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/memcontrol.h |  1 -
 mm/memcontrol.c            | 48 +-------------------------------------
 mm/slab.h                  |  2 ++
 mm/slab_common.c           | 22 +++++++++--------
 4 files changed, 15 insertions(+), 58 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e2c4d54aa1f6..ed0d2ac6a5d2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1418,7 +1418,6 @@ int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size);
 void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);
 
 extern struct static_key_false memcg_kmem_enabled_key;
-extern struct workqueue_struct *memcg_kmem_cache_wq;
 
 extern int memcg_nr_cache_ids;
 void memcg_get_cache_ids(void);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 995204f65217..2695cdc15baa 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -399,8 +399,6 @@ void memcg_put_cache_ids(void)
  */
 DEFINE_STATIC_KEY_FALSE(memcg_kmem_enabled_key);
 EXPORT_SYMBOL(memcg_kmem_enabled_key);
-
-struct workqueue_struct *memcg_kmem_cache_wq;
 #endif
 
 static int memcg_shrinker_map_size;
@@ -2902,39 +2900,6 @@ static void memcg_free_cache_id(int id)
 	ida_simple_remove(&memcg_cache_ida, id);
 }
 
-struct memcg_kmem_cache_create_work {
-	struct kmem_cache *cachep;
-	struct work_struct work;
-};
-
-static void memcg_kmem_cache_create_func(struct work_struct *w)
-{
-	struct memcg_kmem_cache_create_work *cw =
-		container_of(w, struct memcg_kmem_cache_create_work, work);
-	struct kmem_cache *cachep = cw->cachep;
-
-	memcg_create_kmem_cache(cachep);
-
-	kfree(cw);
-}
-
-/*
- * Enqueue the creation of a per-memcg kmem_cache.
- */
-static void memcg_schedule_kmem_cache_create(struct kmem_cache *cachep)
-{
-	struct memcg_kmem_cache_create_work *cw;
-
-	cw = kmalloc(sizeof(*cw), GFP_NOWAIT | __GFP_NOWARN);
-	if (!cw)
-		return;
-
-	cw->cachep = cachep;
-	INIT_WORK(&cw->work, memcg_kmem_cache_create_func);
-
-	queue_work(memcg_kmem_cache_wq, &cw->work);
-}
-
 /**
  * memcg_kmem_get_cache: select memcg or root cache for allocation
  * @cachep: the original global kmem cache
@@ -2951,7 +2916,7 @@ struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep)
 
 	memcg_cachep = READ_ONCE(cachep->memcg_params.memcg_cache);
 	if (unlikely(!memcg_cachep)) {
-		memcg_schedule_kmem_cache_create(cachep);
+		queue_work(system_wq, &cachep->memcg_params.work);
 		return cachep;
 	}
 
@@ -7062,17 +7027,6 @@ static int __init mem_cgroup_init(void)
 {
 	int cpu, node;
 
-#ifdef CONFIG_MEMCG_KMEM
-	/*
-	 * Kmem cache creation is mostly done with the slab_mutex held,
-	 * so use a workqueue with limited concurrency to avoid stalling
-	 * all worker threads in case lots of cgroups are created and
-	 * destroyed simultaneously.
-	 */
-	memcg_kmem_cache_wq = alloc_workqueue("memcg_kmem_cache", 0, 1);
-	BUG_ON(!memcg_kmem_cache_wq);
-#endif
-
 	cpuhp_setup_state_nocalls(CPUHP_MM_MEMCQ_DEAD, "mm/memctrl:dead", NULL,
 				  memcg_hotplug_cpu_dead);
 
diff --git a/mm/slab.h b/mm/slab.h
index 8f8552df5675..c6c7987dfd85 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -45,12 +45,14 @@ struct kmem_cache {
  * @memcg_cache: pointer to memcg kmem cache, used by all non-root memory
  *		cgroups.
  * @root_caches_node: list node for slab_root_caches list.
+ * @work: work struct used to create the non-root cache.
  */
 struct memcg_cache_params {
 	struct kmem_cache *root_cache;
 
 	struct kmem_cache *memcg_cache;
 	struct list_head __root_caches_node;
+	struct work_struct work;
 };
 #endif /* CONFIG_SLOB */
 
diff --git a/mm/slab_common.c b/mm/slab_common.c
index e9deaafddbb6..10aa2acb84ca 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -132,10 +132,18 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t nr,
 
 LIST_HEAD(slab_root_caches);
 
+static void memcg_kmem_cache_create_func(struct work_struct *work)
+{
+	struct kmem_cache *cachep = container_of(work, struct kmem_cache,
+						 memcg_params.work);
+	memcg_create_kmem_cache(cachep);
+}
+
 void slab_init_memcg_params(struct kmem_cache *s)
 {
 	s->memcg_params.root_cache = NULL;
 	s->memcg_params.memcg_cache = NULL;
+	INIT_WORK(&s->memcg_params.work, memcg_kmem_cache_create_func);
 }
 
 static void init_memcg_params(struct kmem_cache *s,
@@ -584,15 +592,9 @@ static int shutdown_memcg_caches(struct kmem_cache *s)
 	return 0;
 }
 
-static void flush_memcg_workqueue(struct kmem_cache *s)
+static void cancel_memcg_cache_creation(struct kmem_cache *s)
 {
-	/*
-	 * SLAB and SLUB create memcg kmem_caches through workqueue and SLUB
-	 * deactivates the memcg kmem_caches through workqueue. Make sure all
-	 * previous workitems on workqueue are processed.
-	 */
-	if (likely(memcg_kmem_cache_wq))
-		flush_workqueue(memcg_kmem_cache_wq);
+	cancel_work_sync(&s->memcg_params.work);
 }
 #else
 static inline int shutdown_memcg_caches(struct kmem_cache *s)
@@ -600,7 +602,7 @@ static inline int shutdown_memcg_caches(struct kmem_cache *s)
 	return 0;
 }
 
-static inline void flush_memcg_workqueue(struct kmem_cache *s)
+static inline void cancel_memcg_cache_creation(struct kmem_cache *s)
 {
 }
 #endif /* CONFIG_MEMCG_KMEM */
@@ -619,7 +621,7 @@ void kmem_cache_destroy(struct kmem_cache *s)
 	if (unlikely(!s))
 		return;
 
-	flush_memcg_workqueue(s);
+	cancel_memcg_cache_creation(s);
 
 	get_online_cpus();
 	get_online_mems();
-- 
2.25.4


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v6 14/19] mm: memcg/slab: remove memcg_kmem_get_cache()
  2020-06-08 23:06 [PATCH v6 00/19] The new cgroup slab memory controller Roman Gushchin
                   ` (12 preceding siblings ...)
  2020-06-08 23:06 ` [PATCH v6 13/19] mm: memcg/slab: simplify memcg cache creation Roman Gushchin
@ 2020-06-08 23:06 ` Roman Gushchin
  2020-06-22 18:42   ` Shakeel Butt
  2020-06-08 23:06 ` [PATCH v6 15/19] mm: memcg/slab: deprecate slab_root_caches Roman Gushchin
                   ` (6 subsequent siblings)
  20 siblings, 1 reply; 92+ messages in thread
From: Roman Gushchin @ 2020-06-08 23:06 UTC (permalink / raw)
  To: Andrew Morton, Christoph Lameter
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	Vlastimil Babka, kernel-team, linux-kernel, Roman Gushchin

The memcg_kmem_get_cache() function became really trivial,
so let's just inline it into the single call point:
memcg_slab_pre_alloc_hook().

It will make the code less bulky and can also help the compiler
to generate a better code.

Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/memcontrol.h |  2 --
 mm/memcontrol.c            | 25 +------------------------
 mm/slab.h                  | 11 +++++++++--
 mm/slab_common.c           |  2 +-
 4 files changed, 11 insertions(+), 29 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ed0d2ac6a5d2..eede46c43573 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1403,8 +1403,6 @@ static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg,
 }
 #endif
 
-struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep);
-
 #ifdef CONFIG_MEMCG_KMEM
 int __memcg_kmem_charge(struct mem_cgroup *memcg, gfp_t gfp,
 			unsigned int nr_pages);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2695cdc15baa..09a84326ead1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -393,7 +393,7 @@ void memcg_put_cache_ids(void)
 
 /*
  * A lot of the calls to the cache allocation functions are expected to be
- * inlined by the compiler. Since the calls to memcg_kmem_get_cache are
+ * inlined by the compiler. Since the calls to memcg_slab_pre_alloc_hook() are
  * conditional to this static branch, we'll have to allow modules that does
  * kmem_cache_alloc and the such to see this symbol as well
  */
@@ -2900,29 +2900,6 @@ static void memcg_free_cache_id(int id)
 	ida_simple_remove(&memcg_cache_ida, id);
 }
 
-/**
- * memcg_kmem_get_cache: select memcg or root cache for allocation
- * @cachep: the original global kmem cache
- *
- * Return the kmem_cache we're supposed to use for a slab allocation.
- *
- * If the cache does not exist yet, if we are the first user of it, we
- * create it asynchronously in a workqueue and let the current allocation
- * go through with the original cache.
- */
-struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep)
-{
-	struct kmem_cache *memcg_cachep;
-
-	memcg_cachep = READ_ONCE(cachep->memcg_params.memcg_cache);
-	if (unlikely(!memcg_cachep)) {
-		queue_work(system_wq, &cachep->memcg_params.work);
-		return cachep;
-	}
-
-	return memcg_cachep;
-}
-
 /**
  * __memcg_kmem_charge: charge a number of kernel pages to a memcg
  * @memcg: memory cgroup to charge
diff --git a/mm/slab.h b/mm/slab.h
index c6c7987dfd85..f4033298a776 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -333,9 +333,16 @@ static inline struct kmem_cache *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
 	if (memcg_kmem_bypass())
 		return s;
 
-	cachep = memcg_kmem_get_cache(s);
-	if (is_root_cache(cachep))
+	cachep = READ_ONCE(s->memcg_params.memcg_cache);
+	if (unlikely(!cachep)) {
+		/*
+		 * If memcg cache does not exist yet, we schedule it's
+		 * asynchronous creation and let the current allocation
+		 * go through with the root cache.
+		 */
+		queue_work(system_wq, &s->memcg_params.work);
 		return s;
+	}
 
 	objcg = get_obj_cgroup_from_current();
 	if (!objcg)
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 10aa2acb84ca..f8874a159637 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -568,7 +568,7 @@ void memcg_create_kmem_cache(struct kmem_cache *root_cache)
 	}
 
 	/*
-	 * Since readers won't lock (see memcg_kmem_get_cache()), we need a
+	 * Since readers won't lock (see memcg_slab_pre_alloc_hook()), we need a
 	 * barrier here to ensure nobody will see the kmem_cache partially
 	 * initialized.
 	 */
-- 
2.25.4


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v6 15/19] mm: memcg/slab: deprecate slab_root_caches
  2020-06-08 23:06 [PATCH v6 00/19] The new cgroup slab memory controller Roman Gushchin
                   ` (13 preceding siblings ...)
  2020-06-08 23:06 ` [PATCH v6 14/19] mm: memcg/slab: remove memcg_kmem_get_cache() Roman Gushchin
@ 2020-06-08 23:06 ` Roman Gushchin
  2020-06-22 17:36   ` Shakeel Butt
  2020-06-08 23:06 ` [PATCH v6 16/19] mm: memcg/slab: remove redundant check in memcg_accumulate_slabinfo() Roman Gushchin
                   ` (5 subsequent siblings)
  20 siblings, 1 reply; 92+ messages in thread
From: Roman Gushchin @ 2020-06-08 23:06 UTC (permalink / raw)
  To: Andrew Morton, Christoph Lameter
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	Vlastimil Babka, kernel-team, linux-kernel, Roman Gushchin

Currently there are two lists of kmem_caches:
1) slab_caches, which contains all kmem_caches,
2) slab_root_caches, which contains only root kmem_caches.

And there is some preprocessor magic to have a single list
if CONFIG_MEMCG_KMEM isn't enabled.

It was required earlier because the number of non-root kmem_caches
was proportional to the number of memory cgroups and could reach
really big values. Now, when it cannot exceed the number of root
kmem_caches, there is really no reason to maintain two lists.

We never iterate over the slab_root_caches list on any hot paths,
so it's perfectly fine to iterate over slab_caches and filter out
non-root kmem_caches.

It allows to remove a lot of config-dependent code and two pointers
from the kmem_cache structure.

Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slab.c        |  1 -
 mm/slab.h        | 17 -----------------
 mm/slab_common.c | 37 ++++++++-----------------------------
 mm/slub.c        |  1 -
 4 files changed, 8 insertions(+), 48 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 7e8d0f62f30b..18a782bacd1b 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1239,7 +1239,6 @@ void __init kmem_cache_init(void)
 				  nr_node_ids * sizeof(struct kmem_cache_node *),
 				  SLAB_HWCACHE_ALIGN, 0, 0);
 	list_add(&kmem_cache->list, &slab_caches);
-	memcg_link_cache(kmem_cache);
 	slab_state = PARTIAL;
 
 	/*
diff --git a/mm/slab.h b/mm/slab.h
index f4033298a776..c49a863adb63 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -44,14 +44,12 @@ struct kmem_cache {
  *
  * @memcg_cache: pointer to memcg kmem cache, used by all non-root memory
  *		cgroups.
- * @root_caches_node: list node for slab_root_caches list.
  * @work: work struct used to create the non-root cache.
  */
 struct memcg_cache_params {
 	struct kmem_cache *root_cache;
 
 	struct kmem_cache *memcg_cache;
-	struct list_head __root_caches_node;
 	struct work_struct work;
 };
 #endif /* CONFIG_SLOB */
@@ -235,11 +233,6 @@ static inline int cache_vmstat_idx(struct kmem_cache *s)
 }
 
 #ifdef CONFIG_MEMCG_KMEM
-
-/* List of all root caches. */
-extern struct list_head		slab_root_caches;
-#define root_caches_node	memcg_params.__root_caches_node
-
 static inline bool is_root_cache(struct kmem_cache *s)
 {
 	return !s->memcg_params.root_cache;
@@ -415,14 +408,8 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page,
 }
 
 extern void slab_init_memcg_params(struct kmem_cache *);
-extern void memcg_link_cache(struct kmem_cache *s);
 
 #else /* CONFIG_MEMCG_KMEM */
-
-/* If !memcg, all caches are root. */
-#define slab_root_caches	slab_caches
-#define root_caches_node	list
-
 static inline bool is_root_cache(struct kmem_cache *s)
 {
 	return true;
@@ -491,10 +478,6 @@ static inline void slab_init_memcg_params(struct kmem_cache *s)
 {
 }
 
-static inline void memcg_link_cache(struct kmem_cache *s)
-{
-}
-
 #endif /* CONFIG_MEMCG_KMEM */
 
 static inline struct kmem_cache *virt_to_cache(const void *obj)
diff --git a/mm/slab_common.c b/mm/slab_common.c
index f8874a159637..c045afb9724e 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -129,9 +129,6 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t nr,
 }
 
 #ifdef CONFIG_MEMCG_KMEM
-
-LIST_HEAD(slab_root_caches);
-
 static void memcg_kmem_cache_create_func(struct work_struct *work)
 {
 	struct kmem_cache *cachep = container_of(work, struct kmem_cache,
@@ -154,27 +151,11 @@ static void init_memcg_params(struct kmem_cache *s,
 	else
 		slab_init_memcg_params(s);
 }
-
-void memcg_link_cache(struct kmem_cache *s)
-{
-	if (is_root_cache(s))
-		list_add(&s->root_caches_node, &slab_root_caches);
-}
-
-static void memcg_unlink_cache(struct kmem_cache *s)
-{
-	if (is_root_cache(s))
-		list_del(&s->root_caches_node);
-}
 #else
 static inline void init_memcg_params(struct kmem_cache *s,
 				     struct kmem_cache *root_cache)
 {
 }
-
-static inline void memcg_unlink_cache(struct kmem_cache *s)
-{
-}
 #endif /* CONFIG_MEMCG_KMEM */
 
 /*
@@ -251,7 +232,7 @@ struct kmem_cache *find_mergeable(unsigned int size, unsigned int align,
 	if (flags & SLAB_NEVER_MERGE)
 		return NULL;
 
-	list_for_each_entry_reverse(s, &slab_root_caches, root_caches_node) {
+	list_for_each_entry_reverse(s, &slab_caches, list) {
 		if (slab_unmergeable(s))
 			continue;
 
@@ -310,7 +291,6 @@ static struct kmem_cache *create_cache(const char *name,
 
 	s->refcount = 1;
 	list_add(&s->list, &slab_caches);
-	memcg_link_cache(s);
 out:
 	if (err)
 		return ERR_PTR(err);
@@ -505,7 +485,6 @@ static int shutdown_cache(struct kmem_cache *s)
 	if (__kmem_cache_shutdown(s) != 0)
 		return -EBUSY;
 
-	memcg_unlink_cache(s);
 	list_del(&s->list);
 
 	if (s->flags & SLAB_TYPESAFE_BY_RCU) {
@@ -749,7 +728,6 @@ struct kmem_cache *__init create_kmalloc_cache(const char *name,
 
 	create_boot_cache(s, name, size, flags, useroffset, usersize);
 	list_add(&s->list, &slab_caches);
-	memcg_link_cache(s);
 	s->refcount = 1;
 	return s;
 }
@@ -1090,12 +1068,12 @@ static void print_slabinfo_header(struct seq_file *m)
 void *slab_start(struct seq_file *m, loff_t *pos)
 {
 	mutex_lock(&slab_mutex);
-	return seq_list_start(&slab_root_caches, *pos);
+	return seq_list_start(&slab_caches, *pos);
 }
 
 void *slab_next(struct seq_file *m, void *p, loff_t *pos)
 {
-	return seq_list_next(p, &slab_root_caches, pos);
+	return seq_list_next(p, &slab_caches, pos);
 }
 
 void slab_stop(struct seq_file *m, void *p)
@@ -1148,11 +1126,12 @@ static void cache_show(struct kmem_cache *s, struct seq_file *m)
 
 static int slab_show(struct seq_file *m, void *p)
 {
-	struct kmem_cache *s = list_entry(p, struct kmem_cache, root_caches_node);
+	struct kmem_cache *s = list_entry(p, struct kmem_cache, list);
 
-	if (p == slab_root_caches.next)
+	if (p == slab_caches.next)
 		print_slabinfo_header(m);
-	cache_show(s, m);
+	if (is_root_cache(s))
+		cache_show(s, m);
 	return 0;
 }
 
@@ -1254,7 +1233,7 @@ static int memcg_slabinfo_show(struct seq_file *m, void *unused)
 	mutex_lock(&slab_mutex);
 	seq_puts(m, "# <name> <css_id[:dead|deact]> <active_objs> <num_objs>");
 	seq_puts(m, " <active_slabs> <num_slabs>\n");
-	list_for_each_entry(s, &slab_root_caches, root_caches_node) {
+	list_for_each_entry(s, &slab_caches, list) {
 		/*
 		 * Skip kmem caches that don't have the memcg cache.
 		 */
diff --git a/mm/slub.c b/mm/slub.c
index 6761e40e2c2e..891ae2716df1 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4292,7 +4292,6 @@ static struct kmem_cache * __init bootstrap(struct kmem_cache *static_cache)
 	}
 	slab_init_memcg_params(s);
 	list_add(&s->list, &slab_caches);
-	memcg_link_cache(s);
 	return s;
 }
 
-- 
2.25.4


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v6 16/19] mm: memcg/slab: remove redundant check in memcg_accumulate_slabinfo()
  2020-06-08 23:06 [PATCH v6 00/19] The new cgroup slab memory controller Roman Gushchin
                   ` (14 preceding siblings ...)
  2020-06-08 23:06 ` [PATCH v6 15/19] mm: memcg/slab: deprecate slab_root_caches Roman Gushchin
@ 2020-06-08 23:06 ` Roman Gushchin
  2020-06-22 17:32   ` Shakeel Butt
  2020-06-08 23:06 ` [PATCH v6 17/19] mm: memcg/slab: use a single set of kmem_caches for all allocations Roman Gushchin
                   ` (4 subsequent siblings)
  20 siblings, 1 reply; 92+ messages in thread
From: Roman Gushchin @ 2020-06-08 23:06 UTC (permalink / raw)
  To: Andrew Morton, Christoph Lameter
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	Vlastimil Babka, kernel-team, linux-kernel, Roman Gushchin

memcg_accumulate_slabinfo() is never called with a non-root
kmem_cache as a first argument, so the is_root_cache(s) check
is redundant and can be removed without any functional change.

Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slab_common.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/mm/slab_common.c b/mm/slab_common.c
index c045afb9724e..52164ad0f197 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1087,9 +1087,6 @@ memcg_accumulate_slabinfo(struct kmem_cache *s, struct slabinfo *info)
 	struct kmem_cache *c;
 	struct slabinfo sinfo;
 
-	if (!is_root_cache(s))
-		return;
-
 	c = memcg_cache(s);
 	if (c) {
 		memset(&sinfo, 0, sizeof(sinfo));
-- 
2.25.4


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v6 17/19] mm: memcg/slab: use a single set of kmem_caches for all allocations
  2020-06-08 23:06 [PATCH v6 00/19] The new cgroup slab memory controller Roman Gushchin
                   ` (15 preceding siblings ...)
  2020-06-08 23:06 ` [PATCH v6 16/19] mm: memcg/slab: remove redundant check in memcg_accumulate_slabinfo() Roman Gushchin
@ 2020-06-08 23:06 ` Roman Gushchin
  2020-06-17 23:35   ` Andrew Morton
  2020-06-22 19:21   ` Shakeel Butt
  2020-06-08 23:06 ` [PATCH v6 18/19] kselftests: cgroup: add kernel memory accounting tests Roman Gushchin
                   ` (3 subsequent siblings)
  20 siblings, 2 replies; 92+ messages in thread
From: Roman Gushchin @ 2020-06-08 23:06 UTC (permalink / raw)
  To: Andrew Morton, Christoph Lameter
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	Vlastimil Babka, kernel-team, linux-kernel, Roman Gushchin

Instead of having two sets of kmem_caches: one for system-wide and
non-accounted allocations and the second one shared by all accounted
allocations, we can use just one.

The idea is simple: space for obj_cgroup metadata can be allocated
on demand and filled only for accounted allocations.

It allows to remove a bunch of code which is required to handle
kmem_cache clones for accounted allocations. There is no more need
to create them, accumulate statistics, propagate attributes, etc.
It's a quite significant simplification.

Also, because the total number of slab_caches is reduced almost twice
(not all kmem_caches have a memcg clone), some additional memory
savings are expected. On my devvm it additionally saves about 3.5%
of slab memory.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/slab.h     |   2 -
 include/linux/slab_def.h |   3 -
 include/linux/slub_def.h |  10 --
 mm/memcontrol.c          |   5 +-
 mm/slab.c                |  41 +------
 mm/slab.h                | 186 +++++++------------------------
 mm/slab_common.c         | 230 +--------------------------------------
 mm/slub.c                | 163 +--------------------------
 8 files changed, 57 insertions(+), 583 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 310768bfa8d2..694a4f69e146 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -155,8 +155,6 @@ struct kmem_cache *kmem_cache_create_usercopy(const char *name,
 void kmem_cache_destroy(struct kmem_cache *);
 int kmem_cache_shrink(struct kmem_cache *);
 
-void memcg_create_kmem_cache(struct kmem_cache *cachep);
-
 /*
  * Please use this macro to create slab caches. Simply specify the
  * name of the structure and maybe some flags that are listed above.
diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h
index ccda7b9669a5..9eb430c163c2 100644
--- a/include/linux/slab_def.h
+++ b/include/linux/slab_def.h
@@ -72,9 +72,6 @@ struct kmem_cache {
 	int obj_offset;
 #endif /* CONFIG_DEBUG_SLAB */
 
-#ifdef CONFIG_MEMCG
-	struct memcg_cache_params memcg_params;
-#endif
 #ifdef CONFIG_KASAN
 	struct kasan_cache kasan_info;
 #endif
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index f87302dcfe8c..1be0ed5befa1 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -108,17 +108,7 @@ struct kmem_cache {
 	struct list_head list;	/* List of slab caches */
 #ifdef CONFIG_SYSFS
 	struct kobject kobj;	/* For sysfs */
-	struct work_struct kobj_remove_work;
 #endif
-#ifdef CONFIG_MEMCG
-	struct memcg_cache_params memcg_params;
-	/* For propagation, maximum size of a stored attr */
-	unsigned int max_attr_size;
-#ifdef CONFIG_SYSFS
-	struct kset *memcg_kset;
-#endif
-#endif
-
 #ifdef CONFIG_SLAB_FREELIST_HARDENED
 	unsigned long random;
 #endif
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 09a84326ead1..93b2e73ef2f7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2826,7 +2826,10 @@ struct mem_cgroup *mem_cgroup_from_obj(void *p)
 
 		off = obj_to_index(page->slab_cache, page, p);
 		objcg = page_obj_cgroups(page)[off];
-		return obj_cgroup_memcg(objcg);
+		if (objcg)
+			return obj_cgroup_memcg(objcg);
+
+		return NULL;
 	}
 
 	/* All other pages use page->mem_cgroup */
diff --git a/mm/slab.c b/mm/slab.c
index 18a782bacd1b..7d33ab503290 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1369,11 +1369,7 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
 		return NULL;
 	}
 
-	if (charge_slab_page(page, flags, cachep->gfporder, cachep)) {
-		__free_pages(page, cachep->gfporder);
-		return NULL;
-	}
-
+	charge_slab_page(page, flags, cachep->gfporder, cachep);
 	__SetPageSlab(page);
 	/* Record if ALLOC_NO_WATERMARKS was set when allocating the slab */
 	if (sk_memalloc_socks() && page_is_pfmemalloc(page))
@@ -3788,8 +3784,8 @@ static int setup_kmem_cache_nodes(struct kmem_cache *cachep, gfp_t gfp)
 }
 
 /* Always called with the slab_mutex held */
-static int __do_tune_cpucache(struct kmem_cache *cachep, int limit,
-				int batchcount, int shared, gfp_t gfp)
+static int do_tune_cpucache(struct kmem_cache *cachep, int limit,
+			    int batchcount, int shared, gfp_t gfp)
 {
 	struct array_cache __percpu *cpu_cache, *prev;
 	int cpu;
@@ -3834,30 +3830,6 @@ static int __do_tune_cpucache(struct kmem_cache *cachep, int limit,
 	return setup_kmem_cache_nodes(cachep, gfp);
 }
 
-static int do_tune_cpucache(struct kmem_cache *cachep, int limit,
-				int batchcount, int shared, gfp_t gfp)
-{
-	int ret;
-	struct kmem_cache *c;
-
-	ret = __do_tune_cpucache(cachep, limit, batchcount, shared, gfp);
-
-	if (slab_state < FULL)
-		return ret;
-
-	if ((ret < 0) || !is_root_cache(cachep))
-		return ret;
-
-	lockdep_assert_held(&slab_mutex);
-	c = memcg_cache(cachep);
-	if (c) {
-		/* return value determined by the root cache only */
-		__do_tune_cpucache(c, limit, batchcount, shared, gfp);
-	}
-
-	return ret;
-}
-
 /* Called with slab_mutex held always */
 static int enable_cpucache(struct kmem_cache *cachep, gfp_t gfp)
 {
@@ -3870,13 +3842,6 @@ static int enable_cpucache(struct kmem_cache *cachep, gfp_t gfp)
 	if (err)
 		goto end;
 
-	if (!is_root_cache(cachep)) {
-		struct kmem_cache *root = memcg_root_cache(cachep);
-		limit = root->limit;
-		shared = root->shared;
-		batchcount = root->batchcount;
-	}
-
 	if (limit && shared && batchcount)
 		goto skip_setup;
 	/*
diff --git a/mm/slab.h b/mm/slab.h
index c49a863adb63..a23518030862 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -30,28 +30,6 @@ struct kmem_cache {
 	struct list_head list;	/* List of all slab caches on the system */
 };
 
-#else /* !CONFIG_SLOB */
-
-/*
- * This is the main placeholder for memcg-related information in kmem caches.
- * Both the root cache and the child cache will have it. Some fields are used
- * in both cases, other are specific to root caches.
- *
- * @root_cache:	Common to root and child caches.  NULL for root, pointer to
- *		the root cache for children.
- *
- * The following fields are specific to root caches.
- *
- * @memcg_cache: pointer to memcg kmem cache, used by all non-root memory
- *		cgroups.
- * @work: work struct used to create the non-root cache.
- */
-struct memcg_cache_params {
-	struct kmem_cache *root_cache;
-
-	struct kmem_cache *memcg_cache;
-	struct work_struct work;
-};
 #endif /* CONFIG_SLOB */
 
 #ifdef CONFIG_SLAB
@@ -194,7 +172,6 @@ int __kmem_cache_shutdown(struct kmem_cache *);
 void __kmem_cache_release(struct kmem_cache *);
 int __kmem_cache_shrink(struct kmem_cache *);
 void slab_kmem_cache_release(struct kmem_cache *);
-void kmem_cache_shrink_all(struct kmem_cache *s);
 
 struct seq_file;
 struct file;
@@ -233,43 +210,6 @@ static inline int cache_vmstat_idx(struct kmem_cache *s)
 }
 
 #ifdef CONFIG_MEMCG_KMEM
-static inline bool is_root_cache(struct kmem_cache *s)
-{
-	return !s->memcg_params.root_cache;
-}
-
-static inline bool slab_equal_or_root(struct kmem_cache *s,
-				      struct kmem_cache *p)
-{
-	return p == s || p == s->memcg_params.root_cache;
-}
-
-/*
- * We use suffixes to the name in memcg because we can't have caches
- * created in the system with the same name. But when we print them
- * locally, better refer to them with the base name
- */
-static inline const char *cache_name(struct kmem_cache *s)
-{
-	if (!is_root_cache(s))
-		s = s->memcg_params.root_cache;
-	return s->name;
-}
-
-static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
-{
-	if (is_root_cache(s))
-		return s;
-	return s->memcg_params.root_cache;
-}
-
-static inline struct kmem_cache *memcg_cache(struct kmem_cache *s)
-{
-	if (is_root_cache(s))
-		return s->memcg_params.memcg_cache;
-	return NULL;
-}
-
 static inline struct obj_cgroup **page_obj_cgroups(struct page *page)
 {
 	/*
@@ -316,38 +256,25 @@ static inline size_t obj_full_size(struct kmem_cache *s)
 	return s->size + sizeof(struct obj_cgroup *);
 }
 
-static inline struct kmem_cache *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
-						struct obj_cgroup **objcgp,
-						size_t objects, gfp_t flags)
+static inline struct obj_cgroup *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
+							   size_t objects,
+							   gfp_t flags)
 {
-	struct kmem_cache *cachep;
 	struct obj_cgroup *objcg;
 
 	if (memcg_kmem_bypass())
-		return s;
-
-	cachep = READ_ONCE(s->memcg_params.memcg_cache);
-	if (unlikely(!cachep)) {
-		/*
-		 * If memcg cache does not exist yet, we schedule it's
-		 * asynchronous creation and let the current allocation
-		 * go through with the root cache.
-		 */
-		queue_work(system_wq, &s->memcg_params.work);
-		return s;
-	}
+		return NULL;
 
 	objcg = get_obj_cgroup_from_current();
 	if (!objcg)
-		return s;
+		return NULL;
 
 	if (obj_cgroup_charge(objcg, flags, objects * obj_full_size(s))) {
 		obj_cgroup_put(objcg);
-		cachep = NULL;
+		return NULL;
 	}
 
-	*objcgp = objcg;
-	return cachep;
+	return objcg;
 }
 
 static inline void mod_objcg_state(struct obj_cgroup *objcg,
@@ -366,15 +293,27 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg,
 
 static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
 					      struct obj_cgroup *objcg,
-					      size_t size, void **p)
+					      gfp_t flags, size_t size,
+					      void **p)
 {
 	struct page *page;
 	unsigned long off;
 	size_t i;
 
+	if (!objcg)
+		return;
+
+	flags &= ~__GFP_ACCOUNT;
 	for (i = 0; i < size; i++) {
 		if (likely(p[i])) {
 			page = virt_to_head_page(p[i]);
+
+			if (!page_has_obj_cgroups(page) &&
+			    memcg_alloc_page_obj_cgroups(page, s, flags)) {
+				obj_cgroup_uncharge(objcg, obj_full_size(s));
+				continue;
+			}
+
 			off = obj_to_index(s, page, p[i]);
 			obj_cgroup_get(objcg);
 			page_obj_cgroups(page)[off] = objcg;
@@ -393,13 +332,19 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page,
 	struct obj_cgroup *objcg;
 	unsigned int off;
 
-	if (!memcg_kmem_enabled() || is_root_cache(s))
+	if (!memcg_kmem_enabled())
+		return;
+
+	if (!page_has_obj_cgroups(page))
 		return;
 
 	off = obj_to_index(s, page, p);
 	objcg = page_obj_cgroups(page)[off];
 	page_obj_cgroups(page)[off] = NULL;
 
+	if (!objcg)
+		return;
+
 	obj_cgroup_uncharge(objcg, obj_full_size(s));
 	mod_objcg_state(objcg, page_pgdat(page), cache_vmstat_idx(s),
 			-obj_full_size(s));
@@ -407,35 +352,7 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page,
 	obj_cgroup_put(objcg);
 }
 
-extern void slab_init_memcg_params(struct kmem_cache *);
-
 #else /* CONFIG_MEMCG_KMEM */
-static inline bool is_root_cache(struct kmem_cache *s)
-{
-	return true;
-}
-
-static inline bool slab_equal_or_root(struct kmem_cache *s,
-				      struct kmem_cache *p)
-{
-	return s == p;
-}
-
-static inline const char *cache_name(struct kmem_cache *s)
-{
-	return s->name;
-}
-
-static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
-{
-	return s;
-}
-
-static inline struct kmem_cache *memcg_cache(struct kmem_cache *s)
-{
-	return NULL;
-}
-
 static inline bool page_has_obj_cgroups(struct page *page)
 {
 	return false;
@@ -456,16 +373,17 @@ static inline void memcg_free_page_obj_cgroups(struct page *page)
 {
 }
 
-static inline struct kmem_cache *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
-						struct obj_cgroup **objcgp,
-						size_t objects, gfp_t flags)
+static inline struct obj_cgroup *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
+							   size_t objects,
+							   gfp_t flags)
 {
 	return NULL;
 }
 
 static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
 					      struct obj_cgroup *objcg,
-					      size_t size, void **p)
+					      gfp_t flags, size_t size,
+					      void **p)
 {
 }
 
@@ -473,11 +391,6 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page,
 					void *p)
 {
 }
-
-static inline void slab_init_memcg_params(struct kmem_cache *s)
-{
-}
-
 #endif /* CONFIG_MEMCG_KMEM */
 
 static inline struct kmem_cache *virt_to_cache(const void *obj)
@@ -491,27 +404,18 @@ static inline struct kmem_cache *virt_to_cache(const void *obj)
 	return page->slab_cache;
 }
 
-static __always_inline int charge_slab_page(struct page *page,
-					    gfp_t gfp, int order,
-					    struct kmem_cache *s)
+static __always_inline void charge_slab_page(struct page *page,
+					     gfp_t gfp, int order,
+					     struct kmem_cache *s)
 {
-	if (memcg_kmem_enabled() && !is_root_cache(s)) {
-		int ret;
-
-		ret = memcg_alloc_page_obj_cgroups(page, s, gfp);
-		if (ret)
-			return ret;
-	}
-
 	mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
 			    PAGE_SIZE << order);
-	return 0;
 }
 
 static __always_inline void uncharge_slab_page(struct page *page, int order,
 					       struct kmem_cache *s)
 {
-	if (memcg_kmem_enabled() && !is_root_cache(s))
+	if (memcg_kmem_enabled())
 		memcg_free_page_obj_cgroups(page);
 
 	mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
@@ -522,20 +426,12 @@ static inline struct kmem_cache *cache_from_obj(struct kmem_cache *s, void *x)
 {
 	struct kmem_cache *cachep;
 
-	/*
-	 * When kmemcg is not being used, both assignments should return the
-	 * same value. but we don't want to pay the assignment price in that
-	 * case. If it is not compiled in, the compiler should be smart enough
-	 * to not do even the assignment. In that case, slab_equal_or_root
-	 * will also be a constant.
-	 */
-	if (!memcg_kmem_enabled() &&
-	    !IS_ENABLED(CONFIG_SLAB_FREELIST_HARDENED) &&
+	if (!IS_ENABLED(CONFIG_SLAB_FREELIST_HARDENED) &&
 	    !unlikely(s->flags & SLAB_CONSISTENCY_CHECKS))
 		return s;
 
 	cachep = virt_to_cache(x);
-	WARN_ONCE(cachep && !slab_equal_or_root(cachep, s),
+	WARN_ONCE(cachep && cachep != s,
 		  "%s: Wrong slab cache. %s but object is from %s\n",
 		  __func__, s->name, cachep->name);
 	return cachep;
@@ -587,7 +483,7 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
 
 	if (memcg_kmem_enabled() &&
 	    ((flags & __GFP_ACCOUNT) || (s->flags & SLAB_ACCOUNT)))
-		return memcg_slab_pre_alloc_hook(s, objcgp, size, flags);
+		*objcgp = memcg_slab_pre_alloc_hook(s, size, flags);
 
 	return s;
 }
@@ -606,8 +502,8 @@ static inline void slab_post_alloc_hook(struct kmem_cache *s,
 					 s->flags, flags);
 	}
 
-	if (memcg_kmem_enabled() && !is_root_cache(s))
-		memcg_slab_post_alloc_hook(s, objcg, size, p);
+	if (memcg_kmem_enabled())
+		memcg_slab_post_alloc_hook(s, objcg, flags, size, p);
 }
 
 #ifndef CONFIG_SLOB
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 52164ad0f197..7be382d45514 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -128,36 +128,6 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t nr,
 	return i;
 }
 
-#ifdef CONFIG_MEMCG_KMEM
-static void memcg_kmem_cache_create_func(struct work_struct *work)
-{
-	struct kmem_cache *cachep = container_of(work, struct kmem_cache,
-						 memcg_params.work);
-	memcg_create_kmem_cache(cachep);
-}
-
-void slab_init_memcg_params(struct kmem_cache *s)
-{
-	s->memcg_params.root_cache = NULL;
-	s->memcg_params.memcg_cache = NULL;
-	INIT_WORK(&s->memcg_params.work, memcg_kmem_cache_create_func);
-}
-
-static void init_memcg_params(struct kmem_cache *s,
-			      struct kmem_cache *root_cache)
-{
-	if (root_cache)
-		s->memcg_params.root_cache = root_cache;
-	else
-		slab_init_memcg_params(s);
-}
-#else
-static inline void init_memcg_params(struct kmem_cache *s,
-				     struct kmem_cache *root_cache)
-{
-}
-#endif /* CONFIG_MEMCG_KMEM */
-
 /*
  * Figure out what the alignment of the objects will be given a set of
  * flags, a user specified alignment and the size of the objects.
@@ -195,9 +165,6 @@ int slab_unmergeable(struct kmem_cache *s)
 	if (slab_nomerge || (s->flags & SLAB_NEVER_MERGE))
 		return 1;
 
-	if (!is_root_cache(s))
-		return 1;
-
 	if (s->ctor)
 		return 1;
 
@@ -284,7 +251,6 @@ static struct kmem_cache *create_cache(const char *name,
 	s->useroffset = useroffset;
 	s->usersize = usersize;
 
-	init_memcg_params(s, root_cache);
 	err = __kmem_cache_create(s, flags);
 	if (err)
 		goto out_free_cache;
@@ -342,7 +308,6 @@ kmem_cache_create_usercopy(const char *name,
 
 	get_online_cpus();
 	get_online_mems();
-	memcg_get_cache_ids();
 
 	mutex_lock(&slab_mutex);
 
@@ -392,7 +357,6 @@ kmem_cache_create_usercopy(const char *name,
 out_unlock:
 	mutex_unlock(&slab_mutex);
 
-	memcg_put_cache_ids();
 	put_online_mems();
 	put_online_cpus();
 
@@ -505,87 +469,6 @@ static int shutdown_cache(struct kmem_cache *s)
 	return 0;
 }
 
-#ifdef CONFIG_MEMCG_KMEM
-/*
- * memcg_create_kmem_cache - Create a cache for non-root memory cgroups.
- * @root_cache: The parent of the new cache.
- *
- * This function attempts to create a kmem cache that will serve allocation
- * requests going all non-root memory cgroups to @root_cache. The new cache
- * inherits properties from its parent.
- */
-void memcg_create_kmem_cache(struct kmem_cache *root_cache)
-{
-	struct kmem_cache *s = NULL;
-	char *cache_name;
-
-	get_online_cpus();
-	get_online_mems();
-
-	mutex_lock(&slab_mutex);
-
-	if (root_cache->memcg_params.memcg_cache)
-		goto out_unlock;
-
-	cache_name = kasprintf(GFP_KERNEL, "%s-memcg", root_cache->name);
-	if (!cache_name)
-		goto out_unlock;
-
-	s = create_cache(cache_name, root_cache->object_size,
-			 root_cache->align,
-			 root_cache->flags & CACHE_CREATE_MASK,
-			 root_cache->useroffset, root_cache->usersize,
-			 root_cache->ctor, root_cache);
-	/*
-	 * If we could not create a memcg cache, do not complain, because
-	 * that's not critical at all as we can always proceed with the root
-	 * cache.
-	 */
-	if (IS_ERR(s)) {
-		kfree(cache_name);
-		goto out_unlock;
-	}
-
-	/*
-	 * Since readers won't lock (see memcg_slab_pre_alloc_hook()), we need a
-	 * barrier here to ensure nobody will see the kmem_cache partially
-	 * initialized.
-	 */
-	smp_wmb();
-	root_cache->memcg_params.memcg_cache = s;
-
-out_unlock:
-	mutex_unlock(&slab_mutex);
-
-	put_online_mems();
-	put_online_cpus();
-}
-
-static int shutdown_memcg_caches(struct kmem_cache *s)
-{
-	BUG_ON(!is_root_cache(s));
-
-	if (s->memcg_params.memcg_cache)
-		WARN_ON(shutdown_cache(s->memcg_params.memcg_cache));
-
-	return 0;
-}
-
-static void cancel_memcg_cache_creation(struct kmem_cache *s)
-{
-	cancel_work_sync(&s->memcg_params.work);
-}
-#else
-static inline int shutdown_memcg_caches(struct kmem_cache *s)
-{
-	return 0;
-}
-
-static inline void cancel_memcg_cache_creation(struct kmem_cache *s)
-{
-}
-#endif /* CONFIG_MEMCG_KMEM */
-
 void slab_kmem_cache_release(struct kmem_cache *s)
 {
 	__kmem_cache_release(s);
@@ -600,8 +483,6 @@ void kmem_cache_destroy(struct kmem_cache *s)
 	if (unlikely(!s))
 		return;
 
-	cancel_memcg_cache_creation(s);
-
 	get_online_cpus();
 	get_online_mems();
 
@@ -611,10 +492,7 @@ void kmem_cache_destroy(struct kmem_cache *s)
 	if (s->refcount)
 		goto out_unlock;
 
-	err = shutdown_memcg_caches(s);
-	if (!err)
-		err = shutdown_cache(s);
-
+	err = shutdown_cache(s);
 	if (err) {
 		pr_err("kmem_cache_destroy %s: Slab cache still has objects\n",
 		       s->name);
@@ -651,33 +529,6 @@ int kmem_cache_shrink(struct kmem_cache *cachep)
 }
 EXPORT_SYMBOL(kmem_cache_shrink);
 
-/**
- * kmem_cache_shrink_all - shrink root and memcg caches
- * @s: The cache pointer
- */
-void kmem_cache_shrink_all(struct kmem_cache *s)
-{
-	struct kmem_cache *c;
-
-	if (!IS_ENABLED(CONFIG_MEMCG_KMEM) || !is_root_cache(s)) {
-		kmem_cache_shrink(s);
-		return;
-	}
-
-	get_online_cpus();
-	get_online_mems();
-	kasan_cache_shrink(s);
-	__kmem_cache_shrink(s);
-
-	c = memcg_cache(s);
-	if (c) {
-		kasan_cache_shrink(c);
-		__kmem_cache_shrink(c);
-	}
-	put_online_mems();
-	put_online_cpus();
-}
-
 bool slab_is_available(void)
 {
 	return slab_state >= UP;
@@ -706,8 +557,6 @@ void __init create_boot_cache(struct kmem_cache *s, const char *name,
 	s->useroffset = useroffset;
 	s->usersize = usersize;
 
-	slab_init_memcg_params(s);
-
 	err = __kmem_cache_create(s, flags);
 
 	if (err)
@@ -1081,25 +930,6 @@ void slab_stop(struct seq_file *m, void *p)
 	mutex_unlock(&slab_mutex);
 }
 
-static void
-memcg_accumulate_slabinfo(struct kmem_cache *s, struct slabinfo *info)
-{
-	struct kmem_cache *c;
-	struct slabinfo sinfo;
-
-	c = memcg_cache(s);
-	if (c) {
-		memset(&sinfo, 0, sizeof(sinfo));
-		get_slabinfo(c, &sinfo);
-
-		info->active_slabs += sinfo.active_slabs;
-		info->num_slabs += sinfo.num_slabs;
-		info->shared_avail += sinfo.shared_avail;
-		info->active_objs += sinfo.active_objs;
-		info->num_objs += sinfo.num_objs;
-	}
-}
-
 static void cache_show(struct kmem_cache *s, struct seq_file *m)
 {
 	struct slabinfo sinfo;
@@ -1107,10 +937,8 @@ static void cache_show(struct kmem_cache *s, struct seq_file *m)
 	memset(&sinfo, 0, sizeof(sinfo));
 	get_slabinfo(s, &sinfo);
 
-	memcg_accumulate_slabinfo(s, &sinfo);
-
 	seq_printf(m, "%-17s %6lu %6lu %6u %4u %4d",
-		   cache_name(s), sinfo.active_objs, sinfo.num_objs, s->size,
+		   s->name, sinfo.active_objs, sinfo.num_objs, s->size,
 		   sinfo.objects_per_slab, (1 << sinfo.cache_order));
 
 	seq_printf(m, " : tunables %4u %4u %4u",
@@ -1127,8 +955,7 @@ static int slab_show(struct seq_file *m, void *p)
 
 	if (p == slab_caches.next)
 		print_slabinfo_header(m);
-	if (is_root_cache(s))
-		cache_show(s, m);
+	cache_show(s, m);
 	return 0;
 }
 
@@ -1153,13 +980,13 @@ void dump_unreclaimable_slab(void)
 	pr_info("Name                      Used          Total\n");
 
 	list_for_each_entry_safe(s, s2, &slab_caches, list) {
-		if (!is_root_cache(s) || (s->flags & SLAB_RECLAIM_ACCOUNT))
+		if (s->flags & SLAB_RECLAIM_ACCOUNT)
 			continue;
 
 		get_slabinfo(s, &sinfo);
 
 		if (sinfo.num_objs > 0)
-			pr_info("%-17s %10luKB %10luKB\n", cache_name(s),
+			pr_info("%-17s %10luKB %10luKB\n", s->name,
 				(sinfo.active_objs * s->size) / 1024,
 				(sinfo.num_objs * s->size) / 1024);
 	}
@@ -1218,53 +1045,6 @@ static int __init slab_proc_init(void)
 }
 module_init(slab_proc_init);
 
-#if defined(CONFIG_DEBUG_FS) && defined(CONFIG_MEMCG_KMEM)
-/*
- * Display information about kmem caches that have memcg cache.
- */
-static int memcg_slabinfo_show(struct seq_file *m, void *unused)
-{
-	struct kmem_cache *s, *c;
-	struct slabinfo sinfo;
-
-	mutex_lock(&slab_mutex);
-	seq_puts(m, "# <name> <css_id[:dead|deact]> <active_objs> <num_objs>");
-	seq_puts(m, " <active_slabs> <num_slabs>\n");
-	list_for_each_entry(s, &slab_caches, list) {
-		/*
-		 * Skip kmem caches that don't have the memcg cache.
-		 */
-		if (!s->memcg_params.memcg_cache)
-			continue;
-
-		memset(&sinfo, 0, sizeof(sinfo));
-		get_slabinfo(s, &sinfo);
-		seq_printf(m, "%-17s root       %6lu %6lu %6lu %6lu\n",
-			   cache_name(s), sinfo.active_objs, sinfo.num_objs,
-			   sinfo.active_slabs, sinfo.num_slabs);
-
-		c = s->memcg_params.memcg_cache;
-		memset(&sinfo, 0, sizeof(sinfo));
-		get_slabinfo(c, &sinfo);
-		seq_printf(m, "%-17s %4d %6lu %6lu %6lu %6lu\n",
-			   cache_name(c), root_mem_cgroup->css.id,
-			   sinfo.active_objs, sinfo.num_objs,
-			   sinfo.active_slabs, sinfo.num_slabs);
-	}
-	mutex_unlock(&slab_mutex);
-	return 0;
-}
-DEFINE_SHOW_ATTRIBUTE(memcg_slabinfo);
-
-static int __init memcg_slabinfo_init(void)
-{
-	debugfs_create_file("memcg_slabinfo", S_IFREG | S_IRUGO,
-			    NULL, NULL, &memcg_slabinfo_fops);
-	return 0;
-}
-
-late_initcall(memcg_slabinfo_init);
-#endif /* CONFIG_DEBUG_FS && CONFIG_MEMCG_KMEM */
 #endif /* CONFIG_SLAB || CONFIG_SLUB_DEBUG */
 
 static __always_inline void *__do_krealloc(const void *p, size_t new_size,
diff --git a/mm/slub.c b/mm/slub.c
index 891ae2716df1..3d1a93edfee3 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -214,14 +214,10 @@ enum track_item { TRACK_ALLOC, TRACK_FREE };
 #ifdef CONFIG_SYSFS
 static int sysfs_slab_add(struct kmem_cache *);
 static int sysfs_slab_alias(struct kmem_cache *, const char *);
-static void memcg_propagate_slab_attrs(struct kmem_cache *s);
-static void sysfs_slab_remove(struct kmem_cache *s);
 #else
 static inline int sysfs_slab_add(struct kmem_cache *s) { return 0; }
 static inline int sysfs_slab_alias(struct kmem_cache *s, const char *p)
 							{ return 0; }
-static inline void memcg_propagate_slab_attrs(struct kmem_cache *s) { }
-static inline void sysfs_slab_remove(struct kmem_cache *s) { }
 #endif
 
 static inline void stat(const struct kmem_cache *s, enum stat_item si)
@@ -1540,10 +1536,8 @@ static inline struct page *alloc_slab_page(struct kmem_cache *s,
 	else
 		page = __alloc_pages_node(node, flags, order);
 
-	if (page && charge_slab_page(page, flags, order, s)) {
-		__free_pages(page, order);
-		page = NULL;
-	}
+	if (page)
+		charge_slab_page(page, flags, order, s);
 
 	return page;
 }
@@ -3852,7 +3846,6 @@ int __kmem_cache_shutdown(struct kmem_cache *s)
 		if (n->nr_partial || slabs_node(s, node))
 			return 1;
 	}
-	sysfs_slab_remove(s);
 	return 0;
 }
 
@@ -4290,7 +4283,6 @@ static struct kmem_cache * __init bootstrap(struct kmem_cache *static_cache)
 			p->slab_cache = s;
 #endif
 	}
-	slab_init_memcg_params(s);
 	list_add(&s->list, &slab_caches);
 	return s;
 }
@@ -4346,7 +4338,7 @@ struct kmem_cache *
 __kmem_cache_alias(const char *name, unsigned int size, unsigned int align,
 		   slab_flags_t flags, void (*ctor)(void *))
 {
-	struct kmem_cache *s, *c;
+	struct kmem_cache *s;
 
 	s = find_mergeable(size, align, flags, name, ctor);
 	if (s) {
@@ -4359,12 +4351,6 @@ __kmem_cache_alias(const char *name, unsigned int size, unsigned int align,
 		s->object_size = max(s->object_size, size);
 		s->inuse = max(s->inuse, ALIGN(size, sizeof(void *)));
 
-		c = memcg_cache(s);
-		if (c) {
-			c->object_size = s->object_size;
-			c->inuse = max(c->inuse, ALIGN(size, sizeof(void *)));
-		}
-
 		if (sysfs_slab_alias(s, name)) {
 			s->refcount--;
 			s = NULL;
@@ -4386,7 +4372,6 @@ int __kmem_cache_create(struct kmem_cache *s, slab_flags_t flags)
 	if (slab_state <= UP)
 		return 0;
 
-	memcg_propagate_slab_attrs(s);
 	err = sysfs_slab_add(s);
 	if (err)
 		__kmem_cache_release(s);
@@ -5366,7 +5351,7 @@ static ssize_t shrink_store(struct kmem_cache *s,
 			const char *buf, size_t length)
 {
 	if (buf[0] == '1')
-		kmem_cache_shrink_all(s);
+		kmem_cache_shrink(s);
 	else
 		return -EINVAL;
 	return length;
@@ -5590,99 +5575,9 @@ static ssize_t slab_attr_store(struct kobject *kobj,
 		return -EIO;
 
 	err = attribute->store(s, buf, len);
-#ifdef CONFIG_MEMCG
-	if (slab_state >= FULL && err >= 0 && is_root_cache(s)) {
-		struct kmem_cache *c;
-
-		mutex_lock(&slab_mutex);
-		if (s->max_attr_size < len)
-			s->max_attr_size = len;
-
-		/*
-		 * This is a best effort propagation, so this function's return
-		 * value will be determined by the parent cache only. This is
-		 * basically because not all attributes will have a well
-		 * defined semantics for rollbacks - most of the actions will
-		 * have permanent effects.
-		 *
-		 * Returning the error value of any of the children that fail
-		 * is not 100 % defined, in the sense that users seeing the
-		 * error code won't be able to know anything about the state of
-		 * the cache.
-		 *
-		 * Only returning the error code for the parent cache at least
-		 * has well defined semantics. The cache being written to
-		 * directly either failed or succeeded, in which case we loop
-		 * through the descendants with best-effort propagation.
-		 */
-		c = memcg_cache(s);
-		if (c)
-			attribute->store(c, buf, len);
-		mutex_unlock(&slab_mutex);
-	}
-#endif
 	return err;
 }
 
-static void memcg_propagate_slab_attrs(struct kmem_cache *s)
-{
-#ifdef CONFIG_MEMCG
-	int i;
-	char *buffer = NULL;
-	struct kmem_cache *root_cache;
-
-	if (is_root_cache(s))
-		return;
-
-	root_cache = s->memcg_params.root_cache;
-
-	/*
-	 * This mean this cache had no attribute written. Therefore, no point
-	 * in copying default values around
-	 */
-	if (!root_cache->max_attr_size)
-		return;
-
-	for (i = 0; i < ARRAY_SIZE(slab_attrs); i++) {
-		char mbuf[64];
-		char *buf;
-		struct slab_attribute *attr = to_slab_attr(slab_attrs[i]);
-		ssize_t len;
-
-		if (!attr || !attr->store || !attr->show)
-			continue;
-
-		/*
-		 * It is really bad that we have to allocate here, so we will
-		 * do it only as a fallback. If we actually allocate, though,
-		 * we can just use the allocated buffer until the end.
-		 *
-		 * Most of the slub attributes will tend to be very small in
-		 * size, but sysfs allows buffers up to a page, so they can
-		 * theoretically happen.
-		 */
-		if (buffer)
-			buf = buffer;
-		else if (root_cache->max_attr_size < ARRAY_SIZE(mbuf) &&
-			 !IS_ENABLED(CONFIG_SLUB_STATS))
-			buf = mbuf;
-		else {
-			buffer = (char *) get_zeroed_page(GFP_KERNEL);
-			if (WARN_ON(!buffer))
-				continue;
-			buf = buffer;
-		}
-
-		len = attr->show(root_cache, buf);
-		if (len > 0)
-			attr->store(s, buf, len);
-	}
-
-	if (buffer)
-		free_page((unsigned long)buffer);
-#endif	/* CONFIG_MEMCG */
-}
-
 static void kmem_cache_release(struct kobject *k)
 {
 	slab_kmem_cache_release(to_slab(k));
@@ -5702,10 +5597,6 @@ static struct kset *slab_kset;
 
 static inline struct kset *cache_kset(struct kmem_cache *s)
 {
-#ifdef CONFIG_MEMCG
-	if (!is_root_cache(s))
-		return s->memcg_params.root_cache->memcg_kset;
-#endif
 	return slab_kset;
 }
 
@@ -5748,27 +5639,6 @@ static char *create_unique_id(struct kmem_cache *s)
 	return name;
 }
 
-static void sysfs_slab_remove_workfn(struct work_struct *work)
-{
-	struct kmem_cache *s =
-		container_of(work, struct kmem_cache, kobj_remove_work);
-
-	if (!s->kobj.state_in_sysfs)
-		/*
-		 * For a memcg cache, this may be called during
-		 * deactivation and again on shutdown.  Remove only once.
-		 * A cache is never shut down before deactivation is
-		 * complete, so no need to worry about synchronization.
-		 */
-		goto out;
-
-#ifdef CONFIG_MEMCG
-	kset_unregister(s->memcg_kset);
-#endif
-out:
-	kobject_put(&s->kobj);
-}
-
 static int sysfs_slab_add(struct kmem_cache *s)
 {
 	int err;
@@ -5776,8 +5646,6 @@ static int sysfs_slab_add(struct kmem_cache *s)
 	struct kset *kset = cache_kset(s);
 	int unmergeable = slab_unmergeable(s);
 
-	INIT_WORK(&s->kobj_remove_work, sysfs_slab_remove_workfn);
-
 	if (!kset) {
 		kobject_init(&s->kobj, &slab_ktype);
 		return 0;
@@ -5814,16 +5682,6 @@ static int sysfs_slab_add(struct kmem_cache *s)
 	if (err)
 		goto out_del_kobj;
 
-#ifdef CONFIG_MEMCG
-	if (is_root_cache(s) && memcg_sysfs_enabled) {
-		s->memcg_kset = kset_create_and_add("cgroup", NULL, &s->kobj);
-		if (!s->memcg_kset) {
-			err = -ENOMEM;
-			goto out_del_kobj;
-		}
-	}
-#endif
-
 	if (!unmergeable) {
 		/* Setup first alias */
 		sysfs_slab_alias(s, s->name);
@@ -5837,19 +5695,6 @@ static int sysfs_slab_add(struct kmem_cache *s)
 	goto out;
 }
 
-static void sysfs_slab_remove(struct kmem_cache *s)
-{
-	if (slab_state < FULL)
-		/*
-		 * Sysfs has not been setup yet so no need to remove the
-		 * cache from sysfs.
-		 */
-		return;
-
-	kobject_get(&s->kobj);
-	schedule_work(&s->kobj_remove_work);
-}
-
 void sysfs_slab_unlink(struct kmem_cache *s)
 {
 	if (slab_state >= FULL)
-- 
2.25.4


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH v6 18/19] kselftests: cgroup: add kernel memory accounting tests
  2020-06-08 23:06 [PATCH v6 00/19] The new cgroup slab memory controller Roman Gushchin
                   ` (16 preceding siblings ...)
  2020-06-08 23:06 ` [PATCH v6 17/19] mm: memcg/slab: use a single set of kmem_caches for all allocations Roman Gushchin
@ 2020-06-08 23:06 ` Roman Gushchin
  2020-06-17  1:46 ` [PATCH v6 00/19] The new cgroup slab memory controller Shakeel Butt
                   ` (2 subsequent siblings)
  20 siblings, 0 replies; 92+ messages in thread
From: Roman Gushchin @ 2020-06-08 23:06 UTC (permalink / raw)
  To: Andrew Morton, Christoph Lameter
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	Vlastimil Babka, kernel-team, linux-kernel, Roman Gushchin

Add some tests to cover the kernel memory accounting functionality.
These are covering some issues (and changes) we had recently.

1) A test which allocates a lot of negative dentries, checks memcg
slab statistics, creates memory pressure by setting memory.max
to some low value and checks that some number of slabs was reclaimed.

2) A test which covers side effects of memcg destruction: it creates
and destroys a large number of sub-cgroups, each containing a
multi-threaded workload which allocates and releases some kernel
memory. Then it checks that the charge ans memory.stats do add up
on the parent level.

3) A test which reads /proc/kpagecgroup and implicitly checks that it
doesn't crash the system.

4) A test which spawns a large number of threads and checks that
the kernel stacks accounting works as expected.

5) A test which checks that living charged slab objects are not
preventing the memory cgroup from being released after being deleted
by a user.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 tools/testing/selftests/cgroup/.gitignore  |   1 +
 tools/testing/selftests/cgroup/Makefile    |   2 +
 tools/testing/selftests/cgroup/test_kmem.c | 382 +++++++++++++++++++++
 3 files changed, 385 insertions(+)
 create mode 100644 tools/testing/selftests/cgroup/test_kmem.c

diff --git a/tools/testing/selftests/cgroup/.gitignore b/tools/testing/selftests/cgroup/.gitignore
index aa6de65b0838..84cfcabea838 100644
--- a/tools/testing/selftests/cgroup/.gitignore
+++ b/tools/testing/selftests/cgroup/.gitignore
@@ -2,3 +2,4 @@
 test_memcontrol
 test_core
 test_freezer
+test_kmem
\ No newline at end of file
diff --git a/tools/testing/selftests/cgroup/Makefile b/tools/testing/selftests/cgroup/Makefile
index 967f268fde74..f027d933595b 100644
--- a/tools/testing/selftests/cgroup/Makefile
+++ b/tools/testing/selftests/cgroup/Makefile
@@ -6,11 +6,13 @@ all:
 TEST_FILES     := with_stress.sh
 TEST_PROGS     := test_stress.sh
 TEST_GEN_PROGS = test_memcontrol
+TEST_GEN_PROGS += test_kmem
 TEST_GEN_PROGS += test_core
 TEST_GEN_PROGS += test_freezer
 
 include ../lib.mk
 
 $(OUTPUT)/test_memcontrol: cgroup_util.c ../clone3/clone3_selftests.h
+$(OUTPUT)/test_kmem: cgroup_util.c ../clone3/clone3_selftests.h
 $(OUTPUT)/test_core: cgroup_util.c ../clone3/clone3_selftests.h
 $(OUTPUT)/test_freezer: cgroup_util.c ../clone3/clone3_selftests.h
diff --git a/tools/testing/selftests/cgroup/test_kmem.c b/tools/testing/selftests/cgroup/test_kmem.c
new file mode 100644
index 000000000000..5224dae216e5
--- /dev/null
+++ b/tools/testing/selftests/cgroup/test_kmem.c
@@ -0,0 +1,382 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+
+#include <linux/limits.h>
+#include <fcntl.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <unistd.h>
+#include <sys/wait.h>
+#include <errno.h>
+#include <sys/sysinfo.h>
+#include <pthread.h>
+
+#include "../kselftest.h"
+#include "cgroup_util.h"
+
+
+static int alloc_dcache(const char *cgroup, void *arg)
+{
+	unsigned long i;
+	struct stat st;
+	char buf[128];
+
+	for (i = 0; i < (unsigned long)arg; i++) {
+		snprintf(buf, sizeof(buf),
+			"/something-non-existent-with-a-long-name-%64lu-%d",
+			 i, getpid());
+		stat(buf, &st);
+	}
+
+	return 0;
+}
+
+/*
+ * This test allocates 100000 of negative dentries with long names.
+ * Then it checks that "slab" in memory.stat is larger than 1M.
+ * Then it sets memory.high to 1M and checks that at least 1/2
+ * of slab memory has been reclaimed.
+ */
+static int test_kmem_basic(const char *root)
+{
+	int ret = KSFT_FAIL;
+	char *cg = NULL;
+	long slab0, slab1, current;
+
+	cg = cg_name(root, "kmem_basic_test");
+	if (!cg)
+		goto cleanup;
+
+	if (cg_create(cg))
+		goto cleanup;
+
+	if (cg_run(cg, alloc_dcache, (void *)100000))
+		goto cleanup;
+
+	slab0 = cg_read_key_long(cg, "memory.stat", "slab ");
+	if (slab0 < (1 << 20))
+		goto cleanup;
+
+	cg_write(cg, "memory.high", "1M");
+	slab1 = cg_read_key_long(cg, "memory.stat", "slab ");
+	if (slab1 <= 0)
+		goto cleanup;
+
+	current = cg_read_long(cg, "memory.current");
+	if (current <= 0)
+		goto cleanup;
+
+	if (slab1 < slab0 / 2 && current < slab0 / 2)
+		ret = KSFT_PASS;
+cleanup:
+	cg_destroy(cg);
+	free(cg);
+
+	return ret;
+}
+
+static void *alloc_kmem_fn(void *arg)
+{
+	alloc_dcache(NULL, (void *)100);
+	return NULL;
+}
+
+static int alloc_kmem_smp(const char *cgroup, void *arg)
+{
+	int nr_threads = 2 * get_nprocs();
+	pthread_t *tinfo;
+	unsigned long i;
+	int ret = -1;
+
+	tinfo = calloc(nr_threads, sizeof(pthread_t));
+	if (tinfo == NULL)
+		return -1;
+
+	for (i = 0; i < nr_threads; i++) {
+		if (pthread_create(&tinfo[i], NULL, &alloc_kmem_fn,
+				   (void *)i)) {
+			free(tinfo);
+			return -1;
+		}
+	}
+
+	for (i = 0; i < nr_threads; i++) {
+		ret = pthread_join(tinfo[i], NULL);
+		if (ret)
+			break;
+	}
+
+	free(tinfo);
+	return ret;
+}
+
+static int cg_run_in_subcgroups(const char *parent,
+				int (*fn)(const char *cgroup, void *arg),
+				void *arg, int times)
+{
+	char *child;
+	int i;
+
+	for (i = 0; i < times; i++) {
+		child = cg_name_indexed(parent, "child", i);
+		if (!child)
+			return -1;
+
+		if (cg_create(child)) {
+			cg_destroy(child);
+			free(child);
+			return -1;
+		}
+
+		if (cg_run(child, fn, NULL)) {
+			cg_destroy(child);
+			free(child);
+			return -1;
+		}
+
+		cg_destroy(child);
+		free(child);
+	}
+
+	return 0;
+}
+
+/*
+ * The test creates and destroys a large number of cgroups. In each cgroup it
+ * allocates some slab memory (mostly negative dentries) using 2 * NR_CPUS
+ * threads. Then it checks the sanity of numbers on the parent level:
+ * the total size of the cgroups should be roughly equal to
+ * anon + file + slab + kernel_stack.
+ */
+static int test_kmem_memcg_deletion(const char *root)
+{
+	long current, slab, anon, file, kernel_stack, sum;
+	int ret = KSFT_FAIL;
+	char *parent;
+
+	parent = cg_name(root, "kmem_memcg_deletion_test");
+	if (!parent)
+		goto cleanup;
+
+	if (cg_create(parent))
+		goto cleanup;
+
+	if (cg_write(parent, "cgroup.subtree_control", "+memory"))
+		goto cleanup;
+
+	if (cg_run_in_subcgroups(parent, alloc_kmem_smp, NULL, 100))
+		goto cleanup;
+
+	current = cg_read_long(parent, "memory.current");
+	slab = cg_read_key_long(parent, "memory.stat", "slab ");
+	anon = cg_read_key_long(parent, "memory.stat", "anon ");
+	file = cg_read_key_long(parent, "memory.stat", "file ");
+	kernel_stack = cg_read_key_long(parent, "memory.stat", "kernel_stack ");
+	if (current < 0 || slab < 0 || anon < 0 || file < 0 ||
+	    kernel_stack < 0)
+		goto cleanup;
+
+	sum = slab + anon + file + kernel_stack;
+	if (abs(sum - current) < 4096 * 32 * 2 * get_nprocs()) {
+		ret = KSFT_PASS;
+	} else {
+		printf("memory.current = %ld\n", current);
+		printf("slab + anon + file + kernel_stack = %ld\n", sum);
+		printf("slab = %ld\n", slab);
+		printf("anon = %ld\n", anon);
+		printf("file = %ld\n", file);
+		printf("kernel_stack = %ld\n", kernel_stack);
+	}
+
+cleanup:
+	cg_destroy(parent);
+	free(parent);
+
+	return ret;
+}
+
+/*
+ * The test reads the entire /proc/kpagecgroup. If the operation went
+ * successfully (and the kernel didn't panic), the test is treated as passed.
+ */
+static int test_kmem_proc_kpagecgroup(const char *root)
+{
+	unsigned long buf[128];
+	int ret = KSFT_FAIL;
+	ssize_t len;
+	int fd;
+
+	fd = open("/proc/kpagecgroup", O_RDONLY);
+	if (fd < 0)
+		return ret;
+
+	do {
+		len = read(fd, buf, sizeof(buf));
+	} while (len > 0);
+
+	if (len == 0)
+		ret = KSFT_PASS;
+
+	close(fd);
+	return ret;
+}
+
+static void *pthread_wait_fn(void *arg)
+{
+	sleep(100);
+	return NULL;
+}
+
+static int spawn_1000_threads(const char *cgroup, void *arg)
+{
+	int nr_threads = 1000;
+	pthread_t *tinfo;
+	unsigned long i;
+	long stack;
+	int ret = -1;
+
+	tinfo = calloc(nr_threads, sizeof(pthread_t));
+	if (tinfo == NULL)
+		return -1;
+
+	for (i = 0; i < nr_threads; i++) {
+		if (pthread_create(&tinfo[i], NULL, &pthread_wait_fn,
+				   (void *)i)) {
+			free(tinfo);
+			return(-1);
+		}
+	}
+
+	stack = cg_read_key_long(cgroup, "memory.stat", "kernel_stack ");
+	if (stack >= 4096 * 1000)
+		ret = 0;
+
+	free(tinfo);
+	return ret;
+}
+
+/*
+ * The test spawns a process, which spawns 1000 threads. Then it checks
+ * that memory.stat's kernel_stack is at least 1000 pages large.
+ */
+static int test_kmem_kernel_stacks(const char *root)
+{
+	int ret = KSFT_FAIL;
+	char *cg = NULL;
+
+	cg = cg_name(root, "kmem_kernel_stacks_test");
+	if (!cg)
+		goto cleanup;
+
+	if (cg_create(cg))
+		goto cleanup;
+
+	if (cg_run(cg, spawn_1000_threads, NULL))
+		goto cleanup;
+
+	ret = KSFT_PASS;
+cleanup:
+	cg_destroy(cg);
+	free(cg);
+
+	return ret;
+}
+
+/*
+ * This test sequentionally creates 30 child cgroups, allocates some
+ * kernel memory in each of them, and deletes them. Then it checks
+ * that the number of dying cgroups on the parent level is 0.
+ */
+static int test_kmem_dead_cgroups(const char *root)
+{
+	int ret = KSFT_FAIL;
+	char *parent;
+	long dead;
+	int i;
+
+	parent = cg_name(root, "kmem_dead_cgroups_test");
+	if (!parent)
+		goto cleanup;
+
+	if (cg_create(parent))
+		goto cleanup;
+
+	if (cg_write(parent, "cgroup.subtree_control", "+memory"))
+		goto cleanup;
+
+	if (cg_run_in_subcgroups(parent, alloc_dcache, (void *)100, 30))
+		goto cleanup;
+
+	for (i = 0; i < 5; i++) {
+		dead = cg_read_key_long(parent, "cgroup.stat",
+					"nr_dying_descendants ");
+		if (dead == 0) {
+			ret = KSFT_PASS;
+			break;
+		}
+		/*
+		 * Reclaiming cgroups might take some time,
+		 * let's wait a bit and repeat.
+		 */
+		sleep(1);
+	}
+
+cleanup:
+	cg_destroy(parent);
+	free(parent);
+
+	return ret;
+}
+
+#define T(x) { x, #x }
+struct kmem_test {
+	int (*fn)(const char *root);
+	const char *name;
+} tests[] = {
+	T(test_kmem_basic),
+	T(test_kmem_memcg_deletion),
+	T(test_kmem_proc_kpagecgroup),
+	T(test_kmem_kernel_stacks),
+	T(test_kmem_dead_cgroups),
+};
+#undef T
+
+int main(int argc, char **argv)
+{
+	char root[PATH_MAX];
+	int i, ret = EXIT_SUCCESS;
+
+	if (cg_find_unified_root(root, sizeof(root)))
+		ksft_exit_skip("cgroup v2 isn't mounted\n");
+
+	/*
+	 * Check that memory controller is available:
+	 * memory is listed in cgroup.controllers
+	 */
+	if (cg_read_strstr(root, "cgroup.controllers", "memory"))
+		ksft_exit_skip("memory controller isn't available\n");
+
+	if (cg_read_strstr(root, "cgroup.subtree_control", "memory"))
+		if (cg_write(root, "cgroup.subtree_control", "+memory"))
+			ksft_exit_skip("Failed to set memory controller\n");
+
+	for (i = 0; i < ARRAY_SIZE(tests); i++) {
+		switch (tests[i].fn(root)) {
+		case KSFT_PASS:
+			ksft_test_result_pass("%s\n", tests[i].name);
+			break;
+		case KSFT_SKIP:
+			ksft_test_result_skip("%s\n", tests[i].name);
+			break;
+		default:
+			ret = EXIT_FAILURE;
+			ksft_test_result_fail("%s\n", tests[i].name);
+			break;
+		}
+	}
+
+	return ret;
+}
-- 
2.25.4


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 00/19] The new cgroup slab memory controller
  2020-06-08 23:06 [PATCH v6 00/19] The new cgroup slab memory controller Roman Gushchin
                   ` (17 preceding siblings ...)
  2020-06-08 23:06 ` [PATCH v6 18/19] kselftests: cgroup: add kernel memory accounting tests Roman Gushchin
@ 2020-06-17  1:46 ` Shakeel Butt
  2020-06-17  2:41   ` Roman Gushchin
  2020-06-18  1:18   ` Roman Gushchin
  2020-06-18  9:27 ` Mike Rapoport
  2020-06-21 22:57 ` Qian Cai
  20 siblings, 2 replies; 92+ messages in thread
From: Shakeel Butt @ 2020-06-17  1:46 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
>
> This is v6 of the slab cgroup controller rework.
>
> The patchset moves the accounting from the page level to the object
> level. It allows to share slab pages between memory cgroups.
> This leads to a significant win in the slab utilization (up to 45%)
> and the corresponding drop in the total kernel memory footprint.

Is this based on just SLUB or does this have a similar impact on SLAB as well?

> The reduced number of unmovable slab pages should also have a positive
> effect on the memory fragmentation.

That would be awesome. We have seen fragmentation getting very bad on
system (or node) level memory pressure. Is that the same for you?

>
> The patchset makes the slab accounting code simpler: there is no more
> need in the complicated dynamic creation and destruction of per-cgroup
> slab caches, all memory cgroups use a global set of shared slab caches.
> The lifetime of slab caches is not more connected to the lifetime
> of memory cgroups.
>
> The more precise accounting does require more CPU, however in practice
> the difference seems to be negligible. We've been using the new slab
> controller in Facebook production for several months with different
> workloads and haven't seen any noticeable regressions. What we've seen
> were memory savings in order of 1 GB per host (it varied heavily depending
> on the actual workload, size of RAM, number of CPUs, memory pressure, etc).
>
> The third version of the patchset added yet another step towards
> the simplification of the code: sharing of slab caches between
> accounted and non-accounted allocations. It comes with significant
> upsides (most noticeable, a complete elimination of dynamic slab caches
> creation) but not without some regression risks, so this change sits
> on top of the patchset and is not completely merged in. So in the unlikely
> event of a noticeable performance regression it can be reverted separately.
>

Have you performed any [perf] testing on SLAB with this patchset?

> v6:
>   1) rebased on top of the mm tree
>   2) removed a redundant check from cache_from_obj(), suggested by Vlastimil
>
> v5:
>   1) fixed a build error, spotted by Vlastimil
>   2) added a comment about memcg->nr_charged_bytes, asked by Johannes
>   3) added missed acks and reviews
>
> v4:
>   1) rebased on top of the mm tree, some fixes here and there
>   2) merged obj_to_index() with slab_index(), suggested by Vlastimil
>   3) changed objects_per_slab() to a better objects_per_slab_page(),
>      suggested by Vlastimil
>   4) other minor fixes and changes
>
> v3:
>   1) added a patch that switches to a global single set of kmem_caches
>   2) kmem API clean up dropped, because if has been already merged
>   3) byte-sized slab vmstat API over page-sized global counters and
>      bytes-sized memcg/lruvec counters
>   3) obj_cgroup refcounting simplifications and other minor fixes
>   4) other minor changes
>
> v2:
>   1) implemented re-layering and renaming suggested by Johannes,
>      added his patch to the set. Thanks!
>   2) fixed the issue discovered by Bharata B Rao. Thanks!
>   3) added kmem API clean up part
>   4) added slab/memcg follow-up clean up part
>   5) fixed a couple of issues discovered by internal testing on FB fleet.
>   6) added kselftests
>   7) included metadata into the charge calculation
>   8) refreshed commit logs, regrouped patches, rebased onto mm tree, etc
>
> v1:
>   1) fixed a bug in zoneinfo_show_print()
>   2) added some comments to the subpage charging API, a minor fix
>   3) separated memory.kmem.slabinfo deprecation into a separate patch,
>      provided a drgn-based replacement
>   4) rebased on top of the current mm tree
>
> RFC:
>   https://lwn.net/Articles/798605/
>
>
> Johannes Weiner (1):
>   mm: memcontrol: decouple reference counting from page accounting
>
> Roman Gushchin (18):
>   mm: memcg: factor out memcg- and lruvec-level changes out of
>     __mod_lruvec_state()
>   mm: memcg: prepare for byte-sized vmstat items
>   mm: memcg: convert vmstat slab counters to bytes
>   mm: slub: implement SLUB version of obj_to_index()
>   mm: memcg/slab: obj_cgroup API
>   mm: memcg/slab: allocate obj_cgroups for non-root slab pages
>   mm: memcg/slab: save obj_cgroup for non-root slab objects
>   mm: memcg/slab: charge individual slab objects instead of pages
>   mm: memcg/slab: deprecate memory.kmem.slabinfo
>   mm: memcg/slab: move memcg_kmem_bypass() to memcontrol.h
>   mm: memcg/slab: use a single set of kmem_caches for all accounted
>     allocations
>   mm: memcg/slab: simplify memcg cache creation
>   mm: memcg/slab: remove memcg_kmem_get_cache()
>   mm: memcg/slab: deprecate slab_root_caches
>   mm: memcg/slab: remove redundant check in memcg_accumulate_slabinfo()
>   mm: memcg/slab: use a single set of kmem_caches for all allocations
>   kselftests: cgroup: add kernel memory accounting tests
>   tools/cgroup: add memcg_slabinfo.py tool
>
>  drivers/base/node.c                        |   6 +-
>  fs/proc/meminfo.c                          |   4 +-
>  include/linux/memcontrol.h                 |  85 ++-
>  include/linux/mm_types.h                   |   5 +-
>  include/linux/mmzone.h                     |  24 +-
>  include/linux/slab.h                       |   5 -
>  include/linux/slab_def.h                   |   9 +-
>  include/linux/slub_def.h                   |  31 +-
>  include/linux/vmstat.h                     |  14 +-
>  kernel/power/snapshot.c                    |   2 +-
>  mm/memcontrol.c                            | 608 +++++++++++--------
>  mm/oom_kill.c                              |   2 +-
>  mm/page_alloc.c                            |   8 +-
>  mm/slab.c                                  |  70 +--
>  mm/slab.h                                  | 372 +++++-------
>  mm/slab_common.c                           | 643 +--------------------
>  mm/slob.c                                  |  12 +-
>  mm/slub.c                                  | 229 +-------
>  mm/vmscan.c                                |   3 +-
>  mm/vmstat.c                                |  30 +-
>  mm/workingset.c                            |   6 +-
>  tools/cgroup/memcg_slabinfo.py             | 226 ++++++++
>  tools/testing/selftests/cgroup/.gitignore  |   1 +
>  tools/testing/selftests/cgroup/Makefile    |   2 +
>  tools/testing/selftests/cgroup/test_kmem.c | 382 ++++++++++++
>  25 files changed, 1374 insertions(+), 1405 deletions(-)
>  create mode 100755 tools/cgroup/memcg_slabinfo.py
>  create mode 100644 tools/testing/selftests/cgroup/test_kmem.c
>
> --
> 2.25.4
>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 01/19] mm: memcg: factor out memcg- and lruvec-level changes out of __mod_lruvec_state()
  2020-06-08 23:06 ` [PATCH v6 01/19] mm: memcg: factor out memcg- and lruvec-level changes out of __mod_lruvec_state() Roman Gushchin
@ 2020-06-17  1:52   ` Shakeel Butt
  2020-06-17  2:50     ` Roman Gushchin
  0 siblings, 1 reply; 92+ messages in thread
From: Shakeel Butt @ 2020-06-17  1:52 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
>
> To convert memcg and lruvec slab counters to bytes there must be
> a way to change these counters without touching node counters.
> Factor out __mod_memcg_lruvec_state() out of __mod_lruvec_state().
>
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  include/linux/memcontrol.h | 17 +++++++++++++++
>  mm/memcontrol.c            | 43 +++++++++++++++++++++-----------------
>  2 files changed, 41 insertions(+), 19 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index bbf624a7f5a6..93dbc7f9d8b8 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -679,11 +679,23 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
>         return x;
>  }
>
> +void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
> +                             int val);
>  void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
>                         int val);
>  void __mod_lruvec_slab_state(void *p, enum node_stat_item idx, int val);
>  void mod_memcg_obj_state(void *p, int idx, int val);
>
> +static inline void mod_memcg_lruvec_state(struct lruvec *lruvec,
> +                                         enum node_stat_item idx, int val)

Is this function used in later patches? Any benefit introducing it
here instead of in the patch where it is used for the first time?

> +{
> +       unsigned long flags;
> +
> +       local_irq_save(flags);
> +       __mod_memcg_lruvec_state(lruvec, idx, val);
> +       local_irq_restore(flags);
> +}
> +
[...]

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 00/19] The new cgroup slab memory controller
  2020-06-17  1:46 ` [PATCH v6 00/19] The new cgroup slab memory controller Shakeel Butt
@ 2020-06-17  2:41   ` Roman Gushchin
  2020-06-17  3:05     ` Shakeel Butt
  2020-06-18  1:18   ` Roman Gushchin
  1 sibling, 1 reply; 92+ messages in thread
From: Roman Gushchin @ 2020-06-17  2:41 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Tue, Jun 16, 2020 at 06:46:56PM -0700, Shakeel Butt wrote:
> On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > This is v6 of the slab cgroup controller rework.
> >
> > The patchset moves the accounting from the page level to the object
> > level. It allows to share slab pages between memory cgroups.
> > This leads to a significant win in the slab utilization (up to 45%)
> > and the corresponding drop in the total kernel memory footprint.
> 
> Is this based on just SLUB or does this have a similar impact on SLAB as well?

The numbers for SLAB are less impressive than for SLUB (I guess per-cpu partial
lists add to the problem), but also in double digits of percents.

> 
> > The reduced number of unmovable slab pages should also have a positive
> > effect on the memory fragmentation.
> 
> That would be awesome. We have seen fragmentation getting very bad on
> system (or node) level memory pressure. Is that the same for you?

Well, we didn't have any specific problems with the fragmentation,
but generally speaking reducing the size of unmovable memory by ~40%
should have a positive effect.

> 
> >
> > The patchset makes the slab accounting code simpler: there is no more
> > need in the complicated dynamic creation and destruction of per-cgroup
> > slab caches, all memory cgroups use a global set of shared slab caches.
> > The lifetime of slab caches is not more connected to the lifetime
> > of memory cgroups.
> >
> > The more precise accounting does require more CPU, however in practice
> > the difference seems to be negligible. We've been using the new slab
> > controller in Facebook production for several months with different
> > workloads and haven't seen any noticeable regressions. What we've seen
> > were memory savings in order of 1 GB per host (it varied heavily depending
> > on the actual workload, size of RAM, number of CPUs, memory pressure, etc).
> >
> > The third version of the patchset added yet another step towards
> > the simplification of the code: sharing of slab caches between
> > accounted and non-accounted allocations. It comes with significant
> > upsides (most noticeable, a complete elimination of dynamic slab caches
> > creation) but not without some regression risks, so this change sits
> > on top of the patchset and is not completely merged in. So in the unlikely
> > event of a noticeable performance regression it can be reverted separately.
> >
> 
> Have you performed any [perf] testing on SLAB with this patchset?

The accounting part is the same for SLAB and SLUB, so there should be no
significant difference. I've checked that it compiles, boots and passes
kselftests. And that memory savings are there.

Thanks!

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 01/19] mm: memcg: factor out memcg- and lruvec-level changes out of __mod_lruvec_state()
  2020-06-17  1:52   ` Shakeel Butt
@ 2020-06-17  2:50     ` Roman Gushchin
  2020-06-17  2:59       ` Shakeel Butt
  0 siblings, 1 reply; 92+ messages in thread
From: Roman Gushchin @ 2020-06-17  2:50 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Tue, Jun 16, 2020 at 06:52:09PM -0700, Shakeel Butt wrote:
> On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > To convert memcg and lruvec slab counters to bytes there must be
> > a way to change these counters without touching node counters.
> > Factor out __mod_memcg_lruvec_state() out of __mod_lruvec_state().
> >
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> > ---
> >  include/linux/memcontrol.h | 17 +++++++++++++++
> >  mm/memcontrol.c            | 43 +++++++++++++++++++++-----------------
> >  2 files changed, 41 insertions(+), 19 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index bbf624a7f5a6..93dbc7f9d8b8 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -679,11 +679,23 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
> >         return x;
> >  }
> >
> > +void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
> > +                             int val);
> >  void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
> >                         int val);
> >  void __mod_lruvec_slab_state(void *p, enum node_stat_item idx, int val);
> >  void mod_memcg_obj_state(void *p, int idx, int val);
> >
> > +static inline void mod_memcg_lruvec_state(struct lruvec *lruvec,
> > +                                         enum node_stat_item idx, int val)
> 
> Is this function used in later patches? Any benefit introducing it
> here instead of in the patch where it is used for the first time?

Yes, it's used in "mm: memcg/slab: charge individual slab objects instead of pages".

It's a fairly large patchset with many internal dependencies, so there is
always a trade-off between putting everything into a single patch, which is
hard to review, and splitting out some changes, which make not much sense
without seeing the whole picture.

In this particular case splitting out a formal and easy-to-verify change makes
the actual non-trivial patch smaller and hopefully easier for a review.

But of course it's all subjective.

Thanks!

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 02/19] mm: memcg: prepare for byte-sized vmstat items
  2020-06-08 23:06 ` [PATCH v6 02/19] mm: memcg: prepare for byte-sized vmstat items Roman Gushchin
@ 2020-06-17  2:57   ` Shakeel Butt
  2020-06-17  3:19     ` Roman Gushchin
  2020-06-17 15:55   ` Shakeel Butt
  1 sibling, 1 reply; 92+ messages in thread
From: Shakeel Butt @ 2020-06-17  2:57 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
>
> To implement per-object slab memory accounting, we need to
> convert slab vmstat counters to bytes. Actually, out of
> 4 levels of counters: global, per-node, per-memcg and per-lruvec
> only two last levels will require byte-sized counters.
> It's because global and per-node counters will be counting the
> number of slab pages, and per-memcg and per-lruvec will be
> counting the amount of memory taken by charged slab objects.
>
> Converting all vmstat counters to bytes or even all slab
> counters to bytes would introduce an additional overhead.
> So instead let's store global and per-node counters
> in pages, and memcg and lruvec counters in bytes.
>
> To make the API clean all access helpers (both on the read
> and write sides) are dealing with bytes.
>

The "dealing with bytes" is only for slab stats or all vmstat stats?

> To avoid back-and-forth conversions a new flavor of read-side
> helpers is introduced, which always returns values in pages:
> node_page_state_pages() and global_node_page_state_pages().
>
> Actually new helpers are just reading raw values. Old helpers are
> simple wrappers, which will complain on an attempt to read
> byte value, because at the moment no one actually needs bytes.
>
> Thanks to Johannes Weiner for the idea of having the byte-sized API
> on top of the page-sized internal storage.
>
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  drivers/base/node.c    |  2 +-
>  include/linux/mmzone.h | 10 ++++++++++
>  include/linux/vmstat.h | 14 +++++++++++++-
>  mm/memcontrol.c        | 14 ++++++++++----
>  mm/vmstat.c            | 30 ++++++++++++++++++++++++++----
>  5 files changed, 60 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 5b02f69769e8..e21e31359297 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -513,7 +513,7 @@ static ssize_t node_read_vmstat(struct device *dev,
>
>         for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
>                 n += sprintf(buf+n, "%s %lu\n", node_stat_name(i),
> -                            node_page_state(pgdat, i));
> +                            node_page_state_pages(pgdat, i));
>
>         return n;
>  }
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index c4c37fd12104..fa8eb49d9898 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -206,6 +206,16 @@ enum node_stat_item {
>         NR_VM_NODE_STAT_ITEMS
>  };
>
> +/*
> + * Returns true if the value is measured in bytes (most vmstat values are
> + * measured in pages). This defines the API part, the internal representation
> + * might be different.
> + */
> +static __always_inline bool vmstat_item_in_bytes(enum node_stat_item item)
> +{
> +       return false;
> +}
> +
>  /*
>   * We do arithmetic on the LRU lists in various places in the code,
>   * so it is important to keep the active lists LRU_ACTIVE higher in
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index aa961088c551..91220ace31da 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -8,6 +8,7 @@
>  #include <linux/vm_event_item.h>
>  #include <linux/atomic.h>
>  #include <linux/static_key.h>
> +#include <linux/mmdebug.h>
>
>  extern int sysctl_stat_interval;
>
> @@ -192,7 +193,8 @@ static inline unsigned long global_zone_page_state(enum zone_stat_item item)
>         return x;
>  }
>
> -static inline unsigned long global_node_page_state(enum node_stat_item item)
> +static inline
> +unsigned long global_node_page_state_pages(enum node_stat_item item)
>  {
>         long x = atomic_long_read(&vm_node_stat[item]);
>  #ifdef CONFIG_SMP
> @@ -202,6 +204,13 @@ static inline unsigned long global_node_page_state(enum node_stat_item item)
>         return x;
>  }
>
> +static inline unsigned long global_node_page_state(enum node_stat_item item)
> +{
> +       VM_WARN_ON_ONCE(vmstat_item_in_bytes(item));
> +
> +       return global_node_page_state_pages(item);
> +}
> +
>  static inline unsigned long zone_page_state(struct zone *zone,
>                                         enum zone_stat_item item)
>  {
> @@ -242,9 +251,12 @@ extern unsigned long sum_zone_node_page_state(int node,
>  extern unsigned long sum_zone_numa_state(int node, enum numa_stat_item item);
>  extern unsigned long node_page_state(struct pglist_data *pgdat,
>                                                 enum node_stat_item item);
> +extern unsigned long node_page_state_pages(struct pglist_data *pgdat,
> +                                          enum node_stat_item item);
>  #else
>  #define sum_zone_node_page_state(node, item) global_zone_page_state(item)
>  #define node_page_state(node, item) global_node_page_state(item)
> +#define node_page_state_pages(node, item) global_node_page_state_pages(item)
>  #endif /* CONFIG_NUMA */
>
>  #ifdef CONFIG_SMP
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e8a91e98556b..07d02e61a73e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -681,13 +681,16 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
>   */
>  void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)
>  {
> -       long x;
> +       long x, threshold = MEMCG_CHARGE_BATCH;
>
>         if (mem_cgroup_disabled())
>                 return;
>
> +       if (vmstat_item_in_bytes(idx))
> +               threshold <<= PAGE_SHIFT;
> +

From the above am I understanding correctly that even after moving to
byte-level accounting, we can still see stats with potential error
limited by (BATCH-1)*PAGE_SIZE*nr_cpus?

>         x = val + __this_cpu_read(memcg->vmstats_percpu->stat[idx]);
> -       if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
> +       if (unlikely(abs(x) > threshold)) {
>                 struct mem_cgroup *mi;
>
>                 /*
> @@ -718,7 +721,7 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
>  {
>         struct mem_cgroup_per_node *pn;
>         struct mem_cgroup *memcg;
> -       long x;
> +       long x, threshold = MEMCG_CHARGE_BATCH;
>
>         pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
>         memcg = pn->memcg;
> @@ -729,8 +732,11 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
>         /* Update lruvec */
>         __this_cpu_add(pn->lruvec_stat_local->count[idx], val);
>
> +       if (vmstat_item_in_bytes(idx))
> +               threshold <<= PAGE_SHIFT;
> +
>         x = val + __this_cpu_read(pn->lruvec_stat_cpu->count[idx]);
> -       if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
> +       if (unlikely(abs(x) > threshold)) {
>                 pg_data_t *pgdat = lruvec_pgdat(lruvec);
>                 struct mem_cgroup_per_node *pi;
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 80c9b6221535..f1c321e1d6d3 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -341,6 +341,11 @@ void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
>         long x;
>         long t;
>
> +       if (vmstat_item_in_bytes(item)) {
> +               VM_WARN_ON_ONCE(delta & (PAGE_SIZE - 1));
> +               delta >>= PAGE_SHIFT;
> +       }
> +
>         x = delta + __this_cpu_read(*p);
>
>         t = __this_cpu_read(pcp->stat_threshold);
> @@ -398,6 +403,8 @@ void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
>         s8 __percpu *p = pcp->vm_node_stat_diff + item;
>         s8 v, t;
>
> +       VM_WARN_ON_ONCE(vmstat_item_in_bytes(item));
> +
>         v = __this_cpu_inc_return(*p);
>         t = __this_cpu_read(pcp->stat_threshold);
>         if (unlikely(v > t)) {
> @@ -442,6 +449,8 @@ void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
>         s8 __percpu *p = pcp->vm_node_stat_diff + item;
>         s8 v, t;
>
> +       VM_WARN_ON_ONCE(vmstat_item_in_bytes(item));
> +
>         v = __this_cpu_dec_return(*p);
>         t = __this_cpu_read(pcp->stat_threshold);
>         if (unlikely(v < - t)) {
> @@ -541,6 +550,11 @@ static inline void mod_node_state(struct pglist_data *pgdat,
>         s8 __percpu *p = pcp->vm_node_stat_diff + item;
>         long o, n, t, z;
>
> +       if (vmstat_item_in_bytes(item)) {
> +               VM_WARN_ON_ONCE(delta & (PAGE_SIZE - 1));
> +               delta >>= PAGE_SHIFT;
> +       }
> +
>         do {
>                 z = 0;  /* overflow to node counters */
>
> @@ -989,8 +1003,8 @@ unsigned long sum_zone_numa_state(int node,
>  /*
>   * Determine the per node value of a stat item.
>   */
> -unsigned long node_page_state(struct pglist_data *pgdat,
> -                               enum node_stat_item item)
> +unsigned long node_page_state_pages(struct pglist_data *pgdat,
> +                                   enum node_stat_item item)
>  {
>         long x = atomic_long_read(&pgdat->vm_stat[item]);
>  #ifdef CONFIG_SMP
> @@ -999,6 +1013,14 @@ unsigned long node_page_state(struct pglist_data *pgdat,
>  #endif
>         return x;
>  }
> +
> +unsigned long node_page_state(struct pglist_data *pgdat,
> +                             enum node_stat_item item)
> +{
> +       VM_WARN_ON_ONCE(vmstat_item_in_bytes(item));
> +
> +       return node_page_state_pages(pgdat, item);
> +}

So, for non-slab, node_page_state and node_page_state_pages will be
the same but different for slab vmstats. However we should not be
calling node_page_state with slab vmstats because we don't need it,
right?

>  #endif
>
>  #ifdef CONFIG_COMPACTION
> @@ -1581,7 +1603,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>                 seq_printf(m, "\n  per-node stats");
>                 for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
>                         seq_printf(m, "\n      %-12s %lu", node_stat_name(i),
> -                                  node_page_state(pgdat, i));
> +                                  node_page_state_pages(pgdat, i));
>                 }
>         }
>         seq_printf(m,
> @@ -1702,7 +1724,7 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
>  #endif
>
>         for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
> -               v[i] = global_node_page_state(i);
> +               v[i] = global_node_page_state_pages(i);
>         v += NR_VM_NODE_STAT_ITEMS;
>
>         global_dirty_limits(v + NR_DIRTY_BG_THRESHOLD,
> --
> 2.25.4
>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 01/19] mm: memcg: factor out memcg- and lruvec-level changes out of __mod_lruvec_state()
  2020-06-17  2:50     ` Roman Gushchin
@ 2020-06-17  2:59       ` Shakeel Butt
  2020-06-17  3:19         ` Roman Gushchin
  0 siblings, 1 reply; 92+ messages in thread
From: Shakeel Butt @ 2020-06-17  2:59 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Tue, Jun 16, 2020 at 7:50 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Tue, Jun 16, 2020 at 06:52:09PM -0700, Shakeel Butt wrote:
> > On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
> > >
> > > To convert memcg and lruvec slab counters to bytes there must be
> > > a way to change these counters without touching node counters.
> > > Factor out __mod_memcg_lruvec_state() out of __mod_lruvec_state().
> > >
> > > Signed-off-by: Roman Gushchin <guro@fb.com>
> > > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > > Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> > > ---
> > >  include/linux/memcontrol.h | 17 +++++++++++++++
> > >  mm/memcontrol.c            | 43 +++++++++++++++++++++-----------------
> > >  2 files changed, 41 insertions(+), 19 deletions(-)
> > >
> > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > index bbf624a7f5a6..93dbc7f9d8b8 100644
> > > --- a/include/linux/memcontrol.h
> > > +++ b/include/linux/memcontrol.h
> > > @@ -679,11 +679,23 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
> > >         return x;
> > >  }
> > >
> > > +void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
> > > +                             int val);
> > >  void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
> > >                         int val);
> > >  void __mod_lruvec_slab_state(void *p, enum node_stat_item idx, int val);
> > >  void mod_memcg_obj_state(void *p, int idx, int val);
> > >
> > > +static inline void mod_memcg_lruvec_state(struct lruvec *lruvec,
> > > +                                         enum node_stat_item idx, int val)
> >
> > Is this function used in later patches? Any benefit introducing it
> > here instead of in the patch where it is used for the first time?
>
> Yes, it's used in "mm: memcg/slab: charge individual slab objects instead of pages".
>
> It's a fairly large patchset with many internal dependencies, so there is
> always a trade-off between putting everything into a single patch, which is
> hard to review, and splitting out some changes, which make not much sense
> without seeing the whole picture.
>
> In this particular case splitting out a formal and easy-to-verify change makes
> the actual non-trivial patch smaller and hopefully easier for a review.
>
> But of course it's all subjective.
>
> Thanks!

I am fine with that.

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 03/19] mm: memcg: convert vmstat slab counters to bytes
  2020-06-08 23:06 ` [PATCH v6 03/19] mm: memcg: convert vmstat slab counters to bytes Roman Gushchin
@ 2020-06-17  3:03   ` Shakeel Butt
  0 siblings, 0 replies; 92+ messages in thread
From: Shakeel Butt @ 2020-06-17  3:03 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
>
> In order to prepare for per-object slab memory accounting, convert
> NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE vmstat items to bytes.
>
> To make it obvious, rename them to NR_SLAB_RECLAIMABLE_B and
> NR_SLAB_UNRECLAIMABLE_B (similar to NR_KERNEL_STACK_KB).
>
> Internally global and per-node counters are stored in pages,
> however memcg and lruvec counters are stored in bytes.
> This scheme may look weird, but only for now. As soon as slab
> pages will be shared between multiple cgroups, global and
> node counters will reflect the total number of slab pages.
> However memcg and lruvec counters will be used for per-memcg
> slab memory tracking, which will take separate kernel objects
> in the account. Keeping global and node counters in pages helps
> to avoid additional overhead.
>
> The size of slab memory shouldn't exceed 4Gb on 32-bit machines,
> so it will fit into atomic_long_t we use for vmstats.
>
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 00/19] The new cgroup slab memory controller
  2020-06-17  2:41   ` Roman Gushchin
@ 2020-06-17  3:05     ` Shakeel Butt
  2020-06-17  3:32       ` Roman Gushchin
  0 siblings, 1 reply; 92+ messages in thread
From: Shakeel Butt @ 2020-06-17  3:05 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Tue, Jun 16, 2020 at 7:41 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Tue, Jun 16, 2020 at 06:46:56PM -0700, Shakeel Butt wrote:
> > On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
> > >
[...]
> >
> > Have you performed any [perf] testing on SLAB with this patchset?
>
> The accounting part is the same for SLAB and SLUB, so there should be no
> significant difference. I've checked that it compiles, boots and passes
> kselftests. And that memory savings are there.
>

What about performance? Also you mentioned that sharing kmem-cache
between accounted and non-accounted can have additional overhead. Any
difference between SLAB and SLUB for such a case?

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 04/19] mm: slub: implement SLUB version of obj_to_index()
  2020-06-08 23:06 ` [PATCH v6 04/19] mm: slub: implement SLUB version of obj_to_index() Roman Gushchin
@ 2020-06-17  3:08   ` Shakeel Butt
  0 siblings, 0 replies; 92+ messages in thread
From: Shakeel Butt @ 2020-06-17  3:08 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
>
> This commit implements SLUB version of the obj_to_index() function,
> which will be required to calculate the offset of obj_cgroup in the
> obj_cgroups vector to store/obtain the objcg ownership data.
>
> To make it faster, let's repeat the SLAB's trick introduced by
> commit 6a2d7a955d8d ("[PATCH] SLAB: use a multiply instead of a
> divide in obj_to_index()") and avoid an expensive division.
>
> Vlastimil Babka noticed, that SLUB does have already a similar
> function called slab_index(), which is defined only if SLUB_DEBUG
> is enabled. The function does a similar math, but with a division,
> and it also takes a page address instead of a page pointer.
>
> Let's remove slab_index() and replace it with the new helper
> __obj_to_index(), which takes a page address. obj_to_index()
> will be a simple wrapper taking a page pointer and passing
> page_address(page) into __obj_to_index().
>
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 02/19] mm: memcg: prepare for byte-sized vmstat items
  2020-06-17  2:57   ` Shakeel Butt
@ 2020-06-17  3:19     ` Roman Gushchin
  0 siblings, 0 replies; 92+ messages in thread
From: Roman Gushchin @ 2020-06-17  3:19 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Tue, Jun 16, 2020 at 07:57:54PM -0700, Shakeel Butt wrote:
> On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > To implement per-object slab memory accounting, we need to
> > convert slab vmstat counters to bytes. Actually, out of
> > 4 levels of counters: global, per-node, per-memcg and per-lruvec
> > only two last levels will require byte-sized counters.
> > It's because global and per-node counters will be counting the
> > number of slab pages, and per-memcg and per-lruvec will be
> > counting the amount of memory taken by charged slab objects.
> >
> > Converting all vmstat counters to bytes or even all slab
> > counters to bytes would introduce an additional overhead.
> > So instead let's store global and per-node counters
> > in pages, and memcg and lruvec counters in bytes.
> >
> > To make the API clean all access helpers (both on the read
> > and write sides) are dealing with bytes.
> >
> 
> The "dealing with bytes" is only for slab stats or all vmstat stats?

Only slab stats as now.
I've sent a percpu memory accounting patchset separately, which
will add another byte-sized counter.

> 
> > To avoid back-and-forth conversions a new flavor of read-side
> > helpers is introduced, which always returns values in pages:
> > node_page_state_pages() and global_node_page_state_pages().
> >
> > Actually new helpers are just reading raw values. Old helpers are
> > simple wrappers, which will complain on an attempt to read
> > byte value, because at the moment no one actually needs bytes.
> >
> > Thanks to Johannes Weiner for the idea of having the byte-sized API
> > on top of the page-sized internal storage.
> >
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> > ---
> >  drivers/base/node.c    |  2 +-
> >  include/linux/mmzone.h | 10 ++++++++++
> >  include/linux/vmstat.h | 14 +++++++++++++-
> >  mm/memcontrol.c        | 14 ++++++++++----
> >  mm/vmstat.c            | 30 ++++++++++++++++++++++++++----
> >  5 files changed, 60 insertions(+), 10 deletions(-)
> >
> > diff --git a/drivers/base/node.c b/drivers/base/node.c
> > index 5b02f69769e8..e21e31359297 100644
> > --- a/drivers/base/node.c
> > +++ b/drivers/base/node.c
> > @@ -513,7 +513,7 @@ static ssize_t node_read_vmstat(struct device *dev,
> >
> >         for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
> >                 n += sprintf(buf+n, "%s %lu\n", node_stat_name(i),
> > -                            node_page_state(pgdat, i));
> > +                            node_page_state_pages(pgdat, i));
> >
> >         return n;
> >  }
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index c4c37fd12104..fa8eb49d9898 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -206,6 +206,16 @@ enum node_stat_item {
> >         NR_VM_NODE_STAT_ITEMS
> >  };
> >
> > +/*
> > + * Returns true if the value is measured in bytes (most vmstat values are
> > + * measured in pages). This defines the API part, the internal representation
> > + * might be different.
> > + */
> > +static __always_inline bool vmstat_item_in_bytes(enum node_stat_item item)
> > +{
> > +       return false;
> > +}
> > +
> >  /*
> >   * We do arithmetic on the LRU lists in various places in the code,
> >   * so it is important to keep the active lists LRU_ACTIVE higher in
> > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> > index aa961088c551..91220ace31da 100644
> > --- a/include/linux/vmstat.h
> > +++ b/include/linux/vmstat.h
> > @@ -8,6 +8,7 @@
> >  #include <linux/vm_event_item.h>
> >  #include <linux/atomic.h>
> >  #include <linux/static_key.h>
> > +#include <linux/mmdebug.h>
> >
> >  extern int sysctl_stat_interval;
> >
> > @@ -192,7 +193,8 @@ static inline unsigned long global_zone_page_state(enum zone_stat_item item)
> >         return x;
> >  }
> >
> > -static inline unsigned long global_node_page_state(enum node_stat_item item)
> > +static inline
> > +unsigned long global_node_page_state_pages(enum node_stat_item item)
> >  {
> >         long x = atomic_long_read(&vm_node_stat[item]);
> >  #ifdef CONFIG_SMP
> > @@ -202,6 +204,13 @@ static inline unsigned long global_node_page_state(enum node_stat_item item)
> >         return x;
> >  }
> >
> > +static inline unsigned long global_node_page_state(enum node_stat_item item)
> > +{
> > +       VM_WARN_ON_ONCE(vmstat_item_in_bytes(item));
> > +
> > +       return global_node_page_state_pages(item);
> > +}
> > +
> >  static inline unsigned long zone_page_state(struct zone *zone,
> >                                         enum zone_stat_item item)
> >  {
> > @@ -242,9 +251,12 @@ extern unsigned long sum_zone_node_page_state(int node,
> >  extern unsigned long sum_zone_numa_state(int node, enum numa_stat_item item);
> >  extern unsigned long node_page_state(struct pglist_data *pgdat,
> >                                                 enum node_stat_item item);
> > +extern unsigned long node_page_state_pages(struct pglist_data *pgdat,
> > +                                          enum node_stat_item item);
> >  #else
> >  #define sum_zone_node_page_state(node, item) global_zone_page_state(item)
> >  #define node_page_state(node, item) global_node_page_state(item)
> > +#define node_page_state_pages(node, item) global_node_page_state_pages(item)
> >  #endif /* CONFIG_NUMA */
> >
> >  #ifdef CONFIG_SMP
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index e8a91e98556b..07d02e61a73e 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -681,13 +681,16 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
> >   */
> >  void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)
> >  {
> > -       long x;
> > +       long x, threshold = MEMCG_CHARGE_BATCH;
> >
> >         if (mem_cgroup_disabled())
> >                 return;
> >
> > +       if (vmstat_item_in_bytes(idx))
> > +               threshold <<= PAGE_SHIFT;
> > +
> 
> From the above am I understanding correctly that even after moving to
> byte-level accounting, we can still see stats with potential error
> limited by (BATCH-1)*PAGE_SIZE*nr_cpus?
> 
> >         x = val + __this_cpu_read(memcg->vmstats_percpu->stat[idx]);
> > -       if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
> > +       if (unlikely(abs(x) > threshold)) {
> >                 struct mem_cgroup *mi;
> >
> >                 /*
> > @@ -718,7 +721,7 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
> >  {
> >         struct mem_cgroup_per_node *pn;
> >         struct mem_cgroup *memcg;
> > -       long x;
> > +       long x, threshold = MEMCG_CHARGE_BATCH;
> >
> >         pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
> >         memcg = pn->memcg;
> > @@ -729,8 +732,11 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
> >         /* Update lruvec */
> >         __this_cpu_add(pn->lruvec_stat_local->count[idx], val);
> >
> > +       if (vmstat_item_in_bytes(idx))
> > +               threshold <<= PAGE_SHIFT;
> > +
> >         x = val + __this_cpu_read(pn->lruvec_stat_cpu->count[idx]);
> > -       if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
> > +       if (unlikely(abs(x) > threshold)) {
> >                 pg_data_t *pgdat = lruvec_pgdat(lruvec);
> >                 struct mem_cgroup_per_node *pi;
> >
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 80c9b6221535..f1c321e1d6d3 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -341,6 +341,11 @@ void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
> >         long x;
> >         long t;
> >
> > +       if (vmstat_item_in_bytes(item)) {
> > +               VM_WARN_ON_ONCE(delta & (PAGE_SIZE - 1));
> > +               delta >>= PAGE_SHIFT;
> > +       }
> > +
> >         x = delta + __this_cpu_read(*p);
> >
> >         t = __this_cpu_read(pcp->stat_threshold);
> > @@ -398,6 +403,8 @@ void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
> >         s8 __percpu *p = pcp->vm_node_stat_diff + item;
> >         s8 v, t;
> >
> > +       VM_WARN_ON_ONCE(vmstat_item_in_bytes(item));
> > +
> >         v = __this_cpu_inc_return(*p);
> >         t = __this_cpu_read(pcp->stat_threshold);
> >         if (unlikely(v > t)) {
> > @@ -442,6 +449,8 @@ void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
> >         s8 __percpu *p = pcp->vm_node_stat_diff + item;
> >         s8 v, t;
> >
> > +       VM_WARN_ON_ONCE(vmstat_item_in_bytes(item));
> > +
> >         v = __this_cpu_dec_return(*p);
> >         t = __this_cpu_read(pcp->stat_threshold);
> >         if (unlikely(v < - t)) {
> > @@ -541,6 +550,11 @@ static inline void mod_node_state(struct pglist_data *pgdat,
> >         s8 __percpu *p = pcp->vm_node_stat_diff + item;
> >         long o, n, t, z;
> >
> > +       if (vmstat_item_in_bytes(item)) {
> > +               VM_WARN_ON_ONCE(delta & (PAGE_SIZE - 1));
> > +               delta >>= PAGE_SHIFT;
> > +       }
> > +
> >         do {
> >                 z = 0;  /* overflow to node counters */
> >
> > @@ -989,8 +1003,8 @@ unsigned long sum_zone_numa_state(int node,
> >  /*
> >   * Determine the per node value of a stat item.
> >   */
> > -unsigned long node_page_state(struct pglist_data *pgdat,
> > -                               enum node_stat_item item)
> > +unsigned long node_page_state_pages(struct pglist_data *pgdat,
> > +                                   enum node_stat_item item)
> >  {
> >         long x = atomic_long_read(&pgdat->vm_stat[item]);
> >  #ifdef CONFIG_SMP
> > @@ -999,6 +1013,14 @@ unsigned long node_page_state(struct pglist_data *pgdat,
> >  #endif
> >         return x;
> >  }
> > +
> > +unsigned long node_page_state(struct pglist_data *pgdat,
> > +                             enum node_stat_item item)
> > +{
> > +       VM_WARN_ON_ONCE(vmstat_item_in_bytes(item));
> > +
> > +       return node_page_state_pages(pgdat, item);
> > +}
> 
> So, for non-slab, node_page_state and node_page_state_pages will be
> the same but different for slab vmstats. However we should not be
> calling node_page_state with slab vmstats because we don't need it,
> right?

Right.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 01/19] mm: memcg: factor out memcg- and lruvec-level changes out of __mod_lruvec_state()
  2020-06-17  2:59       ` Shakeel Butt
@ 2020-06-17  3:19         ` Roman Gushchin
  0 siblings, 0 replies; 92+ messages in thread
From: Roman Gushchin @ 2020-06-17  3:19 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Tue, Jun 16, 2020 at 07:59:40PM -0700, Shakeel Butt wrote:
> On Tue, Jun 16, 2020 at 7:50 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Tue, Jun 16, 2020 at 06:52:09PM -0700, Shakeel Butt wrote:
> > > On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
> > > >
> > > > To convert memcg and lruvec slab counters to bytes there must be
> > > > a way to change these counters without touching node counters.
> > > > Factor out __mod_memcg_lruvec_state() out of __mod_lruvec_state().
> > > >
> > > > Signed-off-by: Roman Gushchin <guro@fb.com>
> > > > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > > > Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> > > > ---
> > > >  include/linux/memcontrol.h | 17 +++++++++++++++
> > > >  mm/memcontrol.c            | 43 +++++++++++++++++++++-----------------
> > > >  2 files changed, 41 insertions(+), 19 deletions(-)
> > > >
> > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > > index bbf624a7f5a6..93dbc7f9d8b8 100644
> > > > --- a/include/linux/memcontrol.h
> > > > +++ b/include/linux/memcontrol.h
> > > > @@ -679,11 +679,23 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
> > > >         return x;
> > > >  }
> > > >
> > > > +void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
> > > > +                             int val);
> > > >  void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
> > > >                         int val);
> > > >  void __mod_lruvec_slab_state(void *p, enum node_stat_item idx, int val);
> > > >  void mod_memcg_obj_state(void *p, int idx, int val);
> > > >
> > > > +static inline void mod_memcg_lruvec_state(struct lruvec *lruvec,
> > > > +                                         enum node_stat_item idx, int val)
> > >
> > > Is this function used in later patches? Any benefit introducing it
> > > here instead of in the patch where it is used for the first time?
> >
> > Yes, it's used in "mm: memcg/slab: charge individual slab objects instead of pages".
> >
> > It's a fairly large patchset with many internal dependencies, so there is
> > always a trade-off between putting everything into a single patch, which is
> > hard to review, and splitting out some changes, which make not much sense
> > without seeing the whole picture.
> >
> > In this particular case splitting out a formal and easy-to-verify change makes
> > the actual non-trivial patch smaller and hopefully easier for a review.
> >
> > But of course it's all subjective.
> >
> > Thanks!
> 
> I am fine with that.
> 
> Reviewed-by: Shakeel Butt <shakeelb@google.com>

Thank you! Appreciate it!

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 00/19] The new cgroup slab memory controller
  2020-06-17  3:05     ` Shakeel Butt
@ 2020-06-17  3:32       ` Roman Gushchin
  2020-06-17 11:24         ` Vlastimil Babka
  0 siblings, 1 reply; 92+ messages in thread
From: Roman Gushchin @ 2020-06-17  3:32 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Tue, Jun 16, 2020 at 08:05:39PM -0700, Shakeel Butt wrote:
> On Tue, Jun 16, 2020 at 7:41 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Tue, Jun 16, 2020 at 06:46:56PM -0700, Shakeel Butt wrote:
> > > On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
> > > >
> [...]
> > >
> > > Have you performed any [perf] testing on SLAB with this patchset?
> >
> > The accounting part is the same for SLAB and SLUB, so there should be no
> > significant difference. I've checked that it compiles, boots and passes
> > kselftests. And that memory savings are there.
> >
> 
> What about performance? Also you mentioned that sharing kmem-cache
> between accounted and non-accounted can have additional overhead. Any
> difference between SLAB and SLUB for such a case?

Not really.

Sharing a single set of caches adds some overhead to root- and non-accounted
allocations, which is something I've tried hard to avoid in my original version.
But I have to admit, it allows to simplify and remove a lot of code, and here
it's hard to argue with Johanness, who pushed on this design.

With performance testing it's not that easy, because it's not obvious what
we wanna test. Obviously, per-object accounting is more expensive, and
measuring something like 1000000 allocations and deallocations in a line from
a single kmem_cache will show a regression. But in the real world the relative
cost of allocations is usually low, and we can get some benefits from a smaller
working set and from having shared kmem_cache objects cache hot.
Not speaking about some extra memory and the fragmentation reduction.

We've done an extensive testing of the original version in Facebook production,
and we haven't noticed any regressions so far. But I have to admit, we were
using an original version with two sets of kmem_caches.

If you have any specific tests in mind, I can definitely run them. Or if you
can help with the performance evaluation, I'll appreciate it a lot.

Thanks!

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 00/19] The new cgroup slab memory controller
  2020-06-17  3:32       ` Roman Gushchin
@ 2020-06-17 11:24         ` Vlastimil Babka
  2020-06-17 14:31           ` Mel Gorman
  2020-06-18  1:29           ` Roman Gushchin
  0 siblings, 2 replies; 92+ messages in thread
From: Vlastimil Babka @ 2020-06-17 11:24 UTC (permalink / raw)
  To: Roman Gushchin, Shakeel Butt
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Kernel Team, LKML, Mel Gorman, Jesper Dangaard Brouer

On 6/17/20 5:32 AM, Roman Gushchin wrote:
> On Tue, Jun 16, 2020 at 08:05:39PM -0700, Shakeel Butt wrote:
>> On Tue, Jun 16, 2020 at 7:41 PM Roman Gushchin <guro@fb.com> wrote:
>> >
>> > On Tue, Jun 16, 2020 at 06:46:56PM -0700, Shakeel Butt wrote:
>> > > On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
>> > > >
>> [...]
>> > >
>> > > Have you performed any [perf] testing on SLAB with this patchset?
>> >
>> > The accounting part is the same for SLAB and SLUB, so there should be no
>> > significant difference. I've checked that it compiles, boots and passes
>> > kselftests. And that memory savings are there.
>> >
>> 
>> What about performance? Also you mentioned that sharing kmem-cache
>> between accounted and non-accounted can have additional overhead. Any
>> difference between SLAB and SLUB for such a case?
> 
> Not really.
> 
> Sharing a single set of caches adds some overhead to root- and non-accounted
> allocations, which is something I've tried hard to avoid in my original version.
> But I have to admit, it allows to simplify and remove a lot of code, and here
> it's hard to argue with Johanness, who pushed on this design.
> 
> With performance testing it's not that easy, because it's not obvious what
> we wanna test. Obviously, per-object accounting is more expensive, and
> measuring something like 1000000 allocations and deallocations in a line from
> a single kmem_cache will show a regression. But in the real world the relative
> cost of allocations is usually low, and we can get some benefits from a smaller
> working set and from having shared kmem_cache objects cache hot.
> Not speaking about some extra memory and the fragmentation reduction.
> 
> We've done an extensive testing of the original version in Facebook production,
> and we haven't noticed any regressions so far. But I have to admit, we were
> using an original version with two sets of kmem_caches.
> 
> If you have any specific tests in mind, I can definitely run them. Or if you
> can help with the performance evaluation, I'll appreciate it a lot.

Jesper provided some pointers here [1], it would be really great if you could
run at least those microbenchmarks. With mmtests it's the major question of
which subset/profiles to run, maybe the referenced commits provide some hints,
or maybe Mel could suggest what he used to evaluate SLAB vs SLUB not so long ago.

[1] https://lore.kernel.org/linux-mm/20200527103545.4348ac10@carbon/

> Thanks!
> 


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 00/19] The new cgroup slab memory controller
  2020-06-17 11:24         ` Vlastimil Babka
@ 2020-06-17 14:31           ` Mel Gorman
  2020-06-20  0:57             ` Roman Gushchin
  2020-06-18  1:29           ` Roman Gushchin
  1 sibling, 1 reply; 92+ messages in thread
From: Mel Gorman @ 2020-06-17 14:31 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Roman Gushchin, Shakeel Butt, Andrew Morton, Christoph Lameter,
	Johannes Weiner, Michal Hocko, Linux MM, Kernel Team, LKML,
	Jesper Dangaard Brouer

On Wed, Jun 17, 2020 at 01:24:21PM +0200, Vlastimil Babka wrote:
> > Not really.
> > 
> > Sharing a single set of caches adds some overhead to root- and non-accounted
> > allocations, which is something I've tried hard to avoid in my original version.
> > But I have to admit, it allows to simplify and remove a lot of code, and here
> > it's hard to argue with Johanness, who pushed on this design.
> > 
> > With performance testing it's not that easy, because it's not obvious what
> > we wanna test. Obviously, per-object accounting is more expensive, and
> > measuring something like 1000000 allocations and deallocations in a line from
> > a single kmem_cache will show a regression. But in the real world the relative
> > cost of allocations is usually low, and we can get some benefits from a smaller
> > working set and from having shared kmem_cache objects cache hot.
> > Not speaking about some extra memory and the fragmentation reduction.
> > 
> > We've done an extensive testing of the original version in Facebook production,
> > and we haven't noticed any regressions so far. But I have to admit, we were
> > using an original version with two sets of kmem_caches.
> > 
> > If you have any specific tests in mind, I can definitely run them. Or if you
> > can help with the performance evaluation, I'll appreciate it a lot.
> 
> Jesper provided some pointers here [1], it would be really great if you could
> run at least those microbenchmarks. With mmtests it's the major question of
> which subset/profiles to run, maybe the referenced commits provide some hints,
> or maybe Mel could suggest what he used to evaluate SLAB vs SLUB not so long ago.
> 

Last time the list of mmtests configurations I used for a basic
comparison were

db-pgbench-timed-ro-small-ext4
db-pgbench-timed-ro-small-xfs
io-dbench4-async-ext4
io-dbench4-async-xfs
io-bonnie-dir-async-ext4
io-bonnie-dir-async-xfs
io-bonnie-file-async-ext4
io-bonnie-file-async-xfs
io-fsmark-xfsrepair-xfs
io-metadata-xfs
network-netperf-unbound
network-netperf-cross-node
network-netperf-cross-socket
network-sockperf-unbound
network-netperf-unix-unbound
network-netpipe
network-tbench
pagereclaim-shrinker-ext4
scheduler-unbound
scheduler-forkintensive
workload-kerndevel-xfs
workload-thpscale-madvhugepage-xfs
workload-thpscale-xfs

Some were more valid than others in terms of doing an evaluation. I
followed up later with a more comprehensive comparison but that was
overkill.

Each time I did a slab/slub comparison in the past, I had to reverify
the rate that kmem_cache_* functions were actually being called as the
pattern can change over time even for the same workload.  A comparison
gets more complicated when comparing cgroups as ideally there would be
workloads running in multiple group but that gets complex and I think
it's reasonable to just test the "basic" case without cgroups.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 02/19] mm: memcg: prepare for byte-sized vmstat items
  2020-06-08 23:06 ` [PATCH v6 02/19] mm: memcg: prepare for byte-sized vmstat items Roman Gushchin
  2020-06-17  2:57   ` Shakeel Butt
@ 2020-06-17 15:55   ` Shakeel Butt
  1 sibling, 0 replies; 92+ messages in thread
From: Shakeel Butt @ 2020-06-17 15:55 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
>
> To implement per-object slab memory accounting, we need to
> convert slab vmstat counters to bytes. Actually, out of
> 4 levels of counters: global, per-node, per-memcg and per-lruvec
> only two last levels will require byte-sized counters.
> It's because global and per-node counters will be counting the
> number of slab pages, and per-memcg and per-lruvec will be
> counting the amount of memory taken by charged slab objects.
>
> Converting all vmstat counters to bytes or even all slab
> counters to bytes would introduce an additional overhead.
> So instead let's store global and per-node counters
> in pages, and memcg and lruvec counters in bytes.
>
> To make the API clean all access helpers (both on the read
> and write sides) are dealing with bytes.
>
> To avoid back-and-forth conversions a new flavor of read-side
> helpers is introduced, which always returns values in pages:
> node_page_state_pages() and global_node_page_state_pages().
>
> Actually new helpers are just reading raw values. Old helpers are
> simple wrappers, which will complain on an attempt to read
> byte value, because at the moment no one actually needs bytes.
>
> Thanks to Johannes Weiner for the idea of having the byte-sized API
> on top of the page-sized internal storage.
>
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 17/19] mm: memcg/slab: use a single set of kmem_caches for all allocations
  2020-06-08 23:06 ` [PATCH v6 17/19] mm: memcg/slab: use a single set of kmem_caches for all allocations Roman Gushchin
@ 2020-06-17 23:35   ` Andrew Morton
  2020-06-18  0:35     ` Roman Gushchin
  2020-06-22 19:21   ` Shakeel Butt
  1 sibling, 1 reply; 92+ messages in thread
From: Andrew Morton @ 2020-06-17 23:35 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Christoph Lameter, Johannes Weiner, Michal Hocko, Shakeel Butt,
	linux-mm, Vlastimil Babka, kernel-team, linux-kernel

On Mon, 8 Jun 2020 16:06:52 -0700 Roman Gushchin <guro@fb.com> wrote:

> Instead of having two sets of kmem_caches: one for system-wide and
> non-accounted allocations and the second one shared by all accounted
> allocations, we can use just one.
> 
> The idea is simple: space for obj_cgroup metadata can be allocated
> on demand and filled only for accounted allocations.
> 
> It allows to remove a bunch of code which is required to handle
> kmem_cache clones for accounted allocations. There is no more need
> to create them, accumulate statistics, propagate attributes, etc.
> It's a quite significant simplification.
> 
> Also, because the total number of slab_caches is reduced almost twice
> (not all kmem_caches have a memcg clone), some additional memory
> savings are expected. On my devvm it additionally saves about 3.5%
> of slab memory.
> 

This ran afoul of Vlastimil's "mm, slab/slub: move and improve
cache_from_obj()"
(http://lkml.kernel.org/r/20200610163135.17364-10-vbabka@suse.cz).  I
resolved things as below.  Not too sure about slab.c's
cache_from_obj()...


From: Roman Gushchin <guro@fb.com>
Subject: mm: memcg/slab: use a single set of kmem_caches for all allocations

Instead of having two sets of kmem_caches: one for system-wide and
non-accounted allocations and the second one shared by all accounted
allocations, we can use just one.

The idea is simple: space for obj_cgroup metadata can be allocated
on demand and filled only for accounted allocations.

It allows to remove a bunch of code which is required to handle
kmem_cache clones for accounted allocations. There is no more need
to create them, accumulate statistics, propagate attributes, etc.
It's a quite significant simplification.

Also, because the total number of slab_caches is reduced almost twice
(not all kmem_caches have a memcg clone), some additional memory
savings are expected. On my devvm it additionally saves about 3.5%
of slab memory.

Link: http://lkml.kernel.org/r/20200608230654.828134-18-guro@fb.com
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/slab.h     |    2 
 include/linux/slab_def.h |    3 
 include/linux/slub_def.h |   10 -
 mm/memcontrol.c          |    5 
 mm/slab.c                |   46 -------
 mm/slab.h                |  176 ++++++----------------------
 mm/slab_common.c         |  230 -------------------------------------
 mm/slub.c                |  166 --------------------------
 8 files changed, 58 insertions(+), 580 deletions(-)

--- a/include/linux/slab_def.h~mm-memcg-slab-use-a-single-set-of-kmem_caches-for-all-allocations
+++ a/include/linux/slab_def.h
@@ -72,9 +72,6 @@ struct kmem_cache {
 	int obj_offset;
 #endif /* CONFIG_DEBUG_SLAB */
 
-#ifdef CONFIG_MEMCG
-	struct memcg_cache_params memcg_params;
-#endif
 #ifdef CONFIG_KASAN
 	struct kasan_cache kasan_info;
 #endif
--- a/include/linux/slab.h~mm-memcg-slab-use-a-single-set-of-kmem_caches-for-all-allocations
+++ a/include/linux/slab.h
@@ -155,8 +155,6 @@ struct kmem_cache *kmem_cache_create_use
 void kmem_cache_destroy(struct kmem_cache *);
 int kmem_cache_shrink(struct kmem_cache *);
 
-void memcg_create_kmem_cache(struct kmem_cache *cachep);
-
 /*
  * Please use this macro to create slab caches. Simply specify the
  * name of the structure and maybe some flags that are listed above.
--- a/include/linux/slub_def.h~mm-memcg-slab-use-a-single-set-of-kmem_caches-for-all-allocations
+++ a/include/linux/slub_def.h
@@ -108,17 +108,7 @@ struct kmem_cache {
 	struct list_head list;	/* List of slab caches */
 #ifdef CONFIG_SYSFS
 	struct kobject kobj;	/* For sysfs */
-	struct work_struct kobj_remove_work;
 #endif
-#ifdef CONFIG_MEMCG
-	struct memcg_cache_params memcg_params;
-	/* For propagation, maximum size of a stored attr */
-	unsigned int max_attr_size;
-#ifdef CONFIG_SYSFS
-	struct kset *memcg_kset;
-#endif
-#endif
-
 #ifdef CONFIG_SLAB_FREELIST_HARDENED
 	unsigned long random;
 #endif
--- a/mm/memcontrol.c~mm-memcg-slab-use-a-single-set-of-kmem_caches-for-all-allocations
+++ a/mm/memcontrol.c
@@ -2826,7 +2826,10 @@ struct mem_cgroup *mem_cgroup_from_obj(v
 
 		off = obj_to_index(page->slab_cache, page, p);
 		objcg = page_obj_cgroups(page)[off];
-		return obj_cgroup_memcg(objcg);
+		if (objcg)
+			return obj_cgroup_memcg(objcg);
+
+		return NULL;
 	}
 
 	/* All other pages use page->mem_cgroup */
--- a/mm/slab.c~mm-memcg-slab-use-a-single-set-of-kmem_caches-for-all-allocations
+++ a/mm/slab.c
@@ -1369,11 +1369,7 @@ static struct page *kmem_getpages(struct
 		return NULL;
 	}
 
-	if (charge_slab_page(page, flags, cachep->gfporder, cachep)) {
-		__free_pages(page, cachep->gfporder);
-		return NULL;
-	}
-
+	charge_slab_page(page, flags, cachep->gfporder, cachep);
 	__SetPageSlab(page);
 	/* Record if ALLOC_NO_WATERMARKS was set when allocating the slab */
 	if (sk_memalloc_socks() && page_is_pfmemalloc(page))
@@ -3670,10 +3666,7 @@ EXPORT_SYMBOL(__kmalloc_track_caller);
 
 static inline struct kmem_cache *cache_from_obj(struct kmem_cache *s, void *x)
 {
-	if (memcg_kmem_enabled())
-		return virt_to_cache(x);
-	else
-		return s;
+	return virt_to_cache(x);
 }
 
 /**
@@ -3800,8 +3793,8 @@ fail:
 }
 
 /* Always called with the slab_mutex held */
-static int __do_tune_cpucache(struct kmem_cache *cachep, int limit,
-				int batchcount, int shared, gfp_t gfp)
+static int do_tune_cpucache(struct kmem_cache *cachep, int limit,
+			    int batchcount, int shared, gfp_t gfp)
 {
 	struct array_cache __percpu *cpu_cache, *prev;
 	int cpu;
@@ -3846,30 +3839,6 @@ setup_node:
 	return setup_kmem_cache_nodes(cachep, gfp);
 }
 
-static int do_tune_cpucache(struct kmem_cache *cachep, int limit,
-				int batchcount, int shared, gfp_t gfp)
-{
-	int ret;
-	struct kmem_cache *c;
-
-	ret = __do_tune_cpucache(cachep, limit, batchcount, shared, gfp);
-
-	if (slab_state < FULL)
-		return ret;
-
-	if ((ret < 0) || !is_root_cache(cachep))
-		return ret;
-
-	lockdep_assert_held(&slab_mutex);
-	c = memcg_cache(cachep);
-	if (c) {
-		/* return value determined by the root cache only */
-		__do_tune_cpucache(c, limit, batchcount, shared, gfp);
-	}
-
-	return ret;
-}
-
 /* Called with slab_mutex held always */
 static int enable_cpucache(struct kmem_cache *cachep, gfp_t gfp)
 {
@@ -3882,13 +3851,6 @@ static int enable_cpucache(struct kmem_c
 	if (err)
 		goto end;
 
-	if (!is_root_cache(cachep)) {
-		struct kmem_cache *root = memcg_root_cache(cachep);
-		limit = root->limit;
-		shared = root->shared;
-		batchcount = root->batchcount;
-	}
-
 	if (limit && shared && batchcount)
 		goto skip_setup;
 	/*
--- a/mm/slab_common.c~mm-memcg-slab-use-a-single-set-of-kmem_caches-for-all-allocations
+++ a/mm/slab_common.c
@@ -128,36 +128,6 @@ int __kmem_cache_alloc_bulk(struct kmem_
 	return i;
 }
 
-#ifdef CONFIG_MEMCG_KMEM
-static void memcg_kmem_cache_create_func(struct work_struct *work)
-{
-	struct kmem_cache *cachep = container_of(work, struct kmem_cache,
-						 memcg_params.work);
-	memcg_create_kmem_cache(cachep);
-}
-
-void slab_init_memcg_params(struct kmem_cache *s)
-{
-	s->memcg_params.root_cache = NULL;
-	s->memcg_params.memcg_cache = NULL;
-	INIT_WORK(&s->memcg_params.work, memcg_kmem_cache_create_func);
-}
-
-static void init_memcg_params(struct kmem_cache *s,
-			      struct kmem_cache *root_cache)
-{
-	if (root_cache)
-		s->memcg_params.root_cache = root_cache;
-	else
-		slab_init_memcg_params(s);
-}
-#else
-static inline void init_memcg_params(struct kmem_cache *s,
-				     struct kmem_cache *root_cache)
-{
-}
-#endif /* CONFIG_MEMCG_KMEM */
-
 /*
  * Figure out what the alignment of the objects will be given a set of
  * flags, a user specified alignment and the size of the objects.
@@ -195,9 +165,6 @@ int slab_unmergeable(struct kmem_cache *
 	if (slab_nomerge || (s->flags & SLAB_NEVER_MERGE))
 		return 1;
 
-	if (!is_root_cache(s))
-		return 1;
-
 	if (s->ctor)
 		return 1;
 
@@ -284,7 +251,6 @@ static struct kmem_cache *create_cache(c
 	s->useroffset = useroffset;
 	s->usersize = usersize;
 
-	init_memcg_params(s, root_cache);
 	err = __kmem_cache_create(s, flags);
 	if (err)
 		goto out_free_cache;
@@ -342,7 +308,6 @@ kmem_cache_create_usercopy(const char *n
 
 	get_online_cpus();
 	get_online_mems();
-	memcg_get_cache_ids();
 
 	mutex_lock(&slab_mutex);
 
@@ -392,7 +357,6 @@ kmem_cache_create_usercopy(const char *n
 out_unlock:
 	mutex_unlock(&slab_mutex);
 
-	memcg_put_cache_ids();
 	put_online_mems();
 	put_online_cpus();
 
@@ -505,87 +469,6 @@ static int shutdown_cache(struct kmem_ca
 	return 0;
 }
 
-#ifdef CONFIG_MEMCG_KMEM
-/*
- * memcg_create_kmem_cache - Create a cache for non-root memory cgroups.
- * @root_cache: The parent of the new cache.
- *
- * This function attempts to create a kmem cache that will serve allocation
- * requests going all non-root memory cgroups to @root_cache. The new cache
- * inherits properties from its parent.
- */
-void memcg_create_kmem_cache(struct kmem_cache *root_cache)
-{
-	struct kmem_cache *s = NULL;
-	char *cache_name;
-
-	get_online_cpus();
-	get_online_mems();
-
-	mutex_lock(&slab_mutex);
-
-	if (root_cache->memcg_params.memcg_cache)
-		goto out_unlock;
-
-	cache_name = kasprintf(GFP_KERNEL, "%s-memcg", root_cache->name);
-	if (!cache_name)
-		goto out_unlock;
-
-	s = create_cache(cache_name, root_cache->object_size,
-			 root_cache->align,
-			 root_cache->flags & CACHE_CREATE_MASK,
-			 root_cache->useroffset, root_cache->usersize,
-			 root_cache->ctor, root_cache);
-	/*
-	 * If we could not create a memcg cache, do not complain, because
-	 * that's not critical at all as we can always proceed with the root
-	 * cache.
-	 */
-	if (IS_ERR(s)) {
-		kfree(cache_name);
-		goto out_unlock;
-	}
-
-	/*
-	 * Since readers won't lock (see memcg_slab_pre_alloc_hook()), we need a
-	 * barrier here to ensure nobody will see the kmem_cache partially
-	 * initialized.
-	 */
-	smp_wmb();
-	root_cache->memcg_params.memcg_cache = s;
-
-out_unlock:
-	mutex_unlock(&slab_mutex);
-
-	put_online_mems();
-	put_online_cpus();
-}
-
-static int shutdown_memcg_caches(struct kmem_cache *s)
-{
-	BUG_ON(!is_root_cache(s));
-
-	if (s->memcg_params.memcg_cache)
-		WARN_ON(shutdown_cache(s->memcg_params.memcg_cache));
-
-	return 0;
-}
-
-static void cancel_memcg_cache_creation(struct kmem_cache *s)
-{
-	cancel_work_sync(&s->memcg_params.work);
-}
-#else
-static inline int shutdown_memcg_caches(struct kmem_cache *s)
-{
-	return 0;
-}
-
-static inline void cancel_memcg_cache_creation(struct kmem_cache *s)
-{
-}
-#endif /* CONFIG_MEMCG_KMEM */
-
 void slab_kmem_cache_release(struct kmem_cache *s)
 {
 	__kmem_cache_release(s);
@@ -600,8 +483,6 @@ void kmem_cache_destroy(struct kmem_cach
 	if (unlikely(!s))
 		return;
 
-	cancel_memcg_cache_creation(s);
-
 	get_online_cpus();
 	get_online_mems();
 
@@ -611,10 +492,7 @@ void kmem_cache_destroy(struct kmem_cach
 	if (s->refcount)
 		goto out_unlock;
 
-	err = shutdown_memcg_caches(s);
-	if (!err)
-		err = shutdown_cache(s);
-
+	err = shutdown_cache(s);
 	if (err) {
 		pr_err("kmem_cache_destroy %s: Slab cache still has objects\n",
 		       s->name);
@@ -651,33 +529,6 @@ int kmem_cache_shrink(struct kmem_cache
 }
 EXPORT_SYMBOL(kmem_cache_shrink);
 
-/**
- * kmem_cache_shrink_all - shrink root and memcg caches
- * @s: The cache pointer
- */
-void kmem_cache_shrink_all(struct kmem_cache *s)
-{
-	struct kmem_cache *c;
-
-	if (!IS_ENABLED(CONFIG_MEMCG_KMEM) || !is_root_cache(s)) {
-		kmem_cache_shrink(s);
-		return;
-	}
-
-	get_online_cpus();
-	get_online_mems();
-	kasan_cache_shrink(s);
-	__kmem_cache_shrink(s);
-
-	c = memcg_cache(s);
-	if (c) {
-		kasan_cache_shrink(c);
-		__kmem_cache_shrink(c);
-	}
-	put_online_mems();
-	put_online_cpus();
-}
-
 bool slab_is_available(void)
 {
 	return slab_state >= UP;
@@ -706,8 +557,6 @@ void __init create_boot_cache(struct kme
 	s->useroffset = useroffset;
 	s->usersize = usersize;
 
-	slab_init_memcg_params(s);
-
 	err = __kmem_cache_create(s, flags);
 
 	if (err)
@@ -1081,25 +930,6 @@ void slab_stop(struct seq_file *m, void
 	mutex_unlock(&slab_mutex);
 }
 
-static void
-memcg_accumulate_slabinfo(struct kmem_cache *s, struct slabinfo *info)
-{
-	struct kmem_cache *c;
-	struct slabinfo sinfo;
-
-	c = memcg_cache(s);
-	if (c) {
-		memset(&sinfo, 0, sizeof(sinfo));
-		get_slabinfo(c, &sinfo);
-
-		info->active_slabs += sinfo.active_slabs;
-		info->num_slabs += sinfo.num_slabs;
-		info->shared_avail += sinfo.shared_avail;
-		info->active_objs += sinfo.active_objs;
-		info->num_objs += sinfo.num_objs;
-	}
-}
-
 static void cache_show(struct kmem_cache *s, struct seq_file *m)
 {
 	struct slabinfo sinfo;
@@ -1107,10 +937,8 @@ static void cache_show(struct kmem_cache
 	memset(&sinfo, 0, sizeof(sinfo));
 	get_slabinfo(s, &sinfo);
 
-	memcg_accumulate_slabinfo(s, &sinfo);
-
 	seq_printf(m, "%-17s %6lu %6lu %6u %4u %4d",
-		   cache_name(s), sinfo.active_objs, sinfo.num_objs, s->size,
+		   s->name, sinfo.active_objs, sinfo.num_objs, s->size,
 		   sinfo.objects_per_slab, (1 << sinfo.cache_order));
 
 	seq_printf(m, " : tunables %4u %4u %4u",
@@ -1127,8 +955,7 @@ static int slab_show(struct seq_file *m,
 
 	if (p == slab_caches.next)
 		print_slabinfo_header(m);
-	if (is_root_cache(s))
-		cache_show(s, m);
+	cache_show(s, m);
 	return 0;
 }
 
@@ -1153,13 +980,13 @@ void dump_unreclaimable_slab(void)
 	pr_info("Name                      Used          Total\n");
 
 	list_for_each_entry_safe(s, s2, &slab_caches, list) {
-		if (!is_root_cache(s) || (s->flags & SLAB_RECLAIM_ACCOUNT))
+		if (s->flags & SLAB_RECLAIM_ACCOUNT)
 			continue;
 
 		get_slabinfo(s, &sinfo);
 
 		if (sinfo.num_objs > 0)
-			pr_info("%-17s %10luKB %10luKB\n", cache_name(s),
+			pr_info("%-17s %10luKB %10luKB\n", s->name,
 				(sinfo.active_objs * s->size) / 1024,
 				(sinfo.num_objs * s->size) / 1024);
 	}
@@ -1218,53 +1045,6 @@ static int __init slab_proc_init(void)
 }
 module_init(slab_proc_init);
 
-#if defined(CONFIG_DEBUG_FS) && defined(CONFIG_MEMCG_KMEM)
-/*
- * Display information about kmem caches that have memcg cache.
- */
-static int memcg_slabinfo_show(struct seq_file *m, void *unused)
-{
-	struct kmem_cache *s, *c;
-	struct slabinfo sinfo;
-
-	mutex_lock(&slab_mutex);
-	seq_puts(m, "# <name> <css_id[:dead|deact]> <active_objs> <num_objs>");
-	seq_puts(m, " <active_slabs> <num_slabs>\n");
-	list_for_each_entry(s, &slab_caches, list) {
-		/*
-		 * Skip kmem caches that don't have the memcg cache.
-		 */
-		if (!s->memcg_params.memcg_cache)
-			continue;
-
-		memset(&sinfo, 0, sizeof(sinfo));
-		get_slabinfo(s, &sinfo);
-		seq_printf(m, "%-17s root       %6lu %6lu %6lu %6lu\n",
-			   cache_name(s), sinfo.active_objs, sinfo.num_objs,
-			   sinfo.active_slabs, sinfo.num_slabs);
-
-		c = s->memcg_params.memcg_cache;
-		memset(&sinfo, 0, sizeof(sinfo));
-		get_slabinfo(c, &sinfo);
-		seq_printf(m, "%-17s %4d %6lu %6lu %6lu %6lu\n",
-			   cache_name(c), root_mem_cgroup->css.id,
-			   sinfo.active_objs, sinfo.num_objs,
-			   sinfo.active_slabs, sinfo.num_slabs);
-	}
-	mutex_unlock(&slab_mutex);
-	return 0;
-}
-DEFINE_SHOW_ATTRIBUTE(memcg_slabinfo);
-
-static int __init memcg_slabinfo_init(void)
-{
-	debugfs_create_file("memcg_slabinfo", S_IFREG | S_IRUGO,
-			    NULL, NULL, &memcg_slabinfo_fops);
-	return 0;
-}
-
-late_initcall(memcg_slabinfo_init);
-#endif /* CONFIG_DEBUG_FS && CONFIG_MEMCG_KMEM */
 #endif /* CONFIG_SLAB || CONFIG_SLUB_DEBUG */
 
 static __always_inline void *__do_krealloc(const void *p, size_t new_size,
--- a/mm/slab.h~mm-memcg-slab-use-a-single-set-of-kmem_caches-for-all-allocations
+++ a/mm/slab.h
@@ -30,28 +30,6 @@ struct kmem_cache {
 	struct list_head list;	/* List of all slab caches on the system */
 };
 
-#else /* !CONFIG_SLOB */
-
-/*
- * This is the main placeholder for memcg-related information in kmem caches.
- * Both the root cache and the child cache will have it. Some fields are used
- * in both cases, other are specific to root caches.
- *
- * @root_cache:	Common to root and child caches.  NULL for root, pointer to
- *		the root cache for children.
- *
- * The following fields are specific to root caches.
- *
- * @memcg_cache: pointer to memcg kmem cache, used by all non-root memory
- *		cgroups.
- * @work: work struct used to create the non-root cache.
- */
-struct memcg_cache_params {
-	struct kmem_cache *root_cache;
-
-	struct kmem_cache *memcg_cache;
-	struct work_struct work;
-};
 #endif /* CONFIG_SLOB */
 
 #ifdef CONFIG_SLAB
@@ -194,7 +172,6 @@ int __kmem_cache_shutdown(struct kmem_ca
 void __kmem_cache_release(struct kmem_cache *);
 int __kmem_cache_shrink(struct kmem_cache *);
 void slab_kmem_cache_release(struct kmem_cache *);
-void kmem_cache_shrink_all(struct kmem_cache *s);
 
 struct seq_file;
 struct file;
@@ -233,43 +210,6 @@ static inline int cache_vmstat_idx(struc
 }
 
 #ifdef CONFIG_MEMCG_KMEM
-static inline bool is_root_cache(struct kmem_cache *s)
-{
-	return !s->memcg_params.root_cache;
-}
-
-static inline bool slab_equal_or_root(struct kmem_cache *s,
-				      struct kmem_cache *p)
-{
-	return p == s || p == s->memcg_params.root_cache;
-}
-
-/*
- * We use suffixes to the name in memcg because we can't have caches
- * created in the system with the same name. But when we print them
- * locally, better refer to them with the base name
- */
-static inline const char *cache_name(struct kmem_cache *s)
-{
-	if (!is_root_cache(s))
-		s = s->memcg_params.root_cache;
-	return s->name;
-}
-
-static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
-{
-	if (is_root_cache(s))
-		return s;
-	return s->memcg_params.root_cache;
-}
-
-static inline struct kmem_cache *memcg_cache(struct kmem_cache *s)
-{
-	if (is_root_cache(s))
-		return s->memcg_params.memcg_cache;
-	return NULL;
-}
-
 static inline struct obj_cgroup **page_obj_cgroups(struct page *page)
 {
 	/*
@@ -316,38 +256,25 @@ static inline size_t obj_full_size(struc
 	return s->size + sizeof(struct obj_cgroup *);
 }
 
-static inline struct kmem_cache *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
-						struct obj_cgroup **objcgp,
-						size_t objects, gfp_t flags)
+static inline struct obj_cgroup *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
+							   size_t objects,
+							   gfp_t flags)
 {
-	struct kmem_cache *cachep;
 	struct obj_cgroup *objcg;
 
 	if (memcg_kmem_bypass())
-		return s;
-
-	cachep = READ_ONCE(s->memcg_params.memcg_cache);
-	if (unlikely(!cachep)) {
-		/*
-		 * If memcg cache does not exist yet, we schedule it's
-		 * asynchronous creation and let the current allocation
-		 * go through with the root cache.
-		 */
-		queue_work(system_wq, &s->memcg_params.work);
-		return s;
-	}
+		return NULL;
 
 	objcg = get_obj_cgroup_from_current();
 	if (!objcg)
-		return s;
+		return NULL;
 
 	if (obj_cgroup_charge(objcg, flags, objects * obj_full_size(s))) {
 		obj_cgroup_put(objcg);
-		cachep = NULL;
+		return NULL;
 	}
 
-	*objcgp = objcg;
-	return cachep;
+	return objcg;
 }
 
 static inline void mod_objcg_state(struct obj_cgroup *objcg,
@@ -366,15 +293,27 @@ static inline void mod_objcg_state(struc
 
 static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
 					      struct obj_cgroup *objcg,
-					      size_t size, void **p)
+					      gfp_t flags, size_t size,
+					      void **p)
 {
 	struct page *page;
 	unsigned long off;
 	size_t i;
 
+	if (!objcg)
+		return;
+
+	flags &= ~__GFP_ACCOUNT;
 	for (i = 0; i < size; i++) {
 		if (likely(p[i])) {
 			page = virt_to_head_page(p[i]);
+
+			if (!page_has_obj_cgroups(page) &&
+			    memcg_alloc_page_obj_cgroups(page, s, flags)) {
+				obj_cgroup_uncharge(objcg, obj_full_size(s));
+				continue;
+			}
+
 			off = obj_to_index(s, page, p[i]);
 			obj_cgroup_get(objcg);
 			page_obj_cgroups(page)[off] = objcg;
@@ -393,13 +332,19 @@ static inline void memcg_slab_free_hook(
 	struct obj_cgroup *objcg;
 	unsigned int off;
 
-	if (!memcg_kmem_enabled() || is_root_cache(s))
+	if (!memcg_kmem_enabled())
+		return;
+
+	if (!page_has_obj_cgroups(page))
 		return;
 
 	off = obj_to_index(s, page, p);
 	objcg = page_obj_cgroups(page)[off];
 	page_obj_cgroups(page)[off] = NULL;
 
+	if (!objcg)
+		return;
+
 	obj_cgroup_uncharge(objcg, obj_full_size(s));
 	mod_objcg_state(objcg, page_pgdat(page), cache_vmstat_idx(s),
 			-obj_full_size(s));
@@ -407,35 +352,7 @@ static inline void memcg_slab_free_hook(
 	obj_cgroup_put(objcg);
 }
 
-extern void slab_init_memcg_params(struct kmem_cache *);
-
 #else /* CONFIG_MEMCG_KMEM */
-static inline bool is_root_cache(struct kmem_cache *s)
-{
-	return true;
-}
-
-static inline bool slab_equal_or_root(struct kmem_cache *s,
-				      struct kmem_cache *p)
-{
-	return s == p;
-}
-
-static inline const char *cache_name(struct kmem_cache *s)
-{
-	return s->name;
-}
-
-static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
-{
-	return s;
-}
-
-static inline struct kmem_cache *memcg_cache(struct kmem_cache *s)
-{
-	return NULL;
-}
-
 static inline bool page_has_obj_cgroups(struct page *page)
 {
 	return false;
@@ -456,16 +373,17 @@ static inline void memcg_free_page_obj_c
 {
 }
 
-static inline struct kmem_cache *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
-						struct obj_cgroup **objcgp,
-						size_t objects, gfp_t flags)
+static inline struct obj_cgroup *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
+							   size_t objects,
+							   gfp_t flags)
 {
 	return NULL;
 }
 
 static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
 					      struct obj_cgroup *objcg,
-					      size_t size, void **p)
+					      gfp_t flags, size_t size,
+					      void **p)
 {
 }
 
@@ -473,11 +391,6 @@ static inline void memcg_slab_free_hook(
 					void *p)
 {
 }
-
-static inline void slab_init_memcg_params(struct kmem_cache *s)
-{
-}
-
 #endif /* CONFIG_MEMCG_KMEM */
 
 static inline struct kmem_cache *virt_to_cache(const void *obj)
@@ -491,27 +404,18 @@ static inline struct kmem_cache *virt_to
 	return page->slab_cache;
 }
 
-static __always_inline int charge_slab_page(struct page *page,
-					    gfp_t gfp, int order,
-					    struct kmem_cache *s)
-{
-	if (memcg_kmem_enabled() && !is_root_cache(s)) {
-		int ret;
-
-		ret = memcg_alloc_page_obj_cgroups(page, s, gfp);
-		if (ret)
-			return ret;
-	}
-
+static __always_inline void charge_slab_page(struct page *page,
+					     gfp_t gfp, int order,
+					     struct kmem_cache *s)
+{
 	mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
 			    PAGE_SIZE << order);
-	return 0;
 }
 
 static __always_inline void uncharge_slab_page(struct page *page, int order,
 					       struct kmem_cache *s)
 {
-	if (memcg_kmem_enabled() && !is_root_cache(s))
+	if (memcg_kmem_enabled())
 		memcg_free_page_obj_cgroups(page);
 
 	mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
@@ -564,7 +468,7 @@ static inline struct kmem_cache *slab_pr
 
 	if (memcg_kmem_enabled() &&
 	    ((flags & __GFP_ACCOUNT) || (s->flags & SLAB_ACCOUNT)))
-		return memcg_slab_pre_alloc_hook(s, objcgp, size, flags);
+		*objcgp = memcg_slab_pre_alloc_hook(s, size, flags);
 
 	return s;
 }
@@ -583,8 +487,8 @@ static inline void slab_post_alloc_hook(
 					 s->flags, flags);
 	}
 
-	if (memcg_kmem_enabled() && !is_root_cache(s))
-		memcg_slab_post_alloc_hook(s, objcg, size, p);
+	if (memcg_kmem_enabled())
+		memcg_slab_post_alloc_hook(s, objcg, flags, size, p);
 }
 
 #ifndef CONFIG_SLOB
--- a/mm/slub.c~mm-memcg-slab-use-a-single-set-of-kmem_caches-for-all-allocations
+++ a/mm/slub.c
@@ -232,14 +232,10 @@ enum track_item { TRACK_ALLOC, TRACK_FRE
 #ifdef CONFIG_SYSFS
 static int sysfs_slab_add(struct kmem_cache *);
 static int sysfs_slab_alias(struct kmem_cache *, const char *);
-static void memcg_propagate_slab_attrs(struct kmem_cache *s);
-static void sysfs_slab_remove(struct kmem_cache *s);
 #else
 static inline int sysfs_slab_add(struct kmem_cache *s) { return 0; }
 static inline int sysfs_slab_alias(struct kmem_cache *s, const char *p)
 							{ return 0; }
-static inline void memcg_propagate_slab_attrs(struct kmem_cache *s) { }
-static inline void sysfs_slab_remove(struct kmem_cache *s) { }
 #endif
 
 static inline void stat(const struct kmem_cache *s, enum stat_item si)
@@ -1643,10 +1639,8 @@ static inline struct page *alloc_slab_pa
 	else
 		page = __alloc_pages_node(node, flags, order);
 
-	if (page && charge_slab_page(page, flags, order, s)) {
-		__free_pages(page, order);
-		page = NULL;
-	}
+	if (page)
+		charge_slab_page(page, flags, order, s);
 
 	return page;
 }
@@ -3185,12 +3179,11 @@ static inline struct kmem_cache *cache_f
 	struct kmem_cache *cachep;
 
 	if (!IS_ENABLED(CONFIG_SLAB_FREELIST_HARDENED) &&
-	    !memcg_kmem_enabled() &&
 	    !kmem_cache_debug_flags(s, SLAB_CONSISTENCY_CHECKS))
 		return s;
 
 	cachep = virt_to_cache(x);
-	if (WARN(cachep && !slab_equal_or_root(cachep, s),
+	if (WARN(cachep && cachep != s,
 		  "%s: Wrong slab cache. %s but object is from %s\n",
 		  __func__, s->name, cachep->name))
 		print_tracking(cachep, x);
@@ -3972,7 +3965,6 @@ int __kmem_cache_shutdown(struct kmem_ca
 		if (n->nr_partial || slabs_node(s, node))
 			return 1;
 	}
-	sysfs_slab_remove(s);
 	return 0;
 }
 
@@ -4410,7 +4402,6 @@ static struct kmem_cache * __init bootst
 			p->slab_cache = s;
 #endif
 	}
-	slab_init_memcg_params(s);
 	list_add(&s->list, &slab_caches);
 	return s;
 }
@@ -4466,7 +4457,7 @@ struct kmem_cache *
 __kmem_cache_alias(const char *name, unsigned int size, unsigned int align,
 		   slab_flags_t flags, void (*ctor)(void *))
 {
-	struct kmem_cache *s, *c;
+	struct kmem_cache *s;
 
 	s = find_mergeable(size, align, flags, name, ctor);
 	if (s) {
@@ -4479,12 +4470,6 @@ __kmem_cache_alias(const char *name, uns
 		s->object_size = max(s->object_size, size);
 		s->inuse = max(s->inuse, ALIGN(size, sizeof(void *)));
 
-		c = memcg_cache(s);
-		if (c) {
-			c->object_size = s->object_size;
-			c->inuse = max(c->inuse, ALIGN(size, sizeof(void *)));
-		}
-
 		if (sysfs_slab_alias(s, name)) {
 			s->refcount--;
 			s = NULL;
@@ -4506,7 +4491,6 @@ int __kmem_cache_create(struct kmem_cach
 	if (slab_state <= UP)
 		return 0;
 
-	memcg_propagate_slab_attrs(s);
 	err = sysfs_slab_add(s);
 	if (err)
 		__kmem_cache_release(s);
@@ -5364,7 +5348,7 @@ static ssize_t shrink_store(struct kmem_
 			const char *buf, size_t length)
 {
 	if (buf[0] == '1')
-		kmem_cache_shrink_all(s);
+		kmem_cache_shrink(s);
 	else
 		return -EINVAL;
 	return length;
@@ -5588,99 +5572,9 @@ static ssize_t slab_attr_store(struct ko
 		return -EIO;
 
 	err = attribute->store(s, buf, len);
-#ifdef CONFIG_MEMCG
-	if (slab_state >= FULL && err >= 0 && is_root_cache(s)) {
-		struct kmem_cache *c;
-
-		mutex_lock(&slab_mutex);
-		if (s->max_attr_size < len)
-			s->max_attr_size = len;
-
-		/*
-		 * This is a best effort propagation, so this function's return
-		 * value will be determined by the parent cache only. This is
-		 * basically because not all attributes will have a well
-		 * defined semantics for rollbacks - most of the actions will
-		 * have permanent effects.
-		 *
-		 * Returning the error value of any of the children that fail
-		 * is not 100 % defined, in the sense that users seeing the
-		 * error code won't be able to know anything about the state of
-		 * the cache.
-		 *
-		 * Only returning the error code for the parent cache at least
-		 * has well defined semantics. The cache being written to
-		 * directly either failed or succeeded, in which case we loop
-		 * through the descendants with best-effort propagation.
-		 */
-		c = memcg_cache(s);
-		if (c)
-			attribute->store(c, buf, len);
-		mutex_unlock(&slab_mutex);
-	}
-#endif
 	return err;
 }
 
-static void memcg_propagate_slab_attrs(struct kmem_cache *s)
-{
-#ifdef CONFIG_MEMCG
-	int i;
-	char *buffer = NULL;
-	struct kmem_cache *root_cache;
-
-	if (is_root_cache(s))
-		return;
-
-	root_cache = s->memcg_params.root_cache;
-
-	/*
-	 * This mean this cache had no attribute written. Therefore, no point
-	 * in copying default values around
-	 */
-	if (!root_cache->max_attr_size)
-		return;
-
-	for (i = 0; i < ARRAY_SIZE(slab_attrs); i++) {
-		char mbuf[64];
-		char *buf;
-		struct slab_attribute *attr = to_slab_attr(slab_attrs[i]);
-		ssize_t len;
-
-		if (!attr || !attr->store || !attr->show)
-			continue;
-
-		/*
-		 * It is really bad that we have to allocate here, so we will
-		 * do it only as a fallback. If we actually allocate, though,
-		 * we can just use the allocated buffer until the end.
-		 *
-		 * Most of the slub attributes will tend to be very small in
-		 * size, but sysfs allows buffers up to a page, so they can
-		 * theoretically happen.
-		 */
-		if (buffer)
-			buf = buffer;
-		else if (root_cache->max_attr_size < ARRAY_SIZE(mbuf) &&
-			 !IS_ENABLED(CONFIG_SLUB_STATS))
-			buf = mbuf;
-		else {
-			buffer = (char *) get_zeroed_page(GFP_KERNEL);
-			if (WARN_ON(!buffer))
-				continue;
-			buf = buffer;
-		}
-
-		len = attr->show(root_cache, buf);
-		if (len > 0)
-			attr->store(s, buf, len);
-	}
-
-	if (buffer)
-		free_page((unsigned long)buffer);
-#endif	/* CONFIG_MEMCG */
-}
-
 static void kmem_cache_release(struct kobject *k)
 {
 	slab_kmem_cache_release(to_slab(k));
@@ -5700,10 +5594,6 @@ static struct kset *slab_kset;
 
 static inline struct kset *cache_kset(struct kmem_cache *s)
 {
-#ifdef CONFIG_MEMCG
-	if (!is_root_cache(s))
-		return s->memcg_params.root_cache->memcg_kset;
-#endif
 	return slab_kset;
 }
 
@@ -5746,27 +5636,6 @@ static char *create_unique_id(struct kme
 	return name;
 }
 
-static void sysfs_slab_remove_workfn(struct work_struct *work)
-{
-	struct kmem_cache *s =
-		container_of(work, struct kmem_cache, kobj_remove_work);
-
-	if (!s->kobj.state_in_sysfs)
-		/*
-		 * For a memcg cache, this may be called during
-		 * deactivation and again on shutdown.  Remove only once.
-		 * A cache is never shut down before deactivation is
-		 * complete, so no need to worry about synchronization.
-		 */
-		goto out;
-
-#ifdef CONFIG_MEMCG
-	kset_unregister(s->memcg_kset);
-#endif
-out:
-	kobject_put(&s->kobj);
-}
-
 static int sysfs_slab_add(struct kmem_cache *s)
 {
 	int err;
@@ -5774,8 +5643,6 @@ static int sysfs_slab_add(struct kmem_ca
 	struct kset *kset = cache_kset(s);
 	int unmergeable = slab_unmergeable(s);
 
-	INIT_WORK(&s->kobj_remove_work, sysfs_slab_remove_workfn);
-
 	if (!kset) {
 		kobject_init(&s->kobj, &slab_ktype);
 		return 0;
@@ -5812,16 +5679,6 @@ static int sysfs_slab_add(struct kmem_ca
 	if (err)
 		goto out_del_kobj;
 
-#ifdef CONFIG_MEMCG
-	if (is_root_cache(s) && memcg_sysfs_enabled) {
-		s->memcg_kset = kset_create_and_add("cgroup", NULL, &s->kobj);
-		if (!s->memcg_kset) {
-			err = -ENOMEM;
-			goto out_del_kobj;
-		}
-	}
-#endif
-
 	if (!unmergeable) {
 		/* Setup first alias */
 		sysfs_slab_alias(s, s->name);
@@ -5835,19 +5692,6 @@ out_del_kobj:
 	goto out;
 }
 
-static void sysfs_slab_remove(struct kmem_cache *s)
-{
-	if (slab_state < FULL)
-		/*
-		 * Sysfs has not been setup yet so no need to remove the
-		 * cache from sysfs.
-		 */
-		return;
-
-	kobject_get(&s->kobj);
-	schedule_work(&s->kobj_remove_work);
-}
-
 void sysfs_slab_unlink(struct kmem_cache *s)
 {
 	if (slab_state >= FULL)
_


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 17/19] mm: memcg/slab: use a single set of kmem_caches for all allocations
  2020-06-17 23:35   ` Andrew Morton
@ 2020-06-18  0:35     ` Roman Gushchin
  2020-06-18  7:33       ` Vlastimil Babka
  0 siblings, 1 reply; 92+ messages in thread
From: Roman Gushchin @ 2020-06-18  0:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Johannes Weiner, Michal Hocko, Shakeel Butt,
	linux-mm, Vlastimil Babka, kernel-team, linux-kernel

On Wed, Jun 17, 2020 at 04:35:28PM -0700, Andrew Morton wrote:
> On Mon, 8 Jun 2020 16:06:52 -0700 Roman Gushchin <guro@fb.com> wrote:
> 
> > Instead of having two sets of kmem_caches: one for system-wide and
> > non-accounted allocations and the second one shared by all accounted
> > allocations, we can use just one.
> > 
> > The idea is simple: space for obj_cgroup metadata can be allocated
> > on demand and filled only for accounted allocations.
> > 
> > It allows to remove a bunch of code which is required to handle
> > kmem_cache clones for accounted allocations. There is no more need
> > to create them, accumulate statistics, propagate attributes, etc.
> > It's a quite significant simplification.
> > 
> > Also, because the total number of slab_caches is reduced almost twice
> > (not all kmem_caches have a memcg clone), some additional memory
> > savings are expected. On my devvm it additionally saves about 3.5%
> > of slab memory.
> > 
> 
> This ran afoul of Vlastimil's "mm, slab/slub: move and improve
> cache_from_obj()"
> (http://lkml.kernel.org/r/20200610163135.17364-10-vbabka@suse.cz).  I
> resolved things as below.  Not too sure about slab.c's
> cache_from_obj()...

It can actually be as simple as:
static inline struct kmem_cache *cache_from_obj(struct kmem_cache *s, void *x)
{
	return s;
}

But I wonder if we need it at all, or maybe we wanna rename it to
something like obj_check_kmem_cache(void *obj, struct kmem_cache *s),
because it has now only debug purposes.

Let me and Vlastimil figure it out and send a follow-up patch.
Your version is definitely correct.

Thanks!

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 05/19] mm: memcontrol: decouple reference counting from page accounting
  2020-06-08 23:06 ` [PATCH v6 05/19] mm: memcontrol: decouple reference counting from page accounting Roman Gushchin
@ 2020-06-18  0:47   ` Shakeel Butt
  2020-06-18 14:55   ` Shakeel Butt
  2020-06-19  1:31   ` Shakeel Butt
  2 siblings, 0 replies; 92+ messages in thread
From: Shakeel Butt @ 2020-06-18  0:47 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
>
> From: Johannes Weiner <hannes@cmpxchg.org>
>
> The reference counting of a memcg is currently coupled directly to how
> many 4k pages are charged to it. This doesn't work well with Roman's
> new slab controller, which maintains pools of objects and doesn't want
> to keep an extra balance sheet for the pages backing those objects.
>
> This unusual refcounting design (reference counts usually track
> pointers to an object) is only for historical reasons: memcg used to
> not take any css references and simply stalled offlining until all
> charges had been reparented and the page counters had dropped to
> zero. When we got rid of the reparenting requirement, the simple
> mechanical translation was to take a reference for every charge.
>
> More historical context can be found in commit e8ea14cc6ead ("mm:
> memcontrol: take a css reference for each charged page"),
> commit 64f219938941 ("mm: memcontrol: remove obsolete kmemcg pinning
> tricks") and commit b2052564e66d ("mm: memcontrol: continue cache
> reclaim from offlined groups").
>
> The new slab controller exposes the limitations in this scheme, so
> let's switch it to a more idiomatic reference counting model based on
> actual kernel pointers to the memcg:
>
> - The per-cpu stock holds a reference to the memcg its caching
>
> - User pages hold a reference for their page->mem_cgroup. Transparent
>   huge pages will no longer acquire tail references in advance, we'll
>   get them if needed during the split.
>
> - Kernel pages hold a reference for their page->mem_cgroup
>
> - Pages allocated in the root cgroup will acquire and release css
>   references for simplicity. css_get() and css_put() optimize that.
>
> - The current memcg_charge_slab() already hacked around the per-charge
>   references; this change gets rid of that as well.
>
> Roman:
> 1) Rebased on top of the current mm tree: added css_get() in
>    mem_cgroup_charge(), dropped mem_cgroup_try_charge() part
> 2) I've reformatted commit references in the commit log to make
>    checkpatch.pl happy.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Acked-by: Roman Gushchin <guro@fb.com>
> ---
>  mm/memcontrol.c | 37 +++++++++++++++++++++----------------
>  mm/slab.h       |  2 --
>  2 files changed, 21 insertions(+), 18 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d18bf93e0f19..80282b2e8b7f 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2094,13 +2094,17 @@ static void drain_stock(struct memcg_stock_pcp *stock)
>  {
>         struct mem_cgroup *old = stock->cached;
>
> +       if (!old)
> +               return;
> +
>         if (stock->nr_pages) {
>                 page_counter_uncharge(&old->memory, stock->nr_pages);
>                 if (do_memsw_account())
>                         page_counter_uncharge(&old->memsw, stock->nr_pages);
> -               css_put_many(&old->css, stock->nr_pages);
>                 stock->nr_pages = 0;
>         }
> +
> +       css_put(&old->css);
>         stock->cached = NULL;
>  }
>
> @@ -2136,6 +2140,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>         stock = this_cpu_ptr(&memcg_stock);
>         if (stock->cached != memcg) { /* reset if necessary */
>                 drain_stock(stock);
> +               css_get(&memcg->css);
>                 stock->cached = memcg;
>         }
>         stock->nr_pages += nr_pages;
> @@ -2594,12 +2599,10 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>         page_counter_charge(&memcg->memory, nr_pages);
>         if (do_memsw_account())
>                 page_counter_charge(&memcg->memsw, nr_pages);
> -       css_get_many(&memcg->css, nr_pages);
>
>         return 0;
>
>  done_restock:
> -       css_get_many(&memcg->css, batch);
>         if (batch > nr_pages)
>                 refill_stock(memcg, batch - nr_pages);
>
> @@ -2657,8 +2660,6 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
>         page_counter_uncharge(&memcg->memory, nr_pages);
>         if (do_memsw_account())
>                 page_counter_uncharge(&memcg->memsw, nr_pages);
> -
> -       css_put_many(&memcg->css, nr_pages);
>  }
>  #endif
>
> @@ -2964,6 +2965,7 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
>                 if (!ret) {
>                         page->mem_cgroup = memcg;
>                         __SetPageKmemcg(page);
> +                       return 0;
>                 }
>         }
>         css_put(&memcg->css);
> @@ -2986,12 +2988,11 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
>         VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
>         __memcg_kmem_uncharge(memcg, nr_pages);
>         page->mem_cgroup = NULL;
> +       css_put(&memcg->css);
>
>         /* slab pages do not have PageKmemcg flag set */
>         if (PageKmemcg(page))
>                 __ClearPageKmemcg(page);
> -
> -       css_put_many(&memcg->css, nr_pages);
>  }
>  #endif /* CONFIG_MEMCG_KMEM */
>
> @@ -3003,13 +3004,16 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
>   */
>  void mem_cgroup_split_huge_fixup(struct page *head)
>  {
> +       struct mem_cgroup *memcg = head->mem_cgroup;
>         int i;
>
>         if (mem_cgroup_disabled())

if (mem_cgroup_disabled() || !memcg)?

>                 return;
>
> -       for (i = 1; i < HPAGE_PMD_NR; i++)
> -               head[i].mem_cgroup = head->mem_cgroup;
> +       for (i = 1; i < HPAGE_PMD_NR; i++) {
> +               css_get(&memcg->css);
> +               head[i].mem_cgroup = memcg;
> +       }
>  }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
> @@ -5454,7 +5458,10 @@ static int mem_cgroup_move_account(struct page *page,
>          */
>         smp_mb();
>
> -       page->mem_cgroup = to;  /* caller should have done css_get */
> +       css_get(&to->css);
> +       css_put(&from->css);
> +
> +       page->mem_cgroup = to;
>
>         __unlock_page_memcg(from);
>
> @@ -6540,6 +6547,7 @@ int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
>         if (ret)
>                 goto out_put;
>
> +       css_get(&memcg->css);
>         commit_charge(page, memcg);
>
>         local_irq_disable();
> @@ -6594,9 +6602,6 @@ static void uncharge_batch(const struct uncharge_gather *ug)
>         __this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, ug->nr_pages);
>         memcg_check_events(ug->memcg, ug->dummy_page);
>         local_irq_restore(flags);
> -
> -       if (!mem_cgroup_is_root(ug->memcg))
> -               css_put_many(&ug->memcg->css, ug->nr_pages);
>  }
>
>  static void uncharge_page(struct page *page, struct uncharge_gather *ug)
> @@ -6634,6 +6639,7 @@ static void uncharge_page(struct page *page, struct uncharge_gather *ug)
>
>         ug->dummy_page = page;
>         page->mem_cgroup = NULL;
> +       css_put(&ug->memcg->css);
>  }
>
>  static void uncharge_list(struct list_head *page_list)
> @@ -6739,8 +6745,8 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage)
>         page_counter_charge(&memcg->memory, nr_pages);
>         if (do_memsw_account())
>                 page_counter_charge(&memcg->memsw, nr_pages);
> -       css_get_many(&memcg->css, nr_pages);
>
> +       css_get(&memcg->css);
>         commit_charge(newpage, memcg);
>
>         local_irq_save(flags);
> @@ -6977,8 +6983,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
>         mem_cgroup_charge_statistics(memcg, page, -nr_entries);
>         memcg_check_events(memcg, page);
>
> -       if (!mem_cgroup_is_root(memcg))
> -               css_put_many(&memcg->css, nr_entries);
> +       css_put(&memcg->css);
>  }
>
>  /**
> diff --git a/mm/slab.h b/mm/slab.h
> index 633eedb6bad1..8a574d9361c1 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -373,9 +373,7 @@ static __always_inline int memcg_charge_slab(struct page *page,
>         lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
>         mod_lruvec_state(lruvec, cache_vmstat_idx(s), nr_pages << PAGE_SHIFT);
>
> -       /* transer try_charge() page references to kmem_cache */
>         percpu_ref_get_many(&s->memcg_params.refcnt, nr_pages);
> -       css_put_many(&memcg->css, nr_pages);
>  out:
>         css_put(&memcg->css);
>         return ret;
> --
> 2.25.4
>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 00/19] The new cgroup slab memory controller
  2020-06-17  1:46 ` [PATCH v6 00/19] The new cgroup slab memory controller Shakeel Butt
  2020-06-17  2:41   ` Roman Gushchin
@ 2020-06-18  1:18   ` Roman Gushchin
  1 sibling, 0 replies; 92+ messages in thread
From: Roman Gushchin @ 2020-06-18  1:18 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Tue, Jun 16, 2020 at 06:46:56PM -0700, Shakeel Butt wrote:
> On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > This is v6 of the slab cgroup controller rework.
> >
> > The patchset moves the accounting from the page level to the object
> > level. It allows to share slab pages between memory cgroups.
> > This leads to a significant win in the slab utilization (up to 45%)
> > and the corresponding drop in the total kernel memory footprint.
> 
> Is this based on just SLUB or does this have a similar impact on SLAB as well?


Just got some fresh numbers on my desktop running 5.8-rc1 + slab controller v6.
It's 8-cores Ryzen 1700 with 32 GB RAM running Fedora 32.

I measured the size of slab memory just after logging into the system.

                   SLUB           SLAB
Original:     463232 kB      312880 kB
Patched:      194840 kB      193392 kB
                   -58%           -38%

Plus perpcu memory usage is also a bit lower.

Thanks!

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 00/19] The new cgroup slab memory controller
  2020-06-17 11:24         ` Vlastimil Babka
  2020-06-17 14:31           ` Mel Gorman
@ 2020-06-18  1:29           ` Roman Gushchin
  2020-06-18  8:43             ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 92+ messages in thread
From: Roman Gushchin @ 2020-06-18  1:29 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Shakeel Butt, Andrew Morton, Christoph Lameter, Johannes Weiner,
	Michal Hocko, Linux MM, Kernel Team, LKML, Mel Gorman,
	Jesper Dangaard Brouer

On Wed, Jun 17, 2020 at 01:24:21PM +0200, Vlastimil Babka wrote:
> On 6/17/20 5:32 AM, Roman Gushchin wrote:
> > On Tue, Jun 16, 2020 at 08:05:39PM -0700, Shakeel Butt wrote:
> >> On Tue, Jun 16, 2020 at 7:41 PM Roman Gushchin <guro@fb.com> wrote:
> >> >
> >> > On Tue, Jun 16, 2020 at 06:46:56PM -0700, Shakeel Butt wrote:
> >> > > On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
> >> > > >
> >> [...]
> >> > >
> >> > > Have you performed any [perf] testing on SLAB with this patchset?
> >> >
> >> > The accounting part is the same for SLAB and SLUB, so there should be no
> >> > significant difference. I've checked that it compiles, boots and passes
> >> > kselftests. And that memory savings are there.
> >> >
> >> 
> >> What about performance? Also you mentioned that sharing kmem-cache
> >> between accounted and non-accounted can have additional overhead. Any
> >> difference between SLAB and SLUB for such a case?
> > 
> > Not really.
> > 
> > Sharing a single set of caches adds some overhead to root- and non-accounted
> > allocations, which is something I've tried hard to avoid in my original version.
> > But I have to admit, it allows to simplify and remove a lot of code, and here
> > it's hard to argue with Johanness, who pushed on this design.
> > 
> > With performance testing it's not that easy, because it's not obvious what
> > we wanna test. Obviously, per-object accounting is more expensive, and
> > measuring something like 1000000 allocations and deallocations in a line from
> > a single kmem_cache will show a regression. But in the real world the relative
> > cost of allocations is usually low, and we can get some benefits from a smaller
> > working set and from having shared kmem_cache objects cache hot.
> > Not speaking about some extra memory and the fragmentation reduction.
> > 
> > We've done an extensive testing of the original version in Facebook production,
> > and we haven't noticed any regressions so far. But I have to admit, we were
> > using an original version with two sets of kmem_caches.
> > 
> > If you have any specific tests in mind, I can definitely run them. Or if you
> > can help with the performance evaluation, I'll appreciate it a lot.
> 
> Jesper provided some pointers here [1], it would be really great if you could
> run at least those microbenchmarks. With mmtests it's the major question of
> which subset/profiles to run, maybe the referenced commits provide some hints,
> or maybe Mel could suggest what he used to evaluate SLAB vs SLUB not so long ago.
> 
> [1] https://lore.kernel.org/linux-mm/20200527103545.4348ac10@carbon/

Oh, Jesper, I'm really sorry, somehow I missed your mail.
Thank you, Vlastimil, for pointing at it.

I've got some results (slab_bulk_test01), but honestly I fail to interpret them.

I ran original vs patched with SLUB and SLAB, each test several times and picked
3 which looked most consistently. But it still looks very noisy.

I ran them on my desktop (8-core Ryzen 1700, 16 GB RAM, Fedora 32),
it's 5.8-rc1 + slab controller v6 vs 5.8-rc1 (default config from Fedora 32).

How should I interpret this data?

--

SLUB:

Patched:
[  444.395174] time_bench: Type:for_loop Per elem: 2 cycles(tsc) 0.773 ns (step:0) - (measurement period time:0.077335091 sec time_interval:77335091) - (invoke count:100000000 tsc_interval:231555960)
[  445.012669] time_bench: Type:kmem fastpath reuse Per elem: 184 cycles(tsc) 61.747 ns (step:0) - (measurement period time:0.617475365 sec time_interval:617475365) - (invoke count:10000000 tsc_interval:1848850440)
[  445.703843] time_bench: Type:kmem bulk_fallback Per elem: 206 cycles(tsc) 69.115 ns (step:1) - (measurement period time:0.691150675 sec time_interval:691150675) - (invoke count:10000000 tsc_interval:2069450250)
[  446.329396] time_bench: Type:kmem bulk_quick_reuse Per elem: 187 cycles(tsc) 62.554 ns (step:1) - (measurement period time:0.625541838 sec time_interval:625541838) - (invoke count:10000000 tsc_interval:1873003020)
[  446.975616] time_bench: Type:kmem bulk_fallback Per elem: 193 cycles(tsc) 64.622 ns (step:2) - (measurement period time:0.646223732 sec time_interval:646223732) - (invoke count:10000000 tsc_interval:1934929440)
[  447.345512] time_bench: Type:kmem bulk_quick_reuse Per elem: 110 cycles(tsc) 36.988 ns (step:2) - (measurement period time:0.369885352 sec time_interval:369885352) - (invoke count:10000000 tsc_interval:1107514050)
[  447.986272] time_bench: Type:kmem bulk_fallback Per elem: 191 cycles(tsc) 64.075 ns (step:3) - (measurement period time:0.640756304 sec time_interval:640756304) - (invoke count:9999999 tsc_interval:1918559070)
[  448.282163] time_bench: Type:kmem bulk_quick_reuse Per elem: 88 cycles(tsc) 29.586 ns (step:3) - (measurement period time:0.295866328 sec time_interval:295866328) - (invoke count:9999999 tsc_interval:885885270)
[  448.623183] time_bench: Type:kmem bulk_fallback Per elem: 102 cycles(tsc) 34.100 ns (step:4) - (measurement period time:0.341005290 sec time_interval:341005290) - (invoke count:10000000 tsc_interval:1021040820)
[  448.930228] time_bench: Type:kmem bulk_quick_reuse Per elem: 91 cycles(tsc) 30.702 ns (step:4) - (measurement period time:0.307020500 sec time_interval:307020500) - (invoke count:10000000 tsc_interval:919282860)
[  449.739697] time_bench: Type:kmem bulk_fallback Per elem: 242 cycles(tsc) 80.946 ns (step:8) - (measurement period time:0.809465825 sec time_interval:809465825) - (invoke count:10000000 tsc_interval:2423710560)
[  449.848110] time_bench: Type:kmem bulk_quick_reuse Per elem: 32 cycles(tsc) 10.836 ns (step:8) - (measurement period time:0.108363638 sec time_interval:108363638) - (invoke count:10000000 tsc_interval:324462540)
[  450.617892] time_bench: Type:kmem bulk_fallback Per elem: 230 cycles(tsc) 76.978 ns (step:16) - (measurement period time:0.769783892 sec time_interval:769783892) - (invoke count:10000000 tsc_interval:2304894090)
[  450.719556] time_bench: Type:kmem bulk_quick_reuse Per elem: 30 cycles(tsc) 10.164 ns (step:16) - (measurement period time:0.101645837 sec time_interval:101645837) - (invoke count:10000000 tsc_interval:304348440)
[  451.025387] time_bench: Type:kmem bulk_fallback Per elem: 91 cycles(tsc) 30.580 ns (step:30) - (measurement period time:0.305803321 sec time_interval:305803321) - (invoke count:9999990 tsc_interval:915639450)
[  451.277708] time_bench: Type:kmem bulk_quick_reuse Per elem: 75 cycles(tsc) 25.229 ns (step:30) - (measurement period time:0.252294821 sec time_interval:252294821) - (invoke count:9999990 tsc_interval:755422110)
[  451.709305] time_bench: Type:kmem bulk_fallback Per elem: 129 cycles(tsc) 43.158 ns (step:32) - (measurement period time:0.431581619 sec time_interval:431581619) - (invoke count:10000000 tsc_interval:1292245320)
[  451.810686] time_bench: Type:kmem bulk_quick_reuse Per elem: 30 cycles(tsc) 10.135 ns (step:32) - (measurement period time:0.101357841 sec time_interval:101357841) - (invoke count:10000000 tsc_interval:303485250)
[  452.186138] time_bench: Type:kmem bulk_fallback Per elem: 112 cycles(tsc) 37.545 ns (step:34) - (measurement period time:0.375453243 sec time_interval:375453243) - (invoke count:9999978 tsc_interval:1124185320)
[  452.304950] time_bench: Type:kmem bulk_quick_reuse Per elem: 35 cycles(tsc) 11.880 ns (step:34) - (measurement period time:0.118800736 sec time_interval:118800736) - (invoke count:9999978 tsc_interval:355713360)
[  452.658607] time_bench: Type:kmem bulk_fallback Per elem: 105 cycles(tsc) 35.362 ns (step:48) - (measurement period time:0.353623065 sec time_interval:353623065) - (invoke count:9999984 tsc_interval:1058820960)
[  452.891623] time_bench: Type:kmem bulk_quick_reuse Per elem: 69 cycles(tsc) 23.298 ns (step:48) - (measurement period time:0.232988291 sec time_interval:232988291) - (invoke count:9999984 tsc_interval:697614570)
[  453.237406] time_bench: Type:kmem bulk_fallback Per elem: 103 cycles(tsc) 34.578 ns (step:64) - (measurement period time:0.345780444 sec time_interval:345780444) - (invoke count:10000000 tsc_interval:1035338790)
[  453.344946] time_bench: Type:kmem bulk_quick_reuse Per elem: 32 cycles(tsc) 10.750 ns (step:64) - (measurement period time:0.107500964 sec time_interval:107500964) - (invoke count:10000000 tsc_interval:321880290)
[  454.249297] time_bench: Type:kmem bulk_fallback Per elem: 270 cycles(tsc) 90.434 ns (step:128) - (measurement period time:0.904340126 sec time_interval:904340126) - (invoke count:10000000 tsc_interval:2707784610)
[  454.582548] time_bench: Type:kmem bulk_quick_reuse Per elem: 99 cycles(tsc) 33.322 ns (step:128) - (measurement period time:0.333226211 sec time_interval:333226211) - (invoke count:10000000 tsc_interval:997748760)
[  454.965002] time_bench: Type:kmem bulk_fallback Per elem: 114 cycles(tsc) 38.241 ns (step:158) - (measurement period time:0.382415227 sec time_interval:382415227) - (invoke count:9999978 tsc_interval:1145031120)
[  455.314105] time_bench: Type:kmem bulk_quick_reuse Per elem: 104 cycles(tsc) 34.908 ns (step:158) - (measurement period time:0.349080430 sec time_interval:349080430) - (invoke count:9999978 tsc_interval:1045219530)
[  455.699089] time_bench: Type:kmem bulk_fallback Per elem: 115 cycles(tsc) 38.495 ns (step:250) - (measurement period time:0.384953654 sec time_interval:384953654) - (invoke count:10000000 tsc_interval:1152631920)
[  456.104244] time_bench: Type:kmem bulk_quick_reuse Per elem: 121 cycles(tsc) 40.513 ns (step:250) - (measurement period time:0.405138149 sec time_interval:405138149) - (invoke count:10000000 tsc_interval:1213068180)

[  465.696654] time_bench: Type:for_loop Per elem: 2 cycles(tsc) 0.772 ns (step:0) - (measurement period time:0.077270577 sec time_interval:77270577) - (invoke count:100000000 tsc_interval:231363840)
[  466.290176] time_bench: Type:kmem fastpath reuse Per elem: 177 cycles(tsc) 59.349 ns (step:0) - (measurement period time:0.593496780 sec time_interval:593496780) - (invoke count:10000000 tsc_interval:1777053420)
[  466.629838] time_bench: Type:kmem bulk_fallback Per elem: 101 cycles(tsc) 33.965 ns (step:1) - (measurement period time:0.339652351 sec time_interval:339652351) - (invoke count:10000000 tsc_interval:1016989230)
[  466.933290] time_bench: Type:kmem bulk_quick_reuse Per elem: 90 cycles(tsc) 30.344 ns (step:1) - (measurement period time:0.303444180 sec time_interval:303444180) - (invoke count:10000000 tsc_interval:908575380)
[  467.250189] time_bench: Type:kmem bulk_fallback Per elem: 94 cycles(tsc) 31.689 ns (step:2) - (measurement period time:0.316896073 sec time_interval:316896073) - (invoke count:10000000 tsc_interval:948853110)
[  467.430142] time_bench: Type:kmem bulk_quick_reuse Per elem: 53 cycles(tsc) 17.994 ns (step:2) - (measurement period time:0.179940800 sec time_interval:179940800) - (invoke count:10000000 tsc_interval:538779390)
[  467.780573] time_bench: Type:kmem bulk_fallback Per elem: 104 cycles(tsc) 35.039 ns (step:3) - (measurement period time:0.350394226 sec time_interval:350394226) - (invoke count:9999999 tsc_interval:1049153580)
[  468.100301] time_bench: Type:kmem bulk_quick_reuse Per elem: 95 cycles(tsc) 31.970 ns (step:3) - (measurement period time:0.319706687 sec time_interval:319706687) - (invoke count:9999999 tsc_interval:957267660)
[  468.792650] time_bench: Type:kmem bulk_fallback Per elem: 207 cycles(tsc) 69.235 ns (step:4) - (measurement period time:0.692354598 sec time_interval:692354598) - (invoke count:10000000 tsc_interval:2073054750)
[  469.078816] time_bench: Type:kmem bulk_quick_reuse Per elem: 85 cycles(tsc) 28.614 ns (step:4) - (measurement period time:0.286145162 sec time_interval:286145162) - (invoke count:10000000 tsc_interval:856777710)
[  469.694558] time_bench: Type:kmem bulk_fallback Per elem: 184 cycles(tsc) 61.573 ns (step:8) - (measurement period time:0.615733224 sec time_interval:615733224) - (invoke count:10000000 tsc_interval:1843634190)
[  469.917439] time_bench: Type:kmem bulk_quick_reuse Per elem: 66 cycles(tsc) 22.284 ns (step:8) - (measurement period time:0.222848937 sec time_interval:222848937) - (invoke count:10000000 tsc_interval:667255740)
[  470.586966] time_bench: Type:kmem bulk_fallback Per elem: 200 cycles(tsc) 66.952 ns (step:16) - (measurement period time:0.669526473 sec time_interval:669526473) - (invoke count:10000000 tsc_interval:2004702960)
[  470.794012] time_bench: Type:kmem bulk_quick_reuse Per elem: 61 cycles(tsc) 20.697 ns (step:16) - (measurement period time:0.206972335 sec time_interval:206972335) - (invoke count:10000000 tsc_interval:619717170)
[  471.422674] time_bench: Type:kmem bulk_fallback Per elem: 188 cycles(tsc) 62.866 ns (step:30) - (measurement period time:0.628659634 sec time_interval:628659634) - (invoke count:9999990 tsc_interval:1882338990)
[  471.524193] time_bench: Type:kmem bulk_quick_reuse Per elem: 30 cycles(tsc) 10.149 ns (step:30) - (measurement period time:0.101497972 sec time_interval:101497972) - (invoke count:9999990 tsc_interval:303905340)
[  471.829474] time_bench: Type:kmem bulk_fallback Per elem: 91 cycles(tsc) 30.527 ns (step:32) - (measurement period time:0.305271485 sec time_interval:305271485) - (invoke count:10000000 tsc_interval:914046510)
[  471.930490] time_bench: Type:kmem bulk_quick_reuse Per elem: 30 cycles(tsc) 10.099 ns (step:32) - (measurement period time:0.100992877 sec time_interval:100992877) - (invoke count:10000000 tsc_interval:302392890)
[  472.311211] time_bench: Type:kmem bulk_fallback Per elem: 113 cycles(tsc) 38.072 ns (step:34) - (measurement period time:0.380725777 sec time_interval:380725777) - (invoke count:9999978 tsc_interval:1139972850)
[  472.429823] time_bench: Type:kmem bulk_quick_reuse Per elem: 35 cycles(tsc) 11.860 ns (step:34) - (measurement period time:0.118599617 sec time_interval:118599617) - (invoke count:9999978 tsc_interval:355111890)
[  472.890092] time_bench: Type:kmem bulk_fallback Per elem: 137 cycles(tsc) 46.026 ns (step:48) - (measurement period time:0.460264730 sec time_interval:460264730) - (invoke count:9999984 tsc_interval:1378127970)
[  472.999481] time_bench: Type:kmem bulk_quick_reuse Per elem: 32 cycles(tsc) 10.937 ns (step:48) - (measurement period time:0.109371593 sec time_interval:109371593) - (invoke count:9999984 tsc_interval:327480390)
[  473.344109] time_bench: Type:kmem bulk_fallback Per elem: 103 cycles(tsc) 34.462 ns (step:64) - (measurement period time:0.344629774 sec time_interval:344629774) - (invoke count:10000000 tsc_interval:1031893740)
[  473.452099] time_bench: Type:kmem bulk_quick_reuse Per elem: 32 cycles(tsc) 10.794 ns (step:64) - (measurement period time:0.107942846 sec time_interval:107942846) - (invoke count:10000000 tsc_interval:323202390)
[  474.382899] time_bench: Type:kmem bulk_fallback Per elem: 278 cycles(tsc) 93.080 ns (step:128) - (measurement period time:0.930809025 sec time_interval:930809025) - (invoke count:10000000 tsc_interval:2787037260)
[  474.729757] time_bench: Type:kmem bulk_quick_reuse Per elem: 103 cycles(tsc) 34.683 ns (step:128) - (measurement period time:0.346831572 sec time_interval:346831572) - (invoke count:10000000 tsc_interval:1038484980)
[  475.616707] time_bench: Type:kmem bulk_fallback Per elem: 265 cycles(tsc) 88.693 ns (step:158) - (measurement period time:0.886937188 sec time_interval:886937188) - (invoke count:9999978 tsc_interval:2655675660)
[  475.890425] time_bench: Type:kmem bulk_quick_reuse Per elem: 81 cycles(tsc) 27.369 ns (step:158) - (measurement period time:0.273692416 sec time_interval:273692416) - (invoke count:9999978 tsc_interval:819491040)
[  476.275144] time_bench: Type:kmem bulk_fallback Per elem: 115 cycles(tsc) 38.471 ns (step:250) - (measurement period time:0.384713160 sec time_interval:384713160) - (invoke count:10000000 tsc_interval:1151911110)
[  476.424219] time_bench: Type:kmem bulk_quick_reuse Per elem: 44 cycles(tsc) 14.906 ns (step:250) - (measurement period time:0.149068364 sec time_interval:149068364) - (invoke count:10000000 tsc_interval:446341830)

[  490.306824] time_bench: Type:for_loop Per elem: 2 cycles(tsc) 0.776 ns (step:0) - (measurement period time:0.077691991 sec time_interval:77691991) - (invoke count:100000000 tsc_interval:232625850)
[  490.897035] time_bench: Type:kmem fastpath reuse Per elem: 176 cycles(tsc) 59.019 ns (step:0) - (measurement period time:0.590195120 sec time_interval:590195120) - (invoke count:10000000 tsc_interval:1767174930)
[  491.590675] time_bench: Type:kmem bulk_fallback Per elem: 207 cycles(tsc) 69.362 ns (step:1) - (measurement period time:0.693628128 sec time_interval:693628128) - (invoke count:10000000 tsc_interval:2076877050)
[  492.339461] time_bench: Type:kmem bulk_quick_reuse Per elem: 224 cycles(tsc) 74.877 ns (step:1) - (measurement period time:0.748777171 sec time_interval:748777171) - (invoke count:10000000 tsc_interval:2242005540)
[  493.129328] time_bench: Type:kmem bulk_fallback Per elem: 236 cycles(tsc) 78.984 ns (step:2) - (measurement period time:0.789848781 sec time_interval:789848781) - (invoke count:10000000 tsc_interval:2364983220)
[  493.574670] time_bench: Type:kmem bulk_quick_reuse Per elem: 133 cycles(tsc) 44.530 ns (step:2) - (measurement period time:0.445304096 sec time_interval:445304096) - (invoke count:10000000 tsc_interval:1333339110)
[  493.887021] time_bench: Type:kmem bulk_fallback Per elem: 93 cycles(tsc) 31.231 ns (step:3) - (measurement period time:0.312316389 sec time_interval:312316389) - (invoke count:9999999 tsc_interval:935143950)
[  494.029383] time_bench: Type:kmem bulk_quick_reuse Per elem: 42 cycles(tsc) 14.234 ns (step:3) - (measurement period time:0.142346254 sec time_interval:142346254) - (invoke count:9999999 tsc_interval:426216000)
[  494.369892] time_bench: Type:kmem bulk_fallback Per elem: 101 cycles(tsc) 34.050 ns (step:4) - (measurement period time:0.340504527 sec time_interval:340504527) - (invoke count:10000000 tsc_interval:1019546130)
[  494.493217] time_bench: Type:kmem bulk_quick_reuse Per elem: 36 cycles(tsc) 12.329 ns (step:4) - (measurement period time:0.123294475 sec time_interval:123294475) - (invoke count:10000000 tsc_interval:369169800)
[  494.820003] time_bench: Type:kmem bulk_fallback Per elem: 97 cycles(tsc) 32.678 ns (step:8) - (measurement period time:0.326780876 sec time_interval:326780876) - (invoke count:10000000 tsc_interval:978453960)
[  494.928831] time_bench: Type:kmem bulk_quick_reuse Per elem: 32 cycles(tsc) 10.880 ns (step:8) - (measurement period time:0.108808086 sec time_interval:108808086) - (invoke count:10000000 tsc_interval:325794570)
[  495.684358] time_bench: Type:kmem bulk_fallback Per elem: 226 cycles(tsc) 75.552 ns (step:16) - (measurement period time:0.755527917 sec time_interval:755527917) - (invoke count:10000000 tsc_interval:2262218520)
[  495.785682] time_bench: Type:kmem bulk_quick_reuse Per elem: 30 cycles(tsc) 10.130 ns (step:16) - (measurement period time:0.101307607 sec time_interval:101307607) - (invoke count:10000000 tsc_interval:303336720)
[  496.090994] time_bench: Type:kmem bulk_fallback Per elem: 91 cycles(tsc) 30.528 ns (step:30) - (measurement period time:0.305280433 sec time_interval:305280433) - (invoke count:9999990 tsc_interval:914077290)
[  496.341570] time_bench: Type:kmem bulk_quick_reuse Per elem: 75 cycles(tsc) 25.054 ns (step:30) - (measurement period time:0.250548825 sec time_interval:250548825) - (invoke count:9999990 tsc_interval:750197910)
[  496.646784] time_bench: Type:kmem bulk_fallback Per elem: 91 cycles(tsc) 30.518 ns (step:32) - (measurement period time:0.305189218 sec time_interval:305189218) - (invoke count:10000000 tsc_interval:913803540)
[  496.900311] time_bench: Type:kmem bulk_quick_reuse Per elem: 75 cycles(tsc) 25.349 ns (step:32) - (measurement period time:0.253499465 sec time_interval:253499465) - (invoke count:10000000 tsc_interval:759033060)
[  497.778600] time_bench: Type:kmem bulk_fallback Per elem: 262 cycles(tsc) 87.830 ns (step:34) - (measurement period time:0.878298604 sec time_interval:878298604) - (invoke count:9999978 tsc_interval:2629821090)
[  498.043690] time_bench: Type:kmem bulk_quick_reuse Per elem: 79 cycles(tsc) 26.506 ns (step:34) - (measurement period time:0.265066374 sec time_interval:265066374) - (invoke count:9999978 tsc_interval:793667400)
[  498.393912] time_bench: Type:kmem bulk_fallback Per elem: 104 cycles(tsc) 35.021 ns (step:48) - (measurement period time:0.350216735 sec time_interval:350216735) - (invoke count:9999984 tsc_interval:1048626840)
[  498.504846] time_bench: Type:kmem bulk_quick_reuse Per elem: 33 cycles(tsc) 11.092 ns (step:48) - (measurement period time:0.110924201 sec time_interval:110924201) - (invoke count:9999984 tsc_interval:332131200)
[  498.878335] time_bench: Type:kmem bulk_fallback Per elem: 111 cycles(tsc) 37.345 ns (step:64) - (measurement period time:0.373454272 sec time_interval:373454272) - (invoke count:10000000 tsc_interval:1118205060)
[  499.145467] time_bench: Type:kmem bulk_quick_reuse Per elem: 79 cycles(tsc) 26.710 ns (step:64) - (measurement period time:0.267102714 sec time_interval:267102714) - (invoke count:10000000 tsc_interval:799763910)
[  499.525255] time_bench: Type:kmem bulk_fallback Per elem: 113 cycles(tsc) 37.971 ns (step:128) - (measurement period time:0.379715035 sec time_interval:379715035) - (invoke count:10000000 tsc_interval:1136951190)
[  499.852495] time_bench: Type:kmem bulk_quick_reuse Per elem: 97 cycles(tsc) 32.721 ns (step:128) - (measurement period time:0.327218329 sec time_interval:327218329) - (invoke count:10000000 tsc_interval:979763670)
[  500.238889] time_bench: Type:kmem bulk_fallback Per elem: 115 cycles(tsc) 38.638 ns (step:158) - (measurement period time:0.386388112 sec time_interval:386388112) - (invoke count:9999978 tsc_interval:1156931610)
[  500.370790] time_bench: Type:kmem bulk_quick_reuse Per elem: 39 cycles(tsc) 13.189 ns (step:158) - (measurement period time:0.131890805 sec time_interval:131890805) - (invoke count:9999978 tsc_interval:394909920)
[  500.747241] time_bench: Type:kmem bulk_fallback Per elem: 112 cycles(tsc) 37.645 ns (step:250) - (measurement period time:0.376455749 sec time_interval:376455749) - (invoke count:10000000 tsc_interval:1127192310)
[  500.897248] time_bench: Type:kmem bulk_quick_reuse Per elem: 44 cycles(tsc) 14.999 ns (step:250) - (measurement period time:0.149997635 sec time_interval:149997635) - (invoke count:10000000 tsc_interval:449125920)

Orig:
[   81.987064] time_bench: Type:for_loop Per elem: 2 cycles(tsc) 0.813 ns (step:0) - (measurement period time:0.081397445 sec time_interval:81397445) - (invoke count:100000000 tsc_interval:243727920)
[   82.595831] time_bench: Type:kmem fastpath reuse Per elem: 178 cycles(tsc) 59.675 ns (step:0) - (measurement period time:0.596752095 sec time_interval:596752095) - (invoke count:10000000 tsc_interval:1786857030)
[   83.031850] time_bench: Type:kmem bulk_fallback Per elem: 127 cycles(tsc) 42.541 ns (step:1) - (measurement period time:0.425415790 sec time_interval:425415790) - (invoke count:10000000 tsc_interval:1273823670)
[   83.340838] time_bench: Type:kmem bulk_quick_reuse Per elem: 87 cycles(tsc) 29.200 ns (step:1) - (measurement period time:0.292006301 sec time_interval:292006301) - (invoke count:10000000 tsc_interval:874355190)
[   83.630781] time_bench: Type:kmem bulk_fallback Per elem: 83 cycles(tsc) 27.923 ns (step:2) - (measurement period time:0.279231691 sec time_interval:279231691) - (invoke count:10000000 tsc_interval:836104170)
[   83.821746] time_bench: Type:kmem bulk_quick_reuse Per elem: 52 cycles(tsc) 17.611 ns (step:2) - (measurement period time:0.176116770 sec time_interval:176116770) - (invoke count:10000000 tsc_interval:527346570)
[   84.105841] time_bench: Type:kmem bulk_fallback Per elem: 81 cycles(tsc) 27.184 ns (step:3) - (measurement period time:0.271845630 sec time_interval:271845630) - (invoke count:9999999 tsc_interval:813988260)
[   84.257733] time_bench: Type:kmem bulk_quick_reuse Per elem: 42 cycles(tsc) 14.120 ns (step:3) - (measurement period time:0.141208965 sec time_interval:141208965) - (invoke count:9999999 tsc_interval:422821890)
[   84.578730] time_bench: Type:kmem bulk_fallback Per elem: 92 cycles(tsc) 30.798 ns (step:4) - (measurement period time:0.307982589 sec time_interval:307982589) - (invoke count:10000000 tsc_interval:922193070)
[   84.894740] time_bench: Type:kmem bulk_quick_reuse Per elem: 91 cycles(tsc) 30.523 ns (step:4) - (measurement period time:0.305231656 sec time_interval:305231656) - (invoke count:10000000 tsc_interval:913955310)
[   85.596699] time_bench: Type:kmem bulk_fallback Per elem: 206 cycles(tsc) 68.977 ns (step:8) - (measurement period time:0.689779758 sec time_interval:689779758) - (invoke count:10000000 tsc_interval:2065410030)
[   85.728679] time_bench: Type:kmem bulk_quick_reuse Per elem: 31 cycles(tsc) 10.641 ns (step:8) - (measurement period time:0.106415387 sec time_interval:106415387) - (invoke count:10000000 tsc_interval:318639630)
[   86.016723] time_bench: Type:kmem bulk_fallback Per elem: 81 cycles(tsc) 27.302 ns (step:16) - (measurement period time:0.273021863 sec time_interval:273021863) - (invoke count:10000000 tsc_interval:817509990)
[   86.137711] time_bench: Type:kmem bulk_quick_reuse Per elem: 29 cycles(tsc) 10.005 ns (step:16) - (measurement period time:0.100053210 sec time_interval:100053210) - (invoke count:10000000 tsc_interval:299589180)
[   86.420698] time_bench: Type:kmem bulk_fallback Per elem: 79 cycles(tsc) 26.598 ns (step:30) - (measurement period time:0.265984644 sec time_interval:265984644) - (invoke count:9999990 tsc_interval:796437960)
[   86.534652] time_bench: Type:kmem bulk_quick_reuse Per elem: 29 cycles(tsc) 9.742 ns (step:30) - (measurement period time:0.097425391 sec time_interval:97425391) - (invoke count:9999990 tsc_interval:291720810)
[   86.812682] time_bench: Type:kmem bulk_fallback Per elem: 79 cycles(tsc) 26.522 ns (step:32) - (measurement period time:0.265225864 sec time_interval:265225864) - (invoke count:10000000 tsc_interval:794166360)
[   86.923650] time_bench: Type:kmem bulk_quick_reuse Per elem: 29 cycles(tsc) 9.729 ns (step:32) - (measurement period time:0.097294552 sec time_interval:97294552) - (invoke count:10000000 tsc_interval:291328800)
[   87.255647] time_bench: Type:kmem bulk_fallback Per elem: 95 cycles(tsc) 32.050 ns (step:34) - (measurement period time:0.320499429 sec time_interval:320499429) - (invoke count:9999978 tsc_interval:959672160)
[   87.383687] time_bench: Type:kmem bulk_quick_reuse Per elem: 34 cycles(tsc) 11.492 ns (step:34) - (measurement period time:0.114921393 sec time_interval:114921393) - (invoke count:9999978 tsc_interval:344109030)
[   87.724663] time_bench: Type:kmem bulk_fallback Per elem: 96 cycles(tsc) 32.346 ns (step:48) - (measurement period time:0.323463245 sec time_interval:323463245) - (invoke count:9999984 tsc_interval:968546670)
[   87.847640] time_bench: Type:kmem bulk_quick_reuse Per elem: 31 cycles(tsc) 10.661 ns (step:48) - (measurement period time:0.106610938 sec time_interval:106610938) - (invoke count:9999984 tsc_interval:319225170)
[   88.167636] time_bench: Type:kmem bulk_fallback Per elem: 91 cycles(tsc) 30.678 ns (step:64) - (measurement period time:0.306781428 sec time_interval:306781428) - (invoke count:10000000 tsc_interval:918596670)
[   88.287645] time_bench: Type:kmem bulk_quick_reuse Per elem: 31 cycles(tsc) 10.677 ns (step:64) - (measurement period time:0.106773747 sec time_interval:106773747) - (invoke count:10000000 tsc_interval:319712640)
[   88.634627] time_bench: Type:kmem bulk_fallback Per elem: 100 cycles(tsc) 33.591 ns (step:128) - (measurement period time:0.335914141 sec time_interval:335914141) - (invoke count:10000000 tsc_interval:1005828930)
[   88.785630] time_bench: Type:kmem bulk_quick_reuse Per elem: 40 cycles(tsc) 13.648 ns (step:128) - (measurement period time:0.136483174 sec time_interval:136483174) - (invoke count:10000000 tsc_interval:408671550)
[   89.138604] time_bench: Type:kmem bulk_fallback Per elem: 101 cycles(tsc) 33.981 ns (step:158) - (measurement period time:0.339814415 sec time_interval:339814415) - (invoke count:9999978 tsc_interval:1017507030)
[   89.289633] time_bench: Type:kmem bulk_quick_reuse Per elem: 42 cycles(tsc) 14.110 ns (step:158) - (measurement period time:0.141101621 sec time_interval:141101621) - (invoke count:9999978 tsc_interval:422500530)
[   89.650638] time_bench: Type:kmem bulk_fallback Per elem: 104 cycles(tsc) 34.887 ns (step:250) - (measurement period time:0.348876887 sec time_interval:348876887) - (invoke count:10000000 tsc_interval:1044643320)
[   89.813613] time_bench: Type:kmem bulk_quick_reuse Per elem: 44 cycles(tsc) 14.821 ns (step:250) - (measurement period time:0.148213151 sec time_interval:148213151) - (invoke count:10000000 tsc_interval:443794860)

[  120.495694] time_bench: Type:for_loop Per elem: 2 cycles(tsc) 0.777 ns (step:0) - (measurement period time:0.077764814 sec time_interval:77764814) - (invoke count:100000000 tsc_interval:232850730)
[  121.018849] time_bench: Type:kmem fastpath reuse Per elem: 153 cycles(tsc) 51.274 ns (step:0) - (measurement period time:0.512740018 sec time_interval:512740018) - (invoke count:10000000 tsc_interval:1535297070)
[  121.326965] time_bench: Type:kmem bulk_fallback Per elem: 91 cycles(tsc) 30.560 ns (step:1) - (measurement period time:0.305608844 sec time_interval:305608844) - (invoke count:10000000 tsc_interval:915084480)
[  121.628922] time_bench: Type:kmem bulk_quick_reuse Per elem: 87 cycles(tsc) 29.218 ns (step:1) - (measurement period time:0.292184439 sec time_interval:292184439) - (invoke count:10000000 tsc_interval:874887840)
[  122.337817] time_bench: Type:kmem bulk_fallback Per elem: 207 cycles(tsc) 69.361 ns (step:2) - (measurement period time:0.693612284 sec time_interval:693612284) - (invoke count:10000000 tsc_interval:2076883890)
[  122.520912] time_bench: Type:kmem bulk_quick_reuse Per elem: 53 cycles(tsc) 17.741 ns (step:2) - (measurement period time:0.177417675 sec time_interval:177417675) - (invoke count:10000000 tsc_interval:531240870)
[  122.872912] time_bench: Type:kmem bulk_fallback Per elem: 102 cycles(tsc) 34.212 ns (step:3) - (measurement period time:0.342120142 sec time_interval:342120142) - (invoke count:9999999 tsc_interval:1024409910)
[  123.019909] time_bench: Type:kmem bulk_quick_reuse Per elem: 42 cycles(tsc) 14.084 ns (step:3) - (measurement period time:0.140842225 sec time_interval:140842225) - (invoke count:9999999 tsc_interval:421723650)
[  123.837965] time_bench: Type:kmem bulk_fallback Per elem: 241 cycles(tsc) 80.516 ns (step:4) - (measurement period time:0.805161046 sec time_interval:805161046) - (invoke count:10000000 tsc_interval:2410894650)
[  123.973915] time_bench: Type:kmem bulk_quick_reuse Per elem: 37 cycles(tsc) 12.377 ns (step:4) - (measurement period time:0.123773940 sec time_interval:123773940) - (invoke count:10000000 tsc_interval:370615290)
[  124.273862] time_bench: Type:kmem bulk_fallback Per elem: 86 cycles(tsc) 28.860 ns (step:8) - (measurement period time:0.288604912 sec time_interval:288604912) - (invoke count:10000000 tsc_interval:864169920)
[  124.546757] time_bench: Type:kmem bulk_quick_reuse Per elem: 79 cycles(tsc) 26.420 ns (step:8) - (measurement period time:0.264207028 sec time_interval:264207028) - (invoke count:10000000 tsc_interval:791114430)
[  125.191730] time_bench: Type:kmem bulk_fallback Per elem: 190 cycles(tsc) 63.456 ns (step:16) - (measurement period time:0.634568513 sec time_interval:634568513) - (invoke count:10000000 tsc_interval:1900088820)
[  125.296839] time_bench: Type:kmem bulk_quick_reuse Per elem: 30 cycles(tsc) 10.043 ns (step:16) - (measurement period time:0.100439926 sec time_interval:100439926) - (invoke count:10000000 tsc_interval:300746670)
[  125.580743] time_bench: Type:kmem bulk_fallback Per elem: 81 cycles(tsc) 27.347 ns (step:30) - (measurement period time:0.273471271 sec time_interval:273471271) - (invoke count:9999990 tsc_interval:818855040)
[  125.836734] time_bench: Type:kmem bulk_quick_reuse Per elem: 72 cycles(tsc) 24.372 ns (step:30) - (measurement period time:0.243727806 sec time_interval:243727806) - (invoke count:9999990 tsc_interval:729793590)
[  126.508883] time_bench: Type:kmem bulk_fallback Per elem: 197 cycles(tsc) 65.900 ns (step:32) - (measurement period time:0.659009779 sec time_interval:659009779) - (invoke count:10000000 tsc_interval:1973273460)
[  126.612891] time_bench: Type:kmem bulk_quick_reuse Per elem: 29 cycles(tsc) 9.749 ns (step:32) - (measurement period time:0.097491968 sec time_interval:97491968) - (invoke count:10000000 tsc_interval:291919890)
[  126.968798] time_bench: Type:kmem bulk_fallback Per elem: 103 cycles(tsc) 34.676 ns (step:34) - (measurement period time:0.346762028 sec time_interval:346762028) - (invoke count:9999978 tsc_interval:1038309510)
[  127.095700] time_bench: Type:kmem bulk_quick_reuse Per elem: 34 cycles(tsc) 11.648 ns (step:34) - (measurement period time:0.116483925 sec time_interval:116483925) - (invoke count:9999978 tsc_interval:348787590)
[  127.974794] time_bench: Type:kmem bulk_fallback Per elem: 259 cycles(tsc) 86.651 ns (step:48) - (measurement period time:0.866514663 sec time_interval:866514663) - (invoke count:9999984 tsc_interval:2594605770)
[  128.093772] time_bench: Type:kmem bulk_quick_reuse Per elem: 34 cycles(tsc) 11.426 ns (step:48) - (measurement period time:0.114267827 sec time_interval:114267827) - (invoke count:9999984 tsc_interval:342151620)
[  128.430665] time_bench: Type:kmem bulk_fallback Per elem: 97 cycles(tsc) 32.514 ns (step:64) - (measurement period time:0.325148101 sec time_interval:325148101) - (invoke count:10000000 tsc_interval:973590990)
[  128.546857] time_bench: Type:kmem bulk_quick_reuse Per elem: 32 cycles(tsc) 10.991 ns (step:64) - (measurement period time:0.109916673 sec time_interval:109916673) - (invoke count:10000000 tsc_interval:329123280)
[  129.431645] time_bench: Type:kmem bulk_fallback Per elem: 261 cycles(tsc) 87.191 ns (step:128) - (measurement period time:0.871911323 sec time_interval:871911323) - (invoke count:10000000 tsc_interval:2610764490)
[  129.583764] time_bench: Type:kmem bulk_quick_reuse Per elem: 43 cycles(tsc) 14.514 ns (step:128) - (measurement period time:0.145148532 sec time_interval:145148532) - (invoke count:10000000 tsc_interval:434617800)
[  130.443627] time_bench: Type:kmem bulk_fallback Per elem: 254 cycles(tsc) 84.982 ns (step:158) - (measurement period time:0.849826310 sec time_interval:849826310) - (invoke count:9999978 tsc_interval:2544635760)
[  130.583738] time_bench: Type:kmem bulk_quick_reuse Per elem: 40 cycles(tsc) 13.399 ns (step:158) - (measurement period time:0.133992977 sec time_interval:133992977) - (invoke count:9999978 tsc_interval:401214210)
[  130.947634] time_bench: Type:kmem bulk_fallback Per elem: 105 cycles(tsc) 35.206 ns (step:250) - (measurement period time:0.352068766 sec time_interval:352068766) - (invoke count:10000000 tsc_interval:1054199400)
[  131.268601] time_bench: Type:kmem bulk_quick_reuse Per elem: 93 cycles(tsc) 31.142 ns (step:250) - (measurement period time:0.311429067 sec time_interval:311429067) - (invoke count:10000000 tsc_interval:932511270)

[  135.584335] time_bench: Type:for_loop Per elem: 2 cycles(tsc) 0.772 ns (step:0) - (measurement period time:0.077217374 sec time_interval:77217374) - (invoke count:100000000 tsc_interval:231211500)
[  136.122480] time_bench: Type:kmem fastpath reuse Per elem: 156 cycles(tsc) 52.212 ns (step:0) - (measurement period time:0.522120964 sec time_interval:522120964) - (invoke count:10000000 tsc_interval:1563386670)
[  136.762465] time_bench: Type:kmem bulk_fallback Per elem: 186 cycles(tsc) 62.301 ns (step:1) - (measurement period time:0.623010984 sec time_interval:623010984) - (invoke count:10000000 tsc_interval:1865481540)
[  137.248444] time_bench: Type:kmem bulk_quick_reuse Per elem: 142 cycles(tsc) 47.606 ns (step:1) - (measurement period time:0.476063536 sec time_interval:476063536) - (invoke count:10000000 tsc_interval:1425477150)
[  137.540440] time_bench: Type:kmem bulk_fallback Per elem: 84 cycles(tsc) 28.282 ns (step:2) - (measurement period time:0.282824344 sec time_interval:282824344) - (invoke count:10000000 tsc_interval:846861210)
[  137.724456] time_bench: Type:kmem bulk_quick_reuse Per elem: 53 cycles(tsc) 17.830 ns (step:2) - (measurement period time:0.178304559 sec time_interval:178304559) - (invoke count:10000000 tsc_interval:533896980)
[  138.366442] time_bench: Type:kmem bulk_fallback Per elem: 189 cycles(tsc) 63.289 ns (step:3) - (measurement period time:0.632890657 sec time_interval:632890657) - (invoke count:9999999 tsc_interval:1895064930)
[  138.682405] time_bench: Type:kmem bulk_quick_reuse Per elem: 91 cycles(tsc) 30.603 ns (step:3) - (measurement period time:0.306034382 sec time_interval:306034382) - (invoke count:9999999 tsc_interval:916357950)
[  138.997539] time_bench: Type:kmem bulk_fallback Per elem: 90 cycles(tsc) 30.372 ns (step:4) - (measurement period time:0.303723704 sec time_interval:303723704) - (invoke count:10000000 tsc_interval:909440220)
[  139.131400] time_bench: Type:kmem bulk_quick_reuse Per elem: 37 cycles(tsc) 12.405 ns (step:4) - (measurement period time:0.124058230 sec time_interval:124058230) - (invoke count:10000000 tsc_interval:371467110)
[  139.430407] time_bench: Type:kmem bulk_fallback Per elem: 86 cycles(tsc) 28.867 ns (step:8) - (measurement period time:0.288673242 sec time_interval:288673242) - (invoke count:10000000 tsc_interval:864374550)
[  139.694401] time_bench: Type:kmem bulk_quick_reuse Per elem: 76 cycles(tsc) 25.593 ns (step:8) - (measurement period time:0.255935939 sec time_interval:255935939) - (invoke count:10000000 tsc_interval:766348440)
[  140.387369] time_bench: Type:kmem bulk_fallback Per elem: 203 cycles(tsc) 68.061 ns (step:16) - (measurement period time:0.680610963 sec time_interval:680610963) - (invoke count:10000000 tsc_interval:2037954090)
[  140.495385] time_bench: Type:kmem bulk_quick_reuse Per elem: 30 cycles(tsc) 10.173 ns (step:16) - (measurement period time:0.101737300 sec time_interval:101737300) - (invoke count:10000000 tsc_interval:304631430)
[  141.101479] time_bench: Type:kmem bulk_fallback Per elem: 177 cycles(tsc) 59.116 ns (step:30) - (measurement period time:0.591165326 sec time_interval:591165326) - (invoke count:9999990 tsc_interval:1770126360)
[  141.350337] time_bench: Type:kmem bulk_quick_reuse Per elem: 72 cycles(tsc) 24.305 ns (step:30) - (measurement period time:0.243051460 sec time_interval:243051460) - (invoke count:9999990 tsc_interval:727767660)
[  141.781369] time_bench: Type:kmem bulk_fallback Per elem: 126 cycles(tsc) 42.191 ns (step:32) - (measurement period time:0.421915112 sec time_interval:421915112) - (invoke count:10000000 tsc_interval:1263340320)
[  142.029348] time_bench: Type:kmem bulk_quick_reuse Per elem: 72 cycles(tsc) 24.208 ns (step:32) - (measurement period time:0.242082250 sec time_interval:242082250) - (invoke count:10000000 tsc_interval:724865610)
[  142.833301] time_bench: Type:kmem bulk_fallback Per elem: 237 cycles(tsc) 79.313 ns (step:34) - (measurement period time:0.793128746 sec time_interval:793128746) - (invoke count:9999978 tsc_interval:2374865760)
[  142.957327] time_bench: Type:kmem bulk_quick_reuse Per elem: 35 cycles(tsc) 11.796 ns (step:34) - (measurement period time:0.117960158 sec time_interval:117960158) - (invoke count:9999978 tsc_interval:353207850)
[  143.714486] time_bench: Type:kmem bulk_fallback Per elem: 223 cycles(tsc) 74.629 ns (step:48) - (measurement period time:0.746296426 sec time_interval:746296426) - (invoke count:9999984 tsc_interval:2234635890)
[  143.998413] time_bench: Type:kmem bulk_quick_reuse Per elem: 82 cycles(tsc) 27.476 ns (step:48) - (measurement period time:0.274759868 sec time_interval:274759868) - (invoke count:9999984 tsc_interval:822712920)
[  144.717341] time_bench: Type:kmem bulk_fallback Per elem: 211 cycles(tsc) 70.598 ns (step:64) - (measurement period time:0.705984861 sec time_interval:705984861) - (invoke count:10000000 tsc_interval:2113930770)
[  144.838259] time_bench: Type:kmem bulk_quick_reuse Per elem: 32 cycles(tsc) 10.788 ns (step:64) - (measurement period time:0.107887319 sec time_interval:107887319) - (invoke count:10000000 tsc_interval:323046420)
[  145.190386] time_bench: Type:kmem bulk_fallback Per elem: 102 cycles(tsc) 34.174 ns (step:128) - (measurement period time:0.341741874 sec time_interval:341741874) - (invoke count:10000000 tsc_interval:1023278130)
[  145.514275] time_bench: Type:kmem bulk_quick_reuse Per elem: 93 cycles(tsc) 31.128 ns (step:128) - (measurement period time:0.311288149 sec time_interval:311288149) - (invoke count:10000000 tsc_interval:932088960)
[  146.367413] time_bench: Type:kmem bulk_fallback Per elem: 251 cycles(tsc) 84.015 ns (step:158) - (measurement period time:0.840153692 sec time_interval:840153692) - (invoke count:9999978 tsc_interval:2515672920)
[  146.523219] time_bench: Type:kmem bulk_quick_reuse Per elem: 42 cycles(tsc) 14.280 ns (step:158) - (measurement period time:0.142806094 sec time_interval:142806094) - (invoke count:9999978 tsc_interval:427603830)
[  146.888375] time_bench: Type:kmem bulk_fallback Per elem: 105 cycles(tsc) 35.119 ns (step:250) - (measurement period time:0.351191259 sec time_interval:351191259) - (invoke count:10000000 tsc_interval:1051571610)
[  147.291226] time_bench: Type:kmem bulk_quick_reuse Per elem: 117 cycles(tsc) 39.200 ns (step:250) - (measurement period time:0.392003176 sec time_interval:392003176) - (invoke count:10000000 tsc_interval:1173774360)


SLAB:

Orig:
[   80.499545] time_bench: Type:for_loop Per elem: 2 cycles(tsc) 0.830 ns (step:0) - (measurement period time:0.083085912 sec time_interval:83085912) - (invoke count:100000000 tsc_interval:248781840)
[   81.099911] time_bench: Type:kmem fastpath reuse Per elem: 174 cycles(tsc) 58.430 ns (step:0) - (measurement period time:0.584308185 sec time_interval:584308185) - (invoke count:10000000 tsc_interval:1749584790)
[   81.421881] time_bench: Type:kmem bulk_fallback Per elem: 89 cycles(tsc) 30.019 ns (step:1) - (measurement period time:0.300198661 sec time_interval:300198661) - (invoke count:10000000 tsc_interval:898879710)
[   81.910960] time_bench: Type:kmem bulk_quick_reuse Per elem: 143 cycles(tsc) 47.889 ns (step:1) - (measurement period time:0.478893310 sec time_interval:478893310) - (invoke count:10000000 tsc_interval:1433941530)
[   82.583917] time_bench: Type:kmem bulk_fallback Per elem: 197 cycles(tsc) 65.813 ns (step:2) - (measurement period time:0.658134429 sec time_interval:658134429) - (invoke count:10000000 tsc_interval:1970640660)
[   82.751867] time_bench: Type:kmem bulk_quick_reuse Per elem: 45 cycles(tsc) 15.221 ns (step:2) - (measurement period time:0.152212195 sec time_interval:152212195) - (invoke count:10000000 tsc_interval:455766000)
[   83.047850] time_bench: Type:kmem bulk_fallback Per elem: 83 cycles(tsc) 27.831 ns (step:3) - (measurement period time:0.278309326 sec time_interval:278309326) - (invoke count:9999999 tsc_interval:833336640)
[   83.186831] time_bench: Type:kmem bulk_quick_reuse Per elem: 38 cycles(tsc) 12.885 ns (step:3) - (measurement period time:0.128853000 sec time_interval:128853000) - (invoke count:9999999 tsc_interval:385821900)
[   83.514848] time_bench: Type:kmem bulk_fallback Per elem: 92 cycles(tsc) 30.901 ns (step:4) - (measurement period time:0.309012550 sec time_interval:309012550) - (invoke count:10000000 tsc_interval:925270980)
[   83.646835] time_bench: Type:kmem bulk_quick_reuse Per elem: 35 cycles(tsc) 11.711 ns (step:4) - (measurement period time:0.117116655 sec time_interval:117116655) - (invoke count:10000000 tsc_interval:350679900)
[   83.954817] time_bench: Type:kmem bulk_fallback Per elem: 89 cycles(tsc) 29.739 ns (step:8) - (measurement period time:0.297398266 sec time_interval:297398266) - (invoke count:10000000 tsc_interval:890494290)
[   84.069826] time_bench: Type:kmem bulk_quick_reuse Per elem: 29 cycles(tsc) 9.943 ns (step:8) - (measurement period time:0.099437599 sec time_interval:99437599) - (invoke count:10000000 tsc_interval:297743760)
[   84.361844] time_bench: Type:kmem bulk_fallback Per elem: 84 cycles(tsc) 28.263 ns (step:16) - (measurement period time:0.282630878 sec time_interval:282630878) - (invoke count:10000000 tsc_interval:846277020)
[   84.471816] time_bench: Type:kmem bulk_quick_reuse Per elem: 28 cycles(tsc) 9.643 ns (step:16) - (measurement period time:0.096439729 sec time_interval:96439729) - (invoke count:10000000 tsc_interval:288767550)
[   84.977793] time_bench: Type:kmem bulk_fallback Per elem: 145 cycles(tsc) 48.452 ns (step:30) - (measurement period time:0.484520609 sec time_interval:484520609) - (invoke count:9999990 tsc_interval:1450791510)
[   85.222771] time_bench: Type:kmem bulk_quick_reuse Per elem: 68 cycles(tsc) 22.726 ns (step:30) - (measurement period time:0.227266268 sec time_interval:227266268) - (invoke count:9999990 tsc_interval:680498580)
[   85.814766] time_bench: Type:kmem bulk_fallback Per elem: 173 cycles(tsc) 57.907 ns (step:32) - (measurement period time:0.579072933 sec time_interval:579072933) - (invoke count:10000000 tsc_interval:1733908170)
[   85.914739] time_bench: Type:kmem bulk_quick_reuse Per elem: 28 cycles(tsc) 9.385 ns (step:32) - (measurement period time:0.093857661 sec time_interval:93857661) - (invoke count:10000000 tsc_interval:281035770)
[   86.207764] time_bench: Type:kmem bulk_fallback Per elem: 82 cycles(tsc) 27.489 ns (step:34) - (measurement period time:0.274891966 sec time_interval:274891966) - (invoke count:9999978 tsc_interval:823104480)
[   86.452755] time_bench: Type:kmem bulk_quick_reuse Per elem: 68 cycles(tsc) 23.040 ns (step:34) - (measurement period time:0.230401610 sec time_interval:230401610) - (invoke count:9999978 tsc_interval:689886630)
[   86.736743] time_bench: Type:kmem bulk_fallback Per elem: 81 cycles(tsc) 27.326 ns (step:48) - (measurement period time:0.273267062 sec time_interval:273267062) - (invoke count:9999984 tsc_interval:818238330)
[   86.839857] time_bench: Type:kmem bulk_quick_reuse Per elem: 28 cycles(tsc) 9.506 ns (step:48) - (measurement period time:0.095059470 sec time_interval:95059470) - (invoke count:9999984 tsc_interval:284634690)
[   87.432947] time_bench: Type:kmem bulk_fallback Per elem: 172 cycles(tsc) 57.565 ns (step:64) - (measurement period time:0.575650143 sec time_interval:575650143) - (invoke count:10000000 tsc_interval:1723659720)
[   87.536682] time_bench: Type:kmem bulk_quick_reuse Per elem: 27 cycles(tsc) 9.267 ns (step:64) - (measurement period time:0.092674016 sec time_interval:92674016) - (invoke count:10000000 tsc_interval:277491600)
[   87.829693] time_bench: Type:kmem bulk_fallback Per elem: 84 cycles(tsc) 28.082 ns (step:128) - (measurement period time:0.280825239 sec time_interval:280825239) - (invoke count:10000000 tsc_interval:840869820)
[   87.942860] time_bench: Type:kmem bulk_quick_reuse Per elem: 31 cycles(tsc) 10.387 ns (step:128) - (measurement period time:0.103871104 sec time_interval:103871104) - (invoke count:10000000 tsc_interval:311019150)
[   88.242686] time_bench: Type:kmem bulk_fallback Per elem: 84 cycles(tsc) 28.101 ns (step:158) - (measurement period time:0.281012946 sec time_interval:281012946) - (invoke count:9999978 tsc_interval:841431990)
[   88.354683] time_bench: Type:kmem bulk_quick_reuse Per elem: 29 cycles(tsc) 9.852 ns (step:158) - (measurement period time:0.098524040 sec time_interval:98524040) - (invoke count:9999978 tsc_interval:295008030)
[   88.655671] time_bench: Type:kmem bulk_fallback Per elem: 86 cycles(tsc) 28.946 ns (step:250) - (measurement period time:0.289463793 sec time_interval:289463793) - (invoke count:10000000 tsc_interval:866736720)
[   88.776655] time_bench: Type:kmem bulk_quick_reuse Per elem: 32 cycles(tsc) 10.695 ns (step:250) - (measurement period time:0.106953355 sec time_interval:106953355) - (invoke count:10000000 tsc_interval:320247930)

[  100.068788] time_bench: Type:for_loop Per elem: 4 cycles(tsc) 1.567 ns (step:0) - (measurement period time:0.156710185 sec time_interval:156710185) - (invoke count:100000000 tsc_interval:469233480)
[  100.654304] time_bench: Type:kmem fastpath reuse Per elem: 170 cycles(tsc) 56.967 ns (step:0) - (measurement period time:0.569671924 sec time_interval:569671924) - (invoke count:10000000 tsc_interval:1705759620)
[  101.373300] time_bench: Type:kmem bulk_fallback Per elem: 212 cycles(tsc) 70.812 ns (step:1) - (measurement period time:0.708129741 sec time_interval:708129741) - (invoke count:10000000 tsc_interval:2120342250)
[  101.840283] time_bench: Type:kmem bulk_quick_reuse Per elem: 136 cycles(tsc) 45.527 ns (step:1) - (measurement period time:0.455275848 sec time_interval:455275848) - (invoke count:10000000 tsc_interval:1363225020)
[  102.139276] time_bench: Type:kmem bulk_fallback Per elem: 86 cycles(tsc) 29.044 ns (step:2) - (measurement period time:0.290446762 sec time_interval:290446762) - (invoke count:10000000 tsc_interval:869680110)
[  102.303272] time_bench: Type:kmem bulk_quick_reuse Per elem: 46 cycles(tsc) 15.383 ns (step:2) - (measurement period time:0.153838537 sec time_interval:153838537) - (invoke count:10000000 tsc_interval:460636140)
[  103.012346] time_bench: Type:kmem bulk_fallback Per elem: 209 cycles(tsc) 69.979 ns (step:3) - (measurement period time:0.699793666 sec time_interval:699793666) - (invoke count:9999999 tsc_interval:2095381860)
[  103.148352] time_bench: Type:kmem bulk_quick_reuse Per elem: 39 cycles(tsc) 13.208 ns (step:3) - (measurement period time:0.132082868 sec time_interval:132082868) - (invoke count:9999999 tsc_interval:395493210)
[  103.462233] time_bench: Type:kmem bulk_fallback Per elem: 91 cycles(tsc) 30.467 ns (step:4) - (measurement period time:0.304675759 sec time_interval:304675759) - (invoke count:10000000 tsc_interval:912285930)
[  103.761428] time_bench: Type:kmem bulk_quick_reuse Per elem: 87 cycles(tsc) 29.059 ns (step:4) - (measurement period time:0.290597158 sec time_interval:290597158) - (invoke count:10000000 tsc_interval:870129780)
[  104.501334] time_bench: Type:kmem bulk_fallback Per elem: 218 cycles(tsc) 73.076 ns (step:8) - (measurement period time:0.730767822 sec time_interval:730767822) - (invoke count:10000000 tsc_interval:2188127310)
[  104.732329] time_bench: Type:kmem bulk_quick_reuse Per elem: 66 cycles(tsc) 22.280 ns (step:8) - (measurement period time:0.222806934 sec time_interval:222806934) - (invoke count:10000000 tsc_interval:667146780)
[  105.346195] time_bench: Type:kmem bulk_fallback Per elem: 180 cycles(tsc) 60.308 ns (step:16) - (measurement period time:0.603085855 sec time_interval:603085855) - (invoke count:10000000 tsc_interval:1805810910)
[  105.565213] time_bench: Type:kmem bulk_quick_reuse Per elem: 62 cycles(tsc) 20.731 ns (step:16) - (measurement period time:0.207317878 sec time_interval:207317878) - (invoke count:10000000 tsc_interval:620768190)
[  106.154163] time_bench: Type:kmem bulk_fallback Per elem: 173 cycles(tsc) 57.884 ns (step:30) - (measurement period time:0.578841035 sec time_interval:578841035) - (invoke count:9999990 tsc_interval:1733213910)
[  106.450218] time_bench: Type:kmem bulk_quick_reuse Per elem: 85 cycles(tsc) 28.455 ns (step:30) - (measurement period time:0.284558769 sec time_interval:284558769) - (invoke count:9999990 tsc_interval:852048780)
[  107.137140] time_bench: Type:kmem bulk_fallback Per elem: 199 cycles(tsc) 66.729 ns (step:32) - (measurement period time:0.667298185 sec time_interval:667298185) - (invoke count:10000000 tsc_interval:1998081120)
[  107.244232] time_bench: Type:kmem bulk_quick_reuse Per elem: 28 cycles(tsc) 9.655 ns (step:32) - (measurement period time:0.096558958 sec time_interval:96558958) - (invoke count:10000000 tsc_interval:289124430)
[  107.528225] time_bench: Type:kmem bulk_fallback Per elem: 82 cycles(tsc) 27.584 ns (step:34) - (measurement period time:0.275841028 sec time_interval:275841028) - (invoke count:9999978 tsc_interval:825940800)
[  107.628207] time_bench: Type:kmem bulk_quick_reuse Per elem: 27 cycles(tsc) 9.182 ns (step:34) - (measurement period time:0.091822659 sec time_interval:91822659) - (invoke count:9999978 tsc_interval:274942830)
[  107.913114] time_bench: Type:kmem bulk_fallback Per elem: 82 cycles(tsc) 27.639 ns (step:48) - (measurement period time:0.276397658 sec time_interval:276397658) - (invoke count:9999984 tsc_interval:827612400)
[  108.013118] time_bench: Type:kmem bulk_quick_reuse Per elem: 27 cycles(tsc) 9.281 ns (step:48) - (measurement period time:0.092811773 sec time_interval:92811773) - (invoke count:9999984 tsc_interval:277904550)
[  108.293222] time_bench: Type:kmem bulk_fallback Per elem: 82 cycles(tsc) 27.413 ns (step:64) - (measurement period time:0.274134107 sec time_interval:274134107) - (invoke count:10000000 tsc_interval:820835190)
[  108.394122] time_bench: Type:kmem bulk_quick_reuse Per elem: 27 cycles(tsc) 9.252 ns (step:64) - (measurement period time:0.092524305 sec time_interval:92524305) - (invoke count:10000000 tsc_interval:277043580)
[  109.015115] time_bench: Type:kmem bulk_fallback Per elem: 183 cycles(tsc) 61.171 ns (step:128) - (measurement period time:0.611713784 sec time_interval:611713784) - (invoke count:10000000 tsc_interval:1831645590)
[  109.282175] time_bench: Type:kmem bulk_quick_reuse Per elem: 76 cycles(tsc) 25.538 ns (step:128) - (measurement period time:0.255382498 sec time_interval:255382498) - (invoke count:10000000 tsc_interval:764687130)
[  109.898178] time_bench: Type:kmem bulk_fallback Per elem: 181 cycles(tsc) 60.732 ns (step:158) - (measurement period time:0.607324486 sec time_interval:607324486) - (invoke count:9999978 tsc_interval:1818501990)
[  110.111052] time_bench: Type:kmem bulk_quick_reuse Per elem: 60 cycles(tsc) 20.241 ns (step:158) - (measurement period time:0.202414120 sec time_interval:202414120) - (invoke count:9999978 tsc_interval:606085230)
[  110.715034] time_bench: Type:kmem bulk_fallback Per elem: 178 cycles(tsc) 59.483 ns (step:250) - (measurement period time:0.594833299 sec time_interval:594833299) - (invoke count:10000000 tsc_interval:1781100600)
[  110.974129] time_bench: Type:kmem bulk_quick_reuse Per elem: 75 cycles(tsc) 25.167 ns (step:250) - (measurement period time:0.251679547 sec time_interval:251679547) - (invoke count:10000000 tsc_interval:753599310)

[  111.856730] time_bench: Type:for_loop Per elem: 4 cycles(tsc) 1.349 ns (step:0) - (measurement period time:0.134993630 sec time_interval:134993630) - (invoke count:100000000 tsc_interval:404208090)
[  112.407098] time_bench: Type:kmem fastpath reuse Per elem: 159 cycles(tsc) 53.400 ns (step:0) - (measurement period time:0.534001917 sec time_interval:534001917) - (invoke count:10000000 tsc_interval:1598953680)
[  113.150981] time_bench: Type:kmem bulk_fallback Per elem: 216 cycles(tsc) 72.396 ns (step:1) - (measurement period time:0.723960939 sec time_interval:723960939) - (invoke count:10000000 tsc_interval:2167744650)
[  113.381971] time_bench: Type:kmem bulk_quick_reuse Per elem: 67 cycles(tsc) 22.501 ns (step:1) - (measurement period time:0.225017504 sec time_interval:225017504) - (invoke count:10000000 tsc_interval:673765620)
[  113.681963] time_bench: Type:kmem bulk_fallback Per elem: 86 cycles(tsc) 28.967 ns (step:2) - (measurement period time:0.289671345 sec time_interval:289671345) - (invoke count:10000000 tsc_interval:867358230)
[  113.843955] time_bench: Type:kmem bulk_quick_reuse Per elem: 46 cycles(tsc) 15.643 ns (step:2) - (measurement period time:0.156437917 sec time_interval:156437917) - (invoke count:10000000 tsc_interval:468418740)
[  114.140953] time_bench: Type:kmem bulk_fallback Per elem: 85 cycles(tsc) 28.414 ns (step:3) - (measurement period time:0.284148848 sec time_interval:284148848) - (invoke count:9999999 tsc_interval:850821930)
[  114.279933] time_bench: Type:kmem bulk_quick_reuse Per elem: 39 cycles(tsc) 13.207 ns (step:3) - (measurement period time:0.132073229 sec time_interval:132073229) - (invoke count:9999999 tsc_interval:395463870)
[  114.609120] time_bench: Type:kmem bulk_fallback Per elem: 93 cycles(tsc) 31.197 ns (step:4) - (measurement period time:0.311972955 sec time_interval:311972955) - (invoke count:10000000 tsc_interval:934136040)
[  114.909950] time_bench: Type:kmem bulk_quick_reuse Per elem: 87 cycles(tsc) 29.326 ns (step:4) - (measurement period time:0.293267093 sec time_interval:293267093) - (invoke count:10000000 tsc_interval:878124330)
[  115.622058] time_bench: Type:kmem bulk_fallback Per elem: 209 cycles(tsc) 70.083 ns (step:8) - (measurement period time:0.700833456 sec time_interval:700833456) - (invoke count:10000000 tsc_interval:2098495740)
[  115.729918] time_bench: Type:kmem bulk_quick_reuse Per elem: 30 cycles(tsc) 10.072 ns (step:8) - (measurement period time:0.100729060 sec time_interval:100729060) - (invoke count:10000000 tsc_interval:301610850)
[  116.445890] time_bench: Type:kmem bulk_fallback Per elem: 211 cycles(tsc) 70.512 ns (step:16) - (measurement period time:0.705126903 sec time_interval:705126903) - (invoke count:10000000 tsc_interval:2111350800)
[  116.597986] time_bench: Type:kmem bulk_quick_reuse Per elem: 43 cycles(tsc) 14.451 ns (step:16) - (measurement period time:0.144517256 sec time_interval:144517256) - (invoke count:10000000 tsc_interval:432725340)
[  117.293842] time_bench: Type:kmem bulk_fallback Per elem: 205 cycles(tsc) 68.660 ns (step:30) - (measurement period time:0.686602607 sec time_interval:686602607) - (invoke count:9999990 tsc_interval:2055883860)
[  117.513834] time_bench: Type:kmem bulk_quick_reuse Per elem: 65 cycles(tsc) 21.724 ns (step:30) - (measurement period time:0.217241306 sec time_interval:217241306) - (invoke count:9999990 tsc_interval:650481120)
[  118.157816] time_bench: Type:kmem bulk_fallback Per elem: 189 cycles(tsc) 63.344 ns (step:32) - (measurement period time:0.633443044 sec time_interval:633443044) - (invoke count:10000000 tsc_interval:1896708780)
[  118.380992] time_bench: Type:kmem bulk_quick_reuse Per elem: 64 cycles(tsc) 21.381 ns (step:32) - (measurement period time:0.213815392 sec time_interval:213815392) - (invoke count:10000000 tsc_interval:640223670)
[  118.981808] time_bench: Type:kmem bulk_fallback Per elem: 176 cycles(tsc) 58.885 ns (step:34) - (measurement period time:0.588855917 sec time_interval:588855917) - (invoke count:9999978 tsc_interval:1763201640)
[  119.078787] time_bench: Type:kmem bulk_quick_reuse Per elem: 27 cycles(tsc) 9.191 ns (step:34) - (measurement period time:0.091919103 sec time_interval:91919103) - (invoke count:9999978 tsc_interval:275231340)
[  119.368789] time_bench: Type:kmem bulk_fallback Per elem: 82 cycles(tsc) 27.533 ns (step:48) - (measurement period time:0.275334132 sec time_interval:275334132) - (invoke count:9999984 tsc_interval:824428110)
[  119.471780] time_bench: Type:kmem bulk_quick_reuse Per elem: 28 cycles(tsc) 9.519 ns (step:48) - (measurement period time:0.095195091 sec time_interval:95195091) - (invoke count:9999984 tsc_interval:285040080)
[  119.775775] time_bench: Type:kmem bulk_fallback Per elem: 87 cycles(tsc) 29.149 ns (step:64) - (measurement period time:0.291498274 sec time_interval:291498274) - (invoke count:10000000 tsc_interval:872828640)
[  119.896771] time_bench: Type:kmem bulk_quick_reuse Per elem: 33 cycles(tsc) 11.330 ns (step:64) - (measurement period time:0.113304207 sec time_interval:113304207) - (invoke count:10000000 tsc_interval:339264000)
[  120.199773] time_bench: Type:kmem bulk_fallback Per elem: 87 cycles(tsc) 29.289 ns (step:128) - (measurement period time:0.292891157 sec time_interval:292891157) - (invoke count:10000000 tsc_interval:876999360)
[  120.320757] time_bench: Type:kmem bulk_quick_reuse Per elem: 34 cycles(tsc) 11.476 ns (step:128) - (measurement period time:0.114763286 sec time_interval:114763286) - (invoke count:10000000 tsc_interval:343632900)
[  120.976762] time_bench: Type:kmem bulk_fallback Per elem: 192 cycles(tsc) 64.320 ns (step:158) - (measurement period time:0.643207519 sec time_interval:643207519) - (invoke count:9999978 tsc_interval:1925946840)
[  121.231790] time_bench: Type:kmem bulk_quick_reuse Per elem: 73 cycles(tsc) 24.705 ns (step:158) - (measurement period time:0.247055281 sec time_interval:247055281) - (invoke count:9999978 tsc_interval:739752480)
[  121.875817] time_bench: Type:kmem bulk_fallback Per elem: 189 cycles(tsc) 63.224 ns (step:250) - (measurement period time:0.632244442 sec time_interval:632244442) - (invoke count:10000000 tsc_interval:1893119520)
[  122.148737] time_bench: Type:kmem bulk_quick_reuse Per elem: 79 cycles(tsc) 26.410 ns (step:250) - (measurement period time:0.264101742 sec time_interval:264101742) - (invoke count:10000000 tsc_interval:790794030)

Patched:

[  654.054203] time_bench: Type:for_loop Per elem: 2 cycles(tsc) 0.776 ns (step:0) - (measurement period time:0.077664118 sec time_interval:77664118) - (invoke count:100000000 tsc_interval:232545750)
[  654.592857] time_bench: Type:kmem fastpath reuse Per elem: 161 cycles(tsc) 53.860 ns (step:0) - (measurement period time:0.538607021 sec time_interval:538607021) - (invoke count:10000000 tsc_interval:1612734660)
[  655.248550] time_bench: Type:kmem bulk_fallback Per elem: 196 cycles(tsc) 65.568 ns (step:1) - (measurement period time:0.655680061 sec time_interval:655680061) - (invoke count:10000000 tsc_interval:1963282620)
[  655.475563] time_bench: Type:kmem bulk_quick_reuse Per elem: 67 cycles(tsc) 22.697 ns (step:1) - (measurement period time:0.226975586 sec time_interval:226975586) - (invoke count:10000000 tsc_interval:679625070)
[  655.757615] time_bench: Type:kmem bulk_fallback Per elem: 84 cycles(tsc) 28.204 ns (step:2) - (measurement period time:0.282047104 sec time_interval:282047104) - (invoke count:10000000 tsc_interval:844524090)
[  655.943657] time_bench: Type:kmem bulk_quick_reuse Per elem: 55 cycles(tsc) 18.599 ns (step:2) - (measurement period time:0.185992389 sec time_interval:185992389) - (invoke count:10000000 tsc_interval:556910100)
[  656.221528] time_bench: Type:kmem bulk_fallback Per elem: 83 cycles(tsc) 27.783 ns (step:3) - (measurement period time:0.277833288 sec time_interval:277833288) - (invoke count:9999999 tsc_interval:831906840)
[  656.535062] time_bench: Type:kmem bulk_quick_reuse Per elem: 93 cycles(tsc) 31.351 ns (step:3) - (measurement period time:0.313512217 sec time_interval:313512217) - (invoke count:9999999 tsc_interval:938739120)
[  656.843267] time_bench: Type:kmem bulk_fallback Per elem: 92 cycles(tsc) 30.818 ns (step:4) - (measurement period time:0.308185034 sec time_interval:308185034) - (invoke count:10000000 tsc_interval:922788240)
[  656.961808] time_bench: Type:kmem bulk_quick_reuse Per elem: 35 cycles(tsc) 11.850 ns (step:4) - (measurement period time:0.118503561 sec time_interval:118503561) - (invoke count:10000000 tsc_interval:354830820)
[  657.691366] time_bench: Type:kmem bulk_fallback Per elem: 218 cycles(tsc) 72.954 ns (step:8) - (measurement period time:0.729541418 sec time_interval:729541418) - (invoke count:10000000 tsc_interval:2184443100)
[  657.792001] time_bench: Type:kmem bulk_quick_reuse Per elem: 30 cycles(tsc) 10.060 ns (step:8) - (measurement period time:0.100604744 sec time_interval:100604744) - (invoke count:10000000 tsc_interval:301236990)
[  658.070712] time_bench: Type:kmem bulk_fallback Per elem: 83 cycles(tsc) 27.868 ns (step:16) - (measurement period time:0.278687720 sec time_interval:278687720) - (invoke count:10000000 tsc_interval:834465960)
[  658.169621] time_bench: Type:kmem bulk_quick_reuse Per elem: 29 cycles(tsc) 9.887 ns (step:16) - (measurement period time:0.098871033 sec time_interval:98871033) - (invoke count:10000000 tsc_interval:296045940)
[  658.846891] time_bench: Type:kmem bulk_fallback Per elem: 202 cycles(tsc) 67.726 ns (step:30) - (measurement period time:0.677260248 sec time_interval:677260248) - (invoke count:9999990 tsc_interval:2027899590)
[  658.940547] time_bench: Type:kmem bulk_quick_reuse Per elem: 28 cycles(tsc) 9.354 ns (step:30) - (measurement period time:0.093547668 sec time_interval:93547668) - (invoke count:9999990 tsc_interval:280105560)
[  659.214131] time_bench: Type:kmem bulk_fallback Per elem: 81 cycles(tsc) 27.356 ns (step:32) - (measurement period time:0.273564878 sec time_interval:273564878) - (invoke count:10000000 tsc_interval:819126750)
[  659.307010] time_bench: Type:kmem bulk_quick_reuse Per elem: 27 cycles(tsc) 9.286 ns (step:32) - (measurement period time:0.092862249 sec time_interval:92862249) - (invoke count:10000000 tsc_interval:278053470)
[  659.577675] time_bench: Type:kmem bulk_fallback Per elem: 81 cycles(tsc) 27.065 ns (step:34) - (measurement period time:0.270657877 sec time_interval:270657877) - (invoke count:9999978 tsc_interval:810422340)
[  659.670155] time_bench: Type:kmem bulk_quick_reuse Per elem: 27 cycles(tsc) 9.246 ns (step:34) - (measurement period time:0.092468447 sec time_interval:92468447) - (invoke count:9999978 tsc_interval:276874410)
[  659.941498] time_bench: Type:kmem bulk_fallback Per elem: 81 cycles(tsc) 27.129 ns (step:48) - (measurement period time:0.271292799 sec time_interval:271292799) - (invoke count:9999984 tsc_interval:812323620)
[  660.034358] time_bench: Type:kmem bulk_quick_reuse Per elem: 27 cycles(tsc) 9.284 ns (step:48) - (measurement period time:0.092846689 sec time_interval:92846689) - (invoke count:9999984 tsc_interval:278007390)
[  660.305652] time_bench: Type:kmem bulk_fallback Per elem: 81 cycles(tsc) 27.125 ns (step:64) - (measurement period time:0.271257793 sec time_interval:271257793) - (invoke count:10000000 tsc_interval:812218680)
[  660.535235] time_bench: Type:kmem bulk_quick_reuse Per elem: 68 cycles(tsc) 22.955 ns (step:64) - (measurement period time:0.229550122 sec time_interval:229550122) - (invoke count:10000000 tsc_interval:687333360)
[  660.814888] time_bench: Type:kmem bulk_fallback Per elem: 83 cycles(tsc) 27.964 ns (step:128) - (measurement period time:0.279643666 sec time_interval:279643666) - (invoke count:10000000 tsc_interval:837328200)
[  660.915969] time_bench: Type:kmem bulk_quick_reuse Per elem: 30 cycles(tsc) 10.104 ns (step:128) - (measurement period time:0.101047589 sec time_interval:101047589) - (invoke count:10000000 tsc_interval:302562990)
[  661.275325] time_bench: Type:kmem bulk_fallback Per elem: 107 cycles(tsc) 35.933 ns (step:158) - (measurement period time:0.359338210 sec time_interval:359338210) - (invoke count:9999978 tsc_interval:1075954290)
[  661.375091] time_bench: Type:kmem bulk_quick_reuse Per elem: 29 cycles(tsc) 9.975 ns (step:158) - (measurement period time:0.099750172 sec time_interval:99750172) - (invoke count:9999978 tsc_interval:298678200)
[  661.655844] time_bench: Type:kmem bulk_fallback Per elem: 84 cycles(tsc) 28.074 ns (step:250) - (measurement period time:0.280746521 sec time_interval:280746521) - (invoke count:10000000 tsc_interval:840630900)
[  661.762658] time_bench: Type:kmem bulk_quick_reuse Per elem: 31 cycles(tsc) 10.680 ns (step:250) - (measurement period time:0.106802018 sec time_interval:106802018) - (invoke count:10000000 tsc_interval:319793460)

[  663.188701] time_bench: Type:for_loop Per elem: 2 cycles(tsc) 0.772 ns (step:0) - (measurement period time:0.077219119 sec time_interval:77219119) - (invoke count:100000000 tsc_interval:231214350)
[  663.723737] time_bench: Type:kmem fastpath reuse Per elem: 160 cycles(tsc) 53.501 ns (step:0) - (measurement period time:0.535016285 sec time_interval:535016285) - (invoke count:10000000 tsc_interval:1601983200)
[  664.022069] time_bench: Type:kmem bulk_fallback Per elem: 89 cycles(tsc) 29.828 ns (step:1) - (measurement period time:0.298280101 sec time_interval:298280101) - (invoke count:10000000 tsc_interval:893130450)
[  664.248849] time_bench: Type:kmem bulk_quick_reuse Per elem: 67 cycles(tsc) 22.677 ns (step:1) - (measurement period time:0.226775284 sec time_interval:226775284) - (invoke count:10000000 tsc_interval:679026090)
[  664.530649] time_bench: Type:kmem bulk_fallback Per elem: 84 cycles(tsc) 28.179 ns (step:2) - (measurement period time:0.281793671 sec time_interval:281793671) - (invoke count:10000000 tsc_interval:843766020)
[  664.686627] time_bench: Type:kmem bulk_quick_reuse Per elem: 46 cycles(tsc) 15.593 ns (step:2) - (measurement period time:0.155939154 sec time_interval:155939154) - (invoke count:10000000 tsc_interval:466923720)
[  665.370321] time_bench: Type:kmem bulk_fallback Per elem: 204 cycles(tsc) 68.367 ns (step:3) - (measurement period time:0.683678844 sec time_interval:683678844) - (invoke count:9999999 tsc_interval:2047118220)
[  665.685507] time_bench: Type:kmem bulk_quick_reuse Per elem: 94 cycles(tsc) 31.513 ns (step:3) - (measurement period time:0.315139143 sec time_interval:315139143) - (invoke count:9999999 tsc_interval:943611060)
[  666.448847] time_bench: Type:kmem bulk_fallback Per elem: 228 cycles(tsc) 76.331 ns (step:4) - (measurement period time:0.763310680 sec time_interval:763310680) - (invoke count:10000000 tsc_interval:2285557860)
[  666.745314] time_bench: Type:kmem bulk_quick_reuse Per elem: 88 cycles(tsc) 29.643 ns (step:4) - (measurement period time:0.296436791 sec time_interval:296436791) - (invoke count:10000000 tsc_interval:887610960)
[  667.041829] time_bench: Type:kmem bulk_fallback Per elem: 88 cycles(tsc) 29.650 ns (step:8) - (measurement period time:0.296505592 sec time_interval:296505592) - (invoke count:10000000 tsc_interval:887817120)
[  667.142484] time_bench: Type:kmem bulk_quick_reuse Per elem: 30 cycles(tsc) 10.064 ns (step:8) - (measurement period time:0.100642315 sec time_interval:100642315) - (invoke count:10000000 tsc_interval:301350000)
[  667.420593] time_bench: Type:kmem bulk_fallback Per elem: 83 cycles(tsc) 27.810 ns (step:16) - (measurement period time:0.278104977 sec time_interval:278104977) - (invoke count:10000000 tsc_interval:832721010)
[  667.519271] time_bench: Type:kmem bulk_quick_reuse Per elem: 29 cycles(tsc) 9.866 ns (step:16) - (measurement period time:0.098662815 sec time_interval:98662815) - (invoke count:10000000 tsc_interval:295422450)
[  667.792475] time_bench: Type:kmem bulk_fallback Per elem: 81 cycles(tsc) 27.315 ns (step:30) - (measurement period time:0.273152701 sec time_interval:273152701) - (invoke count:9999990 tsc_interval:817892820)
[  668.023804] time_bench: Type:kmem bulk_quick_reuse Per elem: 69 cycles(tsc) 23.130 ns (step:30) - (measurement period time:0.231303811 sec time_interval:231303811) - (invoke count:9999990 tsc_interval:692584950)
[  668.696907] time_bench: Type:kmem bulk_fallback Per elem: 201 cycles(tsc) 67.306 ns (step:32) - (measurement period time:0.673067682 sec time_interval:673067682) - (invoke count:10000000 tsc_interval:2015345790)
[  668.889019] time_bench: Type:kmem bulk_quick_reuse Per elem: 57 cycles(tsc) 19.208 ns (step:32) - (measurement period time:0.192088279 sec time_interval:192088279) - (invoke count:10000000 tsc_interval:575162820)
[  669.342870] time_bench: Type:kmem bulk_fallback Per elem: 135 cycles(tsc) 45.383 ns (step:34) - (measurement period time:0.453831353 sec time_interval:453831353) - (invoke count:9999978 tsc_interval:1358892420)
[  669.436107] time_bench: Type:kmem bulk_quick_reuse Per elem: 27 cycles(tsc) 9.322 ns (step:34) - (measurement period time:0.093220843 sec time_interval:93220843) - (invoke count:9999978 tsc_interval:279126840)
[  669.707772] time_bench: Type:kmem bulk_fallback Per elem: 81 cycles(tsc) 27.165 ns (step:48) - (measurement period time:0.271654970 sec time_interval:271654970) - (invoke count:9999984 tsc_interval:813407310)
[  669.800509] time_bench: Type:kmem bulk_quick_reuse Per elem: 27 cycles(tsc) 9.268 ns (step:48) - (measurement period time:0.092683978 sec time_interval:92683978) - (invoke count:9999984 tsc_interval:277520190)
[  670.068757] time_bench: Type:kmem bulk_fallback Per elem: 80 cycles(tsc) 26.823 ns (step:64) - (measurement period time:0.268231313 sec time_interval:268231313) - (invoke count:10000000 tsc_interval:803156580)
[  670.297078] time_bench: Type:kmem bulk_quick_reuse Per elem: 68 cycles(tsc) 22.829 ns (step:64) - (measurement period time:0.228295958 sec time_interval:228295958) - (invoke count:10000000 tsc_interval:683578080)
[  670.573819] time_bench: Type:kmem bulk_fallback Per elem: 82 cycles(tsc) 27.673 ns (step:128) - (measurement period time:0.276731254 sec time_interval:276731254) - (invoke count:10000000 tsc_interval:828607050)
[  670.676864] time_bench: Type:kmem bulk_quick_reuse Per elem: 30 cycles(tsc) 10.300 ns (step:128) - (measurement period time:0.103002111 sec time_interval:103002111) - (invoke count:10000000 tsc_interval:308415540)
[  671.318177] time_bench: Type:kmem bulk_fallback Per elem: 192 cycles(tsc) 64.130 ns (step:158) - (measurement period time:0.641303389 sec time_interval:641303389) - (invoke count:9999978 tsc_interval:1920234600)
[  671.417083] time_bench: Type:kmem bulk_quick_reuse Per elem: 29 cycles(tsc) 9.889 ns (step:158) - (measurement period time:0.098890269 sec time_interval:98890269) - (invoke count:9999978 tsc_interval:296103210)
[  671.700461] time_bench: Type:kmem bulk_fallback Per elem: 84 cycles(tsc) 28.334 ns (step:250) - (measurement period time:0.283346426 sec time_interval:283346426) - (invoke count:10000000 tsc_interval:848415660)
[  671.965515] time_bench: Type:kmem bulk_quick_reuse Per elem: 79 cycles(tsc) 26.502 ns (step:250) - (measurement period time:0.265021064 sec time_interval:265021064) - (invoke count:10000000 tsc_interval:793543500)

[  686.749446] time_bench: Type:for_loop Per elem: 1 cycles(tsc) 0.660 ns (step:0) - (measurement period time:0.066028480 sec time_interval:66028480) - (invoke count:100000000 tsc_interval:197707140)
[  687.296902] time_bench: Type:kmem fastpath reuse Per elem: 163 cycles(tsc) 54.742 ns (step:0) - (measurement period time:0.547423736 sec time_interval:547423736) - (invoke count:10000000 tsc_interval:1639141260)
[  687.910620] time_bench: Type:kmem bulk_fallback Per elem: 183 cycles(tsc) 61.369 ns (step:1) - (measurement period time:0.613692564 sec time_interval:613692564) - (invoke count:10000000 tsc_interval:1837568160)
[  688.381090] time_bench: Type:kmem bulk_quick_reuse Per elem: 140 cycles(tsc) 47.045 ns (step:1) - (measurement period time:0.470452576 sec time_interval:470452576) - (invoke count:10000000 tsc_interval:1408667550)
[  688.662045] time_bench: Type:kmem bulk_fallback Per elem: 84 cycles(tsc) 28.094 ns (step:2) - (measurement period time:0.280943997 sec time_interval:280943997) - (invoke count:10000000 tsc_interval:841225230)
[  688.817464] time_bench: Type:kmem bulk_quick_reuse Per elem: 46 cycles(tsc) 15.540 ns (step:2) - (measurement period time:0.155409002 sec time_interval:155409002) - (invoke count:10000000 tsc_interval:465337980)
[  689.094749] time_bench: Type:kmem bulk_fallback Per elem: 83 cycles(tsc) 27.723 ns (step:3) - (measurement period time:0.277235751 sec time_interval:277235751) - (invoke count:9999999 tsc_interval:830122170)
[  689.225706] time_bench: Type:kmem bulk_quick_reuse Per elem: 39 cycles(tsc) 13.091 ns (step:3) - (measurement period time:0.130919113 sec time_interval:130919113) - (invoke count:9999999 tsc_interval:392008440)
[  689.988861] time_bench: Type:kmem bulk_fallback Per elem: 228 cycles(tsc) 76.314 ns (step:4) - (measurement period time:0.763146670 sec time_interval:763146670) - (invoke count:10000000 tsc_interval:2285076870)
[  690.274210] time_bench: Type:kmem bulk_quick_reuse Per elem: 85 cycles(tsc) 28.532 ns (step:4) - (measurement period time:0.285320525 sec time_interval:285320525) - (invoke count:10000000 tsc_interval:854329500)
[  690.862234] time_bench: Type:kmem bulk_fallback Per elem: 176 cycles(tsc) 58.799 ns (step:8) - (measurement period time:0.587998540 sec time_interval:587998540) - (invoke count:10000000 tsc_interval:1760633010)
[  690.964020] time_bench: Type:kmem bulk_quick_reuse Per elem: 30 cycles(tsc) 10.171 ns (step:8) - (measurement period time:0.101718599 sec time_interval:101718599) - (invoke count:10000000 tsc_interval:304573500)
[  691.245251] time_bench: Type:kmem bulk_fallback Per elem: 84 cycles(tsc) 28.122 ns (step:16) - (measurement period time:0.281223369 sec time_interval:281223369) - (invoke count:10000000 tsc_interval:842060850)
[  691.341256] time_bench: Type:kmem bulk_quick_reuse Per elem: 28 cycles(tsc) 9.599 ns (step:16) - (measurement period time:0.095990014 sec time_interval:95990014) - (invoke count:10000000 tsc_interval:287420100)
[  691.616379] time_bench: Type:kmem bulk_fallback Per elem: 82 cycles(tsc) 27.511 ns (step:30) - (measurement period time:0.275116534 sec time_interval:275116534) - (invoke count:9999990 tsc_interval:823776390)
[  691.710275] time_bench: Type:kmem bulk_quick_reuse Per elem: 28 cycles(tsc) 9.388 ns (step:30) - (measurement period time:0.093884613 sec time_interval:93884613) - (invoke count:9999990 tsc_interval:281115990)
[  691.982082] time_bench: Type:kmem bulk_fallback Per elem: 81 cycles(tsc) 27.180 ns (step:32) - (measurement period time:0.271800767 sec time_interval:271800767) - (invoke count:10000000 tsc_interval:813847530)
[  692.077384] time_bench: Type:kmem bulk_quick_reuse Per elem: 28 cycles(tsc) 9.526 ns (step:32) - (measurement period time:0.095266005 sec time_interval:95266005) - (invoke count:10000000 tsc_interval:285252780)
[  692.348422] time_bench: Type:kmem bulk_fallback Per elem: 81 cycles(tsc) 27.102 ns (step:34) - (measurement period time:0.271026511 sec time_interval:271026511) - (invoke count:9999978 tsc_interval:811529490)
[  692.440805] time_bench: Type:kmem bulk_quick_reuse Per elem: 27 cycles(tsc) 9.236 ns (step:34) - (measurement period time:0.092368535 sec time_interval:92368535) - (invoke count:9999978 tsc_interval:276576810)
[  692.712439] time_bench: Type:kmem bulk_fallback Per elem: 81 cycles(tsc) 27.162 ns (step:48) - (measurement period time:0.271619761 sec time_interval:271619761) - (invoke count:9999984 tsc_interval:813305970)
[  692.945558] time_bench: Type:kmem bulk_quick_reuse Per elem: 69 cycles(tsc) 23.309 ns (step:48) - (measurement period time:0.233091977 sec time_interval:233091977) - (invoke count:9999984 tsc_interval:697942470)
[  693.234591] time_bench: Type:kmem bulk_fallback Per elem: 86 cycles(tsc) 28.902 ns (step:64) - (measurement period time:0.289021416 sec time_interval:289021416) - (invoke count:10000000 tsc_interval:865411350)
[  693.326142] time_bench: Type:kmem bulk_quick_reuse Per elem: 27 cycles(tsc) 9.153 ns (step:64) - (measurement period time:0.091539475 sec time_interval:91539475) - (invoke count:10000000 tsc_interval:274094220)
[  693.615858] time_bench: Type:kmem bulk_fallback Per elem: 86 cycles(tsc) 28.970 ns (step:128) - (measurement period time:0.289709207 sec time_interval:289709207) - (invoke count:10000000 tsc_interval:867470400)
[  693.717321] time_bench: Type:kmem bulk_quick_reuse Per elem: 30 cycles(tsc) 10.145 ns (step:128) - (measurement period time:0.101451019 sec time_interval:101451019) - (invoke count:10000000 tsc_interval:303772410)
[  694.000375] time_bench: Type:kmem bulk_fallback Per elem: 84 cycles(tsc) 28.304 ns (step:158) - (measurement period time:0.283047625 sec time_interval:283047625) - (invoke count:9999978 tsc_interval:847523850)
[  694.108588] time_bench: Type:kmem bulk_quick_reuse Per elem: 32 cycles(tsc) 10.816 ns (step:158) - (measurement period time:0.108168257 sec time_interval:108168257) - (invoke count:9999978 tsc_interval:323885820)
[  694.392070] time_bench: Type:kmem bulk_fallback Per elem: 84 cycles(tsc) 28.344 ns (step:250) - (measurement period time:0.283447055 sec time_interval:283447055) - (invoke count:10000000 tsc_interval:848719800)
[  694.655226] time_bench: Type:kmem bulk_quick_reuse Per elem: 78 cycles(tsc) 26.312 ns (step:250) - (measurement period time:0.263123465 sec time_interval:263123465) - (invoke count:10000000 tsc_interval:787864230)

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 17/19] mm: memcg/slab: use a single set of kmem_caches for all allocations
  2020-06-18  0:35     ` Roman Gushchin
@ 2020-06-18  7:33       ` Vlastimil Babka
  2020-06-18 19:54         ` Roman Gushchin
  0 siblings, 1 reply; 92+ messages in thread
From: Vlastimil Babka @ 2020-06-18  7:33 UTC (permalink / raw)
  To: Roman Gushchin, Andrew Morton
  Cc: Christoph Lameter, Johannes Weiner, Michal Hocko, Shakeel Butt,
	linux-mm, kernel-team, linux-kernel, Kees Cook

On 6/18/20 2:35 AM, Roman Gushchin wrote:
> On Wed, Jun 17, 2020 at 04:35:28PM -0700, Andrew Morton wrote:
>> On Mon, 8 Jun 2020 16:06:52 -0700 Roman Gushchin <guro@fb.com> wrote:
>> 
>> > Instead of having two sets of kmem_caches: one for system-wide and
>> > non-accounted allocations and the second one shared by all accounted
>> > allocations, we can use just one.
>> > 
>> > The idea is simple: space for obj_cgroup metadata can be allocated
>> > on demand and filled only for accounted allocations.
>> > 
>> > It allows to remove a bunch of code which is required to handle
>> > kmem_cache clones for accounted allocations. There is no more need
>> > to create them, accumulate statistics, propagate attributes, etc.
>> > It's a quite significant simplification.
>> > 
>> > Also, because the total number of slab_caches is reduced almost twice
>> > (not all kmem_caches have a memcg clone), some additional memory
>> > savings are expected. On my devvm it additionally saves about 3.5%
>> > of slab memory.
>> > 
>> 
>> This ran afoul of Vlastimil's "mm, slab/slub: move and improve
>> cache_from_obj()"
>> (http://lkml.kernel.org/r/20200610163135.17364-10-vbabka@suse.cz).  I
>> resolved things as below.  Not too sure about slab.c's
>> cache_from_obj()...
> 
> It can actually be as simple as:
> static inline struct kmem_cache *cache_from_obj(struct kmem_cache *s, void *x)
> {
> 	return s;
> }
> 
> But I wonder if we need it at all, or maybe we wanna rename it to
> something like obj_check_kmem_cache(void *obj, struct kmem_cache *s),
> because it has now only debug purposes.
> 
> Let me and Vlastimil figure it out and send a follow-up patch.
> Your version is definitely correct.

Well, Kees wants to restore the common version of cache_from_obj() [1] for SLAB
hardening.

To prevent all that back and forth churn entering git history, I think the best
is for me to send a -fix to my patch that is functionally same while keeping the
common function, and then this your patch should only have a minor conflict and
Kees can rebase his patches on top to become much smaller?

[1] https://lore.kernel.org/linux-mm/20200617195349.3471794-1-keescook@chromium.org/

> Thanks!
> 


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 00/19] The new cgroup slab memory controller
  2020-06-18  1:29           ` Roman Gushchin
@ 2020-06-18  8:43             ` Jesper Dangaard Brouer
  2020-06-18  9:31               ` Jesper Dangaard Brouer
  2020-06-19  1:27               ` Roman Gushchin
  0 siblings, 2 replies; 92+ messages in thread
From: Jesper Dangaard Brouer @ 2020-06-18  8:43 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Vlastimil Babka, Shakeel Butt, Andrew Morton, Christoph Lameter,
	Johannes Weiner, Michal Hocko, Linux MM, Kernel Team, LKML,
	Mel Gorman, brouer

On Wed, 17 Jun 2020 18:29:28 -0700
Roman Gushchin <guro@fb.com> wrote:

> On Wed, Jun 17, 2020 at 01:24:21PM +0200, Vlastimil Babka wrote:
> > On 6/17/20 5:32 AM, Roman Gushchin wrote:  
> > > On Tue, Jun 16, 2020 at 08:05:39PM -0700, Shakeel Butt wrote:  
> > >> On Tue, Jun 16, 2020 at 7:41 PM Roman Gushchin <guro@fb.com> wrote:  
> > >> >
> > >> > On Tue, Jun 16, 2020 at 06:46:56PM -0700, Shakeel Butt wrote:  
> > >> > > On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:  
> > >> > > >  
> > >> [...]  
> > >> > >
> > >> > > Have you performed any [perf] testing on SLAB with this patchset?  
> > >> >
> > >> > The accounting part is the same for SLAB and SLUB, so there should be no
> > >> > significant difference. I've checked that it compiles, boots and passes
> > >> > kselftests. And that memory savings are there.
> > >> >  
> > >> 
> > >> What about performance? Also you mentioned that sharing kmem-cache
> > >> between accounted and non-accounted can have additional overhead. Any
> > >> difference between SLAB and SLUB for such a case?  
> > > 
> > > Not really.
> > > 
> > > Sharing a single set of caches adds some overhead to root- and non-accounted
> > > allocations, which is something I've tried hard to avoid in my original version.
> > > But I have to admit, it allows to simplify and remove a lot of code, and here
> > > it's hard to argue with Johanness, who pushed on this design.
> > > 
> > > With performance testing it's not that easy, because it's not obvious what
> > > we wanna test. Obviously, per-object accounting is more expensive, and
> > > measuring something like 1000000 allocations and deallocations in a line from
> > > a single kmem_cache will show a regression. But in the real world the relative
> > > cost of allocations is usually low, and we can get some benefits from a smaller
> > > working set and from having shared kmem_cache objects cache hot.
> > > Not speaking about some extra memory and the fragmentation reduction.
> > > 
> > > We've done an extensive testing of the original version in Facebook production,
> > > and we haven't noticed any regressions so far. But I have to admit, we were
> > > using an original version with two sets of kmem_caches.
> > > 
> > > If you have any specific tests in mind, I can definitely run them. Or if you
> > > can help with the performance evaluation, I'll appreciate it a lot.  
> > 
> > Jesper provided some pointers here [1], it would be really great if you could
> > run at least those microbenchmarks. With mmtests it's the major question of
> > which subset/profiles to run, maybe the referenced commits provide some hints,
> > or maybe Mel could suggest what he used to evaluate SLAB vs SLUB not so long ago.
> > 
> > [1] https://lore.kernel.org/linux-mm/20200527103545.4348ac10@carbon/  
> 
> Oh, Jesper, I'm really sorry, somehow I missed your mail.
> Thank you, Vlastimil, for pointing at it.
> 
> I've got some results (slab_bulk_test01), but honestly I fail to interpret them.
> 
> I ran original vs patched with SLUB and SLAB, each test several times and picked
> 3 which looked most consistently. But it still looks very noisy.
> 
> I ran them on my desktop (8-core Ryzen 1700, 16 GB RAM, Fedora 32),
> it's 5.8-rc1 + slab controller v6 vs 5.8-rc1 (default config from Fedora 32).

What about running these tests on the server level hardware, that you
intent to run this on?  

> 
> How should I interpret this data?

First of all these SLUB+SLAB microbenchmarks use object size 256 bytes,
because network stack alloc object of this size for SKBs/sk_buff (due
to cache-align as used size is 224 bytes). Checked SLUB: Each slab use
2 pages (8192 bytes) and contain 32 object of size 256 (256*32=8192).

  The SLUB allocator have a per-CPU slab which speedup fast-reuse, in this
case up-to 32 objects. For SLUB the "fastpath reuse" test this behaviour,
and it serves as a baseline for optimal 1-object performance (where my bulk
API tries to beat that, which is possible even for 1-object due to knowing
bulk API cannot be used from IRQ context).

SLUB fastpath: 3 measurements reporting cycles(tsc)
 - SLUB-patched : fastpath reuse: 184 - 177 - 176  cycles(tsc)
 - SLUB-original: fastpath reuse: 178 - 153 - 156  cycles(tsc)

There are some stability concerns as you mention, but it seems pretty
consistently that patched version is slower. If you compile with
no-PREEMPT you can likely get more stable results (and remove a slight
overhead for SLUB fastpath).

The microbenchmark also measures the bulk-API, which is AFAIK only used
by network stack (and io_uring). I guess you shouldn't focus too much
on these bulk measurements. When bulk-API cross this objects per slab
threshold, or is unlucky is it use two per-CPU slab, then the
measurements can fluctuate a bit.

Your numbers for SLUB bulk-API:

SLUB-patched - bulk-API
 - SLUB-patched : bulk_quick_reuse objects=1 : 187 -  90 - 224  cycles(tsc)
 - SLUB-patched : bulk_quick_reuse objects=2 : 110 -  53 - 133  cycles(tsc)
 - SLUB-patched : bulk_quick_reuse objects=3 :  88 -  95 -  42  cycles(tsc)
 - SLUB-patched : bulk_quick_reuse objects=4 :  91 -  85 -  36  cycles(tsc)
 - SLUB-patched : bulk_quick_reuse objects=8 :  32 -  66 -  32  cycles(tsc)

SLUB-original -  bulk-API
 - SLUB-original: bulk_quick_reuse objects=1 :  87 -  87 - 142  cycles(tsc)
 - SLUB-original: bulk_quick_reuse objects=2 :  52 -  53 -  53  cycles(tsc)
 - SLUB-original: bulk_quick_reuse objects=3 :  42 -  42 -  91  cycles(tsc)
 - SLUB-original: bulk_quick_reuse objects=4 :  91 -  37 -  37  cycles(tsc)
 - SLUB-original: bulk_quick_reuse objects=8 :  31 -  79 -  76  cycles(tsc)

Maybe it is just noise or instability in measurements, but it seem that the
1-object case is consistently slower in your patched version.

Mail is too long now... I'll take a look at your SLAB results and followup.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

 
> --
> 
> SLUB:
> 
> Patched:
> [  444.395174] time_bench: Type:for_loop Per elem: 2 cycles(tsc) 0.773 ns (step:0) - (measurement period time:0.077335091 sec time_interval:77335091) - (invoke count:100000000 tsc_interval:231555960)
> [  445.012669] time_bench: Type:kmem fastpath reuse Per elem: 184 cycles(tsc) 61.747 ns (step:0) - (measurement period time:0.617475365 sec time_interval:617475365) - (invoke count:10000000 tsc_interval:1848850440)
> [  445.703843] time_bench: Type:kmem bulk_fallback Per elem: 206 cycles(tsc) 69.115 ns (step:1) - (measurement period time:0.691150675 sec time_interval:691150675) - (invoke count:10000000 tsc_interval:2069450250)
> [  446.329396] time_bench: Type:kmem bulk_quick_reuse Per elem: 187 cycles(tsc) 62.554 ns (step:1) - (measurement period time:0.625541838 sec time_interval:625541838) - (invoke count:10000000 tsc_interval:1873003020)
> [  446.975616] time_bench: Type:kmem bulk_fallback Per elem: 193 cycles(tsc) 64.622 ns (step:2) - (measurement period time:0.646223732 sec time_interval:646223732) - (invoke count:10000000 tsc_interval:1934929440)
> [  447.345512] time_bench: Type:kmem bulk_quick_reuse Per elem: 110 cycles(tsc) 36.988 ns (step:2) - (measurement period time:0.369885352 sec time_interval:369885352) - (invoke count:10000000 tsc_interval:1107514050)
> [  447.986272] time_bench: Type:kmem bulk_fallback Per elem: 191 cycles(tsc) 64.075 ns (step:3) - (measurement period time:0.640756304 sec time_interval:640756304) - (invoke count:9999999 tsc_interval:1918559070)
> [  448.282163] time_bench: Type:kmem bulk_quick_reuse Per elem: 88 cycles(tsc) 29.586 ns (step:3) - (measurement period time:0.295866328 sec time_interval:295866328) - (invoke count:9999999 tsc_interval:885885270)
> [  448.623183] time_bench: Type:kmem bulk_fallback Per elem: 102 cycles(tsc) 34.100 ns (step:4) - (measurement period time:0.341005290 sec time_interval:341005290) - (invoke count:10000000 tsc_interval:1021040820)
> [  448.930228] time_bench: Type:kmem bulk_quick_reuse Per elem: 91 cycles(tsc) 30.702 ns (step:4) - (measurement period time:0.307020500 sec time_interval:307020500) - (invoke count:10000000 tsc_interval:919282860)
> [  449.739697] time_bench: Type:kmem bulk_fallback Per elem: 242 cycles(tsc) 80.946 ns (step:8) - (measurement period time:0.809465825 sec time_interval:809465825) - (invoke count:10000000 tsc_interval:2423710560)
> [  449.848110] time_bench: Type:kmem bulk_quick_reuse Per elem: 32 cycles(tsc) 10.836 ns (step:8) - (measurement period time:0.108363638 sec time_interval:108363638) - (invoke count:10000000 tsc_interval:324462540)
> [  450.617892] time_bench: Type:kmem bulk_fallback Per elem: 230 cycles(tsc) 76.978 ns (step:16) - (measurement period time:0.769783892 sec time_interval:769783892) - (invoke count:10000000 tsc_interval:2304894090)
> [  450.719556] time_bench: Type:kmem bulk_quick_reuse Per elem: 30 cycles(tsc) 10.164 ns (step:16) - (measurement period time:0.101645837 sec time_interval:101645837) - (invoke count:10000000 tsc_interval:304348440)
> [  451.025387] time_bench: Type:kmem bulk_fallback Per elem: 91 cycles(tsc) 30.580 ns (step:30) - (measurement period time:0.305803321 sec time_interval:305803321) - (invoke count:9999990 tsc_interval:915639450)
> [  451.277708] time_bench: Type:kmem bulk_quick_reuse Per elem: 75 cycles(tsc) 25.229 ns (step:30) - (measurement period time:0.252294821 sec time_interval:252294821) - (invoke count:9999990 tsc_interval:755422110)
> [  451.709305] time_bench: Type:kmem bulk_fallback Per elem: 129 cycles(tsc) 43.158 ns (step:32) - (measurement period time:0.431581619 sec time_interval:431581619) - (invoke count:10000000 tsc_interval:1292245320)
> [  451.810686] time_bench: Type:kmem bulk_quick_reuse Per elem: 30 cycles(tsc) 10.135 ns (step:32) - (measurement period time:0.101357841 sec time_interval:101357841) - (invoke count:10000000 tsc_interval:303485250)
> [  452.186138] time_bench: Type:kmem bulk_fallback Per elem: 112 cycles(tsc) 37.545 ns (step:34) - (measurement period time:0.375453243 sec time_interval:375453243) - (invoke count:9999978 tsc_interval:1124185320)
> [  452.304950] time_bench: Type:kmem bulk_quick_reuse Per elem: 35 cycles(tsc) 11.880 ns (step:34) - (measurement period time:0.118800736 sec time_interval:118800736) - (invoke count:9999978 tsc_interval:355713360)
> [  452.658607] time_bench: Type:kmem bulk_fallback Per elem: 105 cycles(tsc) 35.362 ns (step:48) - (measurement period time:0.353623065 sec time_interval:353623065) - (invoke count:9999984 tsc_interval:1058820960)
> [  452.891623] time_bench: Type:kmem bulk_quick_reuse Per elem: 69 cycles(tsc) 23.298 ns (step:48) - (measurement period time:0.232988291 sec time_interval:232988291) - (invoke count:9999984 tsc_interval:697614570)
> [  453.237406] time_bench: Type:kmem bulk_fallback Per elem: 103 cycles(tsc) 34.578 ns (step:64) - (measurement period time:0.345780444 sec time_interval:345780444) - (invoke count:10000000 tsc_interval:1035338790)
> [  453.344946] time_bench: Type:kmem bulk_quick_reuse Per elem: 32 cycles(tsc) 10.750 ns (step:64) - (measurement period time:0.107500964 sec time_interval:107500964) - (invoke count:10000000 tsc_interval:321880290)
> [  454.249297] time_bench: Type:kmem bulk_fallback Per elem: 270 cycles(tsc) 90.434 ns (step:128) - (measurement period time:0.904340126 sec time_interval:904340126) - (invoke count:10000000 tsc_interval:2707784610)
> [  454.582548] time_bench: Type:kmem bulk_quick_reuse Per elem: 99 cycles(tsc) 33.322 ns (step:128) - (measurement period time:0.333226211 sec time_interval:333226211) - (invoke count:10000000 tsc_interval:997748760)
> [  454.965002] time_bench: Type:kmem bulk_fallback Per elem: 114 cycles(tsc) 38.241 ns (step:158) - (measurement period time:0.382415227 sec time_interval:382415227) - (invoke count:9999978 tsc_interval:1145031120)
> [  455.314105] time_bench: Type:kmem bulk_quick_reuse Per elem: 104 cycles(tsc) 34.908 ns (step:158) - (measurement period time:0.349080430 sec time_interval:349080430) - (invoke count:9999978 tsc_interval:1045219530)
> [  455.699089] time_bench: Type:kmem bulk_fallback Per elem: 115 cycles(tsc) 38.495 ns (step:250) - (measurement period time:0.384953654 sec time_interval:384953654) - (invoke count:10000000 tsc_interval:1152631920)
> [  456.104244] time_bench: Type:kmem bulk_quick_reuse Per elem: 121 cycles(tsc) 40.513 ns (step:250) - (measurement period time:0.405138149 sec time_interval:405138149) - (invoke count:10000000 tsc_interval:1213068180)
> 
> [  465.696654] time_bench: Type:for_loop Per elem: 2 cycles(tsc) 0.772 ns (step:0) - (measurement period time:0.077270577 sec time_interval:77270577) - (invoke count:100000000 tsc_interval:231363840)
> [  466.290176] time_bench: Type:kmem fastpath reuse Per elem: 177 cycles(tsc) 59.349 ns (step:0) - (measurement period time:0.593496780 sec time_interval:593496780) - (invoke count:10000000 tsc_interval:1777053420)
> [  466.629838] time_bench: Type:kmem bulk_fallback Per elem: 101 cycles(tsc) 33.965 ns (step:1) - (measurement period time:0.339652351 sec time_interval:339652351) - (invoke count:10000000 tsc_interval:1016989230)
> [  466.933290] time_bench: Type:kmem bulk_quick_reuse Per elem: 90 cycles(tsc) 30.344 ns (step:1) - (measurement period time:0.303444180 sec time_interval:303444180) - (invoke count:10000000 tsc_interval:908575380)
> [  467.250189] time_bench: Type:kmem bulk_fallback Per elem: 94 cycles(tsc) 31.689 ns (step:2) - (measurement period time:0.316896073 sec time_interval:316896073) - (invoke count:10000000 tsc_interval:948853110)
> [  467.430142] time_bench: Type:kmem bulk_quick_reuse Per elem: 53 cycles(tsc) 17.994 ns (step:2) - (measurement period time:0.179940800 sec time_interval:179940800) - (invoke count:10000000 tsc_interval:538779390)
> [  467.780573] time_bench: Type:kmem bulk_fallback Per elem: 104 cycles(tsc) 35.039 ns (step:3) - (measurement period time:0.350394226 sec time_interval:350394226) - (invoke count:9999999 tsc_interval:1049153580)
> [  468.100301] time_bench: Type:kmem bulk_quick_reuse Per elem: 95 cycles(tsc) 31.970 ns (step:3) - (measurement period time:0.319706687 sec time_interval:319706687) - (invoke count:9999999 tsc_interval:957267660)
> [  468.792650] time_bench: Type:kmem bulk_fallback Per elem: 207 cycles(tsc) 69.235 ns (step:4) - (measurement period time:0.692354598 sec time_interval:692354598) - (invoke count:10000000 tsc_interval:2073054750)
> [  469.078816] time_bench: Type:kmem bulk_quick_reuse Per elem: 85 cycles(tsc) 28.614 ns (step:4) - (measurement period time:0.286145162 sec time_interval:286145162) - (invoke count:10000000 tsc_interval:856777710)
> [  469.694558] time_bench: Type:kmem bulk_fallback Per elem: 184 cycles(tsc) 61.573 ns (step:8) - (measurement period time:0.615733224 sec time_interval:615733224) - (invoke count:10000000 tsc_interval:1843634190)
> [  469.917439] time_bench: Type:kmem bulk_quick_reuse Per elem: 66 cycles(tsc) 22.284 ns (step:8) - (measurement period time:0.222848937 sec time_interval:222848937) - (invoke count:10000000 tsc_interval:667255740)
> [  470.586966] time_bench: Type:kmem bulk_fallback Per elem: 200 cycles(tsc) 66.952 ns (step:16) - (measurement period time:0.669526473 sec time_interval:669526473) - (invoke count:10000000 tsc_interval:2004702960)
> [  470.794012] time_bench: Type:kmem bulk_quick_reuse Per elem: 61 cycles(tsc) 20.697 ns (step:16) - (measurement period time:0.206972335 sec time_interval:206972335) - (invoke count:10000000 tsc_interval:619717170)
> [  471.422674] time_bench: Type:kmem bulk_fallback Per elem: 188 cycles(tsc) 62.866 ns (step:30) - (measurement period time:0.628659634 sec time_interval:628659634) - (invoke count:9999990 tsc_interval:1882338990)
> [  471.524193] time_bench: Type:kmem bulk_quick_reuse Per elem: 30 cycles(tsc) 10.149 ns (step:30) - (measurement period time:0.101497972 sec time_interval:101497972) - (invoke count:9999990 tsc_interval:303905340)
> [  471.829474] time_bench: Type:kmem bulk_fallback Per elem: 91 cycles(tsc) 30.527 ns (step:32) - (measurement period time:0.305271485 sec time_interval:305271485) - (invoke count:10000000 tsc_interval:914046510)
> [  471.930490] time_bench: Type:kmem bulk_quick_reuse Per elem: 30 cycles(tsc) 10.099 ns (step:32) - (measurement period time:0.100992877 sec time_interval:100992877) - (invoke count:10000000 tsc_interval:302392890)
> [  472.311211] time_bench: Type:kmem bulk_fallback Per elem: 113 cycles(tsc) 38.072 ns (step:34) - (measurement period time:0.380725777 sec time_interval:380725777) - (invoke count:9999978 tsc_interval:1139972850)
> [  472.429823] time_bench: Type:kmem bulk_quick_reuse Per elem: 35 cycles(tsc) 11.860 ns (step:34) - (measurement period time:0.118599617 sec time_interval:118599617) - (invoke count:9999978 tsc_interval:355111890)
> [  472.890092] time_bench: Type:kmem bulk_fallback Per elem: 137 cycles(tsc) 46.026 ns (step:48) - (measurement period time:0.460264730 sec time_interval:460264730) - (invoke count:9999984 tsc_interval:1378127970)
> [  472.999481] time_bench: Type:kmem bulk_quick_reuse Per elem: 32 cycles(tsc) 10.937 ns (step:48) - (measurement period time:0.109371593 sec time_interval:109371593) - (invoke count:9999984 tsc_interval:327480390)
> [  473.344109] time_bench: Type:kmem bulk_fallback Per elem: 103 cycles(tsc) 34.462 ns (step:64) - (measurement period time:0.344629774 sec time_interval:344629774) - (invoke count:10000000 tsc_interval:1031893740)
> [  473.452099] time_bench: Type:kmem bulk_quick_reuse Per elem: 32 cycles(tsc) 10.794 ns (step:64) - (measurement period time:0.107942846 sec time_interval:107942846) - (invoke count:10000000 tsc_interval:323202390)
> [  474.382899] time_bench: Type:kmem bulk_fallback Per elem: 278 cycles(tsc) 93.080 ns (step:128) - (measurement period time:0.930809025 sec time_interval:930809025) - (invoke count:10000000 tsc_interval:2787037260)
> [  474.729757] time_bench: Type:kmem bulk_quick_reuse Per elem: 103 cycles(tsc) 34.683 ns (step:128) - (measurement period time:0.346831572 sec time_interval:346831572) - (invoke count:10000000 tsc_interval:1038484980)
> [  475.616707] time_bench: Type:kmem bulk_fallback Per elem: 265 cycles(tsc) 88.693 ns (step:158) - (measurement period time:0.886937188 sec time_interval:886937188) - (invoke count:9999978 tsc_interval:2655675660)
> [  475.890425] time_bench: Type:kmem bulk_quick_reuse Per elem: 81 cycles(tsc) 27.369 ns (step:158) - (measurement period time:0.273692416 sec time_interval:273692416) - (invoke count:9999978 tsc_interval:819491040)
> [  476.275144] time_bench: Type:kmem bulk_fallback Per elem: 115 cycles(tsc) 38.471 ns (step:250) - (measurement period time:0.384713160 sec time_interval:384713160) - (invoke count:10000000 tsc_interval:1151911110)
> [  476.424219] time_bench: Type:kmem bulk_quick_reuse Per elem: 44 cycles(tsc) 14.906 ns (step:250) - (measurement period time:0.149068364 sec time_interval:149068364) - (invoke count:10000000 tsc_interval:446341830)
> 
> [  490.306824] time_bench: Type:for_loop Per elem: 2 cycles(tsc) 0.776 ns (step:0) - (measurement period time:0.077691991 sec time_interval:77691991) - (invoke count:100000000 tsc_interval:232625850)
> [  490.897035] time_bench: Type:kmem fastpath reuse Per elem: 176 cycles(tsc) 59.019 ns (step:0) - (measurement period time:0.590195120 sec time_interval:590195120) - (invoke count:10000000 tsc_interval:1767174930)
> [  491.590675] time_bench: Type:kmem bulk_fallback Per elem: 207 cycles(tsc) 69.362 ns (step:1) - (measurement period time:0.693628128 sec time_interval:693628128) - (invoke count:10000000 tsc_interval:2076877050)
> [  492.339461] time_bench: Type:kmem bulk_quick_reuse Per elem: 224 cycles(tsc) 74.877 ns (step:1) - (measurement period time:0.748777171 sec time_interval:748777171) - (invoke count:10000000 tsc_interval:2242005540)
> [  493.129328] time_bench: Type:kmem bulk_fallback Per elem: 236 cycles(tsc) 78.984 ns (step:2) - (measurement period time:0.789848781 sec time_interval:789848781) - (invoke count:10000000 tsc_interval:2364983220)
> [  493.574670] time_bench: Type:kmem bulk_quick_reuse Per elem: 133 cycles(tsc) 44.530 ns (step:2) - (measurement period time:0.445304096 sec time_interval:445304096) - (invoke count:10000000 tsc_interval:1333339110)
> [  493.887021] time_bench: Type:kmem bulk_fallback Per elem: 93 cycles(tsc) 31.231 ns (step:3) - (measurement period time:0.312316389 sec time_interval:312316389) - (invoke count:9999999 tsc_interval:935143950)
> [  494.029383] time_bench: Type:kmem bulk_quick_reuse Per elem: 42 cycles(tsc) 14.234 ns (step:3) - (measurement period time:0.142346254 sec time_interval:142346254) - (invoke count:9999999 tsc_interval:426216000)
> [  494.369892] time_bench: Type:kmem bulk_fallback Per elem: 101 cycles(tsc) 34.050 ns (step:4) - (measurement period time:0.340504527 sec time_interval:340504527) - (invoke count:10000000 tsc_interval:1019546130)
> [  494.493217] time_bench: Type:kmem bulk_quick_reuse Per elem: 36 cycles(tsc) 12.329 ns (step:4) - (measurement period time:0.123294475 sec time_interval:123294475) - (invoke count:10000000 tsc_interval:369169800)
> [  494.820003] time_bench: Type:kmem bulk_fallback Per elem: 97 cycles(tsc) 32.678 ns (step:8) - (measurement period time:0.326780876 sec time_interval:326780876) - (invoke count:10000000 tsc_interval:978453960)
> [  494.928831] time_bench: Type:kmem bulk_quick_reuse Per elem: 32 cycles(tsc) 10.880 ns (step:8) - (measurement period time:0.108808086 sec time_interval:108808086) - (invoke count:10000000 tsc_interval:325794570)
> [  495.684358] time_bench: Type:kmem bulk_fallback Per elem: 226 cycles(tsc) 75.552 ns (step:16) - (measurement period time:0.755527917 sec time_interval:755527917) - (invoke count:10000000 tsc_interval:2262218520)
> [  495.785682] time_bench: Type:kmem bulk_quick_reuse Per elem: 30 cycles(tsc) 10.130 ns (step:16) - (measurement period time:0.101307607 sec time_interval:101307607) - (invoke count:10000000 tsc_interval:303336720)
> [  496.090994] time_bench: Type:kmem bulk_fallback Per elem: 91 cycles(tsc) 30.528 ns (step:30) - (measurement period time:0.305280433 sec time_interval:305280433) - (invoke count:9999990 tsc_interval:914077290)
> [  496.341570] time_bench: Type:kmem bulk_quick_reuse Per elem: 75 cycles(tsc) 25.054 ns (step:30) - (measurement period time:0.250548825 sec time_interval:250548825) - (invoke count:9999990 tsc_interval:750197910)
> [  496.646784] time_bench: Type:kmem bulk_fallback Per elem: 91 cycles(tsc) 30.518 ns (step:32) - (measurement period time:0.305189218 sec time_interval:305189218) - (invoke count:10000000 tsc_interval:913803540)
> [  496.900311] time_bench: Type:kmem bulk_quick_reuse Per elem: 75 cycles(tsc) 25.349 ns (step:32) - (measurement period time:0.253499465 sec time_interval:253499465) - (invoke count:10000000 tsc_interval:759033060)
> [  497.778600] time_bench: Type:kmem bulk_fallback Per elem: 262 cycles(tsc) 87.830 ns (step:34) - (measurement period time:0.878298604 sec time_interval:878298604) - (invoke count:9999978 tsc_interval:2629821090)
> [  498.043690] time_bench: Type:kmem bulk_quick_reuse Per elem: 79 cycles(tsc) 26.506 ns (step:34) - (measurement period time:0.265066374 sec time_interval:265066374) - (invoke count:9999978 tsc_interval:793667400)
> [  498.393912] time_bench: Type:kmem bulk_fallback Per elem: 104 cycles(tsc) 35.021 ns (step:48) - (measurement period time:0.350216735 sec time_interval:350216735) - (invoke count:9999984 tsc_interval:1048626840)
> [  498.504846] time_bench: Type:kmem bulk_quick_reuse Per elem: 33 cycles(tsc) 11.092 ns (step:48) - (measurement period time:0.110924201 sec time_interval:110924201) - (invoke count:9999984 tsc_interval:332131200)
> [  498.878335] time_bench: Type:kmem bulk_fallback Per elem: 111 cycles(tsc) 37.345 ns (step:64) - (measurement period time:0.373454272 sec time_interval:373454272) - (invoke count:10000000 tsc_interval:1118205060)
> [  499.145467] time_bench: Type:kmem bulk_quick_reuse Per elem: 79 cycles(tsc) 26.710 ns (step:64) - (measurement period time:0.267102714 sec time_interval:267102714) - (invoke count:10000000 tsc_interval:799763910)
> [  499.525255] time_bench: Type:kmem bulk_fallback Per elem: 113 cycles(tsc) 37.971 ns (step:128) - (measurement period time:0.379715035 sec time_interval:379715035) - (invoke count:10000000 tsc_interval:1136951190)
> [  499.852495] time_bench: Type:kmem bulk_quick_reuse Per elem: 97 cycles(tsc) 32.721 ns (step:128) - (measurement period time:0.327218329 sec time_interval:327218329) - (invoke count:10000000 tsc_interval:979763670)
> [  500.238889] time_bench: Type:kmem bulk_fallback Per elem: 115 cycles(tsc) 38.638 ns (step:158) - (measurement period time:0.386388112 sec time_interval:386388112) - (invoke count:9999978 tsc_interval:1156931610)
> [  500.370790] time_bench: Type:kmem bulk_quick_reuse Per elem: 39 cycles(tsc) 13.189 ns (step:158) - (measurement period time:0.131890805 sec time_interval:131890805) - (invoke count:9999978 tsc_interval:394909920)
> [  500.747241] time_bench: Type:kmem bulk_fallback Per elem: 112 cycles(tsc) 37.645 ns (step:250) - (measurement period time:0.376455749 sec time_interval:376455749) - (invoke count:10000000 tsc_interval:1127192310)
> [  500.897248] time_bench: Type:kmem bulk_quick_reuse Per elem: 44 cycles(tsc) 14.999 ns (step:250) - (measurement period time:0.149997635 sec time_interval:149997635) - (invoke count:10000000 tsc_interval:449125920)
> 
> Orig:
> [   81.987064] time_bench: Type:for_loop Per elem: 2 cycles(tsc) 0.813 ns (step:0) - (measurement period time:0.081397445 sec time_interval:81397445) - (invoke count:100000000 tsc_interval:243727920)
> [   82.595831] time_bench: Type:kmem fastpath reuse Per elem: 178 cycles(tsc) 59.675 ns (step:0) - (measurement period time:0.596752095 sec time_interval:596752095) - (invoke count:10000000 tsc_interval:1786857030)
> [   83.031850] time_bench: Type:kmem bulk_fallback Per elem: 127 cycles(tsc) 42.541 ns (step:1) - (measurement period time:0.425415790 sec time_interval:425415790) - (invoke count:10000000 tsc_interval:1273823670)
> [   83.340838] time_bench: Type:kmem bulk_quick_reuse Per elem: 87 cycles(tsc) 29.200 ns (step:1) - (measurement period time:0.292006301 sec time_interval:292006301) - (invoke count:10000000 tsc_interval:874355190)
> [   83.630781] time_bench: Type:kmem bulk_fallback Per elem: 83 cycles(tsc) 27.923 ns (step:2) - (measurement period time:0.279231691 sec time_interval:279231691) - (invoke count:10000000 tsc_interval:836104170)
> [   83.821746] time_bench: Type:kmem bulk_quick_reuse Per elem: 52 cycles(tsc) 17.611 ns (step:2) - (measurement period time:0.176116770 sec time_interval:176116770) - (invoke count:10000000 tsc_interval:527346570)
> [   84.105841] time_bench: Type:kmem bulk_fallback Per elem: 81 cycles(tsc) 27.184 ns (step:3) - (measurement period time:0.271845630 sec time_interval:271845630) - (invoke count:9999999 tsc_interval:813988260)
> [   84.257733] time_bench: Type:kmem bulk_quick_reuse Per elem: 42 cycles(tsc) 14.120 ns (step:3) - (measurement period time:0.141208965 sec time_interval:141208965) - (invoke count:9999999 tsc_interval:422821890)
> [   84.578730] time_bench: Type:kmem bulk_fallback Per elem: 92 cycles(tsc) 30.798 ns (step:4) - (measurement period time:0.307982589 sec time_interval:307982589) - (invoke count:10000000 tsc_interval:922193070)
> [   84.894740] time_bench: Type:kmem bulk_quick_reuse Per elem: 91 cycles(tsc) 30.523 ns (step:4) - (measurement period time:0.305231656 sec time_interval:305231656) - (invoke count:10000000 tsc_interval:913955310)
> [   85.596699] time_bench: Type:kmem bulk_fallback Per elem: 206 cycles(tsc) 68.977 ns (step:8) - (measurement period time:0.689779758 sec time_interval:689779758) - (invoke count:10000000 tsc_interval:2065410030)
> [   85.728679] time_bench: Type:kmem bulk_quick_reuse Per elem: 31 cycles(tsc) 10.641 ns (step:8) - (measurement period time:0.106415387 sec time_interval:106415387) - (invoke count:10000000 tsc_interval:318639630)
> [   86.016723] time_bench: Type:kmem bulk_fallback Per elem: 81 cycles(tsc) 27.302 ns (step:16) - (measurement period time:0.273021863 sec time_interval:273021863) - (invoke count:10000000 tsc_interval:817509990)
> [   86.137711] time_bench: Type:kmem bulk_quick_reuse Per elem: 29 cycles(tsc) 10.005 ns (step:16) - (measurement period time:0.100053210 sec time_interval:100053210) - (invoke count:10000000 tsc_interval:299589180)
> [   86.420698] time_bench: Type:kmem bulk_fallback Per elem: 79 cycles(tsc) 26.598 ns (step:30) - (measurement period time:0.265984644 sec time_interval:265984644) - (invoke count:9999990 tsc_interval:796437960)
> [   86.534652] time_bench: Type:kmem bulk_quick_reuse Per elem: 29 cycles(tsc) 9.742 ns (step:30) - (measurement period time:0.097425391 sec time_interval:97425391) - (invoke count:9999990 tsc_interval:291720810)
> [   86.812682] time_bench: Type:kmem bulk_fallback Per elem: 79 cycles(tsc) 26.522 ns (step:32) - (measurement period time:0.265225864 sec time_interval:265225864) - (invoke count:10000000 tsc_interval:794166360)
> [   86.923650] time_bench: Type:kmem bulk_quick_reuse Per elem: 29 cycles(tsc) 9.729 ns (step:32) - (measurement period time:0.097294552 sec time_interval:97294552) - (invoke count:10000000 tsc_interval:291328800)
> [   87.255647] time_bench: Type:kmem bulk_fallback Per elem: 95 cycles(tsc) 32.050 ns (step:34) - (measurement period time:0.320499429 sec time_interval:320499429) - (invoke count:9999978 tsc_interval:959672160)
> [   87.383687] time_bench: Type:kmem bulk_quick_reuse Per elem: 34 cycles(tsc) 11.492 ns (step:34) - (measurement period time:0.114921393 sec time_interval:114921393) - (invoke count:9999978 tsc_interval:344109030)
> [   87.724663] time_bench: Type:kmem bulk_fallback Per elem: 96 cycles(tsc) 32.346 ns (step:48) - (measurement period time:0.323463245 sec time_interval:323463245) - (invoke count:9999984 tsc_interval:968546670)
> [   87.847640] time_bench: Type:kmem bulk_quick_reuse Per elem: 31 cycles(tsc) 10.661 ns (step:48) - (measurement period time:0.106610938 sec time_interval:106610938) - (invoke count:9999984 tsc_interval:319225170)
> [   88.167636] time_bench: Type:kmem bulk_fallback Per elem: 91 cycles(tsc) 30.678 ns (step:64) - (measurement period time:0.306781428 sec time_interval:306781428) - (invoke count:10000000 tsc_interval:918596670)
> [   88.287645] time_bench: Type:kmem bulk_quick_reuse Per elem: 31 cycles(tsc) 10.677 ns (step:64) - (measurement period time:0.106773747 sec time_interval:106773747) - (invoke count:10000000 tsc_interval:319712640)
> [   88.634627] time_bench: Type:kmem bulk_fallback Per elem: 100 cycles(tsc) 33.591 ns (step:128) - (measurement period time:0.335914141 sec time_interval:335914141) - (invoke count:10000000 tsc_interval:1005828930)
> [   88.785630] time_bench: Type:kmem bulk_quick_reuse Per elem: 40 cycles(tsc) 13.648 ns (step:128) - (measurement period time:0.136483174 sec time_interval:136483174) - (invoke count:10000000 tsc_interval:408671550)
> [   89.138604] time_bench: Type:kmem bulk_fallback Per elem: 101 cycles(tsc) 33.981 ns (step:158) - (measurement period time:0.339814415 sec time_interval:339814415) - (invoke count:9999978 tsc_interval:1017507030)
> [   89.289633] time_bench: Type:kmem bulk_quick_reuse Per elem: 42 cycles(tsc) 14.110 ns (step:158) - (measurement period time:0.141101621 sec time_interval:141101621) - (invoke count:9999978 tsc_interval:422500530)
> [   89.650638] time_bench: Type:kmem bulk_fallback Per elem: 104 cycles(tsc) 34.887 ns (step:250) - (measurement period time:0.348876887 sec time_interval:348876887) - (invoke count:10000000 tsc_interval:1044643320)
> [   89.813613] time_bench: Type:kmem bulk_quick_reuse Per elem: 44 cycles(tsc) 14.821 ns (step:250) - (measurement period time:0.148213151 sec time_interval:148213151) - (invoke count:10000000 tsc_interval:443794860)
> 
> [  120.495694] time_bench: Type:for_loop Per elem: 2 cycles(tsc) 0.777 ns (step:0) - (measurement period time:0.077764814 sec time_interval:77764814) - (invoke count:100000000 tsc_interval:232850730)
> [  121.018849] time_bench: Type:kmem fastpath reuse Per elem: 153 cycles(tsc) 51.274 ns (step:0) - (measurement period time:0.512740018 sec time_interval:512740018) - (invoke count:10000000 tsc_interval:1535297070)
> [  121.326965] time_bench: Type:kmem bulk_fallback Per elem: 91 cycles(tsc) 30.560 ns (step:1) - (measurement period time:0.305608844 sec time_interval:305608844) - (invoke count:10000000 tsc_interval:915084480)
> [  121.628922] time_bench: Type:kmem bulk_quick_reuse Per elem: 87 cycles(tsc) 29.218 ns (step:1) - (measurement period time:0.292184439 sec time_interval:292184439) - (invoke count:10000000 tsc_interval:874887840)
> [  122.337817] time_bench: Type:kmem bulk_fallback Per elem: 207 cycles(tsc) 69.361 ns (step:2) - (measurement period time:0.693612284 sec time_interval:693612284) - (invoke count:10000000 tsc_interval:2076883890)
> [  122.520912] time_bench: Type:kmem bulk_quick_reuse Per elem: 53 cycles(tsc) 17.741 ns (step:2) - (measurement period time:0.177417675 sec time_interval:177417675) - (invoke count:10000000 tsc_interval:531240870)
> [  122.872912] time_bench: Type:kmem bulk_fallback Per elem: 102 cycles(tsc) 34.212 ns (step:3) - (measurement period time:0.342120142 sec time_interval:342120142) - (invoke count:9999999 tsc_interval:1024409910)
> [  123.019909] time_bench: Type:kmem bulk_quick_reuse Per elem: 42 cycles(tsc) 14.084 ns (step:3) - (measurement period time:0.140842225 sec time_interval:140842225) - (invoke count:9999999 tsc_interval:421723650)
> [  123.837965] time_bench: Type:kmem bulk_fallback Per elem: 241 cycles(tsc) 80.516 ns (step:4) - (measurement period time:0.805161046 sec time_interval:805161046) - (invoke count:10000000 tsc_interval:2410894650)
> [  123.973915] time_bench: Type:kmem bulk_quick_reuse Per elem: 37 cycles(tsc) 12.377 ns (step:4) - (measurement period time:0.123773940 sec time_interval:123773940) - (invoke count:10000000 tsc_interval:370615290)
> [  124.273862] time_bench: Type:kmem bulk_fallback Per elem: 86 cycles(tsc) 28.860 ns (step:8) - (measurement period time:0.288604912 sec time_interval:288604912) - (invoke count:10000000 tsc_interval:864169920)
> [  124.546757] time_bench: Type:kmem bulk_quick_reuse Per elem: 79 cycles(tsc) 26.420 ns (step:8) - (measurement period time:0.264207028 sec time_interval:264207028) - (invoke count:10000000 tsc_interval:791114430)
> [  125.191730] time_bench: Type:kmem bulk_fallback Per elem: 190 cycles(tsc) 63.456 ns (step:16) - (measurement period time:0.634568513 sec time_interval:634568513) - (invoke count:10000000 tsc_interval:1900088820)
> [  125.296839] time_bench: Type:kmem bulk_quick_reuse Per elem: 30 cycles(tsc) 10.043 ns (step:16) - (measurement period time:0.100439926 sec time_interval:100439926) - (invoke count:10000000 tsc_interval:300746670)
> [  125.580743] time_bench: Type:kmem bulk_fallback Per elem: 81 cycles(tsc) 27.347 ns (step:30) - (measurement period time:0.273471271 sec time_interval:273471271) - (invoke count:9999990 tsc_interval:818855040)
> [  125.836734] time_bench: Type:kmem bulk_quick_reuse Per elem: 72 cycles(tsc) 24.372 ns (step:30) - (measurement period time:0.243727806 sec time_interval:243727806) - (invoke count:9999990 tsc_interval:729793590)
> [  126.508883] time_bench: Type:kmem bulk_fallback Per elem: 197 cycles(tsc) 65.900 ns (step:32) - (measurement period time:0.659009779 sec time_interval:659009779) - (invoke count:10000000 tsc_interval:1973273460)
> [  126.612891] time_bench: Type:kmem bulk_quick_reuse Per elem: 29 cycles(tsc) 9.749 ns (step:32) - (measurement period time:0.097491968 sec time_interval:97491968) - (invoke count:10000000 tsc_interval:291919890)
> [  126.968798] time_bench: Type:kmem bulk_fallback Per elem: 103 cycles(tsc) 34.676 ns (step:34) - (measurement period time:0.346762028 sec time_interval:346762028) - (invoke count:9999978 tsc_interval:1038309510)
> [  127.095700] time_bench: Type:kmem bulk_quick_reuse Per elem: 34 cycles(tsc) 11.648 ns (step:34) - (measurement period time:0.116483925 sec time_interval:116483925) - (invoke count:9999978 tsc_interval:348787590)
> [  127.974794] time_bench: Type:kmem bulk_fallback Per elem: 259 cycles(tsc) 86.651 ns (step:48) - (measurement period time:0.866514663 sec time_interval:866514663) - (invoke count:9999984 tsc_interval:2594605770)
> [  128.093772] time_bench: Type:kmem bulk_quick_reuse Per elem: 34 cycles(tsc) 11.426 ns (step:48) - (measurement period time:0.114267827 sec time_interval:114267827) - (invoke count:9999984 tsc_interval:342151620)
> [  128.430665] time_bench: Type:kmem bulk_fallback Per elem: 97 cycles(tsc) 32.514 ns (step:64) - (measurement period time:0.325148101 sec time_interval:325148101) - (invoke count:10000000 tsc_interval:973590990)
> [  128.546857] time_bench: Type:kmem bulk_quick_reuse Per elem: 32 cycles(tsc) 10.991 ns (step:64) - (measurement period time:0.109916673 sec time_interval:109916673) - (invoke count:10000000 tsc_interval:329123280)
> [  129.431645] time_bench: Type:kmem bulk_fallback Per elem: 261 cycles(tsc) 87.191 ns (step:128) - (measurement period time:0.871911323 sec time_interval:871911323) - (invoke count:10000000 tsc_interval:2610764490)
> [  129.583764] time_bench: Type:kmem bulk_quick_reuse Per elem: 43 cycles(tsc) 14.514 ns (step:128) - (measurement period time:0.145148532 sec time_interval:145148532) - (invoke count:10000000 tsc_interval:434617800)
> [  130.443627] time_bench: Type:kmem bulk_fallback Per elem: 254 cycles(tsc) 84.982 ns (step:158) - (measurement period time:0.849826310 sec time_interval:849826310) - (invoke count:9999978 tsc_interval:2544635760)
> [  130.583738] time_bench: Type:kmem bulk_quick_reuse Per elem: 40 cycles(tsc) 13.399 ns (step:158) - (measurement period time:0.133992977 sec time_interval:133992977) - (invoke count:9999978 tsc_interval:401214210)
> [  130.947634] time_bench: Type:kmem bulk_fallback Per elem: 105 cycles(tsc) 35.206 ns (step:250) - (measurement period time:0.352068766 sec time_interval:352068766) - (invoke count:10000000 tsc_interval:1054199400)
> [  131.268601] time_bench: Type:kmem bulk_quick_reuse Per elem: 93 cycles(tsc) 31.142 ns (step:250) - (measurement period time:0.311429067 sec time_interval:311429067) - (invoke count:10000000 tsc_interval:932511270)
> 
> [  135.584335] time_bench: Type:for_loop Per elem: 2 cycles(tsc) 0.772 ns (step:0) - (measurement period time:0.077217374 sec time_interval:77217374) - (invoke count:100000000 tsc_interval:231211500)
> [  136.122480] time_bench: Type:kmem fastpath reuse Per elem: 156 cycles(tsc) 52.212 ns (step:0) - (measurement period time:0.522120964 sec time_interval:522120964) - (invoke count:10000000 tsc_interval:1563386670)
> [  136.762465] time_bench: Type:kmem bulk_fallback Per elem: 186 cycles(tsc) 62.301 ns (step:1) - (measurement period time:0.623010984 sec time_interval:623010984) - (invoke count:10000000 tsc_interval:1865481540)
> [  137.248444] time_bench: Type:kmem bulk_quick_reuse Per elem: 142 cycles(tsc) 47.606 ns (step:1) - (measurement period time:0.476063536 sec time_interval:476063536) - (invoke count:10000000 tsc_interval:1425477150)
> [  137.540440] time_bench: Type:kmem bulk_fallback Per elem: 84 cycles(tsc) 28.282 ns (step:2) - (measurement period time:0.282824344 sec time_interval:282824344) - (invoke count:10000000 tsc_interval:846861210)
> [  137.724456] time_bench: Type:kmem bulk_quick_reuse Per elem: 53 cycles(tsc) 17.830 ns (step:2) - (measurement period time:0.178304559 sec time_interval:178304559) - (invoke count:10000000 tsc_interval:533896980)
> [  138.366442] time_bench: Type:kmem bulk_fallback Per elem: 189 cycles(tsc) 63.289 ns (step:3) - (measurement period time:0.632890657 sec time_interval:632890657) - (invoke count:9999999 tsc_interval:1895064930)
> [  138.682405] time_bench: Type:kmem bulk_quick_reuse Per elem: 91 cycles(tsc) 30.603 ns (step:3) - (measurement period time:0.306034382 sec time_interval:306034382) - (invoke count:9999999 tsc_interval:916357950)
> [  138.997539] time_bench: Type:kmem bulk_fallback Per elem: 90 cycles(tsc) 30.372 ns (step:4) - (measurement period time:0.303723704 sec time_interval:303723704) - (invoke count:10000000 tsc_interval:909440220)
> [  139.131400] time_bench: Type:kmem bulk_quick_reuse Per elem: 37 cycles(tsc) 12.405 ns (step:4) - (measurement period time:0.124058230 sec time_interval:124058230) - (invoke count:10000000 tsc_interval:371467110)
> [  139.430407] time_bench: Type:kmem bulk_fallback Per elem: 86 cycles(tsc) 28.867 ns (step:8) - (measurement period time:0.288673242 sec time_interval:288673242) - (invoke count:10000000 tsc_interval:864374550)
> [  139.694401] time_bench: Type:kmem bulk_quick_reuse Per elem: 76 cycles(tsc) 25.593 ns (step:8) - (measurement period time:0.255935939 sec time_interval:255935939) - (invoke count:10000000 tsc_interval:766348440)
> [  140.387369] time_bench: Type:kmem bulk_fallback Per elem: 203 cycles(tsc) 68.061 ns (step:16) - (measurement period time:0.680610963 sec time_interval:680610963) - (invoke count:10000000 tsc_interval:2037954090)
> [  140.495385] time_bench: Type:kmem bulk_quick_reuse Per elem: 30 cycles(tsc) 10.173 ns (step:16) - (measurement period time:0.101737300 sec time_interval:101737300) - (invoke count:10000000 tsc_interval:304631430)
> [  141.101479] time_bench: Type:kmem bulk_fallback Per elem: 177 cycles(tsc) 59.116 ns (step:30) - (measurement period time:0.591165326 sec time_interval:591165326) - (invoke count:9999990 tsc_interval:1770126360)
> [  141.350337] time_bench: Type:kmem bulk_quick_reuse Per elem: 72 cycles(tsc) 24.305 ns (step:30) - (measurement period time:0.243051460 sec time_interval:243051460) - (invoke count:9999990 tsc_interval:727767660)
> [  141.781369] time_bench: Type:kmem bulk_fallback Per elem: 126 cycles(tsc) 42.191 ns (step:32) - (measurement period time:0.421915112 sec time_interval:421915112) - (invoke count:10000000 tsc_interval:1263340320)
> [  142.029348] time_bench: Type:kmem bulk_quick_reuse Per elem: 72 cycles(tsc) 24.208 ns (step:32) - (measurement period time:0.242082250 sec time_interval:242082250) - (invoke count:10000000 tsc_interval:724865610)
> [  142.833301] time_bench: Type:kmem bulk_fallback Per elem: 237 cycles(tsc) 79.313 ns (step:34) - (measurement period time:0.793128746 sec time_interval:793128746) - (invoke count:9999978 tsc_interval:2374865760)
> [  142.957327] time_bench: Type:kmem bulk_quick_reuse Per elem: 35 cycles(tsc) 11.796 ns (step:34) - (measurement period time:0.117960158 sec time_interval:117960158) - (invoke count:9999978 tsc_interval:353207850)
> [  143.714486] time_bench: Type:kmem bulk_fallback Per elem: 223 cycles(tsc) 74.629 ns (step:48) - (measurement period time:0.746296426 sec time_interval:746296426) - (invoke count:9999984 tsc_interval:2234635890)
> [  143.998413] time_bench: Type:kmem bulk_quick_reuse Per elem: 82 cycles(tsc) 27.476 ns (step:48) - (measurement period time:0.274759868 sec time_interval:274759868) - (invoke count:9999984 tsc_interval:822712920)
> [  144.717341] time_bench: Type:kmem bulk_fallback Per elem: 211 cycles(tsc) 70.598 ns (step:64) - (measurement period time:0.705984861 sec time_interval:705984861) - (invoke count:10000000 tsc_interval:2113930770)
> [  144.838259] time_bench: Type:kmem bulk_quick_reuse Per elem: 32 cycles(tsc) 10.788 ns (step:64) - (measurement period time:0.107887319 sec time_interval:107887319) - (invoke count:10000000 tsc_interval:323046420)
> [  145.190386] time_bench: Type:kmem bulk_fallback Per elem: 102 cycles(tsc) 34.174 ns (step:128) - (measurement period time:0.341741874 sec time_interval:341741874) - (invoke count:10000000 tsc_interval:1023278130)
> [  145.514275] time_bench: Type:kmem bulk_quick_reuse Per elem: 93 cycles(tsc) 31.128 ns (step:128) - (measurement period time:0.311288149 sec time_interval:311288149) - (invoke count:10000000 tsc_interval:932088960)
> [  146.367413] time_bench: Type:kmem bulk_fallback Per elem: 251 cycles(tsc) 84.015 ns (step:158) - (measurement period time:0.840153692 sec time_interval:840153692) - (invoke count:9999978 tsc_interval:2515672920)
> [  146.523219] time_bench: Type:kmem bulk_quick_reuse Per elem: 42 cycles(tsc) 14.280 ns (step:158) - (measurement period time:0.142806094 sec time_interval:142806094) - (invoke count:9999978 tsc_interval:427603830)
> [  146.888375] time_bench: Type:kmem bulk_fallback Per elem: 105 cycles(tsc) 35.119 ns (step:250) - (measurement period time:0.351191259 sec time_interval:351191259) - (invoke count:10000000 tsc_interval:1051571610)
> [  147.291226] time_bench: Type:kmem bulk_quick_reuse Per elem: 117 cycles(tsc) 39.200 ns (step:250) - (measurement period time:0.392003176 sec time_interval:392003176) - (invoke count:10000000 tsc_interval:1173774360)
> 
> 
> SLAB:
> 
> Orig:
> [   80.499545] time_bench: Type:for_loop Per elem: 2 cycles(tsc) 0.830 ns (step:0) - (measurement period time:0.083085912 sec time_interval:83085912) - (invoke count:100000000 tsc_interval:248781840)
> [   81.099911] time_bench: Type:kmem fastpath reuse Per elem: 174 cycles(tsc) 58.430 ns (step:0) - (measurement period time:0.584308185 sec time_interval:584308185) - (invoke count:10000000 tsc_interval:1749584790)
> [   81.421881] time_bench: Type:kmem bulk_fallback Per elem: 89 cycles(tsc) 30.019 ns (step:1) - (measurement period time:0.300198661 sec time_interval:300198661) - (invoke count:10000000 tsc_interval:898879710)
> [   81.910960] time_bench: Type:kmem bulk_quick_reuse Per elem: 143 cycles(tsc) 47.889 ns (step:1) - (measurement period time:0.478893310 sec time_interval:478893310) - (invoke count:10000000 tsc_interval:1433941530)
> [   82.583917] time_bench: Type:kmem bulk_fallback Per elem: 197 cycles(tsc) 65.813 ns (step:2) - (measurement period time:0.658134429 sec time_interval:658134429) - (invoke count:10000000 tsc_interval:1970640660)
> [   82.751867] time_bench: Type:kmem bulk_quick_reuse Per elem: 45 cycles(tsc) 15.221 ns (step:2) - (measurement period time:0.152212195 sec time_interval:152212195) - (invoke count:10000000 tsc_interval:455766000)
> [   83.047850] time_bench: Type:kmem bulk_fallback Per elem: 83 cycles(tsc) 27.831 ns (step:3) - (measurement period time:0.278309326 sec time_interval:278309326) - (invoke count:9999999 tsc_interval:833336640)
> [   83.186831] time_bench: Type:kmem bulk_quick_reuse Per elem: 38 cycles(tsc) 12.885 ns (step:3) - (measurement period time:0.128853000 sec time_interval:128853000) - (invoke count:9999999 tsc_interval:385821900)
> [   83.514848] time_bench: Type:kmem bulk_fallback Per elem: 92 cycles(tsc) 30.901 ns (step:4) - (measurement period time:0.309012550 sec time_interval:309012550) - (invoke count:10000000 tsc_interval:925270980)
> [   83.646835] time_bench: Type:kmem bulk_quick_reuse Per elem: 35 cycles(tsc) 11.711 ns (step:4) - (measurement period time:0.117116655 sec time_interval:117116655) - (invoke count:10000000 tsc_interval:350679900)
> [   83.954817] time_bench: Type:kmem bulk_fallback Per elem: 89 cycles(tsc) 29.739 ns (step:8) - (measurement period time:0.297398266 sec time_interval:297398266) - (invoke count:10000000 tsc_interval:890494290)
> [   84.069826] time_bench: Type:kmem bulk_quick_reuse Per elem: 29 cycles(tsc) 9.943 ns (step:8) - (measurement period time:0.099437599 sec time_interval:99437599) - (invoke count:10000000 tsc_interval:297743760)
> [   84.361844] time_bench: Type:kmem bulk_fallback Per elem: 84 cycles(tsc) 28.263 ns (step:16) - (measurement period time:0.282630878 sec time_interval:282630878) - (invoke count:10000000 tsc_interval:846277020)
> [   84.471816] time_bench: Type:kmem bulk_quick_reuse Per elem: 28 cycles(tsc) 9.643 ns (step:16) - (measurement period time:0.096439729 sec time_interval:96439729) - (invoke count:10000000 tsc_interval:288767550)
> [   84.977793] time_bench: Type:kmem bulk_fallback Per elem: 145 cycles(tsc) 48.452 ns (step:30) - (measurement period time:0.484520609 sec time_interval:484520609) - (invoke count:9999990 tsc_interval:1450791510)
> [   85.222771] time_bench: Type:kmem bulk_quick_reuse Per elem: 68 cycles(tsc) 22.726 ns (step:30) - (measurement period time:0.227266268 sec time_interval:227266268) - (invoke count:9999990 tsc_interval:680498580)
> [   85.814766] time_bench: Type:kmem bulk_fallback Per elem: 173 cycles(tsc) 57.907 ns (step:32) - (measurement period time:0.579072933 sec time_interval:579072933) - (invoke count:10000000 tsc_interval:1733908170)
> [   85.914739] time_bench: Type:kmem bulk_quick_reuse Per elem: 28 cycles(tsc) 9.385 ns (step:32) - (measurement period time:0.093857661 sec time_interval:93857661) - (invoke count:10000000 tsc_interval:281035770)
> [   86.207764] time_bench: Type:kmem bulk_fallback Per elem: 82 cycles(tsc) 27.489 ns (step:34) - (measurement period time:0.274891966 sec time_interval:274891966) - (invoke count:9999978 tsc_interval:823104480)
> [   86.452755] time_bench: Type:kmem bulk_quick_reuse Per elem: 68 cycles(tsc) 23.040 ns (step:34) - (measurement period time:0.230401610 sec time_interval:230401610) - (invoke count:9999978 tsc_interval:689886630)
> [   86.736743] time_bench: Type:kmem bulk_fallback Per elem: 81 cycles(tsc) 27.326 ns (step:48) - (measurement period time:0.273267062 sec time_interval:273267062) - (invoke count:9999984 tsc_interval:818238330)
> [   86.839857] time_bench: Type:kmem bulk_quick_reuse Per elem: 28 cycles(tsc) 9.506 ns (step:48) - (measurement period time:0.095059470 sec time_interval:95059470) - (invoke count:9999984 tsc_interval:284634690)
> [   87.432947] time_bench: Type:kmem bulk_fallback Per elem: 172 cycles(tsc) 57.565 ns (step:64) - (measurement period time:0.575650143 sec time_interval:575650143) - (invoke count:10000000 tsc_interval:1723659720)
> [   87.536682] time_bench: Type:kmem bulk_quick_reuse Per elem: 27 cycles(tsc) 9.267 ns (step:64) - (measurement period time:0.092674016 sec time_interval:92674016) - (invoke count:10000000 tsc_interval:277491600)
> [   87.829693] time_bench: Type:kmem bulk_fallback Per elem: 84 cycles(tsc) 28.082 ns (step:128) - (measurement period time:0.280825239 sec time_interval:280825239) - (invoke count:10000000 tsc_interval:840869820)
> [   87.942860] time_bench: Type:kmem bulk_quick_reuse Per elem: 31 cycles(tsc) 10.387 ns (step:128) - (measurement period time:0.103871104 sec time_interval:103871104) - (invoke count:10000000 tsc_interval:311019150)
> [   88.242686] time_bench: Type:kmem bulk_fallback Per elem: 84 cycles(tsc) 28.101 ns (step:158) - (measurement period time:0.281012946 sec time_interval:281012946) - (invoke count:9999978 tsc_interval:841431990)
> [   88.354683] time_bench: Type:kmem bulk_quick_reuse Per elem: 29 cycles(tsc) 9.852 ns (step:158) - (measurement period time:0.098524040 sec time_interval:98524040) - (invoke count:9999978 tsc_interval:295008030)
> [   88.655671] time_bench: Type:kmem bulk_fallback Per elem: 86 cycles(tsc) 28.946 ns (step:250) - (measurement period time:0.289463793 sec time_interval:289463793) - (invoke count:10000000 tsc_interval:866736720)
> [   88.776655] time_bench: Type:kmem bulk_quick_reuse Per elem: 32 cycles(tsc) 10.695 ns (step:250) - (measurement period time:0.106953355 sec time_interval:106953355) - (invoke count:10000000 tsc_interval:320247930)
> 
> [  100.068788] time_bench: Type:for_loop Per elem: 4 cycles(tsc) 1.567 ns (step:0) - (measurement period time:0.156710185 sec time_interval:156710185) - (invoke count:100000000 tsc_interval:469233480)
> [  100.654304] time_bench: Type:kmem fastpath reuse Per elem: 170 cycles(tsc) 56.967 ns (step:0) - (measurement period time:0.569671924 sec time_interval:569671924) - (invoke count:10000000 tsc_interval:1705759620)
> [  101.373300] time_bench: Type:kmem bulk_fallback Per elem: 212 cycles(tsc) 70.812 ns (step:1) - (measurement period time:0.708129741 sec time_interval:708129741) - (invoke count:10000000 tsc_interval:2120342250)
> [  101.840283] time_bench: Type:kmem bulk_quick_reuse Per elem: 136 cycles(tsc) 45.527 ns (step:1) - (measurement period time:0.455275848 sec time_interval:455275848) - (invoke count:10000000 tsc_interval:1363225020)
> [  102.139276] time_bench: Type:kmem bulk_fallback Per elem: 86 cycles(tsc) 29.044 ns (step:2) - (measurement period time:0.290446762 sec time_interval:290446762) - (invoke count:10000000 tsc_interval:869680110)
> [  102.303272] time_bench: Type:kmem bulk_quick_reuse Per elem: 46 cycles(tsc) 15.383 ns (step:2) - (measurement period time:0.153838537 sec time_interval:153838537) - (invoke count:10000000 tsc_interval:460636140)
> [  103.012346] time_bench: Type:kmem bulk_fallback Per elem: 209 cycles(tsc) 69.979 ns (step:3) - (measurement period time:0.699793666 sec time_interval:699793666) - (invoke count:9999999 tsc_interval:2095381860)
> [  103.148352] time_bench: Type:kmem bulk_quick_reuse Per elem: 39 cycles(tsc) 13.208 ns (step:3) - (measurement period time:0.132082868 sec time_interval:132082868) - (invoke count:9999999 tsc_interval:395493210)
> [  103.462233] time_bench: Type:kmem bulk_fallback Per elem: 91 cycles(tsc) 30.467 ns (step:4) - (measurement period time:0.304675759 sec time_interval:304675759) - (invoke count:10000000 tsc_interval:912285930)
> [  103.761428] time_bench: Type:kmem bulk_quick_reuse Per elem: 87 cycles(tsc) 29.059 ns (step:4) - (measurement period time:0.290597158 sec time_interval:290597158) - (invoke count:10000000 tsc_interval:870129780)
> [  104.501334] time_bench: Type:kmem bulk_fallback Per elem: 218 cycles(tsc) 73.076 ns (step:8) - (measurement period time:0.730767822 sec time_interval:730767822) - (invoke count:10000000 tsc_interval:2188127310)
> [  104.732329] time_bench: Type:kmem bulk_quick_reuse Per elem: 66 cycles(tsc) 22.280 ns (step:8) - (measurement period time:0.222806934 sec time_interval:222806934) - (invoke count:10000000 tsc_interval:667146780)
> [  105.346195] time_bench: Type:kmem bulk_fallback Per elem: 180 cycles(tsc) 60.308 ns (step:16) - (measurement period time:0.603085855 sec time_interval:603085855) - (invoke count:10000000 tsc_interval:1805810910)
> [  105.565213] time_bench: Type:kmem bulk_quick_reuse Per elem: 62 cycles(tsc) 20.731 ns (step:16) - (measurement period time:0.207317878 sec time_interval:207317878) - (invoke count:10000000 tsc_interval:620768190)
> [  106.154163] time_bench: Type:kmem bulk_fallback Per elem: 173 cycles(tsc) 57.884 ns (step:30) - (measurement period time:0.578841035 sec time_interval:578841035) - (invoke count:9999990 tsc_interval:1733213910)
> [  106.450218] time_bench: Type:kmem bulk_quick_reuse Per elem: 85 cycles(tsc) 28.455 ns (step:30) - (measurement period time:0.284558769 sec time_interval:284558769) - (invoke count:9999990 tsc_interval:852048780)
> [  107.137140] time_bench: Type:kmem bulk_fallback Per elem: 199 cycles(tsc) 66.729 ns (step:32) - (measurement period time:0.667298185 sec time_interval:667298185) - (invoke count:10000000 tsc_interval:1998081120)
> [  107.244232] time_bench: Type:kmem bulk_quick_reuse Per elem: 28 cycles(tsc) 9.655 ns (step:32) - (measurement period time:0.096558958 sec time_interval:96558958) - (invoke count:10000000 tsc_interval:289124430)
> [  107.528225] time_bench: Type:kmem bulk_fallback Per elem: 82 cycles(tsc) 27.584 ns (step:34) - (measurement period time:0.275841028 sec time_interval:275841028) - (invoke count:9999978 tsc_interval:825940800)
> [  107.628207] time_bench: Type:kmem bulk_quick_reuse Per elem: 27 cycles(tsc) 9.182 ns (step:34) - (measurement period time:0.091822659 sec time_interval:91822659) - (invoke count:9999978 tsc_interval:274942830)
> [  107.913114] time_bench: Type:kmem bulk_fallback Per elem: 82 cycles(tsc) 27.639 ns (step:48) - (measurement period time:0.276397658 sec time_interval:276397658) - (invoke count:9999984 tsc_interval:827612400)
> [  108.013118] time_bench: Type:kmem bulk_quick_reuse Per elem: 27 cycles(tsc) 9.281 ns (step:48) - (measurement period time:0.092811773 sec time_interval:92811773) - (invoke count:9999984 tsc_interval:277904550)
> [  108.293222] time_bench: Type:kmem bulk_fallback Per elem: 82 cycles(tsc) 27.413 ns (step:64) - (measurement period time:0.274134107 sec time_interval:274134107) - (invoke count:10000000 tsc_interval:820835190)
> [  108.394122] time_bench: Type:kmem bulk_quick_reuse Per elem: 27 cycles(tsc) 9.252 ns (step:64) - (measurement period time:0.092524305 sec time_interval:92524305) - (invoke count:10000000 tsc_interval:277043580)
> [  109.015115] time_bench: Type:kmem bulk_fallback Per elem: 183 cycles(tsc) 61.171 ns (step:128) - (measurement period time:0.611713784 sec time_interval:611713784) - (invoke count:10000000 tsc_interval:1831645590)
> [  109.282175] time_bench: Type:kmem bulk_quick_reuse Per elem: 76 cycles(tsc) 25.538 ns (step:128) - (measurement period time:0.255382498 sec time_interval:255382498) - (invoke count:10000000 tsc_interval:764687130)
> [  109.898178] time_bench: Type:kmem bulk_fallback Per elem: 181 cycles(tsc) 60.732 ns (step:158) - (measurement period time:0.607324486 sec time_interval:607324486) - (invoke count:9999978 tsc_interval:1818501990)
> [  110.111052] time_bench: Type:kmem bulk_quick_reuse Per elem: 60 cycles(tsc) 20.241 ns (step:158) - (measurement period time:0.202414120 sec time_interval:202414120) - (invoke count:9999978 tsc_interval:606085230)
> [  110.715034] time_bench: Type:kmem bulk_fallback Per elem: 178 cycles(tsc) 59.483 ns (step:250) - (measurement period time:0.594833299 sec time_interval:594833299) - (invoke count:10000000 tsc_interval:1781100600)
> [  110.974129] time_bench: Type:kmem bulk_quick_reuse Per elem: 75 cycles(tsc) 25.167 ns (step:250) - (measurement period time:0.251679547 sec time_interval:251679547) - (invoke count:10000000 tsc_interval:753599310)
> 
> [  111.856730] time_bench: Type:for_loop Per elem: 4 cycles(tsc) 1.349 ns (step:0) - (measurement period time:0.134993630 sec time_interval:134993630) - (invoke count:100000000 tsc_interval:404208090)
> [  112.407098] time_bench: Type:kmem fastpath reuse Per elem: 159 cycles(tsc) 53.400 ns (step:0) - (measurement period time:0.534001917 sec time_interval:534001917) - (invoke count:10000000 tsc_interval:1598953680)
> [  113.150981] time_bench: Type:kmem bulk_fallback Per elem: 216 cycles(tsc) 72.396 ns (step:1) - (measurement period time:0.723960939 sec time_interval:723960939) - (invoke count:10000000 tsc_interval:2167744650)
> [  113.381971] time_bench: Type:kmem bulk_quick_reuse Per elem: 67 cycles(tsc) 22.501 ns (step:1) - (measurement period time:0.225017504 sec time_interval:225017504) - (invoke count:10000000 tsc_interval:673765620)
> [  113.681963] time_bench: Type:kmem bulk_fallback Per elem: 86 cycles(tsc) 28.967 ns (step:2) - (measurement period time:0.289671345 sec time_interval:289671345) - (invoke count:10000000 tsc_interval:867358230)
> [  113.843955] time_bench: Type:kmem bulk_quick_reuse Per elem: 46 cycles(tsc) 15.643 ns (step:2) - (measurement period time:0.156437917 sec time_interval:156437917) - (invoke count:10000000 tsc_interval:468418740)
> [  114.140953] time_bench: Type:kmem bulk_fallback Per elem: 85 cycles(tsc) 28.414 ns (step:3) - (measurement period time:0.284148848 sec time_interval:284148848) - (invoke count:9999999 tsc_interval:850821930)
> [  114.279933] time_bench: Type:kmem bulk_quick_reuse Per elem: 39 cycles(tsc) 13.207 ns (step:3) - (measurement period time:0.132073229 sec time_interval:132073229) - (invoke count:9999999 tsc_interval:395463870)
> [  114.609120] time_bench: Type:kmem bulk_fallback Per elem: 93 cycles(tsc) 31.197 ns (step:4) - (measurement period time:0.311972955 sec time_interval:311972955) - (invoke count:10000000 tsc_interval:934136040)
> [  114.909950] time_bench: Type:kmem bulk_quick_reuse Per elem: 87 cycles(tsc) 29.326 ns (step:4) - (measurement period time:0.293267093 sec time_interval:293267093) - (invoke count:10000000 tsc_interval:878124330)
> [  115.622058] time_bench: Type:kmem bulk_fallback Per elem: 209 cycles(tsc) 70.083 ns (step:8) - (measurement period time:0.700833456 sec time_interval:700833456) - (invoke count:10000000 tsc_interval:2098495740)
> [  115.729918] time_bench: Type:kmem bulk_quick_reuse Per elem: 30 cycles(tsc) 10.072 ns (step:8) - (measurement period time:0.100729060 sec time_interval:100729060) - (invoke count:10000000 tsc_interval:301610850)
> [  116.445890] time_bench: Type:kmem bulk_fallback Per elem: 211 cycles(tsc) 70.512 ns (step:16) - (measurement period time:0.705126903 sec time_interval:705126903) - (invoke count:10000000 tsc_interval:2111350800)
> [  116.597986] time_bench: Type:kmem bulk_quick_reuse Per elem: 43 cycles(tsc) 14.451 ns (step:16) - (measurement period time:0.144517256 sec time_interval:144517256) - (invoke count:10000000 tsc_interval:432725340)
> [  117.293842] time_bench: Type:kmem bulk_fallback Per elem: 205 cycles(tsc) 68.660 ns (step:30) - (measurement period time:0.686602607 sec time_interval:686602607) - (invoke count:9999990 tsc_interval:2055883860)
> [  117.513834] time_bench: Type:kmem bulk_quick_reuse Per elem: 65 cycles(tsc) 21.724 ns (step:30) - (measurement period time:0.217241306 sec time_interval:217241306) - (invoke count:9999990 tsc_interval:650481120)
> [  118.157816] time_bench: Type:kmem bulk_fallback Per elem: 189 cycles(tsc) 63.344 ns (step:32) - (measurement period time:0.633443044 sec time_interval:633443044) - (invoke count:10000000 tsc_interval:1896708780)
> [  118.380992] time_bench: Type:kmem bulk_quick_reuse Per elem: 64 cycles(tsc) 21.381 ns (step:32) - (measurement period time:0.213815392 sec time_interval:213815392) - (invoke count:10000000 tsc_interval:640223670)
> [  118.981808] time_bench: Type:kmem bulk_fallback Per elem: 176 cycles(tsc) 58.885 ns (step:34) - (measurement period time:0.588855917 sec time_interval:588855917) - (invoke count:9999978 tsc_interval:1763201640)
> [  119.078787] time_bench: Type:kmem bulk_quick_reuse Per elem: 27 cycles(tsc) 9.191 ns (step:34) - (measurement period time:0.091919103 sec time_interval:91919103) - (invoke count:9999978 tsc_interval:275231340)
> [  119.368789] time_bench: Type:kmem bulk_fallback Per elem: 82 cycles(tsc) 27.533 ns (step:48) - (measurement period time:0.275334132 sec time_interval:275334132) - (invoke count:9999984 tsc_interval:824428110)
> [  119.471780] time_bench: Type:kmem bulk_quick_reuse Per elem: 28 cycles(tsc) 9.519 ns (step:48) - (measurement period time:0.095195091 sec time_interval:95195091) - (invoke count:9999984 tsc_interval:285040080)
> [  119.775775] time_bench: Type:kmem bulk_fallback Per elem: 87 cycles(tsc) 29.149 ns (step:64) - (measurement period time:0.291498274 sec time_interval:291498274) - (invoke count:10000000 tsc_interval:872828640)
> [  119.896771] time_bench: Type:kmem bulk_quick_reuse Per elem: 33 cycles(tsc) 11.330 ns (step:64) - (measurement period time:0.113304207 sec time_interval:113304207) - (invoke count:10000000 tsc_interval:339264000)
> [  120.199773] time_bench: Type:kmem bulk_fallback Per elem: 87 cycles(tsc) 29.289 ns (step:128) - (measurement period time:0.292891157 sec time_interval:292891157) - (invoke count:10000000 tsc_interval:876999360)
> [  120.320757] time_bench: Type:kmem bulk_quick_reuse Per elem: 34 cycles(tsc) 11.476 ns (step:128) - (measurement period time:0.114763286 sec time_interval:114763286) - (invoke count:10000000 tsc_interval:343632900)
> [  120.976762] time_bench: Type:kmem bulk_fallback Per elem: 192 cycles(tsc) 64.320 ns (step:158) - (measurement period time:0.643207519 sec time_interval:643207519) - (invoke count:9999978 tsc_interval:1925946840)
> [  121.231790] time_bench: Type:kmem bulk_quick_reuse Per elem: 73 cycles(tsc) 24.705 ns (step:158) - (measurement period time:0.247055281 sec time_interval:247055281) - (invoke count:9999978 tsc_interval:739752480)
> [  121.875817] time_bench: Type:kmem bulk_fallback Per elem: 189 cycles(tsc) 63.224 ns (step:250) - (measurement period time:0.632244442 sec time_interval:632244442) - (invoke count:10000000 tsc_interval:1893119520)
> [  122.148737] time_bench: Type:kmem bulk_quick_reuse Per elem: 79 cycles(tsc) 26.410 ns (step:250) - (measurement period time:0.264101742 sec time_interval:264101742) - (invoke count:10000000 tsc_interval:790794030)
> 
> Patched:
> 
> [  654.054203] time_bench: Type:for_loop Per elem: 2 cycles(tsc) 0.776 ns (step:0) - (measurement period time:0.077664118 sec time_interval:77664118) - (invoke count:100000000 tsc_interval:232545750)
> [  654.592857] time_bench: Type:kmem fastpath reuse Per elem: 161 cycles(tsc) 53.860 ns (step:0) - (measurement period time:0.538607021 sec time_interval:538607021) - (invoke count:10000000 tsc_interval:1612734660)
> [  655.248550] time_bench: Type:kmem bulk_fallback Per elem: 196 cycles(tsc) 65.568 ns (step:1) - (measurement period time:0.655680061 sec time_interval:655680061) - (invoke count:10000000 tsc_interval:1963282620)
> [  655.475563] time_bench: Type:kmem bulk_quick_reuse Per elem: 67 cycles(tsc) 22.697 ns (step:1) - (measurement period time:0.226975586 sec time_interval:226975586) - (invoke count:10000000 tsc_interval:679625070)
> [  655.757615] time_bench: Type:kmem bulk_fallback Per elem: 84 cycles(tsc) 28.204 ns (step:2) - (measurement period time:0.282047104 sec time_interval:282047104) - (invoke count:10000000 tsc_interval:844524090)
> [  655.943657] time_bench: Type:kmem bulk_quick_reuse Per elem: 55 cycles(tsc) 18.599 ns (step:2) - (measurement period time:0.185992389 sec time_interval:185992389) - (invoke count:10000000 tsc_interval:556910100)
> [  656.221528] time_bench: Type:kmem bulk_fallback Per elem: 83 cycles(tsc) 27.783 ns (step:3) - (measurement period time:0.277833288 sec time_interval:277833288) - (invoke count:9999999 tsc_interval:831906840)
> [  656.535062] time_bench: Type:kmem bulk_quick_reuse Per elem: 93 cycles(tsc) 31.351 ns (step:3) - (measurement period time:0.313512217 sec time_interval:313512217) - (invoke count:9999999 tsc_interval:938739120)
> [  656.843267] time_bench: Type:kmem bulk_fallback Per elem: 92 cycles(tsc) 30.818 ns (step:4) - (measurement period time:0.308185034 sec time_interval:308185034) - (invoke count:10000000 tsc_interval:922788240)
> [  656.961808] time_bench: Type:kmem bulk_quick_reuse Per elem: 35 cycles(tsc) 11.850 ns (step:4) - (measurement period time:0.118503561 sec time_interval:118503561) - (invoke count:10000000 tsc_interval:354830820)
> [  657.691366] time_bench: Type:kmem bulk_fallback Per elem: 218 cycles(tsc) 72.954 ns (step:8) - (measurement period time:0.729541418 sec time_interval:729541418) - (invoke count:10000000 tsc_interval:2184443100)
> [  657.792001] time_bench: Type:kmem bulk_quick_reuse Per elem: 30 cycles(tsc) 10.060 ns (step:8) - (measurement period time:0.100604744 sec time_interval:100604744) - (invoke count:10000000 tsc_interval:301236990)
> [  658.070712] time_bench: Type:kmem bulk_fallback Per elem: 83 cycles(tsc) 27.868 ns (step:16) - (measurement period time:0.278687720 sec time_interval:278687720) - (invoke count:10000000 tsc_interval:834465960)
> [  658.169621] time_bench: Type:kmem bulk_quick_reuse Per elem: 29 cycles(tsc) 9.887 ns (step:16) - (measurement period time:0.098871033 sec time_interval:98871033) - (invoke count:10000000 tsc_interval:296045940)
> [  658.846891] time_bench: Type:kmem bulk_fallback Per elem: 202 cycles(tsc) 67.726 ns (step:30) - (measurement period time:0.677260248 sec time_interval:677260248) - (invoke count:9999990 tsc_interval:2027899590)
> [  658.940547] time_bench: Type:kmem bulk_quick_reuse Per elem: 28 cycles(tsc) 9.354 ns (step:30) - (measurement period time:0.093547668 sec time_interval:93547668) - (invoke count:9999990 tsc_interval:280105560)
> [  659.214131] time_bench: Type:kmem bulk_fallback Per elem: 81 cycles(tsc) 27.356 ns (step:32) - (measurement period time:0.273564878 sec time_interval:273564878) - (invoke count:10000000 tsc_interval:819126750)
> [  659.307010] time_bench: Type:kmem bulk_quick_reuse Per elem: 27 cycles(tsc) 9.286 ns (step:32) - (measurement period time:0.092862249 sec time_interval:92862249) - (invoke count:10000000 tsc_interval:278053470)
> [  659.577675] time_bench: Type:kmem bulk_fallback Per elem: 81 cycles(tsc) 27.065 ns (step:34) - (measurement period time:0.270657877 sec time_interval:270657877) - (invoke count:9999978 tsc_interval:810422340)
> [  659.670155] time_bench: Type:kmem bulk_quick_reuse Per elem: 27 cycles(tsc) 9.246 ns (step:34) - (measurement period time:0.092468447 sec time_interval:92468447) - (invoke count:9999978 tsc_interval:276874410)
> [  659.941498] time_bench: Type:kmem bulk_fallback Per elem: 81 cycles(tsc) 27.129 ns (step:48) - (measurement period time:0.271292799 sec time_interval:271292799) - (invoke count:9999984 tsc_interval:812323620)
> [  660.034358] time_bench: Type:kmem bulk_quick_reuse Per elem: 27 cycles(tsc) 9.284 ns (step:48) - (measurement period time:0.092846689 sec time_interval:92846689) - (invoke count:9999984 tsc_interval:278007390)
> [  660.305652] time_bench: Type:kmem bulk_fallback Per elem: 81 cycles(tsc) 27.125 ns (step:64) - (measurement period time:0.271257793 sec time_interval:271257793) - (invoke count:10000000 tsc_interval:812218680)
> [  660.535235] time_bench: Type:kmem bulk_quick_reuse Per elem: 68 cycles(tsc) 22.955 ns (step:64) - (measurement period time:0.229550122 sec time_interval:229550122) - (invoke count:10000000 tsc_interval:687333360)
> [  660.814888] time_bench: Type:kmem bulk_fallback Per elem: 83 cycles(tsc) 27.964 ns (step:128) - (measurement period time:0.279643666 sec time_interval:279643666) - (invoke count:10000000 tsc_interval:837328200)
> [  660.915969] time_bench: Type:kmem bulk_quick_reuse Per elem: 30 cycles(tsc) 10.104 ns (step:128) - (measurement period time:0.101047589 sec time_interval:101047589) - (invoke count:10000000 tsc_interval:302562990)
> [  661.275325] time_bench: Type:kmem bulk_fallback Per elem: 107 cycles(tsc) 35.933 ns (step:158) - (measurement period time:0.359338210 sec time_interval:359338210) - (invoke count:9999978 tsc_interval:1075954290)
> [  661.375091] time_bench: Type:kmem bulk_quick_reuse Per elem: 29 cycles(tsc) 9.975 ns (step:158) - (measurement period time:0.099750172 sec time_interval:99750172) - (invoke count:9999978 tsc_interval:298678200)
> [  661.655844] time_bench: Type:kmem bulk_fallback Per elem: 84 cycles(tsc) 28.074 ns (step:250) - (measurement period time:0.280746521 sec time_interval:280746521) - (invoke count:10000000 tsc_interval:840630900)
> [  661.762658] time_bench: Type:kmem bulk_quick_reuse Per elem: 31 cycles(tsc) 10.680 ns (step:250) - (measurement period time:0.106802018 sec time_interval:106802018) - (invoke count:10000000 tsc_interval:319793460)
> 
> [  663.188701] time_bench: Type:for_loop Per elem: 2 cycles(tsc) 0.772 ns (step:0) - (measurement period time:0.077219119 sec time_interval:77219119) - (invoke count:100000000 tsc_interval:231214350)
> [  663.723737] time_bench: Type:kmem fastpath reuse Per elem: 160 cycles(tsc) 53.501 ns (step:0) - (measurement period time:0.535016285 sec time_interval:535016285) - (invoke count:10000000 tsc_interval:1601983200)
> [  664.022069] time_bench: Type:kmem bulk_fallback Per elem: 89 cycles(tsc) 29.828 ns (step:1) - (measurement period time:0.298280101 sec time_interval:298280101) - (invoke count:10000000 tsc_interval:893130450)
> [  664.248849] time_bench: Type:kmem bulk_quick_reuse Per elem: 67 cycles(tsc) 22.677 ns (step:1) - (measurement period time:0.226775284 sec time_interval:226775284) - (invoke count:10000000 tsc_interval:679026090)
> [  664.530649] time_bench: Type:kmem bulk_fallback Per elem: 84 cycles(tsc) 28.179 ns (step:2) - (measurement period time:0.281793671 sec time_interval:281793671) - (invoke count:10000000 tsc_interval:843766020)
> [  664.686627] time_bench: Type:kmem bulk_quick_reuse Per elem: 46 cycles(tsc) 15.593 ns (step:2) - (measurement period time:0.155939154 sec time_interval:155939154) - (invoke count:10000000 tsc_interval:466923720)
> [  665.370321] time_bench: Type:kmem bulk_fallback Per elem: 204 cycles(tsc) 68.367 ns (step:3) - (measurement period time:0.683678844 sec time_interval:683678844) - (invoke count:9999999 tsc_interval:2047118220)
> [  665.685507] time_bench: Type:kmem bulk_quick_reuse Per elem: 94 cycles(tsc) 31.513 ns (step:3) - (measurement period time:0.315139143 sec time_interval:315139143) - (invoke count:9999999 tsc_interval:943611060)
> [  666.448847] time_bench: Type:kmem bulk_fallback Per elem: 228 cycles(tsc) 76.331 ns (step:4) - (measurement period time:0.763310680 sec time_interval:763310680) - (invoke count:10000000 tsc_interval:2285557860)
> [  666.745314] time_bench: Type:kmem bulk_quick_reuse Per elem: 88 cycles(tsc) 29.643 ns (step:4) - (measurement period time:0.296436791 sec time_interval:296436791) - (invoke count:10000000 tsc_interval:887610960)
> [  667.041829] time_bench: Type:kmem bulk_fallback Per elem: 88 cycles(tsc) 29.650 ns (step:8) - (measurement period time:0.296505592 sec time_interval:296505592) - (invoke count:10000000 tsc_interval:887817120)
> [  667.142484] time_bench: Type:kmem bulk_quick_reuse Per elem: 30 cycles(tsc) 10.064 ns (step:8) - (measurement period time:0.100642315 sec time_interval:100642315) - (invoke count:10000000 tsc_interval:301350000)
> [  667.420593] time_bench: Type:kmem bulk_fallback Per elem: 83 cycles(tsc) 27.810 ns (step:16) - (measurement period time:0.278104977 sec time_interval:278104977) - (invoke count:10000000 tsc_interval:832721010)
> [  667.519271] time_bench: Type:kmem bulk_quick_reuse Per elem: 29 cycles(tsc) 9.866 ns (step:16) - (measurement period time:0.098662815 sec time_interval:98662815) - (invoke count:10000000 tsc_interval:295422450)
> [  667.792475] time_bench: Type:kmem bulk_fallback Per elem: 81 cycles(tsc) 27.315 ns (step:30) - (measurement period time:0.273152701 sec time_interval:273152701) - (invoke count:9999990 tsc_interval:817892820)
> [  668.023804] time_bench: Type:kmem bulk_quick_reuse Per elem: 69 cycles(tsc) 23.130 ns (step:30) - (measurement period time:0.231303811 sec time_interval:231303811) - (invoke count:9999990 tsc_interval:692584950)
> [  668.696907] time_bench: Type:kmem bulk_fallback Per elem: 201 cycles(tsc) 67.306 ns (step:32) - (measurement period time:0.673067682 sec time_interval:673067682) - (invoke count:10000000 tsc_interval:2015345790)
> [  668.889019] time_bench: Type:kmem bulk_quick_reuse Per elem: 57 cycles(tsc) 19.208 ns (step:32) - (measurement period time:0.192088279 sec time_interval:192088279) - (invoke count:10000000 tsc_interval:575162820)
> [  669.342870] time_bench: Type:kmem bulk_fallback Per elem: 135 cycles(tsc) 45.383 ns (step:34) - (measurement period time:0.453831353 sec time_interval:453831353) - (invoke count:9999978 tsc_interval:1358892420)
> [  669.436107] time_bench: Type:kmem bulk_quick_reuse Per elem: 27 cycles(tsc) 9.322 ns (step:34) - (measurement period time:0.093220843 sec time_interval:93220843) - (invoke count:9999978 tsc_interval:279126840)
> [  669.707772] time_bench: Type:kmem bulk_fallback Per elem: 81 cycles(tsc) 27.165 ns (step:48) - (measurement period time:0.271654970 sec time_interval:271654970) - (invoke count:9999984 tsc_interval:813407310)
> [  669.800509] time_bench: Type:kmem bulk_quick_reuse Per elem: 27 cycles(tsc) 9.268 ns (step:48) - (measurement period time:0.092683978 sec time_interval:92683978) - (invoke count:9999984 tsc_interval:277520190)
> [  670.068757] time_bench: Type:kmem bulk_fallback Per elem: 80 cycles(tsc) 26.823 ns (step:64) - (measurement period time:0.268231313 sec time_interval:268231313) - (invoke count:10000000 tsc_interval:803156580)
> [  670.297078] time_bench: Type:kmem bulk_quick_reuse Per elem: 68 cycles(tsc) 22.829 ns (step:64) - (measurement period time:0.228295958 sec time_interval:228295958) - (invoke count:10000000 tsc_interval:683578080)
> [  670.573819] time_bench: Type:kmem bulk_fallback Per elem: 82 cycles(tsc) 27.673 ns (step:128) - (measurement period time:0.276731254 sec time_interval:276731254) - (invoke count:10000000 tsc_interval:828607050)
> [  670.676864] time_bench: Type:kmem bulk_quick_reuse Per elem: 30 cycles(tsc) 10.300 ns (step:128) - (measurement period time:0.103002111 sec time_interval:103002111) - (invoke count:10000000 tsc_interval:308415540)
> [  671.318177] time_bench: Type:kmem bulk_fallback Per elem: 192 cycles(tsc) 64.130 ns (step:158) - (measurement period time:0.641303389 sec time_interval:641303389) - (invoke count:9999978 tsc_interval:1920234600)
> [  671.417083] time_bench: Type:kmem bulk_quick_reuse Per elem: 29 cycles(tsc) 9.889 ns (step:158) - (measurement period time:0.098890269 sec time_interval:98890269) - (invoke count:9999978 tsc_interval:296103210)
> [  671.700461] time_bench: Type:kmem bulk_fallback Per elem: 84 cycles(tsc) 28.334 ns (step:250) - (measurement period time:0.283346426 sec time_interval:283346426) - (invoke count:10000000 tsc_interval:848415660)
> [  671.965515] time_bench: Type:kmem bulk_quick_reuse Per elem: 79 cycles(tsc) 26.502 ns (step:250) - (measurement period time:0.265021064 sec time_interval:265021064) - (invoke count:10000000 tsc_interval:793543500)
> 
> [  686.749446] time_bench: Type:for_loop Per elem: 1 cycles(tsc) 0.660 ns (step:0) - (measurement period time:0.066028480 sec time_interval:66028480) - (invoke count:100000000 tsc_interval:197707140)
> [  687.296902] time_bench: Type:kmem fastpath reuse Per elem: 163 cycles(tsc) 54.742 ns (step:0) - (measurement period time:0.547423736 sec time_interval:547423736) - (invoke count:10000000 tsc_interval:1639141260)
> [  687.910620] time_bench: Type:kmem bulk_fallback Per elem: 183 cycles(tsc) 61.369 ns (step:1) - (measurement period time:0.613692564 sec time_interval:613692564) - (invoke count:10000000 tsc_interval:1837568160)
> [  688.381090] time_bench: Type:kmem bulk_quick_reuse Per elem: 140 cycles(tsc) 47.045 ns (step:1) - (measurement period time:0.470452576 sec time_interval:470452576) - (invoke count:10000000 tsc_interval:1408667550)
> [  688.662045] time_bench: Type:kmem bulk_fallback Per elem: 84 cycles(tsc) 28.094 ns (step:2) - (measurement period time:0.280943997 sec time_interval:280943997) - (invoke count:10000000 tsc_interval:841225230)
> [  688.817464] time_bench: Type:kmem bulk_quick_reuse Per elem: 46 cycles(tsc) 15.540 ns (step:2) - (measurement period time:0.155409002 sec time_interval:155409002) - (invoke count:10000000 tsc_interval:465337980)
> [  689.094749] time_bench: Type:kmem bulk_fallback Per elem: 83 cycles(tsc) 27.723 ns (step:3) - (measurement period time:0.277235751 sec time_interval:277235751) - (invoke count:9999999 tsc_interval:830122170)
> [  689.225706] time_bench: Type:kmem bulk_quick_reuse Per elem: 39 cycles(tsc) 13.091 ns (step:3) - (measurement period time:0.130919113 sec time_interval:130919113) - (invoke count:9999999 tsc_interval:392008440)
> [  689.988861] time_bench: Type:kmem bulk_fallback Per elem: 228 cycles(tsc) 76.314 ns (step:4) - (measurement period time:0.763146670 sec time_interval:763146670) - (invoke count:10000000 tsc_interval:2285076870)
> [  690.274210] time_bench: Type:kmem bulk_quick_reuse Per elem: 85 cycles(tsc) 28.532 ns (step:4) - (measurement period time:0.285320525 sec time_interval:285320525) - (invoke count:10000000 tsc_interval:854329500)
> [  690.862234] time_bench: Type:kmem bulk_fallback Per elem: 176 cycles(tsc) 58.799 ns (step:8) - (measurement period time:0.587998540 sec time_interval:587998540) - (invoke count:10000000 tsc_interval:1760633010)
> [  690.964020] time_bench: Type:kmem bulk_quick_reuse Per elem: 30 cycles(tsc) 10.171 ns (step:8) - (measurement period time:0.101718599 sec time_interval:101718599) - (invoke count:10000000 tsc_interval:304573500)
> [  691.245251] time_bench: Type:kmem bulk_fallback Per elem: 84 cycles(tsc) 28.122 ns (step:16) - (measurement period time:0.281223369 sec time_interval:281223369) - (invoke count:10000000 tsc_interval:842060850)
> [  691.341256] time_bench: Type:kmem bulk_quick_reuse Per elem: 28 cycles(tsc) 9.599 ns (step:16) - (measurement period time:0.095990014 sec time_interval:95990014) - (invoke count:10000000 tsc_interval:287420100)
> [  691.616379] time_bench: Type:kmem bulk_fallback Per elem: 82 cycles(tsc) 27.511 ns (step:30) - (measurement period time:0.275116534 sec time_interval:275116534) - (invoke count:9999990 tsc_interval:823776390)
> [  691.710275] time_bench: Type:kmem bulk_quick_reuse Per elem: 28 cycles(tsc) 9.388 ns (step:30) - (measurement period time:0.093884613 sec time_interval:93884613) - (invoke count:9999990 tsc_interval:281115990)
> [  691.982082] time_bench: Type:kmem bulk_fallback Per elem: 81 cycles(tsc) 27.180 ns (step:32) - (measurement period time:0.271800767 sec time_interval:271800767) - (invoke count:10000000 tsc_interval:813847530)
> [  692.077384] time_bench: Type:kmem bulk_quick_reuse Per elem: 28 cycles(tsc) 9.526 ns (step:32) - (measurement period time:0.095266005 sec time_interval:95266005) - (invoke count:10000000 tsc_interval:285252780)
> [  692.348422] time_bench: Type:kmem bulk_fallback Per elem: 81 cycles(tsc) 27.102 ns (step:34) - (measurement period time:0.271026511 sec time_interval:271026511) - (invoke count:9999978 tsc_interval:811529490)
> [  692.440805] time_bench: Type:kmem bulk_quick_reuse Per elem: 27 cycles(tsc) 9.236 ns (step:34) - (measurement period time:0.092368535 sec time_interval:92368535) - (invoke count:9999978 tsc_interval:276576810)
> [  692.712439] time_bench: Type:kmem bulk_fallback Per elem: 81 cycles(tsc) 27.162 ns (step:48) - (measurement period time:0.271619761 sec time_interval:271619761) - (invoke count:9999984 tsc_interval:813305970)
> [  692.945558] time_bench: Type:kmem bulk_quick_reuse Per elem: 69 cycles(tsc) 23.309 ns (step:48) - (measurement period time:0.233091977 sec time_interval:233091977) - (invoke count:9999984 tsc_interval:697942470)
> [  693.234591] time_bench: Type:kmem bulk_fallback Per elem: 86 cycles(tsc) 28.902 ns (step:64) - (measurement period time:0.289021416 sec time_interval:289021416) - (invoke count:10000000 tsc_interval:865411350)
> [  693.326142] time_bench: Type:kmem bulk_quick_reuse Per elem: 27 cycles(tsc) 9.153 ns (step:64) - (measurement period time:0.091539475 sec time_interval:91539475) - (invoke count:10000000 tsc_interval:274094220)
> [  693.615858] time_bench: Type:kmem bulk_fallback Per elem: 86 cycles(tsc) 28.970 ns (step:128) - (measurement period time:0.289709207 sec time_interval:289709207) - (invoke count:10000000 tsc_interval:867470400)
> [  693.717321] time_bench: Type:kmem bulk_quick_reuse Per elem: 30 cycles(tsc) 10.145 ns (step:128) - (measurement period time:0.101451019 sec time_interval:101451019) - (invoke count:10000000 tsc_interval:303772410)
> [  694.000375] time_bench: Type:kmem bulk_fallback Per elem: 84 cycles(tsc) 28.304 ns (step:158) - (measurement period time:0.283047625 sec time_interval:283047625) - (invoke count:9999978 tsc_interval:847523850)
> [  694.108588] time_bench: Type:kmem bulk_quick_reuse Per elem: 32 cycles(tsc) 10.816 ns (step:158) - (measurement period time:0.108168257 sec time_interval:108168257) - (invoke count:9999978 tsc_interval:323885820)
> [  694.392070] time_bench: Type:kmem bulk_fallback Per elem: 84 cycles(tsc) 28.344 ns (step:250) - (measurement period time:0.283447055 sec time_interval:283447055) - (invoke count:10000000 tsc_interval:848719800)
> [  694.655226] time_bench: Type:kmem bulk_quick_reuse Per elem: 78 cycles(tsc) 26.312 ns (step:250) - (measurement period time:0.263123465 sec time_interval:263123465) - (invoke count:10000000 tsc_interval:787864230)
> 



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 00/19] The new cgroup slab memory controller
  2020-06-08 23:06 [PATCH v6 00/19] The new cgroup slab memory controller Roman Gushchin
                   ` (18 preceding siblings ...)
  2020-06-17  1:46 ` [PATCH v6 00/19] The new cgroup slab memory controller Shakeel Butt
@ 2020-06-18  9:27 ` Mike Rapoport
  2020-06-18 20:43   ` Roman Gushchin
  2020-06-21 22:57 ` Qian Cai
  20 siblings, 1 reply; 92+ messages in thread
From: Mike Rapoport @ 2020-06-18  9:27 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Shakeel Butt, linux-mm, Vlastimil Babka, kernel-team,
	linux-kernel

Hi Roman,

On Mon, Jun 08, 2020 at 04:06:35PM -0700, Roman Gushchin wrote:
> This is v6 of the slab cgroup controller rework.
> 
> The patchset moves the accounting from the page level to the object
> level. It allows to share slab pages between memory cgroups.
> This leads to a significant win in the slab utilization (up to 45%)
> and the corresponding drop in the total kernel memory footprint.
> The reduced number of unmovable slab pages should also have a positive
> effect on the memory fragmentation.
 
... 
 
> Johannes Weiner (1):
>   mm: memcontrol: decouple reference counting from page accounting
> 
> Roman Gushchin (18):
>   mm: memcg: factor out memcg- and lruvec-level changes out of __mod_lruvec_state()
>   mm: memcg: prepare for byte-sized vmstat items
>   mm: memcg: convert vmstat slab counters to bytes
>   mm: slub: implement SLUB version of obj_to_index()
>   mm: memcg/slab: obj_cgroup API
>   mm: memcg/slab: allocate obj_cgroups for non-root slab pages
>   mm: memcg/slab: save obj_cgroup for non-root slab objects
>   mm: memcg/slab: charge individual slab objects instead of pages
>   mm: memcg/slab: deprecate memory.kmem.slabinfo
>   mm: memcg/slab: move memcg_kmem_bypass() to memcontrol.h
>   mm: memcg/slab: use a single set of kmem_caches for all accounted allocations
>   mm: memcg/slab: simplify memcg cache creation
>   mm: memcg/slab: remove memcg_kmem_get_cache()
>   mm: memcg/slab: deprecate slab_root_caches
>   mm: memcg/slab: remove redundant check in memcg_accumulate_slabinfo()
>   mm: memcg/slab: use a single set of kmem_caches for all allocations
>   kselftests: cgroup: add kernel memory accounting tests
>   tools/cgroup: add memcg_slabinfo.py tool
 
Sorry for jumping late, but I'm really missing 

   Documentation/vm/cgroup-slab.rst	      | < lots of + >

in this series ;-)

>  drivers/base/node.c                        |   6 +-
>  fs/proc/meminfo.c                          |   4 +-
>  include/linux/memcontrol.h                 |  85 ++-
>  include/linux/mm_types.h                   |   5 +-
>  include/linux/mmzone.h                     |  24 +-
>  include/linux/slab.h                       |   5 -
>  include/linux/slab_def.h                   |   9 +-
>  include/linux/slub_def.h                   |  31 +-
>  include/linux/vmstat.h                     |  14 +-
>  kernel/power/snapshot.c                    |   2 +-
>  mm/memcontrol.c                            | 608 +++++++++++--------
>  mm/oom_kill.c                              |   2 +-
>  mm/page_alloc.c                            |   8 +-
>  mm/slab.c                                  |  70 +--
>  mm/slab.h                                  | 372 +++++-------
>  mm/slab_common.c                           | 643 +--------------------
>  mm/slob.c                                  |  12 +-
>  mm/slub.c                                  | 229 +-------
>  mm/vmscan.c                                |   3 +-
>  mm/vmstat.c                                |  30 +-
>  mm/workingset.c                            |   6 +-
>  tools/cgroup/memcg_slabinfo.py             | 226 ++++++++
>  tools/testing/selftests/cgroup/.gitignore  |   1 +
>  tools/testing/selftests/cgroup/Makefile    |   2 +
>  tools/testing/selftests/cgroup/test_kmem.c | 382 ++++++++++++
>  25 files changed, 1374 insertions(+), 1405 deletions(-)
>  create mode 100755 tools/cgroup/memcg_slabinfo.py
>  create mode 100644 tools/testing/selftests/cgroup/test_kmem.c
> 
> -- 
> 2.25.4
> 
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 00/19] The new cgroup slab memory controller
  2020-06-18  8:43             ` Jesper Dangaard Brouer
@ 2020-06-18  9:31               ` Jesper Dangaard Brouer
  2020-06-19  1:30                 ` Roman Gushchin
  2020-06-19  1:27               ` Roman Gushchin
  1 sibling, 1 reply; 92+ messages in thread
From: Jesper Dangaard Brouer @ 2020-06-18  9:31 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Vlastimil Babka, Shakeel Butt, Andrew Morton, Christoph Lameter,
	Johannes Weiner, Michal Hocko, Linux MM, Kernel Team, LKML,
	Mel Gorman, brouer

On Thu, 18 Jun 2020 10:43:44 +0200
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> On Wed, 17 Jun 2020 18:29:28 -0700
> Roman Gushchin <guro@fb.com> wrote:
> 
> > On Wed, Jun 17, 2020 at 01:24:21PM +0200, Vlastimil Babka wrote:  
> > > On 6/17/20 5:32 AM, Roman Gushchin wrote:    
> > > > On Tue, Jun 16, 2020 at 08:05:39PM -0700, Shakeel Butt wrote:    
> > > >> On Tue, Jun 16, 2020 at 7:41 PM Roman Gushchin <guro@fb.com> wrote:    
> > > >> >
> > > >> > On Tue, Jun 16, 2020 at 06:46:56PM -0700, Shakeel Butt wrote:    
> > > >> > > On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:    
> > > >> > > >    
> > > >> [...]    
> > > >> > >
> > > >> > > Have you performed any [perf] testing on SLAB with this patchset?    
> > > >> >
> > > >> > The accounting part is the same for SLAB and SLUB, so there should be no
> > > >> > significant difference. I've checked that it compiles, boots and passes
> > > >> > kselftests. And that memory savings are there.
> > > >> >    
> > > >> 
> > > >> What about performance? Also you mentioned that sharing kmem-cache
> > > >> between accounted and non-accounted can have additional overhead. Any
> > > >> difference between SLAB and SLUB for such a case?    
> > > > 
> > > > Not really.
> > > > 
> > > > Sharing a single set of caches adds some overhead to root- and non-accounted
> > > > allocations, which is something I've tried hard to avoid in my original version.
> > > > But I have to admit, it allows to simplify and remove a lot of code, and here
> > > > it's hard to argue with Johanness, who pushed on this design.
> > > > 
> > > > With performance testing it's not that easy, because it's not obvious what
> > > > we wanna test. Obviously, per-object accounting is more expensive, and
> > > > measuring something like 1000000 allocations and deallocations in a line from
> > > > a single kmem_cache will show a regression. But in the real world the relative
> > > > cost of allocations is usually low, and we can get some benefits from a smaller
> > > > working set and from having shared kmem_cache objects cache hot.
> > > > Not speaking about some extra memory and the fragmentation reduction.
> > > > 
> > > > We've done an extensive testing of the original version in Facebook production,
> > > > and we haven't noticed any regressions so far. But I have to admit, we were
> > > > using an original version with two sets of kmem_caches.
> > > > 
> > > > If you have any specific tests in mind, I can definitely run them. Or if you
> > > > can help with the performance evaluation, I'll appreciate it a lot.    
> > > 
> > > Jesper provided some pointers here [1], it would be really great if you could
> > > run at least those microbenchmarks. With mmtests it's the major question of
> > > which subset/profiles to run, maybe the referenced commits provide some hints,
> > > or maybe Mel could suggest what he used to evaluate SLAB vs SLUB not so long ago.
> > > 
> > > [1] https://lore.kernel.org/linux-mm/20200527103545.4348ac10@carbon/    
> > 
> > Oh, Jesper, I'm really sorry, somehow I missed your mail.
> > Thank you, Vlastimil, for pointing at it.
> > 
> > I've got some results (slab_bulk_test01), but honestly I fail to interpret them.
> > 
> > I ran original vs patched with SLUB and SLAB, each test several times and picked
> > 3 which looked most consistently. But it still looks very noisy.
> > 
> > I ran them on my desktop (8-core Ryzen 1700, 16 GB RAM, Fedora 32),
> > it's 5.8-rc1 + slab controller v6 vs 5.8-rc1 (default config from Fedora 32).  
> 
> What about running these tests on the server level hardware, that you
> intent to run this on?  

To give you an idea of the performance difference I ran the same test
on a Broadwell Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz.

The SLUB fastpath:
 Type:kmem fastpath reuse Per elem: 60 cycles(tsc) 16.822 ns


> > 
> > How should I interpret this data?  
> 
> First of all these SLUB+SLAB microbenchmarks use object size 256 bytes,
> because network stack alloc object of this size for SKBs/sk_buff (due
> to cache-align as used size is 224 bytes). Checked SLUB: Each slab use
> 2 pages (8192 bytes) and contain 32 object of size 256 (256*32=8192).
> 
>   The SLUB allocator have a per-CPU slab which speedup fast-reuse, in this
> case up-to 32 objects. For SLUB the "fastpath reuse" test this behaviour,
> and it serves as a baseline for optimal 1-object performance (where my bulk
> API tries to beat that, which is possible even for 1-object due to knowing
> bulk API cannot be used from IRQ context).
> 
> SLUB fastpath: 3 measurements reporting cycles(tsc)
>  - SLUB-patched : fastpath reuse: 184 - 177 - 176  cycles(tsc)
>  - SLUB-original: fastpath reuse: 178 - 153 - 156  cycles(tsc)
> 

For your SLAB results:

 SLAB fastpath: 3 measurements reporting cycles(tsc)
  - SLAB-patched : 161 - 160 - 163  cycles(tsc)
  - SLAB-original: 174 - 170 - 159  cycles(tsc)

I find it strange that SLAB is slightly better than SLUB (in many
measurements), because SLUB should have an advantage on this fast-path
quick reuse due to the per-CPU slabs.  Maybe this is also related to
the CPU arch you are using?


> There are some stability concerns as you mention, but it seems pretty
> consistently that patched version is slower. If you compile with
> no-PREEMPT you can likely get more stable results (and remove a slight
> overhead for SLUB fastpath).
> 
> The microbenchmark also measures the bulk-API, which is AFAIK only used
> by network stack (and io_uring). I guess you shouldn't focus too much
> on these bulk measurements. When bulk-API cross this objects per slab
> threshold, or is unlucky is it use two per-CPU slab, then the
> measurements can fluctuate a bit.
> 
> Your numbers for SLUB bulk-API:
> 
> SLUB-patched - bulk-API
>  - SLUB-patched : bulk_quick_reuse objects=1 : 187 -  90 - 224  cycles(tsc)
>  - SLUB-patched : bulk_quick_reuse objects=2 : 110 -  53 - 133  cycles(tsc)
>  - SLUB-patched : bulk_quick_reuse objects=3 :  88 -  95 -  42  cycles(tsc)
>  - SLUB-patched : bulk_quick_reuse objects=4 :  91 -  85 -  36  cycles(tsc)
>  - SLUB-patched : bulk_quick_reuse objects=8 :  32 -  66 -  32  cycles(tsc)
> 
> SLUB-original -  bulk-API
>  - SLUB-original: bulk_quick_reuse objects=1 :  87 -  87 - 142  cycles(tsc)
>  - SLUB-original: bulk_quick_reuse objects=2 :  52 -  53 -  53  cycles(tsc)
>  - SLUB-original: bulk_quick_reuse objects=3 :  42 -  42 -  91  cycles(tsc)
>  - SLUB-original: bulk_quick_reuse objects=4 :  91 -  37 -  37  cycles(tsc)
>  - SLUB-original: bulk_quick_reuse objects=8 :  31 -  79 -  76  cycles(tsc)

Your numbers for SLAB bulk-API:

SLAB-patched -  bulk-API
 - SLAB-patched : bulk_quick_reuse objects=1 :  67 -  67 - 140  cycles(tsc)
 - SLAB-patched : bulk_quick_reuse objects=2 :  55 -  46 -  46  cycles(tsc)
 - SLAB-patched : bulk_quick_reuse objects=3 :  93 -  94 -  39  cycles(tsc)
 - SLAB-patched : bulk_quick_reuse objects=4 :  35 -  88 -  85  cycles(tsc)
 - SLAB-patched : bulk_quick_reuse objects=8 :  30 -  30 -  30  cycles(tsc)

SLAB-original-  bulk-API
 - SLAB-original: bulk_quick_reuse objects=1 : 143 - 136 -  67  cycles(tsc)
 - SLAB-original: bulk_quick_reuse objects=2 :  45 -  46 -  46  cycles(tsc)
 - SLAB-original: bulk_quick_reuse objects=3 :  38 -  39 -  39  cycles(tsc)
 - SLAB-original: bulk_quick_reuse objects=4 :  35 -  87 -  87  cycles(tsc)
 - SLAB-original: bulk_quick_reuse objects=8 :  29 -  66 -  30  cycles(tsc)

In case of SLAB I expect the bulk-API to be slightly faster than SLUB,
as the SLUB bulk code is much more advanced.



> Maybe it is just noise or instability in measurements, but it seem that the
> 1-object case is consistently slower in your patched version.
> 
> Mail is too long now... I'll take a look at your SLAB results and followup.

(This is my follow up with SLAB results.)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 05/19] mm: memcontrol: decouple reference counting from page accounting
  2020-06-08 23:06 ` [PATCH v6 05/19] mm: memcontrol: decouple reference counting from page accounting Roman Gushchin
  2020-06-18  0:47   ` Shakeel Butt
@ 2020-06-18 14:55   ` Shakeel Butt
  2020-06-18 19:51     ` Roman Gushchin
  2020-06-19  1:08     ` Roman Gushchin
  2020-06-19  1:31   ` Shakeel Butt
  2 siblings, 2 replies; 92+ messages in thread
From: Shakeel Butt @ 2020-06-18 14:55 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

Not sure if my email went through, so, re-sending.

On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
>
> From: Johannes Weiner <hannes@cmpxchg.org>
>
[...]
> @@ -3003,13 +3004,16 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
>   */
>  void mem_cgroup_split_huge_fixup(struct page *head)
>  {
> +       struct mem_cgroup *memcg = head->mem_cgroup;
>         int i;
>
>         if (mem_cgroup_disabled())
>                 return;
>

A memcg NULL check is needed here.

> -       for (i = 1; i < HPAGE_PMD_NR; i++)
> -               head[i].mem_cgroup = head->mem_cgroup;
> +       for (i = 1; i < HPAGE_PMD_NR; i++) {
> +               css_get(&memcg->css);
> +               head[i].mem_cgroup = memcg;
> +       }
>  }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 05/19] mm: memcontrol: decouple reference counting from page accounting
  2020-06-18 14:55   ` Shakeel Butt
@ 2020-06-18 19:51     ` Roman Gushchin
  2020-06-19  1:08     ` Roman Gushchin
  1 sibling, 0 replies; 92+ messages in thread
From: Roman Gushchin @ 2020-06-18 19:51 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Thu, Jun 18, 2020 at 07:55:35AM -0700, Shakeel Butt wrote:
> Not sure if my email went through, so, re-sending.

No, I've got it, jut was busy with the other stuff.

> 
> On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > From: Johannes Weiner <hannes@cmpxchg.org>
> >
> [...]
> > @@ -3003,13 +3004,16 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
> >   */
> >  void mem_cgroup_split_huge_fixup(struct page *head)
> >  {
> > +       struct mem_cgroup *memcg = head->mem_cgroup;
> >         int i;
> >
> >         if (mem_cgroup_disabled())
> >                 return;
> >
> 
> A memcg NULL check is needed here.

Thanks for the heads up!

I'll double check it and send a follow-up fix.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 17/19] mm: memcg/slab: use a single set of kmem_caches for all allocations
  2020-06-18  7:33       ` Vlastimil Babka
@ 2020-06-18 19:54         ` Roman Gushchin
  0 siblings, 0 replies; 92+ messages in thread
From: Roman Gushchin @ 2020-06-18 19:54 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Shakeel Butt, linux-mm, kernel-team, linux-kernel, Kees Cook

On Thu, Jun 18, 2020 at 09:33:08AM +0200, Vlastimil Babka wrote:
> On 6/18/20 2:35 AM, Roman Gushchin wrote:
> > On Wed, Jun 17, 2020 at 04:35:28PM -0700, Andrew Morton wrote:
> >> On Mon, 8 Jun 2020 16:06:52 -0700 Roman Gushchin <guro@fb.com> wrote:
> >> 
> >> > Instead of having two sets of kmem_caches: one for system-wide and
> >> > non-accounted allocations and the second one shared by all accounted
> >> > allocations, we can use just one.
> >> > 
> >> > The idea is simple: space for obj_cgroup metadata can be allocated
> >> > on demand and filled only for accounted allocations.
> >> > 
> >> > It allows to remove a bunch of code which is required to handle
> >> > kmem_cache clones for accounted allocations. There is no more need
> >> > to create them, accumulate statistics, propagate attributes, etc.
> >> > It's a quite significant simplification.
> >> > 
> >> > Also, because the total number of slab_caches is reduced almost twice
> >> > (not all kmem_caches have a memcg clone), some additional memory
> >> > savings are expected. On my devvm it additionally saves about 3.5%
> >> > of slab memory.
> >> > 
> >> 
> >> This ran afoul of Vlastimil's "mm, slab/slub: move and improve
> >> cache_from_obj()"
> >> (http://lkml.kernel.org/r/20200610163135.17364-10-vbabka@suse.cz).  I
> >> resolved things as below.  Not too sure about slab.c's
> >> cache_from_obj()...
> > 
> > It can actually be as simple as:
> > static inline struct kmem_cache *cache_from_obj(struct kmem_cache *s, void *x)
> > {
> > 	return s;
> > }
> > 
> > But I wonder if we need it at all, or maybe we wanna rename it to
> > something like obj_check_kmem_cache(void *obj, struct kmem_cache *s),
> > because it has now only debug purposes.
> > 
> > Let me and Vlastimil figure it out and send a follow-up patch.
> > Your version is definitely correct.
> 
> Well, Kees wants to restore the common version of cache_from_obj() [1] for SLAB
> hardening.
> 
> To prevent all that back and forth churn entering git history, I think the best
> is for me to send a -fix to my patch that is functionally same while keeping the
> common function, and then this your patch should only have a minor conflict and
> Kees can rebase his patches on top to become much smaller?

Sounds good to me!

Thanks!

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 00/19] The new cgroup slab memory controller
  2020-06-18  9:27 ` Mike Rapoport
@ 2020-06-18 20:43   ` Roman Gushchin
  0 siblings, 0 replies; 92+ messages in thread
From: Roman Gushchin @ 2020-06-18 20:43 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Shakeel Butt, linux-mm, Vlastimil Babka, kernel-team,
	linux-kernel

On Thu, Jun 18, 2020 at 12:27:07PM +0300, Mike Rapoport wrote:
> Hi Roman,
> 
> On Mon, Jun 08, 2020 at 04:06:35PM -0700, Roman Gushchin wrote:
> > This is v6 of the slab cgroup controller rework.
> > 
> > The patchset moves the accounting from the page level to the object
> > level. It allows to share slab pages between memory cgroups.
> > This leads to a significant win in the slab utilization (up to 45%)
> > and the corresponding drop in the total kernel memory footprint.
> > The reduced number of unmovable slab pages should also have a positive
> > effect on the memory fragmentation.
>  
> ... 
>  
> > Johannes Weiner (1):
> >   mm: memcontrol: decouple reference counting from page accounting
> > 
> > Roman Gushchin (18):
> >   mm: memcg: factor out memcg- and lruvec-level changes out of __mod_lruvec_state()
> >   mm: memcg: prepare for byte-sized vmstat items
> >   mm: memcg: convert vmstat slab counters to bytes
> >   mm: slub: implement SLUB version of obj_to_index()
> >   mm: memcg/slab: obj_cgroup API
> >   mm: memcg/slab: allocate obj_cgroups for non-root slab pages
> >   mm: memcg/slab: save obj_cgroup for non-root slab objects
> >   mm: memcg/slab: charge individual slab objects instead of pages
> >   mm: memcg/slab: deprecate memory.kmem.slabinfo
> >   mm: memcg/slab: move memcg_kmem_bypass() to memcontrol.h
> >   mm: memcg/slab: use a single set of kmem_caches for all accounted allocations
> >   mm: memcg/slab: simplify memcg cache creation
> >   mm: memcg/slab: remove memcg_kmem_get_cache()
> >   mm: memcg/slab: deprecate slab_root_caches
> >   mm: memcg/slab: remove redundant check in memcg_accumulate_slabinfo()
> >   mm: memcg/slab: use a single set of kmem_caches for all allocations
> >   kselftests: cgroup: add kernel memory accounting tests
> >   tools/cgroup: add memcg_slabinfo.py tool
>  
> Sorry for jumping late, but I'm really missing 
> 
>    Documentation/vm/cgroup-slab.rst	      | < lots of + >
> 
> in this series ;-)

Hi Mike!

That's a good point. I'll write something and send a separate patch.
Changes are barely visible to a user, so no rush here,
but it's definitely a good idea to document the new design.

Thank you!

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 05/19] mm: memcontrol: decouple reference counting from page accounting
  2020-06-18 14:55   ` Shakeel Butt
  2020-06-18 19:51     ` Roman Gushchin
@ 2020-06-19  1:08     ` Roman Gushchin
  2020-06-19  1:18       ` Shakeel Butt
  1 sibling, 1 reply; 92+ messages in thread
From: Roman Gushchin @ 2020-06-19  1:08 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Thu, Jun 18, 2020 at 07:55:35AM -0700, Shakeel Butt wrote:
> Not sure if my email went through, so, re-sending.
> 
> On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > From: Johannes Weiner <hannes@cmpxchg.org>
> >
> [...]
> > @@ -3003,13 +3004,16 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
> >   */
> >  void mem_cgroup_split_huge_fixup(struct page *head)
> >  {
> > +       struct mem_cgroup *memcg = head->mem_cgroup;
> >         int i;
> >
> >         if (mem_cgroup_disabled())
> >                 return;
> >
> 
> A memcg NULL check is needed here.

Hm, it seems like the only way how it can be NULL is if mem_cgroup_disabled() is true:

int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
{
	unsigned int nr_pages = hpage_nr_pages(page);
	struct mem_cgroup *memcg = NULL;
	int ret = 0;

	if (mem_cgroup_disabled())
		goto out;

	<...>

	if (!memcg)
		memcg = get_mem_cgroup_from_mm(mm);

	ret = try_charge(memcg, gfp_mask, nr_pages);
	if (ret)
		goto out_put;

	css_get(&memcg->css);
	commit_charge(page, memcg);


Did you hit this issue in reality? The only possible scenario I can imagine
is if the page was allocated before enabling memory cgroups.

Are you about this case?

Otherwise we put root_mem_cgroup there.

Thanks!

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 05/19] mm: memcontrol: decouple reference counting from page accounting
  2020-06-19  1:08     ` Roman Gushchin
@ 2020-06-19  1:18       ` Shakeel Butt
  0 siblings, 0 replies; 92+ messages in thread
From: Shakeel Butt @ 2020-06-19  1:18 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Thu, Jun 18, 2020 at 6:08 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Thu, Jun 18, 2020 at 07:55:35AM -0700, Shakeel Butt wrote:
> > Not sure if my email went through, so, re-sending.
> >
> > On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
> > >
> > > From: Johannes Weiner <hannes@cmpxchg.org>
> > >
> > [...]
> > > @@ -3003,13 +3004,16 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
> > >   */
> > >  void mem_cgroup_split_huge_fixup(struct page *head)
> > >  {
> > > +       struct mem_cgroup *memcg = head->mem_cgroup;
> > >         int i;
> > >
> > >         if (mem_cgroup_disabled())
> > >                 return;
> > >
> >
> > A memcg NULL check is needed here.
>
> Hm, it seems like the only way how it can be NULL is if mem_cgroup_disabled() is true:
>
> int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
> {
>         unsigned int nr_pages = hpage_nr_pages(page);
>         struct mem_cgroup *memcg = NULL;
>         int ret = 0;
>
>         if (mem_cgroup_disabled())
>                 goto out;
>
>         <...>
>
>         if (!memcg)
>                 memcg = get_mem_cgroup_from_mm(mm);
>
>         ret = try_charge(memcg, gfp_mask, nr_pages);
>         if (ret)
>                 goto out_put;
>
>         css_get(&memcg->css);
>         commit_charge(page, memcg);
>
>
> Did you hit this issue in reality? The only possible scenario I can imagine
> is if the page was allocated before enabling memory cgroups.
>
> Are you about this case?
>
> Otherwise we put root_mem_cgroup there.
>

Oh yes, you are right. I am confusing this with kmem pages for root
memcg where we don't set the page->mem_cgroup and this patch series
should be changing that.

Shakeel

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 00/19] The new cgroup slab memory controller
  2020-06-18  8:43             ` Jesper Dangaard Brouer
  2020-06-18  9:31               ` Jesper Dangaard Brouer
@ 2020-06-19  1:27               ` Roman Gushchin
  2020-06-19  9:39                 ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 92+ messages in thread
From: Roman Gushchin @ 2020-06-19  1:27 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Vlastimil Babka, Shakeel Butt, Andrew Morton, Christoph Lameter,
	Johannes Weiner, Michal Hocko, Linux MM, Kernel Team, LKML,
	Mel Gorman

On Thu, Jun 18, 2020 at 10:43:44AM +0200, Jesper Dangaard Brouer wrote:
> On Wed, 17 Jun 2020 18:29:28 -0700
> Roman Gushchin <guro@fb.com> wrote:
> 
> > On Wed, Jun 17, 2020 at 01:24:21PM +0200, Vlastimil Babka wrote:
> > > On 6/17/20 5:32 AM, Roman Gushchin wrote:  
> > > > On Tue, Jun 16, 2020 at 08:05:39PM -0700, Shakeel Butt wrote:  
> > > >> On Tue, Jun 16, 2020 at 7:41 PM Roman Gushchin <guro@fb.com> wrote:  
> > > >> >
> > > >> > On Tue, Jun 16, 2020 at 06:46:56PM -0700, Shakeel Butt wrote:  
> > > >> > > On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:  
> > > >> > > >  
> > > >> [...]  
> > > >> > >
> > > >> > > Have you performed any [perf] testing on SLAB with this patchset?  
> > > >> >
> > > >> > The accounting part is the same for SLAB and SLUB, so there should be no
> > > >> > significant difference. I've checked that it compiles, boots and passes
> > > >> > kselftests. And that memory savings are there.
> > > >> >  
> > > >> 
> > > >> What about performance? Also you mentioned that sharing kmem-cache
> > > >> between accounted and non-accounted can have additional overhead. Any
> > > >> difference between SLAB and SLUB for such a case?  
> > > > 
> > > > Not really.
> > > > 
> > > > Sharing a single set of caches adds some overhead to root- and non-accounted
> > > > allocations, which is something I've tried hard to avoid in my original version.
> > > > But I have to admit, it allows to simplify and remove a lot of code, and here
> > > > it's hard to argue with Johanness, who pushed on this design.
> > > > 
> > > > With performance testing it's not that easy, because it's not obvious what
> > > > we wanna test. Obviously, per-object accounting is more expensive, and
> > > > measuring something like 1000000 allocations and deallocations in a line from
> > > > a single kmem_cache will show a regression. But in the real world the relative
> > > > cost of allocations is usually low, and we can get some benefits from a smaller
> > > > working set and from having shared kmem_cache objects cache hot.
> > > > Not speaking about some extra memory and the fragmentation reduction.
> > > > 
> > > > We've done an extensive testing of the original version in Facebook production,
> > > > and we haven't noticed any regressions so far. But I have to admit, we were
> > > > using an original version with two sets of kmem_caches.
> > > > 
> > > > If you have any specific tests in mind, I can definitely run them. Or if you
> > > > can help with the performance evaluation, I'll appreciate it a lot.  
> > > 
> > > Jesper provided some pointers here [1], it would be really great if you could
> > > run at least those microbenchmarks. With mmtests it's the major question of
> > > which subset/profiles to run, maybe the referenced commits provide some hints,
> > > or maybe Mel could suggest what he used to evaluate SLAB vs SLUB not so long ago.
> > > 
> > > [1] https://lore.kernel.org/linux-mm/20200527103545.4348ac10@carbon/  
> > 
> > Oh, Jesper, I'm really sorry, somehow I missed your mail.
> > Thank you, Vlastimil, for pointing at it.
> > 
> > I've got some results (slab_bulk_test01), but honestly I fail to interpret them.
> > 
> > I ran original vs patched with SLUB and SLAB, each test several times and picked
> > 3 which looked most consistently. But it still looks very noisy.
> > 
> > I ran them on my desktop (8-core Ryzen 1700, 16 GB RAM, Fedora 32),
> > it's 5.8-rc1 + slab controller v6 vs 5.8-rc1 (default config from Fedora 32).
> 
> What about running these tests on the server level hardware, that you
> intent to run this on?

I'm going to backport this version to the kernel version we're using internally
and will come up with more number soon.

> 
> > 
> > How should I interpret this data?
> 
> First of all these SLUB+SLAB microbenchmarks use object size 256 bytes,
> because network stack alloc object of this size for SKBs/sk_buff (due
> to cache-align as used size is 224 bytes). Checked SLUB: Each slab use
> 2 pages (8192 bytes) and contain 32 object of size 256 (256*32=8192).
> 
>   The SLUB allocator have a per-CPU slab which speedup fast-reuse, in this
> case up-to 32 objects. For SLUB the "fastpath reuse" test this behaviour,
> and it serves as a baseline for optimal 1-object performance (where my bulk
> API tries to beat that, which is possible even for 1-object due to knowing
> bulk API cannot be used from IRQ context).
> 
> SLUB fastpath: 3 measurements reporting cycles(tsc)
>  - SLUB-patched : fastpath reuse: 184 - 177 - 176  cycles(tsc)
>  - SLUB-original: fastpath reuse: 178 - 153 - 156  cycles(tsc)
> 
> There are some stability concerns as you mention, but it seems pretty
> consistently that patched version is slower. If you compile with
> no-PREEMPT you can likely get more stable results (and remove a slight
> overhead for SLUB fastpath).
> 
> The microbenchmark also measures the bulk-API, which is AFAIK only used
> by network stack (and io_uring). I guess you shouldn't focus too much
> on these bulk measurements. When bulk-API cross this objects per slab
> threshold, or is unlucky is it use two per-CPU slab, then the
> measurements can fluctuate a bit.
> 
> Your numbers for SLUB bulk-API:
> 
> SLUB-patched - bulk-API
>  - SLUB-patched : bulk_quick_reuse objects=1 : 187 -  90 - 224  cycles(tsc)
>  - SLUB-patched : bulk_quick_reuse objects=2 : 110 -  53 - 133  cycles(tsc)
>  - SLUB-patched : bulk_quick_reuse objects=3 :  88 -  95 -  42  cycles(tsc)
>  - SLUB-patched : bulk_quick_reuse objects=4 :  91 -  85 -  36  cycles(tsc)
>  - SLUB-patched : bulk_quick_reuse objects=8 :  32 -  66 -  32  cycles(tsc)
> 
> SLUB-original -  bulk-API
>  - SLUB-original: bulk_quick_reuse objects=1 :  87 -  87 - 142  cycles(tsc)
>  - SLUB-original: bulk_quick_reuse objects=2 :  52 -  53 -  53  cycles(tsc)
>  - SLUB-original: bulk_quick_reuse objects=3 :  42 -  42 -  91  cycles(tsc)
>  - SLUB-original: bulk_quick_reuse objects=4 :  91 -  37 -  37  cycles(tsc)
>  - SLUB-original: bulk_quick_reuse objects=8 :  31 -  79 -  76  cycles(tsc)
> 
> Maybe it is just noise or instability in measurements, but it seem that the
> 1-object case is consistently slower in your patched version.
> 
> Mail is too long now... I'll take a look at your SLAB results and followup.


Thank you very much for helping with the analysis!

So does it mean you're looking at the smallest number in each series?
If so, the difference is not that big?

Theoretically speaking it should get worse (especially for non-root allocations),
but if the difference is not big, it still should be better, because there is
a big expected win from memory savings/smaller working set/less fragmentation etc.

The only thing I'm slightly worried is what's the effect on root allocations
if we're sharing slab caches between root- and non-root allocations. Because if
someone depends so much on the allocation speed, memcg-based accounting can be
ignored anyway. For most users the cost of allocation is negligible.
That's why the patch which merges root- and memcg slab caches is put on top
and can be reverted if somebody will complain.

Thank you!

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 00/19] The new cgroup slab memory controller
  2020-06-18  9:31               ` Jesper Dangaard Brouer
@ 2020-06-19  1:30                 ` Roman Gushchin
  2020-06-19  8:32                   ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 92+ messages in thread
From: Roman Gushchin @ 2020-06-19  1:30 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Vlastimil Babka, Shakeel Butt, Andrew Morton, Christoph Lameter,
	Johannes Weiner, Michal Hocko, Linux MM, Kernel Team, LKML,
	Mel Gorman

On Thu, Jun 18, 2020 at 11:31:21AM +0200, Jesper Dangaard Brouer wrote:
> On Thu, 18 Jun 2020 10:43:44 +0200
> Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> 
> > On Wed, 17 Jun 2020 18:29:28 -0700
> > Roman Gushchin <guro@fb.com> wrote:
> > 
> > > On Wed, Jun 17, 2020 at 01:24:21PM +0200, Vlastimil Babka wrote:  
> > > > On 6/17/20 5:32 AM, Roman Gushchin wrote:    
> > > > > On Tue, Jun 16, 2020 at 08:05:39PM -0700, Shakeel Butt wrote:    
> > > > >> On Tue, Jun 16, 2020 at 7:41 PM Roman Gushchin <guro@fb.com> wrote:    
> > > > >> >
> > > > >> > On Tue, Jun 16, 2020 at 06:46:56PM -0700, Shakeel Butt wrote:    
> > > > >> > > On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:    
> > > > >> > > >    
> > > > >> [...]    
> > > > >> > >
> > > > >> > > Have you performed any [perf] testing on SLAB with this patchset?    
> > > > >> >
> > > > >> > The accounting part is the same for SLAB and SLUB, so there should be no
> > > > >> > significant difference. I've checked that it compiles, boots and passes
> > > > >> > kselftests. And that memory savings are there.
> > > > >> >    
> > > > >> 
> > > > >> What about performance? Also you mentioned that sharing kmem-cache
> > > > >> between accounted and non-accounted can have additional overhead. Any
> > > > >> difference between SLAB and SLUB for such a case?    
> > > > > 
> > > > > Not really.
> > > > > 
> > > > > Sharing a single set of caches adds some overhead to root- and non-accounted
> > > > > allocations, which is something I've tried hard to avoid in my original version.
> > > > > But I have to admit, it allows to simplify and remove a lot of code, and here
> > > > > it's hard to argue with Johanness, who pushed on this design.
> > > > > 
> > > > > With performance testing it's not that easy, because it's not obvious what
> > > > > we wanna test. Obviously, per-object accounting is more expensive, and
> > > > > measuring something like 1000000 allocations and deallocations in a line from
> > > > > a single kmem_cache will show a regression. But in the real world the relative
> > > > > cost of allocations is usually low, and we can get some benefits from a smaller
> > > > > working set and from having shared kmem_cache objects cache hot.
> > > > > Not speaking about some extra memory and the fragmentation reduction.
> > > > > 
> > > > > We've done an extensive testing of the original version in Facebook production,
> > > > > and we haven't noticed any regressions so far. But I have to admit, we were
> > > > > using an original version with two sets of kmem_caches.
> > > > > 
> > > > > If you have any specific tests in mind, I can definitely run them. Or if you
> > > > > can help with the performance evaluation, I'll appreciate it a lot.    
> > > > 
> > > > Jesper provided some pointers here [1], it would be really great if you could
> > > > run at least those microbenchmarks. With mmtests it's the major question of
> > > > which subset/profiles to run, maybe the referenced commits provide some hints,
> > > > or maybe Mel could suggest what he used to evaluate SLAB vs SLUB not so long ago.
> > > > 
> > > > [1] https://lore.kernel.org/linux-mm/20200527103545.4348ac10@carbon/    
> > > 
> > > Oh, Jesper, I'm really sorry, somehow I missed your mail.
> > > Thank you, Vlastimil, for pointing at it.
> > > 
> > > I've got some results (slab_bulk_test01), but honestly I fail to interpret them.
> > > 
> > > I ran original vs patched with SLUB and SLAB, each test several times and picked
> > > 3 which looked most consistently. But it still looks very noisy.
> > > 
> > > I ran them on my desktop (8-core Ryzen 1700, 16 GB RAM, Fedora 32),
> > > it's 5.8-rc1 + slab controller v6 vs 5.8-rc1 (default config from Fedora 32).  
> > 
> > What about running these tests on the server level hardware, that you
> > intent to run this on?  
> 
> To give you an idea of the performance difference I ran the same test
> on a Broadwell Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz.
> 
> The SLUB fastpath:
>  Type:kmem fastpath reuse Per elem: 60 cycles(tsc) 16.822 ns
> 
> 
> > > 
> > > How should I interpret this data?  
> > 
> > First of all these SLUB+SLAB microbenchmarks use object size 256 bytes,
> > because network stack alloc object of this size for SKBs/sk_buff (due
> > to cache-align as used size is 224 bytes). Checked SLUB: Each slab use
> > 2 pages (8192 bytes) and contain 32 object of size 256 (256*32=8192).
> > 
> >   The SLUB allocator have a per-CPU slab which speedup fast-reuse, in this
> > case up-to 32 objects. For SLUB the "fastpath reuse" test this behaviour,
> > and it serves as a baseline for optimal 1-object performance (where my bulk
> > API tries to beat that, which is possible even for 1-object due to knowing
> > bulk API cannot be used from IRQ context).
> > 
> > SLUB fastpath: 3 measurements reporting cycles(tsc)
> >  - SLUB-patched : fastpath reuse: 184 - 177 - 176  cycles(tsc)
> >  - SLUB-original: fastpath reuse: 178 - 153 - 156  cycles(tsc)
> > 
> 
> For your SLAB results:
> 
>  SLAB fastpath: 3 measurements reporting cycles(tsc)
>   - SLAB-patched : 161 - 160 - 163  cycles(tsc)
>   - SLAB-original: 174 - 170 - 159  cycles(tsc)
> 
> I find it strange that SLAB is slightly better than SLUB (in many
> measurements), because SLUB should have an advantage on this fast-path
> quick reuse due to the per-CPU slabs.  Maybe this is also related to
> the CPU arch you are using?
> 
> 
> > There are some stability concerns as you mention, but it seems pretty
> > consistently that patched version is slower. If you compile with
> > no-PREEMPT you can likely get more stable results (and remove a slight
> > overhead for SLUB fastpath).
> > 
> > The microbenchmark also measures the bulk-API, which is AFAIK only used
> > by network stack (and io_uring). I guess you shouldn't focus too much
> > on these bulk measurements. When bulk-API cross this objects per slab
> > threshold, or is unlucky is it use two per-CPU slab, then the
> > measurements can fluctuate a bit.
> > 
> > Your numbers for SLUB bulk-API:
> > 
> > SLUB-patched - bulk-API
> >  - SLUB-patched : bulk_quick_reuse objects=1 : 187 -  90 - 224  cycles(tsc)
> >  - SLUB-patched : bulk_quick_reuse objects=2 : 110 -  53 - 133  cycles(tsc)
> >  - SLUB-patched : bulk_quick_reuse objects=3 :  88 -  95 -  42  cycles(tsc)
> >  - SLUB-patched : bulk_quick_reuse objects=4 :  91 -  85 -  36  cycles(tsc)
> >  - SLUB-patched : bulk_quick_reuse objects=8 :  32 -  66 -  32  cycles(tsc)
> > 
> > SLUB-original -  bulk-API
> >  - SLUB-original: bulk_quick_reuse objects=1 :  87 -  87 - 142  cycles(tsc)
> >  - SLUB-original: bulk_quick_reuse objects=2 :  52 -  53 -  53  cycles(tsc)
> >  - SLUB-original: bulk_quick_reuse objects=3 :  42 -  42 -  91  cycles(tsc)
> >  - SLUB-original: bulk_quick_reuse objects=4 :  91 -  37 -  37  cycles(tsc)
> >  - SLUB-original: bulk_quick_reuse objects=8 :  31 -  79 -  76  cycles(tsc)
> 
> Your numbers for SLAB bulk-API:
> 
> SLAB-patched -  bulk-API
>  - SLAB-patched : bulk_quick_reuse objects=1 :  67 -  67 - 140  cycles(tsc)
>  - SLAB-patched : bulk_quick_reuse objects=2 :  55 -  46 -  46  cycles(tsc)
>  - SLAB-patched : bulk_quick_reuse objects=3 :  93 -  94 -  39  cycles(tsc)
>  - SLAB-patched : bulk_quick_reuse objects=4 :  35 -  88 -  85  cycles(tsc)
>  - SLAB-patched : bulk_quick_reuse objects=8 :  30 -  30 -  30  cycles(tsc)
> 
> SLAB-original-  bulk-API
>  - SLAB-original: bulk_quick_reuse objects=1 : 143 - 136 -  67  cycles(tsc)
>  - SLAB-original: bulk_quick_reuse objects=2 :  45 -  46 -  46  cycles(tsc)
>  - SLAB-original: bulk_quick_reuse objects=3 :  38 -  39 -  39  cycles(tsc)
>  - SLAB-original: bulk_quick_reuse objects=4 :  35 -  87 -  87  cycles(tsc)
>  - SLAB-original: bulk_quick_reuse objects=8 :  29 -  66 -  30  cycles(tsc)
> 
> In case of SLAB I expect the bulk-API to be slightly faster than SLUB,
> as the SLUB bulk code is much more advanced.

So again it looks like a patched version is only slightly worse if we're taking
the smallest number in each series. Is it a correct assumption?

Thanks!

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 05/19] mm: memcontrol: decouple reference counting from page accounting
  2020-06-08 23:06 ` [PATCH v6 05/19] mm: memcontrol: decouple reference counting from page accounting Roman Gushchin
  2020-06-18  0:47   ` Shakeel Butt
  2020-06-18 14:55   ` Shakeel Butt
@ 2020-06-19  1:31   ` Shakeel Butt
  2 siblings, 0 replies; 92+ messages in thread
From: Shakeel Butt @ 2020-06-19  1:31 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
>
> From: Johannes Weiner <hannes@cmpxchg.org>
>
> The reference counting of a memcg is currently coupled directly to how
> many 4k pages are charged to it. This doesn't work well with Roman's
> new slab controller, which maintains pools of objects and doesn't want
> to keep an extra balance sheet for the pages backing those objects.
>
> This unusual refcounting design (reference counts usually track
> pointers to an object) is only for historical reasons: memcg used to
> not take any css references and simply stalled offlining until all
> charges had been reparented and the page counters had dropped to
> zero. When we got rid of the reparenting requirement, the simple
> mechanical translation was to take a reference for every charge.
>
> More historical context can be found in commit e8ea14cc6ead ("mm:
> memcontrol: take a css reference for each charged page"),
> commit 64f219938941 ("mm: memcontrol: remove obsolete kmemcg pinning
> tricks") and commit b2052564e66d ("mm: memcontrol: continue cache
> reclaim from offlined groups").
>
> The new slab controller exposes the limitations in this scheme, so
> let's switch it to a more idiomatic reference counting model based on
> actual kernel pointers to the memcg:
>
> - The per-cpu stock holds a reference to the memcg its caching
>
> - User pages hold a reference for their page->mem_cgroup. Transparent
>   huge pages will no longer acquire tail references in advance, we'll
>   get them if needed during the split.
>
> - Kernel pages hold a reference for their page->mem_cgroup
>
> - Pages allocated in the root cgroup will acquire and release css
>   references for simplicity. css_get() and css_put() optimize that.
>
> - The current memcg_charge_slab() already hacked around the per-charge
>   references; this change gets rid of that as well.
>
> Roman:
> 1) Rebased on top of the current mm tree: added css_get() in
>    mem_cgroup_charge(), dropped mem_cgroup_try_charge() part
> 2) I've reformatted commit references in the commit log to make
>    checkpatch.pl happy.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Acked-by: Roman Gushchin <guro@fb.com>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 00/19] The new cgroup slab memory controller
  2020-06-19  1:30                 ` Roman Gushchin
@ 2020-06-19  8:32                   ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 92+ messages in thread
From: Jesper Dangaard Brouer @ 2020-06-19  8:32 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Vlastimil Babka, Shakeel Butt, Andrew Morton, Christoph Lameter,
	Johannes Weiner, Michal Hocko, Linux MM, Kernel Team, LKML,
	Mel Gorman, brouer

On Thu, 18 Jun 2020 18:30:13 -0700
Roman Gushchin <guro@fb.com> wrote:

> On Thu, Jun 18, 2020 at 11:31:21AM +0200, Jesper Dangaard Brouer wrote:
> > On Thu, 18 Jun 2020 10:43:44 +0200
> > Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> >   
> > > On Wed, 17 Jun 2020 18:29:28 -0700
> > > Roman Gushchin <guro@fb.com> wrote:
> > >   
> > > > On Wed, Jun 17, 2020 at 01:24:21PM +0200, Vlastimil Babka wrote:    
> > > > > On 6/17/20 5:32 AM, Roman Gushchin wrote:      
> > > > > > On Tue, Jun 16, 2020 at 08:05:39PM -0700, Shakeel Butt wrote:      
> > > > > >> On Tue, Jun 16, 2020 at 7:41 PM Roman Gushchin <guro@fb.com> wrote:      
> > > > > >> >
> > > > > >> > On Tue, Jun 16, 2020 at 06:46:56PM -0700, Shakeel Butt wrote:      
> > > > > >> > > On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:      
> > > > > >> > > >      
> > > > > >> [...]      
> > > > > >> > >
> > > > > >> > > Have you performed any [perf] testing on SLAB with this patchset?      
> > > > > >> >
> > > > > >> > The accounting part is the same for SLAB and SLUB, so there should be no
> > > > > >> > significant difference. I've checked that it compiles, boots and passes
> > > > > >> > kselftests. And that memory savings are there.
> > > > > >> >      
> > > > > >> 
> > > > > >> What about performance? Also you mentioned that sharing kmem-cache
> > > > > >> between accounted and non-accounted can have additional overhead. Any
> > > > > >> difference between SLAB and SLUB for such a case?      
> > > > > > 
> > > > > > Not really.
> > > > > > 
> > > > > > Sharing a single set of caches adds some overhead to root- and non-accounted
> > > > > > allocations, which is something I've tried hard to avoid in my original version.
> > > > > > But I have to admit, it allows to simplify and remove a lot of code, and here
> > > > > > it's hard to argue with Johanness, who pushed on this design.
> > > > > > 
> > > > > > With performance testing it's not that easy, because it's not obvious what
> > > > > > we wanna test. Obviously, per-object accounting is more expensive, and
> > > > > > measuring something like 1000000 allocations and deallocations in a line from
> > > > > > a single kmem_cache will show a regression. But in the real world the relative
> > > > > > cost of allocations is usually low, and we can get some benefits from a smaller
> > > > > > working set and from having shared kmem_cache objects cache hot.
> > > > > > Not speaking about some extra memory and the fragmentation reduction.
> > > > > > 
> > > > > > We've done an extensive testing of the original version in Facebook production,
> > > > > > and we haven't noticed any regressions so far. But I have to admit, we were
> > > > > > using an original version with two sets of kmem_caches.
> > > > > > 
> > > > > > If you have any specific tests in mind, I can definitely run them. Or if you
> > > > > > can help with the performance evaluation, I'll appreciate it a lot.      
> > > > > 
> > > > > Jesper provided some pointers here [1], it would be really great if you could
> > > > > run at least those microbenchmarks. With mmtests it's the major question of
> > > > > which subset/profiles to run, maybe the referenced commits provide some hints,
> > > > > or maybe Mel could suggest what he used to evaluate SLAB vs SLUB not so long ago.
> > > > > 
> > > > > [1] https://lore.kernel.org/linux-mm/20200527103545.4348ac10@carbon/      
> > > > 
> > > > Oh, Jesper, I'm really sorry, somehow I missed your mail.
> > > > Thank you, Vlastimil, for pointing at it.
> > > > 
> > > > I've got some results (slab_bulk_test01), but honestly I fail to interpret them.
> > > > 
> > > > I ran original vs patched with SLUB and SLAB, each test several times and picked
> > > > 3 which looked most consistently. But it still looks very noisy.
> > > > 
> > > > I ran them on my desktop (8-core Ryzen 1700, 16 GB RAM, Fedora 32),
> > > > it's 5.8-rc1 + slab controller v6 vs 5.8-rc1 (default config from Fedora 32).    
> > > 
> > > What about running these tests on the server level hardware, that you
> > > intent to run this on?    
> > 
> > To give you an idea of the performance difference I ran the same test
> > on a Broadwell Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz.
> > 
> > The SLUB fastpath:
> >  Type:kmem fastpath reuse Per elem: 60 cycles(tsc) 16.822 ns
> > 
> >   
> > > > 
> > > > How should I interpret this data?    
> > > 
> > > First of all these SLUB+SLAB microbenchmarks use object size 256 bytes,
> > > because network stack alloc object of this size for SKBs/sk_buff (due
> > > to cache-align as used size is 224 bytes). Checked SLUB: Each slab use
> > > 2 pages (8192 bytes) and contain 32 object of size 256 (256*32=8192).
> > > 
> > >   The SLUB allocator have a per-CPU slab which speedup fast-reuse, in this
> > > case up-to 32 objects. For SLUB the "fastpath reuse" test this behaviour,
> > > and it serves as a baseline for optimal 1-object performance (where my bulk
> > > API tries to beat that, which is possible even for 1-object due to knowing
> > > bulk API cannot be used from IRQ context).
> > > 
> > > SLUB fastpath: 3 measurements reporting cycles(tsc)
> > >  - SLUB-patched : fastpath reuse: 184 - 177 - 176  cycles(tsc)
> > >  - SLUB-original: fastpath reuse: 178 - 153 - 156  cycles(tsc)
> > >   
> > 
> > For your SLAB results:
> > 
> >  SLAB fastpath: 3 measurements reporting cycles(tsc)
> >   - SLAB-patched : 161 - 160 - 163  cycles(tsc)
> >   - SLAB-original: 174 - 170 - 159  cycles(tsc)
> > 
> > I find it strange that SLAB is slightly better than SLUB (in many
> > measurements), because SLUB should have an advantage on this fast-path
> > quick reuse due to the per-CPU slabs.  Maybe this is also related to
> > the CPU arch you are using?
> > 
> >   
> > > There are some stability concerns as you mention, but it seems pretty
> > > consistently that patched version is slower. If you compile with
> > > no-PREEMPT you can likely get more stable results (and remove a slight
> > > overhead for SLUB fastpath).
> > > 
> > > The microbenchmark also measures the bulk-API, which is AFAIK only used
> > > by network stack (and io_uring). I guess you shouldn't focus too much
> > > on these bulk measurements. When bulk-API cross this objects per slab
> > > threshold, or is unlucky is it use two per-CPU slab, then the
> > > measurements can fluctuate a bit.
> > > 
> > > Your numbers for SLUB bulk-API:
> > > 
> > > SLUB-patched - bulk-API
> > >  - SLUB-patched : bulk_quick_reuse objects=1 : 187 -  90 - 224  cycles(tsc)
> > >  - SLUB-patched : bulk_quick_reuse objects=2 : 110 -  53 - 133  cycles(tsc)
> > >  - SLUB-patched : bulk_quick_reuse objects=3 :  88 -  95 -  42  cycles(tsc)
> > >  - SLUB-patched : bulk_quick_reuse objects=4 :  91 -  85 -  36  cycles(tsc)
> > >  - SLUB-patched : bulk_quick_reuse objects=8 :  32 -  66 -  32  cycles(tsc)
> > > 
> > > SLUB-original -  bulk-API
> > >  - SLUB-original: bulk_quick_reuse objects=1 :  87 -  87 - 142  cycles(tsc)
> > >  - SLUB-original: bulk_quick_reuse objects=2 :  52 -  53 -  53  cycles(tsc)
> > >  - SLUB-original: bulk_quick_reuse objects=3 :  42 -  42 -  91  cycles(tsc)
> > >  - SLUB-original: bulk_quick_reuse objects=4 :  91 -  37 -  37  cycles(tsc)
> > >  - SLUB-original: bulk_quick_reuse objects=8 :  31 -  79 -  76  cycles(tsc)  
> > 
> > Your numbers for SLAB bulk-API:
> > 
> > SLAB-patched -  bulk-API
> >  - SLAB-patched : bulk_quick_reuse objects=1 :  67 -  67 - 140  cycles(tsc)
> >  - SLAB-patched : bulk_quick_reuse objects=2 :  55 -  46 -  46  cycles(tsc)
> >  - SLAB-patched : bulk_quick_reuse objects=3 :  93 -  94 -  39  cycles(tsc)
> >  - SLAB-patched : bulk_quick_reuse objects=4 :  35 -  88 -  85  cycles(tsc)
> >  - SLAB-patched : bulk_quick_reuse objects=8 :  30 -  30 -  30  cycles(tsc)
> > 
> > SLAB-original-  bulk-API
> >  - SLAB-original: bulk_quick_reuse objects=1 : 143 - 136 -  67  cycles(tsc)
> >  - SLAB-original: bulk_quick_reuse objects=2 :  45 -  46 -  46  cycles(tsc)
> >  - SLAB-original: bulk_quick_reuse objects=3 :  38 -  39 -  39  cycles(tsc)
> >  - SLAB-original: bulk_quick_reuse objects=4 :  35 -  87 -  87  cycles(tsc)
> >  - SLAB-original: bulk_quick_reuse objects=8 :  29 -  66 -  30  cycles(tsc)
> > 
> > In case of SLAB I expect the bulk-API to be slightly faster than SLUB,
> > as the SLUB bulk code is much more advanced.  
> 
> So again it looks like a patched version is only slightly worse if we're taking
> the smallest number in each series. Is it a correct assumption?

Yes, I guess that is a good way to look at these somewhat fluctuating numbers.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 00/19] The new cgroup slab memory controller
  2020-06-19  1:27               ` Roman Gushchin
@ 2020-06-19  9:39                 ` Jesper Dangaard Brouer
  2020-06-19 18:47                   ` Roman Gushchin
  0 siblings, 1 reply; 92+ messages in thread
From: Jesper Dangaard Brouer @ 2020-06-19  9:39 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Vlastimil Babka, Shakeel Butt, Andrew Morton, Christoph Lameter,
	Johannes Weiner, Michal Hocko, Linux MM, Kernel Team, LKML,
	Mel Gorman, brouer, Larry Woodman

On Thu, 18 Jun 2020 18:27:12 -0700
Roman Gushchin <guro@fb.com> wrote:

> Theoretically speaking it should get worse (especially for non-root allocations),
> but if the difference is not big, it still should be better, because there is
> a big expected win from memory savings/smaller working set/less fragmentation etc.
> 
> The only thing I'm slightly worried is what's the effect on root allocations
> if we're sharing slab caches between root- and non-root allocations. Because if
> someone depends so much on the allocation speed, memcg-based accounting can be
> ignored anyway. For most users the cost of allocation is negligible.
> That's why the patch which merges root- and memcg slab caches is put on top
> and can be reverted if somebody will complain.

In general I like this work for saving memory, but you also have to be
aware of the negative consequences of sharing slab caches.  At Red Hat
we have experienced very hard to find kernel bugs, that point to memory
corruption at a completely wrong kernel code, because other kernel code
were corrupting the shared slab cache.  (Hint a workaround is to enable
SLUB debugging to disable this sharing).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 06/19] mm: memcg/slab: obj_cgroup API
  2020-06-08 23:06 ` [PATCH v6 06/19] mm: memcg/slab: obj_cgroup API Roman Gushchin
@ 2020-06-19 15:42   ` Shakeel Butt
  2020-06-19 21:38     ` Roman Gushchin
  0 siblings, 1 reply; 92+ messages in thread
From: Shakeel Butt @ 2020-06-19 15:42 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
>
> Obj_cgroup API provides an ability to account sub-page sized kernel
> objects, which potentially outlive the original memory cgroup.
>
> The top-level API consists of the following functions:
>   bool obj_cgroup_tryget(struct obj_cgroup *objcg);
>   void obj_cgroup_get(struct obj_cgroup *objcg);
>   void obj_cgroup_put(struct obj_cgroup *objcg);
>
>   int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size);
>   void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);
>
>   struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg);
>   struct obj_cgroup *get_obj_cgroup_from_current(void);
>
> Object cgroup is basically a pointer to a memory cgroup with a per-cpu
> reference counter. It substitutes a memory cgroup in places where
> it's necessary to charge a custom amount of bytes instead of pages.
>
> All charged memory rounded down to pages is charged to the
> corresponding memory cgroup using __memcg_kmem_charge().
>
> It implements reparenting: on memcg offlining it's getting reattached
> to the parent memory cgroup. Each online memory cgroup has an
> associated active object cgroup to handle new allocations and the list
> of all attached object cgroups. On offlining of a cgroup this list is
> reparented and for each object cgroup in the list the memcg pointer is
> swapped to the parent memory cgroup. It prevents long-living objects
> from pinning the original memory cgroup in the memory.
>
> The implementation is based on byte-sized per-cpu stocks. A sub-page
> sized leftover is stored in an atomic field, which is a part of
> obj_cgroup object. So on cgroup offlining the leftover is automatically
> reparented.
>
> memcg->objcg is rcu protected.
> objcg->memcg is a raw pointer, which is always pointing at a memory
> cgroup, but can be atomically swapped to the parent memory cgroup. So
> the caller

What type of caller? The allocator?

> must ensure the lifetime of the cgroup, e.g. grab
> rcu_read_lock or css_set_lock.
>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Roman Gushchin <guro@fb.com>
> ---
>  include/linux/memcontrol.h |  51 +++++++
>  mm/memcontrol.c            | 288 ++++++++++++++++++++++++++++++++++++-
>  2 files changed, 338 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 93dbc7f9d8b8..c69e66fe4f12 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -23,6 +23,7 @@
>  #include <linux/page-flags.h>
>
>  struct mem_cgroup;
> +struct obj_cgroup;
>  struct page;
>  struct mm_struct;
>  struct kmem_cache;
> @@ -192,6 +193,22 @@ struct memcg_cgwb_frn {
>         struct wb_completion done;      /* tracks in-flight foreign writebacks */
>  };
>
> +/*
> + * Bucket for arbitrarily byte-sized objects charged to a memory
> + * cgroup. The bucket can be reparented in one piece when the cgroup
> + * is destroyed, without having to round up the individual references
> + * of all live memory objects in the wild.
> + */
> +struct obj_cgroup {
> +       struct percpu_ref refcnt;
> +       struct mem_cgroup *memcg;
> +       atomic_t nr_charged_bytes;

So, we still charge the mem page counter in pages but keep the
remaining sub-page slack charge in nr_charge_bytes, right?

> +       union {
> +               struct list_head list;
> +               struct rcu_head rcu;
> +       };
> +};
> +
>  /*
>   * The memory controller data structure. The memory controller controls both
>   * page cache and RSS per cgroup. We would eventually like to provide
> @@ -301,6 +318,8 @@ struct mem_cgroup {
>         int kmemcg_id;
>         enum memcg_kmem_state kmem_state;
>         struct list_head kmem_caches;
> +       struct obj_cgroup __rcu *objcg;
> +       struct list_head objcg_list;
>  #endif
>
[snip]
> +
> +static void memcg_reparent_objcgs(struct mem_cgroup *memcg,
> +                                 struct mem_cgroup *parent)
> +{
> +       struct obj_cgroup *objcg, *iter;
> +
> +       objcg = rcu_replace_pointer(memcg->objcg, NULL, true);
> +
> +       spin_lock_irq(&css_set_lock);
> +
> +       /* Move active objcg to the parent's list */
> +       xchg(&objcg->memcg, parent);
> +       css_get(&parent->css);
> +       list_add(&objcg->list, &parent->objcg_list);

So, memcg->objcs_list will always only contain the offlined
descendants objcgs. I would recommend to rename objcg_list to clearly
show that. Maybe offlined_objcg_list or descendants_objcg_list or
something else.

> +
> +       /* Move already reparented objcgs to the parent's list */
> +       list_for_each_entry(iter, &memcg->objcg_list, list) {
> +               css_get(&parent->css);
> +               xchg(&iter->memcg, parent);
> +               css_put(&memcg->css);
> +       }
> +       list_splice(&memcg->objcg_list, &parent->objcg_list);
> +
> +       spin_unlock_irq(&css_set_lock);
> +
> +       percpu_ref_kill(&objcg->refcnt);
> +}
> +
>  /*
[snip]
>
> +__always_inline struct obj_cgroup *get_obj_cgroup_from_current(void)
> +{
> +       struct obj_cgroup *objcg = NULL;
> +       struct mem_cgroup *memcg;
> +
> +       if (unlikely(!current->mm))
> +               return NULL;

I have not seen the users of this function yet but shouldn't the above
check be (!current->mm && !current->active_memcg)?

Do we need a mem_cgroup_disabled() check as well?

> +
> +       rcu_read_lock();
> +       if (unlikely(current->active_memcg))
> +               memcg = rcu_dereference(current->active_memcg);
> +       else
> +               memcg = mem_cgroup_from_task(current);
> +
> +       for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) {
> +               objcg = rcu_dereference(memcg->objcg);
> +               if (objcg && obj_cgroup_tryget(objcg))
> +                       break;
> +       }
> +       rcu_read_unlock();
> +
> +       return objcg;
> +}
> +
[...]
> +
> +static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
> +{
> +       struct memcg_stock_pcp *stock;
> +       unsigned long flags;
> +
> +       local_irq_save(flags);
> +
> +       stock = this_cpu_ptr(&memcg_stock);
> +       if (stock->cached_objcg != objcg) { /* reset if necessary */
> +               drain_obj_stock(stock);
> +               obj_cgroup_get(objcg);
> +               stock->cached_objcg = objcg;
> +               stock->nr_bytes = atomic_xchg(&objcg->nr_charged_bytes, 0);
> +       }
> +       stock->nr_bytes += nr_bytes;
> +
> +       if (stock->nr_bytes > PAGE_SIZE)
> +               drain_obj_stock(stock);

The normal stock can go to 32*nr_cpus*PAGE_SIZE. I am wondering if
just PAGE_SIZE is too less for obj stock.

> +
> +       local_irq_restore(flags);
> +}
> +

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 07/19] mm: memcg/slab: allocate obj_cgroups for non-root slab pages
  2020-06-08 23:06 ` [PATCH v6 07/19] mm: memcg/slab: allocate obj_cgroups for non-root slab pages Roman Gushchin
@ 2020-06-19 16:36   ` Shakeel Butt
  2020-06-20  0:25     ` Roman Gushchin
  0 siblings, 1 reply; 92+ messages in thread
From: Shakeel Butt @ 2020-06-19 16:36 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
>
> Allocate and release memory to store obj_cgroup pointers for each
> non-root slab page. Reuse page->mem_cgroup pointer to store a pointer
> to the allocated space.
>
> To distinguish between obj_cgroups and memcg pointers in case
> when it's not obvious which one is used (as in page_cgroup_ino()),
> let's always set the lowest bit in the obj_cgroup case.
>

I think the commit message should talk about the potential overhead
(i.e an extra pointer for each object) along with the justifications
(i.e. less internal fragmentation and potentially more savings than
the overhead).

> Signed-off-by: Roman Gushchin <guro@fb.com>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  include/linux/mm_types.h |  5 +++-
>  include/linux/slab_def.h |  6 +++++
>  include/linux/slub_def.h |  5 ++++
>  mm/memcontrol.c          | 17 +++++++++++---
>  mm/slab.h                | 49 ++++++++++++++++++++++++++++++++++++++++
>  5 files changed, 78 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 64ede5f150dc..0277fbab7c93 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -198,7 +198,10 @@ struct page {
>         atomic_t _refcount;
>
>  #ifdef CONFIG_MEMCG
> -       struct mem_cgroup *mem_cgroup;
> +       union {
> +               struct mem_cgroup *mem_cgroup;
> +               struct obj_cgroup **obj_cgroups;
> +       };
>  #endif
>
>         /*
> diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h
> index abc7de77b988..ccda7b9669a5 100644
> --- a/include/linux/slab_def.h
> +++ b/include/linux/slab_def.h
> @@ -114,4 +114,10 @@ static inline unsigned int obj_to_index(const struct kmem_cache *cache,
>         return reciprocal_divide(offset, cache->reciprocal_buffer_size);
>  }
>
> +static inline int objs_per_slab_page(const struct kmem_cache *cache,
> +                                    const struct page *page)
> +{
> +       return cache->num;
> +}
> +
>  #endif /* _LINUX_SLAB_DEF_H */
> diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
> index 30e91c83d401..f87302dcfe8c 100644
> --- a/include/linux/slub_def.h
> +++ b/include/linux/slub_def.h
> @@ -198,4 +198,9 @@ static inline unsigned int obj_to_index(const struct kmem_cache *cache,
>         return __obj_to_index(cache, page_address(page), obj);
>  }
>
> +static inline int objs_per_slab_page(const struct kmem_cache *cache,
> +                                    const struct page *page)
> +{
> +       return page->objects;
> +}
>  #endif /* _LINUX_SLUB_DEF_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7ff66275966c..2020c7542aa1 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -569,10 +569,21 @@ ino_t page_cgroup_ino(struct page *page)
>         unsigned long ino = 0;
>
>         rcu_read_lock();
> -       if (PageSlab(page) && !PageTail(page))
> +       if (PageSlab(page) && !PageTail(page)) {
>                 memcg = memcg_from_slab_page(page);
> -       else
> -               memcg = READ_ONCE(page->mem_cgroup);
> +       } else {
> +               memcg = page->mem_cgroup;
> +
> +               /*
> +                * The lowest bit set means that memcg isn't a valid
> +                * memcg pointer, but a obj_cgroups pointer.
> +                * In this case the page is shared and doesn't belong
> +                * to any specific memory cgroup.
> +                */
> +               if ((unsigned long) memcg & 0x1UL)
> +                       memcg = NULL;
> +       }
> +
>         while (memcg && !(memcg->css.flags & CSS_ONLINE))
>                 memcg = parent_mem_cgroup(memcg);
>         if (memcg)
> diff --git a/mm/slab.h b/mm/slab.h
> index 8a574d9361c1..a1633ea15fbf 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -319,6 +319,18 @@ static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
>         return s->memcg_params.root_cache;
>  }
>
> +static inline struct obj_cgroup **page_obj_cgroups(struct page *page)
> +{
> +       /*
> +        * page->mem_cgroup and page->obj_cgroups are sharing the same
> +        * space. To distinguish between them in case we don't know for sure
> +        * that the page is a slab page (e.g. page_cgroup_ino()), let's
> +        * always set the lowest bit of obj_cgroups.
> +        */
> +       return (struct obj_cgroup **)
> +               ((unsigned long)page->obj_cgroups & ~0x1UL);
> +}
> +
>  /*
>   * Expects a pointer to a slab page. Please note, that PageSlab() check
>   * isn't sufficient, as it returns true also for tail compound slab pages,
> @@ -406,6 +418,26 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
>         percpu_ref_put_many(&s->memcg_params.refcnt, nr_pages);
>  }
>
> +static inline int memcg_alloc_page_obj_cgroups(struct page *page,
> +                                              struct kmem_cache *s, gfp_t gfp)
> +{
> +       unsigned int objects = objs_per_slab_page(s, page);
> +       void *vec;
> +
> +       vec = kcalloc(objects, sizeof(struct obj_cgroup *), gfp);

Should the above allocation be on the same node as the page?

> +       if (!vec)
> +               return -ENOMEM;
> +
> +       page->obj_cgroups = (struct obj_cgroup **) ((unsigned long)vec | 0x1UL);
> +       return 0;
> +}
> +
> +static inline void memcg_free_page_obj_cgroups(struct page *page)
> +{
> +       kfree(page_obj_cgroups(page));
> +       page->obj_cgroups = NULL;
> +}
> +
>  extern void slab_init_memcg_params(struct kmem_cache *);
>  extern void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg);
>
> @@ -455,6 +487,16 @@ static inline void memcg_uncharge_slab(struct page *page, int order,
>  {
>  }
>
> +static inline int memcg_alloc_page_obj_cgroups(struct page *page,
> +                                              struct kmem_cache *s, gfp_t gfp)
> +{
> +       return 0;
> +}
> +
> +static inline void memcg_free_page_obj_cgroups(struct page *page)
> +{
> +}
> +
>  static inline void slab_init_memcg_params(struct kmem_cache *s)
>  {
>  }
> @@ -481,12 +523,18 @@ static __always_inline int charge_slab_page(struct page *page,
>                                             gfp_t gfp, int order,
>                                             struct kmem_cache *s)
>  {
> +       int ret;
> +
>         if (is_root_cache(s)) {
>                 mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
>                                     PAGE_SIZE << order);
>                 return 0;
>         }
>
> +       ret = memcg_alloc_page_obj_cgroups(page, s, gfp);
> +       if (ret)
> +               return ret;
> +
>         return memcg_charge_slab(page, gfp, order, s);
>  }
>
> @@ -499,6 +547,7 @@ static __always_inline void uncharge_slab_page(struct page *page, int order,
>                 return;
>         }
>
> +       memcg_free_page_obj_cgroups(page);
>         memcg_uncharge_slab(page, order, s);
>  }
>
> --
> 2.25.4
>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 00/19] The new cgroup slab memory controller
  2020-06-19  9:39                 ` Jesper Dangaard Brouer
@ 2020-06-19 18:47                   ` Roman Gushchin
  0 siblings, 0 replies; 92+ messages in thread
From: Roman Gushchin @ 2020-06-19 18:47 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Vlastimil Babka, Shakeel Butt, Andrew Morton, Christoph Lameter,
	Johannes Weiner, Michal Hocko, Linux MM, Kernel Team, LKML,
	Mel Gorman, Larry Woodman

On Fri, Jun 19, 2020 at 11:39:45AM +0200, Jesper Dangaard Brouer wrote:
> On Thu, 18 Jun 2020 18:27:12 -0700
> Roman Gushchin <guro@fb.com> wrote:
> 
> > Theoretically speaking it should get worse (especially for non-root allocations),
> > but if the difference is not big, it still should be better, because there is
> > a big expected win from memory savings/smaller working set/less fragmentation etc.
> > 
> > The only thing I'm slightly worried is what's the effect on root allocations
> > if we're sharing slab caches between root- and non-root allocations. Because if
> > someone depends so much on the allocation speed, memcg-based accounting can be
> > ignored anyway. For most users the cost of allocation is negligible.
> > That's why the patch which merges root- and memcg slab caches is put on top
> > and can be reverted if somebody will complain.
> 
> In general I like this work for saving memory, but you also have to be
> aware of the negative consequences of sharing slab caches.  At Red Hat
> we have experienced very hard to find kernel bugs, that point to memory
> corruption at a completely wrong kernel code, because other kernel code
> were corrupting the shared slab cache.  (Hint a workaround is to enable
> SLUB debugging to disable this sharing).

I agree, but it must be related to the sharing of slab pages between different
types of objects. We've also disabled cache sharing many times in order
to compare slab usages between different major kernel version or to debug
memory corruptions.

But what about sharing between multiple cgroups, it just brings
CONFIG_MEMCG_KMEM memory layout back to the !CONFIG_MEMCG_KMEM.
I doubt that anyone ever considered the kernel memory accounting
as a debugging mechanism. Quite opposite, we've encountered a lot of
tricky issues related to the dynamic creation and destruction of kmem_caches
and their life-time. Removing this code should make things simpler and
hopefully more reliable.

Thanks!



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 06/19] mm: memcg/slab: obj_cgroup API
  2020-06-19 15:42   ` Shakeel Butt
@ 2020-06-19 21:38     ` Roman Gushchin
  2020-06-19 22:16       ` Shakeel Butt
  2020-06-20 22:50       ` Andrew Morton
  0 siblings, 2 replies; 92+ messages in thread
From: Roman Gushchin @ 2020-06-19 21:38 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Fri, Jun 19, 2020 at 08:42:34AM -0700, Shakeel Butt wrote:
> On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > Obj_cgroup API provides an ability to account sub-page sized kernel
> > objects, which potentially outlive the original memory cgroup.
> >
> > The top-level API consists of the following functions:
> >   bool obj_cgroup_tryget(struct obj_cgroup *objcg);
> >   void obj_cgroup_get(struct obj_cgroup *objcg);
> >   void obj_cgroup_put(struct obj_cgroup *objcg);
> >
> >   int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size);
> >   void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);
> >
> >   struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg);
> >   struct obj_cgroup *get_obj_cgroup_from_current(void);
> >
> > Object cgroup is basically a pointer to a memory cgroup with a per-cpu
> > reference counter. It substitutes a memory cgroup in places where
> > it's necessary to charge a custom amount of bytes instead of pages.
> >
> > All charged memory rounded down to pages is charged to the
> > corresponding memory cgroup using __memcg_kmem_charge().
> >
> > It implements reparenting: on memcg offlining it's getting reattached
> > to the parent memory cgroup. Each online memory cgroup has an
> > associated active object cgroup to handle new allocations and the list
> > of all attached object cgroups. On offlining of a cgroup this list is
> > reparented and for each object cgroup in the list the memcg pointer is
> > swapped to the parent memory cgroup. It prevents long-living objects
> > from pinning the original memory cgroup in the memory.
> >
> > The implementation is based on byte-sized per-cpu stocks. A sub-page
> > sized leftover is stored in an atomic field, which is a part of
> > obj_cgroup object. So on cgroup offlining the leftover is automatically
> > reparented.
> >
> > memcg->objcg is rcu protected.
> > objcg->memcg is a raw pointer, which is always pointing at a memory
> > cgroup, but can be atomically swapped to the parent memory cgroup. So
> > the caller
> 
> What type of caller? The allocator?

Basically whoever uses the pointer. Is it better to s/caller/user?

> 
> > must ensure the lifetime of the cgroup, e.g. grab
> > rcu_read_lock or css_set_lock.
> >
> > Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > ---
> >  include/linux/memcontrol.h |  51 +++++++
> >  mm/memcontrol.c            | 288 ++++++++++++++++++++++++++++++++++++-
> >  2 files changed, 338 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 93dbc7f9d8b8..c69e66fe4f12 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -23,6 +23,7 @@
> >  #include <linux/page-flags.h>
> >
> >  struct mem_cgroup;
> > +struct obj_cgroup;
> >  struct page;
> >  struct mm_struct;
> >  struct kmem_cache;
> > @@ -192,6 +193,22 @@ struct memcg_cgwb_frn {
> >         struct wb_completion done;      /* tracks in-flight foreign writebacks */
> >  };
> >
> > +/*
> > + * Bucket for arbitrarily byte-sized objects charged to a memory
> > + * cgroup. The bucket can be reparented in one piece when the cgroup
> > + * is destroyed, without having to round up the individual references
> > + * of all live memory objects in the wild.
> > + */
> > +struct obj_cgroup {
> > +       struct percpu_ref refcnt;
> > +       struct mem_cgroup *memcg;
> > +       atomic_t nr_charged_bytes;
> 
> So, we still charge the mem page counter in pages but keep the
> remaining sub-page slack charge in nr_charge_bytes, right?

Kind of. The remainder is usually kept in a per-cpu stock,
but if the stock has to be flushed, it's getting flushed to nr_charge_bytes.

> 
> > +       union {
> > +               struct list_head list;
> > +               struct rcu_head rcu;
> > +       };
> > +};
> > +
> >  /*
> >   * The memory controller data structure. The memory controller controls both
> >   * page cache and RSS per cgroup. We would eventually like to provide
> > @@ -301,6 +318,8 @@ struct mem_cgroup {
> >         int kmemcg_id;
> >         enum memcg_kmem_state kmem_state;
> >         struct list_head kmem_caches;
> > +       struct obj_cgroup __rcu *objcg;
> > +       struct list_head objcg_list;
> >  #endif
> >
> [snip]
> > +
> > +static void memcg_reparent_objcgs(struct mem_cgroup *memcg,
> > +                                 struct mem_cgroup *parent)
> > +{
> > +       struct obj_cgroup *objcg, *iter;
> > +
> > +       objcg = rcu_replace_pointer(memcg->objcg, NULL, true);
> > +
> > +       spin_lock_irq(&css_set_lock);
> > +
> > +       /* Move active objcg to the parent's list */
> > +       xchg(&objcg->memcg, parent);
> > +       css_get(&parent->css);
> > +       list_add(&objcg->list, &parent->objcg_list);
> 
> So, memcg->objcs_list will always only contain the offlined
> descendants objcgs. I would recommend to rename objcg_list to clearly
> show that. Maybe offlined_objcg_list or descendants_objcg_list or
> something else.

Right. Let me add a comment for now and think of a better name.

> 
> > +
> > +       /* Move already reparented objcgs to the parent's list */
> > +       list_for_each_entry(iter, &memcg->objcg_list, list) {
> > +               css_get(&parent->css);
> > +               xchg(&iter->memcg, parent);
> > +               css_put(&memcg->css);
> > +       }
> > +       list_splice(&memcg->objcg_list, &parent->objcg_list);
> > +
> > +       spin_unlock_irq(&css_set_lock);
> > +
> > +       percpu_ref_kill(&objcg->refcnt);
> > +}
> > +
> >  /*
> [snip]
> >
> > +__always_inline struct obj_cgroup *get_obj_cgroup_from_current(void)
> > +{
> > +       struct obj_cgroup *objcg = NULL;
> > +       struct mem_cgroup *memcg;
> > +
> > +       if (unlikely(!current->mm))
> > +               return NULL;
> 
> I have not seen the users of this function yet but shouldn't the above
> check be (!current->mm && !current->active_memcg)?

Yes, good catch, it might save a couple of cycles if
current->mm == current->active_memcg == NULL. Adding.

> 
> Do we need a mem_cgroup_disabled() check as well?

As now both call sides are guarded by memcg_kmem_enabled(),
so we don't need it.

But maybe it's a good target for some refactorings,
e.g. moving !current->mm and !current->active_memcg checks out
of memcg_kmem_bypass(). And _maybe_ it's better to move memcg_kmem_enabled()
here, but I'm not sure.

> 
> > +
> > +       rcu_read_lock();
> > +       if (unlikely(current->active_memcg))
> > +               memcg = rcu_dereference(current->active_memcg);
> > +       else
> > +               memcg = mem_cgroup_from_task(current);
> > +
> > +       for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) {
> > +               objcg = rcu_dereference(memcg->objcg);
> > +               if (objcg && obj_cgroup_tryget(objcg))
> > +                       break;
> > +       }
> > +       rcu_read_unlock();
> > +
> > +       return objcg;
> > +}
> > +
> [...]
> > +
> > +static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
> > +{
> > +       struct memcg_stock_pcp *stock;
> > +       unsigned long flags;
> > +
> > +       local_irq_save(flags);
> > +
> > +       stock = this_cpu_ptr(&memcg_stock);
> > +       if (stock->cached_objcg != objcg) { /* reset if necessary */
> > +               drain_obj_stock(stock);
> > +               obj_cgroup_get(objcg);
> > +               stock->cached_objcg = objcg;
> > +               stock->nr_bytes = atomic_xchg(&objcg->nr_charged_bytes, 0);
> > +       }
> > +       stock->nr_bytes += nr_bytes;
> > +
> > +       if (stock->nr_bytes > PAGE_SIZE)
> > +               drain_obj_stock(stock);
> 
> The normal stock can go to 32*nr_cpus*PAGE_SIZE. I am wondering if
> just PAGE_SIZE is too less for obj stock.

It works on top of the current stock of 32 pages, so it can grab these
32 pages without any atomic operations. And it should be easy to increase
this limit if we'll see any benefits.

Thank you for looking into the patchset!

Andrew, can you, please, squash the following fix based on Shakeel's suggestions?
Thanks!

--

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 7ed3af71a6fb..2499f78cf32d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -326,7 +326,7 @@ struct mem_cgroup {
        int kmemcg_id;
        enum memcg_kmem_state kmem_state;
        struct obj_cgroup __rcu *objcg;
-       struct list_head objcg_list;
+       struct list_head objcg_list; /* list of inherited objcgs */
 #endif
 
 #ifdef CONFIG_CGROUP_WRITEBACK
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 70cd44b28db1..9f14b91700d9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2843,7 +2843,7 @@ __always_inline struct obj_cgroup *get_obj_cgroup_from_current(void)
        struct obj_cgroup *objcg = NULL;
        struct mem_cgroup *memcg;
 
-       if (unlikely(!current->mm))
+       if (unlikely(!current->mm && !current->active_memcg))
                return NULL;
 
        rcu_read_lock();

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 06/19] mm: memcg/slab: obj_cgroup API
  2020-06-19 21:38     ` Roman Gushchin
@ 2020-06-19 22:16       ` Shakeel Butt
  2020-06-19 22:52         ` Roman Gushchin
  2020-06-20 22:50       ` Andrew Morton
  1 sibling, 1 reply; 92+ messages in thread
From: Shakeel Butt @ 2020-06-19 22:16 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Fri, Jun 19, 2020 at 2:38 PM Roman Gushchin <guro@fb.com> wrote:
>
[snip]
> > > memcg->objcg is rcu protected.
> > > objcg->memcg is a raw pointer, which is always pointing at a memory
> > > cgroup, but can be atomically swapped to the parent memory cgroup. So
> > > the caller
> >
> > What type of caller? The allocator?
>
> Basically whoever uses the pointer. Is it better to s/caller/user?
>

Yes 'user' feels better.

> >
[...]
> >
> > The normal stock can go to 32*nr_cpus*PAGE_SIZE. I am wondering if
> > just PAGE_SIZE is too less for obj stock.
>
> It works on top of the current stock of 32 pages, so it can grab these
> 32 pages without any atomic operations. And it should be easy to increase
> this limit if we'll see any benefits.
>
> Thank you for looking into the patchset!
>
> Andrew, can you, please, squash the following fix based on Shakeel's suggestions?
> Thanks!
>
> --

For the following squashed into the original patch:

Reviewed-by: Shakeel Butt <shakeelb@google.com>

>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 7ed3af71a6fb..2499f78cf32d 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -326,7 +326,7 @@ struct mem_cgroup {
>         int kmemcg_id;
>         enum memcg_kmem_state kmem_state;
>         struct obj_cgroup __rcu *objcg;
> -       struct list_head objcg_list;
> +       struct list_head objcg_list; /* list of inherited objcgs */
>  #endif
>
>  #ifdef CONFIG_CGROUP_WRITEBACK
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 70cd44b28db1..9f14b91700d9 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2843,7 +2843,7 @@ __always_inline struct obj_cgroup *get_obj_cgroup_from_current(void)
>         struct obj_cgroup *objcg = NULL;
>         struct mem_cgroup *memcg;
>
> -       if (unlikely(!current->mm))
> +       if (unlikely(!current->mm && !current->active_memcg))
>                 return NULL;
>
>         rcu_read_lock();

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 06/19] mm: memcg/slab: obj_cgroup API
  2020-06-19 22:16       ` Shakeel Butt
@ 2020-06-19 22:52         ` Roman Gushchin
  0 siblings, 0 replies; 92+ messages in thread
From: Roman Gushchin @ 2020-06-19 22:52 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Fri, Jun 19, 2020 at 03:16:44PM -0700, Shakeel Butt wrote:
> On Fri, Jun 19, 2020 at 2:38 PM Roman Gushchin <guro@fb.com> wrote:
> >
> [snip]
> > > > memcg->objcg is rcu protected.
> > > > objcg->memcg is a raw pointer, which is always pointing at a memory
> > > > cgroup, but can be atomically swapped to the parent memory cgroup. So
> > > > the caller
> > >
> > > What type of caller? The allocator?
> >
> > Basically whoever uses the pointer. Is it better to s/caller/user?
> >
> 
> Yes 'user' feels better.
> 
> > >
> [...]
> > >
> > > The normal stock can go to 32*nr_cpus*PAGE_SIZE. I am wondering if
> > > just PAGE_SIZE is too less for obj stock.
> >
> > It works on top of the current stock of 32 pages, so it can grab these
> > 32 pages without any atomic operations. And it should be easy to increase
> > this limit if we'll see any benefits.
> >
> > Thank you for looking into the patchset!
> >
> > Andrew, can you, please, squash the following fix based on Shakeel's suggestions?
> > Thanks!
> >
> > --
> 
> For the following squashed into the original patch:
> 
> Reviewed-by: Shakeel Butt <shakeelb@google.com>

Thank you!

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 08/19] mm: memcg/slab: save obj_cgroup for non-root slab objects
  2020-06-08 23:06 ` [PATCH v6 08/19] mm: memcg/slab: save obj_cgroup for non-root slab objects Roman Gushchin
@ 2020-06-20  0:16   ` Shakeel Butt
  2020-06-20  1:19     ` Roman Gushchin
  0 siblings, 1 reply; 92+ messages in thread
From: Shakeel Butt @ 2020-06-20  0:16 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
>
> Store the obj_cgroup pointer in the corresponding place of
> page->obj_cgroups for each allocated non-root slab object.
> Make sure that each allocated object holds a reference to obj_cgroup.
>
> Objcg pointer is obtained from the memcg->objcg dereferencing
> in memcg_kmem_get_cache() and passed from pre_alloc_hook to
> post_alloc_hook. Then in case of successful allocation(s) it's
> getting stored in the page->obj_cgroups vector.
>
> The objcg obtaining part look a bit bulky now, but it will be simplified
> by next commits in the series.
>
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

One nit below otherwise:

Reviewed-by: Shakeel Butt <shakeelb@google.com>

> ---
[snip]
> +static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
> +                                             struct obj_cgroup *objcg,
> +                                             size_t size, void **p)
> +{
> +       struct page *page;
> +       unsigned long off;
> +       size_t i;
> +
> +       for (i = 0; i < size; i++) {
> +               if (likely(p[i])) {
> +                       page = virt_to_head_page(p[i]);
> +                       off = obj_to_index(s, page, p[i]);
> +                       obj_cgroup_get(objcg);
> +                       page_obj_cgroups(page)[off] = objcg;
> +               }
> +       }
> +       obj_cgroup_put(objcg);

Nit: we get the objcg reference in memcg_kmem_get_cache(), doesn't it
look cleaner to put that reference in memcg_kmem_put_cache() instead
of here.

> +       memcg_kmem_put_cache(s);
> +}
> +

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 07/19] mm: memcg/slab: allocate obj_cgroups for non-root slab pages
  2020-06-19 16:36   ` Shakeel Butt
@ 2020-06-20  0:25     ` Roman Gushchin
  2020-06-20  0:31       ` Shakeel Butt
  0 siblings, 1 reply; 92+ messages in thread
From: Roman Gushchin @ 2020-06-20  0:25 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Fri, Jun 19, 2020 at 09:36:16AM -0700, Shakeel Butt wrote:
> On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > Allocate and release memory to store obj_cgroup pointers for each
> > non-root slab page. Reuse page->mem_cgroup pointer to store a pointer
> > to the allocated space.
> >
> > To distinguish between obj_cgroups and memcg pointers in case
> > when it's not obvious which one is used (as in page_cgroup_ino()),
> > let's always set the lowest bit in the obj_cgroup case.
> >
> 
> I think the commit message should talk about the potential overhead
> (i.e an extra pointer for each object) along with the justifications
> (i.e. less internal fragmentation and potentially more savings than
> the overhead).

How about adding the following chunk? I don't like forward links in
commit messages, so maybe putting it into the cover letter?

This commit temporarily increases the memory footprint of the kernel memory
accounting. To store obj_cgroup pointers we'll need a place for an
objcg_pointer for each allocated object. However, the following patches
in the series will enable sharing of slab pages between memory cgroups,
which will dramatically increase the total slab utilization. And the final
memory footprint will be significantly smaller than before.

> 
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> > ---
> >  include/linux/mm_types.h |  5 +++-
> >  include/linux/slab_def.h |  6 +++++
> >  include/linux/slub_def.h |  5 ++++
> >  mm/memcontrol.c          | 17 +++++++++++---
> >  mm/slab.h                | 49 ++++++++++++++++++++++++++++++++++++++++
> >  5 files changed, 78 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 64ede5f150dc..0277fbab7c93 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -198,7 +198,10 @@ struct page {
> >         atomic_t _refcount;
> >
> >  #ifdef CONFIG_MEMCG
> > -       struct mem_cgroup *mem_cgroup;
> > +       union {
> > +               struct mem_cgroup *mem_cgroup;
> > +               struct obj_cgroup **obj_cgroups;
> > +       };
> >  #endif
> >
> >         /*
> > diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h
> > index abc7de77b988..ccda7b9669a5 100644
> > --- a/include/linux/slab_def.h
> > +++ b/include/linux/slab_def.h
> > @@ -114,4 +114,10 @@ static inline unsigned int obj_to_index(const struct kmem_cache *cache,
> >         return reciprocal_divide(offset, cache->reciprocal_buffer_size);
> >  }
> >
> > +static inline int objs_per_slab_page(const struct kmem_cache *cache,
> > +                                    const struct page *page)
> > +{
> > +       return cache->num;
> > +}
> > +
> >  #endif /* _LINUX_SLAB_DEF_H */
> > diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
> > index 30e91c83d401..f87302dcfe8c 100644
> > --- a/include/linux/slub_def.h
> > +++ b/include/linux/slub_def.h
> > @@ -198,4 +198,9 @@ static inline unsigned int obj_to_index(const struct kmem_cache *cache,
> >         return __obj_to_index(cache, page_address(page), obj);
> >  }
> >
> > +static inline int objs_per_slab_page(const struct kmem_cache *cache,
> > +                                    const struct page *page)
> > +{
> > +       return page->objects;
> > +}
> >  #endif /* _LINUX_SLUB_DEF_H */
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 7ff66275966c..2020c7542aa1 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -569,10 +569,21 @@ ino_t page_cgroup_ino(struct page *page)
> >         unsigned long ino = 0;
> >
> >         rcu_read_lock();
> > -       if (PageSlab(page) && !PageTail(page))
> > +       if (PageSlab(page) && !PageTail(page)) {
> >                 memcg = memcg_from_slab_page(page);
> > -       else
> > -               memcg = READ_ONCE(page->mem_cgroup);
> > +       } else {
> > +               memcg = page->mem_cgroup;
> > +
> > +               /*
> > +                * The lowest bit set means that memcg isn't a valid
> > +                * memcg pointer, but a obj_cgroups pointer.
> > +                * In this case the page is shared and doesn't belong
> > +                * to any specific memory cgroup.
> > +                */
> > +               if ((unsigned long) memcg & 0x1UL)
> > +                       memcg = NULL;
> > +       }
> > +
> >         while (memcg && !(memcg->css.flags & CSS_ONLINE))
> >                 memcg = parent_mem_cgroup(memcg);
> >         if (memcg)
> > diff --git a/mm/slab.h b/mm/slab.h
> > index 8a574d9361c1..a1633ea15fbf 100644
> > --- a/mm/slab.h
> > +++ b/mm/slab.h
> > @@ -319,6 +319,18 @@ static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
> >         return s->memcg_params.root_cache;
> >  }
> >
> > +static inline struct obj_cgroup **page_obj_cgroups(struct page *page)
> > +{
> > +       /*
> > +        * page->mem_cgroup and page->obj_cgroups are sharing the same
> > +        * space. To distinguish between them in case we don't know for sure
> > +        * that the page is a slab page (e.g. page_cgroup_ino()), let's
> > +        * always set the lowest bit of obj_cgroups.
> > +        */
> > +       return (struct obj_cgroup **)
> > +               ((unsigned long)page->obj_cgroups & ~0x1UL);
> > +}
> > +
> >  /*
> >   * Expects a pointer to a slab page. Please note, that PageSlab() check
> >   * isn't sufficient, as it returns true also for tail compound slab pages,
> > @@ -406,6 +418,26 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
> >         percpu_ref_put_many(&s->memcg_params.refcnt, nr_pages);
> >  }
> >
> > +static inline int memcg_alloc_page_obj_cgroups(struct page *page,
> > +                                              struct kmem_cache *s, gfp_t gfp)
> > +{
> > +       unsigned int objects = objs_per_slab_page(s, page);
> > +       void *vec;
> > +
> > +       vec = kcalloc(objects, sizeof(struct obj_cgroup *), gfp);
> 
> Should the above allocation be on the same node as the page?

Yeah, it's a clever idea. The following patch should do the trick.
Andrew, can you, please, squash this in?

Thank you!


diff --git a/mm/slab.h b/mm/slab.h
index 0a31600a0f5c..2a036eefbd7e 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -233,7 +233,8 @@ static inline int memcg_alloc_page_obj_cgroups(struct page *page,
        unsigned int objects = objs_per_slab_page(s, page);
        void *vec;
 
-       vec = kcalloc(objects, sizeof(struct obj_cgroup *), gfp);
+       vec = kcalloc_node(objects, sizeof(struct obj_cgroup *), gfp,
+                          page_to_nid(page));
        if (!vec)
                return -ENOMEM;
 

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 07/19] mm: memcg/slab: allocate obj_cgroups for non-root slab pages
  2020-06-20  0:25     ` Roman Gushchin
@ 2020-06-20  0:31       ` Shakeel Butt
  0 siblings, 0 replies; 92+ messages in thread
From: Shakeel Butt @ 2020-06-20  0:31 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Fri, Jun 19, 2020 at 5:25 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Fri, Jun 19, 2020 at 09:36:16AM -0700, Shakeel Butt wrote:
> > On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
> > >
> > > Allocate and release memory to store obj_cgroup pointers for each
> > > non-root slab page. Reuse page->mem_cgroup pointer to store a pointer
> > > to the allocated space.
> > >
> > > To distinguish between obj_cgroups and memcg pointers in case
> > > when it's not obvious which one is used (as in page_cgroup_ino()),
> > > let's always set the lowest bit in the obj_cgroup case.
> > >
> >
> > I think the commit message should talk about the potential overhead
> > (i.e an extra pointer for each object) along with the justifications
> > (i.e. less internal fragmentation and potentially more savings than
> > the overhead).
>
> How about adding the following chunk? I don't like forward links in
> commit messages, so maybe putting it into the cover letter?
>
> This commit temporarily increases the memory footprint of the kernel memory
> accounting. To store obj_cgroup pointers we'll need a place for an
> objcg_pointer for each allocated object. However, the following patches
> in the series will enable sharing of slab pages between memory cgroups,
> which will dramatically increase the total slab utilization. And the final
> memory footprint will be significantly smaller than before.
>

This looks good to me.

> >
> > > Signed-off-by: Roman Gushchin <guro@fb.com>
> > > Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> > > ---
> > >  include/linux/mm_types.h |  5 +++-
> > >  include/linux/slab_def.h |  6 +++++
> > >  include/linux/slub_def.h |  5 ++++
> > >  mm/memcontrol.c          | 17 +++++++++++---
> > >  mm/slab.h                | 49 ++++++++++++++++++++++++++++++++++++++++
> > >  5 files changed, 78 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > > index 64ede5f150dc..0277fbab7c93 100644
> > > --- a/include/linux/mm_types.h
> > > +++ b/include/linux/mm_types.h
> > > @@ -198,7 +198,10 @@ struct page {
> > >         atomic_t _refcount;
> > >
> > >  #ifdef CONFIG_MEMCG
> > > -       struct mem_cgroup *mem_cgroup;
> > > +       union {
> > > +               struct mem_cgroup *mem_cgroup;
> > > +               struct obj_cgroup **obj_cgroups;
> > > +       };
> > >  #endif
> > >
> > >         /*
> > > diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h
> > > index abc7de77b988..ccda7b9669a5 100644
> > > --- a/include/linux/slab_def.h
> > > +++ b/include/linux/slab_def.h
> > > @@ -114,4 +114,10 @@ static inline unsigned int obj_to_index(const struct kmem_cache *cache,
> > >         return reciprocal_divide(offset, cache->reciprocal_buffer_size);
> > >  }
> > >
> > > +static inline int objs_per_slab_page(const struct kmem_cache *cache,
> > > +                                    const struct page *page)
> > > +{
> > > +       return cache->num;
> > > +}
> > > +
> > >  #endif /* _LINUX_SLAB_DEF_H */
> > > diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
> > > index 30e91c83d401..f87302dcfe8c 100644
> > > --- a/include/linux/slub_def.h
> > > +++ b/include/linux/slub_def.h
> > > @@ -198,4 +198,9 @@ static inline unsigned int obj_to_index(const struct kmem_cache *cache,
> > >         return __obj_to_index(cache, page_address(page), obj);
> > >  }
> > >
> > > +static inline int objs_per_slab_page(const struct kmem_cache *cache,
> > > +                                    const struct page *page)
> > > +{
> > > +       return page->objects;
> > > +}
> > >  #endif /* _LINUX_SLUB_DEF_H */
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 7ff66275966c..2020c7542aa1 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -569,10 +569,21 @@ ino_t page_cgroup_ino(struct page *page)
> > >         unsigned long ino = 0;
> > >
> > >         rcu_read_lock();
> > > -       if (PageSlab(page) && !PageTail(page))
> > > +       if (PageSlab(page) && !PageTail(page)) {
> > >                 memcg = memcg_from_slab_page(page);
> > > -       else
> > > -               memcg = READ_ONCE(page->mem_cgroup);
> > > +       } else {
> > > +               memcg = page->mem_cgroup;
> > > +
> > > +               /*
> > > +                * The lowest bit set means that memcg isn't a valid
> > > +                * memcg pointer, but a obj_cgroups pointer.
> > > +                * In this case the page is shared and doesn't belong
> > > +                * to any specific memory cgroup.
> > > +                */
> > > +               if ((unsigned long) memcg & 0x1UL)
> > > +                       memcg = NULL;
> > > +       }
> > > +
> > >         while (memcg && !(memcg->css.flags & CSS_ONLINE))
> > >                 memcg = parent_mem_cgroup(memcg);
> > >         if (memcg)
> > > diff --git a/mm/slab.h b/mm/slab.h
> > > index 8a574d9361c1..a1633ea15fbf 100644
> > > --- a/mm/slab.h
> > > +++ b/mm/slab.h
> > > @@ -319,6 +319,18 @@ static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
> > >         return s->memcg_params.root_cache;
> > >  }
> > >
> > > +static inline struct obj_cgroup **page_obj_cgroups(struct page *page)
> > > +{
> > > +       /*
> > > +        * page->mem_cgroup and page->obj_cgroups are sharing the same
> > > +        * space. To distinguish between them in case we don't know for sure
> > > +        * that the page is a slab page (e.g. page_cgroup_ino()), let's
> > > +        * always set the lowest bit of obj_cgroups.
> > > +        */
> > > +       return (struct obj_cgroup **)
> > > +               ((unsigned long)page->obj_cgroups & ~0x1UL);
> > > +}
> > > +
> > >  /*
> > >   * Expects a pointer to a slab page. Please note, that PageSlab() check
> > >   * isn't sufficient, as it returns true also for tail compound slab pages,
> > > @@ -406,6 +418,26 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
> > >         percpu_ref_put_many(&s->memcg_params.refcnt, nr_pages);
> > >  }
> > >
> > > +static inline int memcg_alloc_page_obj_cgroups(struct page *page,
> > > +                                              struct kmem_cache *s, gfp_t gfp)
> > > +{
> > > +       unsigned int objects = objs_per_slab_page(s, page);
> > > +       void *vec;
> > > +
> > > +       vec = kcalloc(objects, sizeof(struct obj_cgroup *), gfp);
> >
> > Should the above allocation be on the same node as the page?
>
> Yeah, it's a clever idea. The following patch should do the trick.
> Andrew, can you, please, squash this in?
>
> Thank you!
>
>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

> diff --git a/mm/slab.h b/mm/slab.h
> index 0a31600a0f5c..2a036eefbd7e 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -233,7 +233,8 @@ static inline int memcg_alloc_page_obj_cgroups(struct page *page,
>         unsigned int objects = objs_per_slab_page(s, page);
>         void *vec;
>
> -       vec = kcalloc(objects, sizeof(struct obj_cgroup *), gfp);
> +       vec = kcalloc_node(objects, sizeof(struct obj_cgroup *), gfp,
> +                          page_to_nid(page));
>         if (!vec)
>                 return -ENOMEM;
>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 09/19] mm: memcg/slab: charge individual slab objects instead of pages
  2020-06-08 23:06 ` [PATCH v6 09/19] mm: memcg/slab: charge individual slab objects instead of pages Roman Gushchin
@ 2020-06-20  0:54   ` Shakeel Butt
  2020-06-20  1:29     ` Roman Gushchin
  0 siblings, 1 reply; 92+ messages in thread
From: Shakeel Butt @ 2020-06-20  0:54 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
>
> Switch to per-object accounting of non-root slab objects.
>
> Charging is performed using obj_cgroup API in the pre_alloc hook.
> Obj_cgroup is charged with the size of the object and the size
> of metadata: as now it's the size of an obj_cgroup pointer.
> If the amount of memory has been charged successfully, the actual
> allocation code is executed. Otherwise, -ENOMEM is returned.
>
> In the post_alloc hook if the actual allocation succeeded,
> corresponding vmstats are bumped and the obj_cgroup pointer is saved.
> Otherwise, the charge is canceled.
>
> On the free path obj_cgroup pointer is obtained and used to uncharge
> the size of the releasing object.
>
> Memcg and lruvec counters are now representing only memory used
> by active slab objects and do not include the free space. The free
> space is shared and doesn't belong to any specific cgroup.
>
> Global per-node slab vmstats are still modified from (un)charge_slab_page()
> functions. The idea is to keep all slab pages accounted as slab pages
> on system level.
>
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> ---
[snip]
> +
> +static inline struct kmem_cache *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
> +                                               struct obj_cgroup **objcgp,
> +                                               size_t objects, gfp_t flags)
> +{
> +       struct kmem_cache *cachep;
> +
> +       cachep = memcg_kmem_get_cache(s, objcgp);
> +       if (is_root_cache(cachep))
> +               return s;
> +
> +       if (obj_cgroup_charge(*objcgp, flags, objects * obj_full_size(s))) {
> +               memcg_kmem_put_cache(cachep);

obj_cgroup_put(objcgp)? Or better to do that in memcg_kmem_put_cache().

> +               cachep = NULL;
> +       }
> +
> +       return cachep;
> +}
> +

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 00/19] The new cgroup slab memory controller
  2020-06-17 14:31           ` Mel Gorman
@ 2020-06-20  0:57             ` Roman Gushchin
  0 siblings, 0 replies; 92+ messages in thread
From: Roman Gushchin @ 2020-06-20  0:57 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Vlastimil Babka, Shakeel Butt, Andrew Morton, Christoph Lameter,
	Johannes Weiner, Michal Hocko, Linux MM, Kernel Team, LKML,
	Jesper Dangaard Brouer

On Wed, Jun 17, 2020 at 03:31:10PM +0100, Mel Gorman wrote:
> On Wed, Jun 17, 2020 at 01:24:21PM +0200, Vlastimil Babka wrote:
> > > Not really.
> > > 
> > > Sharing a single set of caches adds some overhead to root- and non-accounted
> > > allocations, which is something I've tried hard to avoid in my original version.
> > > But I have to admit, it allows to simplify and remove a lot of code, and here
> > > it's hard to argue with Johanness, who pushed on this design.
> > > 
> > > With performance testing it's not that easy, because it's not obvious what
> > > we wanna test. Obviously, per-object accounting is more expensive, and
> > > measuring something like 1000000 allocations and deallocations in a line from
> > > a single kmem_cache will show a regression. But in the real world the relative
> > > cost of allocations is usually low, and we can get some benefits from a smaller
> > > working set and from having shared kmem_cache objects cache hot.
> > > Not speaking about some extra memory and the fragmentation reduction.
> > > 
> > > We've done an extensive testing of the original version in Facebook production,
> > > and we haven't noticed any regressions so far. But I have to admit, we were
> > > using an original version with two sets of kmem_caches.
> > > 
> > > If you have any specific tests in mind, I can definitely run them. Or if you
> > > can help with the performance evaluation, I'll appreciate it a lot.
> > 
> > Jesper provided some pointers here [1], it would be really great if you could
> > run at least those microbenchmarks. With mmtests it's the major question of
> > which subset/profiles to run, maybe the referenced commits provide some hints,
> > or maybe Mel could suggest what he used to evaluate SLAB vs SLUB not so long ago.
> > 
> 
> Last time the list of mmtests configurations I used for a basic
> comparison were
> 
> db-pgbench-timed-ro-small-ext4
> db-pgbench-timed-ro-small-xfs
> io-dbench4-async-ext4
> io-dbench4-async-xfs
> io-bonnie-dir-async-ext4
> io-bonnie-dir-async-xfs
> io-bonnie-file-async-ext4
> io-bonnie-file-async-xfs
> io-fsmark-xfsrepair-xfs
> io-metadata-xfs
> network-netperf-unbound
> network-netperf-cross-node
> network-netperf-cross-socket
> network-sockperf-unbound
> network-netperf-unix-unbound
> network-netpipe
> network-tbench
> pagereclaim-shrinker-ext4
> scheduler-unbound
> scheduler-forkintensive
> workload-kerndevel-xfs
> workload-thpscale-madvhugepage-xfs
> workload-thpscale-xfs
> 
> Some were more valid than others in terms of doing an evaluation. I
> followed up later with a more comprehensive comparison but that was
> overkill.
> 
> Each time I did a slab/slub comparison in the past, I had to reverify
> the rate that kmem_cache_* functions were actually being called as the
> pattern can change over time even for the same workload.  A comparison
> gets more complicated when comparing cgroups as ideally there would be
> workloads running in multiple group but that gets complex and I think
> it's reasonable to just test the "basic" case without cgroups.

Thank you Mel for the suggestion!

I'll try to come up with some numbers soon. I guess networking tests
will be most interesting in this case.

Thanks!

Roman

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 08/19] mm: memcg/slab: save obj_cgroup for non-root slab objects
  2020-06-20  0:16   ` Shakeel Butt
@ 2020-06-20  1:19     ` Roman Gushchin
  0 siblings, 0 replies; 92+ messages in thread
From: Roman Gushchin @ 2020-06-20  1:19 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Fri, Jun 19, 2020 at 05:16:02PM -0700, Shakeel Butt wrote:
> On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > Store the obj_cgroup pointer in the corresponding place of
> > page->obj_cgroups for each allocated non-root slab object.
> > Make sure that each allocated object holds a reference to obj_cgroup.
> >
> > Objcg pointer is obtained from the memcg->objcg dereferencing
> > in memcg_kmem_get_cache() and passed from pre_alloc_hook to
> > post_alloc_hook. Then in case of successful allocation(s) it's
> > getting stored in the page->obj_cgroups vector.
> >
> > The objcg obtaining part look a bit bulky now, but it will be simplified
> > by next commits in the series.
> >
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> 
> One nit below otherwise:
> 
> Reviewed-by: Shakeel Butt <shakeelb@google.com>
> 
> > ---
> [snip]
> > +static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
> > +                                             struct obj_cgroup *objcg,
> > +                                             size_t size, void **p)
> > +{
> > +       struct page *page;
> > +       unsigned long off;
> > +       size_t i;
> > +
> > +       for (i = 0; i < size; i++) {
> > +               if (likely(p[i])) {
> > +                       page = virt_to_head_page(p[i]);
> > +                       off = obj_to_index(s, page, p[i]);
> > +                       obj_cgroup_get(objcg);
> > +                       page_obj_cgroups(page)[off] = objcg;
> > +               }
> > +       }
> > +       obj_cgroup_put(objcg);
> 
> Nit: we get the objcg reference in memcg_kmem_get_cache(), doesn't it
> look cleaner to put that reference in memcg_kmem_put_cache() instead
> of here.

memcg_kmem_put_cache() will go away completely later in the series.

I know the code might look sub-optimal and messy on some stages,
but it's only because there is a big transition from the original
to the final state, and I don't wanna to increase intermediate diffs.

Please, take a look at the final result.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 11/19] mm: memcg/slab: move memcg_kmem_bypass() to memcontrol.h
  2020-06-08 23:06 ` [PATCH v6 11/19] mm: memcg/slab: move memcg_kmem_bypass() to memcontrol.h Roman Gushchin
@ 2020-06-20  1:19   ` Shakeel Butt
  0 siblings, 0 replies; 92+ messages in thread
From: Shakeel Butt @ 2020-06-20  1:19 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
>
> To make the memcg_kmem_bypass() function available outside of
> the memcontrol.c, let's move it to memcontrol.h. The function
> is small and nicely fits into static inline sort of functions.
>
> It will be used from the slab code.
>
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 09/19] mm: memcg/slab: charge individual slab objects instead of pages
  2020-06-20  0:54   ` Shakeel Butt
@ 2020-06-20  1:29     ` Roman Gushchin
  0 siblings, 0 replies; 92+ messages in thread
From: Roman Gushchin @ 2020-06-20  1:29 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Fri, Jun 19, 2020 at 05:54:24PM -0700, Shakeel Butt wrote:
> On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > Switch to per-object accounting of non-root slab objects.
> >
> > Charging is performed using obj_cgroup API in the pre_alloc hook.
> > Obj_cgroup is charged with the size of the object and the size
> > of metadata: as now it's the size of an obj_cgroup pointer.
> > If the amount of memory has been charged successfully, the actual
> > allocation code is executed. Otherwise, -ENOMEM is returned.
> >
> > In the post_alloc hook if the actual allocation succeeded,
> > corresponding vmstats are bumped and the obj_cgroup pointer is saved.
> > Otherwise, the charge is canceled.
> >
> > On the free path obj_cgroup pointer is obtained and used to uncharge
> > the size of the releasing object.
> >
> > Memcg and lruvec counters are now representing only memory used
> > by active slab objects and do not include the free space. The free
> > space is shared and doesn't belong to any specific cgroup.
> >
> > Global per-node slab vmstats are still modified from (un)charge_slab_page()
> > functions. The idea is to keep all slab pages accounted as slab pages
> > on system level.
> >
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> > ---
> [snip]
> > +
> > +static inline struct kmem_cache *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
> > +                                               struct obj_cgroup **objcgp,
> > +                                               size_t objects, gfp_t flags)
> > +{
> > +       struct kmem_cache *cachep;
> > +
> > +       cachep = memcg_kmem_get_cache(s, objcgp);
> > +       if (is_root_cache(cachep))
> > +               return s;
> > +
> > +       if (obj_cgroup_charge(*objcgp, flags, objects * obj_full_size(s))) {
> > +               memcg_kmem_put_cache(cachep);
> 
> obj_cgroup_put(objcgp)? Or better to do that in memcg_kmem_put_cache().

Hm, you are right, it's a real issue. It's fixed later in the series by
"mm: memcg/slab: use a single set of kmem_caches for all accounted allocations"
though. Good catch!

I'll re-spin necessary patches to avoid it and ask Andrew to replace them.

Thank you!

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 06/19] mm: memcg/slab: obj_cgroup API
  2020-06-19 21:38     ` Roman Gushchin
  2020-06-19 22:16       ` Shakeel Butt
@ 2020-06-20 22:50       ` Andrew Morton
  1 sibling, 0 replies; 92+ messages in thread
From: Andrew Morton @ 2020-06-20 22:50 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Shakeel Butt, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Fri, 19 Jun 2020 14:38:10 -0700 Roman Gushchin <guro@fb.com> wrote:

> Andrew, can you, please, squash the following fix based on Shakeel's suggestions?
> Thanks!

Sure.  But a changelog, a signoff and an avoidance of tabs-replaced-by-spaces
would still be preferred, please!

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 00/19] The new cgroup slab memory controller
  2020-06-08 23:06 [PATCH v6 00/19] The new cgroup slab memory controller Roman Gushchin
                   ` (19 preceding siblings ...)
  2020-06-18  9:27 ` Mike Rapoport
@ 2020-06-21 22:57 ` Qian Cai
  2020-06-21 23:34   ` Roman Gushchin
  20 siblings, 1 reply; 92+ messages in thread
From: Qian Cai @ 2020-06-21 22:57 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Shakeel Butt, linux-mm, Vlastimil Babka, kernel-team,
	linux-kernel, catalin.marinas

On Mon, Jun 08, 2020 at 04:06:35PM -0700, Roman Gushchin wrote:
> This is v6 of the slab cgroup controller rework.
> 
> The patchset moves the accounting from the page level to the object
> level. It allows to share slab pages between memory cgroups.
> This leads to a significant win in the slab utilization (up to 45%)
> and the corresponding drop in the total kernel memory footprint.
> The reduced number of unmovable slab pages should also have a positive
> effect on the memory fragmentation.
> 
> The patchset makes the slab accounting code simpler: there is no more
> need in the complicated dynamic creation and destruction of per-cgroup
> slab caches, all memory cgroups use a global set of shared slab caches.
> The lifetime of slab caches is not more connected to the lifetime
> of memory cgroups.
> 
> The more precise accounting does require more CPU, however in practice
> the difference seems to be negligible. We've been using the new slab
> controller in Facebook production for several months with different
> workloads and haven't seen any noticeable regressions. What we've seen
> were memory savings in order of 1 GB per host (it varied heavily depending
> on the actual workload, size of RAM, number of CPUs, memory pressure, etc).
> 
> The third version of the patchset added yet another step towards
> the simplification of the code: sharing of slab caches between
> accounted and non-accounted allocations. It comes with significant
> upsides (most noticeable, a complete elimination of dynamic slab caches
> creation) but not without some regression risks, so this change sits
> on top of the patchset and is not completely merged in. So in the unlikely
> event of a noticeable performance regression it can be reverted separately.

Reverting this series and its dependency [1], i.e.,

git revert --no-edit 05923a2ccacd..07666ee77fb4

on the top of next-20200621 fixed an issue where kmemleak could report
thousands of leaks like this below using this .config (if ever matters),

https://github.com/cailca/linux-mm/blob/master/x86.config

unreferenced object 0xffff888ff2bf6200 (size 512):
  comm "systemd-udevd", pid 794, jiffies 4294940381 (age 602.740s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000e8e9272e>] __kmalloc_node+0x149/0x260
    [<0000000021c1b4a2>] slab_post_alloc_hook+0x172/0x4d0
    [<00000000f38fad30>] kmem_cache_alloc_node+0x110/0x2e0
    [<00000000fdf1d747>] __alloc_skb+0x92/0x520
    [<00000000bda2c48f>] alloc_skb_with_frags+0x72/0x530
    [<0000000023d10084>] sock_alloc_send_pskb+0x5a1/0x720
    [<00000000dd3334cc>] unix_dgram_sendmsg+0x32a/0xe70
    [<000000001ad988ff>] sock_write_iter+0x341/0x420
    [<0000000056d15d07>] new_sync_write+0x4b6/0x610
    [<0000000090b14475>] vfs_write+0x18b/0x4d0
    [<000000009e7ba1b4>] ksys_write+0x180/0x1c0
    [<00000000713e3b98>] do_syscall_64+0x5f/0x310
    [<00000000b1c204e0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
unreferenced object 0xffff8888040fac00 (size 512):
  comm "systemd-udevd", pid 1096, jiffies 4294941658 (age 590.010s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000e8e9272e>] __kmalloc_node+0x149/0x260
    [<0000000021c1b4a2>] slab_post_alloc_hook+0x172/0x4d0
    [<00000000e26f0785>] kmem_cache_alloc+0xe5/0x2a0
    [<00000000ac28147d>] __alloc_file+0x22/0x2a0
    [<0000000031c82651>] alloc_empty_file+0x3e/0x100
    [<000000000e337bda>] path_openat+0x10c/0x1b00
    [<000000008969cf2d>] do_filp_open+0x171/0x240
    [<000000009462ef7b>] do_sys_openat2+0x2db/0x500
    [<0000000007340ff0>] do_sys_open+0x85/0xd0
    [<00000000713e3b98>] do_syscall_64+0x5f/0x310
    [<00000000b1c204e0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
unreferenced object 0xffff8885bf0c8200 (size 512):
  comm "systemd-udevd", pid 1075, jiffies 4294941661 (age 589.980s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000e8e9272e>] __kmalloc_node+0x149/0x260
    [<0000000021c1b4a2>] slab_post_alloc_hook+0x172/0x4d0
    [<00000000e26f0785>] kmem_cache_alloc+0xe5/0x2a0
    [<00000000ac28147d>] __alloc_file+0x22/0x2a0
    [<0000000031c82651>] alloc_empty_file+0x3e/0x100
    [<000000000e337bda>] path_openat+0x10c/0x1b00
    [<000000008969cf2d>] do_filp_open+0x171/0x240
    [<000000009462ef7b>] do_sys_openat2+0x2db/0x500
    [<0000000007340ff0>] do_sys_open+0x85/0xd0
    [<00000000713e3b98>] do_syscall_64+0x5f/0x310
    [<00000000b1c204e0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
unreferenced object 0xffff88903ce24a00 (size 512):
  comm "systemd-udevd", pid 1078, jiffies 4294941806 (age 588.540s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000e8e9272e>] __kmalloc_node+0x149/0x260
    [<0000000021c1b4a2>] slab_post_alloc_hook+0x172/0x4d0
    [<00000000e26f0785>] kmem_cache_alloc+0xe5/0x2a0
    [<0000000002285574>] vm_area_dup+0x71/0x2a0
    [<000000006c732816>] dup_mm+0x548/0xfc0
    [<000000001cf5c685>] copy_process+0x2a33/0x62c0
    [<000000006e2a8069>] _do_fork+0xf8/0xce0
    [<00000000e7ba268e>] __do_sys_clone+0xda/0x120
    [<00000000713e3b98>] do_syscall_64+0x5f/0x310
    [<00000000b1c204e0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

[1] https://lore.kernel.org/linux-mm/20200608230819.832349-1-guro@fb.com/
Also, confirmed that only reverting the "mm: memcg accounting of percpu
memory" series alone did not help.

> 
> v6:
>   1) rebased on top of the mm tree
>   2) removed a redundant check from cache_from_obj(), suggested by Vlastimil
> 
> v5:
>   1) fixed a build error, spotted by Vlastimil
>   2) added a comment about memcg->nr_charged_bytes, asked by Johannes
>   3) added missed acks and reviews
> 
> v4:
>   1) rebased on top of the mm tree, some fixes here and there
>   2) merged obj_to_index() with slab_index(), suggested by Vlastimil
>   3) changed objects_per_slab() to a better objects_per_slab_page(),
>      suggested by Vlastimil
>   4) other minor fixes and changes
> 
> v3:
>   1) added a patch that switches to a global single set of kmem_caches
>   2) kmem API clean up dropped, because if has been already merged
>   3) byte-sized slab vmstat API over page-sized global counters and
>      bytes-sized memcg/lruvec counters
>   3) obj_cgroup refcounting simplifications and other minor fixes
>   4) other minor changes
> 
> v2:
>   1) implemented re-layering and renaming suggested by Johannes,
>      added his patch to the set. Thanks!
>   2) fixed the issue discovered by Bharata B Rao. Thanks!
>   3) added kmem API clean up part
>   4) added slab/memcg follow-up clean up part
>   5) fixed a couple of issues discovered by internal testing on FB fleet.
>   6) added kselftests
>   7) included metadata into the charge calculation
>   8) refreshed commit logs, regrouped patches, rebased onto mm tree, etc
> 
> v1:
>   1) fixed a bug in zoneinfo_show_print()
>   2) added some comments to the subpage charging API, a minor fix
>   3) separated memory.kmem.slabinfo deprecation into a separate patch,
>      provided a drgn-based replacement
>   4) rebased on top of the current mm tree
> 
> RFC:
>   https://lwn.net/Articles/798605/
> 
> 
> Johannes Weiner (1):
>   mm: memcontrol: decouple reference counting from page accounting
> 
> Roman Gushchin (18):
>   mm: memcg: factor out memcg- and lruvec-level changes out of
>     __mod_lruvec_state()
>   mm: memcg: prepare for byte-sized vmstat items
>   mm: memcg: convert vmstat slab counters to bytes
>   mm: slub: implement SLUB version of obj_to_index()
>   mm: memcg/slab: obj_cgroup API
>   mm: memcg/slab: allocate obj_cgroups for non-root slab pages
>   mm: memcg/slab: save obj_cgroup for non-root slab objects
>   mm: memcg/slab: charge individual slab objects instead of pages
>   mm: memcg/slab: deprecate memory.kmem.slabinfo
>   mm: memcg/slab: move memcg_kmem_bypass() to memcontrol.h
>   mm: memcg/slab: use a single set of kmem_caches for all accounted
>     allocations
>   mm: memcg/slab: simplify memcg cache creation
>   mm: memcg/slab: remove memcg_kmem_get_cache()
>   mm: memcg/slab: deprecate slab_root_caches
>   mm: memcg/slab: remove redundant check in memcg_accumulate_slabinfo()
>   mm: memcg/slab: use a single set of kmem_caches for all allocations
>   kselftests: cgroup: add kernel memory accounting tests
>   tools/cgroup: add memcg_slabinfo.py tool
> 
>  drivers/base/node.c                        |   6 +-
>  fs/proc/meminfo.c                          |   4 +-
>  include/linux/memcontrol.h                 |  85 ++-
>  include/linux/mm_types.h                   |   5 +-
>  include/linux/mmzone.h                     |  24 +-
>  include/linux/slab.h                       |   5 -
>  include/linux/slab_def.h                   |   9 +-
>  include/linux/slub_def.h                   |  31 +-
>  include/linux/vmstat.h                     |  14 +-
>  kernel/power/snapshot.c                    |   2 +-
>  mm/memcontrol.c                            | 608 +++++++++++--------
>  mm/oom_kill.c                              |   2 +-
>  mm/page_alloc.c                            |   8 +-
>  mm/slab.c                                  |  70 +--
>  mm/slab.h                                  | 372 +++++-------
>  mm/slab_common.c                           | 643 +--------------------
>  mm/slob.c                                  |  12 +-
>  mm/slub.c                                  | 229 +-------
>  mm/vmscan.c                                |   3 +-
>  mm/vmstat.c                                |  30 +-
>  mm/workingset.c                            |   6 +-
>  tools/cgroup/memcg_slabinfo.py             | 226 ++++++++
>  tools/testing/selftests/cgroup/.gitignore  |   1 +
>  tools/testing/selftests/cgroup/Makefile    |   2 +
>  tools/testing/selftests/cgroup/test_kmem.c | 382 ++++++++++++
>  25 files changed, 1374 insertions(+), 1405 deletions(-)
>  create mode 100755 tools/cgroup/memcg_slabinfo.py
>  create mode 100644 tools/testing/selftests/cgroup/test_kmem.c
> 
> -- 
> 2.25.4
> 
> 

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 00/19] The new cgroup slab memory controller
  2020-06-21 22:57 ` Qian Cai
@ 2020-06-21 23:34   ` Roman Gushchin
  2020-06-21 23:53     ` Qian Cai
  0 siblings, 1 reply; 92+ messages in thread
From: Roman Gushchin @ 2020-06-21 23:34 UTC (permalink / raw)
  To: Qian Cai
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Shakeel Butt, linux-mm, Vlastimil Babka, kernel-team,
	linux-kernel, catalin.marinas

On Sun, Jun 21, 2020 at 06:57:52PM -0400, Qian Cai wrote:
> On Mon, Jun 08, 2020 at 04:06:35PM -0700, Roman Gushchin wrote:
> > This is v6 of the slab cgroup controller rework.
> > 
> > The patchset moves the accounting from the page level to the object
> > level. It allows to share slab pages between memory cgroups.
> > This leads to a significant win in the slab utilization (up to 45%)
> > and the corresponding drop in the total kernel memory footprint.
> > The reduced number of unmovable slab pages should also have a positive
> > effect on the memory fragmentation.
> > 
> > The patchset makes the slab accounting code simpler: there is no more
> > need in the complicated dynamic creation and destruction of per-cgroup
> > slab caches, all memory cgroups use a global set of shared slab caches.
> > The lifetime of slab caches is not more connected to the lifetime
> > of memory cgroups.
> > 
> > The more precise accounting does require more CPU, however in practice
> > the difference seems to be negligible. We've been using the new slab
> > controller in Facebook production for several months with different
> > workloads and haven't seen any noticeable regressions. What we've seen
> > were memory savings in order of 1 GB per host (it varied heavily depending
> > on the actual workload, size of RAM, number of CPUs, memory pressure, etc).
> > 
> > The third version of the patchset added yet another step towards
> > the simplification of the code: sharing of slab caches between
> > accounted and non-accounted allocations. It comes with significant
> > upsides (most noticeable, a complete elimination of dynamic slab caches
> > creation) but not without some regression risks, so this change sits
> > on top of the patchset and is not completely merged in. So in the unlikely
> > event of a noticeable performance regression it can be reverted separately.
> 
> Reverting this series and its dependency [1], i.e.,
> 
> git revert --no-edit 05923a2ccacd..07666ee77fb4
> 
> on the top of next-20200621 fixed an issue where kmemleak could report
> thousands of leaks like this below using this .config (if ever matters),
> 
> https://github.com/cailca/linux-mm/blob/master/x86.config
> 
> unreferenced object 0xffff888ff2bf6200 (size 512):

Hi Qian!

My wild guess is that kmemleak is getting confused by modifying the lowest
bit of page->mem_cgroup/obhj_cgroups pointer:

struct page {
	...
	union {
		struct mem_cgroup *mem_cgroup;
		struct obj_cgroup **obj_cgroups;
	};
	...
}

We're using the lowest bit to distinguish between a "normal" mem_cgroup
pointer and a vector of obj_cgroup pointers.

This pointer to obj_cgroup vector is saved only here, so if we're modifying
the address, I guess it's what makes kmemleak think that there is a leak.

Or do you have a real leak?

Thanks!

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 00/19] The new cgroup slab memory controller
  2020-06-21 23:34   ` Roman Gushchin
@ 2020-06-21 23:53     ` Qian Cai
  2020-06-22  3:07       ` Roman Gushchin
  0 siblings, 1 reply; 92+ messages in thread
From: Qian Cai @ 2020-06-21 23:53 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Shakeel Butt, linux-mm, Vlastimil Babka, kernel-team,
	linux-kernel, catalin.marinas



> On Jun 21, 2020, at 7:34 PM, Roman Gushchin <guro@fb.com> wrote:
> 
> My wild guess is that kmemleak is getting confused by modifying the lowest
> bit of page->mem_cgroup/obhj_cgroups pointer:
> 
> struct page {
>    ...
>    union {
>        struct mem_cgroup *mem_cgroup;
>        struct obj_cgroup **obj_cgroups;
>    };
>    ...
> }
> 
> We're using the lowest bit to distinguish between a "normal" mem_cgroup
> pointer and a vector of obj_cgroup pointers.
> 
> This pointer to obj_cgroup vector is saved only here, so if we're modifying
> the address, I guess it's what makes kmemleak think that there is a leak.
> 
> Or do you have a real leak?

The point is that we can’t have a patchset in the current form to totally render kmemleak useless with so many even false positives.

Anyway, this is rather easy to reproduce where I am able to reproduce on multiple bare-metal machines by just booting it.

# echo scan > /sys/kernel/debug/kmemleak
# cat /sys/kernel/debug/kmemleak

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 00/19] The new cgroup slab memory controller
  2020-06-21 23:53     ` Qian Cai
@ 2020-06-22  3:07       ` Roman Gushchin
  0 siblings, 0 replies; 92+ messages in thread
From: Roman Gushchin @ 2020-06-22  3:07 UTC (permalink / raw)
  To: Qian Cai
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Shakeel Butt, linux-mm, Vlastimil Babka, kernel-team,
	linux-kernel, catalin.marinas

On Sun, Jun 21, 2020 at 07:53:23PM -0400, Qian Cai wrote:
> 
> 
> > On Jun 21, 2020, at 7:34 PM, Roman Gushchin <guro@fb.com> wrote:
> > 
> > My wild guess is that kmemleak is getting confused by modifying the lowest
> > bit of page->mem_cgroup/obhj_cgroups pointer:
> > 
> > struct page {
> >    ...
> >    union {
> >        struct mem_cgroup *mem_cgroup;
> >        struct obj_cgroup **obj_cgroups;
> >    };
> >    ...
> > }
> > 
> > We're using the lowest bit to distinguish between a "normal" mem_cgroup
> > pointer and a vector of obj_cgroup pointers.
> > 
> > This pointer to obj_cgroup vector is saved only here, so if we're modifying
> > the address, I guess it's what makes kmemleak think that there is a leak.
> > 
> > Or do you have a real leak?
> 
> The point is that we can’t have a patchset in the current form to totally render kmemleak useless with so many even false positives.
> 
> Anyway, this is rather easy to reproduce where I am able to reproduce on multiple bare-metal machines by just booting it.
> 
> # echo scan > /sys/kernel/debug/kmemleak
> # cat /sys/kernel/debug/kmemleak

Ok, thank you for the report, I'll take care of it.

It's easy to mark these vectors to be ignored by kmemleak, but I guess it's better
to explicitly add an additional reference, so we can track actual leaks.

I'll send a patch with fix soon-ish.

Thanks!

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 12/19] mm: memcg/slab: use a single set of kmem_caches for all accounted allocations
  2020-06-08 23:06 ` [PATCH v6 12/19] mm: memcg/slab: use a single set of kmem_caches for all accounted allocations Roman Gushchin
@ 2020-06-22 16:56   ` Shakeel Butt
  0 siblings, 0 replies; 92+ messages in thread
From: Shakeel Butt @ 2020-06-22 16:56 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
>
> This is fairly big but mostly red patch, which makes all accounted
> slab allocations use a single set of kmem_caches instead of
> creating a separate set for each memory cgroup.
>
> Because the number of non-root kmem_caches is now capped by the number
> of root kmem_caches, there is no need to shrink or destroy them
> prematurely. They can be perfectly destroyed together with their
> root counterparts. This allows to dramatically simplify the
> management of non-root kmem_caches and delete a ton of code.
>
> This patch performs the following changes:
> 1) introduces memcg_params.memcg_cache pointer to represent the
>    kmem_cache which will be used for all non-root allocations
> 2) reuses the existing memcg kmem_cache creation mechanism
>    to create memcg kmem_cache on the first allocation attempt
> 3) memcg kmem_caches are named <kmemcache_name>-memcg,
>    e.g. dentry-memcg
> 4) simplifies memcg_kmem_get_cache() to just return memcg kmem_cache
>    or schedule it's creation and return the root cache
> 5) removes almost all non-root kmem_cache management code
>    (separate refcounter, reparenting, shrinking, etc)
> 6) makes slab debugfs to display root_mem_cgroup css id and never
>    show :dead and :deact flags in the memcg_slabinfo attribute.
>
> Following patches in the series will simplify the kmem_cache creation.
>
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

This is a very satisfying patch.

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 10/19] mm: memcg/slab: deprecate memory.kmem.slabinfo
  2020-06-08 23:06 ` [PATCH v6 10/19] mm: memcg/slab: deprecate memory.kmem.slabinfo Roman Gushchin
@ 2020-06-22 17:12   ` Shakeel Butt
  2020-06-22 18:01     ` Roman Gushchin
  0 siblings, 1 reply; 92+ messages in thread
From: Shakeel Butt @ 2020-06-22 17:12 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
>
> Deprecate memory.kmem.slabinfo.
>
> An empty file will be presented if corresponding config options are
> enabled.
>
> The interface is implementation dependent, isn't present in cgroup v2,
> and is generally useful only for core mm debugging purposes. In other
> words, it doesn't provide any value for the absolute majority of users.
>
> A drgn-based replacement can be found in tools/cgroup/slabinfo.py .
> It does support cgroup v1 and v2, mimics memory.kmem.slabinfo output
> and also allows to get any additional information without a need
> to recompile the kernel.
>
> If a drgn-based solution is too slow for a task, a bpf-based tracing
> tool can be used, which can easily keep track of all slab allocations
> belonging to a memory cgroup.
>
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Hi Roman,

I am not against removing the memory.kmem.slabinfo interface but I
would like to have an alternative solution more accessible than
tools/cgroup/slabinfo.py.

In our case, we don't have ssh access and if we need something for
debugging, it is much more preferable to provide a file to read to
SREs. After the review, that file will be added to a whitelist and
then we can directly read that file through automated tools without
approval for each request.

I am just wondering if a file interface can be provided for whatever
tools/cgroup/slabinfo.py is providing.

Shakeel

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 13/19] mm: memcg/slab: simplify memcg cache creation
  2020-06-08 23:06 ` [PATCH v6 13/19] mm: memcg/slab: simplify memcg cache creation Roman Gushchin
@ 2020-06-22 17:29   ` Shakeel Butt
  2020-06-22 17:40     ` Roman Gushchin
  0 siblings, 1 reply; 92+ messages in thread
From: Shakeel Butt @ 2020-06-22 17:29 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
>
> Because the number of non-root kmem_caches doesn't depend on the
> number of memory cgroups anymore and is generally not very big,
> there is no more need for a dedicated workqueue.
>
> Also, as there is no more need to pass any arguments to the
> memcg_create_kmem_cache() except the root kmem_cache, it's
> possible to just embed the work structure into the kmem_cache
> and avoid the dynamic allocation of the work structure.
>
> This will also simplify the synchronization: for each root kmem_cache
> there is only one work. So there will be no more concurrent attempts
> to create a non-root kmem_cache for a root kmem_cache: the second and
> all following attempts to queue the work will fail.
>
>
> On the kmem_cache destruction path there is no more need to call the
> expensive flush_workqueue() and wait for all pending works to be
> finished. Instead, cancel_work_sync() can be used to cancel/wait for
> only one work.
>
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Why not pre-allocate the non-root kmem_cache at the kmem_cache
creation time? No need for work_struct, queue_work() or
cancel_work_sync() at all.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 16/19] mm: memcg/slab: remove redundant check in memcg_accumulate_slabinfo()
  2020-06-08 23:06 ` [PATCH v6 16/19] mm: memcg/slab: remove redundant check in memcg_accumulate_slabinfo() Roman Gushchin
@ 2020-06-22 17:32   ` Shakeel Butt
  0 siblings, 0 replies; 92+ messages in thread
From: Shakeel Butt @ 2020-06-22 17:32 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
>
> memcg_accumulate_slabinfo() is never called with a non-root
> kmem_cache as a first argument, so the is_root_cache(s) check
> is redundant and can be removed without any functional change.
>
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 15/19] mm: memcg/slab: deprecate slab_root_caches
  2020-06-08 23:06 ` [PATCH v6 15/19] mm: memcg/slab: deprecate slab_root_caches Roman Gushchin
@ 2020-06-22 17:36   ` Shakeel Butt
  0 siblings, 0 replies; 92+ messages in thread
From: Shakeel Butt @ 2020-06-22 17:36 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
>
> Currently there are two lists of kmem_caches:
> 1) slab_caches, which contains all kmem_caches,
> 2) slab_root_caches, which contains only root kmem_caches.
>
> And there is some preprocessor magic to have a single list
> if CONFIG_MEMCG_KMEM isn't enabled.
>
> It was required earlier because the number of non-root kmem_caches
> was proportional to the number of memory cgroups and could reach
> really big values. Now, when it cannot exceed the number of root
> kmem_caches, there is really no reason to maintain two lists.
>
> We never iterate over the slab_root_caches list on any hot paths,
> so it's perfectly fine to iterate over slab_caches and filter out
> non-root kmem_caches.
>
> It allows to remove a lot of config-dependent code and two pointers
> from the kmem_cache structure.
>
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 13/19] mm: memcg/slab: simplify memcg cache creation
  2020-06-22 17:29   ` Shakeel Butt
@ 2020-06-22 17:40     ` Roman Gushchin
  2020-06-22 18:03       ` Shakeel Butt
  0 siblings, 1 reply; 92+ messages in thread
From: Roman Gushchin @ 2020-06-22 17:40 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Mon, Jun 22, 2020 at 10:29:29AM -0700, Shakeel Butt wrote:
> On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > Because the number of non-root kmem_caches doesn't depend on the
> > number of memory cgroups anymore and is generally not very big,
> > there is no more need for a dedicated workqueue.
> >
> > Also, as there is no more need to pass any arguments to the
> > memcg_create_kmem_cache() except the root kmem_cache, it's
> > possible to just embed the work structure into the kmem_cache
> > and avoid the dynamic allocation of the work structure.
> >
> > This will also simplify the synchronization: for each root kmem_cache
> > there is only one work. So there will be no more concurrent attempts
> > to create a non-root kmem_cache for a root kmem_cache: the second and
> > all following attempts to queue the work will fail.
> >
> >
> > On the kmem_cache destruction path there is no more need to call the
> > expensive flush_workqueue() and wait for all pending works to be
> > finished. Instead, cancel_work_sync() can be used to cancel/wait for
> > only one work.
> >
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Why not pre-allocate the non-root kmem_cache at the kmem_cache
> creation time? No need for work_struct, queue_work() or
> cancel_work_sync() at all.

Simple because some kmem_caches are created very early, so we don't
even know at that time if we will need memcg slab caches. But this
code is likely going away if we're going with a single set for all
allocations.

Thanks!

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 10/19] mm: memcg/slab: deprecate memory.kmem.slabinfo
  2020-06-22 17:12   ` Shakeel Butt
@ 2020-06-22 18:01     ` Roman Gushchin
  2020-06-22 18:09       ` Shakeel Butt
  0 siblings, 1 reply; 92+ messages in thread
From: Roman Gushchin @ 2020-06-22 18:01 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Mon, Jun 22, 2020 at 10:12:46AM -0700, Shakeel Butt wrote:
> On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > Deprecate memory.kmem.slabinfo.
> >
> > An empty file will be presented if corresponding config options are
> > enabled.
> >
> > The interface is implementation dependent, isn't present in cgroup v2,
> > and is generally useful only for core mm debugging purposes. In other
> > words, it doesn't provide any value for the absolute majority of users.
> >
> > A drgn-based replacement can be found in tools/cgroup/slabinfo.py .
> > It does support cgroup v1 and v2, mimics memory.kmem.slabinfo output
> > and also allows to get any additional information without a need
> > to recompile the kernel.
> >
> > If a drgn-based solution is too slow for a task, a bpf-based tracing
> > tool can be used, which can easily keep track of all slab allocations
> > belonging to a memory cgroup.
> >
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Hi Roman,
> 
> I am not against removing the memory.kmem.slabinfo interface but I
> would like to have an alternative solution more accessible than
> tools/cgroup/slabinfo.py.
> 
> In our case, we don't have ssh access and if we need something for
> debugging, it is much more preferable to provide a file to read to
> SREs. After the review, that file will be added to a whitelist and
> then we can directly read that file through automated tools without
> approval for each request.
> 
> I am just wondering if a file interface can be provided for whatever
> tools/cgroup/slabinfo.py is providing.
> 
> Shakeel

Hello, Shakeel!

I understand your point, but Idk how much we wanna make this code a part
of the kernel and the cgroup interface. The problem is that reading
from it will be really slow in comparison to all other cgroup interface
files. Idk if Google's version of SLAB has a list of all slab pages,
but if not (as in generic SLUB case), it requires scanning of the whole RAM.
So it's not suitable for periodic reading "just in case". But also
the absolute majority of users don't need this information.

If for some reason you're not comfortable with deploying drgn, it's fairly
easy to write a small standalone tool (similar to page-types), which will
do the trick. Maybe it can work for you?

Thanks!

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 13/19] mm: memcg/slab: simplify memcg cache creation
  2020-06-22 17:40     ` Roman Gushchin
@ 2020-06-22 18:03       ` Shakeel Butt
  0 siblings, 0 replies; 92+ messages in thread
From: Shakeel Butt @ 2020-06-22 18:03 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Mon, Jun 22, 2020 at 10:40 AM Roman Gushchin <guro@fb.com> wrote:
>
> On Mon, Jun 22, 2020 at 10:29:29AM -0700, Shakeel Butt wrote:
> > On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
> > >
> > > Because the number of non-root kmem_caches doesn't depend on the
> > > number of memory cgroups anymore and is generally not very big,
> > > there is no more need for a dedicated workqueue.
> > >
> > > Also, as there is no more need to pass any arguments to the
> > > memcg_create_kmem_cache() except the root kmem_cache, it's
> > > possible to just embed the work structure into the kmem_cache
> > > and avoid the dynamic allocation of the work structure.
> > >
> > > This will also simplify the synchronization: for each root kmem_cache
> > > there is only one work. So there will be no more concurrent attempts
> > > to create a non-root kmem_cache for a root kmem_cache: the second and
> > > all following attempts to queue the work will fail.
> > >
> > >
> > > On the kmem_cache destruction path there is no more need to call the
> > > expensive flush_workqueue() and wait for all pending works to be
> > > finished. Instead, cancel_work_sync() can be used to cancel/wait for
> > > only one work.
> > >
> > > Signed-off-by: Roman Gushchin <guro@fb.com>
> > > Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> >
> > Why not pre-allocate the non-root kmem_cache at the kmem_cache
> > creation time? No need for work_struct, queue_work() or
> > cancel_work_sync() at all.
>
> Simple because some kmem_caches are created very early, so we don't
> even know at that time if we will need memcg slab caches. But this
> code is likely going away if we're going with a single set for all
> allocations.
>

LGTM.

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 10/19] mm: memcg/slab: deprecate memory.kmem.slabinfo
  2020-06-22 18:01     ` Roman Gushchin
@ 2020-06-22 18:09       ` Shakeel Butt
  2020-06-22 18:25         ` Roman Gushchin
  0 siblings, 1 reply; 92+ messages in thread
From: Shakeel Butt @ 2020-06-22 18:09 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Mon, Jun 22, 2020 at 11:02 AM Roman Gushchin <guro@fb.com> wrote:
>
> On Mon, Jun 22, 2020 at 10:12:46AM -0700, Shakeel Butt wrote:
> > On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
> > >
> > > Deprecate memory.kmem.slabinfo.
> > >
> > > An empty file will be presented if corresponding config options are
> > > enabled.
> > >
> > > The interface is implementation dependent, isn't present in cgroup v2,
> > > and is generally useful only for core mm debugging purposes. In other
> > > words, it doesn't provide any value for the absolute majority of users.
> > >
> > > A drgn-based replacement can be found in tools/cgroup/slabinfo.py .
> > > It does support cgroup v1 and v2, mimics memory.kmem.slabinfo output
> > > and also allows to get any additional information without a need
> > > to recompile the kernel.
> > >
> > > If a drgn-based solution is too slow for a task, a bpf-based tracing
> > > tool can be used, which can easily keep track of all slab allocations
> > > belonging to a memory cgroup.
> > >
> > > Signed-off-by: Roman Gushchin <guro@fb.com>
> > > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > > Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> >
> > Hi Roman,
> >
> > I am not against removing the memory.kmem.slabinfo interface but I
> > would like to have an alternative solution more accessible than
> > tools/cgroup/slabinfo.py.
> >
> > In our case, we don't have ssh access and if we need something for
> > debugging, it is much more preferable to provide a file to read to
> > SREs. After the review, that file will be added to a whitelist and
> > then we can directly read that file through automated tools without
> > approval for each request.
> >
> > I am just wondering if a file interface can be provided for whatever
> > tools/cgroup/slabinfo.py is providing.
> >
> > Shakeel
>
> Hello, Shakeel!
>
> I understand your point, but Idk how much we wanna make this code a part
> of the kernel and the cgroup interface.

No need for the cgroup interface. I was thinking of a new interface
like /proc/slabinfo_full which tells active objects for each
kmem_cache and memcg pair.

> The problem is that reading
> from it will be really slow in comparison to all other cgroup interface
> files. Idk if Google's version of SLAB has a list of all slab pages,
> but if not (as in generic SLUB case), it requires scanning of the whole RAM.

That's a bummer. Does drgn-based script scan the whole RAM?

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 10/19] mm: memcg/slab: deprecate memory.kmem.slabinfo
  2020-06-22 18:09       ` Shakeel Butt
@ 2020-06-22 18:25         ` Roman Gushchin
  2020-06-22 18:38           ` Shakeel Butt
  0 siblings, 1 reply; 92+ messages in thread
From: Roman Gushchin @ 2020-06-22 18:25 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Mon, Jun 22, 2020 at 11:09:47AM -0700, Shakeel Butt wrote:
> On Mon, Jun 22, 2020 at 11:02 AM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Mon, Jun 22, 2020 at 10:12:46AM -0700, Shakeel Butt wrote:
> > > On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
> > > >
> > > > Deprecate memory.kmem.slabinfo.
> > > >
> > > > An empty file will be presented if corresponding config options are
> > > > enabled.
> > > >
> > > > The interface is implementation dependent, isn't present in cgroup v2,
> > > > and is generally useful only for core mm debugging purposes. In other
> > > > words, it doesn't provide any value for the absolute majority of users.
> > > >
> > > > A drgn-based replacement can be found in tools/cgroup/slabinfo.py .
> > > > It does support cgroup v1 and v2, mimics memory.kmem.slabinfo output
> > > > and also allows to get any additional information without a need
> > > > to recompile the kernel.
> > > >
> > > > If a drgn-based solution is too slow for a task, a bpf-based tracing
> > > > tool can be used, which can easily keep track of all slab allocations
> > > > belonging to a memory cgroup.
> > > >
> > > > Signed-off-by: Roman Gushchin <guro@fb.com>
> > > > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > > > Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> > >
> > > Hi Roman,
> > >
> > > I am not against removing the memory.kmem.slabinfo interface but I
> > > would like to have an alternative solution more accessible than
> > > tools/cgroup/slabinfo.py.
> > >
> > > In our case, we don't have ssh access and if we need something for
> > > debugging, it is much more preferable to provide a file to read to
> > > SREs. After the review, that file will be added to a whitelist and
> > > then we can directly read that file through automated tools without
> > > approval for each request.
> > >
> > > I am just wondering if a file interface can be provided for whatever
> > > tools/cgroup/slabinfo.py is providing.
> > >
> > > Shakeel
> >
> > Hello, Shakeel!
> >
> > I understand your point, but Idk how much we wanna make this code a part
> > of the kernel and the cgroup interface.
> 
> No need for the cgroup interface. I was thinking of a new interface
> like /proc/slabinfo_full which tells active objects for each
> kmem_cache and memcg pair.

To me it's a perfect example where tools like drgn and bpf shine.
They are more flexible and do not blow the kernel up with
the debug-only code.

> 
> > The problem is that reading
> > from it will be really slow in comparison to all other cgroup interface
> > files. Idk if Google's version of SLAB has a list of all slab pages,
> > but if not (as in generic SLUB case), it requires scanning of the whole RAM.
> 
> That's a bummer. Does drgn-based script scan the whole RAM?

To be precise, not over all RAM, but over all struct pages.
Unfortunately, there is no better option with SLUB, as there is no
comprehensive list of slab pages available. So the only option is to scan
over all pages with PageSlab flag set.

Thanks!

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 10/19] mm: memcg/slab: deprecate memory.kmem.slabinfo
  2020-06-22 18:25         ` Roman Gushchin
@ 2020-06-22 18:38           ` Shakeel Butt
  0 siblings, 0 replies; 92+ messages in thread
From: Shakeel Butt @ 2020-06-22 18:38 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Mon, Jun 22, 2020 at 11:25 AM Roman Gushchin <guro@fb.com> wrote:
>
> On Mon, Jun 22, 2020 at 11:09:47AM -0700, Shakeel Butt wrote:
> > On Mon, Jun 22, 2020 at 11:02 AM Roman Gushchin <guro@fb.com> wrote:
> > >
> > > On Mon, Jun 22, 2020 at 10:12:46AM -0700, Shakeel Butt wrote:
> > > > On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
> > > > >
> > > > > Deprecate memory.kmem.slabinfo.
> > > > >
> > > > > An empty file will be presented if corresponding config options are
> > > > > enabled.
> > > > >
> > > > > The interface is implementation dependent, isn't present in cgroup v2,
> > > > > and is generally useful only for core mm debugging purposes. In other
> > > > > words, it doesn't provide any value for the absolute majority of users.
> > > > >
> > > > > A drgn-based replacement can be found in tools/cgroup/slabinfo.py .
> > > > > It does support cgroup v1 and v2, mimics memory.kmem.slabinfo output
> > > > > and also allows to get any additional information without a need
> > > > > to recompile the kernel.
> > > > >
> > > > > If a drgn-based solution is too slow for a task, a bpf-based tracing
> > > > > tool can be used, which can easily keep track of all slab allocations
> > > > > belonging to a memory cgroup.
> > > > >
> > > > > Signed-off-by: Roman Gushchin <guro@fb.com>
> > > > > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > > > > Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> > > >
> > > > Hi Roman,
> > > >
> > > > I am not against removing the memory.kmem.slabinfo interface but I
> > > > would like to have an alternative solution more accessible than
> > > > tools/cgroup/slabinfo.py.
> > > >
> > > > In our case, we don't have ssh access and if we need something for
> > > > debugging, it is much more preferable to provide a file to read to
> > > > SREs. After the review, that file will be added to a whitelist and
> > > > then we can directly read that file through automated tools without
> > > > approval for each request.
> > > >
> > > > I am just wondering if a file interface can be provided for whatever
> > > > tools/cgroup/slabinfo.py is providing.
> > > >
> > > > Shakeel
> > >
> > > Hello, Shakeel!
> > >
> > > I understand your point, but Idk how much we wanna make this code a part
> > > of the kernel and the cgroup interface.
> >
> > No need for the cgroup interface. I was thinking of a new interface
> > like /proc/slabinfo_full which tells active objects for each
> > kmem_cache and memcg pair.
>
> To me it's a perfect example where tools like drgn and bpf shine.
> They are more flexible and do not blow the kernel up with
> the debug-only code.
>
> >
> > > The problem is that reading
> > > from it will be really slow in comparison to all other cgroup interface
> > > files. Idk if Google's version of SLAB has a list of all slab pages,
> > > but if not (as in generic SLUB case), it requires scanning of the whole RAM.
> >
> > That's a bummer. Does drgn-based script scan the whole RAM?
>
> To be precise, not over all RAM, but over all struct pages.
> Unfortunately, there is no better option with SLUB, as there is no
> comprehensive list of slab pages available. So the only option is to scan
> over all pages with PageSlab flag set.
>

So, SLUB does not have any field available in the struct page to
support the list of slab pages?

Anyways, that's a separate discussion.

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 14/19] mm: memcg/slab: remove memcg_kmem_get_cache()
  2020-06-08 23:06 ` [PATCH v6 14/19] mm: memcg/slab: remove memcg_kmem_get_cache() Roman Gushchin
@ 2020-06-22 18:42   ` Shakeel Butt
  0 siblings, 0 replies; 92+ messages in thread
From: Shakeel Butt @ 2020-06-22 18:42 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
>
> The memcg_kmem_get_cache() function became really trivial,
> so let's just inline it into the single call point:
> memcg_slab_pre_alloc_hook().
>
> It will make the code less bulky and can also help the compiler
> to generate a better code.
>
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 17/19] mm: memcg/slab: use a single set of kmem_caches for all allocations
  2020-06-08 23:06 ` [PATCH v6 17/19] mm: memcg/slab: use a single set of kmem_caches for all allocations Roman Gushchin
  2020-06-17 23:35   ` Andrew Morton
@ 2020-06-22 19:21   ` Shakeel Butt
  2020-06-22 20:37     ` Roman Gushchin
  1 sibling, 1 reply; 92+ messages in thread
From: Shakeel Butt @ 2020-06-22 19:21 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
>
> Instead of having two sets of kmem_caches: one for system-wide and
> non-accounted allocations and the second one shared by all accounted
> allocations, we can use just one.
>
> The idea is simple: space for obj_cgroup metadata can be allocated
> on demand and filled only for accounted allocations.
>
> It allows to remove a bunch of code which is required to handle
> kmem_cache clones for accounted allocations. There is no more need
> to create them, accumulate statistics, propagate attributes, etc.
> It's a quite significant simplification.
>
> Also, because the total number of slab_caches is reduced almost twice
> (not all kmem_caches have a memcg clone), some additional memory
> savings are expected. On my devvm it additionally saves about 3.5%
> of slab memory.
>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> ---
[snip]
>  static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
>                                               struct obj_cgroup *objcg,
> -                                             size_t size, void **p)
> +                                             gfp_t flags, size_t size,
> +                                             void **p)
>  {
>         struct page *page;
>         unsigned long off;
>         size_t i;
>
> +       if (!objcg)
> +               return;
> +
> +       flags &= ~__GFP_ACCOUNT;
>         for (i = 0; i < size; i++) {
>                 if (likely(p[i])) {
>                         page = virt_to_head_page(p[i]);
> +
> +                       if (!page_has_obj_cgroups(page) &&

The page is already linked into the kmem_cache, don't you need
synchronization for memcg_alloc_page_obj_cgroups(). What's the reason
to remove this from charge_slab_page()?

> +                           memcg_alloc_page_obj_cgroups(page, s, flags)) {
> +                               obj_cgroup_uncharge(objcg, obj_full_size(s));
> +                               continue;
> +                       }
> +
>                         off = obj_to_index(s, page, p[i]);
>                         obj_cgroup_get(objcg);
>                         page_obj_cgroups(page)[off] = objcg;

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 17/19] mm: memcg/slab: use a single set of kmem_caches for all allocations
  2020-06-22 19:21   ` Shakeel Butt
@ 2020-06-22 20:37     ` Roman Gushchin
  2020-06-22 21:04       ` Shakeel Butt
  0 siblings, 1 reply; 92+ messages in thread
From: Roman Gushchin @ 2020-06-22 20:37 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Mon, Jun 22, 2020 at 12:21:28PM -0700, Shakeel Butt wrote:
> On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > Instead of having two sets of kmem_caches: one for system-wide and
> > non-accounted allocations and the second one shared by all accounted
> > allocations, we can use just one.
> >
> > The idea is simple: space for obj_cgroup metadata can be allocated
> > on demand and filled only for accounted allocations.
> >
> > It allows to remove a bunch of code which is required to handle
> > kmem_cache clones for accounted allocations. There is no more need
> > to create them, accumulate statistics, propagate attributes, etc.
> > It's a quite significant simplification.
> >
> > Also, because the total number of slab_caches is reduced almost twice
> > (not all kmem_caches have a memcg clone), some additional memory
> > savings are expected. On my devvm it additionally saves about 3.5%
> > of slab memory.
> >
> > Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> > ---
> [snip]
> >  static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
> >                                               struct obj_cgroup *objcg,
> > -                                             size_t size, void **p)
> > +                                             gfp_t flags, size_t size,
> > +                                             void **p)
> >  {
> >         struct page *page;
> >         unsigned long off;
> >         size_t i;
> >
> > +       if (!objcg)
> > +               return;
> > +
> > +       flags &= ~__GFP_ACCOUNT;
> >         for (i = 0; i < size; i++) {
> >                 if (likely(p[i])) {
> >                         page = virt_to_head_page(p[i]);
> > +
> > +                       if (!page_has_obj_cgroups(page) &&
> 
> The page is already linked into the kmem_cache, don't you need
> synchronization for memcg_alloc_page_obj_cgroups().

Hm, yes, in theory we need it. I guess the reason behind why I've never seen any issues
here is the SLUB percpu partial list.

So in theory we need something like:

diff --git a/mm/slab.h b/mm/slab.h
index 0a31600a0f5c..44bf57815816 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -237,7 +237,10 @@ static inline int memcg_alloc_page_obj_cgroups(struct page *page,
        if (!vec)
                return -ENOMEM;
 
-       page->obj_cgroups = (struct obj_cgroup **) ((unsigned long)vec | 0x1UL);
+       if (cmpxchg(&page->obj_cgroups, 0,
+                   (struct obj_cgroup **) ((unsigned long)vec | 0x1UL)))
+               kfree(vec);
+
        return 0;
 }


But I wonder if we might put it under #ifdef CONFIG_SLAB?
Or any other ideas how to make it less expensive?

> What's the reason to remove this from charge_slab_page()?

Because at charge_slab_page() we don't know if we'll ever need
page->obj_cgroups. Some caches might have only few or even zero
accounted objects.

> 
> > +                           memcg_alloc_page_obj_cgroups(page, s, flags)) {
> > +                               obj_cgroup_uncharge(objcg, obj_full_size(s));
> > +                               continue;
> > +                       }
> > +
> >                         off = obj_to_index(s, page, p[i]);
> >                         obj_cgroup_get(objcg);
> >                         page_obj_cgroups(page)[off] = objcg;

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 17/19] mm: memcg/slab: use a single set of kmem_caches for all allocations
  2020-06-22 20:37     ` Roman Gushchin
@ 2020-06-22 21:04       ` Shakeel Butt
  2020-06-22 21:13         ` Roman Gushchin
  0 siblings, 1 reply; 92+ messages in thread
From: Shakeel Butt @ 2020-06-22 21:04 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Mon, Jun 22, 2020 at 1:37 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Mon, Jun 22, 2020 at 12:21:28PM -0700, Shakeel Butt wrote:
> > On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
> > >
> > > Instead of having two sets of kmem_caches: one for system-wide and
> > > non-accounted allocations and the second one shared by all accounted
> > > allocations, we can use just one.
> > >
> > > The idea is simple: space for obj_cgroup metadata can be allocated
> > > on demand and filled only for accounted allocations.
> > >
> > > It allows to remove a bunch of code which is required to handle
> > > kmem_cache clones for accounted allocations. There is no more need
> > > to create them, accumulate statistics, propagate attributes, etc.
> > > It's a quite significant simplification.
> > >
> > > Also, because the total number of slab_caches is reduced almost twice
> > > (not all kmem_caches have a memcg clone), some additional memory
> > > savings are expected. On my devvm it additionally saves about 3.5%
> > > of slab memory.
> > >
> > > Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> > > Signed-off-by: Roman Gushchin <guro@fb.com>
> > > Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> > > ---
> > [snip]
> > >  static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
> > >                                               struct obj_cgroup *objcg,
> > > -                                             size_t size, void **p)
> > > +                                             gfp_t flags, size_t size,
> > > +                                             void **p)
> > >  {
> > >         struct page *page;
> > >         unsigned long off;
> > >         size_t i;
> > >
> > > +       if (!objcg)
> > > +               return;
> > > +
> > > +       flags &= ~__GFP_ACCOUNT;
> > >         for (i = 0; i < size; i++) {
> > >                 if (likely(p[i])) {
> > >                         page = virt_to_head_page(p[i]);
> > > +
> > > +                       if (!page_has_obj_cgroups(page) &&
> >
> > The page is already linked into the kmem_cache, don't you need
> > synchronization for memcg_alloc_page_obj_cgroups().
>
> Hm, yes, in theory we need it. I guess the reason behind why I've never seen any issues
> here is the SLUB percpu partial list.
>
> So in theory we need something like:
>
> diff --git a/mm/slab.h b/mm/slab.h
> index 0a31600a0f5c..44bf57815816 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -237,7 +237,10 @@ static inline int memcg_alloc_page_obj_cgroups(struct page *page,
>         if (!vec)
>                 return -ENOMEM;
>
> -       page->obj_cgroups = (struct obj_cgroup **) ((unsigned long)vec | 0x1UL);
> +       if (cmpxchg(&page->obj_cgroups, 0,
> +                   (struct obj_cgroup **) ((unsigned long)vec | 0x1UL)))
> +               kfree(vec);
> +
>         return 0;
>  }
>
>
> But I wonder if we might put it under #ifdef CONFIG_SLAB?
> Or any other ideas how to make it less expensive?
>
> > What's the reason to remove this from charge_slab_page()?
>
> Because at charge_slab_page() we don't know if we'll ever need
> page->obj_cgroups. Some caches might have only few or even zero
> accounted objects.
>

If slab_pre_alloc_hook() returns a non-NULL objcg then we definitely
need page->obj_cgroups.  The charge_slab_page() happens between
slab_pre_alloc_hook() & slab_post_alloc_hook(), so, we should be able
to tell if page->obj_cgroups is needed.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 17/19] mm: memcg/slab: use a single set of kmem_caches for all allocations
  2020-06-22 21:04       ` Shakeel Butt
@ 2020-06-22 21:13         ` Roman Gushchin
  2020-06-22 21:28           ` Shakeel Butt
  0 siblings, 1 reply; 92+ messages in thread
From: Roman Gushchin @ 2020-06-22 21:13 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Mon, Jun 22, 2020 at 02:04:29PM -0700, Shakeel Butt wrote:
> On Mon, Jun 22, 2020 at 1:37 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Mon, Jun 22, 2020 at 12:21:28PM -0700, Shakeel Butt wrote:
> > > On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
> > > >
> > > > Instead of having two sets of kmem_caches: one for system-wide and
> > > > non-accounted allocations and the second one shared by all accounted
> > > > allocations, we can use just one.
> > > >
> > > > The idea is simple: space for obj_cgroup metadata can be allocated
> > > > on demand and filled only for accounted allocations.
> > > >
> > > > It allows to remove a bunch of code which is required to handle
> > > > kmem_cache clones for accounted allocations. There is no more need
> > > > to create them, accumulate statistics, propagate attributes, etc.
> > > > It's a quite significant simplification.
> > > >
> > > > Also, because the total number of slab_caches is reduced almost twice
> > > > (not all kmem_caches have a memcg clone), some additional memory
> > > > savings are expected. On my devvm it additionally saves about 3.5%
> > > > of slab memory.
> > > >
> > > > Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> > > > Signed-off-by: Roman Gushchin <guro@fb.com>
> > > > Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> > > > ---
> > > [snip]
> > > >  static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
> > > >                                               struct obj_cgroup *objcg,
> > > > -                                             size_t size, void **p)
> > > > +                                             gfp_t flags, size_t size,
> > > > +                                             void **p)
> > > >  {
> > > >         struct page *page;
> > > >         unsigned long off;
> > > >         size_t i;
> > > >
> > > > +       if (!objcg)
> > > > +               return;
> > > > +
> > > > +       flags &= ~__GFP_ACCOUNT;
> > > >         for (i = 0; i < size; i++) {
> > > >                 if (likely(p[i])) {
> > > >                         page = virt_to_head_page(p[i]);
> > > > +
> > > > +                       if (!page_has_obj_cgroups(page) &&
> > >
> > > The page is already linked into the kmem_cache, don't you need
> > > synchronization for memcg_alloc_page_obj_cgroups().
> >
> > Hm, yes, in theory we need it. I guess the reason behind why I've never seen any issues
> > here is the SLUB percpu partial list.
> >
> > So in theory we need something like:
> >
> > diff --git a/mm/slab.h b/mm/slab.h
> > index 0a31600a0f5c..44bf57815816 100644
> > --- a/mm/slab.h
> > +++ b/mm/slab.h
> > @@ -237,7 +237,10 @@ static inline int memcg_alloc_page_obj_cgroups(struct page *page,
> >         if (!vec)
> >                 return -ENOMEM;
> >
> > -       page->obj_cgroups = (struct obj_cgroup **) ((unsigned long)vec | 0x1UL);
> > +       if (cmpxchg(&page->obj_cgroups, 0,
> > +                   (struct obj_cgroup **) ((unsigned long)vec | 0x1UL)))
> > +               kfree(vec);
> > +
> >         return 0;
> >  }
> >
> >
> > But I wonder if we might put it under #ifdef CONFIG_SLAB?
> > Or any other ideas how to make it less expensive?
> >
> > > What's the reason to remove this from charge_slab_page()?
> >
> > Because at charge_slab_page() we don't know if we'll ever need
> > page->obj_cgroups. Some caches might have only few or even zero
> > accounted objects.
> >
> 
> If slab_pre_alloc_hook() returns a non-NULL objcg then we definitely
> need page->obj_cgroups.  The charge_slab_page() happens between
> slab_pre_alloc_hook() & slab_post_alloc_hook(), so, we should be able
> to tell if page->obj_cgroups is needed.

Yes, but the opposite is not always true: we can reuse the existing page
without allocated page->obj_cgroups. In this case charge_slab_page() is
not involved at all.

Or do you mean that we can minimize the amount of required synchronization
by allocating some obj_cgroups vectors from charge_slab_page()?

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 17/19] mm: memcg/slab: use a single set of kmem_caches for all allocations
  2020-06-22 21:13         ` Roman Gushchin
@ 2020-06-22 21:28           ` Shakeel Butt
  2020-06-22 21:58             ` Roman Gushchin
  0 siblings, 1 reply; 92+ messages in thread
From: Shakeel Butt @ 2020-06-22 21:28 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Mon, Jun 22, 2020 at 2:15 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Mon, Jun 22, 2020 at 02:04:29PM -0700, Shakeel Butt wrote:
> > On Mon, Jun 22, 2020 at 1:37 PM Roman Gushchin <guro@fb.com> wrote:
> > >
> > > On Mon, Jun 22, 2020 at 12:21:28PM -0700, Shakeel Butt wrote:
> > > > On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
> > > > >
> > > > > Instead of having two sets of kmem_caches: one for system-wide and
> > > > > non-accounted allocations and the second one shared by all accounted
> > > > > allocations, we can use just one.
> > > > >
> > > > > The idea is simple: space for obj_cgroup metadata can be allocated
> > > > > on demand and filled only for accounted allocations.
> > > > >
> > > > > It allows to remove a bunch of code which is required to handle
> > > > > kmem_cache clones for accounted allocations. There is no more need
> > > > > to create them, accumulate statistics, propagate attributes, etc.
> > > > > It's a quite significant simplification.
> > > > >
> > > > > Also, because the total number of slab_caches is reduced almost twice
> > > > > (not all kmem_caches have a memcg clone), some additional memory
> > > > > savings are expected. On my devvm it additionally saves about 3.5%
> > > > > of slab memory.
> > > > >
> > > > > Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> > > > > Signed-off-by: Roman Gushchin <guro@fb.com>
> > > > > Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> > > > > ---
> > > > [snip]
> > > > >  static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
> > > > >                                               struct obj_cgroup *objcg,
> > > > > -                                             size_t size, void **p)
> > > > > +                                             gfp_t flags, size_t size,
> > > > > +                                             void **p)
> > > > >  {
> > > > >         struct page *page;
> > > > >         unsigned long off;
> > > > >         size_t i;
> > > > >
> > > > > +       if (!objcg)
> > > > > +               return;
> > > > > +
> > > > > +       flags &= ~__GFP_ACCOUNT;
> > > > >         for (i = 0; i < size; i++) {
> > > > >                 if (likely(p[i])) {
> > > > >                         page = virt_to_head_page(p[i]);
> > > > > +
> > > > > +                       if (!page_has_obj_cgroups(page) &&
> > > >
> > > > The page is already linked into the kmem_cache, don't you need
> > > > synchronization for memcg_alloc_page_obj_cgroups().
> > >
> > > Hm, yes, in theory we need it. I guess the reason behind why I've never seen any issues
> > > here is the SLUB percpu partial list.
> > >
> > > So in theory we need something like:
> > >
> > > diff --git a/mm/slab.h b/mm/slab.h
> > > index 0a31600a0f5c..44bf57815816 100644
> > > --- a/mm/slab.h
> > > +++ b/mm/slab.h
> > > @@ -237,7 +237,10 @@ static inline int memcg_alloc_page_obj_cgroups(struct page *page,
> > >         if (!vec)
> > >                 return -ENOMEM;
> > >
> > > -       page->obj_cgroups = (struct obj_cgroup **) ((unsigned long)vec | 0x1UL);
> > > +       if (cmpxchg(&page->obj_cgroups, 0,
> > > +                   (struct obj_cgroup **) ((unsigned long)vec | 0x1UL)))
> > > +               kfree(vec);
> > > +
> > >         return 0;
> > >  }
> > >
> > >
> > > But I wonder if we might put it under #ifdef CONFIG_SLAB?
> > > Or any other ideas how to make it less expensive?
> > >
> > > > What's the reason to remove this from charge_slab_page()?
> > >
> > > Because at charge_slab_page() we don't know if we'll ever need
> > > page->obj_cgroups. Some caches might have only few or even zero
> > > accounted objects.
> > >
> >
> > If slab_pre_alloc_hook() returns a non-NULL objcg then we definitely
> > need page->obj_cgroups.  The charge_slab_page() happens between
> > slab_pre_alloc_hook() & slab_post_alloc_hook(), so, we should be able
> > to tell if page->obj_cgroups is needed.
>
> Yes, but the opposite is not always true: we can reuse the existing page
> without allocated page->obj_cgroups. In this case charge_slab_page() is
> not involved at all.
>

Hmm yeah, you are right. I missed that.

>
> Or do you mean that we can minimize the amount of required synchronization
> by allocating some obj_cgroups vectors from charge_slab_page()?

One optimization would be to always pre-allocate page->obj_cgroups for
kmem_caches with SLAB_ACCOUNT.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 17/19] mm: memcg/slab: use a single set of kmem_caches for all allocations
  2020-06-22 21:28           ` Shakeel Butt
@ 2020-06-22 21:58             ` Roman Gushchin
  2020-06-22 22:05               ` Shakeel Butt
  0 siblings, 1 reply; 92+ messages in thread
From: Roman Gushchin @ 2020-06-22 21:58 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Mon, Jun 22, 2020 at 02:28:54PM -0700, Shakeel Butt wrote:
> On Mon, Jun 22, 2020 at 2:15 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Mon, Jun 22, 2020 at 02:04:29PM -0700, Shakeel Butt wrote:
> > > On Mon, Jun 22, 2020 at 1:37 PM Roman Gushchin <guro@fb.com> wrote:
> > > >
> > > > On Mon, Jun 22, 2020 at 12:21:28PM -0700, Shakeel Butt wrote:
> > > > > On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
> > > > > >
> > > > > > Instead of having two sets of kmem_caches: one for system-wide and
> > > > > > non-accounted allocations and the second one shared by all accounted
> > > > > > allocations, we can use just one.
> > > > > >
> > > > > > The idea is simple: space for obj_cgroup metadata can be allocated
> > > > > > on demand and filled only for accounted allocations.
> > > > > >
> > > > > > It allows to remove a bunch of code which is required to handle
> > > > > > kmem_cache clones for accounted allocations. There is no more need
> > > > > > to create them, accumulate statistics, propagate attributes, etc.
> > > > > > It's a quite significant simplification.
> > > > > >
> > > > > > Also, because the total number of slab_caches is reduced almost twice
> > > > > > (not all kmem_caches have a memcg clone), some additional memory
> > > > > > savings are expected. On my devvm it additionally saves about 3.5%
> > > > > > of slab memory.
> > > > > >
> > > > > > Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> > > > > > Signed-off-by: Roman Gushchin <guro@fb.com>
> > > > > > Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> > > > > > ---
> > > > > [snip]
> > > > > >  static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
> > > > > >                                               struct obj_cgroup *objcg,
> > > > > > -                                             size_t size, void **p)
> > > > > > +                                             gfp_t flags, size_t size,
> > > > > > +                                             void **p)
> > > > > >  {
> > > > > >         struct page *page;
> > > > > >         unsigned long off;
> > > > > >         size_t i;
> > > > > >
> > > > > > +       if (!objcg)
> > > > > > +               return;
> > > > > > +
> > > > > > +       flags &= ~__GFP_ACCOUNT;
> > > > > >         for (i = 0; i < size; i++) {
> > > > > >                 if (likely(p[i])) {
> > > > > >                         page = virt_to_head_page(p[i]);
> > > > > > +
> > > > > > +                       if (!page_has_obj_cgroups(page) &&
> > > > >
> > > > > The page is already linked into the kmem_cache, don't you need
> > > > > synchronization for memcg_alloc_page_obj_cgroups().
> > > >
> > > > Hm, yes, in theory we need it. I guess the reason behind why I've never seen any issues
> > > > here is the SLUB percpu partial list.
> > > >
> > > > So in theory we need something like:
> > > >
> > > > diff --git a/mm/slab.h b/mm/slab.h
> > > > index 0a31600a0f5c..44bf57815816 100644
> > > > --- a/mm/slab.h
> > > > +++ b/mm/slab.h
> > > > @@ -237,7 +237,10 @@ static inline int memcg_alloc_page_obj_cgroups(struct page *page,
> > > >         if (!vec)
> > > >                 return -ENOMEM;
> > > >
> > > > -       page->obj_cgroups = (struct obj_cgroup **) ((unsigned long)vec | 0x1UL);
> > > > +       if (cmpxchg(&page->obj_cgroups, 0,
> > > > +                   (struct obj_cgroup **) ((unsigned long)vec | 0x1UL)))
> > > > +               kfree(vec);
> > > > +
> > > >         return 0;
> > > >  }
> > > >
> > > >
> > > > But I wonder if we might put it under #ifdef CONFIG_SLAB?
> > > > Or any other ideas how to make it less expensive?
> > > >
> > > > > What's the reason to remove this from charge_slab_page()?
> > > >
> > > > Because at charge_slab_page() we don't know if we'll ever need
> > > > page->obj_cgroups. Some caches might have only few or even zero
> > > > accounted objects.
> > > >
> > >
> > > If slab_pre_alloc_hook() returns a non-NULL objcg then we definitely
> > > need page->obj_cgroups.  The charge_slab_page() happens between
> > > slab_pre_alloc_hook() & slab_post_alloc_hook(), so, we should be able
> > > to tell if page->obj_cgroups is needed.
> >
> > Yes, but the opposite is not always true: we can reuse the existing page
> > without allocated page->obj_cgroups. In this case charge_slab_page() is
> > not involved at all.
> >
> 
> Hmm yeah, you are right. I missed that.
> 
> >
> > Or do you mean that we can minimize the amount of required synchronization
> > by allocating some obj_cgroups vectors from charge_slab_page()?
> 
> One optimization would be to always pre-allocate page->obj_cgroups for
> kmem_caches with SLAB_ACCOUNT.

Even this is not completely memory overhead-free, because processes belonging
to the root cgroup and kthreads might allocate from such cache.

Anyway, I think I'll go with cmpxchg() for now and will think about possible
optimizations later. Because the allocation happens only once per the lifetime
of a slab page, and is very unlikely racing with a concurrent one on the same page,
the penalty shouldn't be that big.

Thanks!

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH v6 17/19] mm: memcg/slab: use a single set of kmem_caches for all allocations
  2020-06-22 21:58             ` Roman Gushchin
@ 2020-06-22 22:05               ` Shakeel Butt
  0 siblings, 0 replies; 92+ messages in thread
From: Shakeel Butt @ 2020-06-22 22:05 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Christoph Lameter, Johannes Weiner, Michal Hocko,
	Linux MM, Vlastimil Babka, Kernel Team, LKML

On Mon, Jun 22, 2020 at 2:58 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Mon, Jun 22, 2020 at 02:28:54PM -0700, Shakeel Butt wrote:
> > On Mon, Jun 22, 2020 at 2:15 PM Roman Gushchin <guro@fb.com> wrote:
> > >
> > > On Mon, Jun 22, 2020 at 02:04:29PM -0700, Shakeel Butt wrote:
> > > > On Mon, Jun 22, 2020 at 1:37 PM Roman Gushchin <guro@fb.com> wrote:
> > > > >
> > > > > On Mon, Jun 22, 2020 at 12:21:28PM -0700, Shakeel Butt wrote:
> > > > > > On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
> > > > > > >
> > > > > > > Instead of having two sets of kmem_caches: one for system-wide and
> > > > > > > non-accounted allocations and the second one shared by all accounted
> > > > > > > allocations, we can use just one.
> > > > > > >
> > > > > > > The idea is simple: space for obj_cgroup metadata can be allocated
> > > > > > > on demand and filled only for accounted allocations.
> > > > > > >
> > > > > > > It allows to remove a bunch of code which is required to handle
> > > > > > > kmem_cache clones for accounted allocations. There is no more need
> > > > > > > to create them, accumulate statistics, propagate attributes, etc.
> > > > > > > It's a quite significant simplification.
> > > > > > >
> > > > > > > Also, because the total number of slab_caches is reduced almost twice
> > > > > > > (not all kmem_caches have a memcg clone), some additional memory
> > > > > > > savings are expected. On my devvm it additionally saves about 3.5%
> > > > > > > of slab memory.
> > > > > > >
> > > > > > > Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> > > > > > > Signed-off-by: Roman Gushchin <guro@fb.com>
> > > > > > > Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> > > > > > > ---
> > > > > > [snip]
> > > > > > >  static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
> > > > > > >                                               struct obj_cgroup *objcg,
> > > > > > > -                                             size_t size, void **p)
> > > > > > > +                                             gfp_t flags, size_t size,
> > > > > > > +                                             void **p)
> > > > > > >  {
> > > > > > >         struct page *page;
> > > > > > >         unsigned long off;
> > > > > > >         size_t i;
> > > > > > >
> > > > > > > +       if (!objcg)
> > > > > > > +               return;
> > > > > > > +
> > > > > > > +       flags &= ~__GFP_ACCOUNT;
> > > > > > >         for (i = 0; i < size; i++) {
> > > > > > >                 if (likely(p[i])) {
> > > > > > >                         page = virt_to_head_page(p[i]);
> > > > > > > +
> > > > > > > +                       if (!page_has_obj_cgroups(page) &&
> > > > > >
> > > > > > The page is already linked into the kmem_cache, don't you need
> > > > > > synchronization for memcg_alloc_page_obj_cgroups().
> > > > >
> > > > > Hm, yes, in theory we need it. I guess the reason behind why I've never seen any issues
> > > > > here is the SLUB percpu partial list.
> > > > >
> > > > > So in theory we need something like:
> > > > >
> > > > > diff --git a/mm/slab.h b/mm/slab.h
> > > > > index 0a31600a0f5c..44bf57815816 100644
> > > > > --- a/mm/slab.h
> > > > > +++ b/mm/slab.h
> > > > > @@ -237,7 +237,10 @@ static inline int memcg_alloc_page_obj_cgroups(struct page *page,
> > > > >         if (!vec)
> > > > >                 return -ENOMEM;
> > > > >
> > > > > -       page->obj_cgroups = (struct obj_cgroup **) ((unsigned long)vec | 0x1UL);
> > > > > +       if (cmpxchg(&page->obj_cgroups, 0,
> > > > > +                   (struct obj_cgroup **) ((unsigned long)vec | 0x1UL)))
> > > > > +               kfree(vec);
> > > > > +
> > > > >         return 0;
> > > > >  }
> > > > >
> > > > >
> > > > > But I wonder if we might put it under #ifdef CONFIG_SLAB?
> > > > > Or any other ideas how to make it less expensive?
> > > > >
> > > > > > What's the reason to remove this from charge_slab_page()?
> > > > >
> > > > > Because at charge_slab_page() we don't know if we'll ever need
> > > > > page->obj_cgroups. Some caches might have only few or even zero
> > > > > accounted objects.
> > > > >
> > > >
> > > > If slab_pre_alloc_hook() returns a non-NULL objcg then we definitely
> > > > need page->obj_cgroups.  The charge_slab_page() happens between
> > > > slab_pre_alloc_hook() & slab_post_alloc_hook(), so, we should be able
> > > > to tell if page->obj_cgroups is needed.
> > >
> > > Yes, but the opposite is not always true: we can reuse the existing page
> > > without allocated page->obj_cgroups. In this case charge_slab_page() is
> > > not involved at all.
> > >
> >
> > Hmm yeah, you are right. I missed that.
> >
> > >
> > > Or do you mean that we can minimize the amount of required synchronization
> > > by allocating some obj_cgroups vectors from charge_slab_page()?
> >
> > One optimization would be to always pre-allocate page->obj_cgroups for
> > kmem_caches with SLAB_ACCOUNT.
>
> Even this is not completely memory overhead-free, because processes belonging
> to the root cgroup and kthreads might allocate from such cache.
>

Yes, not completely memory overhead-free but please note that in the
containerized world, running in the root container is discouraged and
for SLAB_ACCOUNT kmem_caches, processes from root container and
kthreads should be very rare.

>
> Anyway, I think I'll go with cmpxchg() for now and will think about possible
> optimizations later.

I agree to think about optimizations later (particularly such
heuristics based optimizations).

^ permalink raw reply	[flat|nested] 92+ messages in thread

end of thread, back to index

Thread overview: 92+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-08 23:06 [PATCH v6 00/19] The new cgroup slab memory controller Roman Gushchin
2020-06-08 23:06 ` [PATCH v6 01/19] mm: memcg: factor out memcg- and lruvec-level changes out of __mod_lruvec_state() Roman Gushchin
2020-06-17  1:52   ` Shakeel Butt
2020-06-17  2:50     ` Roman Gushchin
2020-06-17  2:59       ` Shakeel Butt
2020-06-17  3:19         ` Roman Gushchin
2020-06-08 23:06 ` [PATCH v6 02/19] mm: memcg: prepare for byte-sized vmstat items Roman Gushchin
2020-06-17  2:57   ` Shakeel Butt
2020-06-17  3:19     ` Roman Gushchin
2020-06-17 15:55   ` Shakeel Butt
2020-06-08 23:06 ` [PATCH v6 03/19] mm: memcg: convert vmstat slab counters to bytes Roman Gushchin
2020-06-17  3:03   ` Shakeel Butt
2020-06-08 23:06 ` [PATCH v6 04/19] mm: slub: implement SLUB version of obj_to_index() Roman Gushchin
2020-06-17  3:08   ` Shakeel Butt
2020-06-08 23:06 ` [PATCH v6 05/19] mm: memcontrol: decouple reference counting from page accounting Roman Gushchin
2020-06-18  0:47   ` Shakeel Butt
2020-06-18 14:55   ` Shakeel Butt
2020-06-18 19:51     ` Roman Gushchin
2020-06-19  1:08     ` Roman Gushchin
2020-06-19  1:18       ` Shakeel Butt
2020-06-19  1:31   ` Shakeel Butt
2020-06-08 23:06 ` [PATCH v6 06/19] mm: memcg/slab: obj_cgroup API Roman Gushchin
2020-06-19 15:42   ` Shakeel Butt
2020-06-19 21:38     ` Roman Gushchin
2020-06-19 22:16       ` Shakeel Butt
2020-06-19 22:52         ` Roman Gushchin
2020-06-20 22:50       ` Andrew Morton
2020-06-08 23:06 ` [PATCH v6 07/19] mm: memcg/slab: allocate obj_cgroups for non-root slab pages Roman Gushchin
2020-06-19 16:36   ` Shakeel Butt
2020-06-20  0:25     ` Roman Gushchin
2020-06-20  0:31       ` Shakeel Butt
2020-06-08 23:06 ` [PATCH v6 08/19] mm: memcg/slab: save obj_cgroup for non-root slab objects Roman Gushchin
2020-06-20  0:16   ` Shakeel Butt
2020-06-20  1:19     ` Roman Gushchin
2020-06-08 23:06 ` [PATCH v6 09/19] mm: memcg/slab: charge individual slab objects instead of pages Roman Gushchin
2020-06-20  0:54   ` Shakeel Butt
2020-06-20  1:29     ` Roman Gushchin
2020-06-08 23:06 ` [PATCH v6 10/19] mm: memcg/slab: deprecate memory.kmem.slabinfo Roman Gushchin
2020-06-22 17:12   ` Shakeel Butt
2020-06-22 18:01     ` Roman Gushchin
2020-06-22 18:09       ` Shakeel Butt
2020-06-22 18:25         ` Roman Gushchin
2020-06-22 18:38           ` Shakeel Butt
2020-06-08 23:06 ` [PATCH v6 11/19] mm: memcg/slab: move memcg_kmem_bypass() to memcontrol.h Roman Gushchin
2020-06-20  1:19   ` Shakeel Butt
2020-06-08 23:06 ` [PATCH v6 12/19] mm: memcg/slab: use a single set of kmem_caches for all accounted allocations Roman Gushchin
2020-06-22 16:56   ` Shakeel Butt
2020-06-08 23:06 ` [PATCH v6 13/19] mm: memcg/slab: simplify memcg cache creation Roman Gushchin
2020-06-22 17:29   ` Shakeel Butt
2020-06-22 17:40     ` Roman Gushchin
2020-06-22 18:03       ` Shakeel Butt
2020-06-08 23:06 ` [PATCH v6 14/19] mm: memcg/slab: remove memcg_kmem_get_cache() Roman Gushchin
2020-06-22 18:42   ` Shakeel Butt
2020-06-08 23:06 ` [PATCH v6 15/19] mm: memcg/slab: deprecate slab_root_caches Roman Gushchin
2020-06-22 17:36   ` Shakeel Butt
2020-06-08 23:06 ` [PATCH v6 16/19] mm: memcg/slab: remove redundant check in memcg_accumulate_slabinfo() Roman Gushchin
2020-06-22 17:32   ` Shakeel Butt
2020-06-08 23:06 ` [PATCH v6 17/19] mm: memcg/slab: use a single set of kmem_caches for all allocations Roman Gushchin
2020-06-17 23:35   ` Andrew Morton
2020-06-18  0:35     ` Roman Gushchin
2020-06-18  7:33       ` Vlastimil Babka
2020-06-18 19:54         ` Roman Gushchin
2020-06-22 19:21   ` Shakeel Butt
2020-06-22 20:37     ` Roman Gushchin
2020-06-22 21:04       ` Shakeel Butt
2020-06-22 21:13         ` Roman Gushchin
2020-06-22 21:28           ` Shakeel Butt
2020-06-22 21:58             ` Roman Gushchin
2020-06-22 22:05               ` Shakeel Butt
2020-06-08 23:06 ` [PATCH v6 18/19] kselftests: cgroup: add kernel memory accounting tests Roman Gushchin
2020-06-17  1:46 ` [PATCH v6 00/19] The new cgroup slab memory controller Shakeel Butt
2020-06-17  2:41   ` Roman Gushchin
2020-06-17  3:05     ` Shakeel Butt
2020-06-17  3:32       ` Roman Gushchin
2020-06-17 11:24         ` Vlastimil Babka
2020-06-17 14:31           ` Mel Gorman
2020-06-20  0:57             ` Roman Gushchin
2020-06-18  1:29           ` Roman Gushchin
2020-06-18  8:43             ` Jesper Dangaard Brouer
2020-06-18  9:31               ` Jesper Dangaard Brouer
2020-06-19  1:30                 ` Roman Gushchin
2020-06-19  8:32                   ` Jesper Dangaard Brouer
2020-06-19  1:27               ` Roman Gushchin
2020-06-19  9:39                 ` Jesper Dangaard Brouer
2020-06-19 18:47                   ` Roman Gushchin
2020-06-18  1:18   ` Roman Gushchin
2020-06-18  9:27 ` Mike Rapoport
2020-06-18 20:43   ` Roman Gushchin
2020-06-21 22:57 ` Qian Cai
2020-06-21 23:34   ` Roman Gushchin
2020-06-21 23:53     ` Qian Cai
2020-06-22  3:07       ` Roman Gushchin

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git
	git clone --mirror https://lore.kernel.org/lkml/8 lkml/git/8.git
	git clone --mirror https://lore.kernel.org/lkml/9 lkml/git/9.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git