linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/16] The new slab memory controller
@ 2019-10-18  0:28 Roman Gushchin
  2019-10-18  0:28 ` [PATCH 01/16] mm: memcg: introduce mem_cgroup_ptr Roman Gushchin
                   ` (17 more replies)
  0 siblings, 18 replies; 36+ messages in thread
From: Roman Gushchin @ 2019-10-18  0:28 UTC (permalink / raw)
  To: linux-mm
  Cc: Michal Hocko, Johannes Weiner, linux-kernel, kernel-team,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Christoph Lameter,
	Roman Gushchin

The existing slab memory controller is based on the idea of replicating
slab allocator internals for each memory cgroup. This approach promises
a low memory overhead (one pointer per page), and isn't adding too much
code on hot allocation and release paths. But is has a very serious flaw:
it leads to a low slab utilization.

Using a drgn* script I've got an estimation of slab utilization on
a number of machines running different production workloads. In most
cases it was between 45% and 65%, and the best number I've seen was
around 85%. Turning kmem accounting off brings it to high 90s. Also
it brings back 30-50% of slab memory. It means that the real price
of the existing slab memory controller is way bigger than a pointer
per page.

The real reason why the existing design leads to a low slab utilization
is simple: slab pages are used exclusively by one memory cgroup.
If there are only few allocations of certain size made by a cgroup,
or if some active objects (e.g. dentries) are left after the cgroup is
deleted, or the cgroup contains a single-threaded application which is
barely allocating any kernel objects, but does it every time on a new CPU:
in all these cases the resulting slab utilization is very low.
If kmem accounting is off, the kernel is able to use free space
on slab pages for other allocations.

Arguably it wasn't an issue back to days when the kmem controller was
introduced and was an opt-in feature, which had to be turned on
individually for each memory cgroup. But now it's turned on by default
on both cgroup v1 and v2. And modern systemd-based systems tend to
create a large number of cgroups.

This patchset provides a new implementation of the slab memory controller,
which aims to reach a much better slab utilization by sharing slab pages
between multiple memory cgroups. Below is the short description of the new
design (more details in commit messages).

Accounting is performed per-object instead of per-page. Slab-related
vmstat counters are converted to bytes. Charging is performed on page-basis,
with rounding up and remembering leftovers.

Memcg ownership data is stored in a per-slab-page vector: for each slab page
a vector of corresponding size is allocated. To keep slab memory reparenting
working, instead of saving a pointer to the memory cgroup directly an
intermediate object is used. It's simply a pointer to a memcg (which can be
easily changed to the parent) with a built-in reference counter. This scheme
allows to reparent all allocated objects without walking them over and changing
memcg pointer to the parent.

Instead of creating an individual set of kmem_caches for each memory cgroup,
two global sets are used: the root set for non-accounted and root-cgroup
allocations and the second set for all other allocations. This allows to
simplify the lifetime management of individual kmem_caches: they are destroyed
with root counterparts. It allows to remove a good amount of code and make
things generally simpler.

The patchset contains a couple of semi-independent parts, which can find their
usage outside of the slab memory controller too:
1) subpage charging API, which can be used in the future for accounting of
   other non-page-sized objects, e.g. percpu allocations.
2) mem_cgroup_ptr API (refcounted pointers to a memcg, can be reused
   for the efficient reparenting of other objects, e.g. pagecache.

The patchset has been tested on a number of different workloads in our
production. In all cases, it saved hefty amounts of memory:
1) web frontend, 650-700 Mb, ~42% of slab memory
2) database cache, 750-800 Mb, ~35% of slab memory
3) dns server, 700 Mb, ~36% of slab memory

(These numbers were received used a backport of this patchset to the kernel
version used in fb production. But similar numbers can be obtained on
a vanilla kernel. If used on a modern systemd-based distributive,
e.g. Fedora 30, the patched kernel shows the same order of slab memory
savings just after system start).

So far I haven't found any regression on all tested workloads, but
potential CPU regression caused by more precise accounting is a concern.

Obviously the amount of saved memory depend on the number of memory cgroups,
uptime and specific workloads, but overall it feels like the new controller
saves 30-40% of slab memory, sometimes more. Additionally, it should lead
to a lower memory fragmentation, just because of a smaller number of
non-movable pages and also because there is no more need to move all
slab objects to a new set of pages when a workload is restarted in a new
memory cgroup.

* https://github.com/osandov/drgn


v1:
  1) fixed a bug in zoneinfo_show_print()
  2) added some comments to the subpage charging API, a minor fix
  3) separated memory.kmem.slabinfo deprecation into a separate patch,
     provided a drgn-based replacement
  4) rebased on top of the current mm tree

RFC:
  https://lwn.net/Articles/798605/


Roman Gushchin (16):
  mm: memcg: introduce mem_cgroup_ptr
  mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat
  mm: vmstat: convert slab vmstat counter to bytes
  mm: memcg/slab: allocate space for memcg ownership data for non-root
    slabs
  mm: slub: implement SLUB version of obj_to_index()
  mm: memcg/slab: save memcg ownership data for non-root slab objects
  mm: memcg: move memcg_kmem_bypass() to memcontrol.h
  mm: memcg: introduce __mod_lruvec_memcg_state()
  mm: memcg/slab: charge individual slab objects instead of pages
  mm: memcg: move get_mem_cgroup_from_current() to memcontrol.h
  mm: memcg/slab: replace memcg_from_slab_page() with
    memcg_from_slab_obj()
  tools/cgroup: add slabinfo.py tool
  mm: memcg/slab: deprecate memory.kmem.slabinfo
  mm: memcg/slab: use one set of kmem_caches for all memory cgroups
  tools/cgroup: make slabinfo.py compatible with new slab controller
  mm: slab: remove redundant check in memcg_accumulate_slabinfo()

 drivers/base/node.c        |  14 +-
 fs/proc/meminfo.c          |   4 +-
 include/linux/memcontrol.h |  98 +++++++-
 include/linux/mm_types.h   |   5 +-
 include/linux/mmzone.h     |  12 +-
 include/linux/slab.h       |   3 +-
 include/linux/slub_def.h   |   9 +
 include/linux/vmstat.h     |   8 +
 kernel/power/snapshot.c    |   2 +-
 mm/list_lru.c              |  12 +-
 mm/memcontrol.c            | 302 ++++++++++++-------------
 mm/oom_kill.c              |   2 +-
 mm/page_alloc.c            |   8 +-
 mm/slab.c                  |  37 ++-
 mm/slab.h                  | 300 ++++++++++++------------
 mm/slab_common.c           | 452 ++++---------------------------------
 mm/slob.c                  |  12 +-
 mm/slub.c                  |  63 ++----
 mm/vmscan.c                |   3 +-
 mm/vmstat.c                |  37 ++-
 mm/workingset.c            |   6 +-
 tools/cgroup/slabinfo.py   | 250 ++++++++++++++++++++
 22 files changed, 816 insertions(+), 823 deletions(-)
 create mode 100755 tools/cgroup/slabinfo.py

-- 
2.21.0


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 01/16] mm: memcg: introduce mem_cgroup_ptr
  2019-10-18  0:28 [PATCH 00/16] The new slab memory controller Roman Gushchin
@ 2019-10-18  0:28 ` Roman Gushchin
  2019-10-18  0:28 ` [PATCH 02/16] mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat Roman Gushchin
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 36+ messages in thread
From: Roman Gushchin @ 2019-10-18  0:28 UTC (permalink / raw)
  To: linux-mm
  Cc: Michal Hocko, Johannes Weiner, linux-kernel, kernel-team,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Christoph Lameter,
	Roman Gushchin

This commit introduces mem_cgroup_ptr structure and corresponding API.
It implements a pointer to a memory cgroup with a built-in reference
counter. The main goal of it is to implement reparenting efficiently.

If a number of objects (e.g. slab pages) have to keep a pointer and
a reference to a memory cgroup, they can use mem_cgroup_ptr instead.
On reparenting, only one mem_cgroup_ptr->memcg pointer has to be
changed, instead of walking over all accounted objects.

mem_cgroup_ptr holds a single reference to the corresponding memory
cgroup. Because it's initialized before the css reference counter,
css's refcounter can't be bumped at allocation time. Instead, it's
bumped on reparenting which happens during offlining. A cgroup is
never released online, so it's fine.

mem_cgroup_ptr is released using rcu, so memcg->kmem_memcg_ptr can
be accessed in a rcu read section. On reparenting it's atomically
switched to NULL. If the reader gets NULL, it can just read parent's
kmem_memcg_ptr instead.

Each memory cgroup contains a list of kmem_memcg_ptrs. On reparenting
the list is spliced into the parent's list. The list is protected
using the css set lock.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/memcontrol.h | 50 ++++++++++++++++++++++
 mm/memcontrol.c            | 87 ++++++++++++++++++++++++++++++++++++--
 2 files changed, 133 insertions(+), 4 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 99033f9bd319..da864fded297 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -23,6 +23,7 @@
 #include <linux/page-flags.h>
 
 struct mem_cgroup;
+struct mem_cgroup_ptr;
 struct page;
 struct mm_struct;
 struct kmem_cache;
@@ -198,6 +199,22 @@ struct memcg_cgwb_frn {
 	struct wb_completion done;	/* tracks in-flight foreign writebacks */
 };
 
+/*
+ * A pointer to a memory cgroup with a built-in reference counter.
+ * For a use as an intermediate object to simplify reparenting of
+ * objects charged to the cgroup. The memcg pointer can be switched
+ * to the parent cgroup without a need to modify all objects
+ * which hold the reference to the cgroup.
+ */
+struct mem_cgroup_ptr {
+	struct percpu_ref refcnt;
+	struct mem_cgroup *memcg;
+	union {
+		struct list_head list;
+		struct rcu_head rcu;
+	};
+};
+
 /*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
@@ -311,6 +328,8 @@ struct mem_cgroup {
 	int kmemcg_id;
 	enum memcg_kmem_state kmem_state;
 	struct list_head kmem_caches;
+	struct mem_cgroup_ptr __rcu *kmem_memcg_ptr;
+	struct list_head kmem_memcg_ptr_list;
 #endif
 
 	int last_scanned_node;
@@ -439,6 +458,21 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
 	return css ? container_of(css, struct mem_cgroup, css) : NULL;
 }
 
+static inline bool mem_cgroup_ptr_tryget(struct mem_cgroup_ptr *ptr)
+{
+	return percpu_ref_tryget(&ptr->refcnt);
+}
+
+static inline void mem_cgroup_ptr_get(struct mem_cgroup_ptr *ptr)
+{
+	percpu_ref_get(&ptr->refcnt);
+}
+
+static inline void mem_cgroup_ptr_put(struct mem_cgroup_ptr *ptr)
+{
+	percpu_ref_put(&ptr->refcnt);
+}
+
 static inline void mem_cgroup_put(struct mem_cgroup *memcg)
 {
 	if (memcg)
@@ -1435,6 +1469,22 @@ static inline int memcg_cache_id(struct mem_cgroup *memcg)
 	return memcg ? memcg->kmemcg_id : -1;
 }
 
+static inline struct mem_cgroup_ptr *
+mem_cgroup_get_kmem_ptr(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup_ptr *memcg_ptr;
+
+	rcu_read_lock();
+	do {
+		memcg_ptr = rcu_dereference(memcg->kmem_memcg_ptr);
+		if (memcg_ptr && mem_cgroup_ptr_tryget(memcg_ptr))
+			break;
+	} while ((memcg = parent_mem_cgroup(memcg)));
+	rcu_read_unlock();
+
+	return memcg_ptr;
+}
+
 #else
 
 static inline int memcg_kmem_charge(struct page *page, gfp_t gfp, int order)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c3bdb8d4c355..cd8ac747827e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -258,6 +258,77 @@ struct cgroup_subsys_state *vmpressure_to_css(struct vmpressure *vmpr)
 }
 
 #ifdef CONFIG_MEMCG_KMEM
+extern spinlock_t css_set_lock;
+
+static void memcg_ptr_release(struct percpu_ref *ref)
+{
+	struct mem_cgroup_ptr *ptr = container_of(ref, struct mem_cgroup_ptr,
+						  refcnt);
+	unsigned long flags;
+
+	spin_lock_irqsave(&css_set_lock, flags);
+	list_del(&ptr->list);
+	spin_unlock_irqrestore(&css_set_lock, flags);
+
+	mem_cgroup_put(ptr->memcg);
+	percpu_ref_exit(ref);
+	kfree_rcu(ptr, rcu);
+}
+
+static int memcg_init_kmem_memcg_ptr(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup_ptr *kmem_memcg_ptr;
+	int ret;
+
+	kmem_memcg_ptr = kmalloc(sizeof(struct mem_cgroup_ptr), GFP_KERNEL);
+	if (!kmem_memcg_ptr)
+		return -ENOMEM;
+
+	ret = percpu_ref_init(&kmem_memcg_ptr->refcnt, memcg_ptr_release,
+			      0, GFP_KERNEL);
+	if (ret) {
+		kfree(kmem_memcg_ptr);
+		return ret;
+	}
+
+	kmem_memcg_ptr->memcg = memcg;
+	INIT_LIST_HEAD(&kmem_memcg_ptr->list);
+	rcu_assign_pointer(memcg->kmem_memcg_ptr, kmem_memcg_ptr);
+	list_add(&kmem_memcg_ptr->list, &memcg->kmem_memcg_ptr_list);
+	return 0;
+}
+
+static void memcg_reparent_kmem_memcg_ptr(struct mem_cgroup *memcg,
+					  struct mem_cgroup *parent)
+{
+	unsigned int nr_reparented = 0;
+	struct mem_cgroup_ptr *memcg_ptr = NULL;
+
+	rcu_swap_protected(memcg->kmem_memcg_ptr, memcg_ptr, true);
+	percpu_ref_kill(&memcg_ptr->refcnt);
+
+	/*
+	 * kmem_memcg_ptr is initialized before css refcounter, so until now
+	 * it doesn't hold a reference to the memcg. Bump it here.
+	 */
+	css_get(&memcg->css);
+
+	spin_lock_irq(&css_set_lock);
+	list_for_each_entry(memcg_ptr, &memcg->kmem_memcg_ptr_list, list) {
+		xchg(&memcg_ptr->memcg, parent);
+		nr_reparented++;
+	}
+	if (nr_reparented)
+		list_splice(&memcg->kmem_memcg_ptr_list,
+			    &parent->kmem_memcg_ptr_list);
+	spin_unlock_irq(&css_set_lock);
+
+	if (nr_reparented) {
+		css_get_many(&parent->css, nr_reparented);
+		css_put_many(&memcg->css, nr_reparented);
+	}
+}
+
 /*
  * This will be the memcg's index in each cache's ->memcg_params.memcg_caches.
  * The main reason for not using cgroup id for this:
@@ -3572,7 +3643,7 @@ static void memcg_flush_percpu_vmevents(struct mem_cgroup *memcg)
 #ifdef CONFIG_MEMCG_KMEM
 static int memcg_online_kmem(struct mem_cgroup *memcg)
 {
-	int memcg_id;
+	int memcg_id, ret;
 
 	if (cgroup_memory_nokmem)
 		return 0;
@@ -3584,6 +3655,12 @@ static int memcg_online_kmem(struct mem_cgroup *memcg)
 	if (memcg_id < 0)
 		return memcg_id;
 
+	ret = memcg_init_kmem_memcg_ptr(memcg);
+	if (ret) {
+		memcg_free_cache_id(memcg_id);
+		return ret;
+	}
+
 	static_branch_inc(&memcg_kmem_enabled_key);
 	/*
 	 * A memory cgroup is considered kmem-online as soon as it gets
@@ -3619,12 +3696,13 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
 		parent = root_mem_cgroup;
 
 	/*
-	 * Deactivate and reparent kmem_caches. Then flush percpu
-	 * slab statistics to have precise values at the parent and
-	 * all ancestor levels. It's required to keep slab stats
+	 * Deactivate and reparent kmem_caches and reparent kmem_memcg_ptr.
+	 * Then flush percpu slab statistics to have precise values at the
+	 * parent and all ancestor levels. It's required to keep slab stats
 	 * accurate after the reparenting of kmem_caches.
 	 */
 	memcg_deactivate_kmem_caches(memcg, parent);
+	memcg_reparent_kmem_memcg_ptr(memcg, parent);
 	memcg_flush_percpu_vmstats(memcg, true);
 
 	kmemcg_id = memcg->kmemcg_id;
@@ -5180,6 +5258,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 	memcg->socket_pressure = jiffies;
 #ifdef CONFIG_MEMCG_KMEM
 	memcg->kmemcg_id = -1;
+	INIT_LIST_HEAD(&memcg->kmem_memcg_ptr_list);
 #endif
 #ifdef CONFIG_CGROUP_WRITEBACK
 	INIT_LIST_HEAD(&memcg->cgwb_list);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 02/16] mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat
  2019-10-18  0:28 [PATCH 00/16] The new slab memory controller Roman Gushchin
  2019-10-18  0:28 ` [PATCH 01/16] mm: memcg: introduce mem_cgroup_ptr Roman Gushchin
@ 2019-10-18  0:28 ` Roman Gushchin
  2019-10-20 22:44   ` Christopher Lameter
  2019-10-20 22:51   ` Christopher Lameter
  2019-10-18  0:28 ` [PATCH 03/16] mm: vmstat: convert slab vmstat counter to bytes Roman Gushchin
                   ` (15 subsequent siblings)
  17 siblings, 2 replies; 36+ messages in thread
From: Roman Gushchin @ 2019-10-18  0:28 UTC (permalink / raw)
  To: linux-mm
  Cc: Michal Hocko, Johannes Weiner, linux-kernel, kernel-team,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Christoph Lameter,
	Roman Gushchin

Currently s8 type is used for per-cpu caching of per-node statistics.
It works fine because the overfill threshold can't exceed 125.

But if some counters are in bytes (and the next commit in the series
will convert slab counters to bytes), it's not gonna work:
value in bytes can easily exceed s8 without exceeding the threshold
converted to bytes. So to avoid overfilling per-cpu caches and breaking
vmstats correctness, let's use s32 instead.

This doesn't affect per-zone statistics. There are no plans to use
zone-level byte-sized counters, so no reasons to change anything.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/mmzone.h |  2 +-
 mm/vmstat.c            | 16 ++++++++--------
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d4ca03b93373..4d8adaa5d59c 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -353,7 +353,7 @@ struct per_cpu_pageset {
 
 struct per_cpu_nodestat {
 	s8 stat_threshold;
-	s8 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS];
+	s32 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS];
 };
 
 #endif /* !__GENERATING_BOUNDS.H */
diff --git a/mm/vmstat.c b/mm/vmstat.c
index b2fd344d2fcf..3375e7f45891 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -337,7 +337,7 @@ void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
 				long delta)
 {
 	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-	s8 __percpu *p = pcp->vm_node_stat_diff + item;
+	s32 __percpu *p = pcp->vm_node_stat_diff + item;
 	long x;
 	long t;
 
@@ -395,13 +395,13 @@ void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
 void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
 {
 	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-	s8 __percpu *p = pcp->vm_node_stat_diff + item;
-	s8 v, t;
+	s32 __percpu *p = pcp->vm_node_stat_diff + item;
+	s32 v, t;
 
 	v = __this_cpu_inc_return(*p);
 	t = __this_cpu_read(pcp->stat_threshold);
 	if (unlikely(v > t)) {
-		s8 overstep = t >> 1;
+		s32 overstep = t >> 1;
 
 		node_page_state_add(v + overstep, pgdat, item);
 		__this_cpu_write(*p, -overstep);
@@ -439,13 +439,13 @@ void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
 void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
 {
 	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-	s8 __percpu *p = pcp->vm_node_stat_diff + item;
-	s8 v, t;
+	s32 __percpu *p = pcp->vm_node_stat_diff + item;
+	s32 v, t;
 
 	v = __this_cpu_dec_return(*p);
 	t = __this_cpu_read(pcp->stat_threshold);
 	if (unlikely(v < - t)) {
-		s8 overstep = t >> 1;
+		s32 overstep = t >> 1;
 
 		node_page_state_add(v - overstep, pgdat, item);
 		__this_cpu_write(*p, overstep);
@@ -538,7 +538,7 @@ static inline void mod_node_state(struct pglist_data *pgdat,
        enum node_stat_item item, int delta, int overstep_mode)
 {
 	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-	s8 __percpu *p = pcp->vm_node_stat_diff + item;
+	s32 __percpu *p = pcp->vm_node_stat_diff + item;
 	long o, n, t, z;
 
 	do {
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 03/16] mm: vmstat: convert slab vmstat counter to bytes
  2019-10-18  0:28 [PATCH 00/16] The new slab memory controller Roman Gushchin
  2019-10-18  0:28 ` [PATCH 01/16] mm: memcg: introduce mem_cgroup_ptr Roman Gushchin
  2019-10-18  0:28 ` [PATCH 02/16] mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat Roman Gushchin
@ 2019-10-18  0:28 ` Roman Gushchin
  2019-10-18  0:28 ` [PATCH 04/16] mm: memcg/slab: allocate space for memcg ownership data for non-root slabs Roman Gushchin
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 36+ messages in thread
From: Roman Gushchin @ 2019-10-18  0:28 UTC (permalink / raw)
  To: linux-mm
  Cc: Michal Hocko, Johannes Weiner, linux-kernel, kernel-team,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Christoph Lameter,
	Roman Gushchin

In order to prepare for per-object slab memory accounting,
convert NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE vmstat
items to bytes.

To make sure that these vmstats are in bytes, rename them
to NR_SLAB_RECLAIMABLE_B and NR_SLAB_UNRECLAIMABLE_B (similar to
NR_KERNEL_STACK_KB).

The size of slab memory shouldn't exceed 4Gb on 32-bit machines,
so it will fit into atomic_long_t we use for vmstats.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 drivers/base/node.c     | 14 +++++++++-----
 fs/proc/meminfo.c       |  4 ++--
 include/linux/mmzone.h  | 10 ++++++++--
 include/linux/vmstat.h  |  8 ++++++++
 kernel/power/snapshot.c |  2 +-
 mm/memcontrol.c         | 29 ++++++++++++++++-------------
 mm/oom_kill.c           |  2 +-
 mm/page_alloc.c         |  8 ++++----
 mm/slab.h               | 15 ++++++++-------
 mm/slab_common.c        |  4 ++--
 mm/slob.c               | 12 ++++++------
 mm/slub.c               |  8 ++++----
 mm/vmscan.c             |  3 ++-
 mm/vmstat.c             | 21 +++++++++++++++++++--
 mm/workingset.c         |  6 ++++--
 15 files changed, 94 insertions(+), 52 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 98a31bafc8a2..fa07c5806dcd 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -368,8 +368,8 @@ static ssize_t node_read_meminfo(struct device *dev,
 	unsigned long sreclaimable, sunreclaimable;
 
 	si_meminfo_node(&i, nid);
-	sreclaimable = node_page_state(pgdat, NR_SLAB_RECLAIMABLE);
-	sunreclaimable = node_page_state(pgdat, NR_SLAB_UNRECLAIMABLE);
+	sreclaimable = node_page_state_pages(pgdat, NR_SLAB_RECLAIMABLE_B);
+	sunreclaimable = node_page_state_pages(pgdat, NR_SLAB_UNRECLAIMABLE_B);
 	n = sprintf(buf,
 		       "Node %d MemTotal:       %8lu kB\n"
 		       "Node %d MemFree:        %8lu kB\n"
@@ -505,9 +505,13 @@ static ssize_t node_read_vmstat(struct device *dev,
 			     sum_zone_numa_state(nid, i));
 #endif
 
-	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
-		n += sprintf(buf+n, "%s %lu\n", node_stat_name(i),
-			     node_page_state(pgdat, i));
+	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
+		unsigned long x = node_page_state(pgdat, i);
+
+		if (vmstat_item_in_bytes(i))
+			x >>= PAGE_SHIFT;
+		n += sprintf(buf+n, "%s %lu\n", node_stat_name(i), x);
+	}
 
 	return n;
 }
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index ac9247371871..87afa8683c1b 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -53,8 +53,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 		pages[lru] = global_node_page_state(NR_LRU_BASE + lru);
 
 	available = si_mem_available();
-	sreclaimable = global_node_page_state(NR_SLAB_RECLAIMABLE);
-	sunreclaim = global_node_page_state(NR_SLAB_UNRECLAIMABLE);
+	sreclaimable = global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B);
+	sunreclaim = global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE_B);
 
 	show_val_kb(m, "MemTotal:       ", i.totalram);
 	show_val_kb(m, "MemFree:        ", i.freeram);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4d8adaa5d59c..e20c200bff95 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -215,8 +215,8 @@ enum node_stat_item {
 	NR_INACTIVE_FILE,	/*  "     "     "   "       "         */
 	NR_ACTIVE_FILE,		/*  "     "     "   "       "         */
 	NR_UNEVICTABLE,		/*  "     "     "   "       "         */
-	NR_SLAB_RECLAIMABLE,	/* Please do not reorder this item */
-	NR_SLAB_UNRECLAIMABLE,	/* and this one without looking at
+	NR_SLAB_RECLAIMABLE_B,	/* Please do not reorder this item */
+	NR_SLAB_UNRECLAIMABLE_B,/* and this one without looking at
 				 * memcg_flush_percpu_vmstats() first. */
 	NR_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
 	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
@@ -247,6 +247,12 @@ enum node_stat_item {
 	NR_VM_NODE_STAT_ITEMS
 };
 
+static __always_inline bool vmstat_item_in_bytes(enum node_stat_item item)
+{
+	return (item == NR_SLAB_RECLAIMABLE_B ||
+		item == NR_SLAB_UNRECLAIMABLE_B);
+}
+
 /*
  * We do arithmetic on the LRU lists in various places in the code,
  * so it is important to keep the active lists LRU_ACTIVE higher in
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 292485f3d24d..5b0f61b3ca75 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -200,6 +200,12 @@ static inline unsigned long global_node_page_state(enum node_stat_item item)
 	return x;
 }
 
+static inline
+unsigned long global_node_page_state_pages(enum node_stat_item item)
+{
+	return global_node_page_state(item) >> PAGE_SHIFT;
+}
+
 static inline unsigned long zone_page_state(struct zone *zone,
 					enum zone_stat_item item)
 {
@@ -240,6 +246,8 @@ extern unsigned long sum_zone_node_page_state(int node,
 extern unsigned long sum_zone_numa_state(int node, enum numa_stat_item item);
 extern unsigned long node_page_state(struct pglist_data *pgdat,
 						enum node_stat_item item);
+extern unsigned long node_page_state_pages(struct pglist_data *pgdat,
+					   enum node_stat_item item);
 #else
 #define sum_zone_node_page_state(node, item) global_zone_page_state(item)
 #define node_page_state(node, item) global_node_page_state(item)
diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 26b9168321e7..d3911f75c39b 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -1666,7 +1666,7 @@ static unsigned long minimum_image_size(unsigned long saveable)
 {
 	unsigned long size;
 
-	size = global_node_page_state(NR_SLAB_RECLAIMABLE)
+	size = global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B)
 		+ global_node_page_state(NR_ACTIVE_ANON)
 		+ global_node_page_state(NR_INACTIVE_ANON)
 		+ global_node_page_state(NR_ACTIVE_FILE)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index cd8ac747827e..9303e98b0718 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -751,13 +751,16 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
  */
 void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)
 {
-	long x;
+	long x, threshold = MEMCG_CHARGE_BATCH;
 
 	if (mem_cgroup_disabled())
 		return;
 
+	if (vmstat_item_in_bytes(idx))
+		threshold <<= PAGE_SHIFT;
+
 	x = val + __this_cpu_read(memcg->vmstats_percpu->stat[idx]);
-	if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
+	if (unlikely(abs(x) > threshold)) {
 		struct mem_cgroup *mi;
 
 		/*
@@ -799,7 +802,7 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 	pg_data_t *pgdat = lruvec_pgdat(lruvec);
 	struct mem_cgroup_per_node *pn;
 	struct mem_cgroup *memcg;
-	long x;
+	long x, threshold = MEMCG_CHARGE_BATCH;
 
 	/* Update node */
 	__mod_node_page_state(pgdat, idx, val);
@@ -816,8 +819,11 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 	/* Update lruvec */
 	__this_cpu_add(pn->lruvec_stat_local->count[idx], val);
 
+	if (vmstat_item_in_bytes(idx))
+		threshold <<= PAGE_SHIFT;
+
 	x = val + __this_cpu_read(pn->lruvec_stat_cpu->count[idx]);
-	if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
+	if (unlikely(abs(x) > threshold)) {
 		struct mem_cgroup_per_node *pi;
 
 		for (pi = pn; pi; pi = parent_nodeinfo(pi, pgdat->node_id))
@@ -1466,9 +1472,8 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
 		       (u64)memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) *
 		       1024);
 	seq_buf_printf(&s, "slab %llu\n",
-		       (u64)(memcg_page_state(memcg, NR_SLAB_RECLAIMABLE) +
-			     memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE)) *
-		       PAGE_SIZE);
+		       (u64)(memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B) +
+			     memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B)));
 	seq_buf_printf(&s, "sock %llu\n",
 		       (u64)memcg_page_state(memcg, MEMCG_SOCK) *
 		       PAGE_SIZE);
@@ -1502,11 +1507,9 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
 			       PAGE_SIZE);
 
 	seq_buf_printf(&s, "slab_reclaimable %llu\n",
-		       (u64)memcg_page_state(memcg, NR_SLAB_RECLAIMABLE) *
-		       PAGE_SIZE);
+		       (u64)memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B));
 	seq_buf_printf(&s, "slab_unreclaimable %llu\n",
-		       (u64)memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE) *
-		       PAGE_SIZE);
+		       (u64)memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B));
 
 	/* Accumulated memory events */
 
@@ -3582,8 +3585,8 @@ static void memcg_flush_percpu_vmstats(struct mem_cgroup *memcg, bool slab_only)
 	int min_idx, max_idx;
 
 	if (slab_only) {
-		min_idx = NR_SLAB_RECLAIMABLE;
-		max_idx = NR_SLAB_UNRECLAIMABLE;
+		min_idx = NR_SLAB_RECLAIMABLE_B;
+		max_idx = NR_SLAB_UNRECLAIMABLE_B;
 	} else {
 		min_idx = 0;
 		max_idx = MEMCG_NR_STAT;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 314ce1a3cf25..61476bf0f147 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -183,7 +183,7 @@ static bool is_dump_unreclaim_slabs(void)
 		 global_node_page_state(NR_ISOLATED_FILE) +
 		 global_node_page_state(NR_UNEVICTABLE);
 
-	return (global_node_page_state(NR_SLAB_UNRECLAIMABLE) > nr_lru);
+	return (global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE_B) > nr_lru);
 }
 
 /**
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cd1dd0712624..c037bdaa73a1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5130,8 +5130,8 @@ long si_mem_available(void)
 	 * items that are in use, and cannot be freed. Cap this estimate at the
 	 * low watermark.
 	 */
-	reclaimable = global_node_page_state(NR_SLAB_RECLAIMABLE) +
-			global_node_page_state(NR_KERNEL_MISC_RECLAIMABLE);
+	reclaimable = global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B) +
+		global_node_page_state(NR_KERNEL_MISC_RECLAIMABLE);
 	available += reclaimable - min(reclaimable / 2, wmark_low);
 
 	if (available < 0)
@@ -5275,8 +5275,8 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 		global_node_page_state(NR_FILE_DIRTY),
 		global_node_page_state(NR_WRITEBACK),
 		global_node_page_state(NR_UNSTABLE_NFS),
-		global_node_page_state(NR_SLAB_RECLAIMABLE),
-		global_node_page_state(NR_SLAB_UNRECLAIMABLE),
+		global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B),
+		global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE_B),
 		global_node_page_state(NR_FILE_MAPPED),
 		global_node_page_state(NR_SHMEM),
 		global_zone_page_state(NR_PAGETABLE),
diff --git a/mm/slab.h b/mm/slab.h
index 3eb29ae75743..03833b02b9ae 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -272,7 +272,7 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *, gfp_t, size_t, void **);
 static inline int cache_vmstat_idx(struct kmem_cache *s)
 {
 	return (s->flags & SLAB_RECLAIM_ACCOUNT) ?
-		NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE;
+		NR_SLAB_RECLAIMABLE_B : NR_SLAB_UNRECLAIMABLE_B;
 }
 
 #ifdef CONFIG_MEMCG_KMEM
@@ -360,7 +360,7 @@ static __always_inline int memcg_charge_slab(struct page *page,
 
 	if (unlikely(!memcg || mem_cgroup_is_root(memcg))) {
 		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
-				    (1 << order));
+				    (PAGE_SIZE << order));
 		percpu_ref_get_many(&s->memcg_params.refcnt, 1 << order);
 		return 0;
 	}
@@ -370,7 +370,7 @@ static __always_inline int memcg_charge_slab(struct page *page,
 		goto out;
 
 	lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
-	mod_lruvec_state(lruvec, cache_vmstat_idx(s), 1 << order);
+	mod_lruvec_state(lruvec, cache_vmstat_idx(s), PAGE_SIZE << order);
 
 	/* transer try_charge() page references to kmem_cache */
 	percpu_ref_get_many(&s->memcg_params.refcnt, 1 << order);
@@ -394,11 +394,12 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
 	memcg = READ_ONCE(s->memcg_params.memcg);
 	if (likely(!mem_cgroup_is_root(memcg))) {
 		lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
-		mod_lruvec_state(lruvec, cache_vmstat_idx(s), -(1 << order));
+		mod_lruvec_state(lruvec, cache_vmstat_idx(s),
+				 -(PAGE_SIZE << order));
 		memcg_kmem_uncharge_memcg(page, order, memcg);
 	} else {
 		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
-				    -(1 << order));
+				    -(PAGE_SIZE << order));
 	}
 	rcu_read_unlock();
 
@@ -482,7 +483,7 @@ static __always_inline int charge_slab_page(struct page *page,
 {
 	if (is_root_cache(s)) {
 		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
-				    1 << order);
+				    PAGE_SIZE << order);
 		return 0;
 	}
 
@@ -494,7 +495,7 @@ static __always_inline void uncharge_slab_page(struct page *page, int order,
 {
 	if (is_root_cache(s)) {
 		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
-				    -(1 << order));
+				    -(PAGE_SIZE << order));
 		return;
 	}
 
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 8afa188f6e20..f0f7f955c5fa 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1311,8 +1311,8 @@ void *kmalloc_order(size_t size, gfp_t flags, unsigned int order)
 	page = alloc_pages(flags, order);
 	if (likely(page)) {
 		ret = page_address(page);
-		mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE,
-				    1 << order);
+		mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE_B,
+				    PAGE_SIZE << order);
 	}
 	ret = kasan_kmalloc_large(ret, size, flags);
 	/* As ret might get tagged, call kmemleak hook after KASAN. */
diff --git a/mm/slob.c b/mm/slob.c
index fa53e9f73893..8b7b56235438 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -202,8 +202,8 @@ static void *slob_new_pages(gfp_t gfp, int order, int node)
 	if (!page)
 		return NULL;
 
-	mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE,
-			    1 << order);
+	mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE_B,
+			    PAGE_SIZE << order);
 	return page_address(page);
 }
 
@@ -214,8 +214,8 @@ static void slob_free_pages(void *b, int order)
 	if (current->reclaim_state)
 		current->reclaim_state->reclaimed_slab += 1 << order;
 
-	mod_node_page_state(page_pgdat(sp), NR_SLAB_UNRECLAIMABLE,
-			    -(1 << order));
+	mod_node_page_state(page_pgdat(sp), NR_SLAB_UNRECLAIMABLE_B,
+			    -(PAGE_SIZE << order));
 	__free_pages(sp, order);
 }
 
@@ -550,8 +550,8 @@ void kfree(const void *block)
 		slob_free(m, *m + align);
 	} else {
 		unsigned int order = compound_order(sp);
-		mod_node_page_state(page_pgdat(sp), NR_SLAB_UNRECLAIMABLE,
-				    -(1 << order));
+		mod_node_page_state(page_pgdat(sp), NR_SLAB_UNRECLAIMABLE_B,
+				    -(PAGE_SIZE << order));
 		__free_pages(sp, order);
 
 	}
diff --git a/mm/slub.c b/mm/slub.c
index 249e4c8be66a..bd902d65a71c 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3838,8 +3838,8 @@ static void *kmalloc_large_node(size_t size, gfp_t flags, int node)
 	page = alloc_pages_node(node, flags, order);
 	if (page) {
 		ptr = page_address(page);
-		mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE,
-				    1 << order);
+		mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE_B,
+				    PAGE_SIZE << order);
 	}
 
 	return kmalloc_large_node_hook(ptr, size, flags);
@@ -3970,8 +3970,8 @@ void kfree(const void *x)
 
 		BUG_ON(!PageCompound(page));
 		kfree_hook(object);
-		mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE,
-				    -(1 << order));
+		mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE_B,
+				    -(PAGE_SIZE << order));
 		__free_pages(page, order);
 		return;
 	}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 50c81463797b..990e4b455b7d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4272,7 +4272,8 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
 	 * unmapped file backed pages.
 	 */
 	if (node_pagecache_reclaimable(pgdat) <= pgdat->min_unmapped_pages &&
-	    node_page_state(pgdat, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages)
+	    node_page_state_pages(pgdat, NR_SLAB_RECLAIMABLE_B) <=
+	    pgdat->min_slab_pages)
 		return NODE_RECLAIM_FULL;
 
 	/*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 3375e7f45891..1d10ffcecb9f 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -344,6 +344,8 @@ void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
 	x = delta + __this_cpu_read(*p);
 
 	t = __this_cpu_read(pcp->stat_threshold);
+	if (vmstat_item_in_bytes(item))
+		t <<= PAGE_SHIFT;
 
 	if (unlikely(x > t || x < -t)) {
 		node_page_state_add(x, pgdat, item);
@@ -555,6 +557,8 @@ static inline void mod_node_state(struct pglist_data *pgdat,
 		 * for all cpus in a node.
 		 */
 		t = this_cpu_read(pcp->stat_threshold);
+		if (vmstat_item_in_bytes(item))
+			t <<= PAGE_SHIFT;
 
 		o = this_cpu_read(*p);
 		n = delta + o;
@@ -999,6 +1003,12 @@ unsigned long node_page_state(struct pglist_data *pgdat,
 #endif
 	return x;
 }
+
+unsigned long node_page_state_pages(struct pglist_data *pgdat,
+				    enum node_stat_item item)
+{
+	return node_page_state(pgdat, item) >> PAGE_SHIFT;
+}
 #endif
 
 #ifdef CONFIG_COMPACTION
@@ -1548,8 +1558,12 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 	if (is_zone_first_populated(pgdat, zone)) {
 		seq_printf(m, "\n  per-node stats");
 		for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
+			unsigned long x = node_page_state(pgdat, i);
+
+			if (vmstat_item_in_bytes(i))
+				x >>= PAGE_SHIFT;
 			seq_printf(m, "\n      %-12s %lu", node_stat_name(i),
-				   node_page_state(pgdat, i));
+				   x);
 		}
 	}
 	seq_printf(m,
@@ -1671,8 +1685,11 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
 	v += NR_VM_NUMA_STAT_ITEMS;
 #endif
 
-	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
+	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
 		v[i] = global_node_page_state(i);
+		if (vmstat_item_in_bytes(i))
+			v[i] >>= PAGE_SHIFT;
+	}
 	v += NR_VM_NODE_STAT_ITEMS;
 
 	global_dirty_limits(v + NR_DIRTY_BG_THRESHOLD,
diff --git a/mm/workingset.c b/mm/workingset.c
index c963831d354f..675792387e58 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -430,8 +430,10 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
 		for (pages = 0, i = 0; i < NR_LRU_LISTS; i++)
 			pages += lruvec_page_state_local(lruvec,
 							 NR_LRU_BASE + i);
-		pages += lruvec_page_state_local(lruvec, NR_SLAB_RECLAIMABLE);
-		pages += lruvec_page_state_local(lruvec, NR_SLAB_UNRECLAIMABLE);
+		pages += lruvec_page_state_local(
+			lruvec, NR_SLAB_RECLAIMABLE_B) >> PAGE_SHIFT;
+		pages += lruvec_page_state_local(
+			lruvec, NR_SLAB_UNRECLAIMABLE_B) >> PAGE_SHIFT;
 	} else
 #endif
 		pages = node_present_pages(sc->nid);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 04/16] mm: memcg/slab: allocate space for memcg ownership data for non-root slabs
  2019-10-18  0:28 [PATCH 00/16] The new slab memory controller Roman Gushchin
                   ` (2 preceding siblings ...)
  2019-10-18  0:28 ` [PATCH 03/16] mm: vmstat: convert slab vmstat counter to bytes Roman Gushchin
@ 2019-10-18  0:28 ` Roman Gushchin
  2019-10-18  0:28 ` [PATCH 05/16] mm: slub: implement SLUB version of obj_to_index() Roman Gushchin
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 36+ messages in thread
From: Roman Gushchin @ 2019-10-18  0:28 UTC (permalink / raw)
  To: linux-mm
  Cc: Michal Hocko, Johannes Weiner, linux-kernel, kernel-team,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Christoph Lameter,
	Roman Gushchin

Allocate and release memory for storing the memcg ownership data.
For each slab page allocate space sufficient for number_of_objects
pointers to struct mem_cgroup_vec.

The mem_cgroup field of the struct page isn't used for slab pages,
so let's use the space for storing the pointer for the allocated
space.

This commit makes sure that the space is ready for use, but nobody
is actually using it yet. Following commits in the series will fix it.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/mm_types.h |  5 ++++-
 mm/slab.c                |  3 ++-
 mm/slab.h                | 37 ++++++++++++++++++++++++++++++++++++-
 mm/slub.c                |  2 +-
 4 files changed, 43 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 2222fa795284..4d99ee5a9c53 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -198,7 +198,10 @@ struct page {
 	atomic_t _refcount;
 
 #ifdef CONFIG_MEMCG
-	struct mem_cgroup *mem_cgroup;
+	union {
+		struct mem_cgroup *mem_cgroup;
+		struct mem_cgroup_ptr **mem_cgroup_vec;
+	};
 #endif
 
 	/*
diff --git a/mm/slab.c b/mm/slab.c
index f1e1840af533..ffa16dd966ef 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1370,7 +1370,8 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
 		return NULL;
 	}
 
-	if (charge_slab_page(page, flags, cachep->gfporder, cachep)) {
+	if (charge_slab_page(page, flags, cachep->gfporder, cachep,
+			     cachep->num)) {
 		__free_pages(page, cachep->gfporder);
 		return NULL;
 	}
diff --git a/mm/slab.h b/mm/slab.h
index 03833b02b9ae..8620a0a1d5fa 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -406,6 +406,23 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
 	percpu_ref_put_many(&s->memcg_params.refcnt, 1 << order);
 }
 
+static inline int memcg_alloc_page_memcg_vec(struct page *page, gfp_t gfp,
+					     unsigned int objects)
+{
+	page->mem_cgroup_vec = kmalloc(sizeof(struct mem_cgroup_ptr *) *
+				       objects, gfp | __GFP_ZERO);
+	if (!page->mem_cgroup_vec)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static inline void memcg_free_page_memcg_vec(struct page *page)
+{
+	kfree(page->mem_cgroup_vec);
+	page->mem_cgroup_vec = NULL;
+}
+
 extern void slab_init_memcg_params(struct kmem_cache *);
 extern void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg);
 
@@ -455,6 +472,16 @@ static inline void memcg_uncharge_slab(struct page *page, int order,
 {
 }
 
+static inline int memcg_alloc_page_memcg_vec(struct page *page, gfp_t gfp,
+					     unsigned int objects)
+{
+	return 0;
+}
+
+static inline void memcg_free_page_memcg_vec(struct page *page)
+{
+}
+
 static inline void slab_init_memcg_params(struct kmem_cache *s)
 {
 }
@@ -479,14 +506,21 @@ static inline struct kmem_cache *virt_to_cache(const void *obj)
 
 static __always_inline int charge_slab_page(struct page *page,
 					    gfp_t gfp, int order,
-					    struct kmem_cache *s)
+					    struct kmem_cache *s,
+					    unsigned int objects)
 {
+	int ret;
+
 	if (is_root_cache(s)) {
 		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
 				    PAGE_SIZE << order);
 		return 0;
 	}
 
+	ret = memcg_alloc_page_memcg_vec(page, gfp, objects);
+	if (ret)
+		return ret;
+
 	return memcg_charge_slab(page, gfp, order, s);
 }
 
@@ -499,6 +533,7 @@ static __always_inline void uncharge_slab_page(struct page *page, int order,
 		return;
 	}
 
+	memcg_free_page_memcg_vec(page);
 	memcg_uncharge_slab(page, order, s);
 }
 
diff --git a/mm/slub.c b/mm/slub.c
index bd902d65a71c..e810582f5b86 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1518,7 +1518,7 @@ static inline struct page *alloc_slab_page(struct kmem_cache *s,
 	else
 		page = __alloc_pages_node(node, flags, order);
 
-	if (page && charge_slab_page(page, flags, order, s)) {
+	if (page && charge_slab_page(page, flags, order, s, oo_objects(oo))) {
 		__free_pages(page, order);
 		page = NULL;
 	}
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 05/16] mm: slub: implement SLUB version of obj_to_index()
  2019-10-18  0:28 [PATCH 00/16] The new slab memory controller Roman Gushchin
                   ` (3 preceding siblings ...)
  2019-10-18  0:28 ` [PATCH 04/16] mm: memcg/slab: allocate space for memcg ownership data for non-root slabs Roman Gushchin
@ 2019-10-18  0:28 ` Roman Gushchin
  2019-10-18  0:28 ` [PATCH 06/16] mm: memcg/slab: save memcg ownership data for non-root slab objects Roman Gushchin
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 36+ messages in thread
From: Roman Gushchin @ 2019-10-18  0:28 UTC (permalink / raw)
  To: linux-mm
  Cc: Michal Hocko, Johannes Weiner, linux-kernel, kernel-team,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Christoph Lameter,
	Roman Gushchin

This commit implements SLUB version of the obj_to_index() function,
which will be required to calculate the offset of memcg_ptr in the
mem_cgroup_vec to store/obtain the memcg ownership data.

To make it faster, let's repeat the SLAB's trick introduced by
commit 6a2d7a955d8d ("[PATCH] SLAB: use a multiply instead of a
divide in obj_to_index()") and avoid an expensive division.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/slub_def.h | 9 +++++++++
 mm/slub.c                | 1 +
 2 files changed, 10 insertions(+)

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index d2153789bd9f..200ea292f250 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -8,6 +8,7 @@
  * (C) 2007 SGI, Christoph Lameter
  */
 #include <linux/kobject.h>
+#include <linux/reciprocal_div.h>
 
 enum stat_item {
 	ALLOC_FASTPATH,		/* Allocation from cpu slab */
@@ -86,6 +87,7 @@ struct kmem_cache {
 	unsigned long min_partial;
 	unsigned int size;	/* The size of an object including metadata */
 	unsigned int object_size;/* The size of an object without metadata */
+	struct reciprocal_value reciprocal_size;
 	unsigned int offset;	/* Free pointer offset */
 #ifdef CONFIG_SLUB_CPU_PARTIAL
 	/* Number of per cpu partial objects to keep around */
@@ -182,4 +184,11 @@ static inline void *nearest_obj(struct kmem_cache *cache, struct page *page,
 	return result;
 }
 
+static inline unsigned int obj_to_index(const struct kmem_cache *cache,
+					const struct page *page, void *obj)
+{
+	return reciprocal_divide(kasan_reset_tag(obj) - page_address(page),
+				 cache->reciprocal_size);
+}
+
 #endif /* _LINUX_SLUB_DEF_H */
diff --git a/mm/slub.c b/mm/slub.c
index e810582f5b86..557ea45a5d75 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3600,6 +3600,7 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order)
 	 */
 	size = ALIGN(size, s->align);
 	s->size = size;
+	s->reciprocal_size = reciprocal_value(size);
 	if (forced_order >= 0)
 		order = forced_order;
 	else
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 06/16] mm: memcg/slab: save memcg ownership data for non-root slab objects
  2019-10-18  0:28 [PATCH 00/16] The new slab memory controller Roman Gushchin
                   ` (4 preceding siblings ...)
  2019-10-18  0:28 ` [PATCH 05/16] mm: slub: implement SLUB version of obj_to_index() Roman Gushchin
@ 2019-10-18  0:28 ` Roman Gushchin
  2019-10-18  0:28 ` [PATCH 07/16] mm: memcg: move memcg_kmem_bypass() to memcontrol.h Roman Gushchin
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 36+ messages in thread
From: Roman Gushchin @ 2019-10-18  0:28 UTC (permalink / raw)
  To: linux-mm
  Cc: Michal Hocko, Johannes Weiner, linux-kernel, kernel-team,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Christoph Lameter,
	Roman Gushchin

Store a memcg_ptr in the corresponding place of the mem_cgroup_vec
for each allocated non-root slab object. Make sure that each allocated
object holds a reference to the mem_cgroup_ptr.

To get the memcg_ptr in the post alloc hook, we need a memcg pointer.
Because all memory cgroup will soon share the same set of kmem_caches,
let's not use the kmem_cache->memcg_params.memcg. Instead, let's pass
the pointer directly from memcg_kmem_get_cache(). This will guarantee
that we will use the same cgroup in pre- and post-alloc hooks.

Please, note that the code is a bit bulky now, because we have to
manage 3 types of objects with reference counters: memcg, kmem_cache
and memcg_ptr. The following commits in the series will simplify it.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/memcontrol.h |  3 +-
 mm/memcontrol.c            |  8 +++--
 mm/slab.c                  | 18 ++++++-----
 mm/slab.h                  | 64 ++++++++++++++++++++++++++++++++++----
 mm/slub.c                  | 14 ++++++---
 5 files changed, 86 insertions(+), 21 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index da864fded297..f4cb844005a5 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1397,7 +1397,8 @@ static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg,
 }
 #endif
 
-struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep);
+struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep,
+					struct mem_cgroup **memcgp);
 void memcg_kmem_put_cache(struct kmem_cache *cachep);
 
 #ifdef CONFIG_MEMCG_KMEM
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9303e98b0718..47a30db94869 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3023,7 +3023,8 @@ static inline bool memcg_kmem_bypass(void)
  * done with it, memcg_kmem_put_cache() must be called to release the
  * reference.
  */
-struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep)
+struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep,
+					struct mem_cgroup **memcgp)
 {
 	struct mem_cgroup *memcg;
 	struct kmem_cache *memcg_cachep;
@@ -3079,8 +3080,11 @@ struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep)
 	 */
 	if (unlikely(!memcg_cachep))
 		memcg_schedule_kmem_cache_create(memcg, cachep);
-	else if (percpu_ref_tryget(&memcg_cachep->memcg_params.refcnt))
+	else if (percpu_ref_tryget(&memcg_cachep->memcg_params.refcnt)) {
+		css_get(&memcg->css);
+		*memcgp = memcg;
 		cachep = memcg_cachep;
+	}
 out_unlock:
 	rcu_read_unlock();
 	return cachep;
diff --git a/mm/slab.c b/mm/slab.c
index ffa16dd966ef..91cd8bc4ee07 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3223,9 +3223,10 @@ slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
 	unsigned long save_flags;
 	void *ptr;
 	int slab_node = numa_mem_id();
+	struct mem_cgroup *memcg = NULL;
 
 	flags &= gfp_allowed_mask;
-	cachep = slab_pre_alloc_hook(cachep, flags);
+	cachep = slab_pre_alloc_hook(cachep, &memcg, 1, flags);
 	if (unlikely(!cachep))
 		return NULL;
 
@@ -3261,7 +3262,7 @@ slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
 	if (unlikely(slab_want_init_on_alloc(flags, cachep)) && ptr)
 		memset(ptr, 0, cachep->object_size);
 
-	slab_post_alloc_hook(cachep, flags, 1, &ptr);
+	slab_post_alloc_hook(cachep, memcg, flags, 1, &ptr);
 	return ptr;
 }
 
@@ -3302,9 +3303,10 @@ slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller)
 {
 	unsigned long save_flags;
 	void *objp;
+	struct mem_cgroup *memcg = NULL;
 
 	flags &= gfp_allowed_mask;
-	cachep = slab_pre_alloc_hook(cachep, flags);
+	cachep = slab_pre_alloc_hook(cachep, &memcg, 1, flags);
 	if (unlikely(!cachep))
 		return NULL;
 
@@ -3318,7 +3320,7 @@ slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller)
 	if (unlikely(slab_want_init_on_alloc(flags, cachep)) && objp)
 		memset(objp, 0, cachep->object_size);
 
-	slab_post_alloc_hook(cachep, flags, 1, &objp);
+	slab_post_alloc_hook(cachep, memcg, flags, 1, &objp);
 	return objp;
 }
 
@@ -3440,6 +3442,7 @@ void ___cache_free(struct kmem_cache *cachep, void *objp,
 		memset(objp, 0, cachep->object_size);
 	kmemleak_free_recursive(objp, cachep->flags);
 	objp = cache_free_debugcheck(cachep, objp, caller);
+	memcg_slab_free_hook(cachep, virt_to_head_page(objp), objp);
 
 	/*
 	 * Skip calling cache_free_alien() when the platform is not numa.
@@ -3505,8 +3508,9 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 			  void **p)
 {
 	size_t i;
+	struct mem_cgroup *memcg = NULL;
 
-	s = slab_pre_alloc_hook(s, flags);
+	s = slab_pre_alloc_hook(s, &memcg, size, flags);
 	if (!s)
 		return 0;
 
@@ -3529,13 +3533,13 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 		for (i = 0; i < size; i++)
 			memset(p[i], 0, s->object_size);
 
-	slab_post_alloc_hook(s, flags, size, p);
+	slab_post_alloc_hook(s, memcg, flags, size, p);
 	/* FIXME: Trace call missing. Christoph would like a bulk variant */
 	return size;
 error:
 	local_irq_enable();
 	cache_alloc_debugcheck_after_bulk(s, flags, i, p, _RET_IP_);
-	slab_post_alloc_hook(s, flags, i, p);
+	slab_post_alloc_hook(s, memcg, flags, i, p);
 	__kmem_cache_free_bulk(s, i, p);
 	return 0;
 }
diff --git a/mm/slab.h b/mm/slab.h
index 8620a0a1d5fa..28feabed1e9a 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -423,6 +423,45 @@ static inline void memcg_free_page_memcg_vec(struct page *page)
 	page->mem_cgroup_vec = NULL;
 }
 
+static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
+					      struct mem_cgroup *memcg,
+					      size_t size, void **p)
+{
+	struct mem_cgroup_ptr *memcg_ptr;
+	struct page *page;
+	unsigned long off;
+	size_t i;
+
+	memcg_ptr = mem_cgroup_get_kmem_ptr(memcg);
+	for (i = 0; i < size; i++) {
+		if (likely(p[i])) {
+			page = virt_to_head_page(p[i]);
+			off = obj_to_index(s, page, p[i]);
+			mem_cgroup_ptr_get(memcg_ptr);
+			page->mem_cgroup_vec[off] = memcg_ptr;
+		}
+	}
+	mem_cgroup_ptr_put(memcg_ptr);
+	mem_cgroup_put(memcg);
+
+	memcg_kmem_put_cache(s);
+}
+
+static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page,
+					void *p)
+{
+	struct mem_cgroup_ptr *memcg_ptr;
+	unsigned int off;
+
+	if (!memcg_kmem_enabled() || is_root_cache(s))
+		return;
+
+	off = obj_to_index(s, page, p);
+	memcg_ptr = page->mem_cgroup_vec[off];
+	page->mem_cgroup_vec[off] = NULL;
+	mem_cgroup_ptr_put(memcg_ptr);
+}
+
 extern void slab_init_memcg_params(struct kmem_cache *);
 extern void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg);
 
@@ -482,6 +521,17 @@ static inline void memcg_free_page_memcg_vec(struct page *page)
 {
 }
 
+static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
+					      struct mem_cgroup *memcg,
+					      size_t size, void **p)
+{
+}
+
+static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page,
+					void *p)
+{
+}
+
 static inline void slab_init_memcg_params(struct kmem_cache *s)
 {
 }
@@ -591,7 +641,8 @@ static inline size_t slab_ksize(const struct kmem_cache *s)
 }
 
 static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
-						     gfp_t flags)
+						     struct mem_cgroup **memcgp,
+						     size_t size, gfp_t flags)
 {
 	flags &= gfp_allowed_mask;
 
@@ -605,13 +656,14 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
 
 	if (memcg_kmem_enabled() &&
 	    ((flags & __GFP_ACCOUNT) || (s->flags & SLAB_ACCOUNT)))
-		return memcg_kmem_get_cache(s);
+		return memcg_kmem_get_cache(s, memcgp);
 
 	return s;
 }
 
-static inline void slab_post_alloc_hook(struct kmem_cache *s, gfp_t flags,
-					size_t size, void **p)
+static inline void slab_post_alloc_hook(struct kmem_cache *s,
+					struct mem_cgroup *memcg,
+					gfp_t flags, size_t size, void **p)
 {
 	size_t i;
 
@@ -623,8 +675,8 @@ static inline void slab_post_alloc_hook(struct kmem_cache *s, gfp_t flags,
 					 s->flags, flags);
 	}
 
-	if (memcg_kmem_enabled())
-		memcg_kmem_put_cache(s);
+	if (!is_root_cache(s))
+		memcg_slab_post_alloc_hook(s, memcg, size, p);
 }
 
 #ifndef CONFIG_SLOB
diff --git a/mm/slub.c b/mm/slub.c
index 557ea45a5d75..a62545c7acac 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2700,8 +2700,9 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
 	struct kmem_cache_cpu *c;
 	struct page *page;
 	unsigned long tid;
+	struct mem_cgroup *memcg = NULL;
 
-	s = slab_pre_alloc_hook(s, gfpflags);
+	s = slab_pre_alloc_hook(s, &memcg, 1, gfpflags);
 	if (!s)
 		return NULL;
 redo:
@@ -2777,7 +2778,7 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
 	if (unlikely(slab_want_init_on_alloc(gfpflags, s)) && object)
 		memset(object, 0, s->object_size);
 
-	slab_post_alloc_hook(s, gfpflags, 1, &object);
+	slab_post_alloc_hook(s, memcg, gfpflags, 1, &object);
 
 	return object;
 }
@@ -2982,6 +2983,8 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
 	void *tail_obj = tail ? : head;
 	struct kmem_cache_cpu *c;
 	unsigned long tid;
+
+	memcg_slab_free_hook(s, page, head);
 redo:
 	/*
 	 * Determine the currently cpus per cpu slab.
@@ -3159,9 +3162,10 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 {
 	struct kmem_cache_cpu *c;
 	int i;
+	struct mem_cgroup *memcg = NULL;
 
 	/* memcg and kmem_cache debug support */
-	s = slab_pre_alloc_hook(s, flags);
+	s = slab_pre_alloc_hook(s, &memcg, size, flags);
 	if (unlikely(!s))
 		return false;
 	/*
@@ -3206,11 +3210,11 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 	}
 
 	/* memcg and kmem_cache debug support */
-	slab_post_alloc_hook(s, flags, size, p);
+	slab_post_alloc_hook(s, memcg, flags, size, p);
 	return i;
 error:
 	local_irq_enable();
-	slab_post_alloc_hook(s, flags, i, p);
+	slab_post_alloc_hook(s, memcg, flags, i, p);
 	__kmem_cache_free_bulk(s, i, p);
 	return 0;
 }
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 07/16] mm: memcg: move memcg_kmem_bypass() to memcontrol.h
  2019-10-18  0:28 [PATCH 00/16] The new slab memory controller Roman Gushchin
                   ` (5 preceding siblings ...)
  2019-10-18  0:28 ` [PATCH 06/16] mm: memcg/slab: save memcg ownership data for non-root slab objects Roman Gushchin
@ 2019-10-18  0:28 ` Roman Gushchin
  2019-10-18  0:28 ` [PATCH 08/16] mm: memcg: introduce __mod_lruvec_memcg_state() Roman Gushchin
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 36+ messages in thread
From: Roman Gushchin @ 2019-10-18  0:28 UTC (permalink / raw)
  To: linux-mm
  Cc: Michal Hocko, Johannes Weiner, linux-kernel, kernel-team,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Christoph Lameter,
	Roman Gushchin

To make the memcg_kmem_bypass() function available outside of
the memcontrol.c, let's move it to memcontrol.h. The function
is small and nicely fits into static inline sort of functions.

It will be used from the slab code.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/memcontrol.h | 7 +++++++
 mm/memcontrol.c            | 7 -------
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f4cb844005a5..7a6713dc52ce 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1432,6 +1432,13 @@ static inline bool memcg_kmem_enabled(void)
 	return static_branch_unlikely(&memcg_kmem_enabled_key);
 }
 
+static inline bool memcg_kmem_bypass(void)
+{
+	if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
+		return true;
+	return false;
+}
+
 static inline int memcg_kmem_charge(struct page *page, gfp_t gfp, int order)
 {
 	if (memcg_kmem_enabled())
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 47a30db94869..d9d5f77a783d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3000,13 +3000,6 @@ static void memcg_schedule_kmem_cache_create(struct mem_cgroup *memcg,
 	queue_work(memcg_kmem_cache_wq, &cw->work);
 }
 
-static inline bool memcg_kmem_bypass(void)
-{
-	if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
-		return true;
-	return false;
-}
-
 /**
  * memcg_kmem_get_cache: select the correct per-memcg cache for allocation
  * @cachep: the original global kmem cache
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 08/16] mm: memcg: introduce __mod_lruvec_memcg_state()
  2019-10-18  0:28 [PATCH 00/16] The new slab memory controller Roman Gushchin
                   ` (6 preceding siblings ...)
  2019-10-18  0:28 ` [PATCH 07/16] mm: memcg: move memcg_kmem_bypass() to memcontrol.h Roman Gushchin
@ 2019-10-18  0:28 ` Roman Gushchin
  2019-10-18  0:28 ` [PATCH 09/16] mm: memcg/slab: charge individual slab objects instead of pages Roman Gushchin
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 36+ messages in thread
From: Roman Gushchin @ 2019-10-18  0:28 UTC (permalink / raw)
  To: linux-mm
  Cc: Michal Hocko, Johannes Weiner, linux-kernel, kernel-team,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Christoph Lameter,
	Roman Gushchin

To prepare for per-object accounting of slab objects, let's introduce
__mod_lruvec_memcg_state() and mod_lruvec_memcg_state() helpers,
which are similar to mod_lruvec_state(), but do not update global
node counters, only lruvec and per-cgroup.

It's necessary because soon node slab counters will be used for
accounting of all memory used by slab pages, however on memcg level
only the actually used memory will be counted. Free space will be
shared between all cgroups, so it can't be accounted to any.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/memcontrol.h | 22 ++++++++++++++++++++++
 mm/memcontrol.c            | 37 +++++++++++++++++++++++++++----------
 2 files changed, 49 insertions(+), 10 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 7a6713dc52ce..bf1fcf85abd8 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -738,6 +738,8 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
 
 void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 			int val);
+void __mod_lruvec_memcg_state(struct lruvec *lruvec, enum node_stat_item idx,
+			      int val);
 void __mod_lruvec_slab_state(void *p, enum node_stat_item idx, int val);
 
 static inline void mod_lruvec_state(struct lruvec *lruvec,
@@ -750,6 +752,16 @@ static inline void mod_lruvec_state(struct lruvec *lruvec,
 	local_irq_restore(flags);
 }
 
+static inline void mod_lruvec_memcg_state(struct lruvec *lruvec,
+					  enum node_stat_item idx, int val)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__mod_lruvec_memcg_state(lruvec, idx, val);
+	local_irq_restore(flags);
+}
+
 static inline void __mod_lruvec_page_state(struct page *page,
 					   enum node_stat_item idx, int val)
 {
@@ -1142,6 +1154,16 @@ static inline void mod_lruvec_state(struct lruvec *lruvec,
 	mod_node_page_state(lruvec_pgdat(lruvec), idx, val);
 }
 
+static inline void __mod_lruvec_memcg_state(struct lruvec *lruvec,
+					    enum node_stat_item idx, int val)
+{
+}
+
+static inline void mod_lruvec_memcg_state(struct lruvec *lruvec,
+					  enum node_stat_item idx, int val)
+{
+}
+
 static inline void __mod_lruvec_page_state(struct page *page,
 					   enum node_stat_item idx, int val)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d9d5f77a783d..ebaa4ff36e0c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -787,16 +787,16 @@ parent_nodeinfo(struct mem_cgroup_per_node *pn, int nid)
 }
 
 /**
- * __mod_lruvec_state - update lruvec memory statistics
+ * __mod_lruvec_memcg_state - update lruvec memory statistics
  * @lruvec: the lruvec
  * @idx: the stat item
  * @val: delta to add to the counter, can be negative
  *
  * The lruvec is the intersection of the NUMA node and a cgroup. This
- * function updates the all three counters that are affected by a
- * change of state at this level: per-node, per-cgroup, per-lruvec.
+ * function updates the two of three counters that are affected by a
+ * change of state at this level: per-cgroup and per-lruvec.
  */
-void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
+void __mod_lruvec_memcg_state(struct lruvec *lruvec, enum node_stat_item idx,
 			int val)
 {
 	pg_data_t *pgdat = lruvec_pgdat(lruvec);
@@ -804,12 +804,6 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 	struct mem_cgroup *memcg;
 	long x, threshold = MEMCG_CHARGE_BATCH;
 
-	/* Update node */
-	__mod_node_page_state(pgdat, idx, val);
-
-	if (mem_cgroup_disabled())
-		return;
-
 	pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
 	memcg = pn->memcg;
 
@@ -833,6 +827,29 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 	__this_cpu_write(pn->lruvec_stat_cpu->count[idx], x);
 }
 
+/**
+ * __mod_lruvec_state - update lruvec memory statistics
+ * @lruvec: the lruvec
+ * @idx: the stat item
+ * @val: delta to add to the counter, can be negative
+ *
+ * The lruvec is the intersection of the NUMA node and a cgroup. This
+ * function updates the all three counters that are affected by a
+ * change of state at this level: per-node, per-cgroup, per-lruvec.
+ */
+void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
+			int val)
+{
+	pg_data_t *pgdat = lruvec_pgdat(lruvec);
+
+	/* Update node */
+	__mod_node_page_state(pgdat, idx, val);
+
+	/* Update per-cgroup and per-lruvec stats */
+	if (!mem_cgroup_disabled())
+		__mod_lruvec_memcg_state(lruvec, idx, val);
+}
+
 void __mod_lruvec_slab_state(void *p, enum node_stat_item idx, int val)
 {
 	struct page *page = virt_to_head_page(p);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 09/16] mm: memcg/slab: charge individual slab objects instead of pages
  2019-10-18  0:28 [PATCH 00/16] The new slab memory controller Roman Gushchin
                   ` (7 preceding siblings ...)
  2019-10-18  0:28 ` [PATCH 08/16] mm: memcg: introduce __mod_lruvec_memcg_state() Roman Gushchin
@ 2019-10-18  0:28 ` Roman Gushchin
  2019-10-25 19:41   ` Johannes Weiner
  2019-10-18  0:28 ` [PATCH 10/16] mm: memcg: move get_mem_cgroup_from_current() to memcontrol.h Roman Gushchin
                   ` (8 subsequent siblings)
  17 siblings, 1 reply; 36+ messages in thread
From: Roman Gushchin @ 2019-10-18  0:28 UTC (permalink / raw)
  To: linux-mm
  Cc: Michal Hocko, Johannes Weiner, linux-kernel, kernel-team,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Christoph Lameter,
	Roman Gushchin

Switch to per-object accounting of non-root slab objects.

Charging is performed using subpage charging API in pre_alloc hook.
If the amount of memory has been charged successfully, we proceed
with the actual allocation. Otherwise, -ENOMEM is returned.

In post_alloc hook we do check if the actual allocation succeeded.
If so, corresponding vmstats are bumped and memcg membership
information is recorded. Otherwise, the charge is canceled.

On free path we do look for memcg membership information,
decrement stats and do uncharge. No operations are performed
with root kmem_caches.

Global per-node slab-related vmstats NR_SLAB_(UN)RECLAIMABLE_B
are still modified from (un)charge_slab_page() functions. The idea
is to keep all slab pages accounted as slab pages on system level.
Memcg and lruvec counters are now representing only memory used
by actual slab objects and do not include free space. Free space
is shared and doesn't belong to any specific cgroup.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 mm/slab.h | 152 ++++++++++++++++++++----------------------------------
 1 file changed, 57 insertions(+), 95 deletions(-)

diff --git a/mm/slab.h b/mm/slab.h
index 28feabed1e9a..0f2f712de77a 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -340,72 +340,6 @@ static inline struct mem_cgroup *memcg_from_slab_page(struct page *page)
 	return NULL;
 }
 
-/*
- * Charge the slab page belonging to the non-root kmem_cache.
- * Can be called for non-root kmem_caches only.
- */
-static __always_inline int memcg_charge_slab(struct page *page,
-					     gfp_t gfp, int order,
-					     struct kmem_cache *s)
-{
-	struct mem_cgroup *memcg;
-	struct lruvec *lruvec;
-	int ret;
-
-	rcu_read_lock();
-	memcg = READ_ONCE(s->memcg_params.memcg);
-	while (memcg && !css_tryget_online(&memcg->css))
-		memcg = parent_mem_cgroup(memcg);
-	rcu_read_unlock();
-
-	if (unlikely(!memcg || mem_cgroup_is_root(memcg))) {
-		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
-				    (PAGE_SIZE << order));
-		percpu_ref_get_many(&s->memcg_params.refcnt, 1 << order);
-		return 0;
-	}
-
-	ret = memcg_kmem_charge_memcg(page, gfp, order, memcg);
-	if (ret)
-		goto out;
-
-	lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
-	mod_lruvec_state(lruvec, cache_vmstat_idx(s), PAGE_SIZE << order);
-
-	/* transer try_charge() page references to kmem_cache */
-	percpu_ref_get_many(&s->memcg_params.refcnt, 1 << order);
-	css_put_many(&memcg->css, 1 << order);
-out:
-	css_put(&memcg->css);
-	return ret;
-}
-
-/*
- * Uncharge a slab page belonging to a non-root kmem_cache.
- * Can be called for non-root kmem_caches only.
- */
-static __always_inline void memcg_uncharge_slab(struct page *page, int order,
-						struct kmem_cache *s)
-{
-	struct mem_cgroup *memcg;
-	struct lruvec *lruvec;
-
-	rcu_read_lock();
-	memcg = READ_ONCE(s->memcg_params.memcg);
-	if (likely(!mem_cgroup_is_root(memcg))) {
-		lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
-		mod_lruvec_state(lruvec, cache_vmstat_idx(s),
-				 -(PAGE_SIZE << order));
-		memcg_kmem_uncharge_memcg(page, order, memcg);
-	} else {
-		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
-				    -(PAGE_SIZE << order));
-	}
-	rcu_read_unlock();
-
-	percpu_ref_put_many(&s->memcg_params.refcnt, 1 << order);
-}
-
 static inline int memcg_alloc_page_memcg_vec(struct page *page, gfp_t gfp,
 					     unsigned int objects)
 {
@@ -423,11 +357,31 @@ static inline void memcg_free_page_memcg_vec(struct page *page)
 	page->mem_cgroup_vec = NULL;
 }
 
+static inline struct kmem_cache *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
+						struct mem_cgroup **memcgp,
+						size_t size, gfp_t flags)
+{
+	struct kmem_cache *cachep;
+
+	cachep = memcg_kmem_get_cache(s, memcgp);
+	if (is_root_cache(cachep))
+		return s;
+
+	if (__memcg_kmem_charge_subpage(*memcgp, size * s->size, flags)) {
+		mem_cgroup_put(*memcgp);
+		memcg_kmem_put_cache(cachep);
+		cachep = NULL;
+	}
+
+	return cachep;
+}
+
 static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
 					      struct mem_cgroup *memcg,
 					      size_t size, void **p)
 {
 	struct mem_cgroup_ptr *memcg_ptr;
+	struct lruvec *lruvec;
 	struct page *page;
 	unsigned long off;
 	size_t i;
@@ -439,6 +393,11 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
 			off = obj_to_index(s, page, p[i]);
 			mem_cgroup_ptr_get(memcg_ptr);
 			page->mem_cgroup_vec[off] = memcg_ptr;
+			lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
+			mod_lruvec_memcg_state(lruvec, cache_vmstat_idx(s),
+					       s->size);
+		} else {
+			__memcg_kmem_uncharge_subpage(memcg, s->size);
 		}
 	}
 	mem_cgroup_ptr_put(memcg_ptr);
@@ -451,6 +410,8 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page,
 					void *p)
 {
 	struct mem_cgroup_ptr *memcg_ptr;
+	struct mem_cgroup *memcg;
+	struct lruvec *lruvec;
 	unsigned int off;
 
 	if (!memcg_kmem_enabled() || is_root_cache(s))
@@ -459,6 +420,14 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page,
 	off = obj_to_index(s, page, p);
 	memcg_ptr = page->mem_cgroup_vec[off];
 	page->mem_cgroup_vec[off] = NULL;
+	rcu_read_lock();
+	memcg = memcg_ptr->memcg;
+	if (likely(!mem_cgroup_is_root(memcg))) {
+		__memcg_kmem_uncharge_subpage(memcg, s->size);
+		lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
+		mod_lruvec_memcg_state(lruvec, cache_vmstat_idx(s), -s->size);
+	}
+	rcu_read_unlock();
 	mem_cgroup_ptr_put(memcg_ptr);
 }
 
@@ -500,17 +469,6 @@ static inline struct mem_cgroup *memcg_from_slab_page(struct page *page)
 	return NULL;
 }
 
-static inline int memcg_charge_slab(struct page *page, gfp_t gfp, int order,
-				    struct kmem_cache *s)
-{
-	return 0;
-}
-
-static inline void memcg_uncharge_slab(struct page *page, int order,
-				       struct kmem_cache *s)
-{
-}
-
 static inline int memcg_alloc_page_memcg_vec(struct page *page, gfp_t gfp,
 					     unsigned int objects)
 {
@@ -521,6 +479,13 @@ static inline void memcg_free_page_memcg_vec(struct page *page)
 {
 }
 
+static inline struct kmem_cache *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
+						struct mem_cgroup **memcgp,
+						size_t size, gfp_t flags)
+{
+	return NULL;
+}
+
 static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
 					      struct mem_cgroup *memcg,
 					      size_t size, void **p)
@@ -561,30 +526,27 @@ static __always_inline int charge_slab_page(struct page *page,
 {
 	int ret;
 
-	if (is_root_cache(s)) {
-		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
-				    PAGE_SIZE << order);
-		return 0;
-	}
-
-	ret = memcg_alloc_page_memcg_vec(page, gfp, objects);
-	if (ret)
-		return ret;
+	if (!is_root_cache(s)) {
+		ret = memcg_alloc_page_memcg_vec(page, gfp, objects);
+		if (ret)
+			return ret;
 
-	return memcg_charge_slab(page, gfp, order, s);
+		percpu_ref_get_many(&s->memcg_params.refcnt, 1 << order);
+	}
+	mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
+			    PAGE_SIZE << order);
+	return 0;
 }
 
 static __always_inline void uncharge_slab_page(struct page *page, int order,
 					       struct kmem_cache *s)
 {
-	if (is_root_cache(s)) {
-		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
-				    -(PAGE_SIZE << order));
-		return;
+	if (!is_root_cache(s)) {
+		memcg_free_page_memcg_vec(page);
+		percpu_ref_put_many(&s->memcg_params.refcnt, 1 << order);
 	}
-
-	memcg_free_page_memcg_vec(page);
-	memcg_uncharge_slab(page, order, s);
+	mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
+			    -(PAGE_SIZE << order));
 }
 
 static inline struct kmem_cache *cache_from_obj(struct kmem_cache *s, void *x)
@@ -656,7 +618,7 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
 
 	if (memcg_kmem_enabled() &&
 	    ((flags & __GFP_ACCOUNT) || (s->flags & SLAB_ACCOUNT)))
-		return memcg_kmem_get_cache(s, memcgp);
+		return memcg_slab_pre_alloc_hook(s, memcgp, size, flags);
 
 	return s;
 }
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 10/16] mm: memcg: move get_mem_cgroup_from_current() to memcontrol.h
  2019-10-18  0:28 [PATCH 00/16] The new slab memory controller Roman Gushchin
                   ` (8 preceding siblings ...)
  2019-10-18  0:28 ` [PATCH 09/16] mm: memcg/slab: charge individual slab objects instead of pages Roman Gushchin
@ 2019-10-18  0:28 ` Roman Gushchin
  2019-10-18  0:28 ` [PATCH 11/16] mm: memcg/slab: replace memcg_from_slab_page() with memcg_from_slab_obj() Roman Gushchin
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 36+ messages in thread
From: Roman Gushchin @ 2019-10-18  0:28 UTC (permalink / raw)
  To: linux-mm
  Cc: Michal Hocko, Johannes Weiner, linux-kernel, kernel-team,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Christoph Lameter,
	Roman Gushchin

Move get_mem_cgroup_from_current() to memcontrol.h to make it
available for use outside of the memcontrol.c. The function is
small and marked as __always_inline, so it's better to place
it into memcontrol.h.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/memcontrol.h | 17 +++++++++++++++++
 mm/memcontrol.c            | 17 -----------------
 2 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index bf1fcf85abd8..a75980559a47 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -453,6 +453,23 @@ struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm);
 
 struct mem_cgroup *get_mem_cgroup_from_page(struct page *page);
 
+/**
+ * If current->active_memcg is non-NULL, do not fallback to current->mm->memcg.
+ */
+static __always_inline struct mem_cgroup *get_mem_cgroup_from_current(void)
+{
+	if (unlikely(current->active_memcg)) {
+		struct mem_cgroup *memcg = root_mem_cgroup;
+
+		rcu_read_lock();
+		if (css_tryget_online(&current->active_memcg->css))
+			memcg = current->active_memcg;
+		rcu_read_unlock();
+		return memcg;
+	}
+	return get_mem_cgroup_from_mm(current->mm);
+}
+
 static inline
 struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
 	return css ? container_of(css, struct mem_cgroup, css) : NULL;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ebaa4ff36e0c..8a753a336efd 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1074,23 +1074,6 @@ struct mem_cgroup *get_mem_cgroup_from_page(struct page *page)
 }
 EXPORT_SYMBOL(get_mem_cgroup_from_page);
 
-/**
- * If current->active_memcg is non-NULL, do not fallback to current->mm->memcg.
- */
-static __always_inline struct mem_cgroup *get_mem_cgroup_from_current(void)
-{
-	if (unlikely(current->active_memcg)) {
-		struct mem_cgroup *memcg = root_mem_cgroup;
-
-		rcu_read_lock();
-		if (css_tryget_online(&current->active_memcg->css))
-			memcg = current->active_memcg;
-		rcu_read_unlock();
-		return memcg;
-	}
-	return get_mem_cgroup_from_mm(current->mm);
-}
-
 /**
  * mem_cgroup_iter - iterate over memory cgroup hierarchy
  * @root: hierarchy root
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 11/16] mm: memcg/slab: replace memcg_from_slab_page() with memcg_from_slab_obj()
  2019-10-18  0:28 [PATCH 00/16] The new slab memory controller Roman Gushchin
                   ` (9 preceding siblings ...)
  2019-10-18  0:28 ` [PATCH 10/16] mm: memcg: move get_mem_cgroup_from_current() to memcontrol.h Roman Gushchin
@ 2019-10-18  0:28 ` Roman Gushchin
  2019-10-18  0:28 ` [PATCH 13/16] mm: memcg/slab: deprecate memory.kmem.slabinfo Roman Gushchin
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 36+ messages in thread
From: Roman Gushchin @ 2019-10-18  0:28 UTC (permalink / raw)
  To: linux-mm
  Cc: Michal Hocko, Johannes Weiner, linux-kernel, kernel-team,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Christoph Lameter,
	Roman Gushchin

On our way to share slab pages between multiple memory cgroups
let's make sure we don't use kmem_cache.memcg_params.memcg
pointer to determine memcg ownership of a slab object/page.

Let's transform memcg_from_slab_page() into memcg_from_slab_obj(),
which relies on memcg ownership data stored in page->mem_cgroup_vec.

Delete mem_cgroup_from_kmem() and use memcg_from_slab_obj()
instead.

Note: memcg_from_slab_obj() returns NULL if slab obj belongs
to the root cgroup, so remove the redundant check in
__mod_lruvec_slab_state().

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 mm/list_lru.c   | 12 +-----------
 mm/memcontrol.c |  9 ++++-----
 mm/slab.h       | 21 +++++++++++++--------
 3 files changed, 18 insertions(+), 24 deletions(-)

diff --git a/mm/list_lru.c b/mm/list_lru.c
index 0f1f6b06b7f3..4f9d791b802c 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -57,16 +57,6 @@ list_lru_from_memcg_idx(struct list_lru_node *nlru, int idx)
 	return &nlru->lru;
 }
 
-static __always_inline struct mem_cgroup *mem_cgroup_from_kmem(void *ptr)
-{
-	struct page *page;
-
-	if (!memcg_kmem_enabled())
-		return NULL;
-	page = virt_to_head_page(ptr);
-	return memcg_from_slab_page(page);
-}
-
 static inline struct list_lru_one *
 list_lru_from_kmem(struct list_lru_node *nlru, void *ptr,
 		   struct mem_cgroup **memcg_ptr)
@@ -77,7 +67,7 @@ list_lru_from_kmem(struct list_lru_node *nlru, void *ptr,
 	if (!nlru->memcg_lrus)
 		goto out;
 
-	memcg = mem_cgroup_from_kmem(ptr);
+	memcg = memcg_from_slab_obj(ptr);
 	if (!memcg)
 		goto out;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8a753a336efd..1982b14d6e6f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -548,7 +548,7 @@ ino_t page_cgroup_ino(struct page *page)
 
 	rcu_read_lock();
 	if (PageHead(page) && PageSlab(page))
-		memcg = memcg_from_slab_page(page);
+		memcg = root_mem_cgroup;
 	else
 		memcg = READ_ONCE(page->mem_cgroup);
 	while (memcg && !(memcg->css.flags & CSS_ONLINE))
@@ -852,16 +852,15 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 
 void __mod_lruvec_slab_state(void *p, enum node_stat_item idx, int val)
 {
-	struct page *page = virt_to_head_page(p);
-	pg_data_t *pgdat = page_pgdat(page);
+	pg_data_t *pgdat = page_pgdat(virt_to_page(p));
 	struct mem_cgroup *memcg;
 	struct lruvec *lruvec;
 
 	rcu_read_lock();
-	memcg = memcg_from_slab_page(page);
+	memcg = memcg_from_slab_obj(p);
 
 	/* Untracked pages have no memcg, no lruvec. Update only the node */
-	if (!memcg || memcg == root_mem_cgroup) {
+	if (!memcg) {
 		__mod_node_page_state(pgdat, idx, val);
 	} else {
 		lruvec = mem_cgroup_lruvec(pgdat, memcg);
diff --git a/mm/slab.h b/mm/slab.h
index 0f2f712de77a..a6330065d434 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -329,15 +329,20 @@ static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
  * The kmem_cache can be reparented asynchronously. The caller must ensure
  * the memcg lifetime, e.g. by taking rcu_read_lock() or cgroup_mutex.
  */
-static inline struct mem_cgroup *memcg_from_slab_page(struct page *page)
+static inline struct mem_cgroup *memcg_from_slab_obj(void *ptr)
 {
-	struct kmem_cache *s;
-
-	s = READ_ONCE(page->slab_cache);
-	if (s && !is_root_cache(s))
-		return READ_ONCE(s->memcg_params.memcg);
+	struct mem_cgroup_ptr *memcg_ptr;
+	struct page *page;
+	unsigned int off;
 
-	return NULL;
+	if (!memcg_kmem_enabled())
+		return NULL;
+	page = virt_to_head_page(ptr);
+	if (is_root_cache(page->slab_cache))
+		return NULL;
+	off = obj_to_index(page->slab_cache, page, ptr);
+	memcg_ptr = page->mem_cgroup_vec[off];
+	return memcg_ptr->memcg;
 }
 
 static inline int memcg_alloc_page_memcg_vec(struct page *page, gfp_t gfp,
@@ -464,7 +469,7 @@ static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
 	return s;
 }
 
-static inline struct mem_cgroup *memcg_from_slab_page(struct page *page)
+static inline struct mem_cgroup *memcg_from_slab_obj(void *ptr)
 {
 	return NULL;
 }
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 13/16] mm: memcg/slab: deprecate memory.kmem.slabinfo
  2019-10-18  0:28 [PATCH 00/16] The new slab memory controller Roman Gushchin
                   ` (10 preceding siblings ...)
  2019-10-18  0:28 ` [PATCH 11/16] mm: memcg/slab: replace memcg_from_slab_page() with memcg_from_slab_obj() Roman Gushchin
@ 2019-10-18  0:28 ` Roman Gushchin
  2019-10-18  0:28 ` [PATCH 14/16] mm: memcg/slab: use one set of kmem_caches for all memory cgroups Roman Gushchin
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 36+ messages in thread
From: Roman Gushchin @ 2019-10-18  0:28 UTC (permalink / raw)
  To: linux-mm
  Cc: Michal Hocko, Johannes Weiner, linux-kernel, kernel-team,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Christoph Lameter,
	Roman Gushchin

Deprecate memory.kmem.slabinfo.

An empty file will be presented if corresponding config options are
enabled.

The interface is implementation dependent, isn't present in cgroup v2,
and is generally useful only for core mm debugging purposes. In other
words, it doesn't provide any value for the absolute majority of users.

A drgn-based replacement can be found in tools/cgroup/slabinfo.py .
It does support cgroup v1 and v2, mimics memory.kmem.slabinfo output
and also allows to get any additional information without a need
to recompile the kernel.

If a drgn-based solution is too slow for a task, a bpf-based tracing
tool can be used, which can easily keep track of all slab allocations
belonging to a memory cgroup.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 mm/memcontrol.c  |  3 ---
 mm/slab_common.c | 31 ++++---------------------------
 2 files changed, 4 insertions(+), 30 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1982b14d6e6f..0c9698f03cfe 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5038,9 +5038,6 @@ static struct cftype mem_cgroup_legacy_files[] = {
 #if defined(CONFIG_SLAB) || defined(CONFIG_SLUB_DEBUG)
 	{
 		.name = "kmem.slabinfo",
-		.seq_start = memcg_slab_start,
-		.seq_next = memcg_slab_next,
-		.seq_stop = memcg_slab_stop,
 		.seq_show = memcg_slab_show,
 	},
 #endif
diff --git a/mm/slab_common.c b/mm/slab_common.c
index f0f7f955c5fa..cc0c70b57c1c 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1509,35 +1509,12 @@ void dump_unreclaimable_slab(void)
 }
 
 #if defined(CONFIG_MEMCG)
-void *memcg_slab_start(struct seq_file *m, loff_t *pos)
-{
-	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
-
-	mutex_lock(&slab_mutex);
-	return seq_list_start(&memcg->kmem_caches, *pos);
-}
-
-void *memcg_slab_next(struct seq_file *m, void *p, loff_t *pos)
-{
-	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
-
-	return seq_list_next(p, &memcg->kmem_caches, pos);
-}
-
-void memcg_slab_stop(struct seq_file *m, void *p)
-{
-	mutex_unlock(&slab_mutex);
-}
-
 int memcg_slab_show(struct seq_file *m, void *p)
 {
-	struct kmem_cache *s = list_entry(p, struct kmem_cache,
-					  memcg_params.kmem_caches_node);
-	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
-
-	if (p == memcg->kmem_caches.next)
-		print_slabinfo_header(m);
-	cache_show(s, m);
+	/*
+	 * Deprecated.
+	 * Please, take a look at tools/cgroup/slabinfo.py .
+	 */
 	return 0;
 }
 #endif
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 14/16] mm: memcg/slab: use one set of kmem_caches for all memory cgroups
  2019-10-18  0:28 [PATCH 00/16] The new slab memory controller Roman Gushchin
                   ` (11 preceding siblings ...)
  2019-10-18  0:28 ` [PATCH 13/16] mm: memcg/slab: deprecate memory.kmem.slabinfo Roman Gushchin
@ 2019-10-18  0:28 ` Roman Gushchin
  2019-10-18  0:28 ` [PATCH 15/16] tools/cgroup: make slabinfo.py compatible with new slab controller Roman Gushchin
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 36+ messages in thread
From: Roman Gushchin @ 2019-10-18  0:28 UTC (permalink / raw)
  To: linux-mm
  Cc: Michal Hocko, Johannes Weiner, linux-kernel, kernel-team,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Christoph Lameter,
	Roman Gushchin

This is fairly big but mostly red patch, which makes all non-root
slab allocations using a single set of kmem_caches instead of
creating a separate set for each memory cgroup.

Because the number of non-root kmem_caches is now capped by the number
of root kmem_caches, there is no need to shrink or destroy them
prematurely. They can be perfectly destroyed together with their
root counterparts. This allows to dramatically simplify the
management of non-root kmem_caches and delete a ton of code.

This patch performs the following changes:
1) introduces memcg_params.memcg_cache pointer to represent the
   kmem_cache which will be used for all non-root allocations
2) reuses the existing memcg kmem_cache creation mechanism
   to create memcg kmem_cache on the first allocation attempt
3) memcg kmem_caches are named <kmemcache_name>-memcg,
   e.g. dentry-memcg
4) simplifies memcg_kmem_get_cache() to just return memcg kmem_cache
   or schedule it's creation and return the root cache
5) removes almost all non-root kmem_cache management code
   (separate refcounter, reparenting, shrinking, etc)
6) makes slab debugfs to display root_mem_cgroup css id and never
   show :dead and :deact flags in the memcg_slabinfo attribute.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/memcontrol.h |   5 +-
 include/linux/slab.h       |   3 +-
 mm/memcontrol.c            | 123 ++---------
 mm/slab.c                  |  16 +-
 mm/slab.h                  | 121 ++++-------
 mm/slab_common.c           | 414 ++++---------------------------------
 mm/slub.c                  |  38 +---
 7 files changed, 106 insertions(+), 614 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a75980559a47..f36203cf75f8 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -327,7 +327,6 @@ struct mem_cgroup {
         /* Index in the kmem_cache->memcg_params.memcg_caches array */
 	int kmemcg_id;
 	enum memcg_kmem_state kmem_state;
-	struct list_head kmem_caches;
 	struct mem_cgroup_ptr __rcu *kmem_memcg_ptr;
 	struct list_head kmem_memcg_ptr_list;
 #endif
@@ -1436,9 +1435,7 @@ static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg,
 }
 #endif
 
-struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep,
-					struct mem_cgroup **memcgp);
-void memcg_kmem_put_cache(struct kmem_cache *cachep);
+struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep);
 
 #ifdef CONFIG_MEMCG_KMEM
 int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order);
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 877a95c6a2d2..246474e9c706 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -155,8 +155,7 @@ struct kmem_cache *kmem_cache_create_usercopy(const char *name,
 void kmem_cache_destroy(struct kmem_cache *);
 int kmem_cache_shrink(struct kmem_cache *);
 
-void memcg_create_kmem_cache(struct mem_cgroup *, struct kmem_cache *);
-void memcg_deactivate_kmem_caches(struct mem_cgroup *, struct mem_cgroup *);
+void memcg_create_kmem_cache(struct kmem_cache *);
 
 /*
  * Please use this macro to create slab caches. Simply specify the
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0c9698f03cfe..b0d0c833150c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -330,7 +330,7 @@ static void memcg_reparent_kmem_memcg_ptr(struct mem_cgroup *memcg,
 }
 
 /*
- * This will be the memcg's index in each cache's ->memcg_params.memcg_caches.
+ * This will be used as a shrinker list's index.
  * The main reason for not using cgroup id for this:
  *  this works better in sparse environments, where we have a lot of memcgs,
  *  but only a few kmem-limited. Or also, if we have, for instance, 200
@@ -2938,9 +2938,7 @@ static int memcg_alloc_cache_id(void)
 	else if (size > MEMCG_CACHES_MAX_SIZE)
 		size = MEMCG_CACHES_MAX_SIZE;
 
-	err = memcg_update_all_caches(size);
-	if (!err)
-		err = memcg_update_all_list_lrus(size);
+	err = memcg_update_all_list_lrus(size);
 	if (!err)
 		memcg_nr_cache_ids = size;
 
@@ -2959,7 +2957,6 @@ static void memcg_free_cache_id(int id)
 }
 
 struct memcg_kmem_cache_create_work {
-	struct mem_cgroup *memcg;
 	struct kmem_cache *cachep;
 	struct work_struct work;
 };
@@ -2968,31 +2965,24 @@ static void memcg_kmem_cache_create_func(struct work_struct *w)
 {
 	struct memcg_kmem_cache_create_work *cw =
 		container_of(w, struct memcg_kmem_cache_create_work, work);
-	struct mem_cgroup *memcg = cw->memcg;
 	struct kmem_cache *cachep = cw->cachep;
 
-	memcg_create_kmem_cache(memcg, cachep);
+	memcg_create_kmem_cache(cachep);
 
-	css_put(&memcg->css);
 	kfree(cw);
 }
 
 /*
  * Enqueue the creation of a per-memcg kmem_cache.
  */
-static void memcg_schedule_kmem_cache_create(struct mem_cgroup *memcg,
-					       struct kmem_cache *cachep)
+static void memcg_schedule_kmem_cache_create(struct kmem_cache *cachep)
 {
 	struct memcg_kmem_cache_create_work *cw;
 
-	if (!css_tryget_online(&memcg->css))
-		return;
-
 	cw = kmalloc(sizeof(*cw), GFP_NOWAIT | __GFP_NOWARN);
 	if (!cw)
 		return;
 
-	cw->memcg = memcg;
 	cw->cachep = cachep;
 	INIT_WORK(&cw->work, memcg_kmem_cache_create_func);
 
@@ -3000,96 +2990,26 @@ static void memcg_schedule_kmem_cache_create(struct mem_cgroup *memcg,
 }
 
 /**
- * memcg_kmem_get_cache: select the correct per-memcg cache for allocation
+ * memcg_kmem_get_cache: select memcg or root cache for allocation
  * @cachep: the original global kmem cache
  *
  * Return the kmem_cache we're supposed to use for a slab allocation.
- * We try to use the current memcg's version of the cache.
  *
  * If the cache does not exist yet, if we are the first user of it, we
  * create it asynchronously in a workqueue and let the current allocation
  * go through with the original cache.
- *
- * This function takes a reference to the cache it returns to assure it
- * won't get destroyed while we are working with it. Once the caller is
- * done with it, memcg_kmem_put_cache() must be called to release the
- * reference.
  */
-struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep,
-					struct mem_cgroup **memcgp)
+struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep)
 {
-	struct mem_cgroup *memcg;
 	struct kmem_cache *memcg_cachep;
-	struct memcg_cache_array *arr;
-	int kmemcg_id;
 
-	VM_BUG_ON(!is_root_cache(cachep));
-
-	if (memcg_kmem_bypass())
+	memcg_cachep = READ_ONCE(cachep->memcg_params.memcg_cache);
+	if (unlikely(!memcg_cachep)) {
+		memcg_schedule_kmem_cache_create(cachep);
 		return cachep;
-
-	rcu_read_lock();
-
-	if (unlikely(current->active_memcg))
-		memcg = current->active_memcg;
-	else
-		memcg = mem_cgroup_from_task(current);
-
-	if (!memcg || memcg == root_mem_cgroup)
-		goto out_unlock;
-
-	kmemcg_id = READ_ONCE(memcg->kmemcg_id);
-	if (kmemcg_id < 0)
-		goto out_unlock;
-
-	arr = rcu_dereference(cachep->memcg_params.memcg_caches);
-
-	/*
-	 * Make sure we will access the up-to-date value. The code updating
-	 * memcg_caches issues a write barrier to match the data dependency
-	 * barrier inside READ_ONCE() (see memcg_create_kmem_cache()).
-	 */
-	memcg_cachep = READ_ONCE(arr->entries[kmemcg_id]);
-
-	/*
-	 * If we are in a safe context (can wait, and not in interrupt
-	 * context), we could be be predictable and return right away.
-	 * This would guarantee that the allocation being performed
-	 * already belongs in the new cache.
-	 *
-	 * However, there are some clashes that can arrive from locking.
-	 * For instance, because we acquire the slab_mutex while doing
-	 * memcg_create_kmem_cache, this means no further allocation
-	 * could happen with the slab_mutex held. So it's better to
-	 * defer everything.
-	 *
-	 * If the memcg is dying or memcg_cache is about to be released,
-	 * don't bother creating new kmem_caches. Because memcg_cachep
-	 * is ZEROed as the fist step of kmem offlining, we don't need
-	 * percpu_ref_tryget_live() here. css_tryget_online() check in
-	 * memcg_schedule_kmem_cache_create() will prevent us from
-	 * creation of a new kmem_cache.
-	 */
-	if (unlikely(!memcg_cachep))
-		memcg_schedule_kmem_cache_create(memcg, cachep);
-	else if (percpu_ref_tryget(&memcg_cachep->memcg_params.refcnt)) {
-		css_get(&memcg->css);
-		*memcgp = memcg;
-		cachep = memcg_cachep;
 	}
-out_unlock:
-	rcu_read_unlock();
-	return cachep;
-}
 
-/**
- * memcg_kmem_put_cache: drop reference taken by memcg_kmem_get_cache
- * @cachep: the cache returned by memcg_kmem_get_cache
- */
-void memcg_kmem_put_cache(struct kmem_cache *cachep)
-{
-	if (!is_root_cache(cachep))
-		percpu_ref_put(&cachep->memcg_params.refcnt);
+	return memcg_cachep;
 }
 
 /**
@@ -3669,7 +3589,6 @@ static int memcg_online_kmem(struct mem_cgroup *memcg)
 	 */
 	memcg->kmemcg_id = memcg_id;
 	memcg->kmem_state = KMEM_ONLINE;
-	INIT_LIST_HEAD(&memcg->kmem_caches);
 
 	return 0;
 }
@@ -3682,12 +3601,7 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
 
 	if (memcg->kmem_state != KMEM_ONLINE)
 		return;
-	/*
-	 * Clear the online state before clearing memcg_caches array
-	 * entries. The slab_mutex in memcg_deactivate_kmem_caches()
-	 * guarantees that no cache will be created for this cgroup
-	 * after we are done (see memcg_create_kmem_cache()).
-	 */
+
 	memcg->kmem_state = KMEM_ALLOCATED;
 
 	parent = parent_mem_cgroup(memcg);
@@ -3695,12 +3609,10 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
 		parent = root_mem_cgroup;
 
 	/*
-	 * Deactivate and reparent kmem_caches and reparent kmem_memcg_ptr.
-	 * Then flush percpu slab statistics to have precise values at the
-	 * parent and all ancestor levels. It's required to keep slab stats
-	 * accurate after the reparenting of kmem_caches.
+	 * Reparent kmem_memcg_ptr. Then flush percpu slab statistics to have
+	 * precise values at the parent and all ancestor levels. It's required
+	 * to keep slab stats accurate after the reparenting of kmem_caches.
 	 */
-	memcg_deactivate_kmem_caches(memcg, parent);
 	memcg_reparent_kmem_memcg_ptr(memcg, parent);
 	memcg_flush_percpu_vmstats(memcg, true);
 
@@ -3736,10 +3648,8 @@ static void memcg_free_kmem(struct mem_cgroup *memcg)
 	if (unlikely(memcg->kmem_state == KMEM_ONLINE))
 		memcg_offline_kmem(memcg);
 
-	if (memcg->kmem_state == KMEM_ALLOCATED) {
-		WARN_ON(!list_empty(&memcg->kmem_caches));
+	if (memcg->kmem_state == KMEM_ALLOCATED)
 		static_branch_dec(&memcg_kmem_enabled_key);
-	}
 }
 #else
 static int memcg_online_kmem(struct mem_cgroup *memcg)
@@ -5316,9 +5226,6 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 
 	/* The following stuff does not apply to the root */
 	if (!parent) {
-#ifdef CONFIG_MEMCG_KMEM
-		INIT_LIST_HEAD(&memcg->kmem_caches);
-#endif
 		root_mem_cgroup = memcg;
 		return &memcg->css;
 	}
diff --git a/mm/slab.c b/mm/slab.c
index 91cd8bc4ee07..0914d7cd869f 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1239,7 +1239,7 @@ void __init kmem_cache_init(void)
 				  nr_node_ids * sizeof(struct kmem_cache_node *),
 				  SLAB_HWCACHE_ALIGN, 0, 0);
 	list_add(&kmem_cache->list, &slab_caches);
-	memcg_link_cache(kmem_cache, NULL);
+	memcg_link_cache(kmem_cache);
 	slab_state = PARTIAL;
 
 	/*
@@ -2244,17 +2244,6 @@ int __kmem_cache_shrink(struct kmem_cache *cachep)
 	return (ret ? 1 : 0);
 }
 
-#ifdef CONFIG_MEMCG
-void __kmemcg_cache_deactivate(struct kmem_cache *cachep)
-{
-	__kmem_cache_shrink(cachep);
-}
-
-void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s)
-{
-}
-#endif
-
 int __kmem_cache_shutdown(struct kmem_cache *cachep)
 {
 	return __kmem_cache_shrink(cachep);
@@ -3862,7 +3851,8 @@ static int do_tune_cpucache(struct kmem_cache *cachep, int limit,
 		return ret;
 
 	lockdep_assert_held(&slab_mutex);
-	for_each_memcg_cache(c, cachep) {
+	c = memcg_cache(cachep);
+	if (c) {
 		/* return value determined by the root cache only */
 		__do_tune_cpucache(c, limit, batchcount, shared, gfp);
 	}
diff --git a/mm/slab.h b/mm/slab.h
index a6330065d434..035c2969a2ca 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -32,66 +32,25 @@ struct kmem_cache {
 
 #else /* !CONFIG_SLOB */
 
-struct memcg_cache_array {
-	struct rcu_head rcu;
-	struct kmem_cache *entries[0];
-};
-
 /*
  * This is the main placeholder for memcg-related information in kmem caches.
- * Both the root cache and the child caches will have it. For the root cache,
- * this will hold a dynamically allocated array large enough to hold
- * information about the currently limited memcgs in the system. To allow the
- * array to be accessed without taking any locks, on relocation we free the old
- * version only after a grace period.
- *
- * Root and child caches hold different metadata.
+ * Both the root cache and the child cache will have it. Some fields are used
+ * in both cases, other are specific to root caches.
  *
  * @root_cache:	Common to root and child caches.  NULL for root, pointer to
  *		the root cache for children.
  *
  * The following fields are specific to root caches.
  *
- * @memcg_caches: kmemcg ID indexed table of child caches.  This table is
- *		used to index child cachces during allocation and cleared
- *		early during shutdown.
- *
- * @root_caches_node: List node for slab_root_caches list.
- *
- * @children:	List of all child caches.  While the child caches are also
- *		reachable through @memcg_caches, a child cache remains on
- *		this list until it is actually destroyed.
- *
- * The following fields are specific to child caches.
- *
- * @memcg:	Pointer to the memcg this cache belongs to.
- *
- * @children_node: List node for @root_cache->children list.
- *
- * @kmem_caches_node: List node for @memcg->kmem_caches list.
+ * @memcg_cache: pointer to memcg kmem cache, used by all non-root memory
+ *		cgroups.
+ * @root_caches_node: list node for slab_root_caches list.
  */
 struct memcg_cache_params {
 	struct kmem_cache *root_cache;
-	union {
-		struct {
-			struct memcg_cache_array __rcu *memcg_caches;
-			struct list_head __root_caches_node;
-			struct list_head children;
-			bool dying;
-		};
-		struct {
-			struct mem_cgroup *memcg;
-			struct list_head children_node;
-			struct list_head kmem_caches_node;
-			struct percpu_ref refcnt;
-
-			void (*work_fn)(struct kmem_cache *);
-			union {
-				struct rcu_head rcu_head;
-				struct work_struct work;
-			};
-		};
-	};
+
+	struct kmem_cache *memcg_cache;
+	struct list_head __root_caches_node;
 };
 #endif /* CONFIG_SLOB */
 
@@ -234,8 +193,6 @@ bool __kmem_cache_empty(struct kmem_cache *);
 int __kmem_cache_shutdown(struct kmem_cache *);
 void __kmem_cache_release(struct kmem_cache *);
 int __kmem_cache_shrink(struct kmem_cache *);
-void __kmemcg_cache_deactivate(struct kmem_cache *s);
-void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s);
 void slab_kmem_cache_release(struct kmem_cache *);
 void kmem_cache_shrink_all(struct kmem_cache *s);
 
@@ -281,14 +238,6 @@ static inline int cache_vmstat_idx(struct kmem_cache *s)
 extern struct list_head		slab_root_caches;
 #define root_caches_node	memcg_params.__root_caches_node
 
-/*
- * Iterate over all memcg caches of the given root cache. The caller must hold
- * slab_mutex.
- */
-#define for_each_memcg_cache(iter, root) \
-	list_for_each_entry(iter, &(root)->memcg_params.children, \
-			    memcg_params.children_node)
-
 static inline bool is_root_cache(struct kmem_cache *s)
 {
 	return !s->memcg_params.root_cache;
@@ -319,6 +268,13 @@ static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
 	return s->memcg_params.root_cache;
 }
 
+static inline struct kmem_cache *memcg_cache(struct kmem_cache *s)
+{
+	if (is_root_cache(s))
+		return s->memcg_params.memcg_cache;
+	return NULL;
+}
+
 /*
  * Expects a pointer to a slab page. Please note, that PageSlab() check
  * isn't sufficient, as it returns true also for tail compound slab pages,
@@ -367,17 +323,27 @@ static inline struct kmem_cache *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
 						size_t size, gfp_t flags)
 {
 	struct kmem_cache *cachep;
+	struct mem_cgroup *memcg;
+
+	if (memcg_kmem_bypass())
+		return s;
 
-	cachep = memcg_kmem_get_cache(s, memcgp);
-	if (is_root_cache(cachep))
+	memcg = get_mem_cgroup_from_current();
+	if (!memcg || mem_cgroup_is_root(memcg))
 		return s;
 
-	if (__memcg_kmem_charge_subpage(*memcgp, size * s->size, flags)) {
-		mem_cgroup_put(*memcgp);
-		memcg_kmem_put_cache(cachep);
-		cachep = NULL;
+	cachep = memcg_kmem_get_cache(s);
+	if (is_root_cache(cachep)) {
+		mem_cgroup_put(memcg);
+		return s;
 	}
 
+	if (__memcg_kmem_charge_subpage(memcg, size * s->size, flags)) {
+		mem_cgroup_put(memcg);
+		return NULL;
+	}
+
+	*memcgp = memcg;
 	return cachep;
 }
 
@@ -407,8 +373,6 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
 	}
 	mem_cgroup_ptr_put(memcg_ptr);
 	mem_cgroup_put(memcg);
-
-	memcg_kmem_put_cache(s);
 }
 
 static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page,
@@ -437,7 +401,7 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page,
 }
 
 extern void slab_init_memcg_params(struct kmem_cache *);
-extern void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg);
+extern void memcg_link_cache(struct kmem_cache *s);
 
 #else /* CONFIG_MEMCG_KMEM */
 
@@ -445,9 +409,6 @@ extern void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg);
 #define slab_root_caches	slab_caches
 #define root_caches_node	list
 
-#define for_each_memcg_cache(iter, root) \
-	for ((void)(iter), (void)(root); 0; )
-
 static inline bool is_root_cache(struct kmem_cache *s)
 {
 	return true;
@@ -469,6 +430,11 @@ static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
 	return s;
 }
 
+static inline struct kmem_cache *memcg_cache(struct kmem_cache *s)
+{
+	return NULL;
+}
+
 static inline struct mem_cgroup *memcg_from_slab_obj(void *ptr)
 {
 	return NULL;
@@ -506,8 +472,7 @@ static inline void slab_init_memcg_params(struct kmem_cache *s)
 {
 }
 
-static inline void memcg_link_cache(struct kmem_cache *s,
-				    struct mem_cgroup *memcg)
+static inline void memcg_link_cache(struct kmem_cache *s)
 {
 }
 
@@ -535,8 +500,6 @@ static __always_inline int charge_slab_page(struct page *page,
 		ret = memcg_alloc_page_memcg_vec(page, gfp, objects);
 		if (ret)
 			return ret;
-
-		percpu_ref_get_many(&s->memcg_params.refcnt, 1 << order);
 	}
 	mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
 			    PAGE_SIZE << order);
@@ -546,10 +509,9 @@ static __always_inline int charge_slab_page(struct page *page,
 static __always_inline void uncharge_slab_page(struct page *page, int order,
 					       struct kmem_cache *s)
 {
-	if (!is_root_cache(s)) {
+	if (!is_root_cache(s))
 		memcg_free_page_memcg_vec(page);
-		percpu_ref_put_many(&s->memcg_params.refcnt, 1 << order);
-	}
+
 	mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
 			    -(PAGE_SIZE << order));
 }
@@ -698,9 +660,6 @@ static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
 void *slab_start(struct seq_file *m, loff_t *pos);
 void *slab_next(struct seq_file *m, void *p, loff_t *pos);
 void slab_stop(struct seq_file *m, void *p);
-void *memcg_slab_start(struct seq_file *m, loff_t *pos);
-void *memcg_slab_next(struct seq_file *m, void *p, loff_t *pos);
-void memcg_slab_stop(struct seq_file *m, void *p);
 int memcg_slab_show(struct seq_file *m, void *p);
 
 #if defined(CONFIG_SLAB) || defined(CONFIG_SLUB_DEBUG)
diff --git a/mm/slab_common.c b/mm/slab_common.c
index cc0c70b57c1c..3dcd90ba7525 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -131,141 +131,36 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t nr,
 #ifdef CONFIG_MEMCG_KMEM
 
 LIST_HEAD(slab_root_caches);
-static DEFINE_SPINLOCK(memcg_kmem_wq_lock);
-
-static void kmemcg_cache_shutdown(struct percpu_ref *percpu_ref);
 
 void slab_init_memcg_params(struct kmem_cache *s)
 {
 	s->memcg_params.root_cache = NULL;
-	RCU_INIT_POINTER(s->memcg_params.memcg_caches, NULL);
-	INIT_LIST_HEAD(&s->memcg_params.children);
-	s->memcg_params.dying = false;
+	s->memcg_params.memcg_cache = NULL;
 }
 
-static int init_memcg_params(struct kmem_cache *s,
-			     struct kmem_cache *root_cache)
+static void init_memcg_params(struct kmem_cache *s,
+			      struct kmem_cache *root_cache)
 {
-	struct memcg_cache_array *arr;
-
-	if (root_cache) {
-		int ret = percpu_ref_init(&s->memcg_params.refcnt,
-					  kmemcg_cache_shutdown,
-					  0, GFP_KERNEL);
-		if (ret)
-			return ret;
-
+	if (root_cache)
 		s->memcg_params.root_cache = root_cache;
-		INIT_LIST_HEAD(&s->memcg_params.children_node);
-		INIT_LIST_HEAD(&s->memcg_params.kmem_caches_node);
-		return 0;
-	}
-
-	slab_init_memcg_params(s);
-
-	if (!memcg_nr_cache_ids)
-		return 0;
-
-	arr = kvzalloc(sizeof(struct memcg_cache_array) +
-		       memcg_nr_cache_ids * sizeof(void *),
-		       GFP_KERNEL);
-	if (!arr)
-		return -ENOMEM;
-
-	RCU_INIT_POINTER(s->memcg_params.memcg_caches, arr);
-	return 0;
-}
-
-static void destroy_memcg_params(struct kmem_cache *s)
-{
-	if (is_root_cache(s)) {
-		kvfree(rcu_access_pointer(s->memcg_params.memcg_caches));
-	} else {
-		mem_cgroup_put(s->memcg_params.memcg);
-		WRITE_ONCE(s->memcg_params.memcg, NULL);
-		percpu_ref_exit(&s->memcg_params.refcnt);
-	}
+	else
+		slab_init_memcg_params(s);
 }
 
-static void free_memcg_params(struct rcu_head *rcu)
+void memcg_link_cache(struct kmem_cache *s)
 {
-	struct memcg_cache_array *old;
-
-	old = container_of(rcu, struct memcg_cache_array, rcu);
-	kvfree(old);
-}
-
-static int update_memcg_params(struct kmem_cache *s, int new_array_size)
-{
-	struct memcg_cache_array *old, *new;
-
-	new = kvzalloc(sizeof(struct memcg_cache_array) +
-		       new_array_size * sizeof(void *), GFP_KERNEL);
-	if (!new)
-		return -ENOMEM;
-
-	old = rcu_dereference_protected(s->memcg_params.memcg_caches,
-					lockdep_is_held(&slab_mutex));
-	if (old)
-		memcpy(new->entries, old->entries,
-		       memcg_nr_cache_ids * sizeof(void *));
-
-	rcu_assign_pointer(s->memcg_params.memcg_caches, new);
-	if (old)
-		call_rcu(&old->rcu, free_memcg_params);
-	return 0;
-}
-
-int memcg_update_all_caches(int num_memcgs)
-{
-	struct kmem_cache *s;
-	int ret = 0;
-
-	mutex_lock(&slab_mutex);
-	list_for_each_entry(s, &slab_root_caches, root_caches_node) {
-		ret = update_memcg_params(s, num_memcgs);
-		/*
-		 * Instead of freeing the memory, we'll just leave the caches
-		 * up to this point in an updated state.
-		 */
-		if (ret)
-			break;
-	}
-	mutex_unlock(&slab_mutex);
-	return ret;
-}
-
-void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg)
-{
-	if (is_root_cache(s)) {
+	if (is_root_cache(s))
 		list_add(&s->root_caches_node, &slab_root_caches);
-	} else {
-		css_get(&memcg->css);
-		s->memcg_params.memcg = memcg;
-		list_add(&s->memcg_params.children_node,
-			 &s->memcg_params.root_cache->memcg_params.children);
-		list_add(&s->memcg_params.kmem_caches_node,
-			 &s->memcg_params.memcg->kmem_caches);
-	}
 }
 
 static void memcg_unlink_cache(struct kmem_cache *s)
 {
-	if (is_root_cache(s)) {
+	if (is_root_cache(s))
 		list_del(&s->root_caches_node);
-	} else {
-		list_del(&s->memcg_params.children_node);
-		list_del(&s->memcg_params.kmem_caches_node);
-	}
 }
 #else
-static inline int init_memcg_params(struct kmem_cache *s,
-				    struct kmem_cache *root_cache)
-{
-	return 0;
-}
-
-static inline void destroy_memcg_params(struct kmem_cache *s)
+static inline void init_memcg_params(struct kmem_cache *s,
+				     struct kmem_cache *root_cache)
 {
 }
 
@@ -380,7 +275,7 @@ static struct kmem_cache *create_cache(const char *name,
 		unsigned int object_size, unsigned int align,
 		slab_flags_t flags, unsigned int useroffset,
 		unsigned int usersize, void (*ctor)(void *),
-		struct mem_cgroup *memcg, struct kmem_cache *root_cache)
+		struct kmem_cache *root_cache)
 {
 	struct kmem_cache *s;
 	int err;
@@ -400,24 +295,20 @@ static struct kmem_cache *create_cache(const char *name,
 	s->useroffset = useroffset;
 	s->usersize = usersize;
 
-	err = init_memcg_params(s, root_cache);
-	if (err)
-		goto out_free_cache;
-
+	init_memcg_params(s, root_cache);
 	err = __kmem_cache_create(s, flags);
 	if (err)
 		goto out_free_cache;
 
 	s->refcount = 1;
 	list_add(&s->list, &slab_caches);
-	memcg_link_cache(s, memcg);
+	memcg_link_cache(s);
 out:
 	if (err)
 		return ERR_PTR(err);
 	return s;
 
 out_free_cache:
-	destroy_memcg_params(s);
 	kmem_cache_free(kmem_cache, s);
 	goto out;
 }
@@ -504,7 +395,7 @@ kmem_cache_create_usercopy(const char *name,
 
 	s = create_cache(cache_name, size,
 			 calculate_alignment(flags, align, size),
-			 flags, useroffset, usersize, ctor, NULL, NULL);
+			 flags, useroffset, usersize, ctor, NULL);
 	if (IS_ERR(s)) {
 		err = PTR_ERR(s);
 		kfree_const(cache_name);
@@ -629,51 +520,27 @@ static int shutdown_cache(struct kmem_cache *s)
 
 #ifdef CONFIG_MEMCG_KMEM
 /*
- * memcg_create_kmem_cache - Create a cache for a memory cgroup.
- * @memcg: The memory cgroup the new cache is for.
+ * memcg_create_kmem_cache - Create a cache for non-root memory cgroups.
  * @root_cache: The parent of the new cache.
  *
  * This function attempts to create a kmem cache that will serve allocation
- * requests going from @memcg to @root_cache. The new cache inherits properties
- * from its parent.
+ * requests going all non-root memory cgroups to @root_cache. The new cache
+ * inherits properties from its parent.
  */
-void memcg_create_kmem_cache(struct mem_cgroup *memcg,
-			     struct kmem_cache *root_cache)
+void memcg_create_kmem_cache(struct kmem_cache *root_cache)
 {
-	static char memcg_name_buf[NAME_MAX + 1]; /* protected by slab_mutex */
-	struct cgroup_subsys_state *css = &memcg->css;
-	struct memcg_cache_array *arr;
 	struct kmem_cache *s = NULL;
 	char *cache_name;
-	int idx;
 
 	get_online_cpus();
 	get_online_mems();
 
 	mutex_lock(&slab_mutex);
 
-	/*
-	 * The memory cgroup could have been offlined while the cache
-	 * creation work was pending.
-	 */
-	if (memcg->kmem_state != KMEM_ONLINE)
-		goto out_unlock;
-
-	idx = memcg_cache_id(memcg);
-	arr = rcu_dereference_protected(root_cache->memcg_params.memcg_caches,
-					lockdep_is_held(&slab_mutex));
-
-	/*
-	 * Since per-memcg caches are created asynchronously on first
-	 * allocation (see memcg_kmem_get_cache()), several threads can try to
-	 * create the same cache, but only one of them may succeed.
-	 */
-	if (arr->entries[idx])
+	if (root_cache->memcg_params.memcg_cache)
 		goto out_unlock;
 
-	cgroup_name(css->cgroup, memcg_name_buf, sizeof(memcg_name_buf));
-	cache_name = kasprintf(GFP_KERNEL, "%s(%llu:%s)", root_cache->name,
-			       css->serial_nr, memcg_name_buf);
+	cache_name = kasprintf(GFP_KERNEL, "%s-memcg", root_cache->name);
 	if (!cache_name)
 		goto out_unlock;
 
@@ -681,7 +548,7 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
 			 root_cache->align,
 			 root_cache->flags & CACHE_CREATE_MASK,
 			 root_cache->useroffset, root_cache->usersize,
-			 root_cache->ctor, memcg, root_cache);
+			 root_cache->ctor, root_cache);
 	/*
 	 * If we could not create a memcg cache, do not complain, because
 	 * that's not critical at all as we can always proceed with the root
@@ -698,7 +565,7 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
 	 * initialized.
 	 */
 	smp_wmb();
-	arr->entries[idx] = s;
+	root_cache->memcg_params.memcg_cache = s;
 
 out_unlock:
 	mutex_unlock(&slab_mutex);
@@ -707,197 +574,18 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
 	put_online_cpus();
 }
 
-static void kmemcg_workfn(struct work_struct *work)
-{
-	struct kmem_cache *s = container_of(work, struct kmem_cache,
-					    memcg_params.work);
-
-	get_online_cpus();
-	get_online_mems();
-
-	mutex_lock(&slab_mutex);
-	s->memcg_params.work_fn(s);
-	mutex_unlock(&slab_mutex);
-
-	put_online_mems();
-	put_online_cpus();
-}
-
-static void kmemcg_rcufn(struct rcu_head *head)
-{
-	struct kmem_cache *s = container_of(head, struct kmem_cache,
-					    memcg_params.rcu_head);
-
-	/*
-	 * We need to grab blocking locks.  Bounce to ->work.  The
-	 * work item shares the space with the RCU head and can't be
-	 * initialized eariler.
-	 */
-	INIT_WORK(&s->memcg_params.work, kmemcg_workfn);
-	queue_work(memcg_kmem_cache_wq, &s->memcg_params.work);
-}
-
-static void kmemcg_cache_shutdown_fn(struct kmem_cache *s)
-{
-	WARN_ON(shutdown_cache(s));
-}
-
-static void kmemcg_cache_shutdown(struct percpu_ref *percpu_ref)
-{
-	struct kmem_cache *s = container_of(percpu_ref, struct kmem_cache,
-					    memcg_params.refcnt);
-	unsigned long flags;
-
-	spin_lock_irqsave(&memcg_kmem_wq_lock, flags);
-	if (s->memcg_params.root_cache->memcg_params.dying)
-		goto unlock;
-
-	s->memcg_params.work_fn = kmemcg_cache_shutdown_fn;
-	INIT_WORK(&s->memcg_params.work, kmemcg_workfn);
-	queue_work(memcg_kmem_cache_wq, &s->memcg_params.work);
-
-unlock:
-	spin_unlock_irqrestore(&memcg_kmem_wq_lock, flags);
-}
-
-static void kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s)
-{
-	__kmemcg_cache_deactivate_after_rcu(s);
-	percpu_ref_kill(&s->memcg_params.refcnt);
-}
-
-static void kmemcg_cache_deactivate(struct kmem_cache *s)
-{
-	if (WARN_ON_ONCE(is_root_cache(s)))
-		return;
-
-	__kmemcg_cache_deactivate(s);
-	s->flags |= SLAB_DEACTIVATED;
-
-	/*
-	 * memcg_kmem_wq_lock is used to synchronize memcg_params.dying
-	 * flag and make sure that no new kmem_cache deactivation tasks
-	 * are queued (see flush_memcg_workqueue() ).
-	 */
-	spin_lock_irq(&memcg_kmem_wq_lock);
-	if (s->memcg_params.root_cache->memcg_params.dying)
-		goto unlock;
-
-	s->memcg_params.work_fn = kmemcg_cache_deactivate_after_rcu;
-	call_rcu(&s->memcg_params.rcu_head, kmemcg_rcufn);
-unlock:
-	spin_unlock_irq(&memcg_kmem_wq_lock);
-}
-
-void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg,
-				  struct mem_cgroup *parent)
-{
-	int idx;
-	struct memcg_cache_array *arr;
-	struct kmem_cache *s, *c;
-	unsigned int nr_reparented;
-
-	idx = memcg_cache_id(memcg);
-
-	get_online_cpus();
-	get_online_mems();
-
-	mutex_lock(&slab_mutex);
-	list_for_each_entry(s, &slab_root_caches, root_caches_node) {
-		arr = rcu_dereference_protected(s->memcg_params.memcg_caches,
-						lockdep_is_held(&slab_mutex));
-		c = arr->entries[idx];
-		if (!c)
-			continue;
-
-		kmemcg_cache_deactivate(c);
-		arr->entries[idx] = NULL;
-	}
-	nr_reparented = 0;
-	list_for_each_entry(s, &memcg->kmem_caches,
-			    memcg_params.kmem_caches_node) {
-		WRITE_ONCE(s->memcg_params.memcg, parent);
-		css_put(&memcg->css);
-		nr_reparented++;
-	}
-	if (nr_reparented) {
-		list_splice_init(&memcg->kmem_caches,
-				 &parent->kmem_caches);
-		css_get_many(&parent->css, nr_reparented);
-	}
-	mutex_unlock(&slab_mutex);
-
-	put_online_mems();
-	put_online_cpus();
-}
-
 static int shutdown_memcg_caches(struct kmem_cache *s)
 {
-	struct memcg_cache_array *arr;
-	struct kmem_cache *c, *c2;
-	LIST_HEAD(busy);
-	int i;
-
 	BUG_ON(!is_root_cache(s));
 
-	/*
-	 * First, shutdown active caches, i.e. caches that belong to online
-	 * memory cgroups.
-	 */
-	arr = rcu_dereference_protected(s->memcg_params.memcg_caches,
-					lockdep_is_held(&slab_mutex));
-	for_each_memcg_cache_index(i) {
-		c = arr->entries[i];
-		if (!c)
-			continue;
-		if (shutdown_cache(c))
-			/*
-			 * The cache still has objects. Move it to a temporary
-			 * list so as not to try to destroy it for a second
-			 * time while iterating over inactive caches below.
-			 */
-			list_move(&c->memcg_params.children_node, &busy);
-		else
-			/*
-			 * The cache is empty and will be destroyed soon. Clear
-			 * the pointer to it in the memcg_caches array so that
-			 * it will never be accessed even if the root cache
-			 * stays alive.
-			 */
-			arr->entries[i] = NULL;
-	}
-
-	/*
-	 * Second, shutdown all caches left from memory cgroups that are now
-	 * offline.
-	 */
-	list_for_each_entry_safe(c, c2, &s->memcg_params.children,
-				 memcg_params.children_node)
-		shutdown_cache(c);
-
-	list_splice(&busy, &s->memcg_params.children);
+	if (s->memcg_params.memcg_cache)
+		WARN_ON(shutdown_cache(s->memcg_params.memcg_cache));
 
-	/*
-	 * A cache being destroyed must be empty. In particular, this means
-	 * that all per memcg caches attached to it must be empty too.
-	 */
-	if (!list_empty(&s->memcg_params.children))
-		return -EBUSY;
 	return 0;
 }
 
 static void flush_memcg_workqueue(struct kmem_cache *s)
 {
-	spin_lock_irq(&memcg_kmem_wq_lock);
-	s->memcg_params.dying = true;
-	spin_unlock_irq(&memcg_kmem_wq_lock);
-
-	/*
-	 * SLAB and SLUB deactivate the kmem_caches through call_rcu. Make
-	 * sure all registered rcu callbacks have been invoked.
-	 */
-	rcu_barrier();
-
 	/*
 	 * SLAB and SLUB create memcg kmem_caches through workqueue and SLUB
 	 * deactivates the memcg kmem_caches through workqueue. Make sure all
@@ -919,7 +607,6 @@ static inline void flush_memcg_workqueue(struct kmem_cache *s)
 void slab_kmem_cache_release(struct kmem_cache *s)
 {
 	__kmem_cache_release(s);
-	destroy_memcg_params(s);
 	kfree_const(s->name);
 	kmem_cache_free(kmem_cache, s);
 }
@@ -983,7 +670,7 @@ int kmem_cache_shrink(struct kmem_cache *cachep)
 EXPORT_SYMBOL(kmem_cache_shrink);
 
 /**
- * kmem_cache_shrink_all - shrink a cache and all memcg caches for root cache
+ * kmem_cache_shrink_all - shrink root and memcg caches
  * @s: The cache pointer
  */
 void kmem_cache_shrink_all(struct kmem_cache *s)
@@ -1000,21 +687,11 @@ void kmem_cache_shrink_all(struct kmem_cache *s)
 	kasan_cache_shrink(s);
 	__kmem_cache_shrink(s);
 
-	/*
-	 * We have to take the slab_mutex to protect from the memcg list
-	 * modification.
-	 */
-	mutex_lock(&slab_mutex);
-	for_each_memcg_cache(c, s) {
-		/*
-		 * Don't need to shrink deactivated memcg caches.
-		 */
-		if (s->flags & SLAB_DEACTIVATED)
-			continue;
+	c = memcg_cache(s);
+	if (c) {
 		kasan_cache_shrink(c);
 		__kmem_cache_shrink(c);
 	}
-	mutex_unlock(&slab_mutex);
 	put_online_mems();
 	put_online_cpus();
 }
@@ -1069,7 +746,7 @@ struct kmem_cache *__init create_kmalloc_cache(const char *name,
 
 	create_boot_cache(s, name, size, flags, useroffset, usersize);
 	list_add(&s->list, &slab_caches);
-	memcg_link_cache(s, NULL);
+	memcg_link_cache(s);
 	s->refcount = 1;
 	return s;
 }
@@ -1431,7 +1108,8 @@ memcg_accumulate_slabinfo(struct kmem_cache *s, struct slabinfo *info)
 	if (!is_root_cache(s))
 		return;
 
-	for_each_memcg_cache(c, s) {
+	c = memcg_cache(s);
+	if (c) {
 		memset(&sinfo, 0, sizeof(sinfo));
 		get_slabinfo(c, &sinfo);
 
@@ -1562,7 +1240,7 @@ module_init(slab_proc_init);
 
 #if defined(CONFIG_DEBUG_FS) && defined(CONFIG_MEMCG_KMEM)
 /*
- * Display information about kmem caches that have child memcg caches.
+ * Display information about kmem caches that have memcg cache.
  */
 static int memcg_slabinfo_show(struct seq_file *m, void *unused)
 {
@@ -1574,9 +1252,9 @@ static int memcg_slabinfo_show(struct seq_file *m, void *unused)
 	seq_puts(m, " <active_slabs> <num_slabs>\n");
 	list_for_each_entry(s, &slab_root_caches, root_caches_node) {
 		/*
-		 * Skip kmem caches that don't have any memcg children.
+		 * Skip kmem caches that don't have the memcg cache.
 		 */
-		if (list_empty(&s->memcg_params.children))
+		if (!s->memcg_params.memcg_cache)
 			continue;
 
 		memset(&sinfo, 0, sizeof(sinfo));
@@ -1585,23 +1263,13 @@ static int memcg_slabinfo_show(struct seq_file *m, void *unused)
 			   cache_name(s), sinfo.active_objs, sinfo.num_objs,
 			   sinfo.active_slabs, sinfo.num_slabs);
 
-		for_each_memcg_cache(c, s) {
-			struct cgroup_subsys_state *css;
-			char *status = "";
-
-			css = &c->memcg_params.memcg->css;
-			if (!(css->flags & CSS_ONLINE))
-				status = ":dead";
-			else if (c->flags & SLAB_DEACTIVATED)
-				status = ":deact";
-
-			memset(&sinfo, 0, sizeof(sinfo));
-			get_slabinfo(c, &sinfo);
-			seq_printf(m, "%-17s %4d%-6s %6lu %6lu %6lu %6lu\n",
-				   cache_name(c), css->id, status,
-				   sinfo.active_objs, sinfo.num_objs,
-				   sinfo.active_slabs, sinfo.num_slabs);
-		}
+		c = s->memcg_params.memcg_cache;
+		memset(&sinfo, 0, sizeof(sinfo));
+		get_slabinfo(c, &sinfo);
+		seq_printf(m, "%-17s %4d %6lu %6lu %6lu %6lu\n",
+			   cache_name(c), root_mem_cgroup->css.id,
+			   sinfo.active_objs, sinfo.num_objs,
+			   sinfo.active_slabs, sinfo.num_slabs);
 	}
 	mutex_unlock(&slab_mutex);
 	return 0;
diff --git a/mm/slub.c b/mm/slub.c
index a62545c7acac..53abbf0831b6 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4057,36 +4057,6 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 	return ret;
 }
 
-#ifdef CONFIG_MEMCG
-void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s)
-{
-	/*
-	 * Called with all the locks held after a sched RCU grace period.
-	 * Even if @s becomes empty after shrinking, we can't know that @s
-	 * doesn't have allocations already in-flight and thus can't
-	 * destroy @s until the associated memcg is released.
-	 *
-	 * However, let's remove the sysfs files for empty caches here.
-	 * Each cache has a lot of interface files which aren't
-	 * particularly useful for empty draining caches; otherwise, we can
-	 * easily end up with millions of unnecessary sysfs files on
-	 * systems which have a lot of memory and transient cgroups.
-	 */
-	if (!__kmem_cache_shrink(s))
-		sysfs_slab_remove(s);
-}
-
-void __kmemcg_cache_deactivate(struct kmem_cache *s)
-{
-	/*
-	 * Disable empty slabs caching. Used to avoid pinning offline
-	 * memory cgroups by kmem pages that can be freed.
-	 */
-	slub_set_cpu_partial(s, 0);
-	s->min_partial = 0;
-}
-#endif	/* CONFIG_MEMCG */
-
 static int slab_mem_going_offline_callback(void *arg)
 {
 	struct kmem_cache *s;
@@ -4243,7 +4213,7 @@ static struct kmem_cache * __init bootstrap(struct kmem_cache *static_cache)
 	}
 	slab_init_memcg_params(s);
 	list_add(&s->list, &slab_caches);
-	memcg_link_cache(s, NULL);
+	memcg_link_cache(s);
 	return s;
 }
 
@@ -4311,7 +4281,8 @@ __kmem_cache_alias(const char *name, unsigned int size, unsigned int align,
 		s->object_size = max(s->object_size, size);
 		s->inuse = max(s->inuse, ALIGN(size, sizeof(void *)));
 
-		for_each_memcg_cache(c, s) {
+		c = memcg_cache(s);
+		if (c) {
 			c->object_size = s->object_size;
 			c->inuse = max(c->inuse, ALIGN(size, sizeof(void *)));
 		}
@@ -5582,7 +5553,8 @@ static ssize_t slab_attr_store(struct kobject *kobj,
 		 * directly either failed or succeeded, in which case we loop
 		 * through the descendants with best-effort propagation.
 		 */
-		for_each_memcg_cache(c, s)
+		c = memcg_cache(s);
+		if (c)
 			attribute->store(c, buf, len);
 		mutex_unlock(&slab_mutex);
 	}
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 15/16] tools/cgroup: make slabinfo.py compatible with new slab controller
  2019-10-18  0:28 [PATCH 00/16] The new slab memory controller Roman Gushchin
                   ` (12 preceding siblings ...)
  2019-10-18  0:28 ` [PATCH 14/16] mm: memcg/slab: use one set of kmem_caches for all memory cgroups Roman Gushchin
@ 2019-10-18  0:28 ` Roman Gushchin
  2019-10-18  0:28 ` [PATCH 16/16] mm: slab: remove redundant check in memcg_accumulate_slabinfo() Roman Gushchin
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 36+ messages in thread
From: Roman Gushchin @ 2019-10-18  0:28 UTC (permalink / raw)
  To: linux-mm
  Cc: Michal Hocko, Johannes Weiner, linux-kernel, kernel-team,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Christoph Lameter,
	Roman Gushchin

Make slabinfo.py compatible with the new slab controller.

Because there are no more per-memcg kmem_caches, and also there
is no list of all slab pages in the system, it has to walk over
all pages and filter out slab pages belonging to non-root kmem_caches.

Then it counts objects belonging to the given cgroup. It might
sound as a very slow operation, however it's not so slow. It takes
about 30s seconds to walk over 8Gb of slabs out of 64Gb, and filter
out all objects belonging to the cgroup of interest.

Also, it provides an accurate number of active objects, which isn't
true for the old slab controller.

The script is backward compatible and works for both kernel versions.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 tools/cgroup/slabinfo.py | 105 ++++++++++++++++++++++++++++++++++++---
 1 file changed, 98 insertions(+), 7 deletions(-)

diff --git a/tools/cgroup/slabinfo.py b/tools/cgroup/slabinfo.py
index 40b01a6ec4b0..79909ee36fe3 100755
--- a/tools/cgroup/slabinfo.py
+++ b/tools/cgroup/slabinfo.py
@@ -8,7 +8,10 @@ import argparse
 import sys
 
 from drgn.helpers.linux import list_for_each_entry, list_empty
-from drgn import container_of
+from drgn.helpers.linux import for_each_page
+from drgn.helpers.linux.cpumask import for_each_online_cpu
+from drgn.helpers.linux.percpu import per_cpu_ptr
+from drgn import container_of, FaultError
 
 
 DESC = """
@@ -47,7 +50,7 @@ def is_root_cache(s):
 
 def cache_name(s):
     if is_root_cache(s):
-        return s.name
+        return s.name.string_().decode('utf-8')
     else:
         return s.memcg_params.root_cache.name.string_().decode('utf-8')
 
@@ -99,12 +102,16 @@ def slub_get_slabinfo(s, cfg):
 # SLAB-specific functions can be added here...
 
 
-def cache_show(s, cfg):
+def cache_show(s, cfg, objs):
     if cfg['allocator'] == 'SLUB':
         sinfo = slub_get_slabinfo(s, cfg)
     else:
         err('SLAB isn\'t supported yet')
 
+    if cfg['shared_slab_pages']:
+        sinfo['active_objs'] = objs
+        sinfo['num_objs'] = objs
+
     print('%-17s %6lu %6lu %6u %4u %4d'
           ' : tunables %4u %4u %4u'
           ' : slabdata %6lu %6lu %6lu' % (
@@ -127,9 +134,60 @@ def detect_kernel_config():
     else:
         err('Can\'t determine the slab allocator')
 
+    if prog.type('struct memcg_cache_params').members[1][1] == 'memcg_cache':
+        cfg['shared_slab_pages'] = True
+    else:
+        cfg['shared_slab_pages'] = False
+
     return cfg
 
 
+def for_each_slab_page(prog):
+    PGSlab = 1 << prog.constant('PG_slab')
+    PGHead = 1 << prog.constant('PG_head')
+
+    for page in for_each_page(prog):
+        try:
+            if page.flags.value_() & PGSlab:
+                yield page
+        except FaultError:
+            pass
+
+
+# it doesn't find all objects, because by default full slabs are not
+# placed to the full list. however, it can be used in certain cases.
+def for_each_slab_page_fast(prog):
+    for s in list_for_each_entry('struct kmem_cache',
+                                 prog['slab_caches'].address_of_(),
+                                 'list'):
+        if is_root_cache(s):
+            continue
+
+        if s.cpu_partial:
+            for cpu in for_each_online_cpu(prog):
+                cpu_slab = per_cpu_ptr(s.cpu_slab, cpu)
+                if cpu_slab.page:
+                    yield cpu_slab.page
+
+                page = cpu_slab.partial
+                while page:
+                    yield page
+                    page = page.next
+
+        for node in range(prog['nr_online_nodes'].value_()):
+            n = s.node[node]
+
+            for page in list_for_each_entry('struct page',
+                                            n.partial.address_of_(),
+                                            'slab_list'):
+                yield page
+
+            for page in list_for_each_entry('struct page',
+                                            n.full.address_of_(),
+                                            'slab_list'):
+                yield page
+
+
 def main():
     parser = argparse.ArgumentParser(description=DESC,
                                      formatter_class=argparse.RawTextHelpFormatter)
@@ -150,10 +208,43 @@ def main():
           ' : tunables <limit> <batchcount> <sharedfactor>'
           ' : slabdata <active_slabs> <num_slabs> <sharedavail>')
 
-    for s in list_for_each_entry('struct kmem_cache',
-                                 memcg.kmem_caches.address_of_(),
-                                 'memcg_params.kmem_caches_node'):
-        cache_show(s, cfg)
+    if cfg['shared_slab_pages']:
+        memcg_ptrs = set()
+        stats = {}
+        caches = {}
+
+        # find memcg pointers belonging to the specified cgroup
+        for ptr in list_for_each_entry('struct mem_cgroup_ptr',
+                                       memcg.kmem_memcg_ptr_list.address_of_(),
+                                       'list'):
+            memcg_ptrs.add(ptr.value_())
+
+        # look over all slab pages, belonging to non-root memcgs
+        # and look for objects belonging to the given memory cgroup
+        for page in for_each_slab_page(prog):
+            cache = page.slab_cache
+            if not cache or is_root_cache(cache):
+                continue
+            addr = cache.value_()
+            caches[addr] = cache
+            memcg_vec = page.mem_cgroup_vec
+
+            if addr not in stats:
+                stats[addr] = 0
+
+            for i in range(oo_objects(cache)):
+                if memcg_vec[i].value_() in memcg_ptrs:
+                    stats[addr] += 1
+
+        for addr in caches:
+            if stats[addr] > 0:
+                cache_show(caches[addr], cfg, stats[addr])
+
+    else:
+        for s in list_for_each_entry('struct kmem_cache',
+                                     memcg.kmem_caches.address_of_(),
+                                     'memcg_params.kmem_caches_node'):
+            cache_show(s, cfg, None)
 
 
 main()
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 16/16] mm: slab: remove redundant check in memcg_accumulate_slabinfo()
  2019-10-18  0:28 [PATCH 00/16] The new slab memory controller Roman Gushchin
                   ` (13 preceding siblings ...)
  2019-10-18  0:28 ` [PATCH 15/16] tools/cgroup: make slabinfo.py compatible with new slab controller Roman Gushchin
@ 2019-10-18  0:28 ` Roman Gushchin
  2019-10-18 17:03 ` [PATCH 00/16] The new slab memory controller Waiman Long
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 36+ messages in thread
From: Roman Gushchin @ 2019-10-18  0:28 UTC (permalink / raw)
  To: linux-mm
  Cc: Michal Hocko, Johannes Weiner, linux-kernel, kernel-team,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Christoph Lameter,
	Roman Gushchin

memcg_accumulate_slabinfo() is never called with a non-root
kmem_cache as a first argument, so the is_root_cache(s) check
is redundant and can be removed without any functional change.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 mm/slab_common.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/mm/slab_common.c b/mm/slab_common.c
index 3dcd90ba7525..1c10c94343f6 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1105,9 +1105,6 @@ memcg_accumulate_slabinfo(struct kmem_cache *s, struct slabinfo *info)
 	struct kmem_cache *c;
 	struct slabinfo sinfo;
 
-	if (!is_root_cache(s))
-		return;
-
 	c = memcg_cache(s);
 	if (c) {
 		memset(&sinfo, 0, sizeof(sinfo));
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/16] The new slab memory controller
  2019-10-18  0:28 [PATCH 00/16] The new slab memory controller Roman Gushchin
                   ` (14 preceding siblings ...)
  2019-10-18  0:28 ` [PATCH 16/16] mm: slab: remove redundant check in memcg_accumulate_slabinfo() Roman Gushchin
@ 2019-10-18 17:03 ` Waiman Long
  2019-10-18 17:12   ` Roman Gushchin
  2019-10-22 13:22 ` Michal Hocko
  2019-10-22 13:31 ` Michal Hocko
  17 siblings, 1 reply; 36+ messages in thread
From: Waiman Long @ 2019-10-18 17:03 UTC (permalink / raw)
  To: Roman Gushchin, linux-mm
  Cc: Michal Hocko, Johannes Weiner, linux-kernel, kernel-team,
	Shakeel Butt, Vladimir Davydov, Christoph Lameter

On 10/17/19 8:28 PM, Roman Gushchin wrote:
> The existing slab memory controller is based on the idea of replicating
> slab allocator internals for each memory cgroup. This approach promises
> a low memory overhead (one pointer per page), and isn't adding too much
> code on hot allocation and release paths. But is has a very serious flaw:
                                               ^it^
> it leads to a low slab utilization.
>
> Using a drgn* script I've got an estimation of slab utilization on
> a number of machines running different production workloads. In most
> cases it was between 45% and 65%, and the best number I've seen was
> around 85%. Turning kmem accounting off brings it to high 90s. Also
> it brings back 30-50% of slab memory. It means that the real price
> of the existing slab memory controller is way bigger than a pointer
> per page.
>
> The real reason why the existing design leads to a low slab utilization
> is simple: slab pages are used exclusively by one memory cgroup.
> If there are only few allocations of certain size made by a cgroup,
> or if some active objects (e.g. dentries) are left after the cgroup is
> deleted, or the cgroup contains a single-threaded application which is
> barely allocating any kernel objects, but does it every time on a new CPU:
> in all these cases the resulting slab utilization is very low.
> If kmem accounting is off, the kernel is able to use free space
> on slab pages for other allocations.

In the case of slub memory allocator, it is not just unused space within
a slab. It is also the use of per-cpu slabs that can hold up a lot of
memory, especially if the tasks jump around to different cpus. The
problem is compounded if a lot of memcgs are being used. Memory
utilization can improve quite significantly if per-cpu slabs are
disabled. Of course, it comes with a performance cost.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/16] The new slab memory controller
  2019-10-18 17:03 ` [PATCH 00/16] The new slab memory controller Waiman Long
@ 2019-10-18 17:12   ` Roman Gushchin
  0 siblings, 0 replies; 36+ messages in thread
From: Roman Gushchin @ 2019-10-18 17:12 UTC (permalink / raw)
  To: Waiman Long
  Cc: linux-mm, Michal Hocko, Johannes Weiner, linux-kernel,
	Kernel Team, Shakeel Butt, Vladimir Davydov, Christoph Lameter

On Fri, Oct 18, 2019 at 01:03:54PM -0400, Waiman Long wrote:
> On 10/17/19 8:28 PM, Roman Gushchin wrote:
> > The existing slab memory controller is based on the idea of replicating
> > slab allocator internals for each memory cgroup. This approach promises
> > a low memory overhead (one pointer per page), and isn't adding too much
> > code on hot allocation and release paths. But is has a very serious flaw:
>                                                ^it^
> > it leads to a low slab utilization.
> >
> > Using a drgn* script I've got an estimation of slab utilization on
> > a number of machines running different production workloads. In most
> > cases it was between 45% and 65%, and the best number I've seen was
> > around 85%. Turning kmem accounting off brings it to high 90s. Also
> > it brings back 30-50% of slab memory. It means that the real price
> > of the existing slab memory controller is way bigger than a pointer
> > per page.
> >
> > The real reason why the existing design leads to a low slab utilization
> > is simple: slab pages are used exclusively by one memory cgroup.
> > If there are only few allocations of certain size made by a cgroup,
> > or if some active objects (e.g. dentries) are left after the cgroup is
> > deleted, or the cgroup contains a single-threaded application which is
> > barely allocating any kernel objects, but does it every time on a new CPU:
> > in all these cases the resulting slab utilization is very low.
> > If kmem accounting is off, the kernel is able to use free space
> > on slab pages for other allocations.
> 
> In the case of slub memory allocator, it is not just unused space within
> a slab. It is also the use of per-cpu slabs that can hold up a lot of
> memory, especially if the tasks jump around to different cpus. The
> problem is compounded if a lot of memcgs are being used. Memory
> utilization can improve quite significantly if per-cpu slabs are
> disabled. Of course, it comes with a performance cost.

Right, but it's basically the same problem: if slabs can be used exclusively
by a single memory cgroup, slab utilization is low. Per-cpu slabs are just
making the problem worse by increasing the number of mostly empty slabs
proportionally to the number of CPUs.

With the disabled memory cgroup accounting slab utilization is quite high
even with per-slabs. So the problem isn't in per-cpu slabs by themselves,
they just were not designed to exist in so many copies.

Thanks!

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 02/16] mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat
  2019-10-18  0:28 ` [PATCH 02/16] mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat Roman Gushchin
@ 2019-10-20 22:44   ` Christopher Lameter
  2019-10-21  1:15     ` Roman Gushchin
  2019-10-20 22:51   ` Christopher Lameter
  1 sibling, 1 reply; 36+ messages in thread
From: Christopher Lameter @ 2019-10-20 22:44 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Michal Hocko, Johannes Weiner, linux-kernel,
	kernel-team, Shakeel Butt, Vladimir Davydov, Waiman Long

On Thu, 17 Oct 2019, Roman Gushchin wrote:

> But if some counters are in bytes (and the next commit in the series
> will convert slab counters to bytes), it's not gonna work:
> value in bytes can easily exceed s8 without exceeding the threshold
> converted to bytes. So to avoid overfilling per-cpu caches and breaking
> vmstats correctness, let's use s32 instead.

This quardruples the cache footprint of the counters and likely has some
influence on performance.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 02/16] mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat
  2019-10-18  0:28 ` [PATCH 02/16] mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat Roman Gushchin
  2019-10-20 22:44   ` Christopher Lameter
@ 2019-10-20 22:51   ` Christopher Lameter
  2019-10-21  1:21     ` Roman Gushchin
  1 sibling, 1 reply; 36+ messages in thread
From: Christopher Lameter @ 2019-10-20 22:51 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Michal Hocko, Johannes Weiner, linux-kernel,
	kernel-team, Shakeel Butt, Vladimir Davydov, Waiman Long

On Thu, 17 Oct 2019, Roman Gushchin wrote:

> Currently s8 type is used for per-cpu caching of per-node statistics.
> It works fine because the overfill threshold can't exceed 125.
>
> But if some counters are in bytes (and the next commit in the series
> will convert slab counters to bytes), it's not gonna work:
> value in bytes can easily exceed s8 without exceeding the threshold
> converted to bytes. So to avoid overfilling per-cpu caches and breaking
> vmstats correctness, let's use s32 instead.

Actually this is inconsistenct since the other counters are all used to
account for pages. Some of the functions in use for the page counters will
no longer make sense. inc/dec_node_state() etc.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 02/16] mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat
  2019-10-20 22:44   ` Christopher Lameter
@ 2019-10-21  1:15     ` Roman Gushchin
  2019-10-21 18:09       ` Christopher Lameter
  0 siblings, 1 reply; 36+ messages in thread
From: Roman Gushchin @ 2019-10-21  1:15 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: linux-mm, Michal Hocko, Johannes Weiner, linux-kernel,
	Kernel Team, Shakeel Butt, Vladimir Davydov, Waiman Long

On Sun, Oct 20, 2019 at 10:44:10PM +0000, Christopher Lameter wrote:
> On Thu, 17 Oct 2019, Roman Gushchin wrote:
> 
> > But if some counters are in bytes (and the next commit in the series
> > will convert slab counters to bytes), it's not gonna work:
> > value in bytes can easily exceed s8 without exceeding the threshold
> > converted to bytes. So to avoid overfilling per-cpu caches and breaking
> > vmstats correctness, let's use s32 instead.
> 
> This quardruples the cache footprint of the counters and likely has some
> influence on performance.

But otherwise slab allocation counters will be flushed too often, which
will be likely noticeable. I can do custom s32 buffers only for these counters,
but to me it seems like an unnecessary complication, unless we'll find
a clear regression.

Sp far I haven't noticed any regression on the set of workloads where I did test
the patchset, but if you know any benchmark or realistic test which can affected
by this check, I'll be happy to try.

Also, less-than-word-sized operations can be slower on some platforms.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 02/16] mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat
  2019-10-20 22:51   ` Christopher Lameter
@ 2019-10-21  1:21     ` Roman Gushchin
  0 siblings, 0 replies; 36+ messages in thread
From: Roman Gushchin @ 2019-10-21  1:21 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: linux-mm, Michal Hocko, Johannes Weiner, linux-kernel,
	Kernel Team, Shakeel Butt, Vladimir Davydov, Waiman Long

On Sun, Oct 20, 2019 at 10:51:10PM +0000, Christopher Lameter wrote:
> On Thu, 17 Oct 2019, Roman Gushchin wrote:
> 
> > Currently s8 type is used for per-cpu caching of per-node statistics.
> > It works fine because the overfill threshold can't exceed 125.
> >
> > But if some counters are in bytes (and the next commit in the series
> > will convert slab counters to bytes), it's not gonna work:
> > value in bytes can easily exceed s8 without exceeding the threshold
> > converted to bytes. So to avoid overfilling per-cpu caches and breaking
> > vmstats correctness, let's use s32 instead.
> 
> Actually this is inconsistenct since the other counters are all used to
> account for pages. Some of the functions in use for the page counters will
> no longer make sense. inc/dec_node_state() etc.

Actually I tried to implement what Johannes suggested earlier and convert
all counters to bytes, but it looked like it can lead to a significant
performance hit on 32bit platforms, as I'd need to replace all atomic_long_t
with atomic64_t. Also, to make the code look good, I'd need to convert
all counters to bytes (and atomic64_t): zone stats, vmevents, etc.
So I gave up relatively quickly.

Maybe it's a good long-term plan, but as now it doesn't really look
as an obviously good think to do.

I'm fully open to any suggestions here.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 02/16] mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat
  2019-10-21  1:15     ` Roman Gushchin
@ 2019-10-21 18:09       ` Christopher Lameter
  0 siblings, 0 replies; 36+ messages in thread
From: Christopher Lameter @ 2019-10-21 18:09 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Michal Hocko, Johannes Weiner, linux-kernel,
	Kernel Team, Shakeel Butt, Vladimir Davydov, Waiman Long

On Mon, 21 Oct 2019, Roman Gushchin wrote:

> Sp far I haven't noticed any regression on the set of workloads where I did test
> the patchset, but if you know any benchmark or realistic test which can affected
> by this check, I'll be happy to try.
>
> Also, less-than-word-sized operations can be slower on some platforms.

The main issue in the past was the cache footprint. Memory is slow.
Processors are fast.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/16] The new slab memory controller
  2019-10-18  0:28 [PATCH 00/16] The new slab memory controller Roman Gushchin
                   ` (15 preceding siblings ...)
  2019-10-18 17:03 ` [PATCH 00/16] The new slab memory controller Waiman Long
@ 2019-10-22 13:22 ` Michal Hocko
  2019-10-22 13:28   ` Michal Hocko
  2019-10-22 13:31 ` Michal Hocko
  17 siblings, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2019-10-22 13:22 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Johannes Weiner, linux-kernel, kernel-team,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Christoph Lameter

On Thu 17-10-19 17:28:04, Roman Gushchin wrote:
[...]
> Using a drgn* script I've got an estimation of slab utilization on
> a number of machines running different production workloads. In most
> cases it was between 45% and 65%, and the best number I've seen was
> around 85%. Turning kmem accounting off brings it to high 90s. Also
> it brings back 30-50% of slab memory. It means that the real price
> of the existing slab memory controller is way bigger than a pointer
> per page.

How much of the memory are we talking about here? Also is there any
pattern for specific caches that tend to utilize much worse than others?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/16] The new slab memory controller
  2019-10-22 13:22 ` Michal Hocko
@ 2019-10-22 13:28   ` Michal Hocko
  2019-10-22 15:48     ` Roman Gushchin
  0 siblings, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2019-10-22 13:28 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Johannes Weiner, linux-kernel, kernel-team,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Christoph Lameter

On Tue 22-10-19 15:22:06, Michal Hocko wrote:
> On Thu 17-10-19 17:28:04, Roman Gushchin wrote:
> [...]
> > Using a drgn* script I've got an estimation of slab utilization on
> > a number of machines running different production workloads. In most
> > cases it was between 45% and 65%, and the best number I've seen was
> > around 85%. Turning kmem accounting off brings it to high 90s. Also
> > it brings back 30-50% of slab memory. It means that the real price
> > of the existing slab memory controller is way bigger than a pointer
> > per page.
> 
> How much of the memory are we talking about here?

Just to be more specific. Your cover mentions several hundreds of MBs
but there is no scale to the overal charged memory. How much of that is
the actual kmem accounted memory.

> Also is there any pattern for specific caches that tend to utilize
> much worse than others?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/16] The new slab memory controller
  2019-10-18  0:28 [PATCH 00/16] The new slab memory controller Roman Gushchin
                   ` (16 preceding siblings ...)
  2019-10-22 13:22 ` Michal Hocko
@ 2019-10-22 13:31 ` Michal Hocko
  2019-10-22 15:59   ` Roman Gushchin
  17 siblings, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2019-10-22 13:31 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Johannes Weiner, linux-kernel, kernel-team,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Christoph Lameter

On Thu 17-10-19 17:28:04, Roman Gushchin wrote:
> This patchset provides a new implementation of the slab memory controller,
> which aims to reach a much better slab utilization by sharing slab pages
> between multiple memory cgroups. Below is the short description of the new
> design (more details in commit messages).
> 
> Accounting is performed per-object instead of per-page. Slab-related
> vmstat counters are converted to bytes. Charging is performed on page-basis,
> with rounding up and remembering leftovers.
> 
> Memcg ownership data is stored in a per-slab-page vector: for each slab page
> a vector of corresponding size is allocated. To keep slab memory reparenting
> working, instead of saving a pointer to the memory cgroup directly an
> intermediate object is used. It's simply a pointer to a memcg (which can be
> easily changed to the parent) with a built-in reference counter. This scheme
> allows to reparent all allocated objects without walking them over and changing
> memcg pointer to the parent.
> 
> Instead of creating an individual set of kmem_caches for each memory cgroup,
> two global sets are used: the root set for non-accounted and root-cgroup
> allocations and the second set for all other allocations. This allows to
> simplify the lifetime management of individual kmem_caches: they are destroyed
> with root counterparts. It allows to remove a good amount of code and make
> things generally simpler.

What is the performance impact? Also what is the effect on the memory
reclaim side and the isolation. I would expect that mixing objects from
different cgroups would have a negative/unpredictable impact on the
memcg slab shrinking.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/16] The new slab memory controller
  2019-10-22 13:28   ` Michal Hocko
@ 2019-10-22 15:48     ` Roman Gushchin
  0 siblings, 0 replies; 36+ messages in thread
From: Roman Gushchin @ 2019-10-22 15:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Johannes Weiner, linux-kernel, Kernel Team,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Christoph Lameter

On Tue, Oct 22, 2019 at 03:28:00PM +0200, Michal Hocko wrote:
> On Tue 22-10-19 15:22:06, Michal Hocko wrote:
> > On Thu 17-10-19 17:28:04, Roman Gushchin wrote:
> > [...]
> > > Using a drgn* script I've got an estimation of slab utilization on
> > > a number of machines running different production workloads. In most
> > > cases it was between 45% and 65%, and the best number I've seen was
> > > around 85%. Turning kmem accounting off brings it to high 90s. Also
> > > it brings back 30-50% of slab memory. It means that the real price
> > > of the existing slab memory controller is way bigger than a pointer
> > > per page.
> > 
> > How much of the memory are we talking about here?
> 
> Just to be more specific. Your cover mentions several hundreds of MBs
> but there is no scale to the overal charged memory. How much of that is
> the actual kmem accounted memory.

As I wrote, on average it saves 30-45% of slab memory.
The smallest number I've seen was about 15%, the largest over 60%.

The amount of slab memory isn't a very stable metrics in general: it heavily
depends on workload pattern, memory pressure, uptime etc.
In absolute numbers I've seen savings from ~60 Mb for an empty vm to
more than 2 Gb for some production workloads.

Btw, please note that after a recent change from Vlastimil
6a486c0ad4dc ("mm, sl[ou]b: improve memory accounting")
slab counters are including large allocations which are passed
directly to the page allocator. It will makes memory savings
smaller in percents, but of course not in absolute numbers.

> 
> > Also is there any pattern for specific caches that tend to utilize
> > much worse than others?

Caches which usually have many objects (e.g. inodes) initially
have a better utilization, but as some of them are getting reclaimed
the utilization drops. And if the cgroup is already dead, no one can
reuse these mostly slab empty pages, so it's pretty wasteful.

So I don't think the problem is specific to any cache, it's pretty general.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/16] The new slab memory controller
  2019-10-22 13:31 ` Michal Hocko
@ 2019-10-22 15:59   ` Roman Gushchin
  0 siblings, 0 replies; 36+ messages in thread
From: Roman Gushchin @ 2019-10-22 15:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Johannes Weiner, linux-kernel, Kernel Team,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Christoph Lameter

On Tue, Oct 22, 2019 at 03:31:48PM +0200, Michal Hocko wrote:
> On Thu 17-10-19 17:28:04, Roman Gushchin wrote:
> > This patchset provides a new implementation of the slab memory controller,
> > which aims to reach a much better slab utilization by sharing slab pages
> > between multiple memory cgroups. Below is the short description of the new
> > design (more details in commit messages).
> > 
> > Accounting is performed per-object instead of per-page. Slab-related
> > vmstat counters are converted to bytes. Charging is performed on page-basis,
> > with rounding up and remembering leftovers.
> > 
> > Memcg ownership data is stored in a per-slab-page vector: for each slab page
> > a vector of corresponding size is allocated. To keep slab memory reparenting
> > working, instead of saving a pointer to the memory cgroup directly an
> > intermediate object is used. It's simply a pointer to a memcg (which can be
> > easily changed to the parent) with a built-in reference counter. This scheme
> > allows to reparent all allocated objects without walking them over and changing
> > memcg pointer to the parent.
> > 
> > Instead of creating an individual set of kmem_caches for each memory cgroup,
> > two global sets are used: the root set for non-accounted and root-cgroup
> > allocations and the second set for all other allocations. This allows to
> > simplify the lifetime management of individual kmem_caches: they are destroyed
> > with root counterparts. It allows to remove a good amount of code and make
> > things generally simpler.
> 
> What is the performance impact?

As I wrote, so far we haven't found any regression on any real world workload.
Of course, it's pretty easy to come up with a synthetic test which will show
some performance hit: e.g. allocate and free a large number of objects from a
single cache from a single cgroup. The reason is simple: stats and accounting
are more precise, so it requires more work. But I don't think it's a real
problem.

On the other hand I expect to see some positive effects from the significantly
reduced number of unmovable pages: memory fragmentation should become lower.
And all kernel objects will reside on a smaller number of pages, so we can
expect a better cache utilization.

> Also what is the effect on the memory
> reclaim side and the isolation. I would expect that mixing objects from
> different cgroups would have a negative/unpredictable impact on the
> memcg slab shrinking.

Slab shrinking is already working on per-object basis, so no changes here.

Quite opposite: now the freed space can be reused by other cgroups, where
previously it was often a useless operation, as nobody can reuse the space
unless all objects will be freed and the page can be returned to the page
allocator.

Thanks!

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 09/16] mm: memcg/slab: charge individual slab objects instead of pages
  2019-10-18  0:28 ` [PATCH 09/16] mm: memcg/slab: charge individual slab objects instead of pages Roman Gushchin
@ 2019-10-25 19:41   ` Johannes Weiner
  2019-10-25 20:00     ` Roman Gushchin
  2019-10-31  1:52     ` Roman Gushchin
  0 siblings, 2 replies; 36+ messages in thread
From: Johannes Weiner @ 2019-10-25 19:41 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Michal Hocko, linux-kernel, kernel-team, Shakeel Butt,
	Vladimir Davydov, Waiman Long, Christoph Lameter

On Thu, Oct 17, 2019 at 05:28:13PM -0700, Roman Gushchin wrote:
> +static inline struct kmem_cache *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
> +						struct mem_cgroup **memcgp,
> +						size_t size, gfp_t flags)
> +{
> +	struct kmem_cache *cachep;
> +
> +	cachep = memcg_kmem_get_cache(s, memcgp);
> +	if (is_root_cache(cachep))
> +		return s;
> +
> +	if (__memcg_kmem_charge_subpage(*memcgp, size * s->size, flags)) {
> +		mem_cgroup_put(*memcgp);
> +		memcg_kmem_put_cache(cachep);
> +		cachep = NULL;
> +	}
> +
> +	return cachep;
> +}
> +
>  static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
>  					      struct mem_cgroup *memcg,
>  					      size_t size, void **p)
>  {
>  	struct mem_cgroup_ptr *memcg_ptr;
> +	struct lruvec *lruvec;
>  	struct page *page;
>  	unsigned long off;
>  	size_t i;
> @@ -439,6 +393,11 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
>  			off = obj_to_index(s, page, p[i]);
>  			mem_cgroup_ptr_get(memcg_ptr);
>  			page->mem_cgroup_vec[off] = memcg_ptr;
> +			lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
> +			mod_lruvec_memcg_state(lruvec, cache_vmstat_idx(s),
> +					       s->size);
> +		} else {
> +			__memcg_kmem_uncharge_subpage(memcg, s->size);
>  		}
>  	}
>  	mem_cgroup_ptr_put(memcg_ptr);

The memcg_ptr as a collection vessel for object references makes a lot
of sense. But this code showcases that it should be a first-class
memory tracking API that the allocator interacts with, rather than
having to deal with a combination of memcg_ptr and memcg.

In the two hunks here, on one hand we charge bytes to the memcg
object, and then handle all the refcounting through a different
bucketing object. To support that in the first place, we have to
overload the memcg API all the way down to try_charge() to support
bytes and pages. This is difficult to follow throughout all layers.

What would be better is for this to be an abstraction layer for a
subpage object tracker that sits on top of the memcg page tracker -
not unlike the page allocator and the slab allocators themselves.

And then the slab allocator would only interact with the subpage
object tracker, and the object tracker would deal with the memcg page
tracker under the hood.

Something like the below, just a rough sketch to convey the idea:

On allocation, we look up an obj_cgroup from the current task, charge
it with obj_cgroup_charge() and bump its refcount by the number of
objects we allocated. obj_cgroup_charge() in turn uses the memcg layer
to charge pages, and then doles them out as bytes using
objcg->nr_stocked_bytes and the byte per-cpu stock.

On freeing, we call obj_cgroup_uncharge() and drop the references. In
turn, this first refills the byte per-cpu stock and drains the
overflow to the memcg layer (which in turn refills its own page stock
and drains to the page_counter).

On cgroup deletion, we reassign the obj_cgroup to the parent, and all
remaining live objects will free their pages to the updated parental
obj_cgroup->memcg, as per usual.

---

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f36203cf75f8..efc3398aaf02 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -200,15 +200,15 @@ struct memcg_cgwb_frn {
 };
 
 /*
- * A pointer to a memory cgroup with a built-in reference counter.
- * For a use as an intermediate object to simplify reparenting of
- * objects charged to the cgroup. The memcg pointer can be switched
- * to the parent cgroup without a need to modify all objects
- * which hold the reference to the cgroup.
+ * Bucket for arbitrarily byte-sized objects charged to a memory
+ * cgroup. The bucket can be reparented in one piece when the cgroup
+ * is destroyed, without having to round up the individual references
+ * of all live memory objects in the wild.
  */
-struct mem_cgroup_ptr {
-	struct percpu_ref refcnt;
+struct obj_cgroup {
+	struct percpu_ref count;
 	struct mem_cgroup *memcg;
+	unsigned long nr_stocked_bytes;
 	union {
 		struct list_head list;
 		struct rcu_head rcu;
@@ -230,7 +230,6 @@ struct mem_cgroup {
 	/* Accounted resources */
 	struct page_counter memory;
 	struct page_counter swap;
-	atomic_t nr_stocked_bytes;
 
 	/* Legacy consumer-oriented counters */
 	struct page_counter memsw;
@@ -327,8 +326,8 @@ struct mem_cgroup {
         /* Index in the kmem_cache->memcg_params.memcg_caches array */
 	int kmemcg_id;
 	enum memcg_kmem_state kmem_state;
-	struct mem_cgroup_ptr __rcu *kmem_memcg_ptr;
-	struct list_head kmem_memcg_ptr_list;
+	struct obj_cgroup __rcu *slab_objs;
+	struct list_head slab_objs_list;
 #endif
 
 	int last_scanned_node;
@@ -474,21 +473,6 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
 	return css ? container_of(css, struct mem_cgroup, css) : NULL;
 }
 
-static inline bool mem_cgroup_ptr_tryget(struct mem_cgroup_ptr *ptr)
-{
-	return percpu_ref_tryget(&ptr->refcnt);
-}
-
-static inline void mem_cgroup_ptr_get(struct mem_cgroup_ptr *ptr)
-{
-	percpu_ref_get(&ptr->refcnt);
-}
-
-static inline void mem_cgroup_ptr_put(struct mem_cgroup_ptr *ptr)
-{
-	percpu_ref_put(&ptr->refcnt);
-}
-
 static inline void mem_cgroup_put(struct mem_cgroup *memcg)
 {
 	if (memcg)
@@ -1444,9 +1428,9 @@ int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
 			      struct mem_cgroup *memcg);
 void __memcg_kmem_uncharge_memcg(struct mem_cgroup *memcg,
 				 unsigned int nr_pages);
-int __memcg_kmem_charge_subpage(struct mem_cgroup *memcg, size_t size,
+int obj_cgroup_charge(struct obj_cgroup *objcg, size_t size,
 				gfp_t gfp);
-void __memcg_kmem_uncharge_subpage(struct mem_cgroup *memcg, size_t size);
+void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);
 
 extern struct static_key_false memcg_kmem_enabled_key;
 extern struct workqueue_struct *memcg_kmem_cache_wq;
@@ -1513,20 +1497,20 @@ static inline int memcg_cache_id(struct mem_cgroup *memcg)
 	return memcg ? memcg->kmemcg_id : -1;
 }
 
-static inline struct mem_cgroup_ptr *
-mem_cgroup_get_kmem_ptr(struct mem_cgroup *memcg)
+static inline struct obj_cgroup *
+mem_cgroup_get_slab_objs(struct mem_cgroup *memcg)
 {
-	struct mem_cgroup_ptr *memcg_ptr;
+	struct obj_cgroup *objcg;
 
 	rcu_read_lock();
 	do {
-		memcg_ptr = rcu_dereference(memcg->kmem_memcg_ptr);
-		if (memcg_ptr && mem_cgroup_ptr_tryget(memcg_ptr))
+		objcg = rcu_dereference(memcg->slab_objs);
+		if (objcg && percpu_ref_tryget(&objcg->count))
 			break;
 	} while ((memcg = parent_mem_cgroup(memcg)));
 	rcu_read_unlock();
 
-	return memcg_ptr;
+	return objcg;
 }
 
 #else
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 4d99ee5a9c53..7f663fd53c17 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -200,7 +200,7 @@ struct page {
 #ifdef CONFIG_MEMCG
 	union {
 		struct mem_cgroup *mem_cgroup;
-		struct mem_cgroup_ptr **mem_cgroup_vec;
+		struct obj_cgroup **obj_cgroups;
 	};
 #endif
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b0d0c833150c..47110f4f0f4c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -260,52 +260,51 @@ struct cgroup_subsys_state *vmpressure_to_css(struct vmpressure *vmpr)
 #ifdef CONFIG_MEMCG_KMEM
 extern spinlock_t css_set_lock;
 
-static void memcg_ptr_release(struct percpu_ref *ref)
+static void obj_cgroup_release(struct percpu_ref *ref)
 {
-	struct mem_cgroup_ptr *ptr = container_of(ref, struct mem_cgroup_ptr,
-						  refcnt);
+	struct obj_cgroup *objcg;
 	unsigned long flags;
 
+	objcg = container_of(ref, struct obj_cgroup, count);
+
 	spin_lock_irqsave(&css_set_lock, flags);
-	list_del(&ptr->list);
+	list_del(&objcg->list);
 	spin_unlock_irqrestore(&css_set_lock, flags);
 
-	mem_cgroup_put(ptr->memcg);
+	mem_cgroup_put(objcg->memcg);
 	percpu_ref_exit(ref);
-	kfree_rcu(ptr, rcu);
+	kfree_rcu(objcg, rcu);
 }
 
-static int memcg_init_kmem_memcg_ptr(struct mem_cgroup *memcg)
+static int obj_cgroup_alloc(struct mem_cgroup *memcg)
 {
-	struct mem_cgroup_ptr *kmem_memcg_ptr;
+	struct obj_cgroup *objcg;
 	int ret;
 
-	kmem_memcg_ptr = kmalloc(sizeof(struct mem_cgroup_ptr), GFP_KERNEL);
-	if (!kmem_memcg_ptr)
+	objcg = kmalloc(sizeof(struct obj_cgroup), GFP_KERNEL);
+	if (!objcg)
 		return -ENOMEM;
 
-	ret = percpu_ref_init(&kmem_memcg_ptr->refcnt, memcg_ptr_release,
+	ret = percpu_ref_init(&objcg->count, obj_cgroup_release,
 			      0, GFP_KERNEL);
 	if (ret) {
-		kfree(kmem_memcg_ptr);
+		kfree(objcg);
 		return ret;
 	}
 
-	kmem_memcg_ptr->memcg = memcg;
-	INIT_LIST_HEAD(&kmem_memcg_ptr->list);
-	rcu_assign_pointer(memcg->kmem_memcg_ptr, kmem_memcg_ptr);
-	list_add(&kmem_memcg_ptr->list, &memcg->kmem_memcg_ptr_list);
+	INIT_LIST_HEAD(&objcg->list);
+	objcg->memcg = memcg;
 	return 0;
 }
 
-static void memcg_reparent_kmem_memcg_ptr(struct mem_cgroup *memcg,
-					  struct mem_cgroup *parent)
+static void slab_objs_reparent(struct mem_cgroup *memcg,
+			       struct mem_cgroup *parent)
 {
+	struct obj_cgroup *objcg = NULL;
 	unsigned int nr_reparented = 0;
-	struct mem_cgroup_ptr *memcg_ptr = NULL;
 
-	rcu_swap_protected(memcg->kmem_memcg_ptr, memcg_ptr, true);
-	percpu_ref_kill(&memcg_ptr->refcnt);
+	rcu_swap_protected(memcg->slab_objs, objcg, true);
+	percpu_ref_kill(&objcg->count);
 
 	/*
 	 * kmem_memcg_ptr is initialized before css refcounter, so until now
@@ -314,13 +313,13 @@ static void memcg_reparent_kmem_memcg_ptr(struct mem_cgroup *memcg,
 	css_get(&memcg->css);
 
 	spin_lock_irq(&css_set_lock);
-	list_for_each_entry(memcg_ptr, &memcg->kmem_memcg_ptr_list, list) {
-		xchg(&memcg_ptr->memcg, parent);
+	list_for_each_entry(objcg, &memcg->slab_objs_list, list) {
+		xchg(&objcg->memcg, parent);
 		nr_reparented++;
 	}
 	if (nr_reparented)
-		list_splice(&memcg->kmem_memcg_ptr_list,
-			    &parent->kmem_memcg_ptr_list);
+		list_splice(&memcg->slab_objs_list,
+			    &parent->slab_objs_list);
 	spin_unlock_irq(&css_set_lock);
 
 	if (nr_reparented) {
@@ -2217,7 +2216,7 @@ struct memcg_stock_pcp {
 	struct mem_cgroup *cached; /* this never be root cgroup */
 	unsigned int nr_pages;
 
-	struct mem_cgroup *subpage_cached;
+	struct obj_cgroup *obj_cached;
 	unsigned int nr_bytes;
 
 	struct work_struct work;
@@ -2260,29 +2259,6 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 	return ret;
 }
 
-static bool consume_subpage_stock(struct mem_cgroup *memcg,
-				  unsigned int nr_bytes)
-{
-	struct memcg_stock_pcp *stock;
-	unsigned long flags;
-	bool ret = false;
-
-	if (nr_bytes > (MEMCG_CHARGE_BATCH << PAGE_SHIFT))
-		return ret;
-
-	local_irq_save(flags);
-
-	stock = this_cpu_ptr(&memcg_stock);
-	if (memcg == stock->subpage_cached && stock->nr_bytes >= nr_bytes) {
-		stock->nr_bytes -= nr_bytes;
-		ret = true;
-	}
-
-	local_irq_restore(flags);
-
-	return ret;
-}
-
 /*
  * Returns stocks cached in percpu and reset cached information.
  */
@@ -2300,59 +2276,73 @@ static void drain_stock(struct memcg_stock_pcp *stock)
 	stock->cached = NULL;
 }
 
-static void drain_subpage_stock(struct memcg_stock_pcp *stock)
+/*
+ * Cache charges(val) to local per_cpu area.
+ * This will be consumed by consume_stock() function, later.
+ */
+static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
-	struct mem_cgroup *old = stock->subpage_cached;
-
-	if (stock->nr_bytes) {
-		unsigned int nr_pages = stock->nr_bytes >> PAGE_SHIFT;
-		unsigned int nr_bytes = stock->nr_bytes & (PAGE_SIZE - 1);
+	struct memcg_stock_pcp *stock;
+	unsigned long flags;
 
-		page_counter_uncharge(&old->memory, nr_pages);
-		if (do_memsw_account())
-			page_counter_uncharge(&old->memsw, nr_pages);
+	local_irq_save(flags);
 
-		atomic_add(nr_bytes, &old->nr_stocked_bytes);
-		stock->nr_bytes = 0;
-	}
-	if (stock->subpage_cached) {
-		/*
-		 * Unlike generic percpu stock, where ->cached pointer is
-		 * protected by css references of held pages, pages held
-		 * by subpage stocks are not holding a css reference, so
-		 * we need to protect ->subpage_cached pointer explicitly.
-		 *
-		 * css_put() is paired with css_get() in refill_subpage_stock()
-		 */
-		css_put(&old->css);
-		stock->subpage_cached = NULL;
+	stock = this_cpu_ptr(&memcg_stock);
+	if (stock->cached != memcg) { /* reset if necessary */
+		drain_stock(stock);
+		stock->cached = memcg;
 	}
+	stock->nr_pages += nr_pages;
+
+	if (stock->nr_pages > MEMCG_CHARGE_BATCH)
+		drain_stock(stock);
+
+	local_irq_restore(flags);
 }
 
-static void drain_local_stock(struct work_struct *dummy)
+static bool consume_obj_stock(struct obj_cgroup *objcg,
+				  unsigned int nr_bytes)
 {
 	struct memcg_stock_pcp *stock;
 	unsigned long flags;
+	bool ret = false;
+
+	if (nr_bytes > (MEMCG_CHARGE_BATCH << PAGE_SHIFT))
+		return ret;
 
-	/*
-	 * The only protection from memory hotplug vs. drain_stock races is
-	 * that we always operate on local CPU stock here with IRQ disabled
-	 */
 	local_irq_save(flags);
 
 	stock = this_cpu_ptr(&memcg_stock);
-	drain_stock(stock);
-	drain_subpage_stock(stock);
-	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
+	if (objcg == stock->obj_cached && stock->nr_bytes >= nr_bytes) {
+		stock->nr_bytes -= nr_bytes;
+		ret = true;
+	}
 
 	local_irq_restore(flags);
+
+	return ret;
 }
 
-/*
- * Cache charges(val) to local per_cpu area.
- * This will be consumed by consume_stock() function, later.
- */
-static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
+static void drain_obj_stock(struct memcg_stock_pcp *stock)
+{
+	struct obj_cgroup *old = stock->obj_cached;
+
+	if (stock->nr_bytes) {
+		unsigned int nr_pages = stock->nr_bytes >> PAGE_SHIFT;
+		unsigned int nr_bytes = stock->nr_bytes & (PAGE_SIZE - 1);
+
+		refill_stock(old->memcg, nr_pages);
+		atomic_add(nr_bytes, &old->nr_stocked_bytes);
+		stock->nr_bytes = 0;
+	}
+	if (stock->subpage_cached) {
+		percpu_ref_put(&objcg->refcount);
+		stock->obj_cached = NULL;
+	}
+}
+
+static void refill_obj_stock(struct obj_cgroup *objcg,
+				 unsigned int nr_bytes)
 {
 	struct memcg_stock_pcp *stock;
 	unsigned long flags;
@@ -2360,37 +2350,35 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 	local_irq_save(flags);
 
 	stock = this_cpu_ptr(&memcg_stock);
-	if (stock->cached != memcg) { /* reset if necessary */
-		drain_stock(stock);
-		stock->cached = memcg;
+	if (stock->obj_cached != objcg) { /* reset if necessary */
+		drain_obj_stock(stock);
+		percpu_ref_get(&objcg->refcount);
+		stock->obj_cached = objcg;
+		stock->nr_bytes = atomic_xchg(&objcg->nr_stocked_bytes, 0);
 	}
-	stock->nr_pages += nr_pages;
+	stock->nr_bytes += nr_bytes;
 
-	if (stock->nr_pages > MEMCG_CHARGE_BATCH)
-		drain_stock(stock);
+	if (stock->nr_bytes > (MEMCG_CHARGE_BATCH << PAGE_SHIFT))
+		drain_obj_stock(stock);
 
 	local_irq_restore(flags);
 }
 
-static void refill_subpage_stock(struct mem_cgroup *memcg,
-				 unsigned int nr_bytes)
+static void drain_local_stock(struct work_struct *dummy)
 {
 	struct memcg_stock_pcp *stock;
 	unsigned long flags;
 
+	/*
+	 * The only protection from memory hotplug vs. drain_stock races is
+	 * that we always operate on local CPU stock here with IRQ disabled
+	 */
 	local_irq_save(flags);
 
 	stock = this_cpu_ptr(&memcg_stock);
-	if (stock->subpage_cached != memcg) { /* reset if necessary */
-		drain_subpage_stock(stock);
-		css_get(&memcg->css);
-		stock->subpage_cached = memcg;
-		stock->nr_bytes = atomic_xchg(&memcg->nr_stocked_bytes, 0);
-	}
-	stock->nr_bytes += nr_bytes;
-
-	if (stock->nr_bytes > (MEMCG_CHARGE_BATCH << PAGE_SHIFT))
-		drain_subpage_stock(stock);
+	drain_obj_stock(stock);
+	drain_stock(stock);
+	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
 
 	local_irq_restore(flags);
 }
@@ -2658,9 +2646,8 @@ void mem_cgroup_handle_over_high(void)
  */
 
 static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
-		      unsigned int amount, bool subpage)
+		      unsigned int nr_pages)
 {
-	unsigned int nr_pages = subpage ? ((amount >> PAGE_SHIFT) + 1) : amount;
 	unsigned int batch = max(MEMCG_CHARGE_BATCH, nr_pages);
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
 	struct mem_cgroup *mem_over_limit;
@@ -2673,9 +2660,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (mem_cgroup_is_root(memcg))
 		return 0;
 retry:
-	if (subpage && consume_subpage_stock(memcg, amount))
-		return 0;
-	else if (!subpage && consume_stock(memcg, nr_pages))
+	if (consume_stock(memcg, nr_pages))
 		return 0;
 
 	if (!do_memsw_account() ||
@@ -2794,21 +2779,14 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (do_memsw_account())
 		page_counter_charge(&memcg->memsw, nr_pages);
 
-	if (subpage)
-		refill_subpage_stock(memcg, (nr_pages << PAGE_SHIFT) - amount);
-	else
-		css_get_many(&memcg->css, nr_pages);
+	css_get_many(&memcg->css, nr_pages);
 
 	return 0;
 
 done_restock:
-	if (subpage && (batch << PAGE_SHIFT) > amount) {
-		refill_subpage_stock(memcg, (batch << PAGE_SHIFT) - amount);
-	} else if (!subpage) {
-		css_get_many(&memcg->css, batch);
-		if (batch > nr_pages)
-			refill_stock(memcg, batch - nr_pages);
-	}
+	css_get_many(&memcg->css, batch);
+	if (batch > nr_pages)
+		refill_stock(memcg, batch - nr_pages);
 
 	/*
 	 * If the hierarchy is above the normal consumption range, schedule
@@ -3117,15 +3095,24 @@ void __memcg_kmem_uncharge(struct page *page, int order)
 	css_put_many(&memcg->css, nr_pages);
 }
 
-int __memcg_kmem_charge_subpage(struct mem_cgroup *memcg, size_t size,
-				gfp_t gfp)
+int obj_cgroup_charge(struct obj_cgroup *objcg, size_t size, gfp_t gfp)
 {
-	return try_charge(memcg, gfp, size, true);
+	int ret;
+
+	if (consume_obj_stock(objcg, nr_bytes))
+		return 0;
+
+	ret = try_charge(objcg->memcg, gfp, 1);
+	if (ret)
+		return ret;
+
+	refill_obj_stock(objcg, PAGE_SIZE - size);
+	return 0;
 }
 
-void __memcg_kmem_uncharge_subpage(struct mem_cgroup *memcg, size_t size)
+void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
 {
-	refill_subpage_stock(memcg, size);
+	refill_obj_stock(objcg, size);
 }
 
 #endif /* CONFIG_MEMCG_KMEM */
@@ -3562,6 +3549,7 @@ static void memcg_flush_percpu_vmevents(struct mem_cgroup *memcg)
 #ifdef CONFIG_MEMCG_KMEM
 static int memcg_online_kmem(struct mem_cgroup *memcg)
 {
+	struct obj_cgroup *objcg;
 	int memcg_id, ret;
 
 	if (cgroup_memory_nokmem)
@@ -3574,11 +3562,13 @@ static int memcg_online_kmem(struct mem_cgroup *memcg)
 	if (memcg_id < 0)
 		return memcg_id;
 
-	ret = memcg_init_kmem_memcg_ptr(memcg);
+	objcg = obj_cgroup_alloc(memcg);
 	if (ret) {
 		memcg_free_cache_id(memcg_id);
 		return ret;
 	}
+	rcu_assign_pointer(memcg->slab_objs, objcg);
+	list_add(&objcg->list, &memcg->slab_objs_list);
 
 	static_branch_inc(&memcg_kmem_enabled_key);
 	/*
@@ -3613,7 +3603,7 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
 	 * precise values at the parent and all ancestor levels. It's required
 	 * to keep slab stats accurate after the reparenting of kmem_caches.
 	 */
-	memcg_reparent_kmem_memcg_ptr(memcg, parent);
+	obj_cgroup_reparent(memcg, parent);
 	memcg_flush_percpu_vmstats(memcg, true);
 
 	kmemcg_id = memcg->kmemcg_id;
@@ -5164,7 +5154,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 	memcg->socket_pressure = jiffies;
 #ifdef CONFIG_MEMCG_KMEM
 	memcg->kmemcg_id = -1;
-	INIT_LIST_HEAD(&memcg->kmem_memcg_ptr_list);
+	INIT_LIST_HEAD(&memcg->obj_cgroup_list);
 #endif
 #ifdef CONFIG_CGROUP_WRITEBACK
 	INIT_LIST_HEAD(&memcg->cgwb_list);
diff --git a/mm/slab.c b/mm/slab.c
index 0914d7cd869f..d452e0e7aab1 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3212,10 +3212,10 @@ slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
 	unsigned long save_flags;
 	void *ptr;
 	int slab_node = numa_mem_id();
-	struct mem_cgroup *memcg = NULL;
+	struct obj_cgroup *objcg = NULL;
 
 	flags &= gfp_allowed_mask;
-	cachep = slab_pre_alloc_hook(cachep, &memcg, 1, flags);
+	cachep = slab_pre_alloc_hook(cachep, &objcg, 1, flags);
 	if (unlikely(!cachep))
 		return NULL;
 
@@ -3251,7 +3251,7 @@ slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
 	if (unlikely(slab_want_init_on_alloc(flags, cachep)) && ptr)
 		memset(ptr, 0, cachep->object_size);
 
-	slab_post_alloc_hook(cachep, memcg, flags, 1, &ptr);
+	slab_post_alloc_hook(cachep, objcg, flags, 1, &ptr);
 	return ptr;
 }
 
@@ -3292,10 +3292,10 @@ slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller)
 {
 	unsigned long save_flags;
 	void *objp;
-	struct mem_cgroup *memcg = NULL;
+	struct obj_cgroup *objcg = NULL;
 
 	flags &= gfp_allowed_mask;
-	cachep = slab_pre_alloc_hook(cachep, &memcg, 1, flags);
+	cachep = slab_pre_alloc_hook(cachep, &objcg, 1, flags);
 	if (unlikely(!cachep))
 		return NULL;
 
@@ -3309,7 +3309,7 @@ slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller)
 	if (unlikely(slab_want_init_on_alloc(flags, cachep)) && objp)
 		memset(objp, 0, cachep->object_size);
 
-	slab_post_alloc_hook(cachep, memcg, flags, 1, &objp);
+	slab_post_alloc_hook(cachep, objcg, flags, 1, &objp);
 	return objp;
 }
 
@@ -3497,9 +3497,9 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 			  void **p)
 {
 	size_t i;
-	struct mem_cgroup *memcg = NULL;
+	struct obj_cgroup *objcg = NULL;
 
-	s = slab_pre_alloc_hook(s, &memcg, size, flags);
+	s = slab_pre_alloc_hook(s, &objcg, size, flags);
 	if (!s)
 		return 0;
 
@@ -3522,13 +3522,13 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 		for (i = 0; i < size; i++)
 			memset(p[i], 0, s->object_size);
 
-	slab_post_alloc_hook(s, memcg, flags, size, p);
+	slab_post_alloc_hook(s, objcg, flags, size, p);
 	/* FIXME: Trace call missing. Christoph would like a bulk variant */
 	return size;
 error:
 	local_irq_enable();
 	cache_alloc_debugcheck_after_bulk(s, flags, i, p, _RET_IP_);
-	slab_post_alloc_hook(s, memcg, flags, i, p);
+	slab_post_alloc_hook(s, objcg, flags, i, p);
 	__kmem_cache_free_bulk(s, i, p);
 	return 0;
 }
diff --git a/mm/slab.h b/mm/slab.h
index 035c2969a2ca..9dc082c8140d 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -301,29 +301,30 @@ static inline struct mem_cgroup *memcg_from_slab_obj(void *ptr)
 	return memcg_ptr->memcg;
 }
 
-static inline int memcg_alloc_page_memcg_vec(struct page *page, gfp_t gfp,
+static inline int memcg_alloc_page_object_cgroups(struct page *page, gfp_t gfp,
 					     unsigned int objects)
 {
-	page->mem_cgroup_vec = kmalloc(sizeof(struct mem_cgroup_ptr *) *
+	page->obj_cgroups = kmalloc(sizeof(struct mem_cgroup_ptr *) *
 				       objects, gfp | __GFP_ZERO);
-	if (!page->mem_cgroup_vec)
+	if (!page->obj_cgroups)
 		return -ENOMEM;
 
 	return 0;
 }
 
-static inline void memcg_free_page_memcg_vec(struct page *page)
+static inline void memcg_free_page_obj_cgroups(struct page *page)
 {
-	kfree(page->mem_cgroup_vec);
-	page->mem_cgroup_vec = NULL;
+	kfree(page->obj_cgroups);
+	page->obj_cgroups = NULL;
 }
 
 static inline struct kmem_cache *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
-						struct mem_cgroup **memcgp,
+						struct obj_cgroup **objcgp,
 						size_t size, gfp_t flags)
 {
 	struct kmem_cache *cachep;
 	struct mem_cgroup *memcg;
+	struct obj_cgroup *objcg;
 
 	if (memcg_kmem_bypass())
 		return s;
@@ -338,48 +339,51 @@ static inline struct kmem_cache *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
 		return s;
 	}
 
-	if (__memcg_kmem_charge_subpage(memcg, size * s->size, flags)) {
+	objcg = mem_cgroup_get_slab_objs(memcg);
+	if (obj_cgroup_charge(objcg, s->size, size, flags)) {
+		percpu_ref_put(&objcg->count);
 		mem_cgroup_put(memcg);
 		return NULL;
 	}
-
-	*memcgp = memcg;
+	*objcgp = objcg;
 	return cachep;
 }
 
-static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
-					      struct mem_cgroup *memcg,
-					      size_t size, void **p)
+static inline void cgroup_slab_post_alloc_hook(struct kmem_cache *s,
+					       struct obj_cgroup *objcg,
+					       size_t size, void **p)
 {
-	struct mem_cgroup_ptr *memcg_ptr;
+	struct obj_cgroup *objcg;
 	struct lruvec *lruvec;
 	struct page *page;
 	unsigned long off;
 	size_t i;
 
-	memcg_ptr = mem_cgroup_get_kmem_ptr(memcg);
 	for (i = 0; i < size; i++) {
 		if (likely(p[i])) {
 			page = virt_to_head_page(p[i]);
 			off = obj_to_index(s, page, p[i]);
-			mem_cgroup_ptr_get(memcg_ptr);
-			page->mem_cgroup_vec[off] = memcg_ptr;
-			lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
+			page->obj_cgroups[off] = objcg;
+			lruvec = mem_cgroup_lruvec(page_pgdat(page),
+						   objcg->memcg);
 			mod_lruvec_memcg_state(lruvec, cache_vmstat_idx(s),
 					       s->size);
 		} else {
-			__memcg_kmem_uncharge_subpage(memcg, s->size);
+			object_cgroup_uncharge(objcg, s->size);
+			percpu_ref_put(&objcg->count);
 		}
 	}
-	mem_cgroup_ptr_put(memcg_ptr);
+
+	/* pre_hook references */
+	percpu_ref_put(&objcg->count);
 	mem_cgroup_put(memcg);
 }
 
 static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page,
 					void *p)
 {
-	struct mem_cgroup_ptr *memcg_ptr;
 	struct mem_cgroup *memcg;
+	struct obj_cgroup *objcg;
 	struct lruvec *lruvec;
 	unsigned int off;
 
@@ -387,17 +391,16 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page,
 		return;
 
 	off = obj_to_index(s, page, p);
-	memcg_ptr = page->mem_cgroup_vec[off];
-	page->mem_cgroup_vec[off] = NULL;
+	objcg = page->obj_cgroup[off];
+	page->obj_cgroups[off] = NULL;
+
 	rcu_read_lock();
-	memcg = memcg_ptr->memcg;
-	if (likely(!mem_cgroup_is_root(memcg))) {
-		__memcg_kmem_uncharge_subpage(memcg, s->size);
-		lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
-		mod_lruvec_memcg_state(lruvec, cache_vmstat_idx(s), -s->size);
-	}
+	lruvec = mem_cgroup_lruvec(page_pgdat(pgdat), objcg->memcg);
+	mod_lruvec_memcg_state(lruvec, cache_vmstat_idx(s), -s->size);
 	rcu_read_unlock();
-	mem_cgroup_ptr_put(memcg_ptr);
+
+	obj_cgroup_uncharge(objcg, s->size);
+	percpu_ref_put(&objcg->count);
 }
 
 extern void slab_init_memcg_params(struct kmem_cache *);
@@ -451,14 +454,14 @@ static inline void memcg_free_page_memcg_vec(struct page *page)
 }
 
 static inline struct kmem_cache *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
-						struct mem_cgroup **memcgp,
+						struct obj_cgroup **objcgp,
 						size_t size, gfp_t flags)
 {
 	return NULL;
 }
 
 static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
-					      struct mem_cgroup *memcg,
+					      struct obj_cgroup *objcg,
 					      size_t size, void **p)
 {
 }
@@ -570,7 +573,7 @@ static inline size_t slab_ksize(const struct kmem_cache *s)
 }
 
 static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
-						     struct mem_cgroup **memcgp,
+						     struct obj_cgroup **objcg,
 						     size_t size, gfp_t flags)
 {
 	flags &= gfp_allowed_mask;
@@ -585,13 +588,13 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
 
 	if (memcg_kmem_enabled() &&
 	    ((flags & __GFP_ACCOUNT) || (s->flags & SLAB_ACCOUNT)))
-		return memcg_slab_pre_alloc_hook(s, memcgp, size, flags);
+		return memcg_slab_pre_alloc_hook(s, objcg, size, flags);
 
 	return s;
 }
 
 static inline void slab_post_alloc_hook(struct kmem_cache *s,
-					struct mem_cgroup *memcg,
+					struct obj_cgroup *objcg,
 					gfp_t flags, size_t size, void **p)
 {
 	size_t i;
@@ -605,7 +608,7 @@ static inline void slab_post_alloc_hook(struct kmem_cache *s,
 	}
 
 	if (!is_root_cache(s))
-		memcg_slab_post_alloc_hook(s, memcg, size, p);
+		memcg_slab_post_alloc_hook(s, objcg, size, p);
 }
 
 #ifndef CONFIG_SLOB
diff --git a/mm/slub.c b/mm/slub.c
index 53abbf0831b6..8f7f035d49a3 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3162,10 +3162,10 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 {
 	struct kmem_cache_cpu *c;
 	int i;
-	struct mem_cgroup *memcg = NULL;
+	struct obj_cgroup *objcg = NULL;
 
 	/* memcg and kmem_cache debug support */
-	s = slab_pre_alloc_hook(s, &memcg, size, flags);
+	s = slab_pre_alloc_hook(s, &objcg, size, flags);
 	if (unlikely(!s))
 		return false;
 	/*
@@ -3210,11 +3210,11 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 	}
 
 	/* memcg and kmem_cache debug support */
-	slab_post_alloc_hook(s, memcg, flags, size, p);
+	slab_post_alloc_hook(s, objcg, flags, size, p);
 	return i;
 error:
 	local_irq_enable();
-	slab_post_alloc_hook(s, memcg, flags, i, p);
+	slab_post_alloc_hook(s, objcg, flags, i, p);
 	__kmem_cache_free_bulk(s, i, p);
 	return 0;
 }

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 09/16] mm: memcg/slab: charge individual slab objects instead of pages
  2019-10-25 19:41   ` Johannes Weiner
@ 2019-10-25 20:00     ` Roman Gushchin
  2019-10-25 20:52       ` Johannes Weiner
  2019-10-31  1:52     ` Roman Gushchin
  1 sibling, 1 reply; 36+ messages in thread
From: Roman Gushchin @ 2019-10-25 20:00 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Michal Hocko, linux-kernel, Kernel Team, Shakeel Butt,
	Vladimir Davydov, Waiman Long, Christoph Lameter

On Fri, Oct 25, 2019 at 03:41:18PM -0400, Johannes Weiner wrote:
> On Thu, Oct 17, 2019 at 05:28:13PM -0700, Roman Gushchin wrote:
> > +static inline struct kmem_cache *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
> > +						struct mem_cgroup **memcgp,
> > +						size_t size, gfp_t flags)
> > +{
> > +	struct kmem_cache *cachep;
> > +
> > +	cachep = memcg_kmem_get_cache(s, memcgp);
> > +	if (is_root_cache(cachep))
> > +		return s;
> > +
> > +	if (__memcg_kmem_charge_subpage(*memcgp, size * s->size, flags)) {
> > +		mem_cgroup_put(*memcgp);
> > +		memcg_kmem_put_cache(cachep);
> > +		cachep = NULL;
> > +	}
> > +
> > +	return cachep;
> > +}
> > +
> >  static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
> >  					      struct mem_cgroup *memcg,
> >  					      size_t size, void **p)
> >  {
> >  	struct mem_cgroup_ptr *memcg_ptr;
> > +	struct lruvec *lruvec;
> >  	struct page *page;
> >  	unsigned long off;
> >  	size_t i;
> > @@ -439,6 +393,11 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
> >  			off = obj_to_index(s, page, p[i]);
> >  			mem_cgroup_ptr_get(memcg_ptr);
> >  			page->mem_cgroup_vec[off] = memcg_ptr;
> > +			lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
> > +			mod_lruvec_memcg_state(lruvec, cache_vmstat_idx(s),
> > +					       s->size);
> > +		} else {
> > +			__memcg_kmem_uncharge_subpage(memcg, s->size);
> >  		}
> >  	}
> >  	mem_cgroup_ptr_put(memcg_ptr);
> 
> The memcg_ptr as a collection vessel for object references makes a lot
> of sense. But this code showcases that it should be a first-class
> memory tracking API that the allocator interacts with, rather than
> having to deal with a combination of memcg_ptr and memcg.
> 
> In the two hunks here, on one hand we charge bytes to the memcg
> object, and then handle all the refcounting through a different
> bucketing object. To support that in the first place, we have to
> overload the memcg API all the way down to try_charge() to support
> bytes and pages. This is difficult to follow throughout all layers.
> 
> What would be better is for this to be an abstraction layer for a
> subpage object tracker that sits on top of the memcg page tracker -
> not unlike the page allocator and the slab allocators themselves.
> 
> And then the slab allocator would only interact with the subpage
> object tracker, and the object tracker would deal with the memcg page
> tracker under the hood.

Yes, the idea makes total sense to me. I'm not sure I like the new naming
(I have to spend some time with it first), but the idea of moving
stocks and leftovers to the memcg_ptr/obj_cgroup level is really good.

I'll include something based on your proposal into the next version
of the patchset.

Thank you!

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 09/16] mm: memcg/slab: charge individual slab objects instead of pages
  2019-10-25 20:00     ` Roman Gushchin
@ 2019-10-25 20:52       ` Johannes Weiner
  0 siblings, 0 replies; 36+ messages in thread
From: Johannes Weiner @ 2019-10-25 20:52 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Michal Hocko, linux-kernel, Kernel Team, Shakeel Butt,
	Vladimir Davydov, Waiman Long, Christoph Lameter

On Fri, Oct 25, 2019 at 08:00:32PM +0000, Roman Gushchin wrote:
> On Fri, Oct 25, 2019 at 03:41:18PM -0400, Johannes Weiner wrote:
> > On Thu, Oct 17, 2019 at 05:28:13PM -0700, Roman Gushchin wrote:
> > > +static inline struct kmem_cache *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
> > > +						struct mem_cgroup **memcgp,
> > > +						size_t size, gfp_t flags)
> > > +{
> > > +	struct kmem_cache *cachep;
> > > +
> > > +	cachep = memcg_kmem_get_cache(s, memcgp);
> > > +	if (is_root_cache(cachep))
> > > +		return s;
> > > +
> > > +	if (__memcg_kmem_charge_subpage(*memcgp, size * s->size, flags)) {
> > > +		mem_cgroup_put(*memcgp);
> > > +		memcg_kmem_put_cache(cachep);
> > > +		cachep = NULL;
> > > +	}
> > > +
> > > +	return cachep;
> > > +}
> > > +
> > >  static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
> > >  					      struct mem_cgroup *memcg,
> > >  					      size_t size, void **p)
> > >  {
> > >  	struct mem_cgroup_ptr *memcg_ptr;
> > > +	struct lruvec *lruvec;
> > >  	struct page *page;
> > >  	unsigned long off;
> > >  	size_t i;
> > > @@ -439,6 +393,11 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
> > >  			off = obj_to_index(s, page, p[i]);
> > >  			mem_cgroup_ptr_get(memcg_ptr);
> > >  			page->mem_cgroup_vec[off] = memcg_ptr;
> > > +			lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
> > > +			mod_lruvec_memcg_state(lruvec, cache_vmstat_idx(s),
> > > +					       s->size);
> > > +		} else {
> > > +			__memcg_kmem_uncharge_subpage(memcg, s->size);
> > >  		}
> > >  	}
> > >  	mem_cgroup_ptr_put(memcg_ptr);
> > 
> > The memcg_ptr as a collection vessel for object references makes a lot
> > of sense. But this code showcases that it should be a first-class
> > memory tracking API that the allocator interacts with, rather than
> > having to deal with a combination of memcg_ptr and memcg.
> > 
> > In the two hunks here, on one hand we charge bytes to the memcg
> > object, and then handle all the refcounting through a different
> > bucketing object. To support that in the first place, we have to
> > overload the memcg API all the way down to try_charge() to support
> > bytes and pages. This is difficult to follow throughout all layers.
> > 
> > What would be better is for this to be an abstraction layer for a
> > subpage object tracker that sits on top of the memcg page tracker -
> > not unlike the page allocator and the slab allocators themselves.
> > 
> > And then the slab allocator would only interact with the subpage
> > object tracker, and the object tracker would deal with the memcg page
> > tracker under the hood.
> 
> Yes, the idea makes total sense to me. I'm not sure I like the new naming
> (I have to spend some time with it first), but the idea of moving
> stocks and leftovers to the memcg_ptr/obj_cgroup level is really good.

I'm not set on the naming, it was just to illustrate the
structuring. I picked something that has cgroup in it, is not easily
confused with the memcg API, and shortens nicely to local variable
names (obj_cgroup -> objcg), but I'm all for a better name.

> I'll include something based on your proposal into the next version
> of the patchset.

Thanks, looking forward to it.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 09/16] mm: memcg/slab: charge individual slab objects instead of pages
  2019-10-25 19:41   ` Johannes Weiner
  2019-10-25 20:00     ` Roman Gushchin
@ 2019-10-31  1:52     ` Roman Gushchin
  2019-10-31 14:23       ` Johannes Weiner
  2019-10-31 14:41       ` Johannes Weiner
  1 sibling, 2 replies; 36+ messages in thread
From: Roman Gushchin @ 2019-10-31  1:52 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Michal Hocko, linux-kernel, Kernel Team, Shakeel Butt,
	Vladimir Davydov, Waiman Long, Christoph Lameter

On Fri, Oct 25, 2019 at 03:41:18PM -0400, Johannes Weiner wrote:
> On Thu, Oct 17, 2019 at 05:28:13PM -0700, Roman Gushchin wrote:
> > +static inline struct kmem_cache *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
> > +						struct mem_cgroup **memcgp,
> > +						size_t size, gfp_t flags)
> > +{
> > +	struct kmem_cache *cachep;
> > +
> > +	cachep = memcg_kmem_get_cache(s, memcgp);
> > +	if (is_root_cache(cachep))
> > +		return s;
> > +
> > +	if (__memcg_kmem_charge_subpage(*memcgp, size * s->size, flags)) {
> > +		mem_cgroup_put(*memcgp);
> > +		memcg_kmem_put_cache(cachep);
> > +		cachep = NULL;
> > +	}
> > +
> > +	return cachep;
> > +}
> > +
> >  static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
> >  					      struct mem_cgroup *memcg,
> >  					      size_t size, void **p)
> >  {
> >  	struct mem_cgroup_ptr *memcg_ptr;
> > +	struct lruvec *lruvec;
> >  	struct page *page;
> >  	unsigned long off;
> >  	size_t i;
> > @@ -439,6 +393,11 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
> >  			off = obj_to_index(s, page, p[i]);
> >  			mem_cgroup_ptr_get(memcg_ptr);
> >  			page->mem_cgroup_vec[off] = memcg_ptr;
> > +			lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
> > +			mod_lruvec_memcg_state(lruvec, cache_vmstat_idx(s),
> > +					       s->size);
> > +		} else {
> > +			__memcg_kmem_uncharge_subpage(memcg, s->size);
> >  		}
> >  	}
> >  	mem_cgroup_ptr_put(memcg_ptr);
> 
> The memcg_ptr as a collection vessel for object references makes a lot
> of sense. But this code showcases that it should be a first-class
> memory tracking API that the allocator interacts with, rather than
> having to deal with a combination of memcg_ptr and memcg.
> 
> In the two hunks here, on one hand we charge bytes to the memcg
> object, and then handle all the refcounting through a different
> bucketing object. To support that in the first place, we have to
> overload the memcg API all the way down to try_charge() to support
> bytes and pages. This is difficult to follow throughout all layers.
> 
> What would be better is for this to be an abstraction layer for a
> subpage object tracker that sits on top of the memcg page tracker -
> not unlike the page allocator and the slab allocators themselves.
> 
> And then the slab allocator would only interact with the subpage
> object tracker, and the object tracker would deal with the memcg page
> tracker under the hood.

Hello, Johannes!

After some thoughts and coding I don't think it's a right direction to go.
Under direction I mean building the new subpage/obj_cgroup API on top
of the intact existing memcg API.
Let me illustrate my point on the code you provided (please, look below).

> 
> Something like the below, just a rough sketch to convey the idea:
> 
> On allocation, we look up an obj_cgroup from the current task, charge
> it with obj_cgroup_charge() and bump its refcount by the number of
> objects we allocated. obj_cgroup_charge() in turn uses the memcg layer
> to charge pages, and then doles them out as bytes using
> objcg->nr_stocked_bytes and the byte per-cpu stock.
> 
> On freeing, we call obj_cgroup_uncharge() and drop the references. In
> turn, this first refills the byte per-cpu stock and drains the
> overflow to the memcg layer (which in turn refills its own page stock
> and drains to the page_counter).
> 
> On cgroup deletion, we reassign the obj_cgroup to the parent, and all
> remaining live objects will free their pages to the updated parental
> obj_cgroup->memcg, as per usual.
> 
> ---
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index f36203cf75f8..efc3398aaf02 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -200,15 +200,15 @@ struct memcg_cgwb_frn {
>  };
>  
>  /*
> - * A pointer to a memory cgroup with a built-in reference counter.
> - * For a use as an intermediate object to simplify reparenting of
> - * objects charged to the cgroup. The memcg pointer can be switched
> - * to the parent cgroup without a need to modify all objects
> - * which hold the reference to the cgroup.
> + * Bucket for arbitrarily byte-sized objects charged to a memory
> + * cgroup. The bucket can be reparented in one piece when the cgroup
> + * is destroyed, without having to round up the individual references
> + * of all live memory objects in the wild.
>   */
> -struct mem_cgroup_ptr {
> -	struct percpu_ref refcnt;
> +struct obj_cgroup {
> +	struct percpu_ref count;
>  	struct mem_cgroup *memcg;
> +	unsigned long nr_stocked_bytes;
>  	union {
>  		struct list_head list;
>  		struct rcu_head rcu;
> @@ -230,7 +230,6 @@ struct mem_cgroup {
>  	/* Accounted resources */
>  	struct page_counter memory;
>  	struct page_counter swap;
> -	atomic_t nr_stocked_bytes;
>  
>  	/* Legacy consumer-oriented counters */
>  	struct page_counter memsw;
> @@ -327,8 +326,8 @@ struct mem_cgroup {
>          /* Index in the kmem_cache->memcg_params.memcg_caches array */
>  	int kmemcg_id;
>  	enum memcg_kmem_state kmem_state;
> -	struct mem_cgroup_ptr __rcu *kmem_memcg_ptr;
> -	struct list_head kmem_memcg_ptr_list;
> +	struct obj_cgroup __rcu *slab_objs;
> +	struct list_head slab_objs_list;
>  #endif
>  
>  	int last_scanned_node;
> @@ -474,21 +473,6 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
>  	return css ? container_of(css, struct mem_cgroup, css) : NULL;
>  }
>  
> -static inline bool mem_cgroup_ptr_tryget(struct mem_cgroup_ptr *ptr)
> -{
> -	return percpu_ref_tryget(&ptr->refcnt);
> -}
> -
> -static inline void mem_cgroup_ptr_get(struct mem_cgroup_ptr *ptr)
> -{
> -	percpu_ref_get(&ptr->refcnt);
> -}
> -
> -static inline void mem_cgroup_ptr_put(struct mem_cgroup_ptr *ptr)
> -{
> -	percpu_ref_put(&ptr->refcnt);
> -}
> -
>  static inline void mem_cgroup_put(struct mem_cgroup *memcg)
>  {
>  	if (memcg)
> @@ -1444,9 +1428,9 @@ int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
>  			      struct mem_cgroup *memcg);
>  void __memcg_kmem_uncharge_memcg(struct mem_cgroup *memcg,
>  				 unsigned int nr_pages);
> -int __memcg_kmem_charge_subpage(struct mem_cgroup *memcg, size_t size,
> +int obj_cgroup_charge(struct obj_cgroup *objcg, size_t size,
>  				gfp_t gfp);
> -void __memcg_kmem_uncharge_subpage(struct mem_cgroup *memcg, size_t size);
> +void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);
>  
>  extern struct static_key_false memcg_kmem_enabled_key;
>  extern struct workqueue_struct *memcg_kmem_cache_wq;
> @@ -1513,20 +1497,20 @@ static inline int memcg_cache_id(struct mem_cgroup *memcg)
>  	return memcg ? memcg->kmemcg_id : -1;
>  }
>  
> -static inline struct mem_cgroup_ptr *
> -mem_cgroup_get_kmem_ptr(struct mem_cgroup *memcg)
> +static inline struct obj_cgroup *
> +mem_cgroup_get_slab_objs(struct mem_cgroup *memcg)
>  {
> -	struct mem_cgroup_ptr *memcg_ptr;
> +	struct obj_cgroup *objcg;
>  
>  	rcu_read_lock();
>  	do {
> -		memcg_ptr = rcu_dereference(memcg->kmem_memcg_ptr);
> -		if (memcg_ptr && mem_cgroup_ptr_tryget(memcg_ptr))
> +		objcg = rcu_dereference(memcg->slab_objs);
> +		if (objcg && percpu_ref_tryget(&objcg->count))
>  			break;
>  	} while ((memcg = parent_mem_cgroup(memcg)));
>  	rcu_read_unlock();
>  
> -	return memcg_ptr;
> +	return objcg;
>  }
>  
>  #else
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 4d99ee5a9c53..7f663fd53c17 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -200,7 +200,7 @@ struct page {
>  #ifdef CONFIG_MEMCG
>  	union {
>  		struct mem_cgroup *mem_cgroup;
> -		struct mem_cgroup_ptr **mem_cgroup_vec;
> +		struct obj_cgroup **obj_cgroups;
>  	};
>  #endif
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b0d0c833150c..47110f4f0f4c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -260,52 +260,51 @@ struct cgroup_subsys_state *vmpressure_to_css(struct vmpressure *vmpr)
>  #ifdef CONFIG_MEMCG_KMEM
>  extern spinlock_t css_set_lock;
>  
> -static void memcg_ptr_release(struct percpu_ref *ref)
> +static void obj_cgroup_release(struct percpu_ref *ref)
>  {
> -	struct mem_cgroup_ptr *ptr = container_of(ref, struct mem_cgroup_ptr,
> -						  refcnt);
> +	struct obj_cgroup *objcg;
>  	unsigned long flags;
>  
> +	objcg = container_of(ref, struct obj_cgroup, count);
> +
>  	spin_lock_irqsave(&css_set_lock, flags);
> -	list_del(&ptr->list);
> +	list_del(&objcg->list);
>  	spin_unlock_irqrestore(&css_set_lock, flags);
>  
> -	mem_cgroup_put(ptr->memcg);
> +	mem_cgroup_put(objcg->memcg);
>  	percpu_ref_exit(ref);
> -	kfree_rcu(ptr, rcu);
> +	kfree_rcu(objcg, rcu);
>  }

Btw, the part below is looking good, thanks for the idea!

>  
> -static int memcg_init_kmem_memcg_ptr(struct mem_cgroup *memcg)
> +static int obj_cgroup_alloc(struct mem_cgroup *memcg)
>  {
> -	struct mem_cgroup_ptr *kmem_memcg_ptr;
> +	struct obj_cgroup *objcg;
>  	int ret;
>  
> -	kmem_memcg_ptr = kmalloc(sizeof(struct mem_cgroup_ptr), GFP_KERNEL);
> -	if (!kmem_memcg_ptr)
> +	objcg = kmalloc(sizeof(struct obj_cgroup), GFP_KERNEL);
> +	if (!objcg)
>  		return -ENOMEM;
>  
> -	ret = percpu_ref_init(&kmem_memcg_ptr->refcnt, memcg_ptr_release,
> +	ret = percpu_ref_init(&objcg->count, obj_cgroup_release,
>  			      0, GFP_KERNEL);
>  	if (ret) {
> -		kfree(kmem_memcg_ptr);
> +		kfree(objcg);
>  		return ret;
>  	}
>  
> -	kmem_memcg_ptr->memcg = memcg;
> -	INIT_LIST_HEAD(&kmem_memcg_ptr->list);
> -	rcu_assign_pointer(memcg->kmem_memcg_ptr, kmem_memcg_ptr);
> -	list_add(&kmem_memcg_ptr->list, &memcg->kmem_memcg_ptr_list);
> +	INIT_LIST_HEAD(&objcg->list);
> +	objcg->memcg = memcg;
>  	return 0;
>  }
>  

< cut >

> @@ -3117,15 +3095,24 @@ void __memcg_kmem_uncharge(struct page *page, int order)
>  	css_put_many(&memcg->css, nr_pages);
>  }
>  
> -int __memcg_kmem_charge_subpage(struct mem_cgroup *memcg, size_t size,
> -				gfp_t gfp)
> +int obj_cgroup_charge(struct obj_cgroup *objcg, size_t size, gfp_t gfp)
>  {
> -	return try_charge(memcg, gfp, size, true);
> +	int ret;
> +
> +	if (consume_obj_stock(objcg, nr_bytes))
> +		return 0;
> +
> +	ret = try_charge(objcg->memcg, gfp, 1);
> +	if (ret)
> +		return ret;

So, the first problem is here: try_charge() will bump the memcg reference.
It works for generic pages, where we set the mem_cgroup pointer, so that
the reference corresponds to the actual physical page.
But in this case, the actual page is shared between multiple cgroups.
Memcg will be basically referenced once for each stock cache miss.
I struggle to understand what does it mean and how to balance it
with unreferencing on the free path.

The second problem is also here. If a task belonging to a different memcg
is scheduled on this cpu, most likely we will need to refill both stocks,
even if we need only a small temporarily allocation.

> +
> +	refill_obj_stock(objcg, PAGE_SIZE - size);

And the third problem is here. Percpu allocations (on which accounting I'm
working right now) can be larger than a page.

This is fairly small issue in comparison to the first one. But it illustrates
well the main point: we can't simple get a page from the existing API and
sublease it in parts. The problem is that we need to break the main principle
that a page belongs to a single memcg.

Please, let me know what do you think about it.

Thank you!

Roman

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 09/16] mm: memcg/slab: charge individual slab objects instead of pages
  2019-10-31  1:52     ` Roman Gushchin
@ 2019-10-31 14:23       ` Johannes Weiner
  2019-10-31 14:41       ` Johannes Weiner
  1 sibling, 0 replies; 36+ messages in thread
From: Johannes Weiner @ 2019-10-31 14:23 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Michal Hocko, linux-kernel, Kernel Team, Shakeel Butt,
	Vladimir Davydov, Waiman Long, Christoph Lameter

On Thu, Oct 31, 2019 at 01:52:44AM +0000, Roman Gushchin wrote:
> On Fri, Oct 25, 2019 at 03:41:18PM -0400, Johannes Weiner wrote:
> > On Thu, Oct 17, 2019 at 05:28:13PM -0700, Roman Gushchin wrote:
> > > +static inline struct kmem_cache *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
> > > +						struct mem_cgroup **memcgp,
> > > +						size_t size, gfp_t flags)
> > > +{
> > > +	struct kmem_cache *cachep;
> > > +
> > > +	cachep = memcg_kmem_get_cache(s, memcgp);
> > > +	if (is_root_cache(cachep))
> > > +		return s;
> > > +
> > > +	if (__memcg_kmem_charge_subpage(*memcgp, size * s->size, flags)) {
> > > +		mem_cgroup_put(*memcgp);
> > > +		memcg_kmem_put_cache(cachep);
> > > +		cachep = NULL;
> > > +	}
> > > +
> > > +	return cachep;
> > > +}
> > > +
> > >  static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
> > >  					      struct mem_cgroup *memcg,
> > >  					      size_t size, void **p)
> > >  {
> > >  	struct mem_cgroup_ptr *memcg_ptr;
> > > +	struct lruvec *lruvec;
> > >  	struct page *page;
> > >  	unsigned long off;
> > >  	size_t i;
> > > @@ -439,6 +393,11 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
> > >  			off = obj_to_index(s, page, p[i]);
> > >  			mem_cgroup_ptr_get(memcg_ptr);
> > >  			page->mem_cgroup_vec[off] = memcg_ptr;
> > > +			lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
> > > +			mod_lruvec_memcg_state(lruvec, cache_vmstat_idx(s),
> > > +					       s->size);
> > > +		} else {
> > > +			__memcg_kmem_uncharge_subpage(memcg, s->size);
> > >  		}
> > >  	}
> > >  	mem_cgroup_ptr_put(memcg_ptr);
> > 
> > The memcg_ptr as a collection vessel for object references makes a lot
> > of sense. But this code showcases that it should be a first-class
> > memory tracking API that the allocator interacts with, rather than
> > having to deal with a combination of memcg_ptr and memcg.
> > 
> > In the two hunks here, on one hand we charge bytes to the memcg
> > object, and then handle all the refcounting through a different
> > bucketing object. To support that in the first place, we have to
> > overload the memcg API all the way down to try_charge() to support
> > bytes and pages. This is difficult to follow throughout all layers.
> > 
> > What would be better is for this to be an abstraction layer for a
> > subpage object tracker that sits on top of the memcg page tracker -
> > not unlike the page allocator and the slab allocators themselves.
> > 
> > And then the slab allocator would only interact with the subpage
> > object tracker, and the object tracker would deal with the memcg page
> > tracker under the hood.
> 
> Hello, Johannes!
> 
> After some thoughts and coding I don't think it's a right direction to go.
> Under direction I mean building the new subpage/obj_cgroup API on top
> of the intact existing memcg API.
> Let me illustrate my point on the code you provided (please, look below).
> 
> > 
> > Something like the below, just a rough sketch to convey the idea:
> > 
> > On allocation, we look up an obj_cgroup from the current task, charge
> > it with obj_cgroup_charge() and bump its refcount by the number of
> > objects we allocated. obj_cgroup_charge() in turn uses the memcg layer
> > to charge pages, and then doles them out as bytes using
> > objcg->nr_stocked_bytes and the byte per-cpu stock.
> > 
> > On freeing, we call obj_cgroup_uncharge() and drop the references. In
> > turn, this first refills the byte per-cpu stock and drains the
> > overflow to the memcg layer (which in turn refills its own page stock
> > and drains to the page_counter).
> > 
> > On cgroup deletion, we reassign the obj_cgroup to the parent, and all
> > remaining live objects will free their pages to the updated parental
> > obj_cgroup->memcg, as per usual.
> > 
> > ---
> > 
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index f36203cf75f8..efc3398aaf02 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -200,15 +200,15 @@ struct memcg_cgwb_frn {
> >  };
> >  
> >  /*
> > - * A pointer to a memory cgroup with a built-in reference counter.
> > - * For a use as an intermediate object to simplify reparenting of
> > - * objects charged to the cgroup. The memcg pointer can be switched
> > - * to the parent cgroup without a need to modify all objects
> > - * which hold the reference to the cgroup.
> > + * Bucket for arbitrarily byte-sized objects charged to a memory
> > + * cgroup. The bucket can be reparented in one piece when the cgroup
> > + * is destroyed, without having to round up the individual references
> > + * of all live memory objects in the wild.
> >   */
> > -struct mem_cgroup_ptr {
> > -	struct percpu_ref refcnt;
> > +struct obj_cgroup {
> > +	struct percpu_ref count;
> >  	struct mem_cgroup *memcg;
> > +	unsigned long nr_stocked_bytes;
> >  	union {
> >  		struct list_head list;
> >  		struct rcu_head rcu;
> > @@ -230,7 +230,6 @@ struct mem_cgroup {
> >  	/* Accounted resources */
> >  	struct page_counter memory;
> >  	struct page_counter swap;
> > -	atomic_t nr_stocked_bytes;
> >  
> >  	/* Legacy consumer-oriented counters */
> >  	struct page_counter memsw;
> > @@ -327,8 +326,8 @@ struct mem_cgroup {
> >          /* Index in the kmem_cache->memcg_params.memcg_caches array */
> >  	int kmemcg_id;
> >  	enum memcg_kmem_state kmem_state;
> > -	struct mem_cgroup_ptr __rcu *kmem_memcg_ptr;
> > -	struct list_head kmem_memcg_ptr_list;
> > +	struct obj_cgroup __rcu *slab_objs;
> > +	struct list_head slab_objs_list;
> >  #endif
> >  
> >  	int last_scanned_node;
> > @@ -474,21 +473,6 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
> >  	return css ? container_of(css, struct mem_cgroup, css) : NULL;
> >  }
> >  
> > -static inline bool mem_cgroup_ptr_tryget(struct mem_cgroup_ptr *ptr)
> > -{
> > -	return percpu_ref_tryget(&ptr->refcnt);
> > -}
> > -
> > -static inline void mem_cgroup_ptr_get(struct mem_cgroup_ptr *ptr)
> > -{
> > -	percpu_ref_get(&ptr->refcnt);
> > -}
> > -
> > -static inline void mem_cgroup_ptr_put(struct mem_cgroup_ptr *ptr)
> > -{
> > -	percpu_ref_put(&ptr->refcnt);
> > -}
> > -
> >  static inline void mem_cgroup_put(struct mem_cgroup *memcg)
> >  {
> >  	if (memcg)
> > @@ -1444,9 +1428,9 @@ int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
> >  			      struct mem_cgroup *memcg);
> >  void __memcg_kmem_uncharge_memcg(struct mem_cgroup *memcg,
> >  				 unsigned int nr_pages);
> > -int __memcg_kmem_charge_subpage(struct mem_cgroup *memcg, size_t size,
> > +int obj_cgroup_charge(struct obj_cgroup *objcg, size_t size,
> >  				gfp_t gfp);
> > -void __memcg_kmem_uncharge_subpage(struct mem_cgroup *memcg, size_t size);
> > +void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);
> >  
> >  extern struct static_key_false memcg_kmem_enabled_key;
> >  extern struct workqueue_struct *memcg_kmem_cache_wq;
> > @@ -1513,20 +1497,20 @@ static inline int memcg_cache_id(struct mem_cgroup *memcg)
> >  	return memcg ? memcg->kmemcg_id : -1;
> >  }
> >  
> > -static inline struct mem_cgroup_ptr *
> > -mem_cgroup_get_kmem_ptr(struct mem_cgroup *memcg)
> > +static inline struct obj_cgroup *
> > +mem_cgroup_get_slab_objs(struct mem_cgroup *memcg)
> >  {
> > -	struct mem_cgroup_ptr *memcg_ptr;
> > +	struct obj_cgroup *objcg;
> >  
> >  	rcu_read_lock();
> >  	do {
> > -		memcg_ptr = rcu_dereference(memcg->kmem_memcg_ptr);
> > -		if (memcg_ptr && mem_cgroup_ptr_tryget(memcg_ptr))
> > +		objcg = rcu_dereference(memcg->slab_objs);
> > +		if (objcg && percpu_ref_tryget(&objcg->count))
> >  			break;
> >  	} while ((memcg = parent_mem_cgroup(memcg)));
> >  	rcu_read_unlock();
> >  
> > -	return memcg_ptr;
> > +	return objcg;
> >  }
> >  
> >  #else
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 4d99ee5a9c53..7f663fd53c17 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -200,7 +200,7 @@ struct page {
> >  #ifdef CONFIG_MEMCG
> >  	union {
> >  		struct mem_cgroup *mem_cgroup;
> > -		struct mem_cgroup_ptr **mem_cgroup_vec;
> > +		struct obj_cgroup **obj_cgroups;
> >  	};
> >  #endif
> >  
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index b0d0c833150c..47110f4f0f4c 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -260,52 +260,51 @@ struct cgroup_subsys_state *vmpressure_to_css(struct vmpressure *vmpr)
> >  #ifdef CONFIG_MEMCG_KMEM
> >  extern spinlock_t css_set_lock;
> >  
> > -static void memcg_ptr_release(struct percpu_ref *ref)
> > +static void obj_cgroup_release(struct percpu_ref *ref)
> >  {
> > -	struct mem_cgroup_ptr *ptr = container_of(ref, struct mem_cgroup_ptr,
> > -						  refcnt);
> > +	struct obj_cgroup *objcg;
> >  	unsigned long flags;
> >  
> > +	objcg = container_of(ref, struct obj_cgroup, count);
> > +
> >  	spin_lock_irqsave(&css_set_lock, flags);
> > -	list_del(&ptr->list);
> > +	list_del(&objcg->list);
> >  	spin_unlock_irqrestore(&css_set_lock, flags);
> >  
> > -	mem_cgroup_put(ptr->memcg);
> > +	mem_cgroup_put(objcg->memcg);
> >  	percpu_ref_exit(ref);
> > -	kfree_rcu(ptr, rcu);
> > +	kfree_rcu(objcg, rcu);
> >  }
> 
> Btw, the part below is looking good, thanks for the idea!
> 
> >  
> > -static int memcg_init_kmem_memcg_ptr(struct mem_cgroup *memcg)
> > +static int obj_cgroup_alloc(struct mem_cgroup *memcg)
> >  {
> > -	struct mem_cgroup_ptr *kmem_memcg_ptr;
> > +	struct obj_cgroup *objcg;
> >  	int ret;
> >  
> > -	kmem_memcg_ptr = kmalloc(sizeof(struct mem_cgroup_ptr), GFP_KERNEL);
> > -	if (!kmem_memcg_ptr)
> > +	objcg = kmalloc(sizeof(struct obj_cgroup), GFP_KERNEL);
> > +	if (!objcg)
> >  		return -ENOMEM;
> >  
> > -	ret = percpu_ref_init(&kmem_memcg_ptr->refcnt, memcg_ptr_release,
> > +	ret = percpu_ref_init(&objcg->count, obj_cgroup_release,
> >  			      0, GFP_KERNEL);
> >  	if (ret) {
> > -		kfree(kmem_memcg_ptr);
> > +		kfree(objcg);
> >  		return ret;
> >  	}
> >  
> > -	kmem_memcg_ptr->memcg = memcg;
> > -	INIT_LIST_HEAD(&kmem_memcg_ptr->list);
> > -	rcu_assign_pointer(memcg->kmem_memcg_ptr, kmem_memcg_ptr);
> > -	list_add(&kmem_memcg_ptr->list, &memcg->kmem_memcg_ptr_list);
> > +	INIT_LIST_HEAD(&objcg->list);
> > +	objcg->memcg = memcg;
> >  	return 0;
> >  }
> >  
> 
> < cut >
> 
> > @@ -3117,15 +3095,24 @@ void __memcg_kmem_uncharge(struct page *page, int order)
> >  	css_put_many(&memcg->css, nr_pages);
> >  }
> >  
> > -int __memcg_kmem_charge_subpage(struct mem_cgroup *memcg, size_t size,
> > -				gfp_t gfp)
> > +int obj_cgroup_charge(struct obj_cgroup *objcg, size_t size, gfp_t gfp)
> >  {
> > -	return try_charge(memcg, gfp, size, true);
> > +	int ret;
> > +
> > +	if (consume_obj_stock(objcg, nr_bytes))
> > +		return 0;
> > +
> > +	ret = try_charge(objcg->memcg, gfp, 1);
> > +	if (ret)
> > +		return ret;
> 
> So, the first problem is here: try_charge() will bump the memcg reference.
> It works for generic pages, where we set the mem_cgroup pointer, so that
> the reference corresponds to the actual physical page.
> But in this case, the actual page is shared between multiple cgroups.
> Memcg will be basically referenced once for each stock cache miss.
> I struggle to understand what does it mean and how to balance it
> with unreferencing on the free path.

The charge functions make refcounting assumptions they shouldn't
make. That's a sign that we need to fix the refcounting, not work
around it in the upper layers.

We need the following patch for the new slab controller. It's an
overdue cleanup, too.

---
From 10bfa8e3111c65a2fdd2156547fe645c23847fa6 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Thu, 31 Oct 2019 07:04:23 -0400
Subject: [PATCH] mm: memcontrol: decouple reference counting from page
 accounting

The reference counting of a memcg is currently coupled directly to how
many 4k pages are charged to it. This doesn't work well with Roman's
new slab controller, which maintains pools of objects and doesn't want
to keep an extra balance sheet for the pages backing those objects.

This unusual refcounting design (reference counts usually track
pointers to an object) is only for historical reasons: memcg used to
not take any css references and simply stalled offlining until all
charges had been reparented and the page counters had dropped to
zero. When we got rid of the reparenting requirement, the simple
mechanical translation was to take a reference for every charge.

More historical context can be found in e8ea14cc6ead ("mm: memcontrol:
take a css reference for each charged page"), 64f219938941 ("mm:
memcontrol: remove obsolete kmemcg pinning tricks") and b2052564e66d
("mm: memcontrol: continue cache reclaim from offlined groups").

The new slab controller exposes the limitations in this scheme, so
let's switch it to a more idiomatic reference counting model based on
actual kernel pointers to the memcg:

- The per-cpu stock holds a reference to the memcg its caching

- User pages hold a reference for their page->mem_cgroup. Transparent
  huge pages will no longer acquire tail references in advance, we'll
  get them if needed during the split.

- Kernel pages hold a reference for their page->mem_cgroup

- mem_cgroup_try_charge(), if successful, will return one reference to
  be consumed by page->mem_cgroup during commit, or put during cancel

- Pages allocated in the root cgroup will acquire and release css
  references for simplicity. css_get() and css_put() optimize that.

- The current memcg_charge_slab() already hacked around the per-charge
  references; this change gets rid of that as well.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 45 ++++++++++++++++++++++++++-------------------
 mm/slab.h       |  2 --
 2 files changed, 26 insertions(+), 21 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c7e3e758c165..b88c273d6dc0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2190,13 +2190,17 @@ static void drain_stock(struct memcg_stock_pcp *stock)
 {
 	struct mem_cgroup *old = stock->cached;
 
+	if (!old)
+		return;
+
 	if (stock->nr_pages) {
 		page_counter_uncharge(&old->memory, stock->nr_pages);
 		if (do_memsw_account())
 			page_counter_uncharge(&old->memsw, stock->nr_pages);
-		css_put_many(&old->css, stock->nr_pages);
 		stock->nr_pages = 0;
 	}
+
+	css_put(&old->css);
 	stock->cached = NULL;
 }
 
@@ -2232,6 +2236,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 	stock = this_cpu_ptr(&memcg_stock);
 	if (stock->cached != memcg) { /* reset if necessary */
 		drain_stock(stock);
+		css_get(&memcg->css);
 		stock->cached = memcg;
 	}
 	stock->nr_pages += nr_pages;
@@ -2635,12 +2640,10 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	page_counter_charge(&memcg->memory, nr_pages);
 	if (do_memsw_account())
 		page_counter_charge(&memcg->memsw, nr_pages);
-	css_get_many(&memcg->css, nr_pages);
 
 	return 0;
 
 done_restock:
-	css_get_many(&memcg->css, batch);
 	if (batch > nr_pages)
 		refill_stock(memcg, batch - nr_pages);
 
@@ -2677,8 +2680,6 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
 	page_counter_uncharge(&memcg->memory, nr_pages);
 	if (do_memsw_account())
 		page_counter_uncharge(&memcg->memsw, nr_pages);
-
-	css_put_many(&memcg->css, nr_pages);
 }
 
 static void lock_page_lru(struct page *page, int *isolated)
@@ -2989,6 +2990,7 @@ int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order)
 		if (!ret) {
 			page->mem_cgroup = memcg;
 			__SetPageKmemcg(page);
+			return 0;
 		}
 	}
 	css_put(&memcg->css);
@@ -3026,12 +3028,11 @@ void __memcg_kmem_uncharge(struct page *page, int order)
 	VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
 	__memcg_kmem_uncharge_memcg(memcg, nr_pages);
 	page->mem_cgroup = NULL;
+	css_put(&memcg->css);
 
 	/* slab pages do not have PageKmemcg flag set */
 	if (PageKmemcg(page))
 		__ClearPageKmemcg(page);
-
-	css_put_many(&memcg->css, nr_pages);
 }
 #endif /* CONFIG_MEMCG_KMEM */
 
@@ -3043,15 +3044,18 @@ void __memcg_kmem_uncharge(struct page *page, int order)
  */
 void mem_cgroup_split_huge_fixup(struct page *head)
 {
+	struct mem_cgroup *memcg = head->mem_cgroup;
 	int i;
 
 	if (mem_cgroup_disabled())
 		return;
 
-	for (i = 1; i < HPAGE_PMD_NR; i++)
-		head[i].mem_cgroup = head->mem_cgroup;
+	for (i = 1; i < HPAGE_PMD_NR; i++) {
+		css_get(&memcg->css);
+		head[i].mem_cgroup = memcg;
+	}
 
-	__mod_memcg_state(head->mem_cgroup, MEMCG_RSS_HUGE, -HPAGE_PMD_NR);
+	__mod_memcg_state(memcg, MEMCG_RSS_HUGE, -HPAGE_PMD_NR);
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
@@ -5485,7 +5489,9 @@ static int mem_cgroup_move_account(struct page *page,
 	 * uncharging, charging, migration, or LRU putback.
 	 */
 
-	/* caller should have done css_get */
+	css_get(&to->css);
+	css_put(&from->css);
+
 	page->mem_cgroup = to;
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -6514,8 +6520,10 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 		memcg = get_mem_cgroup_from_mm(mm);
 
 	ret = try_charge(memcg, gfp_mask, nr_pages);
-
-	css_put(&memcg->css);
+	if (ret) {
+		css_put(&memcg->css);
+		memcg = NULL;
+	}
 out:
 	*memcgp = memcg;
 	return ret;
@@ -6611,6 +6619,8 @@ void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg,
 		return;
 
 	cancel_charge(memcg, nr_pages);
+
+	css_put(&memcg->css);
 }
 
 struct uncharge_gather {
@@ -6652,9 +6662,6 @@ static void uncharge_batch(const struct uncharge_gather *ug)
 	__this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, nr_pages);
 	memcg_check_events(ug->memcg, ug->dummy_page);
 	local_irq_restore(flags);
-
-	if (!mem_cgroup_is_root(ug->memcg))
-		css_put_many(&ug->memcg->css, nr_pages);
 }
 
 static void uncharge_page(struct page *page, struct uncharge_gather *ug)
@@ -6702,6 +6709,7 @@ static void uncharge_page(struct page *page, struct uncharge_gather *ug)
 
 	ug->dummy_page = page;
 	page->mem_cgroup = NULL;
+	css_put(&ug->memcg->css);
 }
 
 static void uncharge_list(struct list_head *page_list)
@@ -6810,8 +6818,8 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage)
 	page_counter_charge(&memcg->memory, nr_pages);
 	if (do_memsw_account())
 		page_counter_charge(&memcg->memsw, nr_pages);
-	css_get_many(&memcg->css, nr_pages);
 
+	css_get(&memcg->css);
 	commit_charge(newpage, memcg, false);
 
 	local_irq_save(flags);
@@ -7059,8 +7067,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 				     -nr_entries);
 	memcg_check_events(memcg, page);
 
-	if (!mem_cgroup_is_root(memcg))
-		css_put_many(&memcg->css, nr_entries);
+	css_put(&memcg->css);
 }
 
 /**
diff --git a/mm/slab.h b/mm/slab.h
index 2bbecf28688d..9f84298db2d7 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -372,9 +372,7 @@ static __always_inline int memcg_charge_slab(struct page *page,
 	lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
 	mod_lruvec_state(lruvec, cache_vmstat_idx(s), 1 << order);
 
-	/* transer try_charge() page references to kmem_cache */
 	percpu_ref_get_many(&s->memcg_params.refcnt, 1 << order);
-	css_put_many(&memcg->css, 1 << order);
 out:
 	css_put(&memcg->css);
 	return ret;
-- 
2.23.0

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 09/16] mm: memcg/slab: charge individual slab objects instead of pages
  2019-10-31  1:52     ` Roman Gushchin
  2019-10-31 14:23       ` Johannes Weiner
@ 2019-10-31 14:41       ` Johannes Weiner
  2019-10-31 15:07         ` Roman Gushchin
  1 sibling, 1 reply; 36+ messages in thread
From: Johannes Weiner @ 2019-10-31 14:41 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Michal Hocko, linux-kernel, Kernel Team, Shakeel Butt,
	Vladimir Davydov, Waiman Long, Christoph Lameter

On Thu, Oct 31, 2019 at 01:52:44AM +0000, Roman Gushchin wrote:
> On Fri, Oct 25, 2019 at 03:41:18PM -0400, Johannes Weiner wrote:
> > @@ -3117,15 +3095,24 @@ void __memcg_kmem_uncharge(struct page *page, int order)
> >  	css_put_many(&memcg->css, nr_pages);
> >  }
> >  
> > -int __memcg_kmem_charge_subpage(struct mem_cgroup *memcg, size_t size,
> > -				gfp_t gfp)
> > +int obj_cgroup_charge(struct obj_cgroup *objcg, size_t size, gfp_t gfp)
> >  {
> > -	return try_charge(memcg, gfp, size, true);
> > +	int ret;
> > +
> > +	if (consume_obj_stock(objcg, nr_bytes))
> > +		return 0;
> > +
> > +	ret = try_charge(objcg->memcg, gfp, 1);
> > +	if (ret)
> > +		return ret;

> The second problem is also here. If a task belonging to a different memcg
> is scheduled on this cpu, most likely we will need to refill both stocks,
> even if we need only a small temporarily allocation.

Yes, that's a good thing. The reason we have the per-cpu caches in the
first place is because most likely the same cgroup will perform
several allocations. Both the slab allocator and the page allocator
have per-cpu caches for the same reason. I don't really understand
what the argument is.

> > +
> > +	refill_obj_stock(objcg, PAGE_SIZE - size);
> 
> And the third problem is here. Percpu allocations (on which accounting I'm
> working right now) can be larger than a page.

How about this?

	nr_pages = round_up(size, PAGE_SIZE);
	try_charge(objcg->memcg, nr_pages);
	refill_obj_stock(objcg, size % PAGE_SIZE);

> This is fairly small issue in comparison to the first one. But it illustrates
> well the main point: we can't simple get a page from the existing API and
> sublease it in parts. The problem is that we need to break the main principle
> that a page belongs to a single memcg.

We can change the underlying assumptions of the existing API if they
are no longer correct. We don't have to invent a parallel stack.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 09/16] mm: memcg/slab: charge individual slab objects instead of pages
  2019-10-31 14:41       ` Johannes Weiner
@ 2019-10-31 15:07         ` Roman Gushchin
  2019-10-31 18:50           ` Johannes Weiner
  0 siblings, 1 reply; 36+ messages in thread
From: Roman Gushchin @ 2019-10-31 15:07 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Michal Hocko, linux-kernel, Kernel Team, Shakeel Butt,
	Vladimir Davydov, Waiman Long, Christoph Lameter

On Thu, Oct 31, 2019 at 10:41:51AM -0400, Johannes Weiner wrote:
> On Thu, Oct 31, 2019 at 01:52:44AM +0000, Roman Gushchin wrote:
> > On Fri, Oct 25, 2019 at 03:41:18PM -0400, Johannes Weiner wrote:
> > > @@ -3117,15 +3095,24 @@ void __memcg_kmem_uncharge(struct page *page, int order)
> > >  	css_put_many(&memcg->css, nr_pages);
> > >  }
> > >  
> > > -int __memcg_kmem_charge_subpage(struct mem_cgroup *memcg, size_t size,
> > > -				gfp_t gfp)
> > > +int obj_cgroup_charge(struct obj_cgroup *objcg, size_t size, gfp_t gfp)
> > >  {
> > > -	return try_charge(memcg, gfp, size, true);
> > > +	int ret;
> > > +
> > > +	if (consume_obj_stock(objcg, nr_bytes))
> > > +		return 0;
> > > +
> > > +	ret = try_charge(objcg->memcg, gfp, 1);
> > > +	if (ret)
> > > +		return ret;
> 
> > The second problem is also here. If a task belonging to a different memcg
> > is scheduled on this cpu, most likely we will need to refill both stocks,
> > even if we need only a small temporarily allocation.
> 
> Yes, that's a good thing. The reason we have the per-cpu caches in the
> first place is because most likely the same cgroup will perform
> several allocations. Both the slab allocator and the page allocator
> have per-cpu caches for the same reason. I don't really understand
> what the argument is.

I mean it seems strange (and most likely will show up in perf numbers)
to move a page from one stock to another. Is there a reason why do you want
to ask try_charge() and stock only a single page?

Can we do the following instead?

1) add a boolean argument to try_charge() to bypass the consume_stock() call
at the beginning and just go slow path immediately
2) use try_charge() with this argument set to true to fill the objc/subpage
stock with MEMCG_CHARGE_BATCH pages

In this case we'll reuse try_charge() and will have all corresponding benefits,
but will avoid the double stocking, which seems as a strange idea to me.

> 
> > > +
> > > +	refill_obj_stock(objcg, PAGE_SIZE - size);
> > 
> > And the third problem is here. Percpu allocations (on which accounting I'm
> > working right now) can be larger than a page.
> 
> How about this?
> 
> 	nr_pages = round_up(size, PAGE_SIZE);
> 	try_charge(objcg->memcg, nr_pages);
> 	refill_obj_stock(objcg, size % PAGE_SIZE);

Ok, this will work.

> 
> > This is fairly small issue in comparison to the first one. But it illustrates
> > well the main point: we can't simple get a page from the existing API and
> > sublease it in parts. The problem is that we need to break the main principle
> > that a page belongs to a single memcg.
> 
> We can change the underlying assumptions of the existing API if they
> are no longer correct. We don't have to invent a parallel stack.

Ok, this makes sense. And thank you for the patch, I'll take it into the set.

Thanks!

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 09/16] mm: memcg/slab: charge individual slab objects instead of pages
  2019-10-31 15:07         ` Roman Gushchin
@ 2019-10-31 18:50           ` Johannes Weiner
  0 siblings, 0 replies; 36+ messages in thread
From: Johannes Weiner @ 2019-10-31 18:50 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Michal Hocko, linux-kernel, Kernel Team, Shakeel Butt,
	Vladimir Davydov, Waiman Long, Christoph Lameter

On Thu, Oct 31, 2019 at 03:07:02PM +0000, Roman Gushchin wrote:
> On Thu, Oct 31, 2019 at 10:41:51AM -0400, Johannes Weiner wrote:
> > On Thu, Oct 31, 2019 at 01:52:44AM +0000, Roman Gushchin wrote:
> > > On Fri, Oct 25, 2019 at 03:41:18PM -0400, Johannes Weiner wrote:
> > > > @@ -3117,15 +3095,24 @@ void __memcg_kmem_uncharge(struct page *page, int order)
> > > >  	css_put_many(&memcg->css, nr_pages);
> > > >  }
> > > >  
> > > > -int __memcg_kmem_charge_subpage(struct mem_cgroup *memcg, size_t size,
> > > > -				gfp_t gfp)
> > > > +int obj_cgroup_charge(struct obj_cgroup *objcg, size_t size, gfp_t gfp)
> > > >  {
> > > > -	return try_charge(memcg, gfp, size, true);
> > > > +	int ret;
> > > > +
> > > > +	if (consume_obj_stock(objcg, nr_bytes))
> > > > +		return 0;
> > > > +
> > > > +	ret = try_charge(objcg->memcg, gfp, 1);
> > > > +	if (ret)
> > > > +		return ret;
> > 
> > > The second problem is also here. If a task belonging to a different memcg
> > > is scheduled on this cpu, most likely we will need to refill both stocks,
> > > even if we need only a small temporarily allocation.
> > 
> > Yes, that's a good thing. The reason we have the per-cpu caches in the
> > first place is because most likely the same cgroup will perform
> > several allocations. Both the slab allocator and the page allocator
> > have per-cpu caches for the same reason. I don't really understand
> > what the argument is.
> 
> I mean it seems strange (and most likely will show up in perf numbers)
> to move a page from one stock to another. Is there a reason why do you want
> to ask try_charge() and stock only a single page?
>
> Can we do the following instead?
> 
> 1) add a boolean argument to try_charge() to bypass the consume_stock() call
> at the beginning and just go slow path immediately
> 2) use try_charge() with this argument set to true to fill the objc/subpage
> stock with MEMCG_CHARGE_BATCH pages

No, think this through.

If you have disjunct caches for the page_counter, it means the cache
work cannot be shared. A slab allocation has to hit the page_counter,
and a subsequent page allocation has to hit it again; likewise, a slab
allocation cannot benefit from the caching of prior page allocations.

You're trading cheap, unlocked, cpu-local subtractions against costly
atomic RMW ops on shared cachelines. You also double the amount of
cached per-cpu memory and introduce a layering violation.

Hotpath (bytes cached)

	stacked:				disjunct:
	consume_subpage_stock()			try_charge()
						  consume_subpage_stock()

Warmpath (pages cached)

	stacked:				disjunct:
	consume_subpage_stock()			try_charge()
	  try_charge()				  consume_subpage_stock()
	    consume_stock()			  page_counter_charge()
	refill_subpage_stock()			  refill_subpage_stock()

Coldpath (nothing cached)

	stacked:				disjunct
	consume_subpage_stock()			try_charge()
	  try_charge()				  consume_subpage_stock()
	    consume_stock()			  page_counter_charge()
	    page_counter_charge()		  refill_subpage_stock()
	    refill_stock()
	refill_subpage_stock()

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2019-10-31 18:50 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-18  0:28 [PATCH 00/16] The new slab memory controller Roman Gushchin
2019-10-18  0:28 ` [PATCH 01/16] mm: memcg: introduce mem_cgroup_ptr Roman Gushchin
2019-10-18  0:28 ` [PATCH 02/16] mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat Roman Gushchin
2019-10-20 22:44   ` Christopher Lameter
2019-10-21  1:15     ` Roman Gushchin
2019-10-21 18:09       ` Christopher Lameter
2019-10-20 22:51   ` Christopher Lameter
2019-10-21  1:21     ` Roman Gushchin
2019-10-18  0:28 ` [PATCH 03/16] mm: vmstat: convert slab vmstat counter to bytes Roman Gushchin
2019-10-18  0:28 ` [PATCH 04/16] mm: memcg/slab: allocate space for memcg ownership data for non-root slabs Roman Gushchin
2019-10-18  0:28 ` [PATCH 05/16] mm: slub: implement SLUB version of obj_to_index() Roman Gushchin
2019-10-18  0:28 ` [PATCH 06/16] mm: memcg/slab: save memcg ownership data for non-root slab objects Roman Gushchin
2019-10-18  0:28 ` [PATCH 07/16] mm: memcg: move memcg_kmem_bypass() to memcontrol.h Roman Gushchin
2019-10-18  0:28 ` [PATCH 08/16] mm: memcg: introduce __mod_lruvec_memcg_state() Roman Gushchin
2019-10-18  0:28 ` [PATCH 09/16] mm: memcg/slab: charge individual slab objects instead of pages Roman Gushchin
2019-10-25 19:41   ` Johannes Weiner
2019-10-25 20:00     ` Roman Gushchin
2019-10-25 20:52       ` Johannes Weiner
2019-10-31  1:52     ` Roman Gushchin
2019-10-31 14:23       ` Johannes Weiner
2019-10-31 14:41       ` Johannes Weiner
2019-10-31 15:07         ` Roman Gushchin
2019-10-31 18:50           ` Johannes Weiner
2019-10-18  0:28 ` [PATCH 10/16] mm: memcg: move get_mem_cgroup_from_current() to memcontrol.h Roman Gushchin
2019-10-18  0:28 ` [PATCH 11/16] mm: memcg/slab: replace memcg_from_slab_page() with memcg_from_slab_obj() Roman Gushchin
2019-10-18  0:28 ` [PATCH 13/16] mm: memcg/slab: deprecate memory.kmem.slabinfo Roman Gushchin
2019-10-18  0:28 ` [PATCH 14/16] mm: memcg/slab: use one set of kmem_caches for all memory cgroups Roman Gushchin
2019-10-18  0:28 ` [PATCH 15/16] tools/cgroup: make slabinfo.py compatible with new slab controller Roman Gushchin
2019-10-18  0:28 ` [PATCH 16/16] mm: slab: remove redundant check in memcg_accumulate_slabinfo() Roman Gushchin
2019-10-18 17:03 ` [PATCH 00/16] The new slab memory controller Waiman Long
2019-10-18 17:12   ` Roman Gushchin
2019-10-22 13:22 ` Michal Hocko
2019-10-22 13:28   ` Michal Hocko
2019-10-22 15:48     ` Roman Gushchin
2019-10-22 13:31 ` Michal Hocko
2019-10-22 15:59   ` Roman Gushchin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).