linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 00/14] The new slab memory controller
@ 2019-09-05 21:45 Roman Gushchin
  2019-09-05 21:45 ` [PATCH RFC 01/14] mm: memcg: subpage charging API Roman Gushchin
                   ` (13 more replies)
  0 siblings, 14 replies; 44+ messages in thread
From: Roman Gushchin @ 2019-09-05 21:45 UTC (permalink / raw)
  To: linux-mm
  Cc: Michal Hocko, Johannes Weiner, linux-kernel, kernel-team,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Roman Gushchin

The existing slab memory controller is based on the idea of replicating
slab allocator internals for each memory cgroup. This approach promises
a low memory overhead (one pointer per page), and isn't adding too much
code on hot allocation and release paths. But is has a very serious flaw:
it leads to a low slab utilization.

Using a drgn* script I've got an estimation of slab utilization on
a number of machines running different production workloads. In most
cases it was between 45% and 65%, and the best number I've seen was
around 85%. Turning kmem accounting off brings it to high 90s. Also
it brings back 30-50% of slab memory. It means that the real price
of the existing slab memory controller is way bigger than a pointer
per page.

The real reason why the existing design leads to a low slab utilization
is simple: slab pages are used exclusively by one memory cgroup.
If there are only few allocations of certain size made by a cgroup,
or if some active objects (e.g. dentries) are left after the cgroup is
deleted, or the cgroup contains a single-threaded application which is
barely allocating any kernel objects, but does it every time on a new CPU:
in all these cases the resulting slab utilization is very low.
If kmem accounting is off, the kernel is able to use free space
on slab pages for other allocations.

Arguably it wasn't an issue back to days when the kmem controller was
introduced and was an opt-in feature, which had to be turned on
individually for each memory cgroup. But now it's turned on by default
on both cgroup v1 and v2. And modern systemd-based systems tend to
create a large number of cgroups.

This patchset provides a new implementation of the slab memory controller,
which aims to reach a much better slab utilization by sharing slab pages
between multiple memory cgroups. Below is the short description of the new
design (more details in commit messages).

Accounting is performed per-object instead of per-page. Slab-related
vmstat counters are converted to bytes. Charging is performed on page-basis,
with rounding up and remembering leftovers.

Memcg ownership data is stored in a per-slab-page vector: for each slab page
a vector of corresponding size is allocated. To keep slab memory reparenting
working, instead of saving a pointer to the memory cgroup directly an
intermediate object is used. It's simply a pointer to a memcg (which can be
easily changed to the parent) with a built-in reference counter. This scheme
allows to reparent all allocated objects without walking them over and changing
memcg pointer to the parent.

Instead of creating an individual set of kmem_caches for each memory cgroup,
two global sets are used: the root set for non-accounted and root-cgroup
allocations and the second set for all other allocations. This allows to
simplify the lifetime management of individual kmem_caches: they are destroyed
with root counterparts. It allows to remove a good amount of code and make
things generally simpler.

The patchset contains a couple of semi-independent parts, which can find their
usage outside of the slab memory controller too:
1) subpage charging API, which can be used in the future for accounting of
   other non-page-sized objects, e.g. percpu allocations.
2) mem_cgroup_ptr API (refcounted pointers to a memcg, can be reused
   for the efficient reparenting of other objects, e.g. pagecache.

The patchset has been tested on a number of different workloads in our
production. In all cases, it saved hefty amounts of memory:
1) web frontend, 650-700 Mb, ~42% of slab memory
2) database cache, 750-800 Mb, ~35% of slab memory
3) dns server, 700 Mb, ~36% of slab memory

So far I haven't found any regression on all tested workloads, but
potential CPU regression caused by more precise accounting is a concern.

Obviously the amount of saved memory depend on the number of memory cgroups,
uptime and specific workloads, but overall it feels like the new controller
saves 30-40% of slab memory, sometimes more. Additionally, it should lead
to a lower memory fragmentation, just because of a smaller number of
non-movable pages and also because there is no more need to move all
slab objects to a new set of pages when a workload is restarted in a new
memory cgroup.

* https://github.com/osandov/drgn


Roman Gushchin (14):
  mm: memcg: subpage charging API
  mm: memcg: introduce mem_cgroup_ptr
  mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat
  mm: vmstat: convert slab vmstat counter to bytes
  mm: memcg/slab: allocate space for memcg ownership data for non-root
    slabs
  mm: slub: implement SLUB version of obj_to_index()
  mm: memcg/slab: save memcg ownership data for non-root slab objects
  mm: memcg: move memcg_kmem_bypass() to memcontrol.h
  mm: memcg: introduce __mod_lruvec_memcg_state()
  mm: memcg/slab: charge individual slab objects instead of pages
  mm: memcg: move get_mem_cgroup_from_current() to memcontrol.h
  mm: memcg/slab: replace memcg_from_slab_page() with
    memcg_from_slab_obj()
  mm: memcg/slab: use one set of kmem_caches for all memory cgroups
  mm: slab: remove redundant check in memcg_accumulate_slabinfo()

 drivers/base/node.c        |  11 +-
 fs/proc/meminfo.c          |   4 +-
 include/linux/memcontrol.h | 102 ++++++++-
 include/linux/mm_types.h   |   5 +-
 include/linux/mmzone.h     |  12 +-
 include/linux/slab.h       |   3 +-
 include/linux/slub_def.h   |   9 +
 include/linux/vmstat.h     |   8 +
 kernel/power/snapshot.c    |   2 +-
 mm/list_lru.c              |  12 +-
 mm/memcontrol.c            | 431 +++++++++++++++++++++--------------
 mm/oom_kill.c              |   2 +-
 mm/page_alloc.c            |   8 +-
 mm/slab.c                  |  37 ++-
 mm/slab.h                  | 300 +++++++++++++------------
 mm/slab_common.c           | 449 ++++---------------------------------
 mm/slob.c                  |  12 +-
 mm/slub.c                  |  63 ++----
 mm/vmscan.c                |   3 +-
 mm/vmstat.c                |  38 +++-
 mm/workingset.c            |   6 +-
 21 files changed, 683 insertions(+), 834 deletions(-)

-- 
2.21.0


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH RFC 01/14] mm: memcg: subpage charging API
  2019-09-05 21:45 [PATCH RFC 00/14] The new slab memory controller Roman Gushchin
@ 2019-09-05 21:45 ` Roman Gushchin
  2019-09-16 12:56   ` Johannes Weiner
  2019-09-05 21:45 ` [PATCH RFC 02/14] mm: memcg: introduce mem_cgroup_ptr Roman Gushchin
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 44+ messages in thread
From: Roman Gushchin @ 2019-09-05 21:45 UTC (permalink / raw)
  To: linux-mm
  Cc: Michal Hocko, Johannes Weiner, linux-kernel, kernel-team,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Roman Gushchin

Introduce an API to charge subpage objects to the memory cgroup.
The API will be used by the new slab memory controller. Later it
can also be used to implement percpu memory accounting.
In both cases, a single page can be shared between multiple cgroups
(and in percpu case a single allocation is split over multiple pages),
so it's not possible to use page-based accounting.

The implementation is based on percpu stocks. Memory cgroups are still
charged in pages, and the residue is stored in perpcu stock, or on the
memcg itself, when it's necessary to flush the stock.

Please, note, that unlike the generic page charging API, a subpage
object is not holding a reference to the memory cgroup. It's because
a more complicated indirect scheme is required in order to implement
cheap reparenting. The percpu stock holds a single reference to the
cached cgroup though.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/memcontrol.h |   4 ++
 mm/memcontrol.c            | 129 +++++++++++++++++++++++++++++++++----
 2 files changed, 119 insertions(+), 14 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0c762e8ca6a6..120d39066148 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -214,6 +214,7 @@ struct mem_cgroup {
 	/* Accounted resources */
 	struct page_counter memory;
 	struct page_counter swap;
+	atomic_t nr_stocked_bytes;
 
 	/* Legacy consumer-oriented counters */
 	struct page_counter memsw;
@@ -1370,6 +1371,9 @@ int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
 			      struct mem_cgroup *memcg);
 void __memcg_kmem_uncharge_memcg(struct mem_cgroup *memcg,
 				 unsigned int nr_pages);
+int __memcg_kmem_charge_subpage(struct mem_cgroup *memcg, size_t size,
+				gfp_t gfp);
+void __memcg_kmem_uncharge_subpage(struct mem_cgroup *memcg, size_t size);
 
 extern struct static_key_false memcg_kmem_enabled_key;
 extern struct workqueue_struct *memcg_kmem_cache_wq;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1c4c08b45e44..effefcec47b3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2149,6 +2149,10 @@ EXPORT_SYMBOL(unlock_page_memcg);
 struct memcg_stock_pcp {
 	struct mem_cgroup *cached; /* this never be root cgroup */
 	unsigned int nr_pages;
+
+	struct mem_cgroup *subpage_cached;
+	unsigned int nr_bytes;
+
 	struct work_struct work;
 	unsigned long flags;
 #define FLUSHING_CACHED_CHARGE	0
@@ -2189,6 +2193,29 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 	return ret;
 }
 
+static bool consume_subpage_stock(struct mem_cgroup *memcg,
+				  unsigned int nr_bytes)
+{
+	struct memcg_stock_pcp *stock;
+	unsigned long flags;
+	bool ret = false;
+
+	if (nr_bytes > (MEMCG_CHARGE_BATCH << PAGE_SHIFT))
+		return ret;
+
+	local_irq_save(flags);
+
+	stock = this_cpu_ptr(&memcg_stock);
+	if (memcg == stock->subpage_cached && stock->nr_bytes >= nr_bytes) {
+		stock->nr_bytes -= nr_bytes;
+		ret = true;
+	}
+
+	local_irq_restore(flags);
+
+	return ret;
+}
+
 /*
  * Returns stocks cached in percpu and reset cached information.
  */
@@ -2206,6 +2233,27 @@ static void drain_stock(struct memcg_stock_pcp *stock)
 	stock->cached = NULL;
 }
 
+static void drain_subpage_stock(struct memcg_stock_pcp *stock)
+{
+	struct mem_cgroup *old = stock->subpage_cached;
+
+	if (stock->nr_bytes) {
+		unsigned int nr_pages = stock->nr_bytes >> PAGE_SHIFT;
+		unsigned int nr_bytes = stock->nr_bytes & (PAGE_SIZE - 1);
+
+		page_counter_uncharge(&old->memory, nr_pages);
+		if (do_memsw_account())
+			page_counter_uncharge(&old->memsw, nr_pages);
+
+		atomic_add(nr_bytes, &old->nr_stocked_bytes);
+		stock->nr_bytes = 0;
+	}
+	if (stock->subpage_cached) {
+		css_put(&old->css);
+		stock->subpage_cached = NULL;
+	}
+}
+
 static void drain_local_stock(struct work_struct *dummy)
 {
 	struct memcg_stock_pcp *stock;
@@ -2218,8 +2266,11 @@ static void drain_local_stock(struct work_struct *dummy)
 	local_irq_save(flags);
 
 	stock = this_cpu_ptr(&memcg_stock);
-	drain_stock(stock);
-	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
+	if (test_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
+		drain_stock(stock);
+		drain_subpage_stock(stock);
+		clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
+	}
 
 	local_irq_restore(flags);
 }
@@ -2248,6 +2299,29 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 	local_irq_restore(flags);
 }
 
+static void refill_subpage_stock(struct mem_cgroup *memcg,
+				 unsigned int nr_bytes)
+{
+	struct memcg_stock_pcp *stock;
+	unsigned long flags;
+
+	local_irq_save(flags);
+
+	stock = this_cpu_ptr(&memcg_stock);
+	if (stock->subpage_cached != memcg) { /* reset if necessary */
+		drain_subpage_stock(stock);
+		css_get(&memcg->css);
+		stock->subpage_cached = memcg;
+		stock->nr_bytes = atomic_xchg(&memcg->nr_stocked_bytes, 0);
+	}
+	stock->nr_bytes += nr_bytes;
+
+	if (stock->nr_bytes > (MEMCG_CHARGE_BATCH << PAGE_SHIFT))
+		drain_subpage_stock(stock);
+
+	local_irq_restore(flags);
+}
+
 /*
  * Drains all per-CPU charge caches for given root_memcg resp. subtree
  * of the hierarchy under it.
@@ -2276,6 +2350,9 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
 		if (memcg && stock->nr_pages &&
 		    mem_cgroup_is_descendant(memcg, root_memcg))
 			flush = true;
+		memcg = stock->subpage_cached;
+		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
+			flush = true;
 		rcu_read_unlock();
 
 		if (flush &&
@@ -2500,8 +2577,9 @@ void mem_cgroup_handle_over_high(void)
 }
 
 static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
-		      unsigned int nr_pages)
+		      unsigned int amount, bool subpage)
 {
+	unsigned int nr_pages = subpage ? ((amount >> PAGE_SHIFT) + 1) : amount;
 	unsigned int batch = max(MEMCG_CHARGE_BATCH, nr_pages);
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
 	struct mem_cgroup *mem_over_limit;
@@ -2514,7 +2592,9 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (mem_cgroup_is_root(memcg))
 		return 0;
 retry:
-	if (consume_stock(memcg, nr_pages))
+	if (subpage && consume_subpage_stock(memcg, amount))
+		return 0;
+	else if (!subpage && consume_stock(memcg, nr_pages))
 		return 0;
 
 	if (!do_memsw_account() ||
@@ -2632,14 +2712,22 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	page_counter_charge(&memcg->memory, nr_pages);
 	if (do_memsw_account())
 		page_counter_charge(&memcg->memsw, nr_pages);
-	css_get_many(&memcg->css, nr_pages);
+
+	if (subpage)
+		refill_subpage_stock(memcg, (nr_pages << PAGE_SHIFT) - amount);
+	else
+		css_get_many(&memcg->css, nr_pages);
 
 	return 0;
 
 done_restock:
-	css_get_many(&memcg->css, batch);
-	if (batch > nr_pages)
-		refill_stock(memcg, batch - nr_pages);
+	if (subpage && (batch << PAGE_SHIFT) > amount) {
+		refill_subpage_stock(memcg, (batch << PAGE_SHIFT) - amount);
+	} else if (!subpage) {
+		css_get_many(&memcg->css, batch);
+		if (batch > nr_pages)
+			refill_stock(memcg, batch - nr_pages);
+	}
 
 	/*
 	 * If the hierarchy is above the normal consumption range, schedule
@@ -2942,7 +3030,7 @@ int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
 	struct page_counter *counter;
 	int ret;
 
-	ret = try_charge(memcg, gfp, nr_pages);
+	ret = try_charge(memcg, gfp, nr_pages, false);
 	if (ret)
 		return ret;
 
@@ -3020,6 +3108,18 @@ void __memcg_kmem_uncharge(struct page *page, int order)
 
 	css_put_many(&memcg->css, nr_pages);
 }
+
+int __memcg_kmem_charge_subpage(struct mem_cgroup *memcg, size_t size,
+				gfp_t gfp)
+{
+	return try_charge(memcg, gfp, size, true);
+}
+
+void __memcg_kmem_uncharge_subpage(struct mem_cgroup *memcg, size_t size)
+{
+	refill_subpage_stock(memcg, size);
+}
+
 #endif /* CONFIG_MEMCG_KMEM */
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -5267,7 +5367,8 @@ static int mem_cgroup_do_precharge(unsigned long count)
 	int ret;
 
 	/* Try a single bulk charge without reclaim first, kswapd may wake */
-	ret = try_charge(mc.to, GFP_KERNEL & ~__GFP_DIRECT_RECLAIM, count);
+	ret = try_charge(mc.to, GFP_KERNEL & ~__GFP_DIRECT_RECLAIM, count,
+			 false);
 	if (!ret) {
 		mc.precharge += count;
 		return ret;
@@ -5275,7 +5376,7 @@ static int mem_cgroup_do_precharge(unsigned long count)
 
 	/* Try charges one by one with reclaim, but do not retry */
 	while (count--) {
-		ret = try_charge(mc.to, GFP_KERNEL | __GFP_NORETRY, 1);
+		ret = try_charge(mc.to, GFP_KERNEL | __GFP_NORETRY, 1, false);
 		if (ret)
 			return ret;
 		mc.precharge++;
@@ -6487,7 +6588,7 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 	if (!memcg)
 		memcg = get_mem_cgroup_from_mm(mm);
 
-	ret = try_charge(memcg, gfp_mask, nr_pages);
+	ret = try_charge(memcg, gfp_mask, nr_pages, false);
 
 	css_put(&memcg->css);
 out:
@@ -6866,10 +6967,10 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
 
 	mod_memcg_state(memcg, MEMCG_SOCK, nr_pages);
 
-	if (try_charge(memcg, gfp_mask, nr_pages) == 0)
+	if (try_charge(memcg, gfp_mask, nr_pages, false) == 0)
 		return true;
 
-	try_charge(memcg, gfp_mask|__GFP_NOFAIL, nr_pages);
+	try_charge(memcg, gfp_mask|__GFP_NOFAIL, nr_pages, false);
 	return false;
 }
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH RFC 02/14] mm: memcg: introduce mem_cgroup_ptr
  2019-09-05 21:45 [PATCH RFC 00/14] The new slab memory controller Roman Gushchin
  2019-09-05 21:45 ` [PATCH RFC 01/14] mm: memcg: subpage charging API Roman Gushchin
@ 2019-09-05 21:45 ` Roman Gushchin
  2019-09-05 22:34   ` Roman Gushchin
  2019-09-05 21:45 ` [PATCH RFC 03/14] mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat Roman Gushchin
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 44+ messages in thread
From: Roman Gushchin @ 2019-09-05 21:45 UTC (permalink / raw)
  To: linux-mm
  Cc: Michal Hocko, Johannes Weiner, linux-kernel, kernel-team,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Roman Gushchin

This commit introduces mem_cgroup_ptr structure and corresponding API.
It implements a pointer to a memory cgroup with a built-in reference
counter. The main goal of it is to implement reparenting efficiently.

If a number of objects (e.g. slab pages) have to keep a pointer and
a reference to a memory cgroup, they can use mem_cgroup_ptr instead.
On reparenting, only one mem_cgroup_ptr->memcg pointer has to be
changed, instead of walking over all accounted objects.

mem_cgroup_ptr holds a single reference to the corresponding memory
cgroup. Because it's initialized before the css reference counter,
css's refcounter can't be bumped at allocation time. Instead, it's
bumped on reparenting which happens during offlining. A cgroup is
never released online, so it's fine.

mem_cgroup_ptr is released using rcu, so memcg->kmem_memcg_ptr can
be accessed in a rcu read section. On reparenting it's atomically
switched to NULL. If the reader gets NULL, it can just read parent's
kmem_memcg_ptr instead.

Each memory cgroup contains a list of kmem_memcg_ptrs. On reparenting
the list is spliced into the parent's list. The list is protected
using the css set lock.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/memcontrol.h | 50 ++++++++++++++++++++++
 mm/memcontrol.c            | 87 ++++++++++++++++++++++++++++++++++++--
 2 files changed, 133 insertions(+), 4 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 120d39066148..dd5ebfe5a86c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -23,6 +23,7 @@
 #include <linux/page-flags.h>
 
 struct mem_cgroup;
+struct mem_cgroup_ptr;
 struct page;
 struct mm_struct;
 struct kmem_cache;
@@ -197,6 +198,22 @@ struct memcg_cgwb_frn {
 	int memcg_id;			/* memcg->css.id of foreign inode */
 	u64 at;				/* jiffies_64 at the time of dirtying */
 	struct wb_completion done;	/* tracks in-flight foreign writebacks */
+}
+
+/*
+ * A pointer to a memory cgroup with a built-in reference counter.
+ * For a use as an intermediate object to simplify reparenting of
+ * objects charged to the cgroup. The memcg pointer can be switched
+ * to the parent cgroup without a need to modify all objects
+ * which hold the reference to the cgroup.
+ */
+struct mem_cgroup_ptr {
+	struct percpu_ref refcnt;
+	struct mem_cgroup *memcg;
+	union {
+		struct list_head list;
+		struct rcu_head rcu;
+	};
 };
 
 /*
@@ -312,6 +329,8 @@ struct mem_cgroup {
 	int kmemcg_id;
 	enum memcg_kmem_state kmem_state;
 	struct list_head kmem_caches;
+	struct mem_cgroup_ptr __rcu *kmem_memcg_ptr;
+	struct list_head kmem_memcg_ptr_list;
 #endif
 
 	int last_scanned_node;
@@ -440,6 +459,21 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
 	return css ? container_of(css, struct mem_cgroup, css) : NULL;
 }
 
+static inline bool mem_cgroup_ptr_tryget(struct mem_cgroup_ptr *ptr)
+{
+	return percpu_ref_tryget(&ptr->refcnt);
+}
+
+static inline void mem_cgroup_ptr_get(struct mem_cgroup_ptr *ptr)
+{
+	percpu_ref_get(&ptr->refcnt);
+}
+
+static inline void mem_cgroup_ptr_put(struct mem_cgroup_ptr *ptr)
+{
+	percpu_ref_put(&ptr->refcnt);
+}
+
 static inline void mem_cgroup_put(struct mem_cgroup *memcg)
 {
 	if (memcg)
@@ -1433,6 +1467,22 @@ static inline int memcg_cache_id(struct mem_cgroup *memcg)
 	return memcg ? memcg->kmemcg_id : -1;
 }
 
+static inline struct mem_cgroup_ptr *
+mem_cgroup_get_kmem_ptr(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup_ptr *memcg_ptr;
+
+	rcu_read_lock();
+	do {
+		memcg_ptr = rcu_dereference(memcg->kmem_memcg_ptr);
+		if (memcg_ptr && mem_cgroup_ptr_tryget(memcg_ptr))
+			break;
+	} while ((memcg = parent_mem_cgroup(memcg)));
+	rcu_read_unlock();
+
+	return memcg_ptr;
+}
+
 #else
 
 static inline int memcg_kmem_charge(struct page *page, gfp_t gfp, int order)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index effefcec47b3..cb9adb31360e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -266,6 +266,77 @@ struct cgroup_subsys_state *vmpressure_to_css(struct vmpressure *vmpr)
 }
 
 #ifdef CONFIG_MEMCG_KMEM
+extern spinlock_t css_set_lock;
+
+static void memcg_ptr_release(struct percpu_ref *ref)
+{
+	struct mem_cgroup_ptr *ptr = container_of(ref, struct mem_cgroup_ptr,
+						  refcnt);
+	unsigned long flags;
+
+	spin_lock_irqsave(&css_set_lock, flags);
+	list_del(&ptr->list);
+	spin_unlock_irqrestore(&css_set_lock, flags);
+
+	mem_cgroup_put(ptr->memcg);
+	percpu_ref_exit(ref);
+	kfree_rcu(ptr, rcu);
+}
+
+static int memcg_init_kmem_memcg_ptr(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup_ptr *kmem_memcg_ptr;
+	int ret;
+
+	kmem_memcg_ptr = kmalloc(sizeof(struct mem_cgroup_ptr), GFP_KERNEL);
+	if (!kmem_memcg_ptr)
+		return -ENOMEM;
+
+	ret = percpu_ref_init(&kmem_memcg_ptr->refcnt, memcg_ptr_release,
+			      0, GFP_KERNEL);
+	if (ret) {
+		kfree(kmem_memcg_ptr);
+		return ret;
+	}
+
+	kmem_memcg_ptr->memcg = memcg;
+	INIT_LIST_HEAD(&kmem_memcg_ptr->list);
+	rcu_assign_pointer(memcg->kmem_memcg_ptr, kmem_memcg_ptr);
+	list_add(&kmem_memcg_ptr->list, &memcg->kmem_memcg_ptr_list);
+	return 0;
+}
+
+static void memcg_reparent_kmem_memcg_ptr(struct mem_cgroup *memcg,
+					  struct mem_cgroup *parent)
+{
+	unsigned int nr_reparented = 0;
+	struct mem_cgroup_ptr *memcg_ptr = NULL;
+
+	rcu_swap_protected(memcg->kmem_memcg_ptr, memcg_ptr, true);
+	percpu_ref_kill(&memcg_ptr->refcnt);
+
+	/*
+	 * kmem_memcg_ptr is initialized before css refcounter, so until now
+	 * it doesn't hold a reference to the memcg. Bump it here.
+	 */
+	css_get(&memcg->css);
+
+	spin_lock_irq(&css_set_lock);
+	list_for_each_entry(memcg_ptr, &memcg->kmem_memcg_ptr_list, list) {
+		xchg(&memcg_ptr->memcg, parent);
+		nr_reparented++;
+	}
+	if (nr_reparented)
+		list_splice(&memcg->kmem_memcg_ptr_list,
+			    &parent->kmem_memcg_ptr_list);
+	spin_unlock_irq(&css_set_lock);
+
+	if (nr_reparented) {
+		css_get_many(&parent->css, nr_reparented);
+		css_put_many(&memcg->css, nr_reparented);
+	}
+}
+
 /*
  * This will be the memcg's index in each cache's ->memcg_params.memcg_caches.
  * The main reason for not using cgroup id for this:
@@ -3554,7 +3625,7 @@ static void memcg_flush_percpu_vmevents(struct mem_cgroup *memcg)
 #ifdef CONFIG_MEMCG_KMEM
 static int memcg_online_kmem(struct mem_cgroup *memcg)
 {
-	int memcg_id;
+	int memcg_id, ret;
 
 	if (cgroup_memory_nokmem)
 		return 0;
@@ -3566,6 +3637,12 @@ static int memcg_online_kmem(struct mem_cgroup *memcg)
 	if (memcg_id < 0)
 		return memcg_id;
 
+	ret = memcg_init_kmem_memcg_ptr(memcg);
+	if (ret) {
+		memcg_free_cache_id(memcg_id);
+		return ret;
+	}
+
 	static_branch_inc(&memcg_kmem_enabled_key);
 	/*
 	 * A memory cgroup is considered kmem-online as soon as it gets
@@ -3601,12 +3678,13 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
 		parent = root_mem_cgroup;
 
 	/*
-	 * Deactivate and reparent kmem_caches. Then flush percpu
-	 * slab statistics to have precise values at the parent and
-	 * all ancestor levels. It's required to keep slab stats
+	 * Deactivate and reparent kmem_caches and reparent kmem_memcg_ptr.
+	 * Then flush percpu slab statistics to have precise values at the
+	 * parent and all ancestor levels. It's required to keep slab stats
 	 * accurate after the reparenting of kmem_caches.
 	 */
 	memcg_deactivate_kmem_caches(memcg, parent);
+	memcg_reparent_kmem_memcg_ptr(memcg, parent);
 	memcg_flush_percpu_vmstats(memcg, true);
 
 	kmemcg_id = memcg->kmemcg_id;
@@ -5171,6 +5249,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 	memcg->socket_pressure = jiffies;
 #ifdef CONFIG_MEMCG_KMEM
 	memcg->kmemcg_id = -1;
+	INIT_LIST_HEAD(&memcg->kmem_memcg_ptr_list);
 #endif
 #ifdef CONFIG_CGROUP_WRITEBACK
 	INIT_LIST_HEAD(&memcg->cgwb_list);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH RFC 03/14] mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat
  2019-09-05 21:45 [PATCH RFC 00/14] The new slab memory controller Roman Gushchin
  2019-09-05 21:45 ` [PATCH RFC 01/14] mm: memcg: subpage charging API Roman Gushchin
  2019-09-05 21:45 ` [PATCH RFC 02/14] mm: memcg: introduce mem_cgroup_ptr Roman Gushchin
@ 2019-09-05 21:45 ` Roman Gushchin
  2019-09-05 21:45 ` [PATCH RFC 04/14] mm: vmstat: convert slab vmstat counter to bytes Roman Gushchin
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 44+ messages in thread
From: Roman Gushchin @ 2019-09-05 21:45 UTC (permalink / raw)
  To: linux-mm
  Cc: Michal Hocko, Johannes Weiner, linux-kernel, kernel-team,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Roman Gushchin

Currently s8 type is used for per-cpu caching of per-node statistics.
It works fine because the overfill threshold can't exceed 125.

But if some counters are in bytes (and the next commit in the series
will convert slab counters to bytes), it's not gonna work:
value in bytes can easily exceed s8 without exceeding the threshold
converted to bytes. So to avoid overfilling per-cpu caches and breaking
vmstats correctness, let's use s32 instead.

This doesn't affect per-zone statistics. There are no plans to use
zone-level byte-sized counters, so no reasons to change anything.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/mmzone.h |  2 +-
 mm/vmstat.c            | 16 ++++++++--------
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index bda20282746b..6e8233d52971 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -353,7 +353,7 @@ struct per_cpu_pageset {
 
 struct per_cpu_nodestat {
 	s8 stat_threshold;
-	s8 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS];
+	s32 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS];
 };
 
 #endif /* !__GENERATING_BOUNDS.H */
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 6afc892a148a..98f43725d910 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -337,7 +337,7 @@ void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
 				long delta)
 {
 	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-	s8 __percpu *p = pcp->vm_node_stat_diff + item;
+	s32 __percpu *p = pcp->vm_node_stat_diff + item;
 	long x;
 	long t;
 
@@ -395,13 +395,13 @@ void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
 void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
 {
 	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-	s8 __percpu *p = pcp->vm_node_stat_diff + item;
-	s8 v, t;
+	s32 __percpu *p = pcp->vm_node_stat_diff + item;
+	s32 v, t;
 
 	v = __this_cpu_inc_return(*p);
 	t = __this_cpu_read(pcp->stat_threshold);
 	if (unlikely(v > t)) {
-		s8 overstep = t >> 1;
+		s32 overstep = t >> 1;
 
 		node_page_state_add(v + overstep, pgdat, item);
 		__this_cpu_write(*p, -overstep);
@@ -439,13 +439,13 @@ void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
 void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
 {
 	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-	s8 __percpu *p = pcp->vm_node_stat_diff + item;
-	s8 v, t;
+	s32 __percpu *p = pcp->vm_node_stat_diff + item;
+	s32 v, t;
 
 	v = __this_cpu_dec_return(*p);
 	t = __this_cpu_read(pcp->stat_threshold);
 	if (unlikely(v < - t)) {
-		s8 overstep = t >> 1;
+		s32 overstep = t >> 1;
 
 		node_page_state_add(v - overstep, pgdat, item);
 		__this_cpu_write(*p, overstep);
@@ -538,7 +538,7 @@ static inline void mod_node_state(struct pglist_data *pgdat,
        enum node_stat_item item, int delta, int overstep_mode)
 {
 	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-	s8 __percpu *p = pcp->vm_node_stat_diff + item;
+	s32 __percpu *p = pcp->vm_node_stat_diff + item;
 	long o, n, t, z;
 
 	do {
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH RFC 04/14] mm: vmstat: convert slab vmstat counter to bytes
  2019-09-05 21:45 [PATCH RFC 00/14] The new slab memory controller Roman Gushchin
                   ` (2 preceding siblings ...)
  2019-09-05 21:45 ` [PATCH RFC 03/14] mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat Roman Gushchin
@ 2019-09-05 21:45 ` Roman Gushchin
  2019-09-16 12:38   ` Johannes Weiner
  2019-09-05 21:45 ` [PATCH RFC 05/14] mm: memcg/slab: allocate space for memcg ownership data for non-root slabs Roman Gushchin
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 44+ messages in thread
From: Roman Gushchin @ 2019-09-05 21:45 UTC (permalink / raw)
  To: linux-mm
  Cc: Michal Hocko, Johannes Weiner, linux-kernel, kernel-team,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Roman Gushchin

In order to prepare for per-object slab memory accounting,
convert NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE vmstat
items to bytes.

To make sure that these vmstats are in bytes, rename them
to NR_SLAB_RECLAIMABLE_B and NR_SLAB_UNRECLAIMABLE_B (similar to
NR_KERNEL_STACK_KB).

The size of slab memory shouldn't exceed 4Gb on 32-bit machines,
so it will fit into atomic_long_t we use for vmstats.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 drivers/base/node.c     | 11 ++++++++---
 fs/proc/meminfo.c       |  4 ++--
 include/linux/mmzone.h  | 10 ++++++++--
 include/linux/vmstat.h  |  8 ++++++++
 kernel/power/snapshot.c |  2 +-
 mm/memcontrol.c         | 29 ++++++++++++++++-------------
 mm/oom_kill.c           |  2 +-
 mm/page_alloc.c         |  8 ++++----
 mm/slab.h               | 15 ++++++++-------
 mm/slab_common.c        |  4 ++--
 mm/slob.c               | 12 ++++++------
 mm/slub.c               |  8 ++++----
 mm/vmscan.c             |  3 ++-
 mm/vmstat.c             | 22 +++++++++++++++++++---
 mm/workingset.c         |  6 ++++--
 15 files changed, 93 insertions(+), 51 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 296546ffed6c..56664222f3fd 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -368,8 +368,8 @@ static ssize_t node_read_meminfo(struct device *dev,
 	unsigned long sreclaimable, sunreclaimable;
 
 	si_meminfo_node(&i, nid);
-	sreclaimable = node_page_state(pgdat, NR_SLAB_RECLAIMABLE);
-	sunreclaimable = node_page_state(pgdat, NR_SLAB_UNRECLAIMABLE);
+	sreclaimable = node_page_state_pages(pgdat, NR_SLAB_RECLAIMABLE_B);
+	sunreclaimable = node_page_state_pages(pgdat, NR_SLAB_UNRECLAIMABLE_B);
 	n = sprintf(buf,
 		       "Node %d MemTotal:       %8lu kB\n"
 		       "Node %d MemFree:        %8lu kB\n"
@@ -495,9 +495,14 @@ static ssize_t node_read_vmstat(struct device *dev,
 	int i;
 	int n = 0;
 
-	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
+	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
+		unsigned long x = sum_zone_node_page_state(nid, i);
+
+		if (vmstat_item_in_bytes(i))
+			x >>= PAGE_SHIFT;
 		n += sprintf(buf+n, "%s %lu\n", vmstat_text[i],
 			     sum_zone_node_page_state(nid, i));
+	}
 
 #ifdef CONFIG_NUMA
 	for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index ac9247371871..87afa8683c1b 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -53,8 +53,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 		pages[lru] = global_node_page_state(NR_LRU_BASE + lru);
 
 	available = si_mem_available();
-	sreclaimable = global_node_page_state(NR_SLAB_RECLAIMABLE);
-	sunreclaim = global_node_page_state(NR_SLAB_UNRECLAIMABLE);
+	sreclaimable = global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B);
+	sunreclaim = global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE_B);
 
 	show_val_kb(m, "MemTotal:       ", i.totalram);
 	show_val_kb(m, "MemFree:        ", i.freeram);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6e8233d52971..2dbc2d042ef6 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -215,8 +215,8 @@ enum node_stat_item {
 	NR_INACTIVE_FILE,	/*  "     "     "   "       "         */
 	NR_ACTIVE_FILE,		/*  "     "     "   "       "         */
 	NR_UNEVICTABLE,		/*  "     "     "   "       "         */
-	NR_SLAB_RECLAIMABLE,	/* Please do not reorder this item */
-	NR_SLAB_UNRECLAIMABLE,	/* and this one without looking at
+	NR_SLAB_RECLAIMABLE_B,	/* Please do not reorder this item */
+	NR_SLAB_UNRECLAIMABLE_B,/* and this one without looking at
 				 * memcg_flush_percpu_vmstats() first. */
 	NR_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
 	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
@@ -247,6 +247,12 @@ enum node_stat_item {
 	NR_VM_NODE_STAT_ITEMS
 };
 
+static __always_inline bool vmstat_item_in_bytes(enum node_stat_item item)
+{
+	return (item == NR_SLAB_RECLAIMABLE_B ||
+		item == NR_SLAB_UNRECLAIMABLE_B);
+}
+
 /*
  * We do arithmetic on the LRU lists in various places in the code,
  * so it is important to keep the active lists LRU_ACTIVE higher in
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index bdeda4b079fe..e5460e597e0c 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -194,6 +194,12 @@ static inline unsigned long global_node_page_state(enum node_stat_item item)
 	return x;
 }
 
+static inline
+unsigned long global_node_page_state_pages(enum node_stat_item item)
+{
+	return global_node_page_state(item) >> PAGE_SHIFT;
+}
+
 static inline unsigned long zone_page_state(struct zone *zone,
 					enum zone_stat_item item)
 {
@@ -234,6 +240,8 @@ extern unsigned long sum_zone_node_page_state(int node,
 extern unsigned long sum_zone_numa_state(int node, enum numa_stat_item item);
 extern unsigned long node_page_state(struct pglist_data *pgdat,
 						enum node_stat_item item);
+extern unsigned long node_page_state_pages(struct pglist_data *pgdat,
+					   enum node_stat_item item);
 #else
 #define sum_zone_node_page_state(node, item) global_zone_page_state(item)
 #define node_page_state(node, item) global_node_page_state(item)
diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 83105874f255..ce9e5686e745 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -1659,7 +1659,7 @@ static unsigned long minimum_image_size(unsigned long saveable)
 {
 	unsigned long size;
 
-	size = global_node_page_state(NR_SLAB_RECLAIMABLE)
+	size = global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B)
 		+ global_node_page_state(NR_ACTIVE_ANON)
 		+ global_node_page_state(NR_INACTIVE_ANON)
 		+ global_node_page_state(NR_ACTIVE_FILE)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index cb9adb31360e..e4af9810b59e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -759,13 +759,16 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
  */
 void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)
 {
-	long x;
+	long x, threshold = MEMCG_CHARGE_BATCH;
 
 	if (mem_cgroup_disabled())
 		return;
 
+	if (vmstat_item_in_bytes(idx))
+		threshold <<= PAGE_SHIFT;
+
 	x = val + __this_cpu_read(memcg->vmstats_percpu->stat[idx]);
-	if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
+	if (unlikely(abs(x) > threshold)) {
 		struct mem_cgroup *mi;
 
 		/*
@@ -807,7 +810,7 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 	pg_data_t *pgdat = lruvec_pgdat(lruvec);
 	struct mem_cgroup_per_node *pn;
 	struct mem_cgroup *memcg;
-	long x;
+	long x, threshold = MEMCG_CHARGE_BATCH;
 
 	/* Update node */
 	__mod_node_page_state(pgdat, idx, val);
@@ -824,8 +827,11 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 	/* Update lruvec */
 	__this_cpu_add(pn->lruvec_stat_local->count[idx], val);
 
+	if (vmstat_item_in_bytes(idx))
+		threshold <<= PAGE_SHIFT;
+
 	x = val + __this_cpu_read(pn->lruvec_stat_cpu->count[idx]);
-	if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
+	if (unlikely(abs(x) > threshold)) {
 		struct mem_cgroup_per_node *pi;
 
 		for (pi = pn; pi; pi = parent_nodeinfo(pi, pgdat->node_id))
@@ -1478,9 +1484,8 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
 		       (u64)memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) *
 		       1024);
 	seq_buf_printf(&s, "slab %llu\n",
-		       (u64)(memcg_page_state(memcg, NR_SLAB_RECLAIMABLE) +
-			     memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE)) *
-		       PAGE_SIZE);
+		       (u64)(memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B) +
+			     memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B)));
 	seq_buf_printf(&s, "sock %llu\n",
 		       (u64)memcg_page_state(memcg, MEMCG_SOCK) *
 		       PAGE_SIZE);
@@ -1514,11 +1519,9 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
 			       PAGE_SIZE);
 
 	seq_buf_printf(&s, "slab_reclaimable %llu\n",
-		       (u64)memcg_page_state(memcg, NR_SLAB_RECLAIMABLE) *
-		       PAGE_SIZE);
+		       (u64)memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B));
 	seq_buf_printf(&s, "slab_unreclaimable %llu\n",
-		       (u64)memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE) *
-		       PAGE_SIZE);
+		       (u64)memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B));
 
 	/* Accumulated memory events */
 
@@ -3564,8 +3567,8 @@ static void memcg_flush_percpu_vmstats(struct mem_cgroup *memcg, bool slab_only)
 	int min_idx, max_idx;
 
 	if (slab_only) {
-		min_idx = NR_SLAB_RECLAIMABLE;
-		max_idx = NR_SLAB_UNRECLAIMABLE;
+		min_idx = NR_SLAB_RECLAIMABLE_B;
+		max_idx = NR_SLAB_UNRECLAIMABLE_B;
 	} else {
 		min_idx = 0;
 		max_idx = MEMCG_NR_STAT;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 314ce1a3cf25..61476bf0f147 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -183,7 +183,7 @@ static bool is_dump_unreclaim_slabs(void)
 		 global_node_page_state(NR_ISOLATED_FILE) +
 		 global_node_page_state(NR_UNEVICTABLE);
 
-	return (global_node_page_state(NR_SLAB_UNRECLAIMABLE) > nr_lru);
+	return (global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE_B) > nr_lru);
 }
 
 /**
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c5d62f1c2851..9da8ee92c226 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5100,8 +5100,8 @@ long si_mem_available(void)
 	 * items that are in use, and cannot be freed. Cap this estimate at the
 	 * low watermark.
 	 */
-	reclaimable = global_node_page_state(NR_SLAB_RECLAIMABLE) +
-			global_node_page_state(NR_KERNEL_MISC_RECLAIMABLE);
+	reclaimable = global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B) +
+		global_node_page_state(NR_KERNEL_MISC_RECLAIMABLE);
 	available += reclaimable - min(reclaimable / 2, wmark_low);
 
 	if (available < 0)
@@ -5245,8 +5245,8 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 		global_node_page_state(NR_FILE_DIRTY),
 		global_node_page_state(NR_WRITEBACK),
 		global_node_page_state(NR_UNSTABLE_NFS),
-		global_node_page_state(NR_SLAB_RECLAIMABLE),
-		global_node_page_state(NR_SLAB_UNRECLAIMABLE),
+		global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B),
+		global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE_B),
 		global_node_page_state(NR_FILE_MAPPED),
 		global_node_page_state(NR_SHMEM),
 		global_zone_page_state(NR_PAGETABLE),
diff --git a/mm/slab.h b/mm/slab.h
index 68e455f2b698..7c5577c2b9ea 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -272,7 +272,7 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *, gfp_t, size_t, void **);
 static inline int cache_vmstat_idx(struct kmem_cache *s)
 {
 	return (s->flags & SLAB_RECLAIM_ACCOUNT) ?
-		NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE;
+		NR_SLAB_RECLAIMABLE_B : NR_SLAB_UNRECLAIMABLE_B;
 }
 
 #ifdef CONFIG_MEMCG_KMEM
@@ -360,7 +360,7 @@ static __always_inline int memcg_charge_slab(struct page *page,
 
 	if (unlikely(!memcg || mem_cgroup_is_root(memcg))) {
 		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
-				    (1 << order));
+				    (PAGE_SIZE << order));
 		percpu_ref_get_many(&s->memcg_params.refcnt, 1 << order);
 		return 0;
 	}
@@ -370,7 +370,7 @@ static __always_inline int memcg_charge_slab(struct page *page,
 		goto out;
 
 	lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
-	mod_lruvec_state(lruvec, cache_vmstat_idx(s), 1 << order);
+	mod_lruvec_state(lruvec, cache_vmstat_idx(s), PAGE_SIZE << order);
 
 	/* transer try_charge() page references to kmem_cache */
 	percpu_ref_get_many(&s->memcg_params.refcnt, 1 << order);
@@ -394,11 +394,12 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
 	memcg = READ_ONCE(s->memcg_params.memcg);
 	if (likely(!mem_cgroup_is_root(memcg))) {
 		lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
-		mod_lruvec_state(lruvec, cache_vmstat_idx(s), -(1 << order));
+		mod_lruvec_state(lruvec, cache_vmstat_idx(s),
+				 -(PAGE_SIZE << order));
 		memcg_kmem_uncharge_memcg(page, order, memcg);
 	} else {
 		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
-				    -(1 << order));
+				    -(PAGE_SIZE << order));
 	}
 	rcu_read_unlock();
 
@@ -482,7 +483,7 @@ static __always_inline int charge_slab_page(struct page *page,
 {
 	if (is_root_cache(s)) {
 		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
-				    1 << order);
+				    PAGE_SIZE << order);
 		return 0;
 	}
 
@@ -494,7 +495,7 @@ static __always_inline void uncharge_slab_page(struct page *page, int order,
 {
 	if (is_root_cache(s)) {
 		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
-				    -(1 << order));
+				    -(PAGE_SIZE << order));
 		return;
 	}
 
diff --git a/mm/slab_common.c b/mm/slab_common.c
index c29f03adca91..79695d9c34f3 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1303,8 +1303,8 @@ void *kmalloc_order(size_t size, gfp_t flags, unsigned int order)
 	page = alloc_pages(flags, order);
 	if (likely(page)) {
 		ret = page_address(page);
-		mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE,
-				    1 << order);
+		mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE_B,
+				    PAGE_SIZE << order);
 	}
 	ret = kasan_kmalloc_large(ret, size, flags);
 	/* As ret might get tagged, call kmemleak hook after KASAN. */
diff --git a/mm/slob.c b/mm/slob.c
index fa53e9f73893..8b7b56235438 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -202,8 +202,8 @@ static void *slob_new_pages(gfp_t gfp, int order, int node)
 	if (!page)
 		return NULL;
 
-	mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE,
-			    1 << order);
+	mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE_B,
+			    PAGE_SIZE << order);
 	return page_address(page);
 }
 
@@ -214,8 +214,8 @@ static void slob_free_pages(void *b, int order)
 	if (current->reclaim_state)
 		current->reclaim_state->reclaimed_slab += 1 << order;
 
-	mod_node_page_state(page_pgdat(sp), NR_SLAB_UNRECLAIMABLE,
-			    -(1 << order));
+	mod_node_page_state(page_pgdat(sp), NR_SLAB_UNRECLAIMABLE_B,
+			    -(PAGE_SIZE << order));
 	__free_pages(sp, order);
 }
 
@@ -550,8 +550,8 @@ void kfree(const void *block)
 		slob_free(m, *m + align);
 	} else {
 		unsigned int order = compound_order(sp);
-		mod_node_page_state(page_pgdat(sp), NR_SLAB_UNRECLAIMABLE,
-				    -(1 << order));
+		mod_node_page_state(page_pgdat(sp), NR_SLAB_UNRECLAIMABLE_B,
+				    -(PAGE_SIZE << order));
 		__free_pages(sp, order);
 
 	}
diff --git a/mm/slub.c b/mm/slub.c
index c9856a9807f1..0873b77727bf 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3825,8 +3825,8 @@ static void *kmalloc_large_node(size_t size, gfp_t flags, int node)
 	page = alloc_pages_node(node, flags, order);
 	if (page) {
 		ptr = page_address(page);
-		mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE,
-				    1 << order);
+		mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE_B,
+				    PAGE_SIZE << order);
 	}
 
 	return kmalloc_large_node_hook(ptr, size, flags);
@@ -3957,8 +3957,8 @@ void kfree(const void *x)
 
 		BUG_ON(!PageCompound(page));
 		kfree_hook(object);
-		mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE,
-				    -(1 << order));
+		mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE_B,
+				    -(PAGE_SIZE << order));
 		__free_pages(page, order);
 		return;
 	}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ee47bbcb99b5..283da9a39de2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4272,7 +4272,8 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
 	 * unmapped file backed pages.
 	 */
 	if (node_pagecache_reclaimable(pgdat) <= pgdat->min_unmapped_pages &&
-	    node_page_state(pgdat, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages)
+	    node_page_state_pages(pgdat, NR_SLAB_RECLAIMABLE_B) <=
+	    pgdat->min_slab_pages)
 		return NODE_RECLAIM_FULL;
 
 	/*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 98f43725d910..d04f53997fd2 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -344,6 +344,8 @@ void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
 	x = delta + __this_cpu_read(*p);
 
 	t = __this_cpu_read(pcp->stat_threshold);
+	if (vmstat_item_in_bytes(item))
+		t <<= PAGE_SHIFT;
 
 	if (unlikely(x > t || x < -t)) {
 		node_page_state_add(x, pgdat, item);
@@ -555,6 +557,8 @@ static inline void mod_node_state(struct pglist_data *pgdat,
 		 * for all cpus in a node.
 		 */
 		t = this_cpu_read(pcp->stat_threshold);
+		if (vmstat_item_in_bytes(item))
+			t <<= PAGE_SHIFT;
 
 		o = this_cpu_read(*p);
 		n = delta + o;
@@ -999,6 +1003,12 @@ unsigned long node_page_state(struct pglist_data *pgdat,
 #endif
 	return x;
 }
+
+unsigned long node_page_state_pages(struct pglist_data *pgdat,
+				    enum node_stat_item item)
+{
+	return node_page_state(pgdat, item) >> PAGE_SHIFT;
+}
 #endif
 
 #ifdef CONFIG_COMPACTION
@@ -1547,10 +1557,13 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 	if (is_zone_first_populated(pgdat, zone)) {
 		seq_printf(m, "\n  per-node stats");
 		for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
+			unsigned long x = node_page_state(pgdat, i);
+
+			if (vmstat_item_in_bytes(i))
+				x >>= PAGE_SHIFT;
 			seq_printf(m, "\n      %-12s %lu",
 				vmstat_text[i + NR_VM_ZONE_STAT_ITEMS +
-				NR_VM_NUMA_STAT_ITEMS],
-				node_page_state(pgdat, i));
+				NR_VM_NUMA_STAT_ITEMS], x);
 		}
 	}
 	seq_printf(m,
@@ -1679,8 +1692,11 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
 	v += NR_VM_NUMA_STAT_ITEMS;
 #endif
 
-	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
+	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
 		v[i] = global_node_page_state(i);
+		if (vmstat_item_in_bytes(i))
+			v[i] >>= PAGE_SHIFT;
+	}
 	v += NR_VM_NODE_STAT_ITEMS;
 
 	global_dirty_limits(v + NR_DIRTY_BG_THRESHOLD,
diff --git a/mm/workingset.c b/mm/workingset.c
index c963831d354f..675792387e58 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -430,8 +430,10 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
 		for (pages = 0, i = 0; i < NR_LRU_LISTS; i++)
 			pages += lruvec_page_state_local(lruvec,
 							 NR_LRU_BASE + i);
-		pages += lruvec_page_state_local(lruvec, NR_SLAB_RECLAIMABLE);
-		pages += lruvec_page_state_local(lruvec, NR_SLAB_UNRECLAIMABLE);
+		pages += lruvec_page_state_local(
+			lruvec, NR_SLAB_RECLAIMABLE_B) >> PAGE_SHIFT;
+		pages += lruvec_page_state_local(
+			lruvec, NR_SLAB_UNRECLAIMABLE_B) >> PAGE_SHIFT;
 	} else
 #endif
 		pages = node_present_pages(sc->nid);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH RFC 05/14] mm: memcg/slab: allocate space for memcg ownership data for non-root slabs
  2019-09-05 21:45 [PATCH RFC 00/14] The new slab memory controller Roman Gushchin
                   ` (3 preceding siblings ...)
  2019-09-05 21:45 ` [PATCH RFC 04/14] mm: vmstat: convert slab vmstat counter to bytes Roman Gushchin
@ 2019-09-05 21:45 ` Roman Gushchin
  2019-09-05 21:45 ` [PATCH RFC 06/14] mm: slub: implement SLUB version of obj_to_index() Roman Gushchin
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 44+ messages in thread
From: Roman Gushchin @ 2019-09-05 21:45 UTC (permalink / raw)
  To: linux-mm
  Cc: Michal Hocko, Johannes Weiner, linux-kernel, kernel-team,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Roman Gushchin

Allocate and release memory for storing the memcg ownership data.
For each slab page allocate space sufficient for number_of_objects
pointers to struct mem_cgroup_vec.

The mem_cgroup field of the struct page isn't used for slab pages,
so let's use the space for storing the pointer for the allocated
space.

This commit makes sure that the space is ready for use, but nobody
is actually using it yet. Following commits in the series will fix it.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/mm_types.h |  5 ++++-
 mm/slab.c                |  3 ++-
 mm/slab.h                | 37 ++++++++++++++++++++++++++++++++++++-
 mm/slub.c                |  2 +-
 4 files changed, 43 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 25395481d2ae..510cb170c4b8 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -199,7 +199,10 @@ struct page {
 	atomic_t _refcount;
 
 #ifdef CONFIG_MEMCG
-	struct mem_cgroup *mem_cgroup;
+	union {
+		struct mem_cgroup *mem_cgroup;
+		struct mem_cgroup_ptr **mem_cgroup_vec;
+	};
 #endif
 
 	/*
diff --git a/mm/slab.c b/mm/slab.c
index 9df370558e5d..f0833f287dcf 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1369,7 +1369,8 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
 		return NULL;
 	}
 
-	if (charge_slab_page(page, flags, cachep->gfporder, cachep)) {
+	if (charge_slab_page(page, flags, cachep->gfporder, cachep,
+			     cachep->num)) {
 		__free_pages(page, cachep->gfporder);
 		return NULL;
 	}
diff --git a/mm/slab.h b/mm/slab.h
index 7c5577c2b9ea..16d7ea30a2d3 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -406,6 +406,23 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
 	percpu_ref_put_many(&s->memcg_params.refcnt, 1 << order);
 }
 
+static inline int memcg_alloc_page_memcg_vec(struct page *page, gfp_t gfp,
+					     unsigned int objects)
+{
+	page->mem_cgroup_vec = kmalloc(sizeof(struct mem_cgroup_ptr *) *
+				       objects, gfp | __GFP_ZERO);
+	if (!page->mem_cgroup_vec)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static inline void memcg_free_page_memcg_vec(struct page *page)
+{
+	kfree(page->mem_cgroup_vec);
+	page->mem_cgroup_vec = NULL;
+}
+
 extern void slab_init_memcg_params(struct kmem_cache *);
 extern void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg);
 
@@ -455,6 +472,16 @@ static inline void memcg_uncharge_slab(struct page *page, int order,
 {
 }
 
+static inline int memcg_alloc_page_memcg_vec(struct page *page, gfp_t gfp,
+					     unsigned int objects)
+{
+	return 0;
+}
+
+static inline void memcg_free_page_memcg_vec(struct page *page)
+{
+}
+
 static inline void slab_init_memcg_params(struct kmem_cache *s)
 {
 }
@@ -479,14 +506,21 @@ static inline struct kmem_cache *virt_to_cache(const void *obj)
 
 static __always_inline int charge_slab_page(struct page *page,
 					    gfp_t gfp, int order,
-					    struct kmem_cache *s)
+					    struct kmem_cache *s,
+					    unsigned int objects)
 {
+	int ret;
+
 	if (is_root_cache(s)) {
 		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
 				    PAGE_SIZE << order);
 		return 0;
 	}
 
+	ret = memcg_alloc_page_memcg_vec(page, gfp, objects);
+	if (ret)
+		return ret;
+
 	return memcg_charge_slab(page, gfp, order, s);
 }
 
@@ -499,6 +533,7 @@ static __always_inline void uncharge_slab_page(struct page *page, int order,
 		return;
 	}
 
+	memcg_free_page_memcg_vec(page);
 	memcg_uncharge_slab(page, order, s);
 }
 
diff --git a/mm/slub.c b/mm/slub.c
index 0873b77727bf..3014158c100d 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1517,7 +1517,7 @@ static inline struct page *alloc_slab_page(struct kmem_cache *s,
 	else
 		page = __alloc_pages_node(node, flags, order);
 
-	if (page && charge_slab_page(page, flags, order, s)) {
+	if (page && charge_slab_page(page, flags, order, s, oo_objects(oo))) {
 		__free_pages(page, order);
 		page = NULL;
 	}
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH RFC 06/14] mm: slub: implement SLUB version of obj_to_index()
  2019-09-05 21:45 [PATCH RFC 00/14] The new slab memory controller Roman Gushchin
                   ` (4 preceding siblings ...)
  2019-09-05 21:45 ` [PATCH RFC 05/14] mm: memcg/slab: allocate space for memcg ownership data for non-root slabs Roman Gushchin
@ 2019-09-05 21:45 ` Roman Gushchin
  2019-09-05 21:45 ` [PATCH RFC 07/14] mm: memcg/slab: save memcg ownership data for non-root slab objects Roman Gushchin
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 44+ messages in thread
From: Roman Gushchin @ 2019-09-05 21:45 UTC (permalink / raw)
  To: linux-mm
  Cc: Michal Hocko, Johannes Weiner, linux-kernel, kernel-team,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Roman Gushchin

This commit implements SLUB version of the obj_to_index() function,
which will be required to calculate the offset of memcg_ptr in the
mem_cgroup_vec to store/obtain the memcg ownership data.

To make it faster, let's repeat the SLAB's trick introduced by
commit 6a2d7a955d8d ("[PATCH] SLAB: use a multiply instead of a
divide in obj_to_index()") and avoid an expensive division.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/slub_def.h | 9 +++++++++
 mm/slub.c                | 1 +
 2 files changed, 10 insertions(+)

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index d2153789bd9f..200ea292f250 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -8,6 +8,7 @@
  * (C) 2007 SGI, Christoph Lameter
  */
 #include <linux/kobject.h>
+#include <linux/reciprocal_div.h>
 
 enum stat_item {
 	ALLOC_FASTPATH,		/* Allocation from cpu slab */
@@ -86,6 +87,7 @@ struct kmem_cache {
 	unsigned long min_partial;
 	unsigned int size;	/* The size of an object including metadata */
 	unsigned int object_size;/* The size of an object without metadata */
+	struct reciprocal_value reciprocal_size;
 	unsigned int offset;	/* Free pointer offset */
 #ifdef CONFIG_SLUB_CPU_PARTIAL
 	/* Number of per cpu partial objects to keep around */
@@ -182,4 +184,11 @@ static inline void *nearest_obj(struct kmem_cache *cache, struct page *page,
 	return result;
 }
 
+static inline unsigned int obj_to_index(const struct kmem_cache *cache,
+					const struct page *page, void *obj)
+{
+	return reciprocal_divide(kasan_reset_tag(obj) - page_address(page),
+				 cache->reciprocal_size);
+}
+
 #endif /* _LINUX_SLUB_DEF_H */
diff --git a/mm/slub.c b/mm/slub.c
index 3014158c100d..b043cfb673c9 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3587,6 +3587,7 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order)
 	 */
 	size = ALIGN(size, s->align);
 	s->size = size;
+	s->reciprocal_size = reciprocal_value(size);
 	if (forced_order >= 0)
 		order = forced_order;
 	else
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH RFC 07/14] mm: memcg/slab: save memcg ownership data for non-root slab objects
  2019-09-05 21:45 [PATCH RFC 00/14] The new slab memory controller Roman Gushchin
                   ` (5 preceding siblings ...)
  2019-09-05 21:45 ` [PATCH RFC 06/14] mm: slub: implement SLUB version of obj_to_index() Roman Gushchin
@ 2019-09-05 21:45 ` Roman Gushchin
  2019-09-05 21:45 ` [PATCH RFC 08/14] mm: memcg: move memcg_kmem_bypass() to memcontrol.h Roman Gushchin
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 44+ messages in thread
From: Roman Gushchin @ 2019-09-05 21:45 UTC (permalink / raw)
  To: linux-mm
  Cc: Michal Hocko, Johannes Weiner, linux-kernel, kernel-team,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Roman Gushchin

Store a memcg_ptr in the corresponding place of the mem_cgroup_vec
for each allocated non-root slab object. Make sure that each allocated
object holds a reference to the mem_cgroup_ptr.

To get the memcg_ptr in the post alloc hook, we need a memcg pointer.
Because all memory cgroup will soon share the same set of kmem_caches,
let's not use the kmem_cache->memcg_params.memcg. Instead, let's pass
the pointer directly from memcg_kmem_get_cache(). This will guarantee
that we will use the same cgroup in pre- and post-alloc hooks.

Please, note that the code is a bit bulky now, because we have to
manage 3 types of objects with reference counters: memcg, kmem_cache
and memcg_ptr. The following commits in the series will simplify it.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/memcontrol.h |  3 +-
 mm/memcontrol.c            |  8 +++--
 mm/slab.c                  | 18 ++++++-----
 mm/slab.h                  | 64 ++++++++++++++++++++++++++++++++++----
 mm/slub.c                  | 14 ++++++---
 5 files changed, 86 insertions(+), 21 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index dd5ebfe5a86c..b9643d758fc9 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1395,7 +1395,8 @@ static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg,
 }
 #endif
 
-struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep);
+struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep,
+					struct mem_cgroup **memcgp);
 void memcg_kmem_put_cache(struct kmem_cache *cachep);
 
 #ifdef CONFIG_MEMCG_KMEM
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e4af9810b59e..761b646eb968 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3015,7 +3015,8 @@ static inline bool memcg_kmem_bypass(void)
  * done with it, memcg_kmem_put_cache() must be called to release the
  * reference.
  */
-struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep)
+struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep,
+					struct mem_cgroup **memcgp)
 {
 	struct mem_cgroup *memcg;
 	struct kmem_cache *memcg_cachep;
@@ -3071,8 +3072,11 @@ struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep)
 	 */
 	if (unlikely(!memcg_cachep))
 		memcg_schedule_kmem_cache_create(memcg, cachep);
-	else if (percpu_ref_tryget(&memcg_cachep->memcg_params.refcnt))
+	else if (percpu_ref_tryget(&memcg_cachep->memcg_params.refcnt)) {
+		css_get(&memcg->css);
+		*memcgp = memcg;
 		cachep = memcg_cachep;
+	}
 out_unlock:
 	rcu_read_unlock();
 	return cachep;
diff --git a/mm/slab.c b/mm/slab.c
index f0833f287dcf..956aab3b9035 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3222,9 +3222,10 @@ slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
 	unsigned long save_flags;
 	void *ptr;
 	int slab_node = numa_mem_id();
+	struct mem_cgroup *memcg = NULL;
 
 	flags &= gfp_allowed_mask;
-	cachep = slab_pre_alloc_hook(cachep, flags);
+	cachep = slab_pre_alloc_hook(cachep, &memcg, 1, flags);
 	if (unlikely(!cachep))
 		return NULL;
 
@@ -3260,7 +3261,7 @@ slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
 	if (unlikely(slab_want_init_on_alloc(flags, cachep)) && ptr)
 		memset(ptr, 0, cachep->object_size);
 
-	slab_post_alloc_hook(cachep, flags, 1, &ptr);
+	slab_post_alloc_hook(cachep, memcg, flags, 1, &ptr);
 	return ptr;
 }
 
@@ -3301,9 +3302,10 @@ slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller)
 {
 	unsigned long save_flags;
 	void *objp;
+	struct mem_cgroup *memcg = NULL;
 
 	flags &= gfp_allowed_mask;
-	cachep = slab_pre_alloc_hook(cachep, flags);
+	cachep = slab_pre_alloc_hook(cachep, &memcg, 1, flags);
 	if (unlikely(!cachep))
 		return NULL;
 
@@ -3317,7 +3319,7 @@ slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller)
 	if (unlikely(slab_want_init_on_alloc(flags, cachep)) && objp)
 		memset(objp, 0, cachep->object_size);
 
-	slab_post_alloc_hook(cachep, flags, 1, &objp);
+	slab_post_alloc_hook(cachep, memcg, flags, 1, &objp);
 	return objp;
 }
 
@@ -3439,6 +3441,7 @@ void ___cache_free(struct kmem_cache *cachep, void *objp,
 		memset(objp, 0, cachep->object_size);
 	kmemleak_free_recursive(objp, cachep->flags);
 	objp = cache_free_debugcheck(cachep, objp, caller);
+	memcg_slab_free_hook(cachep, virt_to_head_page(objp), objp);
 
 	/*
 	 * Skip calling cache_free_alien() when the platform is not numa.
@@ -3504,8 +3507,9 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 			  void **p)
 {
 	size_t i;
+	struct mem_cgroup *memcg = NULL;
 
-	s = slab_pre_alloc_hook(s, flags);
+	s = slab_pre_alloc_hook(s, &memcg, size, flags);
 	if (!s)
 		return 0;
 
@@ -3528,13 +3532,13 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 		for (i = 0; i < size; i++)
 			memset(p[i], 0, s->object_size);
 
-	slab_post_alloc_hook(s, flags, size, p);
+	slab_post_alloc_hook(s, memcg, flags, size, p);
 	/* FIXME: Trace call missing. Christoph would like a bulk variant */
 	return size;
 error:
 	local_irq_enable();
 	cache_alloc_debugcheck_after_bulk(s, flags, i, p, _RET_IP_);
-	slab_post_alloc_hook(s, flags, i, p);
+	slab_post_alloc_hook(s, memcg, flags, i, p);
 	__kmem_cache_free_bulk(s, i, p);
 	return 0;
 }
diff --git a/mm/slab.h b/mm/slab.h
index 16d7ea30a2d3..39d93c970839 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -423,6 +423,45 @@ static inline void memcg_free_page_memcg_vec(struct page *page)
 	page->mem_cgroup_vec = NULL;
 }
 
+static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
+					      struct mem_cgroup *memcg,
+					      size_t size, void **p)
+{
+	struct mem_cgroup_ptr *memcg_ptr;
+	struct page *page;
+	unsigned long off;
+	size_t i;
+
+	memcg_ptr = mem_cgroup_get_kmem_ptr(memcg);
+	for (i = 0; i < size; i++) {
+		if (likely(p[i])) {
+			page = virt_to_head_page(p[i]);
+			off = obj_to_index(s, page, p[i]);
+			mem_cgroup_ptr_get(memcg_ptr);
+			page->mem_cgroup_vec[off] = memcg_ptr;
+		}
+	}
+	mem_cgroup_ptr_put(memcg_ptr);
+	mem_cgroup_put(memcg);
+
+	memcg_kmem_put_cache(s);
+}
+
+static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page,
+					void *p)
+{
+	struct mem_cgroup_ptr *memcg_ptr;
+	unsigned int off;
+
+	if (!memcg_kmem_enabled() || is_root_cache(s))
+		return;
+
+	off = obj_to_index(s, page, p);
+	memcg_ptr = page->mem_cgroup_vec[off];
+	page->mem_cgroup_vec[off] = NULL;
+	mem_cgroup_ptr_put(memcg_ptr);
+}
+
 extern void slab_init_memcg_params(struct kmem_cache *);
 extern void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg);
 
@@ -482,6 +521,17 @@ static inline void memcg_free_page_memcg_vec(struct page *page)
 {
 }
 
+static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
+					      struct mem_cgroup *memcg,
+					      size_t size, void **p)
+{
+}
+
+static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page,
+					void *p)
+{
+}
+
 static inline void slab_init_memcg_params(struct kmem_cache *s)
 {
 }
@@ -591,7 +641,8 @@ static inline size_t slab_ksize(const struct kmem_cache *s)
 }
 
 static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
-						     gfp_t flags)
+						     struct mem_cgroup **memcgp,
+						     size_t size, gfp_t flags)
 {
 	flags &= gfp_allowed_mask;
 
@@ -605,13 +656,14 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
 
 	if (memcg_kmem_enabled() &&
 	    ((flags & __GFP_ACCOUNT) || (s->flags & SLAB_ACCOUNT)))
-		return memcg_kmem_get_cache(s);
+		return memcg_kmem_get_cache(s, memcgp);
 
 	return s;
 }
 
-static inline void slab_post_alloc_hook(struct kmem_cache *s, gfp_t flags,
-					size_t size, void **p)
+static inline void slab_post_alloc_hook(struct kmem_cache *s,
+					struct mem_cgroup *memcg,
+					gfp_t flags, size_t size, void **p)
 {
 	size_t i;
 
@@ -623,8 +675,8 @@ static inline void slab_post_alloc_hook(struct kmem_cache *s, gfp_t flags,
 					 s->flags, flags);
 	}
 
-	if (memcg_kmem_enabled())
-		memcg_kmem_put_cache(s);
+	if (!is_root_cache(s))
+		memcg_slab_post_alloc_hook(s, memcg, size, p);
 }
 
 #ifndef CONFIG_SLOB
diff --git a/mm/slub.c b/mm/slub.c
index b043cfb673c9..e8c44144a85f 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2686,8 +2686,9 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
 	struct kmem_cache_cpu *c;
 	struct page *page;
 	unsigned long tid;
+	struct mem_cgroup *memcg = NULL;
 
-	s = slab_pre_alloc_hook(s, gfpflags);
+	s = slab_pre_alloc_hook(s, &memcg, 1, gfpflags);
 	if (!s)
 		return NULL;
 redo:
@@ -2767,7 +2768,7 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
 	if (unlikely(slab_want_init_on_alloc(gfpflags, s)) && object)
 		memset(object, 0, s->object_size);
 
-	slab_post_alloc_hook(s, gfpflags, 1, &object);
+	slab_post_alloc_hook(s, memcg, gfpflags, 1, &object);
 
 	return object;
 }
@@ -2972,6 +2973,8 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
 	void *tail_obj = tail ? : head;
 	struct kmem_cache_cpu *c;
 	unsigned long tid;
+
+	memcg_slab_free_hook(s, page, head);
 redo:
 	/*
 	 * Determine the currently cpus per cpu slab.
@@ -3149,9 +3152,10 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 {
 	struct kmem_cache_cpu *c;
 	int i;
+	struct mem_cgroup *memcg = NULL;
 
 	/* memcg and kmem_cache debug support */
-	s = slab_pre_alloc_hook(s, flags);
+	s = slab_pre_alloc_hook(s, &memcg, size, flags);
 	if (unlikely(!s))
 		return false;
 	/*
@@ -3193,11 +3197,11 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 	}
 
 	/* memcg and kmem_cache debug support */
-	slab_post_alloc_hook(s, flags, size, p);
+	slab_post_alloc_hook(s, memcg, flags, size, p);
 	return i;
 error:
 	local_irq_enable();
-	slab_post_alloc_hook(s, flags, i, p);
+	slab_post_alloc_hook(s, memcg, flags, i, p);
 	__kmem_cache_free_bulk(s, i, p);
 	return 0;
 }
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH RFC 08/14] mm: memcg: move memcg_kmem_bypass() to memcontrol.h
  2019-09-05 21:45 [PATCH RFC 00/14] The new slab memory controller Roman Gushchin
                   ` (6 preceding siblings ...)
  2019-09-05 21:45 ` [PATCH RFC 07/14] mm: memcg/slab: save memcg ownership data for non-root slab objects Roman Gushchin
@ 2019-09-05 21:45 ` Roman Gushchin
  2019-09-05 21:45 ` [PATCH RFC 09/14] mm: memcg: introduce __mod_lruvec_memcg_state() Roman Gushchin
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 44+ messages in thread
From: Roman Gushchin @ 2019-09-05 21:45 UTC (permalink / raw)
  To: linux-mm
  Cc: Michal Hocko, Johannes Weiner, linux-kernel, kernel-team,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Roman Gushchin

To make the memcg_kmem_bypass() function available outside of
the memcontrol.c, let's move it to memcontrol.h. The function
is small and nicely fits into static inline sort of functions.

It will be used from the slab code.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/memcontrol.h | 7 +++++++
 mm/memcontrol.c            | 7 -------
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b9643d758fc9..8f1d7161579f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1430,6 +1430,13 @@ static inline bool memcg_kmem_enabled(void)
 	return static_branch_unlikely(&memcg_kmem_enabled_key);
 }
 
+static inline bool memcg_kmem_bypass(void)
+{
+	if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
+		return true;
+	return false;
+}
+
 static inline int memcg_kmem_charge(struct page *page, gfp_t gfp, int order)
 {
 	if (memcg_kmem_enabled())
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 761b646eb968..d57f95177aec 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2992,13 +2992,6 @@ static void memcg_schedule_kmem_cache_create(struct mem_cgroup *memcg,
 	queue_work(memcg_kmem_cache_wq, &cw->work);
 }
 
-static inline bool memcg_kmem_bypass(void)
-{
-	if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
-		return true;
-	return false;
-}
-
 /**
  * memcg_kmem_get_cache: select the correct per-memcg cache for allocation
  * @cachep: the original global kmem cache
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH RFC 09/14] mm: memcg: introduce __mod_lruvec_memcg_state()
  2019-09-05 21:45 [PATCH RFC 00/14] The new slab memory controller Roman Gushchin
                   ` (7 preceding siblings ...)
  2019-09-05 21:45 ` [PATCH RFC 08/14] mm: memcg: move memcg_kmem_bypass() to memcontrol.h Roman Gushchin
@ 2019-09-05 21:45 ` Roman Gushchin
  2019-09-05 22:37 ` [PATCH RFC 02/14] mm: memcg: introduce mem_cgroup_ptr Roman Gushchin
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 44+ messages in thread
From: Roman Gushchin @ 2019-09-05 21:45 UTC (permalink / raw)
  To: linux-mm
  Cc: Michal Hocko, Johannes Weiner, linux-kernel, kernel-team,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Roman Gushchin

To prepare for per-object accounting of slab objects, let's introduce
__mod_lruvec_memcg_state() and mod_lruvec_memcg_state() helpers,
which are similar to mod_lruvec_state(), but do not update global
node counters, only lruvec and per-cgroup.

It's necessary because soon node slab counters will be used for
accounting of all memory used by slab pages, however on memcg level
only the actually used memory will be counted. Free space will be
shared between all cgroups, so it can't be accounted to any.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/memcontrol.h | 22 ++++++++++++++++++++++
 mm/memcontrol.c            | 37 +++++++++++++++++++++++++++----------
 2 files changed, 49 insertions(+), 10 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 8f1d7161579f..cef8a9c51482 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -739,6 +739,8 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
 
 void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 			int val);
+void __mod_lruvec_memcg_state(struct lruvec *lruvec, enum node_stat_item idx,
+			      int val);
 void __mod_lruvec_slab_state(void *p, enum node_stat_item idx, int val);
 
 static inline void mod_lruvec_state(struct lruvec *lruvec,
@@ -751,6 +753,16 @@ static inline void mod_lruvec_state(struct lruvec *lruvec,
 	local_irq_restore(flags);
 }
 
+static inline void mod_lruvec_memcg_state(struct lruvec *lruvec,
+					  enum node_stat_item idx, int val)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__mod_lruvec_memcg_state(lruvec, idx, val);
+	local_irq_restore(flags);
+}
+
 static inline void __mod_lruvec_page_state(struct page *page,
 					   enum node_stat_item idx, int val)
 {
@@ -1143,6 +1155,16 @@ static inline void mod_lruvec_state(struct lruvec *lruvec,
 	mod_node_page_state(lruvec_pgdat(lruvec), idx, val);
 }
 
+static inline void __mod_lruvec_memcg_state(struct lruvec *lruvec,
+					    enum node_stat_item idx, int val)
+{
+}
+
+static inline void mod_lruvec_memcg_state(struct lruvec *lruvec,
+					  enum node_stat_item idx, int val)
+{
+}
+
 static inline void __mod_lruvec_page_state(struct page *page,
 					   enum node_stat_item idx, int val)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d57f95177aec..89a892ef7699 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -795,16 +795,16 @@ parent_nodeinfo(struct mem_cgroup_per_node *pn, int nid)
 }
 
 /**
- * __mod_lruvec_state - update lruvec memory statistics
+ * __mod_lruvec_memcg_state - update lruvec memory statistics
  * @lruvec: the lruvec
  * @idx: the stat item
  * @val: delta to add to the counter, can be negative
  *
  * The lruvec is the intersection of the NUMA node and a cgroup. This
- * function updates the all three counters that are affected by a
- * change of state at this level: per-node, per-cgroup, per-lruvec.
+ * function updates the two of three counters that are affected by a
+ * change of state at this level: per-cgroup and per-lruvec.
  */
-void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
+void __mod_lruvec_memcg_state(struct lruvec *lruvec, enum node_stat_item idx,
 			int val)
 {
 	pg_data_t *pgdat = lruvec_pgdat(lruvec);
@@ -812,12 +812,6 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 	struct mem_cgroup *memcg;
 	long x, threshold = MEMCG_CHARGE_BATCH;
 
-	/* Update node */
-	__mod_node_page_state(pgdat, idx, val);
-
-	if (mem_cgroup_disabled())
-		return;
-
 	pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
 	memcg = pn->memcg;
 
@@ -841,6 +835,29 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 	__this_cpu_write(pn->lruvec_stat_cpu->count[idx], x);
 }
 
+/**
+ * __mod_lruvec_state - update lruvec memory statistics
+ * @lruvec: the lruvec
+ * @idx: the stat item
+ * @val: delta to add to the counter, can be negative
+ *
+ * The lruvec is the intersection of the NUMA node and a cgroup. This
+ * function updates the all three counters that are affected by a
+ * change of state at this level: per-node, per-cgroup, per-lruvec.
+ */
+void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
+			int val)
+{
+	pg_data_t *pgdat = lruvec_pgdat(lruvec);
+
+	/* Update node */
+	__mod_node_page_state(pgdat, idx, val);
+
+	/* Update per-cgroup and per-lruvec stats */
+	if (!mem_cgroup_disabled())
+		__mod_lruvec_memcg_state(lruvec, idx, val);
+}
+
 void __mod_lruvec_slab_state(void *p, enum node_stat_item idx, int val)
 {
 	struct page *page = virt_to_head_page(p);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC 02/14] mm: memcg: introduce mem_cgroup_ptr
  2019-09-05 21:45 ` [PATCH RFC 02/14] mm: memcg: introduce mem_cgroup_ptr Roman Gushchin
@ 2019-09-05 22:34   ` Roman Gushchin
  0 siblings, 0 replies; 44+ messages in thread
From: Roman Gushchin @ 2019-09-05 22:34 UTC (permalink / raw)
  To: linux-mm
  Cc: Michal Hocko, Johannes Weiner, linux-kernel, Kernel Team,
	Shakeel Butt, Vladimir Davydov, Waiman Long

On Thu, Sep 05, 2019 at 02:45:46PM -0700, Roman Gushchin wrote:
> This commit introduces mem_cgroup_ptr structure and corresponding API.
> It implements a pointer to a memory cgroup with a built-in reference
> counter. The main goal of it is to implement reparenting efficiently.
> 
> If a number of objects (e.g. slab pages) have to keep a pointer and
> a reference to a memory cgroup, they can use mem_cgroup_ptr instead.
> On reparenting, only one mem_cgroup_ptr->memcg pointer has to be
> changed, instead of walking over all accounted objects.
> 
> mem_cgroup_ptr holds a single reference to the corresponding memory
> cgroup. Because it's initialized before the css reference counter,
> css's refcounter can't be bumped at allocation time. Instead, it's
> bumped on reparenting which happens during offlining. A cgroup is
> never released online, so it's fine.
> 
> mem_cgroup_ptr is released using rcu, so memcg->kmem_memcg_ptr can
> be accessed in a rcu read section. On reparenting it's atomically
> switched to NULL. If the reader gets NULL, it can just read parent's
> kmem_memcg_ptr instead.
> 
> Each memory cgroup contains a list of kmem_memcg_ptrs. On reparenting
> the list is spliced into the parent's list. The list is protected
> using the css set lock.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> ---
>  include/linux/memcontrol.h | 50 ++++++++++++++++++++++
>  mm/memcontrol.c            | 87 ++++++++++++++++++++++++++++++++++++--
>  2 files changed, 133 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 120d39066148..dd5ebfe5a86c 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -23,6 +23,7 @@
>  #include <linux/page-flags.h>
>  
>  struct mem_cgroup;
> +struct mem_cgroup_ptr;
>  struct page;
>  struct mm_struct;
>  struct kmem_cache;
> @@ -197,6 +198,22 @@ struct memcg_cgwb_frn {
>  	int memcg_id;			/* memcg->css.id of foreign inode */
>  	u64 at;				/* jiffies_64 at the time of dirtying */
>  	struct wb_completion done;	/* tracks in-flight foreign writebacks */
> +}

Oops, a semicolon has been lost during the final rebase.
I'll send a correct version of this patch separately.

Sorry for the mess.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH RFC 02/14] mm: memcg: introduce mem_cgroup_ptr
  2019-09-05 21:45 [PATCH RFC 00/14] The new slab memory controller Roman Gushchin
                   ` (8 preceding siblings ...)
  2019-09-05 21:45 ` [PATCH RFC 09/14] mm: memcg: introduce __mod_lruvec_memcg_state() Roman Gushchin
@ 2019-09-05 22:37 ` Roman Gushchin
  2019-09-17 19:48 ` [PATCH RFC 00/14] The new slab memory controller Waiman Long
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 44+ messages in thread
From: Roman Gushchin @ 2019-09-05 22:37 UTC (permalink / raw)
  To: linux-mm
  Cc: Michal Hocko, Johannes Weiner, linux-kernel, kernel-team,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Roman Gushchin

This commit introduces mem_cgroup_ptr structure and corresponding API.
It implements a pointer to a memory cgroup with a built-in reference
counter. The main goal of it is to implement reparenting efficiently.

If a number of objects (e.g. slab pages) have to keep a pointer and
a reference to a memory cgroup, they can use mem_cgroup_ptr instead.
On reparenting, only one mem_cgroup_ptr->memcg pointer has to be
changed, instead of walking over all accounted objects.

mem_cgroup_ptr holds a single reference to the corresponding memory
cgroup. Because it's initialized before the css reference counter,
css's refcounter can't be bumped at allocation time. Instead, it's
bumped on reparenting which happens during offlining. A cgroup is
never released online, so it's fine.

mem_cgroup_ptr is released using rcu, so memcg->kmem_memcg_ptr can
be accessed in a rcu read section. On reparenting it's atomically
switched to NULL. If the reader gets NULL, it can just read parent's
kmem_memcg_ptr instead.

Each memory cgroup contains a list of kmem_memcg_ptrs. On reparenting
the list is spliced into the parent's list. The list is protected
using the css set lock.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/memcontrol.h | 50 ++++++++++++++++++++++
 mm/memcontrol.c            | 87 ++++++++++++++++++++++++++++++++++++--
 2 files changed, 133 insertions(+), 4 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 120d39066148..d822ea66278c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -23,6 +23,7 @@
 #include <linux/page-flags.h>
 
 struct mem_cgroup;
+struct mem_cgroup_ptr;
 struct page;
 struct mm_struct;
 struct kmem_cache;
@@ -199,6 +200,22 @@ struct memcg_cgwb_frn {
 	struct wb_completion done;	/* tracks in-flight foreign writebacks */
 };
 
+/*
+ * A pointer to a memory cgroup with a built-in reference counter.
+ * For a use as an intermediate object to simplify reparenting of
+ * objects charged to the cgroup. The memcg pointer can be switched
+ * to the parent cgroup without a need to modify all objects
+ * which hold the reference to the cgroup.
+ */
+struct mem_cgroup_ptr {
+	struct percpu_ref refcnt;
+	struct mem_cgroup *memcg;
+	union {
+		struct list_head list;
+		struct rcu_head rcu;
+	};
+};
+
 /*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
@@ -312,6 +329,8 @@ struct mem_cgroup {
 	int kmemcg_id;
 	enum memcg_kmem_state kmem_state;
 	struct list_head kmem_caches;
+	struct mem_cgroup_ptr __rcu *kmem_memcg_ptr;
+	struct list_head kmem_memcg_ptr_list;
 #endif
 
 	int last_scanned_node;
@@ -440,6 +459,21 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
 	return css ? container_of(css, struct mem_cgroup, css) : NULL;
 }
 
+static inline bool mem_cgroup_ptr_tryget(struct mem_cgroup_ptr *ptr)
+{
+	return percpu_ref_tryget(&ptr->refcnt);
+}
+
+static inline void mem_cgroup_ptr_get(struct mem_cgroup_ptr *ptr)
+{
+	percpu_ref_get(&ptr->refcnt);
+}
+
+static inline void mem_cgroup_ptr_put(struct mem_cgroup_ptr *ptr)
+{
+	percpu_ref_put(&ptr->refcnt);
+}
+
 static inline void mem_cgroup_put(struct mem_cgroup *memcg)
 {
 	if (memcg)
@@ -1433,6 +1467,22 @@ static inline int memcg_cache_id(struct mem_cgroup *memcg)
 	return memcg ? memcg->kmemcg_id : -1;
 }
 
+static inline struct mem_cgroup_ptr *
+mem_cgroup_get_kmem_ptr(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup_ptr *memcg_ptr;
+
+	rcu_read_lock();
+	do {
+		memcg_ptr = rcu_dereference(memcg->kmem_memcg_ptr);
+		if (memcg_ptr && mem_cgroup_ptr_tryget(memcg_ptr))
+			break;
+	} while ((memcg = parent_mem_cgroup(memcg)));
+	rcu_read_unlock();
+
+	return memcg_ptr;
+}
+
 #else
 
 static inline int memcg_kmem_charge(struct page *page, gfp_t gfp, int order)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index effefcec47b3..cb9adb31360e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -266,6 +266,77 @@ struct cgroup_subsys_state *vmpressure_to_css(struct vmpressure *vmpr)
 }
 
 #ifdef CONFIG_MEMCG_KMEM
+extern spinlock_t css_set_lock;
+
+static void memcg_ptr_release(struct percpu_ref *ref)
+{
+	struct mem_cgroup_ptr *ptr = container_of(ref, struct mem_cgroup_ptr,
+						  refcnt);
+	unsigned long flags;
+
+	spin_lock_irqsave(&css_set_lock, flags);
+	list_del(&ptr->list);
+	spin_unlock_irqrestore(&css_set_lock, flags);
+
+	mem_cgroup_put(ptr->memcg);
+	percpu_ref_exit(ref);
+	kfree_rcu(ptr, rcu);
+}
+
+static int memcg_init_kmem_memcg_ptr(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup_ptr *kmem_memcg_ptr;
+	int ret;
+
+	kmem_memcg_ptr = kmalloc(sizeof(struct mem_cgroup_ptr), GFP_KERNEL);
+	if (!kmem_memcg_ptr)
+		return -ENOMEM;
+
+	ret = percpu_ref_init(&kmem_memcg_ptr->refcnt, memcg_ptr_release,
+			      0, GFP_KERNEL);
+	if (ret) {
+		kfree(kmem_memcg_ptr);
+		return ret;
+	}
+
+	kmem_memcg_ptr->memcg = memcg;
+	INIT_LIST_HEAD(&kmem_memcg_ptr->list);
+	rcu_assign_pointer(memcg->kmem_memcg_ptr, kmem_memcg_ptr);
+	list_add(&kmem_memcg_ptr->list, &memcg->kmem_memcg_ptr_list);
+	return 0;
+}
+
+static void memcg_reparent_kmem_memcg_ptr(struct mem_cgroup *memcg,
+					  struct mem_cgroup *parent)
+{
+	unsigned int nr_reparented = 0;
+	struct mem_cgroup_ptr *memcg_ptr = NULL;
+
+	rcu_swap_protected(memcg->kmem_memcg_ptr, memcg_ptr, true);
+	percpu_ref_kill(&memcg_ptr->refcnt);
+
+	/*
+	 * kmem_memcg_ptr is initialized before css refcounter, so until now
+	 * it doesn't hold a reference to the memcg. Bump it here.
+	 */
+	css_get(&memcg->css);
+
+	spin_lock_irq(&css_set_lock);
+	list_for_each_entry(memcg_ptr, &memcg->kmem_memcg_ptr_list, list) {
+		xchg(&memcg_ptr->memcg, parent);
+		nr_reparented++;
+	}
+	if (nr_reparented)
+		list_splice(&memcg->kmem_memcg_ptr_list,
+			    &parent->kmem_memcg_ptr_list);
+	spin_unlock_irq(&css_set_lock);
+
+	if (nr_reparented) {
+		css_get_many(&parent->css, nr_reparented);
+		css_put_many(&memcg->css, nr_reparented);
+	}
+}
+
 /*
  * This will be the memcg's index in each cache's ->memcg_params.memcg_caches.
  * The main reason for not using cgroup id for this:
@@ -3554,7 +3625,7 @@ static void memcg_flush_percpu_vmevents(struct mem_cgroup *memcg)
 #ifdef CONFIG_MEMCG_KMEM
 static int memcg_online_kmem(struct mem_cgroup *memcg)
 {
-	int memcg_id;
+	int memcg_id, ret;
 
 	if (cgroup_memory_nokmem)
 		return 0;
@@ -3566,6 +3637,12 @@ static int memcg_online_kmem(struct mem_cgroup *memcg)
 	if (memcg_id < 0)
 		return memcg_id;
 
+	ret = memcg_init_kmem_memcg_ptr(memcg);
+	if (ret) {
+		memcg_free_cache_id(memcg_id);
+		return ret;
+	}
+
 	static_branch_inc(&memcg_kmem_enabled_key);
 	/*
 	 * A memory cgroup is considered kmem-online as soon as it gets
@@ -3601,12 +3678,13 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
 		parent = root_mem_cgroup;
 
 	/*
-	 * Deactivate and reparent kmem_caches. Then flush percpu
-	 * slab statistics to have precise values at the parent and
-	 * all ancestor levels. It's required to keep slab stats
+	 * Deactivate and reparent kmem_caches and reparent kmem_memcg_ptr.
+	 * Then flush percpu slab statistics to have precise values at the
+	 * parent and all ancestor levels. It's required to keep slab stats
 	 * accurate after the reparenting of kmem_caches.
 	 */
 	memcg_deactivate_kmem_caches(memcg, parent);
+	memcg_reparent_kmem_memcg_ptr(memcg, parent);
 	memcg_flush_percpu_vmstats(memcg, true);
 
 	kmemcg_id = memcg->kmemcg_id;
@@ -5171,6 +5249,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 	memcg->socket_pressure = jiffies;
 #ifdef CONFIG_MEMCG_KMEM
 	memcg->kmemcg_id = -1;
+	INIT_LIST_HEAD(&memcg->kmem_memcg_ptr_list);
 #endif
 #ifdef CONFIG_CGROUP_WRITEBACK
 	INIT_LIST_HEAD(&memcg->cgwb_list);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC 04/14] mm: vmstat: convert slab vmstat counter to bytes
  2019-09-05 21:45 ` [PATCH RFC 04/14] mm: vmstat: convert slab vmstat counter to bytes Roman Gushchin
@ 2019-09-16 12:38   ` Johannes Weiner
  2019-09-17  2:08     ` Roman Gushchin
  0 siblings, 1 reply; 44+ messages in thread
From: Johannes Weiner @ 2019-09-16 12:38 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Michal Hocko, linux-kernel, kernel-team, Shakeel Butt,
	Vladimir Davydov, Waiman Long

On Thu, Sep 05, 2019 at 02:45:48PM -0700, Roman Gushchin wrote:
> In order to prepare for per-object slab memory accounting,
> convert NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE vmstat
> items to bytes.
> 
> To make sure that these vmstats are in bytes, rename them
> to NR_SLAB_RECLAIMABLE_B and NR_SLAB_UNRECLAIMABLE_B (similar to
> NR_KERNEL_STACK_KB).
> 
> The size of slab memory shouldn't exceed 4Gb on 32-bit machines,
> so it will fit into atomic_long_t we use for vmstats.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>

Maybe a crazy idea, but instead of mixing bytes and pages, would it be
difficult to account all vmstat items in bytes internally? And provide
two general apis, byte and page based, to update and query the counts,
instead of tying the unit it to individual items?

The vmstat_item_in_bytes() conditional shifting is pretty awkward in
code that has a recent history littered with subtle breakages.

The translation helper node_page_state_pages() will yield garbage if
used with the page-based counters, which is another easy to misuse
interface.

We already have many places that multiply with PAGE_SIZE to get the
stats in bytes or kb units.

And _B/_KB suffixes are kinda clunky.

The stats use atomic_long_t, so switching to atomic64_t doesn't make a
difference on 64-bit and is backward compatible with 32-bit.

The per-cpu batch size you have to raise from s8 either way.

It seems to me that would make the code and API a lot simpler and
easier to use / harder to misuse.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC 01/14] mm: memcg: subpage charging API
  2019-09-05 21:45 ` [PATCH RFC 01/14] mm: memcg: subpage charging API Roman Gushchin
@ 2019-09-16 12:56   ` Johannes Weiner
  2019-09-17  2:27     ` Roman Gushchin
  0 siblings, 1 reply; 44+ messages in thread
From: Johannes Weiner @ 2019-09-16 12:56 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Michal Hocko, linux-kernel, kernel-team, Shakeel Butt,
	Vladimir Davydov, Waiman Long

On Thu, Sep 05, 2019 at 02:45:45PM -0700, Roman Gushchin wrote:
> Introduce an API to charge subpage objects to the memory cgroup.
> The API will be used by the new slab memory controller. Later it
> can also be used to implement percpu memory accounting.
> In both cases, a single page can be shared between multiple cgroups
> (and in percpu case a single allocation is split over multiple pages),
> so it's not possible to use page-based accounting.
> 
> The implementation is based on percpu stocks. Memory cgroups are still
> charged in pages, and the residue is stored in perpcu stock, or on the
> memcg itself, when it's necessary to flush the stock.

Did you just implement a slab allocator for page_counter to track
memory consumed by the slab allocator?

> @@ -2500,8 +2577,9 @@ void mem_cgroup_handle_over_high(void)
>  }
>  
>  static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> -		      unsigned int nr_pages)
> +		      unsigned int amount, bool subpage)
>  {
> +	unsigned int nr_pages = subpage ? ((amount >> PAGE_SHIFT) + 1) : amount;
>  	unsigned int batch = max(MEMCG_CHARGE_BATCH, nr_pages);
>  	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
>  	struct mem_cgroup *mem_over_limit;
> @@ -2514,7 +2592,9 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	if (mem_cgroup_is_root(memcg))
>  		return 0;
>  retry:
> -	if (consume_stock(memcg, nr_pages))
> +	if (subpage && consume_subpage_stock(memcg, amount))
> +		return 0;
> +	else if (!subpage && consume_stock(memcg, nr_pages))
>  		return 0;

The layering here isn't clean. We have an existing per-cpu cache to
batch-charge the page counter. Why does the new subpage allocator not
sit on *top* of this, instead of wedged in between?

I think what it should be is a try_charge_bytes() that simply gets one
page from try_charge() and then does its byte tracking, regardless of
how try_charge() chooses to implement its own page tracking.

That would avoid the awkward @amount + @subpage multiplexing, as well
as annotating all existing callsites of try_charge() with a
non-descript "false" parameter.

You can still reuse the stock data structures, use the lower bits of
stock->nr_bytes for a different cgroup etc., but the charge API should
really be separate.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC 04/14] mm: vmstat: convert slab vmstat counter to bytes
  2019-09-16 12:38   ` Johannes Weiner
@ 2019-09-17  2:08     ` Roman Gushchin
  0 siblings, 0 replies; 44+ messages in thread
From: Roman Gushchin @ 2019-09-17  2:08 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Michal Hocko, linux-kernel, Kernel Team, Shakeel Butt,
	Vladimir Davydov, Waiman Long

On Mon, Sep 16, 2019 at 02:38:40PM +0200, Johannes Weiner wrote:
> On Thu, Sep 05, 2019 at 02:45:48PM -0700, Roman Gushchin wrote:
> > In order to prepare for per-object slab memory accounting,
> > convert NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE vmstat
> > items to bytes.
> > 
> > To make sure that these vmstats are in bytes, rename them
> > to NR_SLAB_RECLAIMABLE_B and NR_SLAB_UNRECLAIMABLE_B (similar to
> > NR_KERNEL_STACK_KB).
> > 
> > The size of slab memory shouldn't exceed 4Gb on 32-bit machines,
> > so it will fit into atomic_long_t we use for vmstats.
> > 
> > Signed-off-by: Roman Gushchin <guro@fb.com>

Hello Johannes!

Thank you for looking into the patchset!

> 
> Maybe a crazy idea, but instead of mixing bytes and pages, would it be
> difficult to account all vmstat items in bytes internally? And provide
> two general apis, byte and page based, to update and query the counts,
> instead of tying the unit it to individual items?
> 
> The vmstat_item_in_bytes() conditional shifting is pretty awkward in
> code that has a recent history littered with subtle breakages.
> 
> The translation helper node_page_state_pages() will yield garbage if
> used with the page-based counters, which is another easy to misuse
> interface.
> 
> We already have many places that multiply with PAGE_SIZE to get the
> stats in bytes or kb units.
> 
> And _B/_KB suffixes are kinda clunky.
> 
> The stats use atomic_long_t, so switching to atomic64_t doesn't make a
> difference on 64-bit and is backward compatible with 32-bit.

I fully agree here, that having different stats in different units
adds a lot of mess to the code. But I always thought that 64-bit
atomics are slow on a 32-bit machine, so it might be a noticeable
performance regression. Don't you think so?

I'm happy to prepare such a patch(set), only I'd prefer to keep it
separately from this one. It can precede or follow the slab controller
rework, either way will work. Slab controller rework is already not so
small, so adding more code (and potential issues) here will only make
the review more complex.

> 
> The per-cpu batch size you have to raise from s8 either way.

Yeah, tbh I don't know why those are just not unsigned long by default.
Space savings are miserable here, and I don't see any other reasons.
It could be even slightly faster to use a larger type.

I kinda tried to keep the patchset as small as possible (at least for
the RFC version), so tried to avoid any non-necessary changes.
But overall using s8 or s16 here doesn't make much sense to me.

> 
> It seems to me that would make the code and API a lot simpler and
> easier to use / harder to misuse.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC 01/14] mm: memcg: subpage charging API
  2019-09-16 12:56   ` Johannes Weiner
@ 2019-09-17  2:27     ` Roman Gushchin
  2019-09-17  8:50       ` Johannes Weiner
  0 siblings, 1 reply; 44+ messages in thread
From: Roman Gushchin @ 2019-09-17  2:27 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Michal Hocko, linux-kernel, Kernel Team, Shakeel Butt,
	Vladimir Davydov, Waiman Long

On Mon, Sep 16, 2019 at 02:56:11PM +0200, Johannes Weiner wrote:
> On Thu, Sep 05, 2019 at 02:45:45PM -0700, Roman Gushchin wrote:
> > Introduce an API to charge subpage objects to the memory cgroup.
> > The API will be used by the new slab memory controller. Later it
> > can also be used to implement percpu memory accounting.
> > In both cases, a single page can be shared between multiple cgroups
> > (and in percpu case a single allocation is split over multiple pages),
> > so it's not possible to use page-based accounting.
> > 
> > The implementation is based on percpu stocks. Memory cgroups are still
> > charged in pages, and the residue is stored in perpcu stock, or on the
> > memcg itself, when it's necessary to flush the stock.
> 
> Did you just implement a slab allocator for page_counter to track
> memory consumed by the slab allocator?

:)

> 
> > @@ -2500,8 +2577,9 @@ void mem_cgroup_handle_over_high(void)
> >  }
> >  
> >  static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > -		      unsigned int nr_pages)
> > +		      unsigned int amount, bool subpage)
> >  {
> > +	unsigned int nr_pages = subpage ? ((amount >> PAGE_SHIFT) + 1) : amount;
> >  	unsigned int batch = max(MEMCG_CHARGE_BATCH, nr_pages);
> >  	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> >  	struct mem_cgroup *mem_over_limit;
> > @@ -2514,7 +2592,9 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >  	if (mem_cgroup_is_root(memcg))
> >  		return 0;
> >  retry:
> > -	if (consume_stock(memcg, nr_pages))
> > +	if (subpage && consume_subpage_stock(memcg, amount))
> > +		return 0;
> > +	else if (!subpage && consume_stock(memcg, nr_pages))
> >  		return 0;
> 
> The layering here isn't clean. We have an existing per-cpu cache to
> batch-charge the page counter. Why does the new subpage allocator not
> sit on *top* of this, instead of wedged in between?
> 
> I think what it should be is a try_charge_bytes() that simply gets one
> page from try_charge() and then does its byte tracking, regardless of
> how try_charge() chooses to implement its own page tracking.
> 
> That would avoid the awkward @amount + @subpage multiplexing, as well
> as annotating all existing callsites of try_charge() with a
> non-descript "false" parameter.
> 
> You can still reuse the stock data structures, use the lower bits of
> stock->nr_bytes for a different cgroup etc., but the charge API should
> really be separate.

Hm, I kinda like the idea, however there is a complication: for the subpage
accounting the css reference management is done in a different way, so that
all existing code should avoid changing the css refcounter. So I'd need
to pass a boolean argument anyway.

But let me try to write this down, hopefully v2 will be cleaner.

Thank you!

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC 01/14] mm: memcg: subpage charging API
  2019-09-17  2:27     ` Roman Gushchin
@ 2019-09-17  8:50       ` Johannes Weiner
  2019-09-17 18:33         ` Roman Gushchin
  0 siblings, 1 reply; 44+ messages in thread
From: Johannes Weiner @ 2019-09-17  8:50 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Michal Hocko, linux-kernel, Kernel Team, Shakeel Butt,
	Vladimir Davydov, Waiman Long

On Tue, Sep 17, 2019 at 02:27:19AM +0000, Roman Gushchin wrote:
> On Mon, Sep 16, 2019 at 02:56:11PM +0200, Johannes Weiner wrote:
> > On Thu, Sep 05, 2019 at 02:45:45PM -0700, Roman Gushchin wrote:
> > > Introduce an API to charge subpage objects to the memory cgroup.
> > > The API will be used by the new slab memory controller. Later it
> > > can also be used to implement percpu memory accounting.
> > > In both cases, a single page can be shared between multiple cgroups
> > > (and in percpu case a single allocation is split over multiple pages),
> > > so it's not possible to use page-based accounting.
> > > 
> > > The implementation is based on percpu stocks. Memory cgroups are still
> > > charged in pages, and the residue is stored in perpcu stock, or on the
> > > memcg itself, when it's necessary to flush the stock.
> > 
> > Did you just implement a slab allocator for page_counter to track
> > memory consumed by the slab allocator?
> 
> :)
> 
> > 
> > > @@ -2500,8 +2577,9 @@ void mem_cgroup_handle_over_high(void)
> > >  }
> > >  
> > >  static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > > -		      unsigned int nr_pages)
> > > +		      unsigned int amount, bool subpage)
> > >  {
> > > +	unsigned int nr_pages = subpage ? ((amount >> PAGE_SHIFT) + 1) : amount;
> > >  	unsigned int batch = max(MEMCG_CHARGE_BATCH, nr_pages);
> > >  	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> > >  	struct mem_cgroup *mem_over_limit;
> > > @@ -2514,7 +2592,9 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > >  	if (mem_cgroup_is_root(memcg))
> > >  		return 0;
> > >  retry:
> > > -	if (consume_stock(memcg, nr_pages))
> > > +	if (subpage && consume_subpage_stock(memcg, amount))
> > > +		return 0;
> > > +	else if (!subpage && consume_stock(memcg, nr_pages))
> > >  		return 0;
> > 
> > The layering here isn't clean. We have an existing per-cpu cache to
> > batch-charge the page counter. Why does the new subpage allocator not
> > sit on *top* of this, instead of wedged in between?
> > 
> > I think what it should be is a try_charge_bytes() that simply gets one
> > page from try_charge() and then does its byte tracking, regardless of
> > how try_charge() chooses to implement its own page tracking.
> > 
> > That would avoid the awkward @amount + @subpage multiplexing, as well
> > as annotating all existing callsites of try_charge() with a
> > non-descript "false" parameter.
> > 
> > You can still reuse the stock data structures, use the lower bits of
> > stock->nr_bytes for a different cgroup etc., but the charge API should
> > really be separate.
> 
> Hm, I kinda like the idea, however there is a complication: for the subpage
> accounting the css reference management is done in a different way, so that
> all existing code should avoid changing the css refcounter. So I'd need
> to pass a boolean argument anyway.

Can you elaborate on the refcounting scheme? I don't quite understand
how there would be complications with that.

Generally, references are held for each page that is allocated in the
page_counter. try_charge() allocates a batch of css references,
returns one and keeps the rest in stock.

So couldn't the following work?

When somebody allocates a subpage, the css reference returned by
try_charge() is shared by the allocated subpage object and the
remainder that is kept via stock->subpage_cache and stock->nr_bytes
(or memcg->nr_stocked_bytes when the percpu cache is reset).

When the subpage objects are freed, you'll eventually have a full page
again in stock->nr_bytes, at which point you page_counter_uncharge()
paired with css_put(_many) as per usual.

A remainder left in old->nr_stocked_bytes would continue to hold on to
one css reference. (I don't quite understand who is protecting this
remainder in your current version, actually. A bug?)

Instead of doing your own batched page_counter uncharging in
refill_subpage_stock() -> drain_subpage_stock(), you should be able to
call refill_stock() when stock->nr_bytes adds up to a whole page again.

Again, IMO this would be much cleaner architecture if there was a
try_charge_bytes() byte allocator that would sit on top of a cleanly
abstracted try_charge() page allocator, just like the slab allocator
is sitting on top of the page allocator - instead of breaking through
the abstraction layer of the underlying page allocator.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC 01/14] mm: memcg: subpage charging API
  2019-09-17  8:50       ` Johannes Weiner
@ 2019-09-17 18:33         ` Roman Gushchin
  0 siblings, 0 replies; 44+ messages in thread
From: Roman Gushchin @ 2019-09-17 18:33 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Michal Hocko, linux-kernel, Kernel Team, Shakeel Butt,
	Vladimir Davydov, Waiman Long

On Tue, Sep 17, 2019 at 10:50:04AM +0200, Johannes Weiner wrote:
> On Tue, Sep 17, 2019 at 02:27:19AM +0000, Roman Gushchin wrote:
> > On Mon, Sep 16, 2019 at 02:56:11PM +0200, Johannes Weiner wrote:
> > > On Thu, Sep 05, 2019 at 02:45:45PM -0700, Roman Gushchin wrote:
> > > > Introduce an API to charge subpage objects to the memory cgroup.
> > > > The API will be used by the new slab memory controller. Later it
> > > > can also be used to implement percpu memory accounting.
> > > > In both cases, a single page can be shared between multiple cgroups
> > > > (and in percpu case a single allocation is split over multiple pages),
> > > > so it's not possible to use page-based accounting.
> > > > 
> > > > The implementation is based on percpu stocks. Memory cgroups are still
> > > > charged in pages, and the residue is stored in perpcu stock, or on the
> > > > memcg itself, when it's necessary to flush the stock.
> > > 
> > > Did you just implement a slab allocator for page_counter to track
> > > memory consumed by the slab allocator?
> > 
> > :)
> > 
> > > 
> > > > @@ -2500,8 +2577,9 @@ void mem_cgroup_handle_over_high(void)
> > > >  }
> > > >  
> > > >  static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > > > -		      unsigned int nr_pages)
> > > > +		      unsigned int amount, bool subpage)
> > > >  {
> > > > +	unsigned int nr_pages = subpage ? ((amount >> PAGE_SHIFT) + 1) : amount;
> > > >  	unsigned int batch = max(MEMCG_CHARGE_BATCH, nr_pages);
> > > >  	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> > > >  	struct mem_cgroup *mem_over_limit;
> > > > @@ -2514,7 +2592,9 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > > >  	if (mem_cgroup_is_root(memcg))
> > > >  		return 0;
> > > >  retry:
> > > > -	if (consume_stock(memcg, nr_pages))
> > > > +	if (subpage && consume_subpage_stock(memcg, amount))
> > > > +		return 0;
> > > > +	else if (!subpage && consume_stock(memcg, nr_pages))
> > > >  		return 0;
> > > 
> > > The layering here isn't clean. We have an existing per-cpu cache to
> > > batch-charge the page counter. Why does the new subpage allocator not
> > > sit on *top* of this, instead of wedged in between?
> > > 
> > > I think what it should be is a try_charge_bytes() that simply gets one
> > > page from try_charge() and then does its byte tracking, regardless of
> > > how try_charge() chooses to implement its own page tracking.
> > > 
> > > That would avoid the awkward @amount + @subpage multiplexing, as well
> > > as annotating all existing callsites of try_charge() with a
> > > non-descript "false" parameter.
> > > 
> > > You can still reuse the stock data structures, use the lower bits of
> > > stock->nr_bytes for a different cgroup etc., but the charge API should
> > > really be separate.
> > 
> > Hm, I kinda like the idea, however there is a complication: for the subpage
> > accounting the css reference management is done in a different way, so that
> > all existing code should avoid changing the css refcounter. So I'd need
> > to pass a boolean argument anyway.
> 
> Can you elaborate on the refcounting scheme? I don't quite understand
> how there would be complications with that.
> 
> Generally, references are held for each page that is allocated in the
> page_counter. try_charge() allocates a batch of css references,
> returns one and keeps the rest in stock.
> 
> So couldn't the following work?
> 
> When somebody allocates a subpage, the css reference returned by
> try_charge() is shared by the allocated subpage object and the
> remainder that is kept via stock->subpage_cache and stock->nr_bytes
> (or memcg->nr_stocked_bytes when the percpu cache is reset).

Because individual objects are a subject of reparenting and can outlive
the origin memory cgroup, they shouldn't hold a direct reference to the
memory cgroup. Instead they hold a reference to the mem_cgroup_ptr object,
and this objects holds a single reference to the memory cgroup.
Underlying pages shouldn't hold a reference too.

Btw, it's already true, just kmem_cache plays the role of such intermediate
object, and we do an explicit transfer of charge (look at memcg_charge_slab()).
So we initially associate a page with the memcg, and almost immediately
after break this association and insert kmem_cache in between.

But with subpage accounting it's not possible, as a page is shared between
multiple cgroups, and it can't be attributed to any specific cgroup at
any time.

> 
> When the subpage objects are freed, you'll eventually have a full page
> again in stock->nr_bytes, at which point you page_counter_uncharge()
> paired with css_put(_many) as per usual.
> 
> A remainder left in old->nr_stocked_bytes would continue to hold on to
> one css reference. (I don't quite understand who is protecting this
> remainder in your current version, actually. A bug?)
> 
> Instead of doing your own batched page_counter uncharging in
> refill_subpage_stock() -> drain_subpage_stock(), you should be able to
> call refill_stock() when stock->nr_bytes adds up to a whole page again.
> 
> Again, IMO this would be much cleaner architecture if there was a
> try_charge_bytes() byte allocator that would sit on top of a cleanly
> abstracted try_charge() page allocator, just like the slab allocator
> is sitting on top of the page allocator - instead of breaking through
> the abstraction layer of the underlying page allocator.
> 

As I said, I like the idea to put it on top, but it can't be put on top
without changes in css refcounting (or I don't see how). I don't know
how to mix stocks which are holding css references and which are not,
so I might end up with two stocks as in current implementation. Then
the idea of having another layer of caching on top looks slightly less
appealing, but maybe still worth a try.

Thanks!

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC 00/14] The new slab memory controller
  2019-09-05 21:45 [PATCH RFC 00/14] The new slab memory controller Roman Gushchin
                   ` (9 preceding siblings ...)
  2019-09-05 22:37 ` [PATCH RFC 02/14] mm: memcg: introduce mem_cgroup_ptr Roman Gushchin
@ 2019-09-17 19:48 ` Waiman Long
  2019-09-17 21:24   ` Roman Gushchin
  2019-09-19 13:39 ` Suleiman Souhlal
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 44+ messages in thread
From: Waiman Long @ 2019-09-17 19:48 UTC (permalink / raw)
  To: Roman Gushchin, linux-mm
  Cc: Michal Hocko, Johannes Weiner, linux-kernel, kernel-team,
	Shakeel Butt, Vladimir Davydov

On 9/5/19 5:45 PM, Roman Gushchin wrote:
> The existing slab memory controller is based on the idea of replicating
> slab allocator internals for each memory cgroup. This approach promises
> a low memory overhead (one pointer per page), and isn't adding too much
> code on hot allocation and release paths. But is has a very serious flaw:
> it leads to a low slab utilization.
>
> Using a drgn* script I've got an estimation of slab utilization on
> a number of machines running different production workloads. In most
> cases it was between 45% and 65%, and the best number I've seen was
> around 85%. Turning kmem accounting off brings it to high 90s. Also
> it brings back 30-50% of slab memory. It means that the real price
> of the existing slab memory controller is way bigger than a pointer
> per page.
>
> The real reason why the existing design leads to a low slab utilization
> is simple: slab pages are used exclusively by one memory cgroup.
> If there are only few allocations of certain size made by a cgroup,
> or if some active objects (e.g. dentries) are left after the cgroup is
> deleted, or the cgroup contains a single-threaded application which is
> barely allocating any kernel objects, but does it every time on a new CPU:
> in all these cases the resulting slab utilization is very low.
> If kmem accounting is off, the kernel is able to use free space
> on slab pages for other allocations.
>
> Arguably it wasn't an issue back to days when the kmem controller was
> introduced and was an opt-in feature, which had to be turned on
> individually for each memory cgroup. But now it's turned on by default
> on both cgroup v1 and v2. And modern systemd-based systems tend to
> create a large number of cgroups.
>
> This patchset provides a new implementation of the slab memory controller,
> which aims to reach a much better slab utilization by sharing slab pages
> between multiple memory cgroups. Below is the short description of the new
> design (more details in commit messages).
>
> Accounting is performed per-object instead of per-page. Slab-related
> vmstat counters are converted to bytes. Charging is performed on page-basis,
> with rounding up and remembering leftovers.
>
> Memcg ownership data is stored in a per-slab-page vector: for each slab page
> a vector of corresponding size is allocated. To keep slab memory reparenting
> working, instead of saving a pointer to the memory cgroup directly an
> intermediate object is used. It's simply a pointer to a memcg (which can be
> easily changed to the parent) with a built-in reference counter. This scheme
> allows to reparent all allocated objects without walking them over and changing
> memcg pointer to the parent.
>
> Instead of creating an individual set of kmem_caches for each memory cgroup,
> two global sets are used: the root set for non-accounted and root-cgroup
> allocations and the second set for all other allocations. This allows to
> simplify the lifetime management of individual kmem_caches: they are destroyed
> with root counterparts. It allows to remove a good amount of code and make
> things generally simpler.
>
> The patchset contains a couple of semi-independent parts, which can find their
> usage outside of the slab memory controller too:
> 1) subpage charging API, which can be used in the future for accounting of
>    other non-page-sized objects, e.g. percpu allocations.
> 2) mem_cgroup_ptr API (refcounted pointers to a memcg, can be reused
>    for the efficient reparenting of other objects, e.g. pagecache.
>
> The patchset has been tested on a number of different workloads in our
> production. In all cases, it saved hefty amounts of memory:
> 1) web frontend, 650-700 Mb, ~42% of slab memory
> 2) database cache, 750-800 Mb, ~35% of slab memory
> 3) dns server, 700 Mb, ~36% of slab memory
>
> So far I haven't found any regression on all tested workloads, but
> potential CPU regression caused by more precise accounting is a concern.
>
> Obviously the amount of saved memory depend on the number of memory cgroups,
> uptime and specific workloads, but overall it feels like the new controller
> saves 30-40% of slab memory, sometimes more. Additionally, it should lead
> to a lower memory fragmentation, just because of a smaller number of
> non-movable pages and also because there is no more need to move all
> slab objects to a new set of pages when a workload is restarted in a new
> memory cgroup.
>
> * https://github.com/osandov/drgn
>
>
> Roman Gushchin (14):
>   mm: memcg: subpage charging API
>   mm: memcg: introduce mem_cgroup_ptr
>   mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat
>   mm: vmstat: convert slab vmstat counter to bytes
>   mm: memcg/slab: allocate space for memcg ownership data for non-root
>     slabs
>   mm: slub: implement SLUB version of obj_to_index()
>   mm: memcg/slab: save memcg ownership data for non-root slab objects
>   mm: memcg: move memcg_kmem_bypass() to memcontrol.h
>   mm: memcg: introduce __mod_lruvec_memcg_state()
>   mm: memcg/slab: charge individual slab objects instead of pages
>   mm: memcg: move get_mem_cgroup_from_current() to memcontrol.h
>   mm: memcg/slab: replace memcg_from_slab_page() with
>     memcg_from_slab_obj()
>   mm: memcg/slab: use one set of kmem_caches for all memory cgroups
>   mm: slab: remove redundant check in memcg_accumulate_slabinfo()
>
>  drivers/base/node.c        |  11 +-
>  fs/proc/meminfo.c          |   4 +-
>  include/linux/memcontrol.h | 102 ++++++++-
>  include/linux/mm_types.h   |   5 +-
>  include/linux/mmzone.h     |  12 +-
>  include/linux/slab.h       |   3 +-
>  include/linux/slub_def.h   |   9 +
>  include/linux/vmstat.h     |   8 +
>  kernel/power/snapshot.c    |   2 +-
>  mm/list_lru.c              |  12 +-
>  mm/memcontrol.c            | 431 +++++++++++++++++++++--------------
>  mm/oom_kill.c              |   2 +-
>  mm/page_alloc.c            |   8 +-
>  mm/slab.c                  |  37 ++-
>  mm/slab.h                  | 300 +++++++++++++------------
>  mm/slab_common.c           | 449 ++++---------------------------------
>  mm/slob.c                  |  12 +-
>  mm/slub.c                  |  63 ++----
>  mm/vmscan.c                |   3 +-
>  mm/vmstat.c                |  38 +++-
>  mm/workingset.c            |   6 +-
>  21 files changed, 683 insertions(+), 834 deletions(-)
>
I can only see the first 9 patches. Patches 10-14 are not there.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC 00/14] The new slab memory controller
  2019-09-17 19:48 ` [PATCH RFC 00/14] The new slab memory controller Waiman Long
@ 2019-09-17 21:24   ` Roman Gushchin
  0 siblings, 0 replies; 44+ messages in thread
From: Roman Gushchin @ 2019-09-17 21:24 UTC (permalink / raw)
  To: Waiman Long
  Cc: linux-mm, Michal Hocko, Johannes Weiner, linux-kernel,
	Kernel Team, Shakeel Butt, Vladimir Davydov

On Tue, Sep 17, 2019 at 03:48:57PM -0400, Waiman Long wrote:
> On 9/5/19 5:45 PM, Roman Gushchin wrote:
> > The existing slab memory controller is based on the idea of replicating
> > slab allocator internals for each memory cgroup. This approach promises
> > a low memory overhead (one pointer per page), and isn't adding too much
> > code on hot allocation and release paths. But is has a very serious flaw:
> > it leads to a low slab utilization.
> >
> > Using a drgn* script I've got an estimation of slab utilization on
> > a number of machines running different production workloads. In most
> > cases it was between 45% and 65%, and the best number I've seen was
> > around 85%. Turning kmem accounting off brings it to high 90s. Also
> > it brings back 30-50% of slab memory. It means that the real price
> > of the existing slab memory controller is way bigger than a pointer
> > per page.
> >
> > The real reason why the existing design leads to a low slab utilization
> > is simple: slab pages are used exclusively by one memory cgroup.
> > If there are only few allocations of certain size made by a cgroup,
> > or if some active objects (e.g. dentries) are left after the cgroup is
> > deleted, or the cgroup contains a single-threaded application which is
> > barely allocating any kernel objects, but does it every time on a new CPU:
> > in all these cases the resulting slab utilization is very low.
> > If kmem accounting is off, the kernel is able to use free space
> > on slab pages for other allocations.
> >
> > Arguably it wasn't an issue back to days when the kmem controller was
> > introduced and was an opt-in feature, which had to be turned on
> > individually for each memory cgroup. But now it's turned on by default
> > on both cgroup v1 and v2. And modern systemd-based systems tend to
> > create a large number of cgroups.
> >
> > This patchset provides a new implementation of the slab memory controller,
> > which aims to reach a much better slab utilization by sharing slab pages
> > between multiple memory cgroups. Below is the short description of the new
> > design (more details in commit messages).
> >
> > Accounting is performed per-object instead of per-page. Slab-related
> > vmstat counters are converted to bytes. Charging is performed on page-basis,
> > with rounding up and remembering leftovers.
> >
> > Memcg ownership data is stored in a per-slab-page vector: for each slab page
> > a vector of corresponding size is allocated. To keep slab memory reparenting
> > working, instead of saving a pointer to the memory cgroup directly an
> > intermediate object is used. It's simply a pointer to a memcg (which can be
> > easily changed to the parent) with a built-in reference counter. This scheme
> > allows to reparent all allocated objects without walking them over and changing
> > memcg pointer to the parent.
> >
> > Instead of creating an individual set of kmem_caches for each memory cgroup,
> > two global sets are used: the root set for non-accounted and root-cgroup
> > allocations and the second set for all other allocations. This allows to
> > simplify the lifetime management of individual kmem_caches: they are destroyed
> > with root counterparts. It allows to remove a good amount of code and make
> > things generally simpler.
> >
> > The patchset contains a couple of semi-independent parts, which can find their
> > usage outside of the slab memory controller too:
> > 1) subpage charging API, which can be used in the future for accounting of
> >    other non-page-sized objects, e.g. percpu allocations.
> > 2) mem_cgroup_ptr API (refcounted pointers to a memcg, can be reused
> >    for the efficient reparenting of other objects, e.g. pagecache.
> >
> > The patchset has been tested on a number of different workloads in our
> > production. In all cases, it saved hefty amounts of memory:
> > 1) web frontend, 650-700 Mb, ~42% of slab memory
> > 2) database cache, 750-800 Mb, ~35% of slab memory
> > 3) dns server, 700 Mb, ~36% of slab memory
> >
> > So far I haven't found any regression on all tested workloads, but
> > potential CPU regression caused by more precise accounting is a concern.
> >
> > Obviously the amount of saved memory depend on the number of memory cgroups,
> > uptime and specific workloads, but overall it feels like the new controller
> > saves 30-40% of slab memory, sometimes more. Additionally, it should lead
> > to a lower memory fragmentation, just because of a smaller number of
> > non-movable pages and also because there is no more need to move all
> > slab objects to a new set of pages when a workload is restarted in a new
> > memory cgroup.
> >
> > * https://github.com/osandov/drgn
> >
> >
> > Roman Gushchin (14):
> >   mm: memcg: subpage charging API
> >   mm: memcg: introduce mem_cgroup_ptr
> >   mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat
> >   mm: vmstat: convert slab vmstat counter to bytes
> >   mm: memcg/slab: allocate space for memcg ownership data for non-root
> >     slabs
> >   mm: slub: implement SLUB version of obj_to_index()
> >   mm: memcg/slab: save memcg ownership data for non-root slab objects
> >   mm: memcg: move memcg_kmem_bypass() to memcontrol.h
> >   mm: memcg: introduce __mod_lruvec_memcg_state()
> >   mm: memcg/slab: charge individual slab objects instead of pages
> >   mm: memcg: move get_mem_cgroup_from_current() to memcontrol.h
> >   mm: memcg/slab: replace memcg_from_slab_page() with
> >     memcg_from_slab_obj()
> >   mm: memcg/slab: use one set of kmem_caches for all memory cgroups
> >   mm: slab: remove redundant check in memcg_accumulate_slabinfo()
> >
> >  drivers/base/node.c        |  11 +-
> >  fs/proc/meminfo.c          |   4 +-
> >  include/linux/memcontrol.h | 102 ++++++++-
> >  include/linux/mm_types.h   |   5 +-
> >  include/linux/mmzone.h     |  12 +-
> >  include/linux/slab.h       |   3 +-
> >  include/linux/slub_def.h   |   9 +
> >  include/linux/vmstat.h     |   8 +
> >  kernel/power/snapshot.c    |   2 +-
> >  mm/list_lru.c              |  12 +-
> >  mm/memcontrol.c            | 431 +++++++++++++++++++++--------------
> >  mm/oom_kill.c              |   2 +-
> >  mm/page_alloc.c            |   8 +-
> >  mm/slab.c                  |  37 ++-
> >  mm/slab.h                  | 300 +++++++++++++------------
> >  mm/slab_common.c           | 449 ++++---------------------------------
> >  mm/slob.c                  |  12 +-
> >  mm/slub.c                  |  63 ++----
> >  mm/vmscan.c                |   3 +-
> >  mm/vmstat.c                |  38 +++-
> >  mm/workingset.c            |   6 +-
> >  21 files changed, 683 insertions(+), 834 deletions(-)
> >
> I can only see the first 9 patches. Patches 10-14 are not there.

Hm, strange. I'll rebase the patchset on top of the current mm tree and resend.

In the meantime you can find the original patchset here:
  https://github.com/rgushchin/linux/tree/new_slab.rfc
or on top of the 5.3 release, which might be better for testing here:
  https://github.com/rgushchin/linux/tree/new_slab.rfc.v5.3

Thanks!

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC 00/14] The new slab memory controller
  2019-09-05 21:45 [PATCH RFC 00/14] The new slab memory controller Roman Gushchin
                   ` (10 preceding siblings ...)
  2019-09-17 19:48 ` [PATCH RFC 00/14] The new slab memory controller Waiman Long
@ 2019-09-19 13:39 ` Suleiman Souhlal
  2019-09-19 16:22   ` Roman Gushchin
  2019-10-01 15:12 ` Michal Koutný
  2019-12-09  9:17 ` [PATCH 00/16] " Bharata B Rao
  13 siblings, 1 reply; 44+ messages in thread
From: Suleiman Souhlal @ 2019-09-19 13:39 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Michal Hocko, Johannes Weiner, Linux Kernel,
	kernel-team, Shakeel Butt, Vladimir Davydov, Waiman Long

On Fri, Sep 6, 2019 at 6:57 AM Roman Gushchin <guro@fb.com> wrote:
> The patchset has been tested on a number of different workloads in our
> production. In all cases, it saved hefty amounts of memory:
> 1) web frontend, 650-700 Mb, ~42% of slab memory
> 2) database cache, 750-800 Mb, ~35% of slab memory
> 3) dns server, 700 Mb, ~36% of slab memory

Do these workloads cycle through a lot of different memcgs?

For workloads that don't, wouldn't this approach potentially use more
memory? For example, a workload where everything is in one or two
memcgs, and those memcgs last forever.

-- Suleiman

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC 00/14] The new slab memory controller
  2019-09-19 13:39 ` Suleiman Souhlal
@ 2019-09-19 16:22   ` Roman Gushchin
  2019-09-19 21:10     ` Suleiman Souhlal
  0 siblings, 1 reply; 44+ messages in thread
From: Roman Gushchin @ 2019-09-19 16:22 UTC (permalink / raw)
  To: Suleiman Souhlal
  Cc: linux-mm, Michal Hocko, Johannes Weiner, Linux Kernel,
	Kernel Team, Shakeel Butt, Vladimir Davydov, Waiman Long

On Thu, Sep 19, 2019 at 10:39:18PM +0900, Suleiman Souhlal wrote:
> On Fri, Sep 6, 2019 at 6:57 AM Roman Gushchin <guro@fb.com> wrote:
> > The patchset has been tested on a number of different workloads in our
> > production. In all cases, it saved hefty amounts of memory:
> > 1) web frontend, 650-700 Mb, ~42% of slab memory
> > 2) database cache, 750-800 Mb, ~35% of slab memory
> > 3) dns server, 700 Mb, ~36% of slab memory
> 
> Do these workloads cycle through a lot of different memcgs?

Not really, those are just plain services managed by systemd.
They aren't restarted too often, maybe several times per day at most.

Also, there is nothing fb-specific. You can take any new modern
distributive (I've tried Fedora 30), boot it up and look at the
amount of slab memory. Numbers are roughly the same.

> 
> For workloads that don't, wouldn't this approach potentially use more
> memory? For example, a workload where everything is in one or two
> memcgs, and those memcgs last forever.
>

Yes, it's true, if you have a very small and fixed number of memory cgroups,
in theory the new approach can take ~10% more memory.

I don't think it's such a big problem though: it seems that the majority
of cgroup users have a lot of them, and they are dynamically created and
destroyed by systemd/kubernetes/whatever else.

And if somebody has a very special setup with only 1-2 cgroups, arguably
kernel memory accounting isn't such a big thing for them, so it can be simple
disabled. Am I wrong and do you have a real-life example?

Thanks!

Roman

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC 00/14] The new slab memory controller
  2019-09-19 16:22   ` Roman Gushchin
@ 2019-09-19 21:10     ` Suleiman Souhlal
  2019-09-19 21:40       ` Roman Gushchin
  0 siblings, 1 reply; 44+ messages in thread
From: Suleiman Souhlal @ 2019-09-19 21:10 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Michal Hocko, Johannes Weiner, Linux Kernel,
	Kernel Team, Shakeel Butt, Vladimir Davydov, Waiman Long

On Fri, Sep 20, 2019 at 1:22 AM Roman Gushchin <guro@fb.com> wrote:
>
> On Thu, Sep 19, 2019 at 10:39:18PM +0900, Suleiman Souhlal wrote:
> > On Fri, Sep 6, 2019 at 6:57 AM Roman Gushchin <guro@fb.com> wrote:
> > > The patchset has been tested on a number of different workloads in our
> > > production. In all cases, it saved hefty amounts of memory:
> > > 1) web frontend, 650-700 Mb, ~42% of slab memory
> > > 2) database cache, 750-800 Mb, ~35% of slab memory
> > > 3) dns server, 700 Mb, ~36% of slab memory
> >
> > Do these workloads cycle through a lot of different memcgs?
>
> Not really, those are just plain services managed by systemd.
> They aren't restarted too often, maybe several times per day at most.
>
> Also, there is nothing fb-specific. You can take any new modern
> distributive (I've tried Fedora 30), boot it up and look at the
> amount of slab memory. Numbers are roughly the same.

Ah, ok.
These numbers are kind of surprising to me.
Do you know if the savings are similar if you use CONFIG_SLAB instead
of CONFIG_SLUB?

> > For workloads that don't, wouldn't this approach potentially use more
> > memory? For example, a workload where everything is in one or two
> > memcgs, and those memcgs last forever.
> >
>
> Yes, it's true, if you have a very small and fixed number of memory cgroups,
> in theory the new approach can take ~10% more memory.
>
> I don't think it's such a big problem though: it seems that the majority
> of cgroup users have a lot of them, and they are dynamically created and
> destroyed by systemd/kubernetes/whatever else.
>
> And if somebody has a very special setup with only 1-2 cgroups, arguably
> kernel memory accounting isn't such a big thing for them, so it can be simple
> disabled. Am I wrong and do you have a real-life example?

No, I don't have any specific examples.

-- Suleiman

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC 00/14] The new slab memory controller
  2019-09-19 21:10     ` Suleiman Souhlal
@ 2019-09-19 21:40       ` Roman Gushchin
  0 siblings, 0 replies; 44+ messages in thread
From: Roman Gushchin @ 2019-09-19 21:40 UTC (permalink / raw)
  To: Suleiman Souhlal
  Cc: linux-mm, Michal Hocko, Johannes Weiner, Linux Kernel,
	Kernel Team, Shakeel Butt, Vladimir Davydov, Waiman Long

On Fri, Sep 20, 2019 at 06:10:11AM +0900, Suleiman Souhlal wrote:
> On Fri, Sep 20, 2019 at 1:22 AM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Thu, Sep 19, 2019 at 10:39:18PM +0900, Suleiman Souhlal wrote:
> > > On Fri, Sep 6, 2019 at 6:57 AM Roman Gushchin <guro@fb.com> wrote:
> > > > The patchset has been tested on a number of different workloads in our
> > > > production. In all cases, it saved hefty amounts of memory:
> > > > 1) web frontend, 650-700 Mb, ~42% of slab memory
> > > > 2) database cache, 750-800 Mb, ~35% of slab memory
> > > > 3) dns server, 700 Mb, ~36% of slab memory
> > >
> > > Do these workloads cycle through a lot of different memcgs?
> >
> > Not really, those are just plain services managed by systemd.
> > They aren't restarted too often, maybe several times per day at most.
> >
> > Also, there is nothing fb-specific. You can take any new modern
> > distributive (I've tried Fedora 30), boot it up and look at the
> > amount of slab memory. Numbers are roughly the same.
> 
> Ah, ok.
> These numbers are kind of surprising to me.
> Do you know if the savings are similar if you use CONFIG_SLAB instead
> of CONFIG_SLUB?

I did only a brief testing of the SLAB version: savings were there, numbers were
slightly less impressive, but still in a double digit number of percents.

> 
> > > For workloads that don't, wouldn't this approach potentially use more
> > > memory? For example, a workload where everything is in one or two
> > > memcgs, and those memcgs last forever.
> > >
> >
> > Yes, it's true, if you have a very small and fixed number of memory cgroups,
> > in theory the new approach can take ~10% more memory.
> >
> > I don't think it's such a big problem though: it seems that the majority
> > of cgroup users have a lot of them, and they are dynamically created and
> > destroyed by systemd/kubernetes/whatever else.
> >
> > And if somebody has a very special setup with only 1-2 cgroups, arguably
> > kernel memory accounting isn't such a big thing for them, so it can be simple
> > disabled. Am I wrong and do you have a real-life example?
> 
> No, I don't have any specific examples.
> 
> -- Suleiman

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC 00/14] The new slab memory controller
  2019-09-05 21:45 [PATCH RFC 00/14] The new slab memory controller Roman Gushchin
                   ` (11 preceding siblings ...)
  2019-09-19 13:39 ` Suleiman Souhlal
@ 2019-10-01 15:12 ` Michal Koutný
  2019-10-02  2:09   ` Roman Gushchin
  2019-12-09  9:17 ` [PATCH 00/16] " Bharata B Rao
  13 siblings, 1 reply; 44+ messages in thread
From: Michal Koutný @ 2019-10-01 15:12 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Michal Hocko, Johannes Weiner, linux-kernel,
	kernel-team, Shakeel Butt, Vladimir Davydov, Waiman Long

[-- Attachment #1: Type: text/plain, Size: 745 bytes --]

On Thu, Sep 05, 2019 at 02:45:44PM -0700, Roman Gushchin <guro@fb.com> wrote:
> Roman Gushchin (14):
> [...]
>   mm: memcg/slab: use one set of kmem_caches for all memory cgroups
From that commit's message:

> 6) obsoletes kmem.slabinfo cgroup v1 interface file, as there are
>   no per-memcg kmem_caches anymore (empty output is printed)

The empty file means no allocations took place in the particular cgroup.
I find this quite a surprising change for consumers of these stats.

I understand obtaining the same data efficiently from the proposed
structures is difficult, however, such a change should be avoided. (In
my understanding, obsoleted file ~ not available in v2, however, it
should not disappear from v1.)

Michal

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC 00/14] The new slab memory controller
  2019-10-01 15:12 ` Michal Koutný
@ 2019-10-02  2:09   ` Roman Gushchin
  2019-10-02 13:00     ` Suleiman Souhlal
  0 siblings, 1 reply; 44+ messages in thread
From: Roman Gushchin @ 2019-10-02  2:09 UTC (permalink / raw)
  To: Michal Koutný
  Cc: linux-mm, Michal Hocko, Johannes Weiner, linux-kernel,
	Kernel Team, Shakeel Butt, Vladimir Davydov, Waiman Long

On Tue, Oct 01, 2019 at 05:12:02PM +0200, Michal Koutný wrote:
> On Thu, Sep 05, 2019 at 02:45:44PM -0700, Roman Gushchin <guro@fb.com> wrote:
> > Roman Gushchin (14):
> > [...]
> >   mm: memcg/slab: use one set of kmem_caches for all memory cgroups
> From that commit's message:
> 
> > 6) obsoletes kmem.slabinfo cgroup v1 interface file, as there are
> >   no per-memcg kmem_caches anymore (empty output is printed)
> 
> The empty file means no allocations took place in the particular cgroup.
> I find this quite a surprising change for consumers of these stats.
> 
> I understand obtaining the same data efficiently from the proposed
> structures is difficult, however, such a change should be avoided. (In
> my understanding, obsoleted file ~ not available in v2, however, it
> should not disappear from v1.)

Well, my assumption is that nobody is using this file for anything except
debugging purposes (I might be wrong, if somebody has an automation based
on it, please, let me know). A number of allocations of each type per memory
cgroup is definitely a useful debug information, but currently it barely works
(displayed numbers show mostly the number of allocated pages, not the number
of active objects). We can support it, but it comes with the price, and
most users don't really need it. So I don't think it worth it to make all
allocations slower just to keep some debug interface working for some
cgroup v1 users. Do you have examples when it's really useful and worth
extra cpu cost?

Unfortunately, we can't enable it conditionally, as a user can switch
between cgroup v1 and cgroup v2 memory controllers dynamically.

Thanks!

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC 00/14] The new slab memory controller
  2019-10-02  2:09   ` Roman Gushchin
@ 2019-10-02 13:00     ` Suleiman Souhlal
  2019-10-03 10:47       ` Michal Koutný
  0 siblings, 1 reply; 44+ messages in thread
From: Suleiman Souhlal @ 2019-10-02 13:00 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Michal Koutný,
	linux-mm, Michal Hocko, Johannes Weiner, linux-kernel,
	Kernel Team, Shakeel Butt, Vladimir Davydov, Waiman Long

On Wed, Oct 2, 2019 at 11:09 AM Roman Gushchin <guro@fb.com> wrote:
>
> On Tue, Oct 01, 2019 at 05:12:02PM +0200, Michal Koutný wrote:
> > On Thu, Sep 05, 2019 at 02:45:44PM -0700, Roman Gushchin <guro@fb.com> wrote:
> > > Roman Gushchin (14):
> > > [...]
> > >   mm: memcg/slab: use one set of kmem_caches for all memory cgroups
> > From that commit's message:
> >
> > > 6) obsoletes kmem.slabinfo cgroup v1 interface file, as there are
> > >   no per-memcg kmem_caches anymore (empty output is printed)
> >
> > The empty file means no allocations took place in the particular cgroup.
> > I find this quite a surprising change for consumers of these stats.
> >
> > I understand obtaining the same data efficiently from the proposed
> > structures is difficult, however, such a change should be avoided. (In
> > my understanding, obsoleted file ~ not available in v2, however, it
> > should not disappear from v1.)
>
> Well, my assumption is that nobody is using this file for anything except
> debugging purposes (I might be wrong, if somebody has an automation based
> on it, please, let me know). A number of allocations of each type per memory
> cgroup is definitely a useful debug information, but currently it barely works
> (displayed numbers show mostly the number of allocated pages, not the number
> of active objects). We can support it, but it comes with the price, and
> most users don't really need it. So I don't think it worth it to make all
> allocations slower just to keep some debug interface working for some
> cgroup v1 users. Do you have examples when it's really useful and worth
> extra cpu cost?
>
> Unfortunately, we can't enable it conditionally, as a user can switch
> between cgroup v1 and cgroup v2 memory controllers dynamically.

kmem.slabinfo has been absolutely invaluable for debugging, in my experience.
I am however not aware of any automation based on it.

Maybe it might be worth adding it to cgroup v2 and have a CONFIG
option to enable it?

-- Suleiman

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC 00/14] The new slab memory controller
  2019-10-02 13:00     ` Suleiman Souhlal
@ 2019-10-03 10:47       ` Michal Koutný
  2019-10-03 15:52         ` Roman Gushchin
  0 siblings, 1 reply; 44+ messages in thread
From: Michal Koutný @ 2019-10-03 10:47 UTC (permalink / raw)
  To: Suleiman Souhlal
  Cc: Roman Gushchin, Johannes Weiner, Kernel Team, Vladimir Davydov,
	Shakeel Butt, Michal Hocko, linux-mm, Waiman Long, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 648 bytes --]

On Wed, Oct 02, 2019 at 10:00:07PM +0900, Suleiman Souhlal <suleiman@google.com> wrote:
> kmem.slabinfo has been absolutely invaluable for debugging, in my experience.
> I am however not aware of any automation based on it.
My experience is the same. However, the point is that this has been
exposed since ages, so the safe assumption is that there may be users.

> Maybe it might be worth adding it to cgroup v2 and have a CONFIG
> option to enable it?
I don't think v2 file is necessary given the cost of obtaining the
information. But I concur the idea of making the per-object tracking
switchable at boot time or at least CONFIGurable.

Michal

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH RFC 00/14] The new slab memory controller
  2019-10-03 10:47       ` Michal Koutný
@ 2019-10-03 15:52         ` Roman Gushchin
  0 siblings, 0 replies; 44+ messages in thread
From: Roman Gushchin @ 2019-10-03 15:52 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Suleiman Souhlal, Johannes Weiner, Kernel Team, Vladimir Davydov,
	Shakeel Butt, Michal Hocko, linux-mm, Waiman Long, linux-kernel

On Thu, Oct 03, 2019 at 12:47:41PM +0200, Michal Koutný wrote:
> On Wed, Oct 02, 2019 at 10:00:07PM +0900, Suleiman Souhlal <suleiman@google.com> wrote:
> > kmem.slabinfo has been absolutely invaluable for debugging, in my experience.
> > I am however not aware of any automation based on it.
> My experience is the same. However, the point is that this has been
> exposed since ages, so the safe assumption is that there may be users.

Yes, but kernel memory accounting was an opt-in feature for years,
and also it can be disabled on boot time, so displaying an empty
memory.slabinfo file doesn't break the interface.

> 
> > Maybe it might be worth adding it to cgroup v2 and have a CONFIG
> > option to enable it?
> I don't think v2 file is necessary given the cost of obtaining the
> information. But I concur the idea of making the per-object tracking
> switchable at boot time or at least CONFIGurable.

As I said, the cost is the same and should be paid in any case,
no matter if cgroup v1 or v2 is used. A user can dynamically switch
between v1 and v2, and there is no way to obtain this information
afterwards, so we need to collect it from scratch.

Another concern I have is that it will require adding a non-trivial amount
of new code (as we don't have dynamically creating and destroying kmem_caches
anymore). It's perfectly doable, but I'm not sure we need it so much
to postpone the merging of the main thing. But I'm happy to hear
any arguments why it's not true.

Thanks!

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] The new slab memory controller
  2019-09-05 21:45 [PATCH RFC 00/14] The new slab memory controller Roman Gushchin
                   ` (12 preceding siblings ...)
  2019-10-01 15:12 ` Michal Koutný
@ 2019-12-09  9:17 ` Bharata B Rao
  2019-12-09 11:56   ` Bharata B Rao
  13 siblings, 1 reply; 44+ messages in thread
From: Bharata B Rao @ 2019-12-09  9:17 UTC (permalink / raw)
  To: guro
  Cc: mhocko, hannes, linux-kernel, kernel-team, shakeelb,
	vdavydov.dev, longman

Hi,

I see the below crash during early boot when I try this patchset on
PowerPC host. I am on new_slab.rfc.v5.3 branch.

BUG: Unable to handle kernel data access at 0x81030236d1814578
Faulting instruction address: 0xc0000000002cc314
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
Modules linked in: ip_tables x_tables autofs4 sr_mod cdrom usbhid bnx2x crct10dif_vpmsum crct10dif_common mdio libcrc32c crc32c_vpmsum
CPU: 31 PID: 1752 Comm: keyboard-setup. Not tainted 5.3.0-g9bd85fd72a0c #155
NIP:  c0000000002cc314 LR: c0000000002cc2e8 CTR: 0000000000000000
REGS: c000001e40f378b0 TRAP: 0380   Not tainted  (5.3.0-g9bd85fd72a0c)
MSR:  900000010280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]>  CR: 44022224  XER: 00000000
CFAR: c0000000002c6ad4 IRQMASK: 1 
GPR00: c0000000000b8a40 c000001e40f37b40 c000000000ed9600 0000000000000000 
GPR04: 0000000000000023 0000000000000010 c000001e40f37b24 c000001e3cba3400 
GPR08: 0000000000000020 81030218815f4578 0000001e50220000 0000000000000030 
GPR12: 0000000000002200 c000001fff774d80 0000000000000000 00000001072600d8 
GPR16: 0000000000000000 c0000000000bbaac 0000000000000000 0000000000000000 
GPR20: c000001e40f37c48 0000000000000001 0000000000000000 c000001e3cba3400 
GPR24: c000001e40f37dd8 0000000000000000 c000000000fa0d58 0000000000000000 
GPR28: c000001e3a080080 c000001e32da0100 0000000000000118 0000000000000010 
NIP [c0000000002cc314] __mod_memcg_state+0x58/0xd0
LR [c0000000002cc2e8] __mod_memcg_state+0x2c/0xd0
Call Trace:
[c000001e40f37b90] [c0000000000b8a40] account_kernel_stack+0xa4/0xe4
[c000001e40f37bd0] [c0000000000ba4a4] copy_process+0x2b4/0x16f0
[c000001e40f37cf0] [c0000000000bbaac] _do_fork+0x9c/0x3e4
[c000001e40f37db0] [c0000000000bc030] sys_clone+0x74/0xa8
[c000001e40f37e20] [c00000000000bb34] ppc_clone+0x8/0xc
Instruction dump:
4bffa7e9 2fa30000 409e007c 395efffb 3d000020 2b8a0001 409d0008 39000020 
e93d0718 e94d0028 7bde1f24 7d29f214 <7ca9502a> 7fff2a14 7fe9fe76 7d27fa78 

Looks like page->mem_cgroup_vec is allocated but not yet initialized
with memcg pointers when we try to access them.

I did get past the crash by initializing the pointers like this
in account_kernel_stack(), but I am pretty sure that this is not the
place to do this:

diff --git a/kernel/fork.c b/kernel/fork.c
index 541fd805fb88..be21419feae2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -380,13 +380,26 @@ static void account_kernel_stack(struct task_struct *tsk, int account)
                 * All stack pages are in the same zone and belong to the
                 * same memcg.
                 */
-               struct page *first_page = virt_to_page(stack);
+               struct page *first_page = virt_to_head_page((stack));
+               unsigned long off;
+               struct mem_cgroup_ptr *memcg_ptr;
+               struct mem_cgroup *memcg;
 
                mod_zone_page_state(page_zone(first_page), NR_KERNEL_STACK_KB,
                                    THREAD_SIZE / 1024 * account);
 
-               mod_memcg_page_state(first_page, MEMCG_KERNEL_STACK_KB,
+               if (!first_page->mem_cgroup_vec)
+                       return;
+               off = obj_to_index(task_struct_cachep, first_page, stack);
+               memcg_ptr = first_page->mem_cgroup_vec[off];
+               if (!memcg_ptr)
+                       return;
+               rcu_read_lock();
+               memcg = memcg_ptr->memcg;
+               if (memcg)
+                       __mod_memcg_state(memcg, MEMCG_KERNEL_STACK_KB,
                                     account * (THREAD_SIZE / 1024));
+               rcu_read_unlock();
        }
 }

Regards,
Bharata.


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] The new slab memory controller
  2019-12-09  9:17 ` [PATCH 00/16] " Bharata B Rao
@ 2019-12-09 11:56   ` Bharata B Rao
  2019-12-09 18:04     ` Roman Gushchin
  0 siblings, 1 reply; 44+ messages in thread
From: Bharata B Rao @ 2019-12-09 11:56 UTC (permalink / raw)
  To: guro
  Cc: mhocko, hannes, linux-kernel, kernel-team, shakeelb,
	vdavydov.dev, longman

On Mon, Dec 09, 2019 at 02:47:52PM +0530, Bharata B Rao wrote:
> Hi,
> 
> I see the below crash during early boot when I try this patchset on
> PowerPC host. I am on new_slab.rfc.v5.3 branch.
> 
> BUG: Unable to handle kernel data access at 0x81030236d1814578
> Faulting instruction address: 0xc0000000002cc314
> Oops: Kernel access of bad area, sig: 11 [#1]
> LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
> Modules linked in: ip_tables x_tables autofs4 sr_mod cdrom usbhid bnx2x crct10dif_vpmsum crct10dif_common mdio libcrc32c crc32c_vpmsum
> CPU: 31 PID: 1752 Comm: keyboard-setup. Not tainted 5.3.0-g9bd85fd72a0c #155
> NIP:  c0000000002cc314 LR: c0000000002cc2e8 CTR: 0000000000000000
> REGS: c000001e40f378b0 TRAP: 0380   Not tainted  (5.3.0-g9bd85fd72a0c)
> MSR:  900000010280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]>  CR: 44022224  XER: 00000000
> CFAR: c0000000002c6ad4 IRQMASK: 1 
> GPR00: c0000000000b8a40 c000001e40f37b40 c000000000ed9600 0000000000000000 
> GPR04: 0000000000000023 0000000000000010 c000001e40f37b24 c000001e3cba3400 
> GPR08: 0000000000000020 81030218815f4578 0000001e50220000 0000000000000030 
> GPR12: 0000000000002200 c000001fff774d80 0000000000000000 00000001072600d8 
> GPR16: 0000000000000000 c0000000000bbaac 0000000000000000 0000000000000000 
> GPR20: c000001e40f37c48 0000000000000001 0000000000000000 c000001e3cba3400 
> GPR24: c000001e40f37dd8 0000000000000000 c000000000fa0d58 0000000000000000 
> GPR28: c000001e3a080080 c000001e32da0100 0000000000000118 0000000000000010 
> NIP [c0000000002cc314] __mod_memcg_state+0x58/0xd0
> LR [c0000000002cc2e8] __mod_memcg_state+0x2c/0xd0
> Call Trace:
> [c000001e40f37b90] [c0000000000b8a40] account_kernel_stack+0xa4/0xe4
> [c000001e40f37bd0] [c0000000000ba4a4] copy_process+0x2b4/0x16f0
> [c000001e40f37cf0] [c0000000000bbaac] _do_fork+0x9c/0x3e4
> [c000001e40f37db0] [c0000000000bc030] sys_clone+0x74/0xa8
> [c000001e40f37e20] [c00000000000bb34] ppc_clone+0x8/0xc
> Instruction dump:
> 4bffa7e9 2fa30000 409e007c 395efffb 3d000020 2b8a0001 409d0008 39000020 
> e93d0718 e94d0028 7bde1f24 7d29f214 <7ca9502a> 7fff2a14 7fe9fe76 7d27fa78 
> 
> Looks like page->mem_cgroup_vec is allocated but not yet initialized
> with memcg pointers when we try to access them.
> 
> I did get past the crash by initializing the pointers like this
> in account_kernel_stack(),

The above is not an accurate description of the hack I showed below.
Essentially I am making sure that I get to the memcg corresponding
to task_struct_cachep object in the page.

But that still doesn't explain why we don't hit this problem on x86.

> but I am pretty sure that this is not the
> place to do this:
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 541fd805fb88..be21419feae2 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -380,13 +380,26 @@ static void account_kernel_stack(struct task_struct *tsk, int account)
>                  * All stack pages are in the same zone and belong to the
>                  * same memcg.
>                  */
> -               struct page *first_page = virt_to_page(stack);
> +               struct page *first_page = virt_to_head_page((stack));
> +               unsigned long off;
> +               struct mem_cgroup_ptr *memcg_ptr;
> +               struct mem_cgroup *memcg;
>  
>                 mod_zone_page_state(page_zone(first_page), NR_KERNEL_STACK_KB,
>                                     THREAD_SIZE / 1024 * account);
>  
> -               mod_memcg_page_state(first_page, MEMCG_KERNEL_STACK_KB,
> +               if (!first_page->mem_cgroup_vec)
> +                       return;
> +               off = obj_to_index(task_struct_cachep, first_page, stack);
> +               memcg_ptr = first_page->mem_cgroup_vec[off];
> +               if (!memcg_ptr)
> +                       return;
> +               rcu_read_lock();
> +               memcg = memcg_ptr->memcg;
> +               if (memcg)
> +                       __mod_memcg_state(memcg, MEMCG_KERNEL_STACK_KB,
>                                      account * (THREAD_SIZE / 1024));
> +               rcu_read_unlock();
>         }
>  }
> 
> Regards,
> Bharata.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] The new slab memory controller
  2019-12-09 11:56   ` Bharata B Rao
@ 2019-12-09 18:04     ` Roman Gushchin
  2019-12-10  6:23       ` Bharata B Rao
  0 siblings, 1 reply; 44+ messages in thread
From: Roman Gushchin @ 2019-12-09 18:04 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: mhocko, hannes, linux-kernel, Kernel Team, shakeelb,
	vdavydov.dev, longman

On Mon, Dec 09, 2019 at 05:26:49PM +0530, Bharata B Rao wrote:
> On Mon, Dec 09, 2019 at 02:47:52PM +0530, Bharata B Rao wrote:
> > Hi,
> > 
> > I see the below crash during early boot when I try this patchset on
> > PowerPC host. I am on new_slab.rfc.v5.3 branch.
> > 
> > BUG: Unable to handle kernel data access at 0x81030236d1814578
> > Faulting instruction address: 0xc0000000002cc314
> > Oops: Kernel access of bad area, sig: 11 [#1]
> > LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
> > Modules linked in: ip_tables x_tables autofs4 sr_mod cdrom usbhid bnx2x crct10dif_vpmsum crct10dif_common mdio libcrc32c crc32c_vpmsum
> > CPU: 31 PID: 1752 Comm: keyboard-setup. Not tainted 5.3.0-g9bd85fd72a0c #155
> > NIP:  c0000000002cc314 LR: c0000000002cc2e8 CTR: 0000000000000000
> > REGS: c000001e40f378b0 TRAP: 0380   Not tainted  (5.3.0-g9bd85fd72a0c)
> > MSR:  900000010280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]>  CR: 44022224  XER: 00000000
> > CFAR: c0000000002c6ad4 IRQMASK: 1 
> > GPR00: c0000000000b8a40 c000001e40f37b40 c000000000ed9600 0000000000000000 
> > GPR04: 0000000000000023 0000000000000010 c000001e40f37b24 c000001e3cba3400 
> > GPR08: 0000000000000020 81030218815f4578 0000001e50220000 0000000000000030 
> > GPR12: 0000000000002200 c000001fff774d80 0000000000000000 00000001072600d8 
> > GPR16: 0000000000000000 c0000000000bbaac 0000000000000000 0000000000000000 
> > GPR20: c000001e40f37c48 0000000000000001 0000000000000000 c000001e3cba3400 
> > GPR24: c000001e40f37dd8 0000000000000000 c000000000fa0d58 0000000000000000 
> > GPR28: c000001e3a080080 c000001e32da0100 0000000000000118 0000000000000010 
> > NIP [c0000000002cc314] __mod_memcg_state+0x58/0xd0
> > LR [c0000000002cc2e8] __mod_memcg_state+0x2c/0xd0
> > Call Trace:
> > [c000001e40f37b90] [c0000000000b8a40] account_kernel_stack+0xa4/0xe4
> > [c000001e40f37bd0] [c0000000000ba4a4] copy_process+0x2b4/0x16f0
> > [c000001e40f37cf0] [c0000000000bbaac] _do_fork+0x9c/0x3e4
> > [c000001e40f37db0] [c0000000000bc030] sys_clone+0x74/0xa8
> > [c000001e40f37e20] [c00000000000bb34] ppc_clone+0x8/0xc
> > Instruction dump:
> > 4bffa7e9 2fa30000 409e007c 395efffb 3d000020 2b8a0001 409d0008 39000020 
> > e93d0718 e94d0028 7bde1f24 7d29f214 <7ca9502a> 7fff2a14 7fe9fe76 7d27fa78 
> > 
> > Looks like page->mem_cgroup_vec is allocated but not yet initialized
> > with memcg pointers when we try to access them.
> > 
> > I did get past the crash by initializing the pointers like this
> > in account_kernel_stack(),
> 
> The above is not an accurate description of the hack I showed below.
> Essentially I am making sure that I get to the memcg corresponding
> to task_struct_cachep object in the page.

Hello, Bharata!

Thank you very much for the report and the patch, it's a good catch,
and the code looks good to me. I'll include the fix into the next
version of the patchset (I can't keep it as a separate fix due to massive
renamings/rewrites).

> 
> But that still doesn't explain why we don't hit this problem on x86.

On x86 (and arm64) we're using vmap-based stacks, so the underlying memory is
allocated directly by the page allocator, bypassing the slab allocator.
It depends on CONFIG_VMAP_STACK.

Btw, thank you for looking into the patchset and trying it on powerpc.
Would you mind to share some results?

Thank you!

Roman

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] The new slab memory controller
  2019-12-09 18:04     ` Roman Gushchin
@ 2019-12-10  6:23       ` Bharata B Rao
  2019-12-10 18:05         ` Roman Gushchin
  0 siblings, 1 reply; 44+ messages in thread
From: Bharata B Rao @ 2019-12-10  6:23 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: mhocko, hannes, linux-kernel, Kernel Team, shakeelb,
	vdavydov.dev, longman

On Mon, Dec 09, 2019 at 06:04:22PM +0000, Roman Gushchin wrote:
> On Mon, Dec 09, 2019 at 05:26:49PM +0530, Bharata B Rao wrote:
> > On Mon, Dec 09, 2019 at 02:47:52PM +0530, Bharata B Rao wrote:
> Hello, Bharata!
> 
> Thank you very much for the report and the patch, it's a good catch,
> and the code looks good to me. I'll include the fix into the next
> version of the patchset (I can't keep it as a separate fix due to massive
> renamings/rewrites).

Sure, but please note that I did post only the core change without
the associated header includes etc, where I took some short cuts.

> 
> > 
> > But that still doesn't explain why we don't hit this problem on x86.
> 
> On x86 (and arm64) we're using vmap-based stacks, so the underlying memory is
> allocated directly by the page allocator, bypassing the slab allocator.
> It depends on CONFIG_VMAP_STACK.

I turned off CONFIG_VMAP_STACK on x86, but still don't hit any
problems.

$ grep VMAP .config
CONFIG_HAVE_ARCH_HUGE_VMAP=y
CONFIG_HAVE_ARCH_VMAP_STACK=y
# CONFIG_VMAP_STACK is not set

May be something else prevents this particular crash on x86?

> 
> Btw, thank you for looking into the patchset and trying it on powerpc.
> Would you mind to share some results?

Sure, I will get back with more results, but initial numbers when running
a small alpine docker image look promising.

With slab patches
# docker stats --no-stream
CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT   MEM %               NET I/O             BLOCK I/O           PIDS
24bc99d94d91        sleek               0.00%               1MiB / 25MiB        4.00%               1.81kB / 0B         0B / 0B             0

Without slab patches
# docker stats --no-stream
CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT   MEM %               NET I/O             BLOCK I/O           PIDS
52382f8aaa13        sleek               0.00%               8.688MiB / 25MiB    34.75%              1.53kB / 0B         0B / 0B             0

So that's an improvement of MEM USAGE from 8.688MiB to 1MiB. Note that this
docker container isn't doing anything useful and hence the numbers
aren't representative of any workload.

Regards,
Bharata.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] The new slab memory controller
  2019-12-10  6:23       ` Bharata B Rao
@ 2019-12-10 18:05         ` Roman Gushchin
  2020-01-13  8:47           ` Bharata B Rao
  0 siblings, 1 reply; 44+ messages in thread
From: Roman Gushchin @ 2019-12-10 18:05 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: mhocko, hannes, linux-kernel, Kernel Team, shakeelb,
	vdavydov.dev, longman

On Tue, Dec 10, 2019 at 11:53:08AM +0530, Bharata B Rao wrote:
> On Mon, Dec 09, 2019 at 06:04:22PM +0000, Roman Gushchin wrote:
> > On Mon, Dec 09, 2019 at 05:26:49PM +0530, Bharata B Rao wrote:
> > > On Mon, Dec 09, 2019 at 02:47:52PM +0530, Bharata B Rao wrote:
> > Hello, Bharata!
> > 
> > Thank you very much for the report and the patch, it's a good catch,
> > and the code looks good to me. I'll include the fix into the next
> > version of the patchset (I can't keep it as a separate fix due to massive
> > renamings/rewrites).
> 
> Sure, but please note that I did post only the core change without
> the associated header includes etc, where I took some short cuts.

Sure, I'll adopt the code to the next version.

> 
> > 
> > > 
> > > But that still doesn't explain why we don't hit this problem on x86.
> > 
> > On x86 (and arm64) we're using vmap-based stacks, so the underlying memory is
> > allocated directly by the page allocator, bypassing the slab allocator.
> > It depends on CONFIG_VMAP_STACK.
> 
> I turned off CONFIG_VMAP_STACK on x86, but still don't hit any
> problems.

If you'll look at kernel/fork.c (~ :184), there are two ORed conditions
to bypass the slab allocator:
1) THREAD_SIZE >= PAGE_SIZE
2) CONFIG_VMAP_STACK

I guess the first one is what saves x86 in your case, while on ppc you might
have 64k pages (hard to say without looking at your config).

> 
> $ grep VMAP .config
> CONFIG_HAVE_ARCH_HUGE_VMAP=y
> CONFIG_HAVE_ARCH_VMAP_STACK=y
> # CONFIG_VMAP_STACK is not set
> 
> May be something else prevents this particular crash on x86?

I'm pretty sure it will crash, have stack been allocated using
the slab allocator. I bet everybody are using vmap-based stacks.

> 
> > 
> > Btw, thank you for looking into the patchset and trying it on powerpc.
> > Would you mind to share some results?
> 
> Sure, I will get back with more results, but initial numbers when running
> a small alpine docker image look promising.
> 
> With slab patches
> # docker stats --no-stream
> CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT   MEM %               NET I/O             BLOCK I/O           PIDS
> 24bc99d94d91        sleek               0.00%               1MiB / 25MiB        4.00%               1.81kB / 0B         0B / 0B             0
> 
> Without slab patches
> # docker stats --no-stream
> CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT   MEM %               NET I/O             BLOCK I/O           PIDS
> 52382f8aaa13        sleek               0.00%               8.688MiB / 25MiB    34.75%              1.53kB / 0B         0B / 0B             0
> 
> So that's an improvement of MEM USAGE from 8.688MiB to 1MiB. Note that this
> docker container isn't doing anything useful and hence the numbers
> aren't representative of any workload.

Cool, that's great!

Small containers is where the relative win is the biggest. Of course, it will
decrease with the size of containers, but it's expected.

If you'll get any additional numbers, please, share them. It's really
interesting, especially if you have larger-than-4k pages.

Thank you!

Roman

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] The new slab memory controller
  2019-12-10 18:05         ` Roman Gushchin
@ 2020-01-13  8:47           ` Bharata B Rao
  2020-01-13 15:31             ` Roman Gushchin
  0 siblings, 1 reply; 44+ messages in thread
From: Bharata B Rao @ 2020-01-13  8:47 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: mhocko, hannes, linux-kernel, Kernel Team, shakeelb,
	vdavydov.dev, longman

On Tue, Dec 10, 2019 at 06:05:20PM +0000, Roman Gushchin wrote:
> > 
> > With slab patches
> > # docker stats --no-stream
> > CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT   MEM %               NET I/O             BLOCK I/O           PIDS
> > 24bc99d94d91        sleek               0.00%               1MiB / 25MiB        4.00%               1.81kB / 0B         0B / 0B             0
> > 
> > Without slab patches
> > # docker stats --no-stream
> > CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT   MEM %               NET I/O             BLOCK I/O           PIDS
> > 52382f8aaa13        sleek               0.00%               8.688MiB / 25MiB    34.75%              1.53kB / 0B         0B / 0B             0
> > 
> > So that's an improvement of MEM USAGE from 8.688MiB to 1MiB. Note that this
> > docker container isn't doing anything useful and hence the numbers
> > aren't representative of any workload.
> 
> Cool, that's great!
> 
> Small containers is where the relative win is the biggest. Of course, it will
> decrease with the size of containers, but it's expected.
> 
> If you'll get any additional numbers, please, share them. It's really
> interesting, especially if you have larger-than-4k pages.

I run a couple of workloads contained within a memory cgroup and measured
memory.kmem.usage_in_bytes and memory.usage_in_bytes with and without
this patchset on PowerPC host. I see significant reduction in
memory.kmem.usage_in_bytes and some reduction in memory.usage_in_bytes.
Before posting the numbers, would like to get the following clarified:

In the original case, the memory cgroup is charged (including kmem charging)
when a new slab page is allocated. In your patch, the subpage charging is
done in slab_pre_alloc_hook routine. However in this case, I couldn't find
where exactly kmem counters are being charged/updated. Hence wanted to
make sure that the reduction in memory.kmem.usage_in_bytes that I am
seeing is indeed real and not because kmem accounting was missed out for
slab usage?

Also, I see all non-root allocations are coming from a single set of
kmem_caches. Guess <kmemcache_name>-memcg caches don't yet show up in
/proc/slabinfo and nor their stats is accumulated into /proc/slabinfo?

Regards,
Bharata.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] The new slab memory controller
  2020-01-13  8:47           ` Bharata B Rao
@ 2020-01-13 15:31             ` Roman Gushchin
  0 siblings, 0 replies; 44+ messages in thread
From: Roman Gushchin @ 2020-01-13 15:31 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: mhocko, hannes, linux-kernel, Kernel Team, shakeelb,
	vdavydov.dev, longman

On Mon, Jan 13, 2020 at 02:17:10PM +0530, Bharata B Rao wrote:
> On Tue, Dec 10, 2019 at 06:05:20PM +0000, Roman Gushchin wrote:
> > > 
> > > With slab patches
> > > # docker stats --no-stream
> > > CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT   MEM %               NET I/O             BLOCK I/O           PIDS
> > > 24bc99d94d91        sleek               0.00%               1MiB / 25MiB        4.00%               1.81kB / 0B         0B / 0B             0
> > > 
> > > Without slab patches
> > > # docker stats --no-stream
> > > CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT   MEM %               NET I/O             BLOCK I/O           PIDS
> > > 52382f8aaa13        sleek               0.00%               8.688MiB / 25MiB    34.75%              1.53kB / 0B         0B / 0B             0
> > > 
> > > So that's an improvement of MEM USAGE from 8.688MiB to 1MiB. Note that this
> > > docker container isn't doing anything useful and hence the numbers
> > > aren't representative of any workload.
> > 
> > Cool, that's great!
> > 
> > Small containers is where the relative win is the biggest. Of course, it will
> > decrease with the size of containers, but it's expected.
> > 
> > If you'll get any additional numbers, please, share them. It's really
> > interesting, especially if you have larger-than-4k pages.
> 
> I run a couple of workloads contained within a memory cgroup and measured
> memory.kmem.usage_in_bytes and memory.usage_in_bytes with and without
> this patchset on PowerPC host. I see significant reduction in
> memory.kmem.usage_in_bytes and some reduction in memory.usage_in_bytes.
> Before posting the numbers, would like to get the following clarified:
> 
> In the original case, the memory cgroup is charged (including kmem charging)
 > when a new slab page is allocated. In your patch, the subpage charging is
> done in slab_pre_alloc_hook routine. However in this case, I couldn't find
> where exactly kmem counters are being charged/updated. Hence wanted to
> make sure that the reduction in memory.kmem.usage_in_bytes that I am
> seeing is indeed real and not because kmem accounting was missed out for
> slab usage?
> 
> Also, I see all non-root allocations are coming from a single set of
> kmem_caches. Guess <kmemcache_name>-memcg caches don't yet show up in
> /proc/slabinfo and nor their stats is accumulated into /proc/slabinfo?

Hello Bharata!

First I'd look at global slab counters in /proc/meminfo (or vmstat).
These are reflecting the total system-wide amount of pages used by all slab
memory and they are accurate.

What about cgroup-level counters, they are not perfect in the version which
I posted. In general on cgroup v1 kernel memory is accounted twice: as a part
of total memory (memory.usage_in_bytes) and as a separate value
(memory.kmem.usage_in_bytes). The version of the slab controller which you test
doesn't support the second one. Also, it doesn't include the space used by the
accounting meta-data (1 pointer per object) into the accounting.
But after all the difference in memory.usage_in_bytes values beside some margin
(~10% of the difference) is meaningful.

The next version which I'm working on right now (and hope to post in a week or so)
will address these issues.

Thanks!

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] The new slab memory controller
  2019-10-22 13:31 ` Michal Hocko
@ 2019-10-22 15:59   ` Roman Gushchin
  0 siblings, 0 replies; 44+ messages in thread
From: Roman Gushchin @ 2019-10-22 15:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Johannes Weiner, linux-kernel, Kernel Team,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Christoph Lameter

On Tue, Oct 22, 2019 at 03:31:48PM +0200, Michal Hocko wrote:
> On Thu 17-10-19 17:28:04, Roman Gushchin wrote:
> > This patchset provides a new implementation of the slab memory controller,
> > which aims to reach a much better slab utilization by sharing slab pages
> > between multiple memory cgroups. Below is the short description of the new
> > design (more details in commit messages).
> > 
> > Accounting is performed per-object instead of per-page. Slab-related
> > vmstat counters are converted to bytes. Charging is performed on page-basis,
> > with rounding up and remembering leftovers.
> > 
> > Memcg ownership data is stored in a per-slab-page vector: for each slab page
> > a vector of corresponding size is allocated. To keep slab memory reparenting
> > working, instead of saving a pointer to the memory cgroup directly an
> > intermediate object is used. It's simply a pointer to a memcg (which can be
> > easily changed to the parent) with a built-in reference counter. This scheme
> > allows to reparent all allocated objects without walking them over and changing
> > memcg pointer to the parent.
> > 
> > Instead of creating an individual set of kmem_caches for each memory cgroup,
> > two global sets are used: the root set for non-accounted and root-cgroup
> > allocations and the second set for all other allocations. This allows to
> > simplify the lifetime management of individual kmem_caches: they are destroyed
> > with root counterparts. It allows to remove a good amount of code and make
> > things generally simpler.
> 
> What is the performance impact?

As I wrote, so far we haven't found any regression on any real world workload.
Of course, it's pretty easy to come up with a synthetic test which will show
some performance hit: e.g. allocate and free a large number of objects from a
single cache from a single cgroup. The reason is simple: stats and accounting
are more precise, so it requires more work. But I don't think it's a real
problem.

On the other hand I expect to see some positive effects from the significantly
reduced number of unmovable pages: memory fragmentation should become lower.
And all kernel objects will reside on a smaller number of pages, so we can
expect a better cache utilization.

> Also what is the effect on the memory
> reclaim side and the isolation. I would expect that mixing objects from
> different cgroups would have a negative/unpredictable impact on the
> memcg slab shrinking.

Slab shrinking is already working on per-object basis, so no changes here.

Quite opposite: now the freed space can be reused by other cgroups, where
previously it was often a useless operation, as nobody can reuse the space
unless all objects will be freed and the page can be returned to the page
allocator.

Thanks!

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] The new slab memory controller
  2019-10-22 13:28   ` Michal Hocko
@ 2019-10-22 15:48     ` Roman Gushchin
  0 siblings, 0 replies; 44+ messages in thread
From: Roman Gushchin @ 2019-10-22 15:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Johannes Weiner, linux-kernel, Kernel Team,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Christoph Lameter

On Tue, Oct 22, 2019 at 03:28:00PM +0200, Michal Hocko wrote:
> On Tue 22-10-19 15:22:06, Michal Hocko wrote:
> > On Thu 17-10-19 17:28:04, Roman Gushchin wrote:
> > [...]
> > > Using a drgn* script I've got an estimation of slab utilization on
> > > a number of machines running different production workloads. In most
> > > cases it was between 45% and 65%, and the best number I've seen was
> > > around 85%. Turning kmem accounting off brings it to high 90s. Also
> > > it brings back 30-50% of slab memory. It means that the real price
> > > of the existing slab memory controller is way bigger than a pointer
> > > per page.
> > 
> > How much of the memory are we talking about here?
> 
> Just to be more specific. Your cover mentions several hundreds of MBs
> but there is no scale to the overal charged memory. How much of that is
> the actual kmem accounted memory.

As I wrote, on average it saves 30-45% of slab memory.
The smallest number I've seen was about 15%, the largest over 60%.

The amount of slab memory isn't a very stable metrics in general: it heavily
depends on workload pattern, memory pressure, uptime etc.
In absolute numbers I've seen savings from ~60 Mb for an empty vm to
more than 2 Gb for some production workloads.

Btw, please note that after a recent change from Vlastimil
6a486c0ad4dc ("mm, sl[ou]b: improve memory accounting")
slab counters are including large allocations which are passed
directly to the page allocator. It will makes memory savings
smaller in percents, but of course not in absolute numbers.

> 
> > Also is there any pattern for specific caches that tend to utilize
> > much worse than others?

Caches which usually have many objects (e.g. inodes) initially
have a better utilization, but as some of them are getting reclaimed
the utilization drops. And if the cgroup is already dead, no one can
reuse these mostly slab empty pages, so it's pretty wasteful.

So I don't think the problem is specific to any cache, it's pretty general.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] The new slab memory controller
  2019-10-18  0:28 Roman Gushchin
  2019-10-18 17:03 ` Waiman Long
  2019-10-22 13:22 ` Michal Hocko
@ 2019-10-22 13:31 ` Michal Hocko
  2019-10-22 15:59   ` Roman Gushchin
  2 siblings, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2019-10-22 13:31 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Johannes Weiner, linux-kernel, kernel-team,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Christoph Lameter

On Thu 17-10-19 17:28:04, Roman Gushchin wrote:
> This patchset provides a new implementation of the slab memory controller,
> which aims to reach a much better slab utilization by sharing slab pages
> between multiple memory cgroups. Below is the short description of the new
> design (more details in commit messages).
> 
> Accounting is performed per-object instead of per-page. Slab-related
> vmstat counters are converted to bytes. Charging is performed on page-basis,
> with rounding up and remembering leftovers.
> 
> Memcg ownership data is stored in a per-slab-page vector: for each slab page
> a vector of corresponding size is allocated. To keep slab memory reparenting
> working, instead of saving a pointer to the memory cgroup directly an
> intermediate object is used. It's simply a pointer to a memcg (which can be
> easily changed to the parent) with a built-in reference counter. This scheme
> allows to reparent all allocated objects without walking them over and changing
> memcg pointer to the parent.
> 
> Instead of creating an individual set of kmem_caches for each memory cgroup,
> two global sets are used: the root set for non-accounted and root-cgroup
> allocations and the second set for all other allocations. This allows to
> simplify the lifetime management of individual kmem_caches: they are destroyed
> with root counterparts. It allows to remove a good amount of code and make
> things generally simpler.

What is the performance impact? Also what is the effect on the memory
reclaim side and the isolation. I would expect that mixing objects from
different cgroups would have a negative/unpredictable impact on the
memcg slab shrinking.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] The new slab memory controller
  2019-10-22 13:22 ` Michal Hocko
@ 2019-10-22 13:28   ` Michal Hocko
  2019-10-22 15:48     ` Roman Gushchin
  0 siblings, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2019-10-22 13:28 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Johannes Weiner, linux-kernel, kernel-team,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Christoph Lameter

On Tue 22-10-19 15:22:06, Michal Hocko wrote:
> On Thu 17-10-19 17:28:04, Roman Gushchin wrote:
> [...]
> > Using a drgn* script I've got an estimation of slab utilization on
> > a number of machines running different production workloads. In most
> > cases it was between 45% and 65%, and the best number I've seen was
> > around 85%. Turning kmem accounting off brings it to high 90s. Also
> > it brings back 30-50% of slab memory. It means that the real price
> > of the existing slab memory controller is way bigger than a pointer
> > per page.
> 
> How much of the memory are we talking about here?

Just to be more specific. Your cover mentions several hundreds of MBs
but there is no scale to the overal charged memory. How much of that is
the actual kmem accounted memory.

> Also is there any pattern for specific caches that tend to utilize
> much worse than others?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] The new slab memory controller
  2019-10-18  0:28 Roman Gushchin
  2019-10-18 17:03 ` Waiman Long
@ 2019-10-22 13:22 ` Michal Hocko
  2019-10-22 13:28   ` Michal Hocko
  2019-10-22 13:31 ` Michal Hocko
  2 siblings, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2019-10-22 13:22 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Johannes Weiner, linux-kernel, kernel-team,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Christoph Lameter

On Thu 17-10-19 17:28:04, Roman Gushchin wrote:
[...]
> Using a drgn* script I've got an estimation of slab utilization on
> a number of machines running different production workloads. In most
> cases it was between 45% and 65%, and the best number I've seen was
> around 85%. Turning kmem accounting off brings it to high 90s. Also
> it brings back 30-50% of slab memory. It means that the real price
> of the existing slab memory controller is way bigger than a pointer
> per page.

How much of the memory are we talking about here? Also is there any
pattern for specific caches that tend to utilize much worse than others?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] The new slab memory controller
  2019-10-18 17:03 ` Waiman Long
@ 2019-10-18 17:12   ` Roman Gushchin
  0 siblings, 0 replies; 44+ messages in thread
From: Roman Gushchin @ 2019-10-18 17:12 UTC (permalink / raw)
  To: Waiman Long
  Cc: linux-mm, Michal Hocko, Johannes Weiner, linux-kernel,
	Kernel Team, Shakeel Butt, Vladimir Davydov, Christoph Lameter

On Fri, Oct 18, 2019 at 01:03:54PM -0400, Waiman Long wrote:
> On 10/17/19 8:28 PM, Roman Gushchin wrote:
> > The existing slab memory controller is based on the idea of replicating
> > slab allocator internals for each memory cgroup. This approach promises
> > a low memory overhead (one pointer per page), and isn't adding too much
> > code on hot allocation and release paths. But is has a very serious flaw:
>                                                ^it^
> > it leads to a low slab utilization.
> >
> > Using a drgn* script I've got an estimation of slab utilization on
> > a number of machines running different production workloads. In most
> > cases it was between 45% and 65%, and the best number I've seen was
> > around 85%. Turning kmem accounting off brings it to high 90s. Also
> > it brings back 30-50% of slab memory. It means that the real price
> > of the existing slab memory controller is way bigger than a pointer
> > per page.
> >
> > The real reason why the existing design leads to a low slab utilization
> > is simple: slab pages are used exclusively by one memory cgroup.
> > If there are only few allocations of certain size made by a cgroup,
> > or if some active objects (e.g. dentries) are left after the cgroup is
> > deleted, or the cgroup contains a single-threaded application which is
> > barely allocating any kernel objects, but does it every time on a new CPU:
> > in all these cases the resulting slab utilization is very low.
> > If kmem accounting is off, the kernel is able to use free space
> > on slab pages for other allocations.
> 
> In the case of slub memory allocator, it is not just unused space within
> a slab. It is also the use of per-cpu slabs that can hold up a lot of
> memory, especially if the tasks jump around to different cpus. The
> problem is compounded if a lot of memcgs are being used. Memory
> utilization can improve quite significantly if per-cpu slabs are
> disabled. Of course, it comes with a performance cost.

Right, but it's basically the same problem: if slabs can be used exclusively
by a single memory cgroup, slab utilization is low. Per-cpu slabs are just
making the problem worse by increasing the number of mostly empty slabs
proportionally to the number of CPUs.

With the disabled memory cgroup accounting slab utilization is quite high
even with per-slabs. So the problem isn't in per-cpu slabs by themselves,
they just were not designed to exist in so many copies.

Thanks!

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/16] The new slab memory controller
  2019-10-18  0:28 Roman Gushchin
@ 2019-10-18 17:03 ` Waiman Long
  2019-10-18 17:12   ` Roman Gushchin
  2019-10-22 13:22 ` Michal Hocko
  2019-10-22 13:31 ` Michal Hocko
  2 siblings, 1 reply; 44+ messages in thread
From: Waiman Long @ 2019-10-18 17:03 UTC (permalink / raw)
  To: Roman Gushchin, linux-mm
  Cc: Michal Hocko, Johannes Weiner, linux-kernel, kernel-team,
	Shakeel Butt, Vladimir Davydov, Christoph Lameter

On 10/17/19 8:28 PM, Roman Gushchin wrote:
> The existing slab memory controller is based on the idea of replicating
> slab allocator internals for each memory cgroup. This approach promises
> a low memory overhead (one pointer per page), and isn't adding too much
> code on hot allocation and release paths. But is has a very serious flaw:
                                               ^it^
> it leads to a low slab utilization.
>
> Using a drgn* script I've got an estimation of slab utilization on
> a number of machines running different production workloads. In most
> cases it was between 45% and 65%, and the best number I've seen was
> around 85%. Turning kmem accounting off brings it to high 90s. Also
> it brings back 30-50% of slab memory. It means that the real price
> of the existing slab memory controller is way bigger than a pointer
> per page.
>
> The real reason why the existing design leads to a low slab utilization
> is simple: slab pages are used exclusively by one memory cgroup.
> If there are only few allocations of certain size made by a cgroup,
> or if some active objects (e.g. dentries) are left after the cgroup is
> deleted, or the cgroup contains a single-threaded application which is
> barely allocating any kernel objects, but does it every time on a new CPU:
> in all these cases the resulting slab utilization is very low.
> If kmem accounting is off, the kernel is able to use free space
> on slab pages for other allocations.

In the case of slub memory allocator, it is not just unused space within
a slab. It is also the use of per-cpu slabs that can hold up a lot of
memory, especially if the tasks jump around to different cpus. The
problem is compounded if a lot of memcgs are being used. Memory
utilization can improve quite significantly if per-cpu slabs are
disabled. Of course, it comes with a performance cost.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 00/16] The new slab memory controller
@ 2019-10-18  0:28 Roman Gushchin
  2019-10-18 17:03 ` Waiman Long
                   ` (2 more replies)
  0 siblings, 3 replies; 44+ messages in thread
From: Roman Gushchin @ 2019-10-18  0:28 UTC (permalink / raw)
  To: linux-mm
  Cc: Michal Hocko, Johannes Weiner, linux-kernel, kernel-team,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Christoph Lameter,
	Roman Gushchin

The existing slab memory controller is based on the idea of replicating
slab allocator internals for each memory cgroup. This approach promises
a low memory overhead (one pointer per page), and isn't adding too much
code on hot allocation and release paths. But is has a very serious flaw:
it leads to a low slab utilization.

Using a drgn* script I've got an estimation of slab utilization on
a number of machines running different production workloads. In most
cases it was between 45% and 65%, and the best number I've seen was
around 85%. Turning kmem accounting off brings it to high 90s. Also
it brings back 30-50% of slab memory. It means that the real price
of the existing slab memory controller is way bigger than a pointer
per page.

The real reason why the existing design leads to a low slab utilization
is simple: slab pages are used exclusively by one memory cgroup.
If there are only few allocations of certain size made by a cgroup,
or if some active objects (e.g. dentries) are left after the cgroup is
deleted, or the cgroup contains a single-threaded application which is
barely allocating any kernel objects, but does it every time on a new CPU:
in all these cases the resulting slab utilization is very low.
If kmem accounting is off, the kernel is able to use free space
on slab pages for other allocations.

Arguably it wasn't an issue back to days when the kmem controller was
introduced and was an opt-in feature, which had to be turned on
individually for each memory cgroup. But now it's turned on by default
on both cgroup v1 and v2. And modern systemd-based systems tend to
create a large number of cgroups.

This patchset provides a new implementation of the slab memory controller,
which aims to reach a much better slab utilization by sharing slab pages
between multiple memory cgroups. Below is the short description of the new
design (more details in commit messages).

Accounting is performed per-object instead of per-page. Slab-related
vmstat counters are converted to bytes. Charging is performed on page-basis,
with rounding up and remembering leftovers.

Memcg ownership data is stored in a per-slab-page vector: for each slab page
a vector of corresponding size is allocated. To keep slab memory reparenting
working, instead of saving a pointer to the memory cgroup directly an
intermediate object is used. It's simply a pointer to a memcg (which can be
easily changed to the parent) with a built-in reference counter. This scheme
allows to reparent all allocated objects without walking them over and changing
memcg pointer to the parent.

Instead of creating an individual set of kmem_caches for each memory cgroup,
two global sets are used: the root set for non-accounted and root-cgroup
allocations and the second set for all other allocations. This allows to
simplify the lifetime management of individual kmem_caches: they are destroyed
with root counterparts. It allows to remove a good amount of code and make
things generally simpler.

The patchset contains a couple of semi-independent parts, which can find their
usage outside of the slab memory controller too:
1) subpage charging API, which can be used in the future for accounting of
   other non-page-sized objects, e.g. percpu allocations.
2) mem_cgroup_ptr API (refcounted pointers to a memcg, can be reused
   for the efficient reparenting of other objects, e.g. pagecache.

The patchset has been tested on a number of different workloads in our
production. In all cases, it saved hefty amounts of memory:
1) web frontend, 650-700 Mb, ~42% of slab memory
2) database cache, 750-800 Mb, ~35% of slab memory
3) dns server, 700 Mb, ~36% of slab memory

(These numbers were received used a backport of this patchset to the kernel
version used in fb production. But similar numbers can be obtained on
a vanilla kernel. If used on a modern systemd-based distributive,
e.g. Fedora 30, the patched kernel shows the same order of slab memory
savings just after system start).

So far I haven't found any regression on all tested workloads, but
potential CPU regression caused by more precise accounting is a concern.

Obviously the amount of saved memory depend on the number of memory cgroups,
uptime and specific workloads, but overall it feels like the new controller
saves 30-40% of slab memory, sometimes more. Additionally, it should lead
to a lower memory fragmentation, just because of a smaller number of
non-movable pages and also because there is no more need to move all
slab objects to a new set of pages when a workload is restarted in a new
memory cgroup.

* https://github.com/osandov/drgn


v1:
  1) fixed a bug in zoneinfo_show_print()
  2) added some comments to the subpage charging API, a minor fix
  3) separated memory.kmem.slabinfo deprecation into a separate patch,
     provided a drgn-based replacement
  4) rebased on top of the current mm tree

RFC:
  https://lwn.net/Articles/798605/


Roman Gushchin (16):
  mm: memcg: introduce mem_cgroup_ptr
  mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat
  mm: vmstat: convert slab vmstat counter to bytes
  mm: memcg/slab: allocate space for memcg ownership data for non-root
    slabs
  mm: slub: implement SLUB version of obj_to_index()
  mm: memcg/slab: save memcg ownership data for non-root slab objects
  mm: memcg: move memcg_kmem_bypass() to memcontrol.h
  mm: memcg: introduce __mod_lruvec_memcg_state()
  mm: memcg/slab: charge individual slab objects instead of pages
  mm: memcg: move get_mem_cgroup_from_current() to memcontrol.h
  mm: memcg/slab: replace memcg_from_slab_page() with
    memcg_from_slab_obj()
  tools/cgroup: add slabinfo.py tool
  mm: memcg/slab: deprecate memory.kmem.slabinfo
  mm: memcg/slab: use one set of kmem_caches for all memory cgroups
  tools/cgroup: make slabinfo.py compatible with new slab controller
  mm: slab: remove redundant check in memcg_accumulate_slabinfo()

 drivers/base/node.c        |  14 +-
 fs/proc/meminfo.c          |   4 +-
 include/linux/memcontrol.h |  98 +++++++-
 include/linux/mm_types.h   |   5 +-
 include/linux/mmzone.h     |  12 +-
 include/linux/slab.h       |   3 +-
 include/linux/slub_def.h   |   9 +
 include/linux/vmstat.h     |   8 +
 kernel/power/snapshot.c    |   2 +-
 mm/list_lru.c              |  12 +-
 mm/memcontrol.c            | 302 ++++++++++++-------------
 mm/oom_kill.c              |   2 +-
 mm/page_alloc.c            |   8 +-
 mm/slab.c                  |  37 ++-
 mm/slab.h                  | 300 ++++++++++++------------
 mm/slab_common.c           | 452 ++++---------------------------------
 mm/slob.c                  |  12 +-
 mm/slub.c                  |  63 ++----
 mm/vmscan.c                |   3 +-
 mm/vmstat.c                |  37 ++-
 mm/workingset.c            |   6 +-
 tools/cgroup/slabinfo.py   | 250 ++++++++++++++++++++
 22 files changed, 816 insertions(+), 823 deletions(-)
 create mode 100755 tools/cgroup/slabinfo.py

-- 
2.21.0


^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2020-01-13 15:31 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-05 21:45 [PATCH RFC 00/14] The new slab memory controller Roman Gushchin
2019-09-05 21:45 ` [PATCH RFC 01/14] mm: memcg: subpage charging API Roman Gushchin
2019-09-16 12:56   ` Johannes Weiner
2019-09-17  2:27     ` Roman Gushchin
2019-09-17  8:50       ` Johannes Weiner
2019-09-17 18:33         ` Roman Gushchin
2019-09-05 21:45 ` [PATCH RFC 02/14] mm: memcg: introduce mem_cgroup_ptr Roman Gushchin
2019-09-05 22:34   ` Roman Gushchin
2019-09-05 21:45 ` [PATCH RFC 03/14] mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat Roman Gushchin
2019-09-05 21:45 ` [PATCH RFC 04/14] mm: vmstat: convert slab vmstat counter to bytes Roman Gushchin
2019-09-16 12:38   ` Johannes Weiner
2019-09-17  2:08     ` Roman Gushchin
2019-09-05 21:45 ` [PATCH RFC 05/14] mm: memcg/slab: allocate space for memcg ownership data for non-root slabs Roman Gushchin
2019-09-05 21:45 ` [PATCH RFC 06/14] mm: slub: implement SLUB version of obj_to_index() Roman Gushchin
2019-09-05 21:45 ` [PATCH RFC 07/14] mm: memcg/slab: save memcg ownership data for non-root slab objects Roman Gushchin
2019-09-05 21:45 ` [PATCH RFC 08/14] mm: memcg: move memcg_kmem_bypass() to memcontrol.h Roman Gushchin
2019-09-05 21:45 ` [PATCH RFC 09/14] mm: memcg: introduce __mod_lruvec_memcg_state() Roman Gushchin
2019-09-05 22:37 ` [PATCH RFC 02/14] mm: memcg: introduce mem_cgroup_ptr Roman Gushchin
2019-09-17 19:48 ` [PATCH RFC 00/14] The new slab memory controller Waiman Long
2019-09-17 21:24   ` Roman Gushchin
2019-09-19 13:39 ` Suleiman Souhlal
2019-09-19 16:22   ` Roman Gushchin
2019-09-19 21:10     ` Suleiman Souhlal
2019-09-19 21:40       ` Roman Gushchin
2019-10-01 15:12 ` Michal Koutný
2019-10-02  2:09   ` Roman Gushchin
2019-10-02 13:00     ` Suleiman Souhlal
2019-10-03 10:47       ` Michal Koutný
2019-10-03 15:52         ` Roman Gushchin
2019-12-09  9:17 ` [PATCH 00/16] " Bharata B Rao
2019-12-09 11:56   ` Bharata B Rao
2019-12-09 18:04     ` Roman Gushchin
2019-12-10  6:23       ` Bharata B Rao
2019-12-10 18:05         ` Roman Gushchin
2020-01-13  8:47           ` Bharata B Rao
2020-01-13 15:31             ` Roman Gushchin
2019-10-18  0:28 Roman Gushchin
2019-10-18 17:03 ` Waiman Long
2019-10-18 17:12   ` Roman Gushchin
2019-10-22 13:22 ` Michal Hocko
2019-10-22 13:28   ` Michal Hocko
2019-10-22 15:48     ` Roman Gushchin
2019-10-22 13:31 ` Michal Hocko
2019-10-22 15:59   ` Roman Gushchin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).