linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/28] The new cgroup slab memory controller
@ 2020-01-27 17:34 Roman Gushchin
  2020-01-27 17:34 ` [PATCH v2 01/28] mm: kmem: cleanup (__)memcg_kmem_charge_memcg() arguments Roman Gushchin
                   ` (28 more replies)
  0 siblings, 29 replies; 84+ messages in thread
From: Roman Gushchin @ 2020-01-27 17:34 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Michal Hocko, Johannes Weiner, Shakeel Butt, Vladimir Davydov,
	linux-kernel, kernel-team, Bharata B Rao, Yafang Shao,
	Roman Gushchin

The existing cgroup slab memory controller is based on the idea of
replicating slab allocator internals for each memory cgroup.
This approach promises a low memory overhead (one pointer per page),
and isn't adding too much code on hot allocation and release paths.
But is has a very serious flaw: it leads to a low slab utilization.

Using a drgn* script I've got an estimation of slab utilization on
a number of machines running different production workloads. In most
cases it was between 45% and 65%, and the best number I've seen was
around 85%. Turning kmem accounting off brings it to high 90s. Also
it brings back 30-50% of slab memory. It means that the real price
of the existing slab memory controller is way bigger than a pointer
per page.

The real reason why the existing design leads to a low slab utilization
is simple: slab pages are used exclusively by one memory cgroup.
If there are only few allocations of certain size made by a cgroup,
or if some active objects (e.g. dentries) are left after the cgroup is
deleted, or the cgroup contains a single-threaded application which is
barely allocating any kernel objects, but does it every time on a new CPU:
in all these cases the resulting slab utilization is very low.
If kmem accounting is off, the kernel is able to use free space
on slab pages for other allocations.

Arguably it wasn't an issue back to days when the kmem controller was
introduced and was an opt-in feature, which had to be turned on
individually for each memory cgroup. But now it's turned on by default
on both cgroup v1 and v2. And modern systemd-based systems tend to
create a large number of cgroups.

This patchset provides a new implementation of the slab memory controller,
which aims to reach a much better slab utilization by sharing slab pages
between multiple memory cgroups. Below is the short description of the new
design (more details in commit messages).

Accounting is performed per-object instead of per-page. Slab-related
vmstat counters are converted to bytes. Charging is performed on page-basis,
with rounding up and remembering leftovers.

Memcg ownership data is stored in a per-slab-page vector: for each slab page
a vector of corresponding size is allocated. To keep slab memory reparenting
working, instead of saving a pointer to the memory cgroup directly an
intermediate object is used. It's simply a pointer to a memcg (which can be
easily changed to the parent) with a built-in reference counter. This scheme
allows to reparent all allocated objects without walking them over and
changing memcg pointer to the parent.

Instead of creating an individual set of kmem_caches for each memory cgroup,
two global sets are used: the root set for non-accounted and root-cgroup
allocations and the second set for all other allocations. This allows to
simplify the lifetime management of individual kmem_caches: they are
destroyed with root counterparts. It allows to remove a good amount of code
and make things generally simpler.

The patchset* has been tested on a number of different workloads in our
production. In all cases it saved significant amount of memory, measured
from high hundreds of MBs to single GBs per host. On average, the size
of slab memory has been reduced by 35-45%.

(* These numbers were received used a backport of this patchset to the kernel
version used in fb production. But similar numbers can be obtained on
a vanilla kernel. On my personal desktop with 8-cores CPU and 16 GB of RAM
running Fedora 31 the new slab controller saves ~45-50% of slab memory,
measured just after loading of the system).

Additionally, it should lead to a lower memory fragmentation, just because
of a smaller number of non-movable pages and also because there is no more
need to move all slab objects to a new set of pages when a workload is
restarted in a new memory cgroup.

The patchset consists of several blocks:
patches (1)-(6) clean up the existing kmem accounting API,
patches (7)-(13) prepare vmstat to count individual slab objects,
patches (14)-(21) implement the main idea of the patchset,
patches (22)-(25) are following clean-ups of the memcg/slab code,
patches (26)-(27) implement a drgn-based replacement for per-memcg slabinfo,
patch (28) add kselftests covering kernel memory accounting functionality.


* https://github.com/osandov/drgn

v2:
  1) implemented re-layering and renaming suggested by Johannes,
    added his patch to the set. Thanks!
  2) fixed the issue discovered by Bharata B Rao. Thanks!
  3) added kmem API clean up part
  4) added slab/memcg follow-up clean up part
  5) fixed a couple of issues discovered by internal testing on FB fleet.
  6) added kselftests
  7) included metadata into the charge calculation
  8) refreshed commit logs, regrouped patches, rebased onto mm tree, etc

v1:
  1) fixed a bug in zoneinfo_show_print()
  2) added some comments to the subpage charging API, a minor fix
  3) separated memory.kmem.slabinfo deprecation into a separate patch,
     provided a drgn-based replacement
  4) rebased on top of the current mm tree

RFC:
  https://lwn.net/Articles/798605/


Johannes Weiner (1):
  mm: memcontrol: decouple reference counting from page accounting

Roman Gushchin (27):
  mm: kmem: cleanup (__)memcg_kmem_charge_memcg() arguments
  mm: kmem: cleanup memcg_kmem_uncharge_memcg() arguments
  mm: kmem: rename memcg_kmem_(un)charge() into
    memcg_kmem_(un)charge_page()
  mm: kmem: switch to nr_pages in (__)memcg_kmem_charge_memcg()
  mm: memcg/slab: cache page number in memcg_(un)charge_slab()
  mm: kmem: rename (__)memcg_kmem_(un)charge_memcg() to
    __memcg_kmem_(un)charge()
  mm: memcg/slab: introduce mem_cgroup_from_obj()
  mm: fork: fix kernel_stack memcg stats for various stack
    implementations
  mm: memcg/slab: rename __mod_lruvec_slab_state() into
    __mod_lruvec_obj_state()
  mm: memcg: introduce mod_lruvec_memcg_state()
  mm: slub: implement SLUB version of obj_to_index()
  mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat
  mm: vmstat: convert slab vmstat counter to bytes
  mm: memcg/slab: obj_cgroup API
  mm: memcg/slab: allocate obj_cgroups for non-root slab pages
  mm: memcg/slab: save obj_cgroup for non-root slab objects
  mm: memcg/slab: charge individual slab objects instead of pages
  mm: memcg/slab: deprecate memory.kmem.slabinfo
  mm: memcg/slab: move memcg_kmem_bypass() to memcontrol.h
  mm: memcg/slab: use a single set of kmem_caches for all memory cgroups
  mm: memcg/slab: simplify memcg cache creation
  mm: memcg/slab: deprecate memcg_kmem_get_cache()
  mm: memcg/slab: deprecate slab_root_caches
  mm: memcg/slab: remove redundant check in memcg_accumulate_slabinfo()
  tools/cgroup: add slabinfo.py tool
  tools/cgroup: make slabinfo.py compatible with new slab controller
  kselftests: cgroup: add kernel memory accounting tests

 drivers/base/node.c                        |  14 +-
 fs/pipe.c                                  |   2 +-
 fs/proc/meminfo.c                          |   4 +-
 include/linux/memcontrol.h                 | 147 ++++-
 include/linux/mm.h                         |  25 +-
 include/linux/mm_types.h                   |   5 +-
 include/linux/mmzone.h                     |  12 +-
 include/linux/slab.h                       |   5 +-
 include/linux/slub_def.h                   |   9 +
 include/linux/vmstat.h                     |   8 +
 kernel/fork.c                              |  13 +-
 kernel/power/snapshot.c                    |   2 +-
 mm/list_lru.c                              |  12 +-
 mm/memcontrol.c                            | 638 +++++++++++++--------
 mm/oom_kill.c                              |   2 +-
 mm/page_alloc.c                            |  12 +-
 mm/slab.c                                  |  36 +-
 mm/slab.h                                  | 346 +++++------
 mm/slab_common.c                           | 513 ++---------------
 mm/slob.c                                  |  12 +-
 mm/slub.c                                  |  62 +-
 mm/vmscan.c                                |   3 +-
 mm/vmstat.c                                |  37 +-
 mm/workingset.c                            |   6 +-
 tools/cgroup/slabinfo.py                   | 220 +++++++
 tools/testing/selftests/cgroup/.gitignore  |   1 +
 tools/testing/selftests/cgroup/Makefile    |   2 +
 tools/testing/selftests/cgroup/test_kmem.c | 380 ++++++++++++
 28 files changed, 1505 insertions(+), 1023 deletions(-)
 create mode 100755 tools/cgroup/slabinfo.py
 create mode 100644 tools/testing/selftests/cgroup/test_kmem.c

-- 
2.24.1



^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH v2 01/28] mm: kmem: cleanup (__)memcg_kmem_charge_memcg() arguments
  2020-01-27 17:34 [PATCH v2 00/28] The new cgroup slab memory controller Roman Gushchin
@ 2020-01-27 17:34 ` Roman Gushchin
  2020-01-27 17:34 ` [PATCH v2 02/28] mm: kmem: cleanup memcg_kmem_uncharge_memcg() arguments Roman Gushchin
                   ` (27 subsequent siblings)
  28 siblings, 0 replies; 84+ messages in thread
From: Roman Gushchin @ 2020-01-27 17:34 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Michal Hocko, Johannes Weiner, Shakeel Butt, Vladimir Davydov,
	linux-kernel, kernel-team, Bharata B Rao, Yafang Shao,
	Roman Gushchin

The first argument of memcg_kmem_charge_memcg() and
__memcg_kmem_charge_memcg() is the page pointer and it's not used.
Let's drop it.

Memcg pointer is passed as the last argument. Move it to
the first place for consistency with other memcg functions,
e.g. __memcg_kmem_uncharge_memcg() or try_charge().

Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h | 9 ++++-----
 mm/memcontrol.c            | 8 +++-----
 mm/slab.h                  | 2 +-
 3 files changed, 8 insertions(+), 11 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a7a0a1a5c8d5..c954209fd685 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1364,8 +1364,7 @@ void memcg_kmem_put_cache(struct kmem_cache *cachep);
 #ifdef CONFIG_MEMCG_KMEM
 int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order);
 void __memcg_kmem_uncharge(struct page *page, int order);
-int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
-			      struct mem_cgroup *memcg);
+int __memcg_kmem_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp, int order);
 void __memcg_kmem_uncharge_memcg(struct mem_cgroup *memcg,
 				 unsigned int nr_pages);
 
@@ -1402,11 +1401,11 @@ static inline void memcg_kmem_uncharge(struct page *page, int order)
 		__memcg_kmem_uncharge(page, order);
 }
 
-static inline int memcg_kmem_charge_memcg(struct page *page, gfp_t gfp,
-					  int order, struct mem_cgroup *memcg)
+static inline int memcg_kmem_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp,
+					  int order)
 {
 	if (memcg_kmem_enabled())
-		return __memcg_kmem_charge_memcg(page, gfp, order, memcg);
+		return __memcg_kmem_charge_memcg(memcg, gfp, order);
 	return 0;
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6f6dc8712e39..36a01d940e4b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2820,15 +2820,13 @@ void memcg_kmem_put_cache(struct kmem_cache *cachep)
 
 /**
  * __memcg_kmem_charge_memcg: charge a kmem page
- * @page: page to charge
+ * @memcg: memory cgroup to charge
  * @gfp: reclaim mode
  * @order: allocation order
- * @memcg: memory cgroup to charge
  *
  * Returns 0 on success, an error code on failure.
  */
-int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
-			    struct mem_cgroup *memcg)
+int __memcg_kmem_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp, int order)
 {
 	unsigned int nr_pages = 1 << order;
 	struct page_counter *counter;
@@ -2874,7 +2872,7 @@ int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order)
 
 	memcg = get_mem_cgroup_from_current();
 	if (!mem_cgroup_is_root(memcg)) {
-		ret = __memcg_kmem_charge_memcg(page, gfp, order, memcg);
+		ret = __memcg_kmem_charge_memcg(memcg, gfp, order);
 		if (!ret) {
 			page->mem_cgroup = memcg;
 			__SetPageKmemcg(page);
diff --git a/mm/slab.h b/mm/slab.h
index 7e94700aa78c..c4c93e991250 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -365,7 +365,7 @@ static __always_inline int memcg_charge_slab(struct page *page,
 		return 0;
 	}
 
-	ret = memcg_kmem_charge_memcg(page, gfp, order, memcg);
+	ret = memcg_kmem_charge_memcg(memcg, gfp, order);
 	if (ret)
 		goto out;
 
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 02/28] mm: kmem: cleanup memcg_kmem_uncharge_memcg() arguments
  2020-01-27 17:34 [PATCH v2 00/28] The new cgroup slab memory controller Roman Gushchin
  2020-01-27 17:34 ` [PATCH v2 01/28] mm: kmem: cleanup (__)memcg_kmem_charge_memcg() arguments Roman Gushchin
@ 2020-01-27 17:34 ` Roman Gushchin
  2020-01-27 17:34 ` [PATCH v2 03/28] mm: kmem: rename memcg_kmem_(un)charge() into memcg_kmem_(un)charge_page() Roman Gushchin
                   ` (26 subsequent siblings)
  28 siblings, 0 replies; 84+ messages in thread
From: Roman Gushchin @ 2020-01-27 17:34 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Michal Hocko, Johannes Weiner, Shakeel Butt, Vladimir Davydov,
	linux-kernel, kernel-team, Bharata B Rao, Yafang Shao,
	Roman Gushchin

Drop the unused page argument and put the memcg pointer at the first
place. This make the function consistent with its peers:
__memcg_kmem_uncharge_memcg(), memcg_kmem_charge_memcg(), etc.

Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h | 4 ++--
 mm/slab.h                  | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c954209fd685..900a9f884260 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1409,8 +1409,8 @@ static inline int memcg_kmem_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp,
 	return 0;
 }
 
-static inline void memcg_kmem_uncharge_memcg(struct page *page, int order,
-					     struct mem_cgroup *memcg)
+static inline void memcg_kmem_uncharge_memcg(struct mem_cgroup *memcg,
+					     int order)
 {
 	if (memcg_kmem_enabled())
 		__memcg_kmem_uncharge_memcg(memcg, 1 << order);
diff --git a/mm/slab.h b/mm/slab.h
index c4c93e991250..e7da63fb8211 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -395,7 +395,7 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
 	if (likely(!mem_cgroup_is_root(memcg))) {
 		lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
 		mod_lruvec_state(lruvec, cache_vmstat_idx(s), -(1 << order));
-		memcg_kmem_uncharge_memcg(page, order, memcg);
+		memcg_kmem_uncharge_memcg(memcg, order);
 	} else {
 		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
 				    -(1 << order));
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 03/28] mm: kmem: rename memcg_kmem_(un)charge() into memcg_kmem_(un)charge_page()
  2020-01-27 17:34 [PATCH v2 00/28] The new cgroup slab memory controller Roman Gushchin
  2020-01-27 17:34 ` [PATCH v2 01/28] mm: kmem: cleanup (__)memcg_kmem_charge_memcg() arguments Roman Gushchin
  2020-01-27 17:34 ` [PATCH v2 02/28] mm: kmem: cleanup memcg_kmem_uncharge_memcg() arguments Roman Gushchin
@ 2020-01-27 17:34 ` Roman Gushchin
  2020-01-27 17:34 ` [PATCH v2 04/28] mm: kmem: switch to nr_pages in (__)memcg_kmem_charge_memcg() Roman Gushchin
                   ` (25 subsequent siblings)
  28 siblings, 0 replies; 84+ messages in thread
From: Roman Gushchin @ 2020-01-27 17:34 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Michal Hocko, Johannes Weiner, Shakeel Butt, Vladimir Davydov,
	linux-kernel, kernel-team, Bharata B Rao, Yafang Shao,
	Roman Gushchin

Rename (__)memcg_kmem_(un)charge() into (__)memcg_kmem_(un)charge_page()
to better reflect what they are actually doing:
1) call __memcg_kmem_(un)charge_memcg() to actually charge or
uncharge the current memcg
2) set or clear the PageKmemcg flag

Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/pipe.c                  |  2 +-
 include/linux/memcontrol.h | 23 +++++++++++++----------
 kernel/fork.c              |  9 +++++----
 mm/memcontrol.c            |  8 ++++----
 mm/page_alloc.c            |  4 ++--
 5 files changed, 25 insertions(+), 21 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index 423aafca4338..ae77f47b4fc8 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -144,7 +144,7 @@ static int anon_pipe_buf_steal(struct pipe_inode_info *pipe,
 	struct page *page = buf->page;
 
 	if (page_count(page) == 1) {
-		memcg_kmem_uncharge(page, 0);
+		memcg_kmem_uncharge_page(page, 0);
 		__SetPageLocked(page);
 		return 0;
 	}
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 900a9f884260..4ee0c345e905 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1362,8 +1362,8 @@ struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep);
 void memcg_kmem_put_cache(struct kmem_cache *cachep);
 
 #ifdef CONFIG_MEMCG_KMEM
-int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order);
-void __memcg_kmem_uncharge(struct page *page, int order);
+int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order);
+void __memcg_kmem_uncharge_page(struct page *page, int order);
 int __memcg_kmem_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp, int order);
 void __memcg_kmem_uncharge_memcg(struct mem_cgroup *memcg,
 				 unsigned int nr_pages);
@@ -1388,17 +1388,18 @@ static inline bool memcg_kmem_enabled(void)
 	return static_branch_unlikely(&memcg_kmem_enabled_key);
 }
 
-static inline int memcg_kmem_charge(struct page *page, gfp_t gfp, int order)
+static inline int memcg_kmem_charge_page(struct page *page, gfp_t gfp,
+					 int order)
 {
 	if (memcg_kmem_enabled())
-		return __memcg_kmem_charge(page, gfp, order);
+		return __memcg_kmem_charge_page(page, gfp, order);
 	return 0;
 }
 
-static inline void memcg_kmem_uncharge(struct page *page, int order)
+static inline void memcg_kmem_uncharge_page(struct page *page, int order)
 {
 	if (memcg_kmem_enabled())
-		__memcg_kmem_uncharge(page, order);
+		__memcg_kmem_uncharge_page(page, order);
 }
 
 static inline int memcg_kmem_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp,
@@ -1428,21 +1429,23 @@ static inline int memcg_cache_id(struct mem_cgroup *memcg)
 
 #else
 
-static inline int memcg_kmem_charge(struct page *page, gfp_t gfp, int order)
+static inline int memcg_kmem_charge_page(struct page *page, gfp_t gfp,
+					 int order)
 {
 	return 0;
 }
 
-static inline void memcg_kmem_uncharge(struct page *page, int order)
+static inline void memcg_kmem_uncharge_page(struct page *page, int order)
 {
 }
 
-static inline int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order)
+static inline int __memcg_kmem_charge_page(struct page *page, gfp_t gfp,
+					   int order)
 {
 	return 0;
 }
 
-static inline void __memcg_kmem_uncharge(struct page *page, int order)
+static inline void __memcg_kmem_uncharge_page(struct page *page, int order)
 {
 }
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 2d14e20a97e0..4dad271ee28e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -281,7 +281,7 @@ static inline void free_thread_stack(struct task_struct *tsk)
 					     MEMCG_KERNEL_STACK_KB,
 					     -(int)(PAGE_SIZE / 1024));
 
-			memcg_kmem_uncharge(vm->pages[i], 0);
+			memcg_kmem_uncharge_page(vm->pages[i], 0);
 		}
 
 		for (i = 0; i < NR_CACHED_STACKS; i++) {
@@ -413,12 +413,13 @@ static int memcg_charge_kernel_stack(struct task_struct *tsk)
 
 		for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++) {
 			/*
-			 * If memcg_kmem_charge() fails, page->mem_cgroup
-			 * pointer is NULL, and both memcg_kmem_uncharge()
+			 * If memcg_kmem_charge_page() fails, page->mem_cgroup
+			 * pointer is NULL, and both memcg_kmem_uncharge_page()
 			 * and mod_memcg_page_state() in free_thread_stack()
 			 * will ignore this page. So it's safe.
 			 */
-			ret = memcg_kmem_charge(vm->pages[i], GFP_KERNEL, 0);
+			ret = memcg_kmem_charge_page(vm->pages[i], GFP_KERNEL,
+						     0);
 			if (ret)
 				return ret;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 36a01d940e4b..4ed98b930323 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2855,14 +2855,14 @@ int __memcg_kmem_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp, int order)
 }
 
 /**
- * __memcg_kmem_charge: charge a kmem page to the current memory cgroup
+ * __memcg_kmem_charge_page: charge a kmem page to the current memory cgroup
  * @page: page to charge
  * @gfp: reclaim mode
  * @order: allocation order
  *
  * Returns 0 on success, an error code on failure.
  */
-int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order)
+int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
 {
 	struct mem_cgroup *memcg;
 	int ret = 0;
@@ -2898,11 +2898,11 @@ void __memcg_kmem_uncharge_memcg(struct mem_cgroup *memcg,
 		page_counter_uncharge(&memcg->memsw, nr_pages);
 }
 /**
- * __memcg_kmem_uncharge: uncharge a kmem page
+ * __memcg_kmem_uncharge_page: uncharge a kmem page
  * @page: page to uncharge
  * @order: allocation order
  */
-void __memcg_kmem_uncharge(struct page *page, int order)
+void __memcg_kmem_uncharge_page(struct page *page, int order)
 {
 	struct mem_cgroup *memcg = page->mem_cgroup;
 	unsigned int nr_pages = 1 << order;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3c4eb750a199..f842ebcb4600 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1152,7 +1152,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
 	if (PageMappingFlags(page))
 		page->mapping = NULL;
 	if (memcg_kmem_enabled() && PageKmemcg(page))
-		__memcg_kmem_uncharge(page, order);
+		__memcg_kmem_uncharge_page(page, order);
 	if (check_free)
 		bad += free_pages_check(page);
 	if (bad)
@@ -4752,7 +4752,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
 
 out:
 	if (memcg_kmem_enabled() && (gfp_mask & __GFP_ACCOUNT) && page &&
-	    unlikely(__memcg_kmem_charge(page, gfp_mask, order) != 0)) {
+	    unlikely(__memcg_kmem_charge_page(page, gfp_mask, order) != 0)) {
 		__free_pages(page, order);
 		page = NULL;
 	}
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 04/28] mm: kmem: switch to nr_pages in (__)memcg_kmem_charge_memcg()
  2020-01-27 17:34 [PATCH v2 00/28] The new cgroup slab memory controller Roman Gushchin
                   ` (2 preceding siblings ...)
  2020-01-27 17:34 ` [PATCH v2 03/28] mm: kmem: rename memcg_kmem_(un)charge() into memcg_kmem_(un)charge_page() Roman Gushchin
@ 2020-01-27 17:34 ` Roman Gushchin
  2020-01-27 17:34 ` [PATCH v2 05/28] mm: memcg/slab: cache page number in memcg_(un)charge_slab() Roman Gushchin
                   ` (24 subsequent siblings)
  28 siblings, 0 replies; 84+ messages in thread
From: Roman Gushchin @ 2020-01-27 17:34 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Michal Hocko, Johannes Weiner, Shakeel Butt, Vladimir Davydov,
	linux-kernel, kernel-team, Bharata B Rao, Yafang Shao,
	Roman Gushchin

These functions are charging the given number of kernel pages to the
given memory cgroup. The number doesn't have to be a power of two.
Let's make them to take the unsigned int nr_pages as an argument
instead of the page order.

It makes them look consistent with the corresponding uncharge
functions and functions like: mem_cgroup_charge_skmem(memcg, nr_pages).

Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h | 11 ++++++-----
 mm/memcontrol.c            |  8 ++++----
 mm/slab.h                  |  2 +-
 3 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 4ee0c345e905..851c373edb74 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1364,7 +1364,8 @@ void memcg_kmem_put_cache(struct kmem_cache *cachep);
 #ifdef CONFIG_MEMCG_KMEM
 int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order);
 void __memcg_kmem_uncharge_page(struct page *page, int order);
-int __memcg_kmem_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp, int order);
+int __memcg_kmem_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp,
+			      unsigned int nr_pages);
 void __memcg_kmem_uncharge_memcg(struct mem_cgroup *memcg,
 				 unsigned int nr_pages);
 
@@ -1403,18 +1404,18 @@ static inline void memcg_kmem_uncharge_page(struct page *page, int order)
 }
 
 static inline int memcg_kmem_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp,
-					  int order)
+					  unsigned int nr_pages)
 {
 	if (memcg_kmem_enabled())
-		return __memcg_kmem_charge_memcg(memcg, gfp, order);
+		return __memcg_kmem_charge_memcg(memcg, gfp, nr_pages);
 	return 0;
 }
 
 static inline void memcg_kmem_uncharge_memcg(struct mem_cgroup *memcg,
-					     int order)
+					     unsigned int nr_pages)
 {
 	if (memcg_kmem_enabled())
-		__memcg_kmem_uncharge_memcg(memcg, 1 << order);
+		__memcg_kmem_uncharge_memcg(memcg, nr_pages);
 }
 
 /*
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4ed98b930323..1561ef984104 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2822,13 +2822,13 @@ void memcg_kmem_put_cache(struct kmem_cache *cachep)
  * __memcg_kmem_charge_memcg: charge a kmem page
  * @memcg: memory cgroup to charge
  * @gfp: reclaim mode
- * @order: allocation order
+ * @nr_pages: number of pages to charge
  *
  * Returns 0 on success, an error code on failure.
  */
-int __memcg_kmem_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp, int order)
+int __memcg_kmem_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp,
+			      unsigned int nr_pages)
 {
-	unsigned int nr_pages = 1 << order;
 	struct page_counter *counter;
 	int ret;
 
@@ -2872,7 +2872,7 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
 
 	memcg = get_mem_cgroup_from_current();
 	if (!mem_cgroup_is_root(memcg)) {
-		ret = __memcg_kmem_charge_memcg(memcg, gfp, order);
+		ret = __memcg_kmem_charge_memcg(memcg, gfp, 1 << order);
 		if (!ret) {
 			page->mem_cgroup = memcg;
 			__SetPageKmemcg(page);
diff --git a/mm/slab.h b/mm/slab.h
index e7da63fb8211..d96c87a30a9b 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -365,7 +365,7 @@ static __always_inline int memcg_charge_slab(struct page *page,
 		return 0;
 	}
 
-	ret = memcg_kmem_charge_memcg(memcg, gfp, order);
+	ret = memcg_kmem_charge_memcg(memcg, gfp, 1 << order);
 	if (ret)
 		goto out;
 
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 05/28] mm: memcg/slab: cache page number in memcg_(un)charge_slab()
  2020-01-27 17:34 [PATCH v2 00/28] The new cgroup slab memory controller Roman Gushchin
                   ` (3 preceding siblings ...)
  2020-01-27 17:34 ` [PATCH v2 04/28] mm: kmem: switch to nr_pages in (__)memcg_kmem_charge_memcg() Roman Gushchin
@ 2020-01-27 17:34 ` Roman Gushchin
  2020-01-27 17:34 ` [PATCH v2 06/28] mm: kmem: rename (__)memcg_kmem_(un)charge_memcg() to __memcg_kmem_(un)charge() Roman Gushchin
                   ` (23 subsequent siblings)
  28 siblings, 0 replies; 84+ messages in thread
From: Roman Gushchin @ 2020-01-27 17:34 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Michal Hocko, Johannes Weiner, Shakeel Butt, Vladimir Davydov,
	linux-kernel, kernel-team, Bharata B Rao, Yafang Shao,
	Roman Gushchin

There are many places in memcg_charge_slab() and memcg_uncharge_slab()
which are calculating the number of pages to charge, css references to
grab etc depending on the order of the slab page.

Let's simplify the code by calculating it once and caching in the
local variable.

Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/slab.h | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/mm/slab.h b/mm/slab.h
index d96c87a30a9b..a7ed8b422d8f 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -348,6 +348,7 @@ static __always_inline int memcg_charge_slab(struct page *page,
 					     gfp_t gfp, int order,
 					     struct kmem_cache *s)
 {
+	unsigned int nr_pages = 1 << order;
 	struct mem_cgroup *memcg;
 	struct lruvec *lruvec;
 	int ret;
@@ -360,17 +361,17 @@ static __always_inline int memcg_charge_slab(struct page *page,
 
 	if (unlikely(!memcg || mem_cgroup_is_root(memcg))) {
 		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
-				    (1 << order));
-		percpu_ref_get_many(&s->memcg_params.refcnt, 1 << order);
+				    nr_pages);
+		percpu_ref_get_many(&s->memcg_params.refcnt, nr_pages);
 		return 0;
 	}
 
-	ret = memcg_kmem_charge_memcg(memcg, gfp, 1 << order);
+	ret = memcg_kmem_charge_memcg(memcg, gfp, nr_pages);
 	if (ret)
 		goto out;
 
 	lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
-	mod_lruvec_state(lruvec, cache_vmstat_idx(s), 1 << order);
+	mod_lruvec_state(lruvec, cache_vmstat_idx(s), nr_pages);
 
 	/* transer try_charge() page references to kmem_cache */
 	percpu_ref_get_many(&s->memcg_params.refcnt, 1 << order);
@@ -387,6 +388,7 @@ static __always_inline int memcg_charge_slab(struct page *page,
 static __always_inline void memcg_uncharge_slab(struct page *page, int order,
 						struct kmem_cache *s)
 {
+	unsigned int nr_pages = 1 << order;
 	struct mem_cgroup *memcg;
 	struct lruvec *lruvec;
 
@@ -394,15 +396,15 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
 	memcg = READ_ONCE(s->memcg_params.memcg);
 	if (likely(!mem_cgroup_is_root(memcg))) {
 		lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
-		mod_lruvec_state(lruvec, cache_vmstat_idx(s), -(1 << order));
-		memcg_kmem_uncharge_memcg(memcg, order);
+		mod_lruvec_state(lruvec, cache_vmstat_idx(s), -nr_pages);
+		memcg_kmem_uncharge_memcg(memcg, nr_pages);
 	} else {
 		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
-				    -(1 << order));
+				    -nr_pages);
 	}
 	rcu_read_unlock();
 
-	percpu_ref_put_many(&s->memcg_params.refcnt, 1 << order);
+	percpu_ref_put_many(&s->memcg_params.refcnt, nr_pages);
 }
 
 extern void slab_init_memcg_params(struct kmem_cache *);
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 06/28] mm: kmem: rename (__)memcg_kmem_(un)charge_memcg() to __memcg_kmem_(un)charge()
  2020-01-27 17:34 [PATCH v2 00/28] The new cgroup slab memory controller Roman Gushchin
                   ` (4 preceding siblings ...)
  2020-01-27 17:34 ` [PATCH v2 05/28] mm: memcg/slab: cache page number in memcg_(un)charge_slab() Roman Gushchin
@ 2020-01-27 17:34 ` Roman Gushchin
  2020-01-27 17:34 ` [PATCH v2 07/28] mm: memcg/slab: introduce mem_cgroup_from_obj() Roman Gushchin
                   ` (22 subsequent siblings)
  28 siblings, 0 replies; 84+ messages in thread
From: Roman Gushchin @ 2020-01-27 17:34 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Michal Hocko, Johannes Weiner, Shakeel Butt, Vladimir Davydov,
	linux-kernel, kernel-team, Bharata B Rao, Yafang Shao,
	Roman Gushchin

Drop the _memcg suffix from (__)memcg_kmem_(un)charge functions.
It's shorter and more obvious.

These are the most basic functions which are just (un)charging the
given cgroup with the given amount of pages.

Also fix up the corresponding comments.

Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h | 19 +++++++++---------
 mm/memcontrol.c            | 40 +++++++++++++++++++-------------------
 mm/slab.h                  |  4 ++--
 3 files changed, 31 insertions(+), 32 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 851c373edb74..c372bed6be80 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1362,12 +1362,11 @@ struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep);
 void memcg_kmem_put_cache(struct kmem_cache *cachep);
 
 #ifdef CONFIG_MEMCG_KMEM
+int __memcg_kmem_charge(struct mem_cgroup *memcg, gfp_t gfp,
+			unsigned int nr_pages);
+void __memcg_kmem_uncharge(struct mem_cgroup *memcg, unsigned int nr_pages);
 int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order);
 void __memcg_kmem_uncharge_page(struct page *page, int order);
-int __memcg_kmem_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp,
-			      unsigned int nr_pages);
-void __memcg_kmem_uncharge_memcg(struct mem_cgroup *memcg,
-				 unsigned int nr_pages);
 
 extern struct static_key_false memcg_kmem_enabled_key;
 extern struct workqueue_struct *memcg_kmem_cache_wq;
@@ -1403,19 +1402,19 @@ static inline void memcg_kmem_uncharge_page(struct page *page, int order)
 		__memcg_kmem_uncharge_page(page, order);
 }
 
-static inline int memcg_kmem_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp,
-					  unsigned int nr_pages)
+static inline int memcg_kmem_charge(struct mem_cgroup *memcg, gfp_t gfp,
+				    unsigned int nr_pages)
 {
 	if (memcg_kmem_enabled())
-		return __memcg_kmem_charge_memcg(memcg, gfp, nr_pages);
+		return __memcg_kmem_charge(memcg, gfp, nr_pages);
 	return 0;
 }
 
-static inline void memcg_kmem_uncharge_memcg(struct mem_cgroup *memcg,
-					     unsigned int nr_pages)
+static inline void memcg_kmem_uncharge(struct mem_cgroup *memcg,
+				       unsigned int nr_pages)
 {
 	if (memcg_kmem_enabled())
-		__memcg_kmem_uncharge_memcg(memcg, nr_pages);
+		__memcg_kmem_uncharge(memcg, nr_pages);
 }
 
 /*
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1561ef984104..8798702d165b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2819,15 +2819,15 @@ void memcg_kmem_put_cache(struct kmem_cache *cachep)
 }
 
 /**
- * __memcg_kmem_charge_memcg: charge a kmem page
+ * __memcg_kmem_charge: charge a number of kernel pages to a memcg
  * @memcg: memory cgroup to charge
  * @gfp: reclaim mode
  * @nr_pages: number of pages to charge
  *
  * Returns 0 on success, an error code on failure.
  */
-int __memcg_kmem_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp,
-			      unsigned int nr_pages)
+int __memcg_kmem_charge(struct mem_cgroup *memcg, gfp_t gfp,
+			unsigned int nr_pages)
 {
 	struct page_counter *counter;
 	int ret;
@@ -2854,6 +2854,21 @@ int __memcg_kmem_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp,
 	return 0;
 }
 
+/**
+ * __memcg_kmem_uncharge: uncharge a number of kernel pages from a memcg
+ * @memcg: memcg to uncharge
+ * @nr_pages: number of pages to uncharge
+ */
+void __memcg_kmem_uncharge(struct mem_cgroup *memcg, unsigned int nr_pages)
+{
+	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+		page_counter_uncharge(&memcg->kmem, nr_pages);
+
+	page_counter_uncharge(&memcg->memory, nr_pages);
+	if (do_memsw_account())
+		page_counter_uncharge(&memcg->memsw, nr_pages);
+}
+
 /**
  * __memcg_kmem_charge_page: charge a kmem page to the current memory cgroup
  * @page: page to charge
@@ -2872,7 +2887,7 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
 
 	memcg = get_mem_cgroup_from_current();
 	if (!mem_cgroup_is_root(memcg)) {
-		ret = __memcg_kmem_charge_memcg(memcg, gfp, 1 << order);
+		ret = __memcg_kmem_charge(memcg, gfp, 1 << order);
 		if (!ret) {
 			page->mem_cgroup = memcg;
 			__SetPageKmemcg(page);
@@ -2882,21 +2897,6 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
 	return ret;
 }
 
-/**
- * __memcg_kmem_uncharge_memcg: uncharge a kmem page
- * @memcg: memcg to uncharge
- * @nr_pages: number of pages to uncharge
- */
-void __memcg_kmem_uncharge_memcg(struct mem_cgroup *memcg,
-				 unsigned int nr_pages)
-{
-	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
-		page_counter_uncharge(&memcg->kmem, nr_pages);
-
-	page_counter_uncharge(&memcg->memory, nr_pages);
-	if (do_memsw_account())
-		page_counter_uncharge(&memcg->memsw, nr_pages);
-}
 /**
  * __memcg_kmem_uncharge_page: uncharge a kmem page
  * @page: page to uncharge
@@ -2911,7 +2911,7 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
 		return;
 
 	VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
-	__memcg_kmem_uncharge_memcg(memcg, nr_pages);
+	__memcg_kmem_uncharge(memcg, nr_pages);
 	page->mem_cgroup = NULL;
 
 	/* slab pages do not have PageKmemcg flag set */
diff --git a/mm/slab.h b/mm/slab.h
index a7ed8b422d8f..d943264f0f09 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -366,7 +366,7 @@ static __always_inline int memcg_charge_slab(struct page *page,
 		return 0;
 	}
 
-	ret = memcg_kmem_charge_memcg(memcg, gfp, nr_pages);
+	ret = memcg_kmem_charge(memcg, gfp, nr_pages);
 	if (ret)
 		goto out;
 
@@ -397,7 +397,7 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
 	if (likely(!mem_cgroup_is_root(memcg))) {
 		lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
 		mod_lruvec_state(lruvec, cache_vmstat_idx(s), -nr_pages);
-		memcg_kmem_uncharge_memcg(memcg, nr_pages);
+		memcg_kmem_uncharge(memcg, nr_pages);
 	} else {
 		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
 				    -nr_pages);
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 07/28] mm: memcg/slab: introduce mem_cgroup_from_obj()
  2020-01-27 17:34 [PATCH v2 00/28] The new cgroup slab memory controller Roman Gushchin
                   ` (5 preceding siblings ...)
  2020-01-27 17:34 ` [PATCH v2 06/28] mm: kmem: rename (__)memcg_kmem_(un)charge_memcg() to __memcg_kmem_(un)charge() Roman Gushchin
@ 2020-01-27 17:34 ` Roman Gushchin
  2020-02-03 16:05   ` Johannes Weiner
  2020-01-27 17:34 ` [PATCH v2 08/28] mm: fork: fix kernel_stack memcg stats for various stack implementations Roman Gushchin
                   ` (21 subsequent siblings)
  28 siblings, 1 reply; 84+ messages in thread
From: Roman Gushchin @ 2020-01-27 17:34 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Michal Hocko, Johannes Weiner, Shakeel Butt, Vladimir Davydov,
	linux-kernel, kernel-team, Bharata B Rao, Yafang Shao,
	Roman Gushchin

Sometimes we need to get a memcg pointer from a charged kernel object.
The right way to get it depends on whether it's a proper slab object
or it's backed by raw pages (e.g. it's a vmalloc alloction). In the
first case the kmem_cache->memcg_params.memcg indirection should be
used; in other cases it's just page->mem_cgroup.

To simplify this task and hide the implementation details let's
introduce a mem_cgroup_from_obj() helper, which takes a pointer
to any kernel object and returns a valid memcg pointer or NULL.

Passing a kernel address rather than a pointer to a page will allow
to use this helper for per-object (rather than per-page) tracked
objects in the future.

The caller is still responsible to ensure that the returned memcg
isn't going away underneath: take the rcu read lock, cgroup mutex etc;
depending on the context.

mem_cgroup_from_kmem() defined in mm/list_lru.c is now obsolete
and can be removed.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/memcontrol.h |  7 +++++++
 mm/list_lru.c              | 12 +-----------
 mm/memcontrol.c            | 32 +++++++++++++++++++++++++++++---
 3 files changed, 37 insertions(+), 14 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c372bed6be80..24c50d004c46 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1427,6 +1427,8 @@ static inline int memcg_cache_id(struct mem_cgroup *memcg)
 	return memcg ? memcg->kmemcg_id : -1;
 }
 
+struct mem_cgroup *mem_cgroup_from_obj(void *p);
+
 #else
 
 static inline int memcg_kmem_charge_page(struct page *page, gfp_t gfp,
@@ -1470,6 +1472,11 @@ static inline void memcg_put_cache_ids(void)
 {
 }
 
+static inline struct mem_cgroup *mem_cgroup_from_obj(void *p)
+{
+       return NULL;
+}
+
 #endif /* CONFIG_MEMCG_KMEM */
 
 #endif /* _LINUX_MEMCONTROL_H */
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 0f1f6b06b7f3..8de5e3784ee4 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -57,16 +57,6 @@ list_lru_from_memcg_idx(struct list_lru_node *nlru, int idx)
 	return &nlru->lru;
 }
 
-static __always_inline struct mem_cgroup *mem_cgroup_from_kmem(void *ptr)
-{
-	struct page *page;
-
-	if (!memcg_kmem_enabled())
-		return NULL;
-	page = virt_to_head_page(ptr);
-	return memcg_from_slab_page(page);
-}
-
 static inline struct list_lru_one *
 list_lru_from_kmem(struct list_lru_node *nlru, void *ptr,
 		   struct mem_cgroup **memcg_ptr)
@@ -77,7 +67,7 @@ list_lru_from_kmem(struct list_lru_node *nlru, void *ptr,
 	if (!nlru->memcg_lrus)
 		goto out;
 
-	memcg = mem_cgroup_from_kmem(ptr);
+	memcg = mem_cgroup_from_obj(ptr);
 	if (!memcg)
 		goto out;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8798702d165b..43010471621c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -757,13 +757,12 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 
 void __mod_lruvec_slab_state(void *p, enum node_stat_item idx, int val)
 {
-	struct page *page = virt_to_head_page(p);
-	pg_data_t *pgdat = page_pgdat(page);
+	pg_data_t *pgdat = page_pgdat(virt_to_page(p));
 	struct mem_cgroup *memcg;
 	struct lruvec *lruvec;
 
 	rcu_read_lock();
-	memcg = memcg_from_slab_page(page);
+	memcg = mem_cgroup_from_obj(p);
 
 	/* Untracked pages have no memcg, no lruvec. Update only the node */
 	if (!memcg || memcg == root_mem_cgroup) {
@@ -2636,6 +2635,33 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
 }
 
 #ifdef CONFIG_MEMCG_KMEM
+/*
+ * Returns a pointer to the memory cgroup to which the kernel object is charged.
+ *
+ * The caller must ensure the memcg lifetime, e.g. by taking rcu_read_lock(),
+ * cgroup_mutex, etc.
+ */
+struct mem_cgroup *mem_cgroup_from_obj(void *p)
+{
+	struct page *page;
+
+	if (mem_cgroup_disabled())
+		return NULL;
+
+	page = virt_to_head_page(p);
+
+	/*
+	 * Slab pages don't have page->mem_cgroup set because corresponding
+	 * kmem caches can be reparented during the lifetime. That's why
+	 * memcg_from_slab_page() should be used instead.
+	 */
+	if (PageSlab(page))
+		return memcg_from_slab_page(page);
+
+	/* All other pages use page->mem_cgroup */
+	return page->mem_cgroup;
+}
+
 static int memcg_alloc_cache_id(void)
 {
 	int id, size;
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 08/28] mm: fork: fix kernel_stack memcg stats for various stack implementations
  2020-01-27 17:34 [PATCH v2 00/28] The new cgroup slab memory controller Roman Gushchin
                   ` (6 preceding siblings ...)
  2020-01-27 17:34 ` [PATCH v2 07/28] mm: memcg/slab: introduce mem_cgroup_from_obj() Roman Gushchin
@ 2020-01-27 17:34 ` Roman Gushchin
  2020-02-03 16:12   ` Johannes Weiner
  2020-01-27 17:34 ` [PATCH v2 09/28] mm: memcg/slab: rename __mod_lruvec_slab_state() into __mod_lruvec_obj_state() Roman Gushchin
                   ` (20 subsequent siblings)
  28 siblings, 1 reply; 84+ messages in thread
From: Roman Gushchin @ 2020-01-27 17:34 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Michal Hocko, Johannes Weiner, Shakeel Butt, Vladimir Davydov,
	linux-kernel, kernel-team, Bharata B Rao, Yafang Shao,
	Roman Gushchin

Depending on CONFIG_VMAP_STACK and the THREAD_SIZE / PAGE_SIZE ratio
the space for task stacks can be allocated using __vmalloc_node_range(),
alloc_pages_node() and kmem_cache_alloc_node(). In the first and the
second cases page->mem_cgroup pointer is set, but in the third it's
not: memcg membership of a slab page should be determined using the
memcg_from_slab_page() function, which looks at
page->slab_cache->memcg_params.memcg . In this case, using
mod_memcg_page_state() (as in account_kernel_stack()) is incorrect:
page->mem_cgroup pointer is NULL even for pages charged to a non-root
memory cgroup.

In order to fix it, let's introduce a mod_memcg_obj_state() helper,
which takes a pointer to a kernel object as a first argument, uses
mem_cgroup_from_obj() to get a RCU-protected memcg pointer and
calls mod_memcg_state(). It allows to handle all possible
configurations (CONFIG_VMAP_STACK and various THREAD_SIZE/PAGE_SIZE
values) without spilling any memcg/kmem specifics into fork.c .

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/memcontrol.h |  5 +++++
 kernel/fork.c              |  4 ++--
 mm/memcontrol.c            | 11 +++++++++++
 3 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 24c50d004c46..1b4150ff64be 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -695,6 +695,7 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
 void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 			int val);
 void __mod_lruvec_slab_state(void *p, enum node_stat_item idx, int val);
+void mod_memcg_obj_state(void *p, int idx, int val);
 
 static inline void mod_lruvec_state(struct lruvec *lruvec,
 				    enum node_stat_item idx, int val)
@@ -1123,6 +1124,10 @@ static inline void __mod_lruvec_slab_state(void *p, enum node_stat_item idx,
 	__mod_node_page_state(page_pgdat(page), idx, val);
 }
 
+static inline void mod_memcg_obj_state(void *p, int idx, int val)
+{
+}
+
 static inline
 unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
 					    gfp_t gfp_mask,
diff --git a/kernel/fork.c b/kernel/fork.c
index 4dad271ee28e..d8d1ccc7a40e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -397,8 +397,8 @@ static void account_kernel_stack(struct task_struct *tsk, int account)
 		mod_zone_page_state(page_zone(first_page), NR_KERNEL_STACK_KB,
 				    THREAD_SIZE / 1024 * account);
 
-		mod_memcg_page_state(first_page, MEMCG_KERNEL_STACK_KB,
-				     account * (THREAD_SIZE / 1024));
+		mod_memcg_obj_state(stack, MEMCG_KERNEL_STACK_KB,
+				    account * (THREAD_SIZE / 1024));
 	}
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 43010471621c..2bdc1ae5402a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -774,6 +774,17 @@ void __mod_lruvec_slab_state(void *p, enum node_stat_item idx, int val)
 	rcu_read_unlock();
 }
 
+void mod_memcg_obj_state(void *p, int idx, int val)
+{
+	struct mem_cgroup *memcg;
+
+	rcu_read_lock();
+	memcg = mem_cgroup_from_obj(p);
+	if (memcg)
+		mod_memcg_state(memcg, idx, val);
+	rcu_read_unlock();
+}
+
 /**
  * __count_memcg_events - account VM events in a cgroup
  * @memcg: the memory cgroup
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 09/28] mm: memcg/slab: rename __mod_lruvec_slab_state() into __mod_lruvec_obj_state()
  2020-01-27 17:34 [PATCH v2 00/28] The new cgroup slab memory controller Roman Gushchin
                   ` (7 preceding siblings ...)
  2020-01-27 17:34 ` [PATCH v2 08/28] mm: fork: fix kernel_stack memcg stats for various stack implementations Roman Gushchin
@ 2020-01-27 17:34 ` Roman Gushchin
  2020-02-03 16:13   ` Johannes Weiner
  2020-01-27 17:34 ` [PATCH v2 10/28] mm: memcg: introduce mod_lruvec_memcg_state() Roman Gushchin
                   ` (19 subsequent siblings)
  28 siblings, 1 reply; 84+ messages in thread
From: Roman Gushchin @ 2020-01-27 17:34 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Michal Hocko, Johannes Weiner, Shakeel Butt, Vladimir Davydov,
	linux-kernel, kernel-team, Bharata B Rao, Yafang Shao,
	Roman Gushchin

Rename __mod_lruvec_slab_state() into __mod_lruvec_obj_state()
to unify it with mod_memcg_obj_state(). It better reflects the fact
that the passed object isn't necessary slab-backed.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/memcontrol.h | 8 ++++----
 mm/memcontrol.c            | 2 +-
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1b4150ff64be..37d4f418e336 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -694,7 +694,7 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
 
 void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 			int val);
-void __mod_lruvec_slab_state(void *p, enum node_stat_item idx, int val);
+void __mod_lruvec_obj_state(void *p, enum node_stat_item idx, int val);
 void mod_memcg_obj_state(void *p, int idx, int val);
 
 static inline void mod_lruvec_state(struct lruvec *lruvec,
@@ -1116,7 +1116,7 @@ static inline void mod_lruvec_page_state(struct page *page,
 	mod_node_page_state(page_pgdat(page), idx, val);
 }
 
-static inline void __mod_lruvec_slab_state(void *p, enum node_stat_item idx,
+static inline void __mod_lruvec_obj_state(void *p, enum node_stat_item idx,
 					   int val)
 {
 	struct page *page = virt_to_head_page(p);
@@ -1217,12 +1217,12 @@ static inline void __dec_lruvec_page_state(struct page *page,
 
 static inline void __inc_lruvec_slab_state(void *p, enum node_stat_item idx)
 {
-	__mod_lruvec_slab_state(p, idx, 1);
+	__mod_lruvec_obj_state(p, idx, 1);
 }
 
 static inline void __dec_lruvec_slab_state(void *p, enum node_stat_item idx)
 {
-	__mod_lruvec_slab_state(p, idx, -1);
+	__mod_lruvec_obj_state(p, idx, -1);
 }
 
 /* idx can be of type enum memcg_stat_item or node_stat_item */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2bdc1ae5402a..cbf01cc0cbac 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -755,7 +755,7 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 	__this_cpu_write(pn->lruvec_stat_cpu->count[idx], x);
 }
 
-void __mod_lruvec_slab_state(void *p, enum node_stat_item idx, int val)
+void __mod_lruvec_obj_state(void *p, enum node_stat_item idx, int val)
 {
 	pg_data_t *pgdat = page_pgdat(virt_to_page(p));
 	struct mem_cgroup *memcg;
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 10/28] mm: memcg: introduce mod_lruvec_memcg_state()
  2020-01-27 17:34 [PATCH v2 00/28] The new cgroup slab memory controller Roman Gushchin
                   ` (8 preceding siblings ...)
  2020-01-27 17:34 ` [PATCH v2 09/28] mm: memcg/slab: rename __mod_lruvec_slab_state() into __mod_lruvec_obj_state() Roman Gushchin
@ 2020-01-27 17:34 ` Roman Gushchin
  2020-02-03 17:39   ` Johannes Weiner
  2020-01-27 17:34 ` [PATCH v2 11/28] mm: slub: implement SLUB version of obj_to_index() Roman Gushchin
                   ` (18 subsequent siblings)
  28 siblings, 1 reply; 84+ messages in thread
From: Roman Gushchin @ 2020-01-27 17:34 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Michal Hocko, Johannes Weiner, Shakeel Butt, Vladimir Davydov,
	linux-kernel, kernel-team, Bharata B Rao, Yafang Shao,
	Roman Gushchin

To prepare for per-object accounting of slab objects, let's introduce
__mod_lruvec_memcg_state() and mod_lruvec_memcg_state() helpers,
which are similar to mod_lruvec_state(), but do not update global
node counters, only lruvec and per-cgroup.

It's necessary because soon node slab counters will be used for
accounting of all memory used by slab pages, however on memcg level
only the actually used memory will be counted. The free space will be
shared between all cgroups, so it can't be accounted to any
specific cgroup.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/memcontrol.h | 22 ++++++++++++++++++++++
 mm/memcontrol.c            | 37 +++++++++++++++++++++++++++----------
 2 files changed, 49 insertions(+), 10 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 37d4f418e336..73c2a7d32862 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -694,6 +694,8 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
 
 void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 			int val);
+void __mod_lruvec_memcg_state(struct lruvec *lruvec, enum node_stat_item idx,
+			      int val);
 void __mod_lruvec_obj_state(void *p, enum node_stat_item idx, int val);
 void mod_memcg_obj_state(void *p, int idx, int val);
 
@@ -707,6 +709,16 @@ static inline void mod_lruvec_state(struct lruvec *lruvec,
 	local_irq_restore(flags);
 }
 
+static inline void mod_lruvec_memcg_state(struct lruvec *lruvec,
+					  enum node_stat_item idx, int val)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__mod_lruvec_memcg_state(lruvec, idx, val);
+	local_irq_restore(flags);
+}
+
 static inline void __mod_lruvec_page_state(struct page *page,
 					   enum node_stat_item idx, int val)
 {
@@ -1104,6 +1116,16 @@ static inline void mod_lruvec_state(struct lruvec *lruvec,
 	mod_node_page_state(lruvec_pgdat(lruvec), idx, val);
 }
 
+static inline void __mod_lruvec_memcg_state(struct lruvec *lruvec,
+					    enum node_stat_item idx, int val)
+{
+}
+
+static inline void mod_lruvec_memcg_state(struct lruvec *lruvec,
+					  enum node_stat_item idx, int val)
+{
+}
+
 static inline void __mod_lruvec_page_state(struct page *page,
 					   enum node_stat_item idx, int val)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index cbf01cc0cbac..730f230cee6a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -712,16 +712,16 @@ parent_nodeinfo(struct mem_cgroup_per_node *pn, int nid)
 }
 
 /**
- * __mod_lruvec_state - update lruvec memory statistics
+ * __mod_lruvec_memcg_state - update lruvec memory statistics
  * @lruvec: the lruvec
  * @idx: the stat item
  * @val: delta to add to the counter, can be negative
  *
  * The lruvec is the intersection of the NUMA node and a cgroup. This
- * function updates the all three counters that are affected by a
- * change of state at this level: per-node, per-cgroup, per-lruvec.
+ * function updates the two of three counters that are affected by a
+ * change of state at this level: per-cgroup and per-lruvec.
  */
-void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
+void __mod_lruvec_memcg_state(struct lruvec *lruvec, enum node_stat_item idx,
 			int val)
 {
 	pg_data_t *pgdat = lruvec_pgdat(lruvec);
@@ -729,12 +729,6 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 	struct mem_cgroup *memcg;
 	long x;
 
-	/* Update node */
-	__mod_node_page_state(pgdat, idx, val);
-
-	if (mem_cgroup_disabled())
-		return;
-
 	pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
 	memcg = pn->memcg;
 
@@ -755,6 +749,29 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 	__this_cpu_write(pn->lruvec_stat_cpu->count[idx], x);
 }
 
+/**
+ * __mod_lruvec_state - update lruvec memory statistics
+ * @lruvec: the lruvec
+ * @idx: the stat item
+ * @val: delta to add to the counter, can be negative
+ *
+ * The lruvec is the intersection of the NUMA node and a cgroup. This
+ * function updates the all three counters that are affected by a
+ * change of state at this level: per-node, per-cgroup, per-lruvec.
+ */
+void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
+			int val)
+{
+	pg_data_t *pgdat = lruvec_pgdat(lruvec);
+
+	/* Update node */
+	__mod_node_page_state(pgdat, idx, val);
+
+	/* Update per-cgroup and per-lruvec stats */
+	if (!mem_cgroup_disabled())
+		__mod_lruvec_memcg_state(lruvec, idx, val);
+}
+
 void __mod_lruvec_obj_state(void *p, enum node_stat_item idx, int val)
 {
 	pg_data_t *pgdat = page_pgdat(virt_to_page(p));
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 11/28] mm: slub: implement SLUB version of obj_to_index()
  2020-01-27 17:34 [PATCH v2 00/28] The new cgroup slab memory controller Roman Gushchin
                   ` (9 preceding siblings ...)
  2020-01-27 17:34 ` [PATCH v2 10/28] mm: memcg: introduce mod_lruvec_memcg_state() Roman Gushchin
@ 2020-01-27 17:34 ` Roman Gushchin
  2020-02-03 17:44   ` Johannes Weiner
  2020-01-27 17:34 ` [PATCH v2 12/28] mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat Roman Gushchin
                   ` (17 subsequent siblings)
  28 siblings, 1 reply; 84+ messages in thread
From: Roman Gushchin @ 2020-01-27 17:34 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Michal Hocko, Johannes Weiner, Shakeel Butt, Vladimir Davydov,
	linux-kernel, kernel-team, Bharata B Rao, Yafang Shao,
	Roman Gushchin

This commit implements SLUB version of the obj_to_index() function,
which will be required to calculate the offset of obj_cgroup in the
obj_cgroups vector to store/obtain the objcg ownership data.

To make it faster, let's repeat the SLAB's trick introduced by
commit 6a2d7a955d8d ("[PATCH] SLAB: use a multiply instead of a
divide in obj_to_index()") and avoid an expensive division.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Christoph Lameter <cl@linux.com>
---
 include/linux/slub_def.h | 9 +++++++++
 mm/slub.c                | 1 +
 2 files changed, 10 insertions(+)

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index d2153789bd9f..200ea292f250 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -8,6 +8,7 @@
  * (C) 2007 SGI, Christoph Lameter
  */
 #include <linux/kobject.h>
+#include <linux/reciprocal_div.h>
 
 enum stat_item {
 	ALLOC_FASTPATH,		/* Allocation from cpu slab */
@@ -86,6 +87,7 @@ struct kmem_cache {
 	unsigned long min_partial;
 	unsigned int size;	/* The size of an object including metadata */
 	unsigned int object_size;/* The size of an object without metadata */
+	struct reciprocal_value reciprocal_size;
 	unsigned int offset;	/* Free pointer offset */
 #ifdef CONFIG_SLUB_CPU_PARTIAL
 	/* Number of per cpu partial objects to keep around */
@@ -182,4 +184,11 @@ static inline void *nearest_obj(struct kmem_cache *cache, struct page *page,
 	return result;
 }
 
+static inline unsigned int obj_to_index(const struct kmem_cache *cache,
+					const struct page *page, void *obj)
+{
+	return reciprocal_divide(kasan_reset_tag(obj) - page_address(page),
+				 cache->reciprocal_size);
+}
+
 #endif /* _LINUX_SLUB_DEF_H */
diff --git a/mm/slub.c b/mm/slub.c
index 503e11b1c4e1..f2fe3a8e420a 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3598,6 +3598,7 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order)
 	 */
 	size = ALIGN(size, s->align);
 	s->size = size;
+	s->reciprocal_size = reciprocal_value(size);
 	if (forced_order >= 0)
 		order = forced_order;
 	else
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 12/28] mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat
  2020-01-27 17:34 [PATCH v2 00/28] The new cgroup slab memory controller Roman Gushchin
                   ` (10 preceding siblings ...)
  2020-01-27 17:34 ` [PATCH v2 11/28] mm: slub: implement SLUB version of obj_to_index() Roman Gushchin
@ 2020-01-27 17:34 ` Roman Gushchin
  2020-02-03 17:58   ` Johannes Weiner
  2020-01-27 17:34 ` [PATCH v2 13/28] mm: vmstat: convert slab vmstat counter to bytes Roman Gushchin
                   ` (16 subsequent siblings)
  28 siblings, 1 reply; 84+ messages in thread
From: Roman Gushchin @ 2020-01-27 17:34 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Michal Hocko, Johannes Weiner, Shakeel Butt, Vladimir Davydov,
	linux-kernel, kernel-team, Bharata B Rao, Yafang Shao,
	Roman Gushchin

Currently s8 type is used for per-cpu caching of per-node statistics.
It works fine because the overfill threshold can't exceed 125.

But if some counters are in bytes (and the next commit in the series
will convert slab counters to bytes), it's not gonna work:
value in bytes can easily exceed s8 without exceeding the threshold
converted to bytes. So to avoid overfilling per-cpu caches and breaking
vmstats correctness, let's use s32 instead.

This doesn't affect per-zone statistics. There are no plans to use
zone-level byte-sized counters, so no reasons to change anything.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/mmzone.h |  2 +-
 mm/vmstat.c            | 16 ++++++++--------
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 462f6873905a..1c5eafd925e7 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -360,7 +360,7 @@ struct per_cpu_pageset {
 
 struct per_cpu_nodestat {
 	s8 stat_threshold;
-	s8 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS];
+	s32 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS];
 };
 
 #endif /* !__GENERATING_BOUNDS.H */
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 78d53378db99..6242129939dd 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -337,7 +337,7 @@ void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
 				long delta)
 {
 	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-	s8 __percpu *p = pcp->vm_node_stat_diff + item;
+	s32 __percpu *p = pcp->vm_node_stat_diff + item;
 	long x;
 	long t;
 
@@ -395,13 +395,13 @@ void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
 void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
 {
 	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-	s8 __percpu *p = pcp->vm_node_stat_diff + item;
-	s8 v, t;
+	s32 __percpu *p = pcp->vm_node_stat_diff + item;
+	s32 v, t;
 
 	v = __this_cpu_inc_return(*p);
 	t = __this_cpu_read(pcp->stat_threshold);
 	if (unlikely(v > t)) {
-		s8 overstep = t >> 1;
+		s32 overstep = t >> 1;
 
 		node_page_state_add(v + overstep, pgdat, item);
 		__this_cpu_write(*p, -overstep);
@@ -439,13 +439,13 @@ void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
 void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
 {
 	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-	s8 __percpu *p = pcp->vm_node_stat_diff + item;
-	s8 v, t;
+	s32 __percpu *p = pcp->vm_node_stat_diff + item;
+	s32 v, t;
 
 	v = __this_cpu_dec_return(*p);
 	t = __this_cpu_read(pcp->stat_threshold);
 	if (unlikely(v < - t)) {
-		s8 overstep = t >> 1;
+		s32 overstep = t >> 1;
 
 		node_page_state_add(v - overstep, pgdat, item);
 		__this_cpu_write(*p, overstep);
@@ -538,7 +538,7 @@ static inline void mod_node_state(struct pglist_data *pgdat,
        enum node_stat_item item, int delta, int overstep_mode)
 {
 	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-	s8 __percpu *p = pcp->vm_node_stat_diff + item;
+	s32 __percpu *p = pcp->vm_node_stat_diff + item;
 	long o, n, t, z;
 
 	do {
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 13/28] mm: vmstat: convert slab vmstat counter to bytes
  2020-01-27 17:34 [PATCH v2 00/28] The new cgroup slab memory controller Roman Gushchin
                   ` (11 preceding siblings ...)
  2020-01-27 17:34 ` [PATCH v2 12/28] mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat Roman Gushchin
@ 2020-01-27 17:34 ` Roman Gushchin
  2020-01-27 17:34 ` [PATCH v2 14/28] mm: memcontrol: decouple reference counting from page accounting Roman Gushchin
                   ` (15 subsequent siblings)
  28 siblings, 0 replies; 84+ messages in thread
From: Roman Gushchin @ 2020-01-27 17:34 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Michal Hocko, Johannes Weiner, Shakeel Butt, Vladimir Davydov,
	linux-kernel, kernel-team, Bharata B Rao, Yafang Shao,
	Roman Gushchin

In order to prepare for per-object slab memory accounting,
convert NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE vmstat
items to bytes.

To make sure that these vmstats are in bytes, rename them
to NR_SLAB_RECLAIMABLE_B and NR_SLAB_UNRECLAIMABLE_B (similar to
NR_KERNEL_STACK_KB).

The size of slab memory shouldn't exceed 4Gb on 32-bit machines,
so it will fit into atomic_long_t we use for vmstats.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 drivers/base/node.c     | 14 +++++++++-----
 fs/proc/meminfo.c       |  4 ++--
 include/linux/mmzone.h  | 10 ++++++++--
 include/linux/vmstat.h  |  8 ++++++++
 kernel/power/snapshot.c |  2 +-
 mm/memcontrol.c         | 25 ++++++++++++++-----------
 mm/oom_kill.c           |  2 +-
 mm/page_alloc.c         |  8 ++++----
 mm/slab.h               | 15 ++++++++-------
 mm/slab_common.c        |  4 ++--
 mm/slob.c               | 12 ++++++------
 mm/slub.c               |  8 ++++----
 mm/vmscan.c             |  3 ++-
 mm/vmstat.c             | 21 +++++++++++++++++++--
 mm/workingset.c         |  6 ++++--
 15 files changed, 92 insertions(+), 50 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 98a31bafc8a2..fa07c5806dcd 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -368,8 +368,8 @@ static ssize_t node_read_meminfo(struct device *dev,
 	unsigned long sreclaimable, sunreclaimable;
 
 	si_meminfo_node(&i, nid);
-	sreclaimable = node_page_state(pgdat, NR_SLAB_RECLAIMABLE);
-	sunreclaimable = node_page_state(pgdat, NR_SLAB_UNRECLAIMABLE);
+	sreclaimable = node_page_state_pages(pgdat, NR_SLAB_RECLAIMABLE_B);
+	sunreclaimable = node_page_state_pages(pgdat, NR_SLAB_UNRECLAIMABLE_B);
 	n = sprintf(buf,
 		       "Node %d MemTotal:       %8lu kB\n"
 		       "Node %d MemFree:        %8lu kB\n"
@@ -505,9 +505,13 @@ static ssize_t node_read_vmstat(struct device *dev,
 			     sum_zone_numa_state(nid, i));
 #endif
 
-	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
-		n += sprintf(buf+n, "%s %lu\n", node_stat_name(i),
-			     node_page_state(pgdat, i));
+	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
+		unsigned long x = node_page_state(pgdat, i);
+
+		if (vmstat_item_in_bytes(i))
+			x >>= PAGE_SHIFT;
+		n += sprintf(buf+n, "%s %lu\n", node_stat_name(i), x);
+	}
 
 	return n;
 }
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 8c1f1bb1a5ce..0811e4100084 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -53,8 +53,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 		pages[lru] = global_node_page_state(NR_LRU_BASE + lru);
 
 	available = si_mem_available();
-	sreclaimable = global_node_page_state(NR_SLAB_RECLAIMABLE);
-	sunreclaim = global_node_page_state(NR_SLAB_UNRECLAIMABLE);
+	sreclaimable = global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B);
+	sunreclaim = global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE_B);
 
 	show_val_kb(m, "MemTotal:       ", i.totalram);
 	show_val_kb(m, "MemFree:        ", i.freeram);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 1c5eafd925e7..9935445786b5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -215,8 +215,8 @@ enum node_stat_item {
 	NR_INACTIVE_FILE,	/*  "     "     "   "       "         */
 	NR_ACTIVE_FILE,		/*  "     "     "   "       "         */
 	NR_UNEVICTABLE,		/*  "     "     "   "       "         */
-	NR_SLAB_RECLAIMABLE,
-	NR_SLAB_UNRECLAIMABLE,
+	NR_SLAB_RECLAIMABLE_B,
+	NR_SLAB_UNRECLAIMABLE_B,
 	NR_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
 	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
 	WORKINGSET_NODES,
@@ -246,6 +246,12 @@ enum node_stat_item {
 	NR_VM_NODE_STAT_ITEMS
 };
 
+static __always_inline bool vmstat_item_in_bytes(enum node_stat_item item)
+{
+	return (item == NR_SLAB_RECLAIMABLE_B ||
+		item == NR_SLAB_UNRECLAIMABLE_B);
+}
+
 /*
  * We do arithmetic on the LRU lists in various places in the code,
  * so it is important to keep the active lists LRU_ACTIVE higher in
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 292485f3d24d..5b0f61b3ca75 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -200,6 +200,12 @@ static inline unsigned long global_node_page_state(enum node_stat_item item)
 	return x;
 }
 
+static inline
+unsigned long global_node_page_state_pages(enum node_stat_item item)
+{
+	return global_node_page_state(item) >> PAGE_SHIFT;
+}
+
 static inline unsigned long zone_page_state(struct zone *zone,
 					enum zone_stat_item item)
 {
@@ -240,6 +246,8 @@ extern unsigned long sum_zone_node_page_state(int node,
 extern unsigned long sum_zone_numa_state(int node, enum numa_stat_item item);
 extern unsigned long node_page_state(struct pglist_data *pgdat,
 						enum node_stat_item item);
+extern unsigned long node_page_state_pages(struct pglist_data *pgdat,
+					   enum node_stat_item item);
 #else
 #define sum_zone_node_page_state(node, item) global_zone_page_state(item)
 #define node_page_state(node, item) global_node_page_state(item)
diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index ddade80ad276..84cdeb12ac5c 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -1664,7 +1664,7 @@ static unsigned long minimum_image_size(unsigned long saveable)
 {
 	unsigned long size;
 
-	size = global_node_page_state(NR_SLAB_RECLAIMABLE)
+	size = global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B)
 		+ global_node_page_state(NR_ACTIVE_ANON)
 		+ global_node_page_state(NR_INACTIVE_ANON)
 		+ global_node_page_state(NR_ACTIVE_FILE)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 730f230cee6a..bf846fb60d9f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -679,13 +679,16 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
  */
 void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)
 {
-	long x;
+	long x, threshold = MEMCG_CHARGE_BATCH;
 
 	if (mem_cgroup_disabled())
 		return;
 
+	if (vmstat_item_in_bytes(idx))
+		threshold <<= PAGE_SHIFT;
+
 	x = val + __this_cpu_read(memcg->vmstats_percpu->stat[idx]);
-	if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
+	if (unlikely(abs(x) > threshold)) {
 		struct mem_cgroup *mi;
 
 		/*
@@ -727,7 +730,7 @@ void __mod_lruvec_memcg_state(struct lruvec *lruvec, enum node_stat_item idx,
 	pg_data_t *pgdat = lruvec_pgdat(lruvec);
 	struct mem_cgroup_per_node *pn;
 	struct mem_cgroup *memcg;
-	long x;
+	long x, threshold = MEMCG_CHARGE_BATCH;
 
 	pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
 	memcg = pn->memcg;
@@ -738,8 +741,11 @@ void __mod_lruvec_memcg_state(struct lruvec *lruvec, enum node_stat_item idx,
 	/* Update lruvec */
 	__this_cpu_add(pn->lruvec_stat_local->count[idx], val);
 
+	if (vmstat_item_in_bytes(idx))
+		threshold <<= PAGE_SHIFT;
+
 	x = val + __this_cpu_read(pn->lruvec_stat_cpu->count[idx]);
-	if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
+	if (unlikely(abs(x) > threshold)) {
 		struct mem_cgroup_per_node *pi;
 
 		for (pi = pn; pi; pi = parent_nodeinfo(pi, pgdat->node_id))
@@ -1409,9 +1415,8 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
 		       (u64)memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) *
 		       1024);
 	seq_buf_printf(&s, "slab %llu\n",
-		       (u64)(memcg_page_state(memcg, NR_SLAB_RECLAIMABLE) +
-			     memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE)) *
-		       PAGE_SIZE);
+		       (u64)(memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B) +
+			     memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B)));
 	seq_buf_printf(&s, "sock %llu\n",
 		       (u64)memcg_page_state(memcg, MEMCG_SOCK) *
 		       PAGE_SIZE);
@@ -1445,11 +1450,9 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
 			       PAGE_SIZE);
 
 	seq_buf_printf(&s, "slab_reclaimable %llu\n",
-		       (u64)memcg_page_state(memcg, NR_SLAB_RECLAIMABLE) *
-		       PAGE_SIZE);
+		       (u64)memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B));
 	seq_buf_printf(&s, "slab_unreclaimable %llu\n",
-		       (u64)memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE) *
-		       PAGE_SIZE);
+		       (u64)memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B));
 
 	/* Accumulated memory events */
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d7cc37e3f91d..93d92575245c 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -184,7 +184,7 @@ static bool is_dump_unreclaim_slabs(void)
 		 global_node_page_state(NR_ISOLATED_FILE) +
 		 global_node_page_state(NR_UNEVICTABLE);
 
-	return (global_node_page_state(NR_SLAB_UNRECLAIMABLE) > nr_lru);
+	return (global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE_B) > nr_lru);
 }
 
 /**
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f842ebcb4600..e1b54125279a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5108,8 +5108,8 @@ long si_mem_available(void)
 	 * items that are in use, and cannot be freed. Cap this estimate at the
 	 * low watermark.
 	 */
-	reclaimable = global_node_page_state(NR_SLAB_RECLAIMABLE) +
-			global_node_page_state(NR_KERNEL_MISC_RECLAIMABLE);
+	reclaimable = global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B) +
+		global_node_page_state(NR_KERNEL_MISC_RECLAIMABLE);
 	available += reclaimable - min(reclaimable / 2, wmark_low);
 
 	if (available < 0)
@@ -5253,8 +5253,8 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 		global_node_page_state(NR_FILE_DIRTY),
 		global_node_page_state(NR_WRITEBACK),
 		global_node_page_state(NR_UNSTABLE_NFS),
-		global_node_page_state(NR_SLAB_RECLAIMABLE),
-		global_node_page_state(NR_SLAB_UNRECLAIMABLE),
+		global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B),
+		global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE_B),
 		global_node_page_state(NR_FILE_MAPPED),
 		global_node_page_state(NR_SHMEM),
 		global_zone_page_state(NR_PAGETABLE),
diff --git a/mm/slab.h b/mm/slab.h
index d943264f0f09..517f1f1359e5 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -272,7 +272,7 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *, gfp_t, size_t, void **);
 static inline int cache_vmstat_idx(struct kmem_cache *s)
 {
 	return (s->flags & SLAB_RECLAIM_ACCOUNT) ?
-		NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE;
+		NR_SLAB_RECLAIMABLE_B : NR_SLAB_UNRECLAIMABLE_B;
 }
 
 #ifdef CONFIG_MEMCG_KMEM
@@ -361,7 +361,7 @@ static __always_inline int memcg_charge_slab(struct page *page,
 
 	if (unlikely(!memcg || mem_cgroup_is_root(memcg))) {
 		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
-				    nr_pages);
+				    nr_pages << PAGE_SHIFT);
 		percpu_ref_get_many(&s->memcg_params.refcnt, nr_pages);
 		return 0;
 	}
@@ -371,7 +371,7 @@ static __always_inline int memcg_charge_slab(struct page *page,
 		goto out;
 
 	lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
-	mod_lruvec_state(lruvec, cache_vmstat_idx(s), nr_pages);
+	mod_lruvec_state(lruvec, cache_vmstat_idx(s), nr_pages << PAGE_SHIFT);
 
 	/* transer try_charge() page references to kmem_cache */
 	percpu_ref_get_many(&s->memcg_params.refcnt, 1 << order);
@@ -396,11 +396,12 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
 	memcg = READ_ONCE(s->memcg_params.memcg);
 	if (likely(!mem_cgroup_is_root(memcg))) {
 		lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
-		mod_lruvec_state(lruvec, cache_vmstat_idx(s), -nr_pages);
+		mod_lruvec_state(lruvec, cache_vmstat_idx(s),
+				 -(nr_pages << PAGE_SHIFT));
 		memcg_kmem_uncharge(memcg, nr_pages);
 	} else {
 		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
-				    -nr_pages);
+				    -(nr_pages << PAGE_SHIFT));
 	}
 	rcu_read_unlock();
 
@@ -484,7 +485,7 @@ static __always_inline int charge_slab_page(struct page *page,
 {
 	if (is_root_cache(s)) {
 		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
-				    1 << order);
+				    PAGE_SIZE << order);
 		return 0;
 	}
 
@@ -496,7 +497,7 @@ static __always_inline void uncharge_slab_page(struct page *page, int order,
 {
 	if (is_root_cache(s)) {
 		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
-				    -(1 << order));
+				    -(PAGE_SIZE << order));
 		return;
 	}
 
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 1907cb2903c7..a2afa4ff5d7b 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1324,8 +1324,8 @@ void *kmalloc_order(size_t size, gfp_t flags, unsigned int order)
 	page = alloc_pages(flags, order);
 	if (likely(page)) {
 		ret = page_address(page);
-		mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE,
-				    1 << order);
+		mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE_B,
+				    PAGE_SIZE << order);
 	}
 	ret = kasan_kmalloc_large(ret, size, flags);
 	/* As ret might get tagged, call kmemleak hook after KASAN. */
diff --git a/mm/slob.c b/mm/slob.c
index fa53e9f73893..8b7b56235438 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -202,8 +202,8 @@ static void *slob_new_pages(gfp_t gfp, int order, int node)
 	if (!page)
 		return NULL;
 
-	mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE,
-			    1 << order);
+	mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE_B,
+			    PAGE_SIZE << order);
 	return page_address(page);
 }
 
@@ -214,8 +214,8 @@ static void slob_free_pages(void *b, int order)
 	if (current->reclaim_state)
 		current->reclaim_state->reclaimed_slab += 1 << order;
 
-	mod_node_page_state(page_pgdat(sp), NR_SLAB_UNRECLAIMABLE,
-			    -(1 << order));
+	mod_node_page_state(page_pgdat(sp), NR_SLAB_UNRECLAIMABLE_B,
+			    -(PAGE_SIZE << order));
 	__free_pages(sp, order);
 }
 
@@ -550,8 +550,8 @@ void kfree(const void *block)
 		slob_free(m, *m + align);
 	} else {
 		unsigned int order = compound_order(sp);
-		mod_node_page_state(page_pgdat(sp), NR_SLAB_UNRECLAIMABLE,
-				    -(1 << order));
+		mod_node_page_state(page_pgdat(sp), NR_SLAB_UNRECLAIMABLE_B,
+				    -(PAGE_SIZE << order));
 		__free_pages(sp, order);
 
 	}
diff --git a/mm/slub.c b/mm/slub.c
index f2fe3a8e420a..ed6aea234400 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3837,8 +3837,8 @@ static void *kmalloc_large_node(size_t size, gfp_t flags, int node)
 	page = alloc_pages_node(node, flags, order);
 	if (page) {
 		ptr = page_address(page);
-		mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE,
-				    1 << order);
+		mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE_B,
+				    PAGE_SIZE << order);
 	}
 
 	return kmalloc_large_node_hook(ptr, size, flags);
@@ -3969,8 +3969,8 @@ void kfree(const void *x)
 
 		BUG_ON(!PageCompound(page));
 		kfree_hook(object);
-		mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE,
-				    -(1 << order));
+		mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE_B,
+				    -(PAGE_SIZE << order));
 		__free_pages(page, order);
 		return;
 	}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b1863de475fb..ec6d89cdcb73 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4251,7 +4251,8 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
 	 * unmapped file backed pages.
 	 */
 	if (node_pagecache_reclaimable(pgdat) <= pgdat->min_unmapped_pages &&
-	    node_page_state(pgdat, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages)
+	    node_page_state_pages(pgdat, NR_SLAB_RECLAIMABLE_B) <=
+	    pgdat->min_slab_pages)
 		return NODE_RECLAIM_FULL;
 
 	/*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 6242129939dd..57dca4a5f0ea 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -344,6 +344,8 @@ void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
 	x = delta + __this_cpu_read(*p);
 
 	t = __this_cpu_read(pcp->stat_threshold);
+	if (vmstat_item_in_bytes(item))
+		t <<= PAGE_SHIFT;
 
 	if (unlikely(x > t || x < -t)) {
 		node_page_state_add(x, pgdat, item);
@@ -555,6 +557,8 @@ static inline void mod_node_state(struct pglist_data *pgdat,
 		 * for all cpus in a node.
 		 */
 		t = this_cpu_read(pcp->stat_threshold);
+		if (vmstat_item_in_bytes(item))
+			t <<= PAGE_SHIFT;
 
 		o = this_cpu_read(*p);
 		n = delta + o;
@@ -999,6 +1003,12 @@ unsigned long node_page_state(struct pglist_data *pgdat,
 #endif
 	return x;
 }
+
+unsigned long node_page_state_pages(struct pglist_data *pgdat,
+				    enum node_stat_item item)
+{
+	return node_page_state(pgdat, item) >> PAGE_SHIFT;
+}
 #endif
 
 #ifdef CONFIG_COMPACTION
@@ -1565,8 +1575,12 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 	if (is_zone_first_populated(pgdat, zone)) {
 		seq_printf(m, "\n  per-node stats");
 		for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
+			unsigned long x = node_page_state(pgdat, i);
+
+			if (vmstat_item_in_bytes(i))
+				x >>= PAGE_SHIFT;
 			seq_printf(m, "\n      %-12s %lu", node_stat_name(i),
-				   node_page_state(pgdat, i));
+				   x);
 		}
 	}
 	seq_printf(m,
@@ -1686,8 +1700,11 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
 	v += NR_VM_NUMA_STAT_ITEMS;
 #endif
 
-	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
+	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
 		v[i] = global_node_page_state(i);
+		if (vmstat_item_in_bytes(i))
+			v[i] >>= PAGE_SHIFT;
+	}
 	v += NR_VM_NODE_STAT_ITEMS;
 
 	global_dirty_limits(v + NR_DIRTY_BG_THRESHOLD,
diff --git a/mm/workingset.c b/mm/workingset.c
index 474186b76ced..9358c1ee5bb6 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -467,8 +467,10 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
 		for (pages = 0, i = 0; i < NR_LRU_LISTS; i++)
 			pages += lruvec_page_state_local(lruvec,
 							 NR_LRU_BASE + i);
-		pages += lruvec_page_state_local(lruvec, NR_SLAB_RECLAIMABLE);
-		pages += lruvec_page_state_local(lruvec, NR_SLAB_UNRECLAIMABLE);
+		pages += lruvec_page_state_local(
+			lruvec, NR_SLAB_RECLAIMABLE_B) >> PAGE_SHIFT;
+		pages += lruvec_page_state_local(
+			lruvec, NR_SLAB_UNRECLAIMABLE_B) >> PAGE_SHIFT;
 	} else
 #endif
 		pages = node_present_pages(sc->nid);
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 14/28] mm: memcontrol: decouple reference counting from page accounting
  2020-01-27 17:34 [PATCH v2 00/28] The new cgroup slab memory controller Roman Gushchin
                   ` (12 preceding siblings ...)
  2020-01-27 17:34 ` [PATCH v2 13/28] mm: vmstat: convert slab vmstat counter to bytes Roman Gushchin
@ 2020-01-27 17:34 ` Roman Gushchin
  2020-01-27 17:34 ` [PATCH v2 15/28] mm: memcg/slab: obj_cgroup API Roman Gushchin
                   ` (14 subsequent siblings)
  28 siblings, 0 replies; 84+ messages in thread
From: Roman Gushchin @ 2020-01-27 17:34 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Michal Hocko, Johannes Weiner, Shakeel Butt, Vladimir Davydov,
	linux-kernel, kernel-team, Bharata B Rao, Yafang Shao

From: Johannes Weiner <hannes@cmpxchg.org>

The reference counting of a memcg is currently coupled directly to how
many 4k pages are charged to it. This doesn't work well with Roman's
new slab controller, which maintains pools of objects and doesn't want
to keep an extra balance sheet for the pages backing those objects.

This unusual refcounting design (reference counts usually track
pointers to an object) is only for historical reasons: memcg used to
not take any css references and simply stalled offlining until all
charges had been reparented and the page counters had dropped to
zero. When we got rid of the reparenting requirement, the simple
mechanical translation was to take a reference for every charge.

More historical context can be found in commit e8ea14cc6ead ("mm:
memcontrol: take a css reference for each charged page"),
commit 64f219938941 ("mm: memcontrol: remove obsolete kmemcg pinning
tricks") and commit b2052564e66d ("mm: memcontrol: continue cache
reclaim from offlined groups").

The new slab controller exposes the limitations in this scheme, so
let's switch it to a more idiomatic reference counting model based on
actual kernel pointers to the memcg:

- The per-cpu stock holds a reference to the memcg its caching

- User pages hold a reference for their page->mem_cgroup. Transparent
  huge pages will no longer acquire tail references in advance, we'll
  get them if needed during the split.

- Kernel pages hold a reference for their page->mem_cgroup

- mem_cgroup_try_charge(), if successful, will return one reference to
  be consumed by page->mem_cgroup during commit, or put during cancel

- Pages allocated in the root cgroup will acquire and release css
  references for simplicity. css_get() and css_put() optimize that.

- The current memcg_charge_slab() already hacked around the per-charge
  references; this change gets rid of that as well.

Roman: I've reformatted commit references in the commit log to make
  checkpatch.pl happy.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Roman Gushchin <guro@fb.com>
---
 mm/memcontrol.c | 45 ++++++++++++++++++++++++++-------------------
 mm/slab.h       |  2 --
 2 files changed, 26 insertions(+), 21 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bf846fb60d9f..b86cfdcf2e1d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2109,13 +2109,17 @@ static void drain_stock(struct memcg_stock_pcp *stock)
 {
 	struct mem_cgroup *old = stock->cached;
 
+	if (!old)
+		return;
+
 	if (stock->nr_pages) {
 		page_counter_uncharge(&old->memory, stock->nr_pages);
 		if (do_memsw_account())
 			page_counter_uncharge(&old->memsw, stock->nr_pages);
-		css_put_many(&old->css, stock->nr_pages);
 		stock->nr_pages = 0;
 	}
+
+	css_put(&old->css);
 	stock->cached = NULL;
 }
 
@@ -2151,6 +2155,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 	stock = this_cpu_ptr(&memcg_stock);
 	if (stock->cached != memcg) { /* reset if necessary */
 		drain_stock(stock);
+		css_get(&memcg->css);
 		stock->cached = memcg;
 	}
 	stock->nr_pages += nr_pages;
@@ -2554,12 +2559,10 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	page_counter_charge(&memcg->memory, nr_pages);
 	if (do_memsw_account())
 		page_counter_charge(&memcg->memsw, nr_pages);
-	css_get_many(&memcg->css, nr_pages);
 
 	return 0;
 
 done_restock:
-	css_get_many(&memcg->css, batch);
 	if (batch > nr_pages)
 		refill_stock(memcg, batch - nr_pages);
 
@@ -2596,8 +2599,6 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
 	page_counter_uncharge(&memcg->memory, nr_pages);
 	if (do_memsw_account())
 		page_counter_uncharge(&memcg->memsw, nr_pages);
-
-	css_put_many(&memcg->css, nr_pages);
 }
 
 static void lock_page_lru(struct page *page, int *isolated)
@@ -2948,6 +2949,7 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
 		if (!ret) {
 			page->mem_cgroup = memcg;
 			__SetPageKmemcg(page);
+			return 0;
 		}
 	}
 	css_put(&memcg->css);
@@ -2970,12 +2972,11 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
 	VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
 	__memcg_kmem_uncharge(memcg, nr_pages);
 	page->mem_cgroup = NULL;
+	css_put(&memcg->css);
 
 	/* slab pages do not have PageKmemcg flag set */
 	if (PageKmemcg(page))
 		__ClearPageKmemcg(page);
-
-	css_put_many(&memcg->css, nr_pages);
 }
 #endif /* CONFIG_MEMCG_KMEM */
 
@@ -2987,15 +2988,18 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
  */
 void mem_cgroup_split_huge_fixup(struct page *head)
 {
+	struct mem_cgroup *memcg = head->mem_cgroup;
 	int i;
 
 	if (mem_cgroup_disabled())
 		return;
 
-	for (i = 1; i < HPAGE_PMD_NR; i++)
-		head[i].mem_cgroup = head->mem_cgroup;
+	for (i = 1; i < HPAGE_PMD_NR; i++) {
+		css_get(&memcg->css);
+		head[i].mem_cgroup = memcg;
+	}
 
-	__mod_memcg_state(head->mem_cgroup, MEMCG_RSS_HUGE, -HPAGE_PMD_NR);
+	__mod_memcg_state(memcg, MEMCG_RSS_HUGE, -HPAGE_PMD_NR);
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
@@ -5401,7 +5405,9 @@ static int mem_cgroup_move_account(struct page *page,
 	 * uncharging, charging, migration, or LRU putback.
 	 */
 
-	/* caller should have done css_get */
+	css_get(&to->css);
+	css_put(&from->css);
+
 	page->mem_cgroup = to;
 
 	spin_unlock_irqrestore(&from->move_lock, flags);
@@ -6420,8 +6426,10 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 		memcg = get_mem_cgroup_from_mm(mm);
 
 	ret = try_charge(memcg, gfp_mask, nr_pages);
-
-	css_put(&memcg->css);
+	if (ret) {
+		css_put(&memcg->css);
+		memcg = NULL;
+	}
 out:
 	*memcgp = memcg;
 	return ret;
@@ -6517,6 +6525,8 @@ void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg,
 		return;
 
 	cancel_charge(memcg, nr_pages);
+
+	css_put(&memcg->css);
 }
 
 struct uncharge_gather {
@@ -6558,9 +6568,6 @@ static void uncharge_batch(const struct uncharge_gather *ug)
 	__this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, nr_pages);
 	memcg_check_events(ug->memcg, ug->dummy_page);
 	local_irq_restore(flags);
-
-	if (!mem_cgroup_is_root(ug->memcg))
-		css_put_many(&ug->memcg->css, nr_pages);
 }
 
 static void uncharge_page(struct page *page, struct uncharge_gather *ug)
@@ -6608,6 +6615,7 @@ static void uncharge_page(struct page *page, struct uncharge_gather *ug)
 
 	ug->dummy_page = page;
 	page->mem_cgroup = NULL;
+	css_put(&ug->memcg->css);
 }
 
 static void uncharge_list(struct list_head *page_list)
@@ -6714,8 +6722,8 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage)
 	page_counter_charge(&memcg->memory, nr_pages);
 	if (do_memsw_account())
 		page_counter_charge(&memcg->memsw, nr_pages);
-	css_get_many(&memcg->css, nr_pages);
 
+	css_get(&memcg->css);
 	commit_charge(newpage, memcg, false);
 
 	local_irq_save(flags);
@@ -6964,8 +6972,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 				     -nr_entries);
 	memcg_check_events(memcg, page);
 
-	if (!mem_cgroup_is_root(memcg))
-		css_put_many(&memcg->css, nr_entries);
+	css_put(&memcg->css);
 }
 
 /**
diff --git a/mm/slab.h b/mm/slab.h
index 517f1f1359e5..7925f7005161 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -373,9 +373,7 @@ static __always_inline int memcg_charge_slab(struct page *page,
 	lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
 	mod_lruvec_state(lruvec, cache_vmstat_idx(s), nr_pages << PAGE_SHIFT);
 
-	/* transer try_charge() page references to kmem_cache */
 	percpu_ref_get_many(&s->memcg_params.refcnt, 1 << order);
-	css_put_many(&memcg->css, 1 << order);
 out:
 	css_put(&memcg->css);
 	return ret;
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 15/28] mm: memcg/slab: obj_cgroup API
  2020-01-27 17:34 [PATCH v2 00/28] The new cgroup slab memory controller Roman Gushchin
                   ` (13 preceding siblings ...)
  2020-01-27 17:34 ` [PATCH v2 14/28] mm: memcontrol: decouple reference counting from page accounting Roman Gushchin
@ 2020-01-27 17:34 ` Roman Gushchin
  2020-02-03 19:31   ` Johannes Weiner
  2020-01-27 17:34 ` [PATCH v2 16/28] mm: memcg/slab: allocate obj_cgroups for non-root slab pages Roman Gushchin
                   ` (13 subsequent siblings)
  28 siblings, 1 reply; 84+ messages in thread
From: Roman Gushchin @ 2020-01-27 17:34 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Michal Hocko, Johannes Weiner, Shakeel Butt, Vladimir Davydov,
	linux-kernel, kernel-team, Bharata B Rao, Yafang Shao,
	Roman Gushchin

Obj_cgroup API provides an ability to account sub-page sized kernel
objects, which potentially outlive the original memory cgroup.

The top-level API consists of the following functions:
  bool obj_cgroup_tryget(struct obj_cgroup *objcg);
  void obj_cgroup_get(struct obj_cgroup *objcg);
  void obj_cgroup_put(struct obj_cgroup *objcg);

  int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size);
  void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);

  struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg);

Object cgroup is basically a pointer to a memory cgroup with a per-cpu
reference counter. It substitutes a memory cgroup in places where
it's necessary to charge a custom amount of bytes instead of pages.

All charged memory rounded down to pages is charged to the
corresponding memory cgroup using __memcg_kmem_charge().

It implements reparenting: on memcg offlining it's getting reattached
to the parent memory cgroup. Each online memory cgroup has an
associated active object cgroup to handle new allocations and the list
of all attached object cgroups. On offlining of a cgroup this list is
reparented and for each object cgroup in the list the memcg pointer is
swapped to the parent memory cgroup. It prevents long-living objects
from pinning the original memory cgroup in the memory.

The implementation is based on byte-sized per-cpu stocks. A sub-page
sized leftover is stored in an atomic field, which is a part of
obj_cgroup object. So on cgroup offlining the leftover is automatically
reparented.

memcg->objcg is rcu protected.
objcg->memcg is a raw pointer, which is always pointing at a memory
cgroup, but can be atomically swapped to the parent memory cgroup. So
the caller must ensure the lifetime of the cgroup, e.g. grab
rcu_read_lock or css_set_lock.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/memcontrol.h |  49 ++++++++
 mm/memcontrol.c            | 222 ++++++++++++++++++++++++++++++++++++-
 2 files changed, 269 insertions(+), 2 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 73c2a7d32862..30bbea3f85e2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -23,6 +23,7 @@
 #include <linux/page-flags.h>
 
 struct mem_cgroup;
+struct obj_cgroup;
 struct page;
 struct mm_struct;
 struct kmem_cache;
@@ -194,6 +195,22 @@ struct memcg_cgwb_frn {
 	struct wb_completion done;	/* tracks in-flight foreign writebacks */
 };
 
+/*
+ * Bucket for arbitrarily byte-sized objects charged to a memory
+ * cgroup. The bucket can be reparented in one piece when the cgroup
+ * is destroyed, without having to round up the individual references
+ * of all live memory objects in the wild.
+ */
+struct obj_cgroup {
+	struct percpu_ref refcnt;
+	struct mem_cgroup *memcg;
+	atomic_t nr_charged_bytes;
+	union {
+		struct list_head list;
+		struct rcu_head rcu;
+	};
+};
+
 /*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
@@ -306,6 +323,8 @@ struct mem_cgroup {
 	int kmemcg_id;
 	enum memcg_kmem_state kmem_state;
 	struct list_head kmem_caches;
+	struct obj_cgroup __rcu *objcg;
+	struct list_head objcg_list;
 #endif
 
 #ifdef CONFIG_CGROUP_WRITEBACK
@@ -429,6 +448,33 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
 	return css ? container_of(css, struct mem_cgroup, css) : NULL;
 }
 
+static inline bool obj_cgroup_tryget(struct obj_cgroup *objcg)
+{
+	return percpu_ref_tryget(&objcg->refcnt);
+}
+
+static inline void obj_cgroup_get(struct obj_cgroup *objcg)
+{
+	percpu_ref_get(&objcg->refcnt);
+}
+
+static inline void obj_cgroup_put(struct obj_cgroup *objcg)
+{
+	percpu_ref_put(&objcg->refcnt);
+}
+
+/*
+ * After the initialization objcg->memcg is always pointing at
+ * a valid memcg, but can be atomically swapped to the parent memcg.
+ *
+ * The caller must ensure that the returned memcg won't be released:
+ * e.g. acquire the rcu_read_lock or css_set_lock.
+ */
+static inline struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg)
+{
+	return READ_ONCE(objcg->memcg);
+}
+
 static inline void mem_cgroup_put(struct mem_cgroup *memcg)
 {
 	if (memcg)
@@ -1395,6 +1441,9 @@ void __memcg_kmem_uncharge(struct mem_cgroup *memcg, unsigned int nr_pages);
 int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order);
 void __memcg_kmem_uncharge_page(struct page *page, int order);
 
+int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size);
+void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);
+
 extern struct static_key_false memcg_kmem_enabled_key;
 extern struct workqueue_struct *memcg_kmem_cache_wq;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b86cfdcf2e1d..9aa37bc61db5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -257,6 +257,73 @@ struct cgroup_subsys_state *vmpressure_to_css(struct vmpressure *vmpr)
 }
 
 #ifdef CONFIG_MEMCG_KMEM
+extern spinlock_t css_set_lock;
+
+static void obj_cgroup_release(struct percpu_ref *ref)
+{
+	struct obj_cgroup *objcg = container_of(ref, struct obj_cgroup, refcnt);
+	unsigned int nr_bytes;
+	unsigned int nr_pages;
+	unsigned long flags;
+
+	nr_bytes = atomic_read(&objcg->nr_charged_bytes);
+	WARN_ON_ONCE(nr_bytes & (PAGE_SIZE - 1));
+	nr_pages = nr_bytes >> PAGE_SHIFT;
+
+	if (nr_pages) {
+		rcu_read_lock();
+		__memcg_kmem_uncharge(obj_cgroup_memcg(objcg), nr_pages);
+		rcu_read_unlock();
+	}
+
+	spin_lock_irqsave(&css_set_lock, flags);
+	list_del(&objcg->list);
+	mem_cgroup_put(obj_cgroup_memcg(objcg));
+	spin_unlock_irqrestore(&css_set_lock, flags);
+
+	percpu_ref_exit(ref);
+	kfree_rcu(objcg, rcu);
+}
+
+static struct obj_cgroup *obj_cgroup_alloc(void)
+{
+	struct obj_cgroup *objcg;
+	int ret;
+
+	objcg = kzalloc(sizeof(struct obj_cgroup), GFP_KERNEL);
+	if (!objcg)
+		return NULL;
+
+	ret = percpu_ref_init(&objcg->refcnt, obj_cgroup_release, 0,
+			      GFP_KERNEL);
+	if (ret) {
+		kfree(objcg);
+		return NULL;
+	}
+	INIT_LIST_HEAD(&objcg->list);
+	return objcg;
+}
+
+static void memcg_reparent_objcgs(struct mem_cgroup *memcg,
+				  struct mem_cgroup *parent)
+{
+	struct obj_cgroup *objcg;
+
+	objcg = rcu_replace_pointer(memcg->objcg, NULL, true);
+	/* Paired with mem_cgroup_put() in objcg_release(). */
+	css_get(&memcg->css);
+	percpu_ref_kill(&objcg->refcnt);
+
+	spin_lock_irq(&css_set_lock);
+	list_for_each_entry(objcg, &memcg->objcg_list, list) {
+		css_get(&parent->css);
+		xchg(&objcg->memcg, parent);
+		css_put(&memcg->css);
+	}
+	list_splice(&memcg->objcg_list, &parent->objcg_list);
+	spin_unlock_irq(&css_set_lock);
+}
+
 /*
  * This will be the memcg's index in each cache's ->memcg_params.memcg_caches.
  * The main reason for not using cgroup id for this:
@@ -2062,6 +2129,12 @@ EXPORT_SYMBOL(unlock_page_memcg);
 struct memcg_stock_pcp {
 	struct mem_cgroup *cached; /* this never be root cgroup */
 	unsigned int nr_pages;
+
+#ifdef CONFIG_MEMCG_KMEM
+	struct obj_cgroup *cached_objcg;
+	unsigned int nr_bytes;
+#endif
+
 	struct work_struct work;
 	unsigned long flags;
 #define FLUSHING_CACHED_CHARGE	0
@@ -2069,6 +2142,22 @@ struct memcg_stock_pcp {
 static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
 static DEFINE_MUTEX(percpu_charge_mutex);
 
+#ifdef CONFIG_MEMCG_KMEM
+static void drain_obj_stock(struct memcg_stock_pcp *stock);
+static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
+				     struct mem_cgroup *root_memcg);
+
+#else
+static inline void drain_obj_stock(struct memcg_stock_pcp *stock)
+{
+}
+static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
+				     struct mem_cgroup *root_memcg)
+{
+	return false;
+}
+#endif
+
 /**
  * consume_stock: Try to consume stocked charge on this cpu.
  * @memcg: memcg to consume from.
@@ -2135,6 +2224,7 @@ static void drain_local_stock(struct work_struct *dummy)
 	local_irq_save(flags);
 
 	stock = this_cpu_ptr(&memcg_stock);
+	drain_obj_stock(stock);
 	drain_stock(stock);
 	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
 
@@ -2194,6 +2284,8 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
 		if (memcg && stock->nr_pages &&
 		    mem_cgroup_is_descendant(memcg, root_memcg))
 			flush = true;
+		if (obj_stock_flush_required(stock, root_memcg))
+			flush = true;
 		rcu_read_unlock();
 
 		if (flush &&
@@ -2978,6 +3070,120 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
 	if (PageKmemcg(page))
 		__ClearPageKmemcg(page);
 }
+
+static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
+{
+	struct memcg_stock_pcp *stock;
+	unsigned long flags;
+	bool ret = false;
+
+	local_irq_save(flags);
+
+	stock = this_cpu_ptr(&memcg_stock);
+	if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
+		stock->nr_bytes -= nr_bytes;
+		ret = true;
+	}
+
+	local_irq_restore(flags);
+
+	return ret;
+}
+
+static void drain_obj_stock(struct memcg_stock_pcp *stock)
+{
+	struct obj_cgroup *old = stock->cached_objcg;
+
+	if (!old)
+		return;
+
+	if (stock->nr_bytes) {
+		unsigned int nr_pages = stock->nr_bytes >> PAGE_SHIFT;
+		unsigned int nr_bytes = stock->nr_bytes & (PAGE_SIZE - 1);
+
+		if (nr_pages) {
+			rcu_read_lock();
+			__memcg_kmem_uncharge(obj_cgroup_memcg(old), nr_pages);
+			rcu_read_unlock();
+		}
+
+		atomic_add(nr_bytes, &old->nr_charged_bytes);
+		stock->nr_bytes = 0;
+	}
+
+	obj_cgroup_put(old);
+	stock->cached_objcg = NULL;
+}
+
+static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
+				     struct mem_cgroup *root_memcg)
+{
+	struct mem_cgroup *memcg;
+
+	if (stock->cached_objcg) {
+		memcg = obj_cgroup_memcg(stock->cached_objcg);
+		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
+			return true;
+	}
+
+	return false;
+}
+
+static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
+{
+	struct memcg_stock_pcp *stock;
+	unsigned long flags;
+
+	local_irq_save(flags);
+
+	stock = this_cpu_ptr(&memcg_stock);
+	if (stock->cached_objcg != objcg) { /* reset if necessary */
+		drain_obj_stock(stock);
+		obj_cgroup_get(objcg);
+		stock->cached_objcg = objcg;
+		stock->nr_bytes = atomic_xchg(&objcg->nr_charged_bytes, 0);
+	}
+	stock->nr_bytes += nr_bytes;
+
+	if (stock->nr_bytes > PAGE_SIZE)
+		drain_obj_stock(stock);
+
+	local_irq_restore(flags);
+}
+
+int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
+{
+	struct mem_cgroup *memcg;
+	unsigned int nr_pages, nr_bytes;
+	int ret;
+
+	if (consume_obj_stock(objcg, size))
+		return 0;
+
+	rcu_read_lock();
+	memcg = obj_cgroup_memcg(objcg);
+	css_get(&memcg->css);
+	rcu_read_unlock();
+
+	nr_pages = size >> PAGE_SHIFT;
+	nr_bytes = size & (PAGE_SIZE - 1);
+
+	if (nr_bytes)
+		nr_pages += 1;
+
+	ret = __memcg_kmem_charge(memcg, gfp, nr_pages);
+	if (!ret && nr_bytes)
+		refill_obj_stock(objcg, PAGE_SIZE - nr_bytes);
+
+	css_put(&memcg->css);
+	return ret;
+}
+
+void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
+{
+	refill_obj_stock(objcg, size);
+}
+
 #endif /* CONFIG_MEMCG_KMEM */
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -3400,7 +3606,8 @@ static void memcg_flush_percpu_vmevents(struct mem_cgroup *memcg)
 #ifdef CONFIG_MEMCG_KMEM
 static int memcg_online_kmem(struct mem_cgroup *memcg)
 {
-	int memcg_id;
+	struct obj_cgroup *objcg;
+	int memcg_id, ret;
 
 	if (cgroup_memory_nokmem)
 		return 0;
@@ -3412,6 +3619,15 @@ static int memcg_online_kmem(struct mem_cgroup *memcg)
 	if (memcg_id < 0)
 		return memcg_id;
 
+	objcg = obj_cgroup_alloc();
+	if (!objcg) {
+		memcg_free_cache_id(memcg_id);
+		return ret;
+	}
+	objcg->memcg = memcg;
+	rcu_assign_pointer(memcg->objcg, objcg);
+	list_add(&objcg->list, &memcg->objcg_list);
+
 	static_branch_inc(&memcg_kmem_enabled_key);
 	/*
 	 * A memory cgroup is considered kmem-online as soon as it gets
@@ -3447,9 +3663,10 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
 		parent = root_mem_cgroup;
 
 	/*
-	 * Deactivate and reparent kmem_caches.
+	 * Deactivate and reparent kmem_caches and objcgs.
 	 */
 	memcg_deactivate_kmem_caches(memcg, parent);
+	memcg_reparent_objcgs(memcg, parent);
 
 	kmemcg_id = memcg->kmemcg_id;
 	BUG_ON(kmemcg_id < 0);
@@ -5003,6 +5220,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 	memcg->socket_pressure = jiffies;
 #ifdef CONFIG_MEMCG_KMEM
 	memcg->kmemcg_id = -1;
+	INIT_LIST_HEAD(&memcg->objcg_list);
 #endif
 #ifdef CONFIG_CGROUP_WRITEBACK
 	INIT_LIST_HEAD(&memcg->cgwb_list);
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 16/28] mm: memcg/slab: allocate obj_cgroups for non-root slab pages
  2020-01-27 17:34 [PATCH v2 00/28] The new cgroup slab memory controller Roman Gushchin
                   ` (14 preceding siblings ...)
  2020-01-27 17:34 ` [PATCH v2 15/28] mm: memcg/slab: obj_cgroup API Roman Gushchin
@ 2020-01-27 17:34 ` Roman Gushchin
  2020-02-03 18:27   ` Johannes Weiner
  2020-01-27 17:34 ` [PATCH v2 17/28] mm: memcg/slab: save obj_cgroup for non-root slab objects Roman Gushchin
                   ` (12 subsequent siblings)
  28 siblings, 1 reply; 84+ messages in thread
From: Roman Gushchin @ 2020-01-27 17:34 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Michal Hocko, Johannes Weiner, Shakeel Butt, Vladimir Davydov,
	linux-kernel, kernel-team, Bharata B Rao, Yafang Shao,
	Roman Gushchin

Allocate and release memory to store obj_cgroup pointers for each
non-root slab page. Reuse page->mem_cgroup pointer to store a pointer
to the allocated space.

To distinguish between obj_cgroups and memcg pointers in case
when it's not obvious which one is used (as in page_cgroup_ino()),
let's always set the lowest bit in the obj_cgroup case.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/mm.h       | 25 ++++++++++++++++++--
 include/linux/mm_types.h |  5 +++-
 mm/memcontrol.c          |  5 ++--
 mm/slab.c                |  3 ++-
 mm/slab.h                | 51 +++++++++++++++++++++++++++++++++++++++-
 mm/slub.c                |  2 +-
 6 files changed, 83 insertions(+), 8 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 080f8ac8bfb7..65224becc4ca 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1264,12 +1264,33 @@ static inline void set_page_links(struct page *page, enum zone_type zone,
 #ifdef CONFIG_MEMCG
 static inline struct mem_cgroup *page_memcg(struct page *page)
 {
-	return page->mem_cgroup;
+	struct mem_cgroup *memcg = page->mem_cgroup;
+
+	/*
+	 * The lowest bit set means that memcg isn't a valid memcg pointer,
+	 * but a obj_cgroups pointer. In this case the page is shared and
+	 * isn't charged to any specific memory cgroup. Return NULL.
+	 */
+	if ((unsigned long) memcg & 0x1UL)
+		memcg = NULL;
+
+	return memcg;
 }
 static inline struct mem_cgroup *page_memcg_rcu(struct page *page)
 {
+	struct mem_cgroup *memcg = READ_ONCE(page->mem_cgroup);
+
 	WARN_ON_ONCE(!rcu_read_lock_held());
-	return READ_ONCE(page->mem_cgroup);
+
+	/*
+	 * The lowest bit set means that memcg isn't a valid memcg pointer,
+	 * but a obj_cgroups pointer. In this case the page is shared and
+	 * isn't charged to any specific memory cgroup. Return NULL.
+	 */
+	if ((unsigned long) memcg & 0x1UL)
+		memcg = NULL;
+
+	return memcg;
 }
 #else
 static inline struct mem_cgroup *page_memcg(struct page *page)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 270aa8fd2800..5102f00f3336 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -198,7 +198,10 @@ struct page {
 	atomic_t _refcount;
 
 #ifdef CONFIG_MEMCG
-	struct mem_cgroup *mem_cgroup;
+	union {
+		struct mem_cgroup *mem_cgroup;
+		struct obj_cgroup **obj_cgroups;
+	};
 #endif
 
 	/*
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9aa37bc61db5..94337ab1ebe9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -545,7 +545,8 @@ ino_t page_cgroup_ino(struct page *page)
 	if (PageSlab(page) && !PageTail(page))
 		memcg = memcg_from_slab_page(page);
 	else
-		memcg = READ_ONCE(page->mem_cgroup);
+		memcg = page_memcg_rcu(page);
+
 	while (memcg && !(memcg->css.flags & CSS_ONLINE))
 		memcg = parent_mem_cgroup(memcg);
 	if (memcg)
@@ -2783,7 +2784,7 @@ struct mem_cgroup *mem_cgroup_from_obj(void *p)
 		return memcg_from_slab_page(page);
 
 	/* All other pages use page->mem_cgroup */
-	return page->mem_cgroup;
+	return page_memcg(page);
 }
 
 static int memcg_alloc_cache_id(void)
diff --git a/mm/slab.c b/mm/slab.c
index a89633603b2d..22e161b57367 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1370,7 +1370,8 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
 		return NULL;
 	}
 
-	if (charge_slab_page(page, flags, cachep->gfporder, cachep)) {
+	if (charge_slab_page(page, flags, cachep->gfporder, cachep,
+			     cachep->num)) {
 		__free_pages(page, cachep->gfporder);
 		return NULL;
 	}
diff --git a/mm/slab.h b/mm/slab.h
index 7925f7005161..8ee8c3a250ac 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -319,6 +319,18 @@ static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
 	return s->memcg_params.root_cache;
 }
 
+static inline struct obj_cgroup **page_obj_cgroups(struct page *page)
+{
+	/*
+	 * page->mem_cgroup and page->obj_cgroups are sharing the same
+	 * space. To distinguish between them in case we don't know for sure
+	 * that the page is a slab page (e.g. page_cgroup_ino()), let's
+	 * always set the lowest bit of obj_cgroups.
+	 */
+	return (struct obj_cgroup **)
+		((unsigned long)page->obj_cgroups & ~0x1UL);
+}
+
 /*
  * Expects a pointer to a slab page. Please note, that PageSlab() check
  * isn't sufficient, as it returns true also for tail compound slab pages,
@@ -406,6 +418,25 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
 	percpu_ref_put_many(&s->memcg_params.refcnt, nr_pages);
 }
 
+static inline int memcg_alloc_page_obj_cgroups(struct page *page, gfp_t gfp,
+					     unsigned int objects)
+{
+	void *vec;
+
+	vec = kcalloc(objects, sizeof(struct obj_cgroup *), gfp);
+	if (!vec)
+		return -ENOMEM;
+
+	page->obj_cgroups = (struct obj_cgroup **) ((unsigned long)vec | 0x1UL);
+	return 0;
+}
+
+static inline void memcg_free_page_obj_cgroups(struct page *page)
+{
+	kfree(page_obj_cgroups(page));
+	page->obj_cgroups = NULL;
+}
+
 extern void slab_init_memcg_params(struct kmem_cache *);
 extern void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg);
 
@@ -455,6 +486,16 @@ static inline void memcg_uncharge_slab(struct page *page, int order,
 {
 }
 
+static inline int memcg_alloc_page_obj_cgroups(struct page *page, gfp_t gfp,
+					       unsigned int objects)
+{
+	return 0;
+}
+
+static inline void memcg_free_page_obj_cgroups(struct page *page)
+{
+}
+
 static inline void slab_init_memcg_params(struct kmem_cache *s)
 {
 }
@@ -479,14 +520,21 @@ static inline struct kmem_cache *virt_to_cache(const void *obj)
 
 static __always_inline int charge_slab_page(struct page *page,
 					    gfp_t gfp, int order,
-					    struct kmem_cache *s)
+					    struct kmem_cache *s,
+					    unsigned int objects)
 {
+	int ret;
+
 	if (is_root_cache(s)) {
 		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
 				    PAGE_SIZE << order);
 		return 0;
 	}
 
+	ret = memcg_alloc_page_obj_cgroups(page, gfp, objects);
+	if (ret)
+		return ret;
+
 	return memcg_charge_slab(page, gfp, order, s);
 }
 
@@ -499,6 +547,7 @@ static __always_inline void uncharge_slab_page(struct page *page, int order,
 		return;
 	}
 
+	memcg_free_page_obj_cgroups(page);
 	memcg_uncharge_slab(page, order, s);
 }
 
diff --git a/mm/slub.c b/mm/slub.c
index ed6aea234400..165e43076c8b 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1516,7 +1516,7 @@ static inline struct page *alloc_slab_page(struct kmem_cache *s,
 	else
 		page = __alloc_pages_node(node, flags, order);
 
-	if (page && charge_slab_page(page, flags, order, s)) {
+	if (page && charge_slab_page(page, flags, order, s, oo_objects(oo))) {
 		__free_pages(page, order);
 		page = NULL;
 	}
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 17/28] mm: memcg/slab: save obj_cgroup for non-root slab objects
  2020-01-27 17:34 [PATCH v2 00/28] The new cgroup slab memory controller Roman Gushchin
                   ` (15 preceding siblings ...)
  2020-01-27 17:34 ` [PATCH v2 16/28] mm: memcg/slab: allocate obj_cgroups for non-root slab pages Roman Gushchin
@ 2020-01-27 17:34 ` Roman Gushchin
  2020-02-03 19:53   ` Johannes Weiner
  2020-01-27 17:34 ` [PATCH v2 18/28] mm: memcg/slab: charge individual slab objects instead of pages Roman Gushchin
                   ` (11 subsequent siblings)
  28 siblings, 1 reply; 84+ messages in thread
From: Roman Gushchin @ 2020-01-27 17:34 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Michal Hocko, Johannes Weiner, Shakeel Butt, Vladimir Davydov,
	linux-kernel, kernel-team, Bharata B Rao, Yafang Shao,
	Roman Gushchin

Store the obj_cgroup pointer in the corresponding place of
page->obj_cgroups for each allocated non-root slab object.
Make sure that each allocated object holds a reference to obj_cgroup.

Objcg pointer is obtained from the memcg->objcg dereferencing
in memcg_kmem_get_cache() and passed from pre_alloc_hook to
post_alloc_hook. Then in case of successful allocation(s) it's
getting stored in the page->obj_cgroups vector.

The objcg obtaining part look a bit bulky now, but it will be simplified
by next commits in the series.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/memcontrol.h |  3 +-
 mm/memcontrol.c            | 14 +++++++--
 mm/slab.c                  | 18 +++++++-----
 mm/slab.h                  | 60 ++++++++++++++++++++++++++++++++++----
 mm/slub.c                  | 14 +++++----
 5 files changed, 88 insertions(+), 21 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 30bbea3f85e2..54bfb26b5016 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1431,7 +1431,8 @@ static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg,
 }
 #endif
 
-struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep);
+struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep,
+					struct obj_cgroup **objcgp);
 void memcg_kmem_put_cache(struct kmem_cache *cachep);
 
 #ifdef CONFIG_MEMCG_KMEM
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 94337ab1ebe9..0e9fe272e688 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2896,7 +2896,8 @@ static inline bool memcg_kmem_bypass(void)
  * done with it, memcg_kmem_put_cache() must be called to release the
  * reference.
  */
-struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep)
+struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep,
+					struct obj_cgroup **objcgp)
 {
 	struct mem_cgroup *memcg;
 	struct kmem_cache *memcg_cachep;
@@ -2952,8 +2953,17 @@ struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep)
 	 */
 	if (unlikely(!memcg_cachep))
 		memcg_schedule_kmem_cache_create(memcg, cachep);
-	else if (percpu_ref_tryget(&memcg_cachep->memcg_params.refcnt))
+	else if (percpu_ref_tryget(&memcg_cachep->memcg_params.refcnt)) {
+		struct obj_cgroup *objcg = rcu_dereference(memcg->objcg);
+
+		if (!objcg || !obj_cgroup_tryget(objcg)) {
+			percpu_ref_put(&memcg_cachep->memcg_params.refcnt);
+			goto out_unlock;
+		}
+
+		*objcgp = objcg;
 		cachep = memcg_cachep;
+	}
 out_unlock:
 	rcu_read_unlock();
 	return cachep;
diff --git a/mm/slab.c b/mm/slab.c
index 22e161b57367..f16e896d5a3d 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3223,9 +3223,10 @@ slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
 	unsigned long save_flags;
 	void *ptr;
 	int slab_node = numa_mem_id();
+	struct obj_cgroup *objcg = NULL;
 
 	flags &= gfp_allowed_mask;
-	cachep = slab_pre_alloc_hook(cachep, flags);
+	cachep = slab_pre_alloc_hook(cachep, &objcg, 1, flags);
 	if (unlikely(!cachep))
 		return NULL;
 
@@ -3261,7 +3262,7 @@ slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
 	if (unlikely(slab_want_init_on_alloc(flags, cachep)) && ptr)
 		memset(ptr, 0, cachep->object_size);
 
-	slab_post_alloc_hook(cachep, flags, 1, &ptr);
+	slab_post_alloc_hook(cachep, objcg, flags, 1, &ptr);
 	return ptr;
 }
 
@@ -3302,9 +3303,10 @@ slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller)
 {
 	unsigned long save_flags;
 	void *objp;
+	struct obj_cgroup *objcg = NULL;
 
 	flags &= gfp_allowed_mask;
-	cachep = slab_pre_alloc_hook(cachep, flags);
+	cachep = slab_pre_alloc_hook(cachep, &objcg, 1, flags);
 	if (unlikely(!cachep))
 		return NULL;
 
@@ -3318,7 +3320,7 @@ slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller)
 	if (unlikely(slab_want_init_on_alloc(flags, cachep)) && objp)
 		memset(objp, 0, cachep->object_size);
 
-	slab_post_alloc_hook(cachep, flags, 1, &objp);
+	slab_post_alloc_hook(cachep, objcg, flags, 1, &objp);
 	return objp;
 }
 
@@ -3440,6 +3442,7 @@ void ___cache_free(struct kmem_cache *cachep, void *objp,
 		memset(objp, 0, cachep->object_size);
 	kmemleak_free_recursive(objp, cachep->flags);
 	objp = cache_free_debugcheck(cachep, objp, caller);
+	memcg_slab_free_hook(cachep, virt_to_head_page(objp), objp);
 
 	/*
 	 * Skip calling cache_free_alien() when the platform is not numa.
@@ -3505,8 +3508,9 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 			  void **p)
 {
 	size_t i;
+	struct obj_cgroup *objcg = NULL;
 
-	s = slab_pre_alloc_hook(s, flags);
+	s = slab_pre_alloc_hook(s, &objcg, size, flags);
 	if (!s)
 		return 0;
 
@@ -3529,13 +3533,13 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 		for (i = 0; i < size; i++)
 			memset(p[i], 0, s->object_size);
 
-	slab_post_alloc_hook(s, flags, size, p);
+	slab_post_alloc_hook(s, objcg, flags, size, p);
 	/* FIXME: Trace call missing. Christoph would like a bulk variant */
 	return size;
 error:
 	local_irq_enable();
 	cache_alloc_debugcheck_after_bulk(s, flags, i, p, _RET_IP_);
-	slab_post_alloc_hook(s, flags, i, p);
+	slab_post_alloc_hook(s, objcg, flags, i, p);
 	__kmem_cache_free_bulk(s, i, p);
 	return 0;
 }
diff --git a/mm/slab.h b/mm/slab.h
index 8ee8c3a250ac..0fdbeaf4aa8c 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -437,6 +437,41 @@ static inline void memcg_free_page_obj_cgroups(struct page *page)
 	page->obj_cgroups = NULL;
 }
 
+static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
+					      struct obj_cgroup *objcg,
+					      size_t size, void **p)
+{
+	struct page *page;
+	unsigned long off;
+	size_t i;
+
+	for (i = 0; i < size; i++) {
+		if (likely(p[i])) {
+			page = virt_to_head_page(p[i]);
+			off = obj_to_index(s, page, p[i]);
+			obj_cgroup_get(objcg);
+			page_obj_cgroups(page)[off] = objcg;
+		}
+	}
+	obj_cgroup_put(objcg);
+	memcg_kmem_put_cache(s);
+}
+
+static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page,
+					void *p)
+{
+	struct obj_cgroup *objcg;
+	unsigned int off;
+
+	if (!memcg_kmem_enabled() || is_root_cache(s))
+		return;
+
+	off = obj_to_index(s, page, p);
+	objcg = page_obj_cgroups(page)[off];
+	page_obj_cgroups(page)[off] = NULL;
+	obj_cgroup_put(objcg);
+}
+
 extern void slab_init_memcg_params(struct kmem_cache *);
 extern void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg);
 
@@ -496,6 +531,17 @@ static inline void memcg_free_page_obj_cgroups(struct page *page)
 {
 }
 
+static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
+					      struct obj_cgroup *objcg,
+					      size_t size, void **p)
+{
+}
+
+static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page,
+					void *p)
+{
+}
+
 static inline void slab_init_memcg_params(struct kmem_cache *s)
 {
 }
@@ -605,7 +651,8 @@ static inline size_t slab_ksize(const struct kmem_cache *s)
 }
 
 static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
-						     gfp_t flags)
+						     struct obj_cgroup **objcgp,
+						     size_t size, gfp_t flags)
 {
 	flags &= gfp_allowed_mask;
 
@@ -619,13 +666,14 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
 
 	if (memcg_kmem_enabled() &&
 	    ((flags & __GFP_ACCOUNT) || (s->flags & SLAB_ACCOUNT)))
-		return memcg_kmem_get_cache(s);
+		return memcg_kmem_get_cache(s, objcgp);
 
 	return s;
 }
 
-static inline void slab_post_alloc_hook(struct kmem_cache *s, gfp_t flags,
-					size_t size, void **p)
+static inline void slab_post_alloc_hook(struct kmem_cache *s,
+					struct obj_cgroup *objcg,
+					gfp_t flags, size_t size, void **p)
 {
 	size_t i;
 
@@ -637,8 +685,8 @@ static inline void slab_post_alloc_hook(struct kmem_cache *s, gfp_t flags,
 					 s->flags, flags);
 	}
 
-	if (memcg_kmem_enabled())
-		memcg_kmem_put_cache(s);
+	if (!is_root_cache(s))
+		memcg_slab_post_alloc_hook(s, objcg, size, p);
 }
 
 #ifndef CONFIG_SLOB
diff --git a/mm/slub.c b/mm/slub.c
index 165e43076c8b..6365e89cd503 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2698,8 +2698,9 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
 	struct kmem_cache_cpu *c;
 	struct page *page;
 	unsigned long tid;
+	struct obj_cgroup *objcg = NULL;
 
-	s = slab_pre_alloc_hook(s, gfpflags);
+	s = slab_pre_alloc_hook(s, &objcg, 1, gfpflags);
 	if (!s)
 		return NULL;
 redo:
@@ -2775,7 +2776,7 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
 	if (unlikely(slab_want_init_on_alloc(gfpflags, s)) && object)
 		memset(object, 0, s->object_size);
 
-	slab_post_alloc_hook(s, gfpflags, 1, &object);
+	slab_post_alloc_hook(s, objcg, gfpflags, 1, &object);
 
 	return object;
 }
@@ -2980,6 +2981,8 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
 	void *tail_obj = tail ? : head;
 	struct kmem_cache_cpu *c;
 	unsigned long tid;
+
+	memcg_slab_free_hook(s, page, head);
 redo:
 	/*
 	 * Determine the currently cpus per cpu slab.
@@ -3157,9 +3160,10 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 {
 	struct kmem_cache_cpu *c;
 	int i;
+	struct obj_cgroup *objcg = NULL;
 
 	/* memcg and kmem_cache debug support */
-	s = slab_pre_alloc_hook(s, flags);
+	s = slab_pre_alloc_hook(s, &objcg, size, flags);
 	if (unlikely(!s))
 		return false;
 	/*
@@ -3204,11 +3208,11 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 	}
 
 	/* memcg and kmem_cache debug support */
-	slab_post_alloc_hook(s, flags, size, p);
+	slab_post_alloc_hook(s, objcg, flags, size, p);
 	return i;
 error:
 	local_irq_enable();
-	slab_post_alloc_hook(s, flags, i, p);
+	slab_post_alloc_hook(s, objcg, flags, i, p);
 	__kmem_cache_free_bulk(s, i, p);
 	return 0;
 }
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 18/28] mm: memcg/slab: charge individual slab objects instead of pages
  2020-01-27 17:34 [PATCH v2 00/28] The new cgroup slab memory controller Roman Gushchin
                   ` (16 preceding siblings ...)
  2020-01-27 17:34 ` [PATCH v2 17/28] mm: memcg/slab: save obj_cgroup for non-root slab objects Roman Gushchin
@ 2020-01-27 17:34 ` Roman Gushchin
  2020-01-27 17:34 ` [PATCH v2 19/28] mm: memcg/slab: deprecate memory.kmem.slabinfo Roman Gushchin
                   ` (10 subsequent siblings)
  28 siblings, 0 replies; 84+ messages in thread
From: Roman Gushchin @ 2020-01-27 17:34 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Michal Hocko, Johannes Weiner, Shakeel Butt, Vladimir Davydov,
	linux-kernel, kernel-team, Bharata B Rao, Yafang Shao,
	Roman Gushchin

Switch to per-object accounting of non-root slab objects.

Charging is performed using obj_cgroup API in the pre_alloc hook.
Obj_cgroup is charged with the size of the object and the size
of metadata: as now it's the size of an obj_cgroup pointer.
If the amount of memory has been charged successfully, the actual
allocation code is executed. Otherwise, -ENOMEM is returned.

In the post_alloc hook if the actual allocation succeeded,
corresponding vmstats are bumped and the obj_cgroup pointer is saved.
Otherwise, the charge is canceled.

On the free path obj_cgroup pointer is obtained and used to uncharge
the size of the releasing object.

Memcg and lruvec counters are now representing only memory used
by active slab objects and do not include the free space. The free
space is shared and doesn't belong to any specific cgroup.

Global per-node slab vmstats are still modified from (un)charge_slab_page()
functions. The idea is to keep all slab pages accounted as slab pages
on system level.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 mm/slab.h | 171 ++++++++++++++++++++++++------------------------------
 1 file changed, 75 insertions(+), 96 deletions(-)

diff --git a/mm/slab.h b/mm/slab.h
index 0fdbeaf4aa8c..6585638e5be0 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -352,72 +352,6 @@ static inline struct mem_cgroup *memcg_from_slab_page(struct page *page)
 	return NULL;
 }
 
-/*
- * Charge the slab page belonging to the non-root kmem_cache.
- * Can be called for non-root kmem_caches only.
- */
-static __always_inline int memcg_charge_slab(struct page *page,
-					     gfp_t gfp, int order,
-					     struct kmem_cache *s)
-{
-	unsigned int nr_pages = 1 << order;
-	struct mem_cgroup *memcg;
-	struct lruvec *lruvec;
-	int ret;
-
-	rcu_read_lock();
-	memcg = READ_ONCE(s->memcg_params.memcg);
-	while (memcg && !css_tryget_online(&memcg->css))
-		memcg = parent_mem_cgroup(memcg);
-	rcu_read_unlock();
-
-	if (unlikely(!memcg || mem_cgroup_is_root(memcg))) {
-		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
-				    nr_pages << PAGE_SHIFT);
-		percpu_ref_get_many(&s->memcg_params.refcnt, nr_pages);
-		return 0;
-	}
-
-	ret = memcg_kmem_charge(memcg, gfp, nr_pages);
-	if (ret)
-		goto out;
-
-	lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
-	mod_lruvec_state(lruvec, cache_vmstat_idx(s), nr_pages << PAGE_SHIFT);
-
-	percpu_ref_get_many(&s->memcg_params.refcnt, 1 << order);
-out:
-	css_put(&memcg->css);
-	return ret;
-}
-
-/*
- * Uncharge a slab page belonging to a non-root kmem_cache.
- * Can be called for non-root kmem_caches only.
- */
-static __always_inline void memcg_uncharge_slab(struct page *page, int order,
-						struct kmem_cache *s)
-{
-	unsigned int nr_pages = 1 << order;
-	struct mem_cgroup *memcg;
-	struct lruvec *lruvec;
-
-	rcu_read_lock();
-	memcg = READ_ONCE(s->memcg_params.memcg);
-	if (likely(!mem_cgroup_is_root(memcg))) {
-		lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
-		mod_lruvec_state(lruvec, cache_vmstat_idx(s),
-				 -(nr_pages << PAGE_SHIFT));
-		memcg_kmem_uncharge(memcg, nr_pages);
-	} else {
-		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
-				    -(nr_pages << PAGE_SHIFT));
-	}
-	rcu_read_unlock();
-
-	percpu_ref_put_many(&s->memcg_params.refcnt, nr_pages);
-}
-
 static inline int memcg_alloc_page_obj_cgroups(struct page *page, gfp_t gfp,
 					     unsigned int objects)
 {
@@ -437,6 +371,45 @@ static inline void memcg_free_page_obj_cgroups(struct page *page)
 	page->obj_cgroups = NULL;
 }
 
+static inline size_t obj_full_size(struct kmem_cache *s)
+{
+	/*
+	 * For each accounted object there is an extra space which is used
+	 * to store obj_cgroup membership. Charge it too.
+	 */
+	return s->size + sizeof(struct obj_cgroup *);
+}
+
+static inline struct kmem_cache *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
+						struct obj_cgroup **objcgp,
+						size_t objects, gfp_t flags)
+{
+	struct kmem_cache *cachep;
+
+	cachep = memcg_kmem_get_cache(s, objcgp);
+	if (is_root_cache(cachep))
+		return s;
+
+	if (obj_cgroup_charge(*objcgp, flags, objects * obj_full_size(s))) {
+		memcg_kmem_put_cache(cachep);
+		cachep = NULL;
+	}
+
+	return cachep;
+}
+
+static inline void mod_objcg_memcg_state(struct obj_cgroup *objcg,
+					 struct pglist_data *pgdat,
+					 int idx, int nr)
+{
+	struct lruvec *lruvec;
+
+	rcu_read_lock();
+	lruvec = mem_cgroup_lruvec(obj_cgroup_memcg(objcg), pgdat);
+	mod_lruvec_memcg_state(lruvec, idx, nr);
+	rcu_read_unlock();
+}
+
 static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
 					      struct obj_cgroup *objcg,
 					      size_t size, void **p)
@@ -451,6 +424,10 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
 			off = obj_to_index(s, page, p[i]);
 			obj_cgroup_get(objcg);
 			page_obj_cgroups(page)[off] = objcg;
+			mod_objcg_memcg_state(objcg, page_pgdat(page),
+					      cache_vmstat_idx(s), s->size);
+		} else {
+			obj_cgroup_uncharge(objcg, obj_full_size(s));
 		}
 	}
 	obj_cgroup_put(objcg);
@@ -469,6 +446,11 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page,
 	off = obj_to_index(s, page, p);
 	objcg = page_obj_cgroups(page)[off];
 	page_obj_cgroups(page)[off] = NULL;
+
+	obj_cgroup_uncharge(objcg, obj_full_size(s));
+	mod_objcg_memcg_state(objcg, page_pgdat(page), cache_vmstat_idx(s),
+			      -s->size);
+
 	obj_cgroup_put(objcg);
 }
 
@@ -510,17 +492,6 @@ static inline struct mem_cgroup *memcg_from_slab_page(struct page *page)
 	return NULL;
 }
 
-static inline int memcg_charge_slab(struct page *page, gfp_t gfp, int order,
-				    struct kmem_cache *s)
-{
-	return 0;
-}
-
-static inline void memcg_uncharge_slab(struct page *page, int order,
-				       struct kmem_cache *s)
-{
-}
-
 static inline int memcg_alloc_page_obj_cgroups(struct page *page, gfp_t gfp,
 					       unsigned int objects)
 {
@@ -531,6 +502,13 @@ static inline void memcg_free_page_obj_cgroups(struct page *page)
 {
 }
 
+static inline struct kmem_cache *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
+						struct obj_cgroup **objcgp,
+						size_t objects, gfp_t flags)
+{
+	return NULL;
+}
+
 static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
 					      struct obj_cgroup *objcg,
 					      size_t size, void **p)
@@ -569,32 +547,33 @@ static __always_inline int charge_slab_page(struct page *page,
 					    struct kmem_cache *s,
 					    unsigned int objects)
 {
-	int ret;
-
-	if (is_root_cache(s)) {
-		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
-				    PAGE_SIZE << order);
-		return 0;
-	}
+#ifdef CONFIG_MEMCG_KMEM
+	if (!is_root_cache(s)) {
+		int ret;
 
-	ret = memcg_alloc_page_obj_cgroups(page, gfp, objects);
-	if (ret)
-		return ret;
+		ret = memcg_alloc_page_obj_cgroups(page, gfp, objects);
+		if (ret)
+			return ret;
 
-	return memcg_charge_slab(page, gfp, order, s);
+		percpu_ref_get_many(&s->memcg_params.refcnt, 1 << order);
+	}
+#endif
+	mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
+			    PAGE_SIZE << order);
+	return 0;
 }
 
 static __always_inline void uncharge_slab_page(struct page *page, int order,
 					       struct kmem_cache *s)
 {
-	if (is_root_cache(s)) {
-		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
-				    -(PAGE_SIZE << order));
-		return;
+#ifdef CONFIG_MEMCG_KMEM
+	if (!is_root_cache(s)) {
+		memcg_free_page_obj_cgroups(page);
+		percpu_ref_put_many(&s->memcg_params.refcnt, 1 << order);
 	}
-
-	memcg_free_page_obj_cgroups(page);
-	memcg_uncharge_slab(page, order, s);
+#endif
+	mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
+			    -(PAGE_SIZE << order));
 }
 
 static inline struct kmem_cache *cache_from_obj(struct kmem_cache *s, void *x)
@@ -666,7 +645,7 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
 
 	if (memcg_kmem_enabled() &&
 	    ((flags & __GFP_ACCOUNT) || (s->flags & SLAB_ACCOUNT)))
-		return memcg_kmem_get_cache(s, objcgp);
+		return memcg_slab_pre_alloc_hook(s, objcgp, size, flags);
 
 	return s;
 }
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 19/28] mm: memcg/slab: deprecate memory.kmem.slabinfo
  2020-01-27 17:34 [PATCH v2 00/28] The new cgroup slab memory controller Roman Gushchin
                   ` (17 preceding siblings ...)
  2020-01-27 17:34 ` [PATCH v2 18/28] mm: memcg/slab: charge individual slab objects instead of pages Roman Gushchin
@ 2020-01-27 17:34 ` Roman Gushchin
  2020-01-27 17:34 ` [PATCH v2 20/28] mm: memcg/slab: move memcg_kmem_bypass() to memcontrol.h Roman Gushchin
                   ` (9 subsequent siblings)
  28 siblings, 0 replies; 84+ messages in thread
From: Roman Gushchin @ 2020-01-27 17:34 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Michal Hocko, Johannes Weiner, Shakeel Butt, Vladimir Davydov,
	linux-kernel, kernel-team, Bharata B Rao, Yafang Shao,
	Roman Gushchin

Deprecate memory.kmem.slabinfo.

An empty file will be presented if corresponding config options are
enabled.

The interface is implementation dependent, isn't present in cgroup v2,
and is generally useful only for core mm debugging purposes. In other
words, it doesn't provide any value for the absolute majority of users.

A drgn-based replacement can be found in tools/cgroup/slabinfo.py .
It does support cgroup v1 and v2, mimics memory.kmem.slabinfo output
and also allows to get any additional information without a need
to recompile the kernel.

If a drgn-based solution is too slow for a task, a bpf-based tracing
tool can be used, which can easily keep track of all slab allocations
belonging to a memory cgroup.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 mm/memcontrol.c  |  3 ---
 mm/slab_common.c | 31 ++++---------------------------
 2 files changed, 4 insertions(+), 30 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0e9fe272e688..45bd9b1d9735 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5013,9 +5013,6 @@ static struct cftype mem_cgroup_legacy_files[] = {
 #if defined(CONFIG_SLAB) || defined(CONFIG_SLUB_DEBUG)
 	{
 		.name = "kmem.slabinfo",
-		.seq_start = memcg_slab_start,
-		.seq_next = memcg_slab_next,
-		.seq_stop = memcg_slab_stop,
 		.seq_show = memcg_slab_show,
 	},
 #endif
diff --git a/mm/slab_common.c b/mm/slab_common.c
index a2afa4ff5d7b..80b4efdb1df3 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1522,35 +1522,12 @@ void dump_unreclaimable_slab(void)
 }
 
 #if defined(CONFIG_MEMCG)
-void *memcg_slab_start(struct seq_file *m, loff_t *pos)
-{
-	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
-
-	mutex_lock(&slab_mutex);
-	return seq_list_start(&memcg->kmem_caches, *pos);
-}
-
-void *memcg_slab_next(struct seq_file *m, void *p, loff_t *pos)
-{
-	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
-
-	return seq_list_next(p, &memcg->kmem_caches, pos);
-}
-
-void memcg_slab_stop(struct seq_file *m, void *p)
-{
-	mutex_unlock(&slab_mutex);
-}
-
 int memcg_slab_show(struct seq_file *m, void *p)
 {
-	struct kmem_cache *s = list_entry(p, struct kmem_cache,
-					  memcg_params.kmem_caches_node);
-	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
-
-	if (p == memcg->kmem_caches.next)
-		print_slabinfo_header(m);
-	cache_show(s, m);
+	/*
+	 * Deprecated.
+	 * Please, take a look at tools/cgroup/slabinfo.py .
+	 */
 	return 0;
 }
 #endif
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 20/28] mm: memcg/slab: move memcg_kmem_bypass() to memcontrol.h
  2020-01-27 17:34 [PATCH v2 00/28] The new cgroup slab memory controller Roman Gushchin
                   ` (18 preceding siblings ...)
  2020-01-27 17:34 ` [PATCH v2 19/28] mm: memcg/slab: deprecate memory.kmem.slabinfo Roman Gushchin
@ 2020-01-27 17:34 ` Roman Gushchin
  2020-01-27 17:34 ` [PATCH v2 21/28] mm: memcg/slab: use a single set of kmem_caches for all memory cgroups Roman Gushchin
                   ` (8 subsequent siblings)
  28 siblings, 0 replies; 84+ messages in thread
From: Roman Gushchin @ 2020-01-27 17:34 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Michal Hocko, Johannes Weiner, Shakeel Butt, Vladimir Davydov,
	linux-kernel, kernel-team, Bharata B Rao, Yafang Shao,
	Roman Gushchin

To make the memcg_kmem_bypass() function available outside of
the memcontrol.c, let's move it to memcontrol.h. The function
is small and nicely fits into static inline sort of functions.

It will be used from the slab code.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/memcontrol.h | 7 +++++++
 mm/memcontrol.c            | 7 -------
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 54bfb26b5016..f578d1a24280 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1465,6 +1465,13 @@ static inline bool memcg_kmem_enabled(void)
 	return static_branch_unlikely(&memcg_kmem_enabled_key);
 }
 
+static inline bool memcg_kmem_bypass(void)
+{
+	if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
+		return true;
+	return false;
+}
+
 static inline int memcg_kmem_charge_page(struct page *page, gfp_t gfp,
 					 int order)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 45bd9b1d9735..ffe7e1e9f3c0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2873,13 +2873,6 @@ static void memcg_schedule_kmem_cache_create(struct mem_cgroup *memcg,
 	queue_work(memcg_kmem_cache_wq, &cw->work);
 }
 
-static inline bool memcg_kmem_bypass(void)
-{
-	if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
-		return true;
-	return false;
-}
-
 /**
  * memcg_kmem_get_cache: select the correct per-memcg cache for allocation
  * @cachep: the original global kmem cache
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 21/28] mm: memcg/slab: use a single set of kmem_caches for all memory cgroups
  2020-01-27 17:34 [PATCH v2 00/28] The new cgroup slab memory controller Roman Gushchin
                   ` (19 preceding siblings ...)
  2020-01-27 17:34 ` [PATCH v2 20/28] mm: memcg/slab: move memcg_kmem_bypass() to memcontrol.h Roman Gushchin
@ 2020-01-27 17:34 ` Roman Gushchin
  2020-02-03 19:50   ` Johannes Weiner
  2020-01-27 17:34 ` [PATCH v2 22/28] mm: memcg/slab: simplify memcg cache creation Roman Gushchin
                   ` (7 subsequent siblings)
  28 siblings, 1 reply; 84+ messages in thread
From: Roman Gushchin @ 2020-01-27 17:34 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Michal Hocko, Johannes Weiner, Shakeel Butt, Vladimir Davydov,
	linux-kernel, kernel-team, Bharata B Rao, Yafang Shao,
	Roman Gushchin

This is fairly big but mostly red patch, which makes all non-root
slab allocations use a single set of kmem_caches instead of
creating a separate set for each memory cgroup.

Because the number of non-root kmem_caches is now capped by the number
of root kmem_caches, there is no need to shrink or destroy them
prematurely. They can be perfectly destroyed together with their
root counterparts. This allows to dramatically simplify the
management of non-root kmem_caches and delete a ton of code.

This patch performs the following changes:
1) introduces memcg_params.memcg_cache pointer to represent the
   kmem_cache which will be used for all non-root allocations
2) reuses the existing memcg kmem_cache creation mechanism
   to create memcg kmem_cache on the first allocation attempt
3) memcg kmem_caches are named <kmemcache_name>-memcg,
   e.g. dentry-memcg
4) simplifies memcg_kmem_get_cache() to just return memcg kmem_cache
   or schedule it's creation and return the root cache
5) removes almost all non-root kmem_cache management code
   (separate refcounter, reparenting, shrinking, etc)
6) makes slab debugfs to display root_mem_cgroup css id and never
   show :dead and :deact flags in the memcg_slabinfo attribute.

Following patches in the series will simplify the kmem_cache creation.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/memcontrol.h |   7 +-
 include/linux/slab.h       |   5 +-
 mm/memcontrol.c            | 172 +++++----------
 mm/slab.c                  |  16 +-
 mm/slab.h                  | 145 ++++---------
 mm/slab_common.c           | 426 ++++---------------------------------
 mm/slub.c                  |  38 +---
 7 files changed, 146 insertions(+), 663 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f578d1a24280..95d66f46493c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -322,7 +322,6 @@ struct mem_cgroup {
         /* Index in the kmem_cache->memcg_params.memcg_caches array */
 	int kmemcg_id;
 	enum memcg_kmem_state kmem_state;
-	struct list_head kmem_caches;
 	struct obj_cgroup __rcu *objcg;
 	struct list_head objcg_list;
 #endif
@@ -1431,9 +1430,7 @@ static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg,
 }
 #endif
 
-struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep,
-					struct obj_cgroup **objcgp);
-void memcg_kmem_put_cache(struct kmem_cache *cachep);
+struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep);
 
 #ifdef CONFIG_MEMCG_KMEM
 int __memcg_kmem_charge(struct mem_cgroup *memcg, gfp_t gfp,
@@ -1442,6 +1439,8 @@ void __memcg_kmem_uncharge(struct mem_cgroup *memcg, unsigned int nr_pages);
 int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order);
 void __memcg_kmem_uncharge_page(struct page *page, int order);
 
+struct obj_cgroup *get_obj_cgroup_from_current(void);
+
 int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size);
 void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);
 
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 03a389358562..b2dde3f24cfa 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -155,8 +155,7 @@ struct kmem_cache *kmem_cache_create_usercopy(const char *name,
 void kmem_cache_destroy(struct kmem_cache *);
 int kmem_cache_shrink(struct kmem_cache *);
 
-void memcg_create_kmem_cache(struct mem_cgroup *, struct kmem_cache *);
-void memcg_deactivate_kmem_caches(struct mem_cgroup *, struct mem_cgroup *);
+void memcg_create_kmem_cache(struct kmem_cache *cachep);
 
 /*
  * Please use this macro to create slab caches. Simply specify the
@@ -578,8 +577,6 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
 	return __kmalloc_node(size, flags, node);
 }
 
-int memcg_update_all_caches(int num_memcgs);
-
 /**
  * kmalloc_array - allocate memory for an array.
  * @n: number of elements.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ffe7e1e9f3c0..3f92b1c71aed 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -325,7 +325,7 @@ static void memcg_reparent_objcgs(struct mem_cgroup *memcg,
 }
 
 /*
- * This will be the memcg's index in each cache's ->memcg_params.memcg_caches.
+ * This will be used as a shrinker list's index.
  * The main reason for not using cgroup id for this:
  *  this works better in sparse environments, where we have a lot of memcgs,
  *  but only a few kmem-limited. Or also, if we have, for instance, 200
@@ -542,11 +542,7 @@ ino_t page_cgroup_ino(struct page *page)
 	unsigned long ino = 0;
 
 	rcu_read_lock();
-	if (PageSlab(page) && !PageTail(page))
-		memcg = memcg_from_slab_page(page);
-	else
-		memcg = page_memcg_rcu(page);
-
+	memcg = page_memcg_rcu(page);
 	while (memcg && !(memcg->css.flags & CSS_ONLINE))
 		memcg = parent_mem_cgroup(memcg);
 	if (memcg)
@@ -2776,17 +2772,46 @@ struct mem_cgroup *mem_cgroup_from_obj(void *p)
 	page = virt_to_head_page(p);
 
 	/*
-	 * Slab pages don't have page->mem_cgroup set because corresponding
-	 * kmem caches can be reparented during the lifetime. That's why
-	 * memcg_from_slab_page() should be used instead.
+	 * Slab objects are accounted individually, not per-page.
+	 * Memcg membership data for each individual object is saved in
+	 * the page->obj_cgroups.
 	 */
-	if (PageSlab(page))
-		return memcg_from_slab_page(page);
+	if (page_has_obj_cgroups(page)) {
+		struct obj_cgroup *objcg;
+		unsigned int off;
+
+		off = obj_to_index(page->slab_cache, page, p);
+		objcg = page_obj_cgroups(page)[off];
+		return obj_cgroup_memcg(objcg);
+	}
 
-	/* All other pages use page->mem_cgroup */
 	return page_memcg(page);
 }
 
+__always_inline struct obj_cgroup *get_obj_cgroup_from_current(void)
+{
+	struct obj_cgroup *objcg = NULL;
+	struct mem_cgroup *memcg;
+
+	if (unlikely(!current->mm))
+		return NULL;
+
+	rcu_read_lock();
+	if (unlikely(current->active_memcg))
+		memcg = rcu_dereference(current->active_memcg);
+	else
+		memcg = mem_cgroup_from_task(current);
+
+	if (memcg && memcg != root_mem_cgroup) {
+		objcg = rcu_dereference(memcg->objcg);
+		if (objcg && !obj_cgroup_tryget(objcg))
+			objcg = NULL;
+	}
+	rcu_read_unlock();
+
+	return objcg;
+}
+
 static int memcg_alloc_cache_id(void)
 {
 	int id, size;
@@ -2812,9 +2837,7 @@ static int memcg_alloc_cache_id(void)
 	else if (size > MEMCG_CACHES_MAX_SIZE)
 		size = MEMCG_CACHES_MAX_SIZE;
 
-	err = memcg_update_all_caches(size);
-	if (!err)
-		err = memcg_update_all_list_lrus(size);
+	err = memcg_update_all_list_lrus(size);
 	if (!err)
 		memcg_nr_cache_ids = size;
 
@@ -2833,7 +2856,6 @@ static void memcg_free_cache_id(int id)
 }
 
 struct memcg_kmem_cache_create_work {
-	struct mem_cgroup *memcg;
 	struct kmem_cache *cachep;
 	struct work_struct work;
 };
@@ -2842,31 +2864,24 @@ static void memcg_kmem_cache_create_func(struct work_struct *w)
 {
 	struct memcg_kmem_cache_create_work *cw =
 		container_of(w, struct memcg_kmem_cache_create_work, work);
-	struct mem_cgroup *memcg = cw->memcg;
 	struct kmem_cache *cachep = cw->cachep;
 
-	memcg_create_kmem_cache(memcg, cachep);
+	memcg_create_kmem_cache(cachep);
 
-	css_put(&memcg->css);
 	kfree(cw);
 }
 
 /*
  * Enqueue the creation of a per-memcg kmem_cache.
  */
-static void memcg_schedule_kmem_cache_create(struct mem_cgroup *memcg,
-					       struct kmem_cache *cachep)
+static void memcg_schedule_kmem_cache_create(struct kmem_cache *cachep)
 {
 	struct memcg_kmem_cache_create_work *cw;
 
-	if (!css_tryget_online(&memcg->css))
-		return;
-
 	cw = kmalloc(sizeof(*cw), GFP_NOWAIT | __GFP_NOWARN);
 	if (!cw)
 		return;
 
-	cw->memcg = memcg;
 	cw->cachep = cachep;
 	INIT_WORK(&cw->work, memcg_kmem_cache_create_func);
 
@@ -2874,102 +2889,26 @@ static void memcg_schedule_kmem_cache_create(struct mem_cgroup *memcg,
 }
 
 /**
- * memcg_kmem_get_cache: select the correct per-memcg cache for allocation
+ * memcg_kmem_get_cache: select memcg or root cache for allocation
  * @cachep: the original global kmem cache
  *
  * Return the kmem_cache we're supposed to use for a slab allocation.
- * We try to use the current memcg's version of the cache.
  *
  * If the cache does not exist yet, if we are the first user of it, we
  * create it asynchronously in a workqueue and let the current allocation
  * go through with the original cache.
- *
- * This function takes a reference to the cache it returns to assure it
- * won't get destroyed while we are working with it. Once the caller is
- * done with it, memcg_kmem_put_cache() must be called to release the
- * reference.
  */
-struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep,
-					struct obj_cgroup **objcgp)
+struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep)
 {
-	struct mem_cgroup *memcg;
 	struct kmem_cache *memcg_cachep;
-	struct memcg_cache_array *arr;
-	int kmemcg_id;
 
-	VM_BUG_ON(!is_root_cache(cachep));
-
-	if (memcg_kmem_bypass())
+	memcg_cachep = READ_ONCE(cachep->memcg_params.memcg_cache);
+	if (unlikely(!memcg_cachep)) {
+		memcg_schedule_kmem_cache_create(cachep);
 		return cachep;
-
-	rcu_read_lock();
-
-	if (unlikely(current->active_memcg))
-		memcg = current->active_memcg;
-	else
-		memcg = mem_cgroup_from_task(current);
-
-	if (!memcg || memcg == root_mem_cgroup)
-		goto out_unlock;
-
-	kmemcg_id = READ_ONCE(memcg->kmemcg_id);
-	if (kmemcg_id < 0)
-		goto out_unlock;
-
-	arr = rcu_dereference(cachep->memcg_params.memcg_caches);
-
-	/*
-	 * Make sure we will access the up-to-date value. The code updating
-	 * memcg_caches issues a write barrier to match the data dependency
-	 * barrier inside READ_ONCE() (see memcg_create_kmem_cache()).
-	 */
-	memcg_cachep = READ_ONCE(arr->entries[kmemcg_id]);
-
-	/*
-	 * If we are in a safe context (can wait, and not in interrupt
-	 * context), we could be be predictable and return right away.
-	 * This would guarantee that the allocation being performed
-	 * already belongs in the new cache.
-	 *
-	 * However, there are some clashes that can arrive from locking.
-	 * For instance, because we acquire the slab_mutex while doing
-	 * memcg_create_kmem_cache, this means no further allocation
-	 * could happen with the slab_mutex held. So it's better to
-	 * defer everything.
-	 *
-	 * If the memcg is dying or memcg_cache is about to be released,
-	 * don't bother creating new kmem_caches. Because memcg_cachep
-	 * is ZEROed as the fist step of kmem offlining, we don't need
-	 * percpu_ref_tryget_live() here. css_tryget_online() check in
-	 * memcg_schedule_kmem_cache_create() will prevent us from
-	 * creation of a new kmem_cache.
-	 */
-	if (unlikely(!memcg_cachep))
-		memcg_schedule_kmem_cache_create(memcg, cachep);
-	else if (percpu_ref_tryget(&memcg_cachep->memcg_params.refcnt)) {
-		struct obj_cgroup *objcg = rcu_dereference(memcg->objcg);
-
-		if (!objcg || !obj_cgroup_tryget(objcg)) {
-			percpu_ref_put(&memcg_cachep->memcg_params.refcnt);
-			goto out_unlock;
-		}
-
-		*objcgp = objcg;
-		cachep = memcg_cachep;
 	}
-out_unlock:
-	rcu_read_unlock();
-	return cachep;
-}
 
-/**
- * memcg_kmem_put_cache: drop reference taken by memcg_kmem_get_cache
- * @cachep: the cache returned by memcg_kmem_get_cache
- */
-void memcg_kmem_put_cache(struct kmem_cache *cachep)
-{
-	if (!is_root_cache(cachep))
-		percpu_ref_put(&cachep->memcg_params.refcnt);
+	return memcg_cachep;
 }
 
 /**
@@ -3641,7 +3580,6 @@ static int memcg_online_kmem(struct mem_cgroup *memcg)
 	 */
 	memcg->kmemcg_id = memcg_id;
 	memcg->kmem_state = KMEM_ONLINE;
-	INIT_LIST_HEAD(&memcg->kmem_caches);
 
 	return 0;
 }
@@ -3654,22 +3592,13 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
 
 	if (memcg->kmem_state != KMEM_ONLINE)
 		return;
-	/*
-	 * Clear the online state before clearing memcg_caches array
-	 * entries. The slab_mutex in memcg_deactivate_kmem_caches()
-	 * guarantees that no cache will be created for this cgroup
-	 * after we are done (see memcg_create_kmem_cache()).
-	 */
+
 	memcg->kmem_state = KMEM_ALLOCATED;
 
 	parent = parent_mem_cgroup(memcg);
 	if (!parent)
 		parent = root_mem_cgroup;
 
-	/*
-	 * Deactivate and reparent kmem_caches and objcgs.
-	 */
-	memcg_deactivate_kmem_caches(memcg, parent);
 	memcg_reparent_objcgs(memcg, parent);
 
 	kmemcg_id = memcg->kmemcg_id;
@@ -3704,10 +3633,8 @@ static void memcg_free_kmem(struct mem_cgroup *memcg)
 	if (unlikely(memcg->kmem_state == KMEM_ONLINE))
 		memcg_offline_kmem(memcg);
 
-	if (memcg->kmem_state == KMEM_ALLOCATED) {
-		WARN_ON(!list_empty(&memcg->kmem_caches));
+	if (memcg->kmem_state == KMEM_ALLOCATED)
 		static_branch_dec(&memcg_kmem_enabled_key);
-	}
 }
 #else
 static int memcg_online_kmem(struct mem_cgroup *memcg)
@@ -5283,9 +5210,6 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 
 	/* The following stuff does not apply to the root */
 	if (!parent) {
-#ifdef CONFIG_MEMCG_KMEM
-		INIT_LIST_HEAD(&memcg->kmem_caches);
-#endif
 		root_mem_cgroup = memcg;
 		return &memcg->css;
 	}
diff --git a/mm/slab.c b/mm/slab.c
index f16e896d5a3d..1f6ce2018993 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1239,7 +1239,7 @@ void __init kmem_cache_init(void)
 				  nr_node_ids * sizeof(struct kmem_cache_node *),
 				  SLAB_HWCACHE_ALIGN, 0, 0);
 	list_add(&kmem_cache->list, &slab_caches);
-	memcg_link_cache(kmem_cache, NULL);
+	memcg_link_cache(kmem_cache);
 	slab_state = PARTIAL;
 
 	/*
@@ -2244,17 +2244,6 @@ int __kmem_cache_shrink(struct kmem_cache *cachep)
 	return (ret ? 1 : 0);
 }
 
-#ifdef CONFIG_MEMCG
-void __kmemcg_cache_deactivate(struct kmem_cache *cachep)
-{
-	__kmem_cache_shrink(cachep);
-}
-
-void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s)
-{
-}
-#endif
-
 int __kmem_cache_shutdown(struct kmem_cache *cachep)
 {
 	return __kmem_cache_shrink(cachep);
@@ -3862,7 +3851,8 @@ static int do_tune_cpucache(struct kmem_cache *cachep, int limit,
 		return ret;
 
 	lockdep_assert_held(&slab_mutex);
-	for_each_memcg_cache(c, cachep) {
+	c = memcg_cache(cachep);
+	if (c) {
 		/* return value determined by the root cache only */
 		__do_tune_cpucache(c, limit, batchcount, shared, gfp);
 	}
diff --git a/mm/slab.h b/mm/slab.h
index 6585638e5be0..732051d6861d 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -32,66 +32,25 @@ struct kmem_cache {
 
 #else /* !CONFIG_SLOB */
 
-struct memcg_cache_array {
-	struct rcu_head rcu;
-	struct kmem_cache *entries[0];
-};
-
 /*
  * This is the main placeholder for memcg-related information in kmem caches.
- * Both the root cache and the child caches will have it. For the root cache,
- * this will hold a dynamically allocated array large enough to hold
- * information about the currently limited memcgs in the system. To allow the
- * array to be accessed without taking any locks, on relocation we free the old
- * version only after a grace period.
- *
- * Root and child caches hold different metadata.
+ * Both the root cache and the child cache will have it. Some fields are used
+ * in both cases, other are specific to root caches.
  *
  * @root_cache:	Common to root and child caches.  NULL for root, pointer to
  *		the root cache for children.
  *
  * The following fields are specific to root caches.
  *
- * @memcg_caches: kmemcg ID indexed table of child caches.  This table is
- *		used to index child cachces during allocation and cleared
- *		early during shutdown.
- *
- * @root_caches_node: List node for slab_root_caches list.
- *
- * @children:	List of all child caches.  While the child caches are also
- *		reachable through @memcg_caches, a child cache remains on
- *		this list until it is actually destroyed.
- *
- * The following fields are specific to child caches.
- *
- * @memcg:	Pointer to the memcg this cache belongs to.
- *
- * @children_node: List node for @root_cache->children list.
- *
- * @kmem_caches_node: List node for @memcg->kmem_caches list.
+ * @memcg_cache: pointer to memcg kmem cache, used by all non-root memory
+ *		cgroups.
+ * @root_caches_node: list node for slab_root_caches list.
  */
 struct memcg_cache_params {
 	struct kmem_cache *root_cache;
-	union {
-		struct {
-			struct memcg_cache_array __rcu *memcg_caches;
-			struct list_head __root_caches_node;
-			struct list_head children;
-			bool dying;
-		};
-		struct {
-			struct mem_cgroup *memcg;
-			struct list_head children_node;
-			struct list_head kmem_caches_node;
-			struct percpu_ref refcnt;
-
-			void (*work_fn)(struct kmem_cache *);
-			union {
-				struct rcu_head rcu_head;
-				struct work_struct work;
-			};
-		};
-	};
+
+	struct kmem_cache *memcg_cache;
+	struct list_head __root_caches_node;
 };
 #endif /* CONFIG_SLOB */
 
@@ -234,8 +193,6 @@ bool __kmem_cache_empty(struct kmem_cache *);
 int __kmem_cache_shutdown(struct kmem_cache *);
 void __kmem_cache_release(struct kmem_cache *);
 int __kmem_cache_shrink(struct kmem_cache *);
-void __kmemcg_cache_deactivate(struct kmem_cache *s);
-void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s);
 void slab_kmem_cache_release(struct kmem_cache *);
 void kmem_cache_shrink_all(struct kmem_cache *s);
 
@@ -281,14 +238,6 @@ static inline int cache_vmstat_idx(struct kmem_cache *s)
 extern struct list_head		slab_root_caches;
 #define root_caches_node	memcg_params.__root_caches_node
 
-/*
- * Iterate over all memcg caches of the given root cache. The caller must hold
- * slab_mutex.
- */
-#define for_each_memcg_cache(iter, root) \
-	list_for_each_entry(iter, &(root)->memcg_params.children, \
-			    memcg_params.children_node)
-
 static inline bool is_root_cache(struct kmem_cache *s)
 {
 	return !s->memcg_params.root_cache;
@@ -319,6 +268,13 @@ static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
 	return s->memcg_params.root_cache;
 }
 
+static inline struct kmem_cache *memcg_cache(struct kmem_cache *s)
+{
+	if (is_root_cache(s))
+		return s->memcg_params.memcg_cache;
+	return NULL;
+}
+
 static inline struct obj_cgroup **page_obj_cgroups(struct page *page)
 {
 	/*
@@ -331,25 +287,9 @@ static inline struct obj_cgroup **page_obj_cgroups(struct page *page)
 		((unsigned long)page->obj_cgroups & ~0x1UL);
 }
 
-/*
- * Expects a pointer to a slab page. Please note, that PageSlab() check
- * isn't sufficient, as it returns true also for tail compound slab pages,
- * which do not have slab_cache pointer set.
- * So this function assumes that the page can pass PageSlab() && !PageTail()
- * check.
- *
- * The kmem_cache can be reparented asynchronously. The caller must ensure
- * the memcg lifetime, e.g. by taking rcu_read_lock() or cgroup_mutex.
- */
-static inline struct mem_cgroup *memcg_from_slab_page(struct page *page)
+static inline bool page_has_obj_cgroups(struct page *page)
 {
-	struct kmem_cache *s;
-
-	s = READ_ONCE(page->slab_cache);
-	if (s && !is_root_cache(s))
-		return READ_ONCE(s->memcg_params.memcg);
-
-	return NULL;
+	return ((unsigned long)page->obj_cgroups & 0x1UL);
 }
 
 static inline int memcg_alloc_page_obj_cgroups(struct page *page, gfp_t gfp,
@@ -385,16 +325,25 @@ static inline struct kmem_cache *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
 						size_t objects, gfp_t flags)
 {
 	struct kmem_cache *cachep;
+	struct obj_cgroup *objcg;
+
+	if (memcg_kmem_bypass())
+		return s;
 
-	cachep = memcg_kmem_get_cache(s, objcgp);
+	cachep = memcg_kmem_get_cache(s);
 	if (is_root_cache(cachep))
 		return s;
 
-	if (obj_cgroup_charge(*objcgp, flags, objects * obj_full_size(s))) {
-		memcg_kmem_put_cache(cachep);
+	objcg = get_obj_cgroup_from_current();
+	if (!objcg)
+		return s;
+
+	if (obj_cgroup_charge(objcg, flags, objects * obj_full_size(s))) {
+		obj_cgroup_put(objcg);
 		cachep = NULL;
 	}
 
+	*objcgp = objcg;
 	return cachep;
 }
 
@@ -431,7 +380,6 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
 		}
 	}
 	obj_cgroup_put(objcg);
-	memcg_kmem_put_cache(s);
 }
 
 static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page,
@@ -455,7 +403,7 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page,
 }
 
 extern void slab_init_memcg_params(struct kmem_cache *);
-extern void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg);
+extern void memcg_link_cache(struct kmem_cache *s);
 
 #else /* CONFIG_MEMCG_KMEM */
 
@@ -463,9 +411,6 @@ extern void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg);
 #define slab_root_caches	slab_caches
 #define root_caches_node	list
 
-#define for_each_memcg_cache(iter, root) \
-	for ((void)(iter), (void)(root); 0; )
-
 static inline bool is_root_cache(struct kmem_cache *s)
 {
 	return true;
@@ -487,7 +432,17 @@ static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
 	return s;
 }
 
-static inline struct mem_cgroup *memcg_from_slab_page(struct page *page)
+static inline struct kmem_cache *memcg_cache(struct kmem_cache *s)
+{
+	return NULL;
+}
+
+static inline bool page_has_obj_cgroups(struct page *page)
+{
+	return false;
+}
+
+static inline struct mem_cgroup *memcg_from_slab_obj(void *ptr)
 {
 	return NULL;
 }
@@ -524,8 +479,7 @@ static inline void slab_init_memcg_params(struct kmem_cache *s)
 {
 }
 
-static inline void memcg_link_cache(struct kmem_cache *s,
-				    struct mem_cgroup *memcg)
+static inline void memcg_link_cache(struct kmem_cache *s)
 {
 }
 
@@ -547,17 +501,14 @@ static __always_inline int charge_slab_page(struct page *page,
 					    struct kmem_cache *s,
 					    unsigned int objects)
 {
-#ifdef CONFIG_MEMCG_KMEM
 	if (!is_root_cache(s)) {
 		int ret;
 
 		ret = memcg_alloc_page_obj_cgroups(page, gfp, objects);
 		if (ret)
 			return ret;
-
-		percpu_ref_get_many(&s->memcg_params.refcnt, 1 << order);
 	}
-#endif
+
 	mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
 			    PAGE_SIZE << order);
 	return 0;
@@ -566,12 +517,9 @@ static __always_inline int charge_slab_page(struct page *page,
 static __always_inline void uncharge_slab_page(struct page *page, int order,
 					       struct kmem_cache *s)
 {
-#ifdef CONFIG_MEMCG_KMEM
-	if (!is_root_cache(s)) {
+	if (!is_root_cache(s))
 		memcg_free_page_obj_cgroups(page);
-		percpu_ref_put_many(&s->memcg_params.refcnt, 1 << order);
-	}
-#endif
+
 	mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
 			    -(PAGE_SIZE << order));
 }
@@ -720,9 +668,6 @@ static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
 void *slab_start(struct seq_file *m, loff_t *pos);
 void *slab_next(struct seq_file *m, void *p, loff_t *pos);
 void slab_stop(struct seq_file *m, void *p);
-void *memcg_slab_start(struct seq_file *m, loff_t *pos);
-void *memcg_slab_next(struct seq_file *m, void *p, loff_t *pos);
-void memcg_slab_stop(struct seq_file *m, void *p);
 int memcg_slab_show(struct seq_file *m, void *p);
 
 #if defined(CONFIG_SLAB) || defined(CONFIG_SLUB_DEBUG)
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 80b4efdb1df3..9ad13c7e28fc 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -131,141 +131,36 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t nr,
 #ifdef CONFIG_MEMCG_KMEM
 
 LIST_HEAD(slab_root_caches);
-static DEFINE_SPINLOCK(memcg_kmem_wq_lock);
-
-static void kmemcg_cache_shutdown(struct percpu_ref *percpu_ref);
 
 void slab_init_memcg_params(struct kmem_cache *s)
 {
 	s->memcg_params.root_cache = NULL;
-	RCU_INIT_POINTER(s->memcg_params.memcg_caches, NULL);
-	INIT_LIST_HEAD(&s->memcg_params.children);
-	s->memcg_params.dying = false;
+	s->memcg_params.memcg_cache = NULL;
 }
 
-static int init_memcg_params(struct kmem_cache *s,
-			     struct kmem_cache *root_cache)
+static void init_memcg_params(struct kmem_cache *s,
+			      struct kmem_cache *root_cache)
 {
-	struct memcg_cache_array *arr;
-
-	if (root_cache) {
-		int ret = percpu_ref_init(&s->memcg_params.refcnt,
-					  kmemcg_cache_shutdown,
-					  0, GFP_KERNEL);
-		if (ret)
-			return ret;
-
+	if (root_cache)
 		s->memcg_params.root_cache = root_cache;
-		INIT_LIST_HEAD(&s->memcg_params.children_node);
-		INIT_LIST_HEAD(&s->memcg_params.kmem_caches_node);
-		return 0;
-	}
-
-	slab_init_memcg_params(s);
-
-	if (!memcg_nr_cache_ids)
-		return 0;
-
-	arr = kvzalloc(sizeof(struct memcg_cache_array) +
-		       memcg_nr_cache_ids * sizeof(void *),
-		       GFP_KERNEL);
-	if (!arr)
-		return -ENOMEM;
-
-	RCU_INIT_POINTER(s->memcg_params.memcg_caches, arr);
-	return 0;
-}
-
-static void destroy_memcg_params(struct kmem_cache *s)
-{
-	if (is_root_cache(s)) {
-		kvfree(rcu_access_pointer(s->memcg_params.memcg_caches));
-	} else {
-		mem_cgroup_put(s->memcg_params.memcg);
-		WRITE_ONCE(s->memcg_params.memcg, NULL);
-		percpu_ref_exit(&s->memcg_params.refcnt);
-	}
-}
-
-static void free_memcg_params(struct rcu_head *rcu)
-{
-	struct memcg_cache_array *old;
-
-	old = container_of(rcu, struct memcg_cache_array, rcu);
-	kvfree(old);
-}
-
-static int update_memcg_params(struct kmem_cache *s, int new_array_size)
-{
-	struct memcg_cache_array *old, *new;
-
-	new = kvzalloc(sizeof(struct memcg_cache_array) +
-		       new_array_size * sizeof(void *), GFP_KERNEL);
-	if (!new)
-		return -ENOMEM;
-
-	old = rcu_dereference_protected(s->memcg_params.memcg_caches,
-					lockdep_is_held(&slab_mutex));
-	if (old)
-		memcpy(new->entries, old->entries,
-		       memcg_nr_cache_ids * sizeof(void *));
-
-	rcu_assign_pointer(s->memcg_params.memcg_caches, new);
-	if (old)
-		call_rcu(&old->rcu, free_memcg_params);
-	return 0;
-}
-
-int memcg_update_all_caches(int num_memcgs)
-{
-	struct kmem_cache *s;
-	int ret = 0;
-
-	mutex_lock(&slab_mutex);
-	list_for_each_entry(s, &slab_root_caches, root_caches_node) {
-		ret = update_memcg_params(s, num_memcgs);
-		/*
-		 * Instead of freeing the memory, we'll just leave the caches
-		 * up to this point in an updated state.
-		 */
-		if (ret)
-			break;
-	}
-	mutex_unlock(&slab_mutex);
-	return ret;
+	else
+		slab_init_memcg_params(s);
 }
 
-void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg)
+void memcg_link_cache(struct kmem_cache *s)
 {
-	if (is_root_cache(s)) {
+	if (is_root_cache(s))
 		list_add(&s->root_caches_node, &slab_root_caches);
-	} else {
-		css_get(&memcg->css);
-		s->memcg_params.memcg = memcg;
-		list_add(&s->memcg_params.children_node,
-			 &s->memcg_params.root_cache->memcg_params.children);
-		list_add(&s->memcg_params.kmem_caches_node,
-			 &s->memcg_params.memcg->kmem_caches);
-	}
 }
 
 static void memcg_unlink_cache(struct kmem_cache *s)
 {
-	if (is_root_cache(s)) {
+	if (is_root_cache(s))
 		list_del(&s->root_caches_node);
-	} else {
-		list_del(&s->memcg_params.children_node);
-		list_del(&s->memcg_params.kmem_caches_node);
-	}
 }
 #else
-static inline int init_memcg_params(struct kmem_cache *s,
-				    struct kmem_cache *root_cache)
-{
-	return 0;
-}
-
-static inline void destroy_memcg_params(struct kmem_cache *s)
+static inline void init_memcg_params(struct kmem_cache *s,
+				     struct kmem_cache *root_cache)
 {
 }
 
@@ -380,7 +275,7 @@ static struct kmem_cache *create_cache(const char *name,
 		unsigned int object_size, unsigned int align,
 		slab_flags_t flags, unsigned int useroffset,
 		unsigned int usersize, void (*ctor)(void *),
-		struct mem_cgroup *memcg, struct kmem_cache *root_cache)
+		struct kmem_cache *root_cache)
 {
 	struct kmem_cache *s;
 	int err;
@@ -400,24 +295,20 @@ static struct kmem_cache *create_cache(const char *name,
 	s->useroffset = useroffset;
 	s->usersize = usersize;
 
-	err = init_memcg_params(s, root_cache);
-	if (err)
-		goto out_free_cache;
-
+	init_memcg_params(s, root_cache);
 	err = __kmem_cache_create(s, flags);
 	if (err)
 		goto out_free_cache;
 
 	s->refcount = 1;
 	list_add(&s->list, &slab_caches);
-	memcg_link_cache(s, memcg);
+	memcg_link_cache(s);
 out:
 	if (err)
 		return ERR_PTR(err);
 	return s;
 
 out_free_cache:
-	destroy_memcg_params(s);
 	kmem_cache_free(kmem_cache, s);
 	goto out;
 }
@@ -504,7 +395,7 @@ kmem_cache_create_usercopy(const char *name,
 
 	s = create_cache(cache_name, size,
 			 calculate_alignment(flags, align, size),
-			 flags, useroffset, usersize, ctor, NULL, NULL);
+			 flags, useroffset, usersize, ctor, NULL);
 	if (IS_ERR(s)) {
 		err = PTR_ERR(s);
 		kfree_const(cache_name);
@@ -629,51 +520,27 @@ static int shutdown_cache(struct kmem_cache *s)
 
 #ifdef CONFIG_MEMCG_KMEM
 /*
- * memcg_create_kmem_cache - Create a cache for a memory cgroup.
- * @memcg: The memory cgroup the new cache is for.
+ * memcg_create_kmem_cache - Create a cache for non-root memory cgroups.
  * @root_cache: The parent of the new cache.
  *
  * This function attempts to create a kmem cache that will serve allocation
- * requests going from @memcg to @root_cache. The new cache inherits properties
- * from its parent.
+ * requests going all non-root memory cgroups to @root_cache. The new cache
+ * inherits properties from its parent.
  */
-void memcg_create_kmem_cache(struct mem_cgroup *memcg,
-			     struct kmem_cache *root_cache)
+void memcg_create_kmem_cache(struct kmem_cache *root_cache)
 {
-	static char memcg_name_buf[NAME_MAX + 1]; /* protected by slab_mutex */
-	struct cgroup_subsys_state *css = &memcg->css;
-	struct memcg_cache_array *arr;
 	struct kmem_cache *s = NULL;
 	char *cache_name;
-	int idx;
 
 	get_online_cpus();
 	get_online_mems();
 
 	mutex_lock(&slab_mutex);
 
-	/*
-	 * The memory cgroup could have been offlined while the cache
-	 * creation work was pending.
-	 */
-	if (memcg->kmem_state != KMEM_ONLINE)
+	if (root_cache->memcg_params.memcg_cache)
 		goto out_unlock;
 
-	idx = memcg_cache_id(memcg);
-	arr = rcu_dereference_protected(root_cache->memcg_params.memcg_caches,
-					lockdep_is_held(&slab_mutex));
-
-	/*
-	 * Since per-memcg caches are created asynchronously on first
-	 * allocation (see memcg_kmem_get_cache()), several threads can try to
-	 * create the same cache, but only one of them may succeed.
-	 */
-	if (arr->entries[idx])
-		goto out_unlock;
-
-	cgroup_name(css->cgroup, memcg_name_buf, sizeof(memcg_name_buf));
-	cache_name = kasprintf(GFP_KERNEL, "%s(%llu:%s)", root_cache->name,
-			       css->serial_nr, memcg_name_buf);
+	cache_name = kasprintf(GFP_KERNEL, "%s-memcg", root_cache->name);
 	if (!cache_name)
 		goto out_unlock;
 
@@ -681,7 +548,7 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
 			 root_cache->align,
 			 root_cache->flags & CACHE_CREATE_MASK,
 			 root_cache->useroffset, root_cache->usersize,
-			 root_cache->ctor, memcg, root_cache);
+			 root_cache->ctor, root_cache);
 	/*
 	 * If we could not create a memcg cache, do not complain, because
 	 * that's not critical at all as we can always proceed with the root
@@ -698,7 +565,7 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
 	 * initialized.
 	 */
 	smp_wmb();
-	arr->entries[idx] = s;
+	root_cache->memcg_params.memcg_cache = s;
 
 out_unlock:
 	mutex_unlock(&slab_mutex);
@@ -707,197 +574,18 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
 	put_online_cpus();
 }
 
-static void kmemcg_workfn(struct work_struct *work)
-{
-	struct kmem_cache *s = container_of(work, struct kmem_cache,
-					    memcg_params.work);
-
-	get_online_cpus();
-	get_online_mems();
-
-	mutex_lock(&slab_mutex);
-	s->memcg_params.work_fn(s);
-	mutex_unlock(&slab_mutex);
-
-	put_online_mems();
-	put_online_cpus();
-}
-
-static void kmemcg_rcufn(struct rcu_head *head)
-{
-	struct kmem_cache *s = container_of(head, struct kmem_cache,
-					    memcg_params.rcu_head);
-
-	/*
-	 * We need to grab blocking locks.  Bounce to ->work.  The
-	 * work item shares the space with the RCU head and can't be
-	 * initialized eariler.
-	 */
-	INIT_WORK(&s->memcg_params.work, kmemcg_workfn);
-	queue_work(memcg_kmem_cache_wq, &s->memcg_params.work);
-}
-
-static void kmemcg_cache_shutdown_fn(struct kmem_cache *s)
-{
-	WARN_ON(shutdown_cache(s));
-}
-
-static void kmemcg_cache_shutdown(struct percpu_ref *percpu_ref)
-{
-	struct kmem_cache *s = container_of(percpu_ref, struct kmem_cache,
-					    memcg_params.refcnt);
-	unsigned long flags;
-
-	spin_lock_irqsave(&memcg_kmem_wq_lock, flags);
-	if (s->memcg_params.root_cache->memcg_params.dying)
-		goto unlock;
-
-	s->memcg_params.work_fn = kmemcg_cache_shutdown_fn;
-	INIT_WORK(&s->memcg_params.work, kmemcg_workfn);
-	queue_work(memcg_kmem_cache_wq, &s->memcg_params.work);
-
-unlock:
-	spin_unlock_irqrestore(&memcg_kmem_wq_lock, flags);
-}
-
-static void kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s)
-{
-	__kmemcg_cache_deactivate_after_rcu(s);
-	percpu_ref_kill(&s->memcg_params.refcnt);
-}
-
-static void kmemcg_cache_deactivate(struct kmem_cache *s)
-{
-	if (WARN_ON_ONCE(is_root_cache(s)))
-		return;
-
-	__kmemcg_cache_deactivate(s);
-	s->flags |= SLAB_DEACTIVATED;
-
-	/*
-	 * memcg_kmem_wq_lock is used to synchronize memcg_params.dying
-	 * flag and make sure that no new kmem_cache deactivation tasks
-	 * are queued (see flush_memcg_workqueue() ).
-	 */
-	spin_lock_irq(&memcg_kmem_wq_lock);
-	if (s->memcg_params.root_cache->memcg_params.dying)
-		goto unlock;
-
-	s->memcg_params.work_fn = kmemcg_cache_deactivate_after_rcu;
-	call_rcu(&s->memcg_params.rcu_head, kmemcg_rcufn);
-unlock:
-	spin_unlock_irq(&memcg_kmem_wq_lock);
-}
-
-void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg,
-				  struct mem_cgroup *parent)
-{
-	int idx;
-	struct memcg_cache_array *arr;
-	struct kmem_cache *s, *c;
-	unsigned int nr_reparented;
-
-	idx = memcg_cache_id(memcg);
-
-	get_online_cpus();
-	get_online_mems();
-
-	mutex_lock(&slab_mutex);
-	list_for_each_entry(s, &slab_root_caches, root_caches_node) {
-		arr = rcu_dereference_protected(s->memcg_params.memcg_caches,
-						lockdep_is_held(&slab_mutex));
-		c = arr->entries[idx];
-		if (!c)
-			continue;
-
-		kmemcg_cache_deactivate(c);
-		arr->entries[idx] = NULL;
-	}
-	nr_reparented = 0;
-	list_for_each_entry(s, &memcg->kmem_caches,
-			    memcg_params.kmem_caches_node) {
-		WRITE_ONCE(s->memcg_params.memcg, parent);
-		css_put(&memcg->css);
-		nr_reparented++;
-	}
-	if (nr_reparented) {
-		list_splice_init(&memcg->kmem_caches,
-				 &parent->kmem_caches);
-		css_get_many(&parent->css, nr_reparented);
-	}
-	mutex_unlock(&slab_mutex);
-
-	put_online_mems();
-	put_online_cpus();
-}
-
 static int shutdown_memcg_caches(struct kmem_cache *s)
 {
-	struct memcg_cache_array *arr;
-	struct kmem_cache *c, *c2;
-	LIST_HEAD(busy);
-	int i;
-
 	BUG_ON(!is_root_cache(s));
 
-	/*
-	 * First, shutdown active caches, i.e. caches that belong to online
-	 * memory cgroups.
-	 */
-	arr = rcu_dereference_protected(s->memcg_params.memcg_caches,
-					lockdep_is_held(&slab_mutex));
-	for_each_memcg_cache_index(i) {
-		c = arr->entries[i];
-		if (!c)
-			continue;
-		if (shutdown_cache(c))
-			/*
-			 * The cache still has objects. Move it to a temporary
-			 * list so as not to try to destroy it for a second
-			 * time while iterating over inactive caches below.
-			 */
-			list_move(&c->memcg_params.children_node, &busy);
-		else
-			/*
-			 * The cache is empty and will be destroyed soon. Clear
-			 * the pointer to it in the memcg_caches array so that
-			 * it will never be accessed even if the root cache
-			 * stays alive.
-			 */
-			arr->entries[i] = NULL;
-	}
-
-	/*
-	 * Second, shutdown all caches left from memory cgroups that are now
-	 * offline.
-	 */
-	list_for_each_entry_safe(c, c2, &s->memcg_params.children,
-				 memcg_params.children_node)
-		shutdown_cache(c);
-
-	list_splice(&busy, &s->memcg_params.children);
+	if (s->memcg_params.memcg_cache)
+		WARN_ON(shutdown_cache(s->memcg_params.memcg_cache));
 
-	/*
-	 * A cache being destroyed must be empty. In particular, this means
-	 * that all per memcg caches attached to it must be empty too.
-	 */
-	if (!list_empty(&s->memcg_params.children))
-		return -EBUSY;
 	return 0;
 }
 
 static void flush_memcg_workqueue(struct kmem_cache *s)
 {
-	spin_lock_irq(&memcg_kmem_wq_lock);
-	s->memcg_params.dying = true;
-	spin_unlock_irq(&memcg_kmem_wq_lock);
-
-	/*
-	 * SLAB and SLUB deactivate the kmem_caches through call_rcu. Make
-	 * sure all registered rcu callbacks have been invoked.
-	 */
-	rcu_barrier();
-
 	/*
 	 * SLAB and SLUB create memcg kmem_caches through workqueue and SLUB
 	 * deactivates the memcg kmem_caches through workqueue. Make sure all
@@ -905,18 +593,6 @@ static void flush_memcg_workqueue(struct kmem_cache *s)
 	 */
 	if (likely(memcg_kmem_cache_wq))
 		flush_workqueue(memcg_kmem_cache_wq);
-
-	/*
-	 * If we're racing with children kmem_cache deactivation, it might
-	 * take another rcu grace period to complete their destruction.
-	 * At this moment the corresponding percpu_ref_kill() call should be
-	 * done, but it might take another rcu grace period to complete
-	 * switching to the atomic mode.
-	 * Please, note that we check without grabbing the slab_mutex. It's safe
-	 * because at this moment the children list can't grow.
-	 */
-	if (!list_empty(&s->memcg_params.children))
-		rcu_barrier();
 }
 #else
 static inline int shutdown_memcg_caches(struct kmem_cache *s)
@@ -932,7 +608,6 @@ static inline void flush_memcg_workqueue(struct kmem_cache *s)
 void slab_kmem_cache_release(struct kmem_cache *s)
 {
 	__kmem_cache_release(s);
-	destroy_memcg_params(s);
 	kfree_const(s->name);
 	kmem_cache_free(kmem_cache, s);
 }
@@ -996,7 +671,7 @@ int kmem_cache_shrink(struct kmem_cache *cachep)
 EXPORT_SYMBOL(kmem_cache_shrink);
 
 /**
- * kmem_cache_shrink_all - shrink a cache and all memcg caches for root cache
+ * kmem_cache_shrink_all - shrink root and memcg caches
  * @s: The cache pointer
  */
 void kmem_cache_shrink_all(struct kmem_cache *s)
@@ -1013,21 +688,11 @@ void kmem_cache_shrink_all(struct kmem_cache *s)
 	kasan_cache_shrink(s);
 	__kmem_cache_shrink(s);
 
-	/*
-	 * We have to take the slab_mutex to protect from the memcg list
-	 * modification.
-	 */
-	mutex_lock(&slab_mutex);
-	for_each_memcg_cache(c, s) {
-		/*
-		 * Don't need to shrink deactivated memcg caches.
-		 */
-		if (s->flags & SLAB_DEACTIVATED)
-			continue;
+	c = memcg_cache(s);
+	if (c) {
 		kasan_cache_shrink(c);
 		__kmem_cache_shrink(c);
 	}
-	mutex_unlock(&slab_mutex);
 	put_online_mems();
 	put_online_cpus();
 }
@@ -1082,7 +747,7 @@ struct kmem_cache *__init create_kmalloc_cache(const char *name,
 
 	create_boot_cache(s, name, size, flags, useroffset, usersize);
 	list_add(&s->list, &slab_caches);
-	memcg_link_cache(s, NULL);
+	memcg_link_cache(s);
 	s->refcount = 1;
 	return s;
 }
@@ -1444,7 +1109,8 @@ memcg_accumulate_slabinfo(struct kmem_cache *s, struct slabinfo *info)
 	if (!is_root_cache(s))
 		return;
 
-	for_each_memcg_cache(c, s) {
+	c = memcg_cache(s);
+	if (c) {
 		memset(&sinfo, 0, sizeof(sinfo));
 		get_slabinfo(c, &sinfo);
 
@@ -1574,7 +1240,7 @@ module_init(slab_proc_init);
 
 #if defined(CONFIG_DEBUG_FS) && defined(CONFIG_MEMCG_KMEM)
 /*
- * Display information about kmem caches that have child memcg caches.
+ * Display information about kmem caches that have memcg cache.
  */
 static int memcg_slabinfo_show(struct seq_file *m, void *unused)
 {
@@ -1586,9 +1252,9 @@ static int memcg_slabinfo_show(struct seq_file *m, void *unused)
 	seq_puts(m, " <active_slabs> <num_slabs>\n");
 	list_for_each_entry(s, &slab_root_caches, root_caches_node) {
 		/*
-		 * Skip kmem caches that don't have any memcg children.
+		 * Skip kmem caches that don't have the memcg cache.
 		 */
-		if (list_empty(&s->memcg_params.children))
+		if (!s->memcg_params.memcg_cache)
 			continue;
 
 		memset(&sinfo, 0, sizeof(sinfo));
@@ -1597,23 +1263,13 @@ static int memcg_slabinfo_show(struct seq_file *m, void *unused)
 			   cache_name(s), sinfo.active_objs, sinfo.num_objs,
 			   sinfo.active_slabs, sinfo.num_slabs);
 
-		for_each_memcg_cache(c, s) {
-			struct cgroup_subsys_state *css;
-			char *status = "";
-
-			css = &c->memcg_params.memcg->css;
-			if (!(css->flags & CSS_ONLINE))
-				status = ":dead";
-			else if (c->flags & SLAB_DEACTIVATED)
-				status = ":deact";
-
-			memset(&sinfo, 0, sizeof(sinfo));
-			get_slabinfo(c, &sinfo);
-			seq_printf(m, "%-17s %4d%-6s %6lu %6lu %6lu %6lu\n",
-				   cache_name(c), css->id, status,
-				   sinfo.active_objs, sinfo.num_objs,
-				   sinfo.active_slabs, sinfo.num_slabs);
-		}
+		c = s->memcg_params.memcg_cache;
+		memset(&sinfo, 0, sizeof(sinfo));
+		get_slabinfo(c, &sinfo);
+		seq_printf(m, "%-17s %4d %6lu %6lu %6lu %6lu\n",
+			   cache_name(c), root_mem_cgroup->css.id,
+			   sinfo.active_objs, sinfo.num_objs,
+			   sinfo.active_slabs, sinfo.num_slabs);
 	}
 	mutex_unlock(&slab_mutex);
 	return 0;
diff --git a/mm/slub.c b/mm/slub.c
index 6365e89cd503..5b39b5c005bc 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4055,36 +4055,6 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 	return ret;
 }
 
-#ifdef CONFIG_MEMCG
-void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s)
-{
-	/*
-	 * Called with all the locks held after a sched RCU grace period.
-	 * Even if @s becomes empty after shrinking, we can't know that @s
-	 * doesn't have allocations already in-flight and thus can't
-	 * destroy @s until the associated memcg is released.
-	 *
-	 * However, let's remove the sysfs files for empty caches here.
-	 * Each cache has a lot of interface files which aren't
-	 * particularly useful for empty draining caches; otherwise, we can
-	 * easily end up with millions of unnecessary sysfs files on
-	 * systems which have a lot of memory and transient cgroups.
-	 */
-	if (!__kmem_cache_shrink(s))
-		sysfs_slab_remove(s);
-}
-
-void __kmemcg_cache_deactivate(struct kmem_cache *s)
-{
-	/*
-	 * Disable empty slabs caching. Used to avoid pinning offline
-	 * memory cgroups by kmem pages that can be freed.
-	 */
-	slub_set_cpu_partial(s, 0);
-	s->min_partial = 0;
-}
-#endif	/* CONFIG_MEMCG */
-
 static int slab_mem_going_offline_callback(void *arg)
 {
 	struct kmem_cache *s;
@@ -4241,7 +4211,7 @@ static struct kmem_cache * __init bootstrap(struct kmem_cache *static_cache)
 	}
 	slab_init_memcg_params(s);
 	list_add(&s->list, &slab_caches);
-	memcg_link_cache(s, NULL);
+	memcg_link_cache(s);
 	return s;
 }
 
@@ -4309,7 +4279,8 @@ __kmem_cache_alias(const char *name, unsigned int size, unsigned int align,
 		s->object_size = max(s->object_size, size);
 		s->inuse = max(s->inuse, ALIGN(size, sizeof(void *)));
 
-		for_each_memcg_cache(c, s) {
+		c = memcg_cache(s);
+		if (c) {
 			c->object_size = s->object_size;
 			c->inuse = max(c->inuse, ALIGN(size, sizeof(void *)));
 		}
@@ -5562,7 +5533,8 @@ static ssize_t slab_attr_store(struct kobject *kobj,
 		 * directly either failed or succeeded, in which case we loop
 		 * through the descendants with best-effort propagation.
 		 */
-		for_each_memcg_cache(c, s)
+		c = memcg_cache(s);
+		if (c)
 			attribute->store(c, buf, len);
 		mutex_unlock(&slab_mutex);
 	}
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 22/28] mm: memcg/slab: simplify memcg cache creation
  2020-01-27 17:34 [PATCH v2 00/28] The new cgroup slab memory controller Roman Gushchin
                   ` (20 preceding siblings ...)
  2020-01-27 17:34 ` [PATCH v2 21/28] mm: memcg/slab: use a single set of kmem_caches for all memory cgroups Roman Gushchin
@ 2020-01-27 17:34 ` Roman Gushchin
  2020-01-27 17:34 ` [PATCH v2 23/28] mm: memcg/slab: deprecate memcg_kmem_get_cache() Roman Gushchin
                   ` (6 subsequent siblings)
  28 siblings, 0 replies; 84+ messages in thread
From: Roman Gushchin @ 2020-01-27 17:34 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Michal Hocko, Johannes Weiner, Shakeel Butt, Vladimir Davydov,
	linux-kernel, kernel-team, Bharata B Rao, Yafang Shao,
	Roman Gushchin

Because the number of non-root kmem_caches doesn't depend on the
number of memory cgroups anymore and is generally not very big,
there is no more need for a dedicated workqueue.

Also, as there is no more need to pass any arguments to the
memcg_create_kmem_cache() except the root kmem_cache, it's
possible to just embed the work structure into the kmem_cache
and avoid the dynamic allocation of the work structure.

This will also simplify the synchronization: for each root kmem_cache
there is only one work. So there will be no more concurrent attempts
to create a non-root kmem_cache for a root kmem_cache: the second and
all following attempts to queue the work will fail.

On the kmem_cache destruction path there is no more need to call the
expensive flush_workqueue() and wait for all pending works to be
finished. Instead, cancel_work_sync() can be used to cancel/wait for
only one work.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/memcontrol.h |  1 -
 mm/memcontrol.c            | 48 +-------------------------------------
 mm/slab.h                  |  2 ++
 mm/slab_common.c           | 22 +++++++++--------
 4 files changed, 15 insertions(+), 58 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 95d66f46493c..cf2a1703164a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1445,7 +1445,6 @@ int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size);
 void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);
 
 extern struct static_key_false memcg_kmem_enabled_key;
-extern struct workqueue_struct *memcg_kmem_cache_wq;
 
 extern int memcg_nr_cache_ids;
 void memcg_get_cache_ids(void);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3f92b1c71aed..a48d13718ec9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -374,8 +374,6 @@ void memcg_put_cache_ids(void)
  */
 DEFINE_STATIC_KEY_FALSE(memcg_kmem_enabled_key);
 EXPORT_SYMBOL(memcg_kmem_enabled_key);
-
-struct workqueue_struct *memcg_kmem_cache_wq;
 #endif
 
 static int memcg_shrinker_map_size;
@@ -2855,39 +2853,6 @@ static void memcg_free_cache_id(int id)
 	ida_simple_remove(&memcg_cache_ida, id);
 }
 
-struct memcg_kmem_cache_create_work {
-	struct kmem_cache *cachep;
-	struct work_struct work;
-};
-
-static void memcg_kmem_cache_create_func(struct work_struct *w)
-{
-	struct memcg_kmem_cache_create_work *cw =
-		container_of(w, struct memcg_kmem_cache_create_work, work);
-	struct kmem_cache *cachep = cw->cachep;
-
-	memcg_create_kmem_cache(cachep);
-
-	kfree(cw);
-}
-
-/*
- * Enqueue the creation of a per-memcg kmem_cache.
- */
-static void memcg_schedule_kmem_cache_create(struct kmem_cache *cachep)
-{
-	struct memcg_kmem_cache_create_work *cw;
-
-	cw = kmalloc(sizeof(*cw), GFP_NOWAIT | __GFP_NOWARN);
-	if (!cw)
-		return;
-
-	cw->cachep = cachep;
-	INIT_WORK(&cw->work, memcg_kmem_cache_create_func);
-
-	queue_work(memcg_kmem_cache_wq, &cw->work);
-}
-
 /**
  * memcg_kmem_get_cache: select memcg or root cache for allocation
  * @cachep: the original global kmem cache
@@ -2904,7 +2869,7 @@ struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep)
 
 	memcg_cachep = READ_ONCE(cachep->memcg_params.memcg_cache);
 	if (unlikely(!memcg_cachep)) {
-		memcg_schedule_kmem_cache_create(cachep);
+		queue_work(system_wq, &cachep->memcg_params.work);
 		return cachep;
 	}
 
@@ -7000,17 +6965,6 @@ static int __init mem_cgroup_init(void)
 {
 	int cpu, node;
 
-#ifdef CONFIG_MEMCG_KMEM
-	/*
-	 * Kmem cache creation is mostly done with the slab_mutex held,
-	 * so use a workqueue with limited concurrency to avoid stalling
-	 * all worker threads in case lots of cgroups are created and
-	 * destroyed simultaneously.
-	 */
-	memcg_kmem_cache_wq = alloc_workqueue("memcg_kmem_cache", 0, 1);
-	BUG_ON(!memcg_kmem_cache_wq);
-#endif
-
 	cpuhp_setup_state_nocalls(CPUHP_MM_MEMCQ_DEAD, "mm/memctrl:dead", NULL,
 				  memcg_hotplug_cpu_dead);
 
diff --git a/mm/slab.h b/mm/slab.h
index 732051d6861d..847ceeffff65 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -45,12 +45,14 @@ struct kmem_cache {
  * @memcg_cache: pointer to memcg kmem cache, used by all non-root memory
  *		cgroups.
  * @root_caches_node: list node for slab_root_caches list.
+ * @work: work struct used to create the non-root cache.
  */
 struct memcg_cache_params {
 	struct kmem_cache *root_cache;
 
 	struct kmem_cache *memcg_cache;
 	struct list_head __root_caches_node;
+	struct work_struct work;
 };
 #endif /* CONFIG_SLOB */
 
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 9ad13c7e28fc..a4853b12db1c 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -132,10 +132,18 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t nr,
 
 LIST_HEAD(slab_root_caches);
 
+static void memcg_kmem_cache_create_func(struct work_struct *work)
+{
+	struct kmem_cache *cachep = container_of(work, struct kmem_cache,
+						 memcg_params.work);
+	memcg_create_kmem_cache(cachep);
+}
+
 void slab_init_memcg_params(struct kmem_cache *s)
 {
 	s->memcg_params.root_cache = NULL;
 	s->memcg_params.memcg_cache = NULL;
+	INIT_WORK(&s->memcg_params.work, memcg_kmem_cache_create_func);
 }
 
 static void init_memcg_params(struct kmem_cache *s,
@@ -584,15 +592,9 @@ static int shutdown_memcg_caches(struct kmem_cache *s)
 	return 0;
 }
 
-static void flush_memcg_workqueue(struct kmem_cache *s)
+static void cancel_memcg_cache_creation(struct kmem_cache *s)
 {
-	/*
-	 * SLAB and SLUB create memcg kmem_caches through workqueue and SLUB
-	 * deactivates the memcg kmem_caches through workqueue. Make sure all
-	 * previous workitems on workqueue are processed.
-	 */
-	if (likely(memcg_kmem_cache_wq))
-		flush_workqueue(memcg_kmem_cache_wq);
+	cancel_work_sync(&s->memcg_params.work);
 }
 #else
 static inline int shutdown_memcg_caches(struct kmem_cache *s)
@@ -600,7 +602,7 @@ static inline int shutdown_memcg_caches(struct kmem_cache *s)
 	return 0;
 }
 
-static inline void flush_memcg_workqueue(struct kmem_cache *s)
+static inline void cancel_memcg_cache_creation(struct kmem_cache *s)
 {
 }
 #endif /* CONFIG_MEMCG_KMEM */
@@ -619,7 +621,7 @@ void kmem_cache_destroy(struct kmem_cache *s)
 	if (unlikely(!s))
 		return;
 
-	flush_memcg_workqueue(s);
+	cancel_memcg_cache_creation(s);
 
 	get_online_cpus();
 	get_online_mems();
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 23/28] mm: memcg/slab: deprecate memcg_kmem_get_cache()
  2020-01-27 17:34 [PATCH v2 00/28] The new cgroup slab memory controller Roman Gushchin
                   ` (21 preceding siblings ...)
  2020-01-27 17:34 ` [PATCH v2 22/28] mm: memcg/slab: simplify memcg cache creation Roman Gushchin
@ 2020-01-27 17:34 ` Roman Gushchin
  2020-01-27 17:34 ` [PATCH v2 24/28] mm: memcg/slab: deprecate slab_root_caches Roman Gushchin
                   ` (5 subsequent siblings)
  28 siblings, 0 replies; 84+ messages in thread
From: Roman Gushchin @ 2020-01-27 17:34 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Michal Hocko, Johannes Weiner, Shakeel Butt, Vladimir Davydov,
	linux-kernel, kernel-team, Bharata B Rao, Yafang Shao,
	Roman Gushchin

The memcg_kmem_get_cache() function became really trivial, so
let's just inline it into the single call point:
memcg_slab_pre_alloc_hook().

It will make the code less bulky and can also help the compiler
to generate a better code.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/memcontrol.h |  2 --
 mm/memcontrol.c            | 25 +------------------------
 mm/slab.h                  | 11 +++++++++--
 mm/slab_common.c           |  2 +-
 4 files changed, 11 insertions(+), 29 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index cf2a1703164a..68beadc04813 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1430,8 +1430,6 @@ static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg,
 }
 #endif
 
-struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep);
-
 #ifdef CONFIG_MEMCG_KMEM
 int __memcg_kmem_charge(struct mem_cgroup *memcg, gfp_t gfp,
 			unsigned int nr_pages);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a48d13718ec9..2f9222ad2691 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -368,7 +368,7 @@ void memcg_put_cache_ids(void)
 
 /*
  * A lot of the calls to the cache allocation functions are expected to be
- * inlined by the compiler. Since the calls to memcg_kmem_get_cache are
+ * inlined by the compiler. Since the calls to memcg_slab_pre_alloc_hook() are
  * conditional to this static branch, we'll have to allow modules that does
  * kmem_cache_alloc and the such to see this symbol as well
  */
@@ -2853,29 +2853,6 @@ static void memcg_free_cache_id(int id)
 	ida_simple_remove(&memcg_cache_ida, id);
 }
 
-/**
- * memcg_kmem_get_cache: select memcg or root cache for allocation
- * @cachep: the original global kmem cache
- *
- * Return the kmem_cache we're supposed to use for a slab allocation.
- *
- * If the cache does not exist yet, if we are the first user of it, we
- * create it asynchronously in a workqueue and let the current allocation
- * go through with the original cache.
- */
-struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep)
-{
-	struct kmem_cache *memcg_cachep;
-
-	memcg_cachep = READ_ONCE(cachep->memcg_params.memcg_cache);
-	if (unlikely(!memcg_cachep)) {
-		queue_work(system_wq, &cachep->memcg_params.work);
-		return cachep;
-	}
-
-	return memcg_cachep;
-}
-
 /**
  * __memcg_kmem_charge: charge a number of kernel pages to a memcg
  * @memcg: memory cgroup to charge
diff --git a/mm/slab.h b/mm/slab.h
index 847ceeffff65..dcc00fdb9a8a 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -332,9 +332,16 @@ static inline struct kmem_cache *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
 	if (memcg_kmem_bypass())
 		return s;
 
-	cachep = memcg_kmem_get_cache(s);
-	if (is_root_cache(cachep))
+	cachep = READ_ONCE(s->memcg_params.memcg_cache);
+	if (unlikely(!cachep)) {
+		/*
+		 * If memcg cache does not exist yet, we schedule it's
+		 * asynchronous creation and let the current allocation
+		 * go through with the root cache.
+		 */
+		queue_work(system_wq, &s->memcg_params.work);
 		return s;
+	}
 
 	objcg = get_obj_cgroup_from_current();
 	if (!objcg)
diff --git a/mm/slab_common.c b/mm/slab_common.c
index a4853b12db1c..7dc1804623d2 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -568,7 +568,7 @@ void memcg_create_kmem_cache(struct kmem_cache *root_cache)
 	}
 
 	/*
-	 * Since readers won't lock (see memcg_kmem_get_cache()), we need a
+	 * Since readers won't lock (see memcg_slab_pre_alloc_hook()), we need a
 	 * barrier here to ensure nobody will see the kmem_cache partially
 	 * initialized.
 	 */
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 24/28] mm: memcg/slab: deprecate slab_root_caches
  2020-01-27 17:34 [PATCH v2 00/28] The new cgroup slab memory controller Roman Gushchin
                   ` (22 preceding siblings ...)
  2020-01-27 17:34 ` [PATCH v2 23/28] mm: memcg/slab: deprecate memcg_kmem_get_cache() Roman Gushchin
@ 2020-01-27 17:34 ` Roman Gushchin
  2020-01-27 17:34 ` [PATCH v2 25/28] mm: memcg/slab: remove redundant check in memcg_accumulate_slabinfo() Roman Gushchin
                   ` (4 subsequent siblings)
  28 siblings, 0 replies; 84+ messages in thread
From: Roman Gushchin @ 2020-01-27 17:34 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Michal Hocko, Johannes Weiner, Shakeel Butt, Vladimir Davydov,
	linux-kernel, kernel-team, Bharata B Rao, Yafang Shao,
	Roman Gushchin

Currently there are two lists of kmem_caches:
1) slab_caches, which contains all kmem_caches,
2) slab_root_caches, which contains only root kmem_caches.

And there is some preprocessor magic to have a single list
if CONFIG_MEMCG_KMEM isn't enabled.

It was required earlier because the number of non-root kmem_caches
was proportional to the number of memory cgroups and could reach
really big values. Now, when it cannot exceed the number of root
kmem_caches, there is really no reason to maintain two lists.

We never iterate over the slab_root_caches list on any hot paths,
so it's perfectly fine to iterate over slab_caches and filter out
non-root kmem_caches.

It allows to remove a lot of config-dependent code and two pointers
from the kmem_cache structure.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 mm/slab.c        |  1 -
 mm/slab.h        | 17 -----------------
 mm/slab_common.c | 37 ++++++++-----------------------------
 mm/slub.c        |  1 -
 4 files changed, 8 insertions(+), 48 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 1f6ce2018993..b8603329bd9c 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1239,7 +1239,6 @@ void __init kmem_cache_init(void)
 				  nr_node_ids * sizeof(struct kmem_cache_node *),
 				  SLAB_HWCACHE_ALIGN, 0, 0);
 	list_add(&kmem_cache->list, &slab_caches);
-	memcg_link_cache(kmem_cache);
 	slab_state = PARTIAL;
 
 	/*
diff --git a/mm/slab.h b/mm/slab.h
index dcc00fdb9a8a..de4064e7a61b 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -44,14 +44,12 @@ struct kmem_cache {
  *
  * @memcg_cache: pointer to memcg kmem cache, used by all non-root memory
  *		cgroups.
- * @root_caches_node: list node for slab_root_caches list.
  * @work: work struct used to create the non-root cache.
  */
 struct memcg_cache_params {
 	struct kmem_cache *root_cache;
 
 	struct kmem_cache *memcg_cache;
-	struct list_head __root_caches_node;
 	struct work_struct work;
 };
 #endif /* CONFIG_SLOB */
@@ -235,11 +233,6 @@ static inline int cache_vmstat_idx(struct kmem_cache *s)
 }
 
 #ifdef CONFIG_MEMCG_KMEM
-
-/* List of all root caches. */
-extern struct list_head		slab_root_caches;
-#define root_caches_node	memcg_params.__root_caches_node
-
 static inline bool is_root_cache(struct kmem_cache *s)
 {
 	return !s->memcg_params.root_cache;
@@ -412,14 +405,8 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page,
 }
 
 extern void slab_init_memcg_params(struct kmem_cache *);
-extern void memcg_link_cache(struct kmem_cache *s);
 
 #else /* CONFIG_MEMCG_KMEM */
-
-/* If !memcg, all caches are root. */
-#define slab_root_caches	slab_caches
-#define root_caches_node	list
-
 static inline bool is_root_cache(struct kmem_cache *s)
 {
 	return true;
@@ -488,10 +475,6 @@ static inline void slab_init_memcg_params(struct kmem_cache *s)
 {
 }
 
-static inline void memcg_link_cache(struct kmem_cache *s)
-{
-}
-
 #endif /* CONFIG_MEMCG_KMEM */
 
 static inline struct kmem_cache *virt_to_cache(const void *obj)
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 7dc1804623d2..4c54120c4171 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -129,9 +129,6 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t nr,
 }
 
 #ifdef CONFIG_MEMCG_KMEM
-
-LIST_HEAD(slab_root_caches);
-
 static void memcg_kmem_cache_create_func(struct work_struct *work)
 {
 	struct kmem_cache *cachep = container_of(work, struct kmem_cache,
@@ -154,27 +151,11 @@ static void init_memcg_params(struct kmem_cache *s,
 	else
 		slab_init_memcg_params(s);
 }
-
-void memcg_link_cache(struct kmem_cache *s)
-{
-	if (is_root_cache(s))
-		list_add(&s->root_caches_node, &slab_root_caches);
-}
-
-static void memcg_unlink_cache(struct kmem_cache *s)
-{
-	if (is_root_cache(s))
-		list_del(&s->root_caches_node);
-}
 #else
 static inline void init_memcg_params(struct kmem_cache *s,
 				     struct kmem_cache *root_cache)
 {
 }
-
-static inline void memcg_unlink_cache(struct kmem_cache *s)
-{
-}
 #endif /* CONFIG_MEMCG_KMEM */
 
 /*
@@ -251,7 +232,7 @@ struct kmem_cache *find_mergeable(unsigned int size, unsigned int align,
 	if (flags & SLAB_NEVER_MERGE)
 		return NULL;
 
-	list_for_each_entry_reverse(s, &slab_root_caches, root_caches_node) {
+	list_for_each_entry_reverse(s, &slab_caches, list) {
 		if (slab_unmergeable(s))
 			continue;
 
@@ -310,7 +291,6 @@ static struct kmem_cache *create_cache(const char *name,
 
 	s->refcount = 1;
 	list_add(&s->list, &slab_caches);
-	memcg_link_cache(s);
 out:
 	if (err)
 		return ERR_PTR(err);
@@ -505,7 +485,6 @@ static int shutdown_cache(struct kmem_cache *s)
 	if (__kmem_cache_shutdown(s) != 0)
 		return -EBUSY;
 
-	memcg_unlink_cache(s);
 	list_del(&s->list);
 
 	if (s->flags & SLAB_TYPESAFE_BY_RCU) {
@@ -749,7 +728,6 @@ struct kmem_cache *__init create_kmalloc_cache(const char *name,
 
 	create_boot_cache(s, name, size, flags, useroffset, usersize);
 	list_add(&s->list, &slab_caches);
-	memcg_link_cache(s);
 	s->refcount = 1;
 	return s;
 }
@@ -1089,12 +1067,12 @@ static void print_slabinfo_header(struct seq_file *m)
 void *slab_start(struct seq_file *m, loff_t *pos)
 {
 	mutex_lock(&slab_mutex);
-	return seq_list_start(&slab_root_caches, *pos);
+	return seq_list_start(&slab_caches, *pos);
 }
 
 void *slab_next(struct seq_file *m, void *p, loff_t *pos)
 {
-	return seq_list_next(p, &slab_root_caches, pos);
+	return seq_list_next(p, &slab_caches, pos);
 }
 
 void slab_stop(struct seq_file *m, void *p)
@@ -1147,11 +1125,12 @@ static void cache_show(struct kmem_cache *s, struct seq_file *m)
 
 static int slab_show(struct seq_file *m, void *p)
 {
-	struct kmem_cache *s = list_entry(p, struct kmem_cache, root_caches_node);
+	struct kmem_cache *s = list_entry(p, struct kmem_cache, list);
 
-	if (p == slab_root_caches.next)
+	if (p == slab_caches.next)
 		print_slabinfo_header(m);
-	cache_show(s, m);
+	if (is_root_cache(s))
+		cache_show(s, m);
 	return 0;
 }
 
@@ -1252,7 +1231,7 @@ static int memcg_slabinfo_show(struct seq_file *m, void *unused)
 	mutex_lock(&slab_mutex);
 	seq_puts(m, "# <name> <css_id[:dead|deact]> <active_objs> <num_objs>");
 	seq_puts(m, " <active_slabs> <num_slabs>\n");
-	list_for_each_entry(s, &slab_root_caches, root_caches_node) {
+	list_for_each_entry(s, &slab_caches, list) {
 		/*
 		 * Skip kmem caches that don't have the memcg cache.
 		 */
diff --git a/mm/slub.c b/mm/slub.c
index 5b39b5c005bc..1d644143f93e 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4211,7 +4211,6 @@ static struct kmem_cache * __init bootstrap(struct kmem_cache *static_cache)
 	}
 	slab_init_memcg_params(s);
 	list_add(&s->list, &slab_caches);
-	memcg_link_cache(s);
 	return s;
 }
 
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 25/28] mm: memcg/slab: remove redundant check in memcg_accumulate_slabinfo()
  2020-01-27 17:34 [PATCH v2 00/28] The new cgroup slab memory controller Roman Gushchin
                   ` (23 preceding siblings ...)
  2020-01-27 17:34 ` [PATCH v2 24/28] mm: memcg/slab: deprecate slab_root_caches Roman Gushchin
@ 2020-01-27 17:34 ` Roman Gushchin
  2020-01-27 17:34 ` [PATCH v2 26/28] tools/cgroup: add slabinfo.py tool Roman Gushchin
                   ` (3 subsequent siblings)
  28 siblings, 0 replies; 84+ messages in thread
From: Roman Gushchin @ 2020-01-27 17:34 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Michal Hocko, Johannes Weiner, Shakeel Butt, Vladimir Davydov,
	linux-kernel, kernel-team, Bharata B Rao, Yafang Shao,
	Roman Gushchin

memcg_accumulate_slabinfo() is never called with a non-root
kmem_cache as a first argument, so the is_root_cache(s) check
is redundant and can be removed without any functional change.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 mm/slab_common.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/mm/slab_common.c b/mm/slab_common.c
index 4c54120c4171..df6683cd20cc 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1086,9 +1086,6 @@ memcg_accumulate_slabinfo(struct kmem_cache *s, struct slabinfo *info)
 	struct kmem_cache *c;
 	struct slabinfo sinfo;
 
-	if (!is_root_cache(s))
-		return;
-
 	c = memcg_cache(s);
 	if (c) {
 		memset(&sinfo, 0, sizeof(sinfo));
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 26/28] tools/cgroup: add slabinfo.py tool
  2020-01-27 17:34 [PATCH v2 00/28] The new cgroup slab memory controller Roman Gushchin
                   ` (24 preceding siblings ...)
  2020-01-27 17:34 ` [PATCH v2 25/28] mm: memcg/slab: remove redundant check in memcg_accumulate_slabinfo() Roman Gushchin
@ 2020-01-27 17:34 ` Roman Gushchin
  2020-01-27 17:34 ` [PATCH v2 27/28] tools/cgroup: make slabinfo.py compatible with new slab controller Roman Gushchin
                   ` (2 subsequent siblings)
  28 siblings, 0 replies; 84+ messages in thread
From: Roman Gushchin @ 2020-01-27 17:34 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Michal Hocko, Johannes Weiner, Shakeel Butt, Vladimir Davydov,
	linux-kernel, kernel-team, Bharata B Rao, Yafang Shao,
	Roman Gushchin, Waiman Long, Tobin C . Harding, Tejun Heo

Add a drgn-based tool to display slab information for a given memcg.
Can replace cgroup v1 memory.kmem.slabinfo interface on cgroup v2,
but in a more flexiable way.

Currently supports only SLUB configuration, but SLAB can be trivially
added later.

Output example:
$ sudo ./tools/cgroup/slabinfo.py /sys/fs/cgroup/user.slice/user-111017.slice/user\@111017.service
shmem_inode_cache     92     92    704   46    8 : tunables    0    0    0 : slabdata      2      2      0
eventpoll_pwq         56     56     72   56    1 : tunables    0    0    0 : slabdata      1      1      0
eventpoll_epi         32     32    128   32    1 : tunables    0    0    0 : slabdata      1      1      0
kmalloc-8              0      0      8  512    1 : tunables    0    0    0 : slabdata      0      0      0
kmalloc-96             0      0     96   42    1 : tunables    0    0    0 : slabdata      0      0      0
kmalloc-2048           0      0   2048   16    8 : tunables    0    0    0 : slabdata      0      0      0
kmalloc-64           128    128     64   64    1 : tunables    0    0    0 : slabdata      2      2      0
mm_struct            160    160   1024   32    8 : tunables    0    0    0 : slabdata      5      5      0
signal_cache          96     96   1024   32    8 : tunables    0    0    0 : slabdata      3      3      0
sighand_cache         45     45   2112   15    8 : tunables    0    0    0 : slabdata      3      3      0
files_cache          138    138    704   46    8 : tunables    0    0    0 : slabdata      3      3      0
task_delay_info      153    153     80   51    1 : tunables    0    0    0 : slabdata      3      3      0
task_struct           27     27   3520    9    8 : tunables    0    0    0 : slabdata      3      3      0
radix_tree_node       56     56    584   28    4 : tunables    0    0    0 : slabdata      2      2      0
btrfs_inode          140    140   1136   28    8 : tunables    0    0    0 : slabdata      5      5      0
kmalloc-1024          64     64   1024   32    8 : tunables    0    0    0 : slabdata      2      2      0
kmalloc-192           84     84    192   42    2 : tunables    0    0    0 : slabdata      2      2      0
inode_cache           54     54    600   27    4 : tunables    0    0    0 : slabdata      2      2      0
kmalloc-128            0      0    128   32    1 : tunables    0    0    0 : slabdata      0      0      0
kmalloc-512           32     32    512   32    4 : tunables    0    0    0 : slabdata      1      1      0
skbuff_head_cache     32     32    256   32    2 : tunables    0    0    0 : slabdata      1      1      0
sock_inode_cache      46     46    704   46    8 : tunables    0    0    0 : slabdata      1      1      0
cred_jar             378    378    192   42    2 : tunables    0    0    0 : slabdata      9      9      0
proc_inode_cache      96     96    672   24    4 : tunables    0    0    0 : slabdata      4      4      0
dentry               336    336    192   42    2 : tunables    0    0    0 : slabdata      8      8      0
filp                 697    864    256   32    2 : tunables    0    0    0 : slabdata     27     27      0
anon_vma             644    644     88   46    1 : tunables    0    0    0 : slabdata     14     14      0
pid                 1408   1408     64   64    1 : tunables    0    0    0 : slabdata     22     22      0
vm_area_struct      1200   1200    200   40    2 : tunables    0    0    0 : slabdata     30     30      0

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
---
 tools/cgroup/slabinfo.py | 158 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 158 insertions(+)
 create mode 100755 tools/cgroup/slabinfo.py

diff --git a/tools/cgroup/slabinfo.py b/tools/cgroup/slabinfo.py
new file mode 100755
index 000000000000..cdb37665993b
--- /dev/null
+++ b/tools/cgroup/slabinfo.py
@@ -0,0 +1,158 @@
+#!/usr/bin/env drgn
+#
+# Copyright (C) 2019 Roman Gushchin <guro@fb.com>
+# Copyright (C) 2019 Facebook
+
+from os import stat
+import argparse
+import sys
+
+from drgn.helpers.linux import list_for_each_entry, list_empty
+from drgn import container_of
+
+
+DESC = """
+This is a drgn script to provide slab statistics for memory cgroups.
+It supports cgroup v2 and v1 and can emulate memory.kmem.slabinfo
+interface of cgroup v1.
+For drgn, visit https://github.com/osandov/drgn.
+"""
+
+
+MEMCGS = {}
+
+OO_SHIFT = 16
+OO_MASK = ((1 << OO_SHIFT) - 1)
+
+
+def err(s):
+    print('slabinfo.py: error: %s' % s, file=sys.stderr, flush=True)
+    sys.exit(1)
+
+
+def find_memcg_ids(css=prog['root_mem_cgroup'].css, prefix=''):
+    if not list_empty(css.children.address_of_()):
+        for css in list_for_each_entry('struct cgroup_subsys_state',
+                                       css.children.address_of_(),
+                                       'sibling'):
+            name = prefix + '/' + css.cgroup.kn.name.string_().decode('utf-8')
+            memcg = container_of(css, 'struct mem_cgroup', 'css')
+            MEMCGS[css.cgroup.kn.id.ino.value_()] = memcg
+            find_memcg_ids(css, name)
+
+
+def is_root_cache(s):
+    return False if s.memcg_params.root_cache else True
+
+
+def cache_name(s):
+    if is_root_cache(s):
+        return s.name.string_().decode('utf-8')
+    else:
+        return s.memcg_params.root_cache.name.string_().decode('utf-8')
+
+
+# SLUB
+
+def oo_order(s):
+    return s.oo.x >> OO_SHIFT
+
+
+def oo_objects(s):
+    return s.oo.x & OO_MASK
+
+
+def count_partial(n, fn):
+    nr_pages = 0
+    for page in list_for_each_entry('struct page', n.partial.address_of_(),
+                                    'lru'):
+         nr_pages += fn(page)
+    return nr_pages
+
+
+def count_free(page):
+    return page.objects - page.inuse
+
+
+def slub_get_slabinfo(s, cfg):
+    nr_slabs = 0
+    nr_objs = 0
+    nr_free = 0
+
+    for node in range(cfg['nr_nodes']):
+        n = s.node[node]
+        nr_slabs += n.nr_slabs.counter.value_()
+        nr_objs += n.total_objects.counter.value_()
+        nr_free += count_partial(n, count_free)
+
+    return {'active_objs': nr_objs - nr_free,
+            'num_objs': nr_objs,
+            'active_slabs': nr_slabs,
+            'num_slabs': nr_slabs,
+            'objects_per_slab': oo_objects(s),
+            'cache_order': oo_order(s),
+            'limit': 0,
+            'batchcount': 0,
+            'shared': 0,
+            'shared_avail': 0}
+
+
+def cache_show(s, cfg):
+    if cfg['allocator'] == 'SLUB':
+        sinfo = slub_get_slabinfo(s, cfg)
+    else:
+        err('SLAB isn\'t supported yet')
+
+    print('%-17s %6lu %6lu %6u %4u %4d'
+          ' : tunables %4u %4u %4u'
+          ' : slabdata %6lu %6lu %6lu' % (
+              cache_name(s), sinfo['active_objs'], sinfo['num_objs'],
+              s.size, sinfo['objects_per_slab'], 1 << sinfo['cache_order'],
+              sinfo['limit'], sinfo['batchcount'], sinfo['shared'],
+              sinfo['active_slabs'], sinfo['num_slabs'],
+              sinfo['shared_avail']))
+
+
+def detect_kernel_config():
+    cfg = {}
+
+    cfg['nr_nodes'] = prog['nr_online_nodes'].value_()
+
+    if prog.type('struct kmem_cache').members[1][1] == 'flags':
+        cfg['allocator'] = 'SLUB'
+    elif prog.type('struct kmem_cache').members[1][1] == 'batchcount':
+        cfg['allocator'] = 'SLAB'
+    else:
+        err('Can\'t determine the slab allocator')
+
+    return cfg
+
+
+def main():
+    parser = argparse.ArgumentParser(description=DESC,
+                                     formatter_class=
+                                     argparse.RawTextHelpFormatter)
+    parser.add_argument('cgroup', metavar='CGROUP',
+                        help='Target memory cgroup')
+    args = parser.parse_args()
+
+    try:
+        cgroup_id = stat(args.cgroup).st_ino
+        find_memcg_ids()
+        memcg = MEMCGS[cgroup_id]
+    except KeyError:
+        err('Can\'t find the memory cgroup')
+
+    cfg = detect_kernel_config()
+
+    print('# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab>'
+          ' : tunables <limit> <batchcount> <sharedfactor>'
+          ' : slabdata <active_slabs> <num_slabs> <sharedavail>')
+
+    for s in list_for_each_entry('struct kmem_cache',
+                                 memcg.kmem_caches.address_of_(),
+                                 'memcg_params.kmem_caches_node'):
+        cache_show(s, cfg)
+
+
+main()
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 27/28] tools/cgroup: make slabinfo.py compatible with new slab controller
  2020-01-27 17:34 [PATCH v2 00/28] The new cgroup slab memory controller Roman Gushchin
                   ` (25 preceding siblings ...)
  2020-01-27 17:34 ` [PATCH v2 26/28] tools/cgroup: add slabinfo.py tool Roman Gushchin
@ 2020-01-27 17:34 ` Roman Gushchin
  2020-01-30  2:17   ` Bharata B Rao
  2020-01-27 17:34 ` [PATCH v2 28/28] kselftests: cgroup: add kernel memory accounting tests Roman Gushchin
  2020-01-30  2:06 ` [PATCH v2 00/28] The new cgroup slab memory controller Bharata B Rao
  28 siblings, 1 reply; 84+ messages in thread
From: Roman Gushchin @ 2020-01-27 17:34 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Michal Hocko, Johannes Weiner, Shakeel Butt, Vladimir Davydov,
	linux-kernel, kernel-team, Bharata B Rao, Yafang Shao,
	Roman Gushchin

Make slabinfo.py compatible with the new slab controller.

Because there are no more per-memcg kmem_caches, and also there
is no list of all slab pages in the system, it has to walk over
all pages and filter out slab pages belonging to non-root kmem_caches.

Then it counts objects belonging to the given cgroup. It might
sound as a very slow operation, however it's not so slow. It takes
about 30s seconds to walk over 8Gb of slabs out of 64Gb, and filter
out all objects belonging to the cgroup of interest.

Also, it provides an accurate number of active objects, which isn't
true for the old slab controller.

The script is backward compatible and works for both kernel versions.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 tools/cgroup/slabinfo.py | 74 ++++++++++++++++++++++++++++++++++++----
 1 file changed, 68 insertions(+), 6 deletions(-)

diff --git a/tools/cgroup/slabinfo.py b/tools/cgroup/slabinfo.py
index cdb37665993b..b779a4863beb 100755
--- a/tools/cgroup/slabinfo.py
+++ b/tools/cgroup/slabinfo.py
@@ -8,7 +8,10 @@ import argparse
 import sys
 
 from drgn.helpers.linux import list_for_each_entry, list_empty
-from drgn import container_of
+from drgn.helpers.linux import for_each_page
+from drgn.helpers.linux.cpumask import for_each_online_cpu
+from drgn.helpers.linux.percpu import per_cpu_ptr
+from drgn import container_of, FaultError, Object
 
 
 DESC = """
@@ -97,12 +100,16 @@ def slub_get_slabinfo(s, cfg):
             'shared_avail': 0}
 
 
-def cache_show(s, cfg):
+def cache_show(s, cfg, objs):
     if cfg['allocator'] == 'SLUB':
         sinfo = slub_get_slabinfo(s, cfg)
     else:
         err('SLAB isn\'t supported yet')
 
+    if cfg['shared_slab_pages']:
+        sinfo['active_objs'] = objs
+        sinfo['num_objs'] = objs
+
     print('%-17s %6lu %6lu %6u %4u %4d'
           ' : tunables %4u %4u %4u'
           ' : slabdata %6lu %6lu %6lu' % (
@@ -125,9 +132,26 @@ def detect_kernel_config():
     else:
         err('Can\'t determine the slab allocator')
 
+    if prog.type('struct memcg_cache_params').members[1][1] == 'memcg_cache':
+        cfg['shared_slab_pages'] = True
+    else:
+        cfg['shared_slab_pages'] = False
+
     return cfg
 
 
+def for_each_slab_page(prog):
+    PGSlab = 1 << prog.constant('PG_slab')
+    PGHead = 1 << prog.constant('PG_head')
+
+    for page in for_each_page(prog):
+        try:
+            if page.flags.value_() & PGSlab:
+                yield page
+        except FaultError:
+            pass
+
+
 def main():
     parser = argparse.ArgumentParser(description=DESC,
                                      formatter_class=
@@ -149,10 +173,48 @@ def main():
           ' : tunables <limit> <batchcount> <sharedfactor>'
           ' : slabdata <active_slabs> <num_slabs> <sharedavail>')
 
-    for s in list_for_each_entry('struct kmem_cache',
-                                 memcg.kmem_caches.address_of_(),
-                                 'memcg_params.kmem_caches_node'):
-        cache_show(s, cfg)
+    if cfg['shared_slab_pages']:
+        obj_cgroups = set()
+        stats = {}
+        caches = {}
+
+        # find memcg pointers belonging to the specified cgroup
+        for ptr in list_for_each_entry('struct obj_cgroup',
+                                       memcg.objcg_list.address_of_(),
+                                       'list'):
+            obj_cgroups.add(ptr.value_())
+
+        # look over all slab pages, belonging to non-root memcgs
+        # and look for objects belonging to the given memory cgroup
+        for page in for_each_slab_page(prog):
+            objcg_vec_raw = page.obj_cgroups.value_()
+            if objcg_vec_raw == 0:
+                continue
+            cache = page.slab_cache
+            if not cache or is_root_cache(cache):
+                continue
+            addr = cache.value_()
+            caches[addr] = cache
+            # clear the lowest bit to get the true obj_cgroups
+            objcg_vec = Object(prog, page.obj_cgroups.type_,
+                               value=objcg_vec_raw & ~1)
+
+            if addr not in stats:
+                stats[addr] = 0
+
+            for i in range(oo_objects(cache)):
+                if objcg_vec[i].value_() in obj_cgroups:
+                    stats[addr] += 1
+
+        for addr in caches:
+            if stats[addr] > 0:
+                cache_show(caches[addr], cfg, stats[addr])
+
+    else:
+        for s in list_for_each_entry('struct kmem_cache',
+                                     memcg.kmem_caches.address_of_(),
+                                     'memcg_params.kmem_caches_node'):
+            cache_show(s, cfg, None)
 
 
 main()
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 28/28] kselftests: cgroup: add kernel memory accounting tests
  2020-01-27 17:34 [PATCH v2 00/28] The new cgroup slab memory controller Roman Gushchin
                   ` (26 preceding siblings ...)
  2020-01-27 17:34 ` [PATCH v2 27/28] tools/cgroup: make slabinfo.py compatible with new slab controller Roman Gushchin
@ 2020-01-27 17:34 ` Roman Gushchin
  2020-01-30  2:06 ` [PATCH v2 00/28] The new cgroup slab memory controller Bharata B Rao
  28 siblings, 0 replies; 84+ messages in thread
From: Roman Gushchin @ 2020-01-27 17:34 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Michal Hocko, Johannes Weiner, Shakeel Butt, Vladimir Davydov,
	linux-kernel, kernel-team, Bharata B Rao, Yafang Shao,
	Roman Gushchin

Add some tests to cover the kernel memory accounting functionality.
These are covering some issues (and changes) we had recently.

1) A test which allocates a lot of negative dentries, checks memcg
slab statistics, creates memory pressure by setting memory.max
to some low value and checks that some number of slabs was reclaimed.

2) A test which covers side effects of memcg destruction: it creates
and destroys a large number of sub-cgroups, each containing a
multi-threaded workload which allocates and releases some kernel
memory. Then it checks that the charge ans memory.stats do add up
on the parent level.

3) A test which reads /proc/kpagecgroup and implicitly checks that it
doesn't crash the system.

4) A test which spawns a large number of threads and checks that
the kernel stacks accounting works as expected.

5) A test which checks that living charged slab objects are not
preventing the memory cgroup from being released after being deleted
by a user.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 tools/testing/selftests/cgroup/.gitignore  |   1 +
 tools/testing/selftests/cgroup/Makefile    |   2 +
 tools/testing/selftests/cgroup/test_kmem.c | 380 +++++++++++++++++++++
 3 files changed, 383 insertions(+)
 create mode 100644 tools/testing/selftests/cgroup/test_kmem.c

diff --git a/tools/testing/selftests/cgroup/.gitignore b/tools/testing/selftests/cgroup/.gitignore
index 7f9835624793..fa6660aba062 100644
--- a/tools/testing/selftests/cgroup/.gitignore
+++ b/tools/testing/selftests/cgroup/.gitignore
@@ -1,3 +1,4 @@
 test_memcontrol
 test_core
 test_freezer
+test_kmem
\ No newline at end of file
diff --git a/tools/testing/selftests/cgroup/Makefile b/tools/testing/selftests/cgroup/Makefile
index 66aafe1f5746..d0b3bca5dabb 100644
--- a/tools/testing/selftests/cgroup/Makefile
+++ b/tools/testing/selftests/cgroup/Makefile
@@ -6,11 +6,13 @@ all:
 TEST_FILES     := with_stress.sh
 TEST_PROGS     := test_stress.sh
 TEST_GEN_PROGS = test_memcontrol
+TEST_GEN_PROGS = test_kmem
 TEST_GEN_PROGS += test_core
 TEST_GEN_PROGS += test_freezer
 
 include ../lib.mk
 
 $(OUTPUT)/test_memcontrol: cgroup_util.c
+$(OUTPUT)/test_kmem: cgroup_util.c
 $(OUTPUT)/test_core: cgroup_util.c
 $(OUTPUT)/test_freezer: cgroup_util.c
diff --git a/tools/testing/selftests/cgroup/test_kmem.c b/tools/testing/selftests/cgroup/test_kmem.c
new file mode 100644
index 000000000000..b4bb50a4c862
--- /dev/null
+++ b/tools/testing/selftests/cgroup/test_kmem.c
@@ -0,0 +1,380 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+
+#include <linux/limits.h>
+#include <fcntl.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <unistd.h>
+#include <sys/wait.h>
+#include <errno.h>
+#include <sys/sysinfo.h>
+#include <pthread.h>
+
+#include "../kselftest.h"
+#include "cgroup_util.h"
+
+
+static int alloc_dcache(const char *cgroup, void *arg)
+{
+	unsigned long i;
+	struct stat st;
+	char buf[128];
+
+	for (i = 0; i < (unsigned long)arg; i++) {
+		snprintf(buf, sizeof(buf),
+			"/something-non-existent-with-a-long-name-%64lu-%d",
+			 i, getpid());
+		stat(buf, &st);
+	}
+
+	return 0;
+}
+
+/*
+ * This test allocates 100000 of negative dentries with long names.
+ * Then it checks that "slab" in memory.stat is larger than 1M.
+ * Then it sets memory.high to 1M and checks that at least 1/2
+ * of slab memory has been reclaimed.
+ */
+static int test_kmem_basic(const char *root)
+{
+	int ret = KSFT_FAIL;
+	char *cg = NULL;
+	long slab0, slab1, current;
+
+	cg = cg_name(root, "kmem_basic_test");
+	if (!cg)
+		goto cleanup;
+
+	if (cg_create(cg))
+		goto cleanup;
+
+	if (cg_run(cg, alloc_dcache, (void *)100000))
+		goto cleanup;
+
+	slab0 = cg_read_key_long(cg, "memory.stat", "slab ");
+	if (slab0 < (1 >> 20))
+		goto cleanup;
+
+	cg_write(cg, "memory.high", "1M");
+	slab1 = cg_read_key_long(cg, "memory.stat", "slab ");
+	if (slab1 <= 0)
+		goto cleanup;
+
+	current = cg_read_long(cg, "memory.current");
+	if (current <= 0)
+		goto cleanup;
+
+	if (slab1 < slab0 / 2 && current < slab0 / 2)
+		ret = KSFT_PASS;
+cleanup:
+	cg_destroy(cg);
+	free(cg);
+
+	return ret;
+}
+
+static void *alloc_kmem_fn(void *arg)
+{
+	alloc_dcache(NULL, (void *)10);
+	return NULL;
+}
+
+static int alloc_kmem_smp(const char *cgroup, void *arg)
+{
+	int nr_threads = 2 * get_nprocs();
+	pthread_t *tinfo;
+	unsigned long i;
+	int ret = -1;
+
+	tinfo = calloc(nr_threads, sizeof(pthread_t));
+	if (tinfo == NULL)
+		return -1;
+
+	for (i = 0; i < nr_threads; i++) {
+		if (pthread_create(&tinfo[i], NULL, &alloc_kmem_fn,
+				   (void *)i)) {
+			free(tinfo);
+			return -1;
+		}
+	}
+
+	for (i = 0; i < nr_threads; i++) {
+		ret = pthread_join(tinfo[i], NULL);
+		if (ret)
+			break;
+	}
+
+	free(tinfo);
+	return ret;
+}
+
+static int cg_run_in_subcgroups(const char *parent,
+				int (*fn)(const char *cgroup, void *arg),
+				void *arg, int times)
+{
+	char *child;
+	int i;
+
+	for (i = 0; i < times; i++) {
+		child = cg_name_indexed(parent, "child", i);
+		if (!child)
+			return -1;
+
+		if (cg_create(child)) {
+			cg_destroy(child);
+			free(child);
+			return -1;
+		}
+
+		if (cg_run(child, fn, NULL)) {
+			cg_destroy(child);
+			return -1;
+		}
+
+		cg_destroy(child);
+	}
+
+	return 0;
+}
+
+/*
+ * The test creates and destroys a large number of cgroups. In each cgroup it
+ * allocates some slab memory (mostly negative dentries) using 2 * NR_CPUS
+ * threads. Then it checks the sanity of numbers on the parent level:
+ * the total size of the cgroups should be roughly equal to
+ * anon + file + slab + kernel_stack.
+ */
+static int test_kmem_memcg_deletion(const char *root)
+{
+	long current, slab, anon, file, kernel_stack, sum;
+	int ret = KSFT_FAIL;
+	char *parent;
+
+	parent = cg_name(root, "kmem_memcg_deletion_test");
+	if (!parent)
+		goto cleanup;
+
+	if (cg_create(parent))
+		goto cleanup;
+
+	if (cg_write(parent, "cgroup.subtree_control", "+memory"))
+		goto cleanup;
+
+	if (cg_run_in_subcgroups(parent, alloc_kmem_smp, NULL, 1000))
+		goto cleanup;
+
+	current = cg_read_long(parent, "memory.current");
+	slab = cg_read_key_long(parent, "memory.stat", "slab ");
+	anon = cg_read_key_long(parent, "memory.stat", "anon ");
+	file = cg_read_key_long(parent, "memory.stat", "file ");
+	kernel_stack = cg_read_key_long(parent, "memory.stat", "kernel_stack ");
+	if (current < 0 || slab < 0 || anon < 0 || file < 0 ||
+	    kernel_stack < 0)
+		goto cleanup;
+
+	sum = slab + anon + file + kernel_stack;
+	if (abs(sum - current) < 4096 * 32 * 2 * get_nprocs()) {
+		ret = KSFT_PASS;
+	} else {
+		printf("memory.current = %ld\n", current);
+		printf("slab + anon + file + kernel_stack = %ld\n", sum);
+		printf("slab = %ld\n", slab);
+		printf("anon = %ld\n", anon);
+		printf("file = %ld\n", file);
+		printf("kernel_stack = %ld\n", kernel_stack);
+	}
+
+cleanup:
+	cg_destroy(parent);
+	free(parent);
+
+	return ret;
+}
+
+/*
+ * The test reads the entire /proc/kpagecgroup. If the operation went
+ * successfully (and the kernel didn't panic), the test is treated as passed.
+ */
+static int test_kmem_proc_kpagecgroup(const char *root)
+{
+	unsigned long buf[128];
+	int ret = KSFT_FAIL;
+	ssize_t len;
+	int fd;
+
+	fd = open("/proc/kpagecgroup", O_RDONLY);
+	if (fd < 0)
+		return ret;
+
+	do {
+		len = read(fd, buf, sizeof(buf));
+	} while (len > 0);
+
+	if (len == 0)
+		ret = KSFT_PASS;
+
+	close(fd);
+	return ret;
+}
+
+static void *pthread_wait_fn(void *arg)
+{
+	sleep(100);
+	return NULL;
+}
+
+static int spawn_1000_threads(const char *cgroup, void *arg)
+{
+	int nr_threads = 1000;
+	pthread_t *tinfo;
+	unsigned long i;
+	long stack;
+	int ret = -1;
+
+	tinfo = calloc(nr_threads, sizeof(pthread_t));
+	if (tinfo == NULL)
+		return -1;
+
+	for (i = 0; i < nr_threads; i++) {
+		if (pthread_create(&tinfo[i], NULL, &pthread_wait_fn,
+				   (void *)i)) {
+			free(tinfo);
+			return(-1);
+		}
+	}
+
+	stack = cg_read_key_long(cgroup, "memory.stat", "kernel_stack ");
+	if (stack >= 4096 * 1000)
+		ret = 0;
+
+	free(tinfo);
+	return ret;
+}
+
+/*
+ * The test spawns a process, which spawns 1000 threads. Then it checks
+ * that memory.stat's kernel_stack is at least 1000 pages large.
+ */
+static int test_kmem_kernel_stacks(const char *root)
+{
+	int ret = KSFT_FAIL;
+	char *cg = NULL;
+
+	cg = cg_name(root, "kmem_kernel_stacks_test");
+	if (!cg)
+		goto cleanup;
+
+	if (cg_create(cg))
+		goto cleanup;
+
+	if (cg_run(cg, spawn_1000_threads, NULL))
+		goto cleanup;
+
+	ret = KSFT_PASS;
+cleanup:
+	cg_destroy(cg);
+	free(cg);
+
+	return ret;
+}
+
+/*
+ * This test sequentionally creates 30 child cgroups, allocates some
+ * kernel memory in each of them, and deletes them. Then it checks
+ * that the number of dying cgroups on the parent level is 0.
+ */
+static int test_kmem_dead_cgroups(const char *root)
+{
+	int ret = KSFT_FAIL;
+	char *parent;
+	long dead;
+	int i;
+
+	parent = cg_name(root, "kmem_dead_cgroups_test");
+	if (!parent)
+		goto cleanup;
+
+	if (cg_create(parent))
+		goto cleanup;
+
+	if (cg_write(parent, "cgroup.subtree_control", "+memory"))
+		goto cleanup;
+
+	if (cg_run_in_subcgroups(parent, alloc_dcache, (void *)100, 30))
+		goto cleanup;
+
+	for (i = 0; i < 5; i++) {
+		dead = cg_read_key_long(parent, "cgroup.stat",
+					"nr_dying_descendants ");
+		if (dead == 0) {
+			ret = KSFT_PASS;
+			break;
+		}
+		/*
+		 * Reclaiming cgroups might take some time,
+		 * let's wait a bit and repeat.
+		 */
+		sleep(1);
+	}
+
+cleanup:
+	cg_destroy(parent);
+	free(parent);
+
+	return ret;
+}
+
+#define T(x) { x, #x }
+struct kmem_test {
+	int (*fn)(const char *root);
+	const char *name;
+} tests[] = {
+	T(test_kmem_basic),
+	T(test_kmem_memcg_deletion),
+	T(test_kmem_proc_kpagecgroup),
+	T(test_kmem_kernel_stacks),
+	T(test_kmem_dead_cgroups),
+};
+#undef T
+
+int main(int argc, char **argv)
+{
+	char root[PATH_MAX];
+	int i, ret = EXIT_SUCCESS;
+
+	if (cg_find_unified_root(root, sizeof(root)))
+		ksft_exit_skip("cgroup v2 isn't mounted\n");
+
+	/*
+	 * Check that memory controller is available:
+	 * memory is listed in cgroup.controllers
+	 */
+	if (cg_read_strstr(root, "cgroup.controllers", "memory"))
+		ksft_exit_skip("memory controller isn't available\n");
+
+	if (cg_read_strstr(root, "cgroup.subtree_control", "memory"))
+		if (cg_write(root, "cgroup.subtree_control", "+memory"))
+			ksft_exit_skip("Failed to set memory controller\n");
+
+	for (i = 0; i < ARRAY_SIZE(tests); i++) {
+		switch (tests[i].fn(root)) {
+		case KSFT_PASS:
+			ksft_test_result_pass("%s\n", tests[i].name);
+			break;
+		case KSFT_SKIP:
+			ksft_test_result_skip("%s\n", tests[i].name);
+			break;
+		default:
+			ret = EXIT_FAILURE;
+			ksft_test_result_fail("%s\n", tests[i].name);
+			break;
+		}
+	}
+
+	return ret;
+}
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 00/28] The new cgroup slab memory controller
  2020-01-27 17:34 [PATCH v2 00/28] The new cgroup slab memory controller Roman Gushchin
                   ` (27 preceding siblings ...)
  2020-01-27 17:34 ` [PATCH v2 28/28] kselftests: cgroup: add kernel memory accounting tests Roman Gushchin
@ 2020-01-30  2:06 ` Bharata B Rao
  2020-01-30  2:41   ` Roman Gushchin
  28 siblings, 1 reply; 84+ messages in thread
From: Bharata B Rao @ 2020-01-30  2:06 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, Johannes Weiner,
	Shakeel Butt, Vladimir Davydov, linux-kernel, kernel-team,
	Yafang Shao

On Mon, Jan 27, 2020 at 09:34:25AM -0800, Roman Gushchin wrote:
> The existing cgroup slab memory controller is based on the idea of
> replicating slab allocator internals for each memory cgroup.
> This approach promises a low memory overhead (one pointer per page),
> and isn't adding too much code on hot allocation and release paths.
> But is has a very serious flaw: it leads to a low slab utilization.
> 
> Using a drgn* script I've got an estimation of slab utilization on
> a number of machines running different production workloads. In most
> cases it was between 45% and 65%, and the best number I've seen was
> around 85%. Turning kmem accounting off brings it to high 90s. Also
> it brings back 30-50% of slab memory. It means that the real price
> of the existing slab memory controller is way bigger than a pointer
> per page.
> 
> The real reason why the existing design leads to a low slab utilization
> is simple: slab pages are used exclusively by one memory cgroup.
> If there are only few allocations of certain size made by a cgroup,
> or if some active objects (e.g. dentries) are left after the cgroup is
> deleted, or the cgroup contains a single-threaded application which is
> barely allocating any kernel objects, but does it every time on a new CPU:
> in all these cases the resulting slab utilization is very low.
> If kmem accounting is off, the kernel is able to use free space
> on slab pages for other allocations.
> 
> Arguably it wasn't an issue back to days when the kmem controller was
> introduced and was an opt-in feature, which had to be turned on
> individually for each memory cgroup. But now it's turned on by default
> on both cgroup v1 and v2. And modern systemd-based systems tend to
> create a large number of cgroups.
> 
> This patchset provides a new implementation of the slab memory controller,
> which aims to reach a much better slab utilization by sharing slab pages
> between multiple memory cgroups. Below is the short description of the new
> design (more details in commit messages).
> 
> Accounting is performed per-object instead of per-page. Slab-related
> vmstat counters are converted to bytes. Charging is performed on page-basis,
> with rounding up and remembering leftovers.
> 
> Memcg ownership data is stored in a per-slab-page vector: for each slab page
> a vector of corresponding size is allocated. To keep slab memory reparenting
> working, instead of saving a pointer to the memory cgroup directly an
> intermediate object is used. It's simply a pointer to a memcg (which can be
> easily changed to the parent) with a built-in reference counter. This scheme
> allows to reparent all allocated objects without walking them over and
> changing memcg pointer to the parent.
> 
> Instead of creating an individual set of kmem_caches for each memory cgroup,
> two global sets are used: the root set for non-accounted and root-cgroup
> allocations and the second set for all other allocations. This allows to
> simplify the lifetime management of individual kmem_caches: they are
> destroyed with root counterparts. It allows to remove a good amount of code
> and make things generally simpler.
> 
> The patchset* has been tested on a number of different workloads in our
> production. In all cases it saved significant amount of memory, measured
> from high hundreds of MBs to single GBs per host. On average, the size
> of slab memory has been reduced by 35-45%.

Here are some numbers from multiple runs of sysbench and kernel compilation
with this patchset on a 10 core POWER8 host:

==========================================================================
Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
meminfo:Slab for Sysbench oltp_read_write with mysqld running as part
of a mem cgroup (Sampling every 5s)
--------------------------------------------------------------------------
				5.5.0-rc7-mm1	+slab patch	%reduction
--------------------------------------------------------------------------
memory.kmem.usage_in_bytes	15859712	4456448		72
memory.usage_in_bytes		337510400	335806464	.5
Slab: (kB)			814336		607296		25

memory.kmem.usage_in_bytes	16187392	4653056		71
memory.usage_in_bytes		318832640	300154880	5
Slab: (kB)			789888		559744		29
--------------------------------------------------------------------------


Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
meminfo:Slab for kernel compilation (make -s -j64) Compilation was
done from bash that is in a memory cgroup. (Sampling every 5s)
--------------------------------------------------------------------------
				5.5.0-rc7-mm1	+slab patch	%reduction
--------------------------------------------------------------------------
memory.kmem.usage_in_bytes	338493440	231931904	31
memory.usage_in_bytes		7368015872	6275923968	15
Slab: (kB)			1139072		785408		31

memory.kmem.usage_in_bytes	341835776	236453888	30
memory.usage_in_bytes		6540427264	6072893440	7
Slab: (kB)			1074304		761280		29

memory.kmem.usage_in_bytes	340525056	233570304	31
memory.usage_in_bytes		6406209536	6177357824	3
Slab: (kB)			1244288		739712		40
--------------------------------------------------------------------------

Slab consumption right after boot
--------------------------------------------------------------------------
				5.5.0-rc7-mm1	+slab patch	%reduction
--------------------------------------------------------------------------
Slab: (kB)			821888		583424		29
==========================================================================

Summary:

With sysbench and kernel compilation,  memory.kmem.usage_in_bytes shows
around 70% and 30% reduction consistently.

Didn't see consistent reduction of memory.usage_in_bytes with sysbench and
kernel compilation.

Slab usage (from /proc/meminfo) shows consistent 30% reduction and the
same is seen right after boot too.

Regards,
Bharata.



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 27/28] tools/cgroup: make slabinfo.py compatible with new slab controller
  2020-01-27 17:34 ` [PATCH v2 27/28] tools/cgroup: make slabinfo.py compatible with new slab controller Roman Gushchin
@ 2020-01-30  2:17   ` Bharata B Rao
  2020-01-30  2:44     ` Roman Gushchin
  2020-01-31 22:24     ` Roman Gushchin
  0 siblings, 2 replies; 84+ messages in thread
From: Bharata B Rao @ 2020-01-30  2:17 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, Johannes Weiner,
	Shakeel Butt, Vladimir Davydov, linux-kernel, kernel-team,
	Yafang Shao

On Mon, Jan 27, 2020 at 09:34:52AM -0800, Roman Gushchin wrote:
> Make slabinfo.py compatible with the new slab controller.
 
Tried using slabinfo.py, but run into some errors. (I am using your
new_slab.2 branch)

 ./tools/cgroup/slabinfo.py /sys/fs/cgroup/memory/1
Traceback (most recent call last):
  File "/usr/local/bin/drgn", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/dist-packages/drgn/internal/cli.py", line 127, in main
    runpy.run_path(args.script[0], init_globals=init_globals, run_name="__main__")
  File "/usr/lib/python3.6/runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/usr/lib/python3.6/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "./tools/cgroup/slabinfo.py", line 220, in <module>
    main()
  File "./tools/cgroup/slabinfo.py", line 165, in main
    find_memcg_ids()
  File "./tools/cgroup/slabinfo.py", line 43, in find_memcg_ids
    MEMCGS[css.cgroup.kn.id.ino.value_()] = memcg
AttributeError: '_drgn.Object' object has no attribute 'ino'

I did make this change...

# git diff
diff --git a/tools/cgroup/slabinfo.py b/tools/cgroup/slabinfo.py
index b779a4863beb..571fd95224d6 100755
--- a/tools/cgroup/slabinfo.py
+++ b/tools/cgroup/slabinfo.py
@@ -40,7 +40,7 @@ def find_memcg_ids(css=prog['root_mem_cgroup'].css, prefix=''):
                                        'sibling'):
             name = prefix + '/' + css.cgroup.kn.name.string_().decode('utf-8')
             memcg = container_of(css, 'struct mem_cgroup', 'css')
-            MEMCGS[css.cgroup.kn.id.ino.value_()] = memcg
+            MEMCGS[css.cgroup.kn.id.value_()] = memcg
             find_memcg_ids(css, name)


but now get empty output.

# ./tools/cgroup/slabinfo.py /sys/fs/cgroup/memory/1
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>

Guess this script is not yet ready for the upstream kernel?

Regards,
Bharata.



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 00/28] The new cgroup slab memory controller
  2020-01-30  2:06 ` [PATCH v2 00/28] The new cgroup slab memory controller Bharata B Rao
@ 2020-01-30  2:41   ` Roman Gushchin
  2020-08-12 23:16     ` Pavel Tatashin
  0 siblings, 1 reply; 84+ messages in thread
From: Roman Gushchin @ 2020-01-30  2:41 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-mm, Andrew Morton, Michal Hocko, Johannes Weiner,
	Shakeel Butt, Vladimir Davydov, linux-kernel, Kernel Team,
	Yafang Shao

On Thu, Jan 30, 2020 at 07:36:26AM +0530, Bharata B Rao wrote:
> On Mon, Jan 27, 2020 at 09:34:25AM -0800, Roman Gushchin wrote:
> > The existing cgroup slab memory controller is based on the idea of
> > replicating slab allocator internals for each memory cgroup.
> > This approach promises a low memory overhead (one pointer per page),
> > and isn't adding too much code on hot allocation and release paths.
> > But is has a very serious flaw: it leads to a low slab utilization.
> > 
> > Using a drgn* script I've got an estimation of slab utilization on
> > a number of machines running different production workloads. In most
> > cases it was between 45% and 65%, and the best number I've seen was
> > around 85%. Turning kmem accounting off brings it to high 90s. Also
> > it brings back 30-50% of slab memory. It means that the real price
> > of the existing slab memory controller is way bigger than a pointer
> > per page.
> > 
> > The real reason why the existing design leads to a low slab utilization
> > is simple: slab pages are used exclusively by one memory cgroup.
> > If there are only few allocations of certain size made by a cgroup,
> > or if some active objects (e.g. dentries) are left after the cgroup is
> > deleted, or the cgroup contains a single-threaded application which is
> > barely allocating any kernel objects, but does it every time on a new CPU:
> > in all these cases the resulting slab utilization is very low.
> > If kmem accounting is off, the kernel is able to use free space
> > on slab pages for other allocations.
> > 
> > Arguably it wasn't an issue back to days when the kmem controller was
> > introduced and was an opt-in feature, which had to be turned on
> > individually for each memory cgroup. But now it's turned on by default
> > on both cgroup v1 and v2. And modern systemd-based systems tend to
> > create a large number of cgroups.
> > 
> > This patchset provides a new implementation of the slab memory controller,
> > which aims to reach a much better slab utilization by sharing slab pages
> > between multiple memory cgroups. Below is the short description of the new
> > design (more details in commit messages).
> > 
> > Accounting is performed per-object instead of per-page. Slab-related
> > vmstat counters are converted to bytes. Charging is performed on page-basis,
> > with rounding up and remembering leftovers.
> > 
> > Memcg ownership data is stored in a per-slab-page vector: for each slab page
> > a vector of corresponding size is allocated. To keep slab memory reparenting
> > working, instead of saving a pointer to the memory cgroup directly an
> > intermediate object is used. It's simply a pointer to a memcg (which can be
> > easily changed to the parent) with a built-in reference counter. This scheme
> > allows to reparent all allocated objects without walking them over and
> > changing memcg pointer to the parent.
> > 
> > Instead of creating an individual set of kmem_caches for each memory cgroup,
> > two global sets are used: the root set for non-accounted and root-cgroup
> > allocations and the second set for all other allocations. This allows to
> > simplify the lifetime management of individual kmem_caches: they are
> > destroyed with root counterparts. It allows to remove a good amount of code
> > and make things generally simpler.
> > 
> > The patchset* has been tested on a number of different workloads in our
> > production. In all cases it saved significant amount of memory, measured
> > from high hundreds of MBs to single GBs per host. On average, the size
> > of slab memory has been reduced by 35-45%.
> 
> Here are some numbers from multiple runs of sysbench and kernel compilation
> with this patchset on a 10 core POWER8 host:
> 
> ==========================================================================
> Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> meminfo:Slab for Sysbench oltp_read_write with mysqld running as part
> of a mem cgroup (Sampling every 5s)
> --------------------------------------------------------------------------
> 				5.5.0-rc7-mm1	+slab patch	%reduction
> --------------------------------------------------------------------------
> memory.kmem.usage_in_bytes	15859712	4456448		72
> memory.usage_in_bytes		337510400	335806464	.5
> Slab: (kB)			814336		607296		25
> 
> memory.kmem.usage_in_bytes	16187392	4653056		71
> memory.usage_in_bytes		318832640	300154880	5
> Slab: (kB)			789888		559744		29
> --------------------------------------------------------------------------
> 
> 
> Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> meminfo:Slab for kernel compilation (make -s -j64) Compilation was
> done from bash that is in a memory cgroup. (Sampling every 5s)
> --------------------------------------------------------------------------
> 				5.5.0-rc7-mm1	+slab patch	%reduction
> --------------------------------------------------------------------------
> memory.kmem.usage_in_bytes	338493440	231931904	31
> memory.usage_in_bytes		7368015872	6275923968	15
> Slab: (kB)			1139072		785408		31
> 
> memory.kmem.usage_in_bytes	341835776	236453888	30
> memory.usage_in_bytes		6540427264	6072893440	7
> Slab: (kB)			1074304		761280		29
> 
> memory.kmem.usage_in_bytes	340525056	233570304	31
> memory.usage_in_bytes		6406209536	6177357824	3
> Slab: (kB)			1244288		739712		40
> --------------------------------------------------------------------------
> 
> Slab consumption right after boot
> --------------------------------------------------------------------------
> 				5.5.0-rc7-mm1	+slab patch	%reduction
> --------------------------------------------------------------------------
> Slab: (kB)			821888		583424		29
> ==========================================================================
> 
> Summary:
> 
> With sysbench and kernel compilation,  memory.kmem.usage_in_bytes shows
> around 70% and 30% reduction consistently.
> 
> Didn't see consistent reduction of memory.usage_in_bytes with sysbench and
> kernel compilation.
> 
> Slab usage (from /proc/meminfo) shows consistent 30% reduction and the
> same is seen right after boot too.

That's just perfect!

memory.usage_in_bytes was most likely the same because the freed space
was taken by pagecache.

Thank you very much for testing!

Roman


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 27/28] tools/cgroup: make slabinfo.py compatible with new slab controller
  2020-01-30  2:17   ` Bharata B Rao
@ 2020-01-30  2:44     ` Roman Gushchin
  2020-01-31 22:24     ` Roman Gushchin
  1 sibling, 0 replies; 84+ messages in thread
From: Roman Gushchin @ 2020-01-30  2:44 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-mm, Andrew Morton, Michal Hocko, Johannes Weiner,
	Shakeel Butt, Vladimir Davydov, linux-kernel, Kernel Team,
	Yafang Shao

On Thu, Jan 30, 2020 at 07:47:29AM +0530, Bharata B Rao wrote:
> On Mon, Jan 27, 2020 at 09:34:52AM -0800, Roman Gushchin wrote:
> > Make slabinfo.py compatible with the new slab controller.
>  
> Tried using slabinfo.py, but run into some errors. (I am using your
> new_slab.2 branch)
> 
>  ./tools/cgroup/slabinfo.py /sys/fs/cgroup/memory/1
> Traceback (most recent call last):
>   File "/usr/local/bin/drgn", line 11, in <module>
>     sys.exit(main())
>   File "/usr/local/lib/python3.6/dist-packages/drgn/internal/cli.py", line 127, in main
>     runpy.run_path(args.script[0], init_globals=init_globals, run_name="__main__")
>   File "/usr/lib/python3.6/runpy.py", line 263, in run_path
>     pkg_name=pkg_name, script_name=fname)
>   File "/usr/lib/python3.6/runpy.py", line 96, in _run_module_code
>     mod_name, mod_spec, pkg_name, script_name)
>   File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
>     exec(code, run_globals)
>   File "./tools/cgroup/slabinfo.py", line 220, in <module>
>     main()
>   File "./tools/cgroup/slabinfo.py", line 165, in main
>     find_memcg_ids()
>   File "./tools/cgroup/slabinfo.py", line 43, in find_memcg_ids
>     MEMCGS[css.cgroup.kn.id.ino.value_()] = memcg
> AttributeError: '_drgn.Object' object has no attribute 'ino'
> 
> I did make this change...
> 
> # git diff
> diff --git a/tools/cgroup/slabinfo.py b/tools/cgroup/slabinfo.py
> index b779a4863beb..571fd95224d6 100755
> --- a/tools/cgroup/slabinfo.py
> +++ b/tools/cgroup/slabinfo.py
> @@ -40,7 +40,7 @@ def find_memcg_ids(css=prog['root_mem_cgroup'].css, prefix=''):
>                                         'sibling'):
>              name = prefix + '/' + css.cgroup.kn.name.string_().decode('utf-8')
>              memcg = container_of(css, 'struct mem_cgroup', 'css')
> -            MEMCGS[css.cgroup.kn.id.ino.value_()] = memcg
> +            MEMCGS[css.cgroup.kn.id.value_()] = memcg
>              find_memcg_ids(css, name)
> 
> 
> but now get empty output.
> 
> # ./tools/cgroup/slabinfo.py /sys/fs/cgroup/memory/1
> # name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
> 
> Guess this script is not yet ready for the upstream kernel?

Yes, looks like I've used a slightly outdated kernel version to test it.
I'll fix it in the next version.

Thank you for reporting it!


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 27/28] tools/cgroup: make slabinfo.py compatible with new slab controller
  2020-01-30  2:17   ` Bharata B Rao
  2020-01-30  2:44     ` Roman Gushchin
@ 2020-01-31 22:24     ` Roman Gushchin
  2020-02-12  5:21       ` Bharata B Rao
  1 sibling, 1 reply; 84+ messages in thread
From: Roman Gushchin @ 2020-01-31 22:24 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-mm, Andrew Morton, Michal Hocko, Johannes Weiner,
	Shakeel Butt, Vladimir Davydov, linux-kernel, Kernel Team,
	Yafang Shao

On Thu, Jan 30, 2020 at 07:47:29AM +0530, Bharata B Rao wrote:
> On Mon, Jan 27, 2020 at 09:34:52AM -0800, Roman Gushchin wrote:
> > Make slabinfo.py compatible with the new slab controller.
>  
> Tried using slabinfo.py, but run into some errors. (I am using your
> new_slab.2 branch)
> 
>  ./tools/cgroup/slabinfo.py /sys/fs/cgroup/memory/1
> Traceback (most recent call last):
>   File "/usr/local/bin/drgn", line 11, in <module>
>     sys.exit(main())
>   File "/usr/local/lib/python3.6/dist-packages/drgn/internal/cli.py", line 127, in main
>     runpy.run_path(args.script[0], init_globals=init_globals, run_name="__main__")
>   File "/usr/lib/python3.6/runpy.py", line 263, in run_path
>     pkg_name=pkg_name, script_name=fname)
>   File "/usr/lib/python3.6/runpy.py", line 96, in _run_module_code
>     mod_name, mod_spec, pkg_name, script_name)
>   File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
>     exec(code, run_globals)
>   File "./tools/cgroup/slabinfo.py", line 220, in <module>
>     main()
>   File "./tools/cgroup/slabinfo.py", line 165, in main
>     find_memcg_ids()
>   File "./tools/cgroup/slabinfo.py", line 43, in find_memcg_ids
>     MEMCGS[css.cgroup.kn.id.ino.value_()] = memcg
> AttributeError: '_drgn.Object' object has no attribute 'ino'
> 
> I did make this change...
> 
> # git diff
> diff --git a/tools/cgroup/slabinfo.py b/tools/cgroup/slabinfo.py
> index b779a4863beb..571fd95224d6 100755
> --- a/tools/cgroup/slabinfo.py
> +++ b/tools/cgroup/slabinfo.py
> @@ -40,7 +40,7 @@ def find_memcg_ids(css=prog['root_mem_cgroup'].css, prefix=''):
>                                         'sibling'):
>              name = prefix + '/' + css.cgroup.kn.name.string_().decode('utf-8')
>              memcg = container_of(css, 'struct mem_cgroup', 'css')
> -            MEMCGS[css.cgroup.kn.id.ino.value_()] = memcg
> +            MEMCGS[css.cgroup.kn.id.value_()] = memcg
>              find_memcg_ids(css, name)
> 
> 
> but now get empty output.

Btw, I've checked that the change like you've done above fixes the problem.
The script works for me both on current upstream and new_slab.2 branch.

Are you sure that in your case there is some kernel memory charged to that
cgroup? Please note, that in the current implementation kmem_caches are created
on demand, so the accounting is effectively enabled with some delay.

Thank you!

Below is an updated version of the patch to use:
--------------------------------------------------------------------------------

From 69b8e1bf451043c41e43e769b9ae15b36092ddf9 Mon Sep 17 00:00:00 2001
From: Roman Gushchin <guro@fb.com>
Date: Tue, 15 Oct 2019 17:06:04 -0700
Subject: [PATCH v2 26/28] tools/cgroup: add slabinfo.py tool

Add a drgn-based tool to display slab information for a given memcg.
Can replace cgroup v1 memory.kmem.slabinfo interface on cgroup v2,
but in a more flexiable way.

Currently supports only SLUB configuration, but SLAB can be trivially
added later.

Output example:
$ sudo ./tools/cgroup/slabinfo.py /sys/fs/cgroup/user.slice/user-111017.slice/user\@111017.service
shmem_inode_cache     92     92    704   46    8 : tunables    0    0    0 : slabdata      2      2      0
eventpoll_pwq         56     56     72   56    1 : tunables    0    0    0 : slabdata      1      1      0
eventpoll_epi         32     32    128   32    1 : tunables    0    0    0 : slabdata      1      1      0
kmalloc-8              0      0      8  512    1 : tunables    0    0    0 : slabdata      0      0      0
kmalloc-96             0      0     96   42    1 : tunables    0    0    0 : slabdata      0      0      0
kmalloc-2048           0      0   2048   16    8 : tunables    0    0    0 : slabdata      0      0      0
kmalloc-64           128    128     64   64    1 : tunables    0    0    0 : slabdata      2      2      0
mm_struct            160    160   1024   32    8 : tunables    0    0    0 : slabdata      5      5      0
signal_cache          96     96   1024   32    8 : tunables    0    0    0 : slabdata      3      3      0
sighand_cache         45     45   2112   15    8 : tunables    0    0    0 : slabdata      3      3      0
files_cache          138    138    704   46    8 : tunables    0    0    0 : slabdata      3      3      0
task_delay_info      153    153     80   51    1 : tunables    0    0    0 : slabdata      3      3      0
task_struct           27     27   3520    9    8 : tunables    0    0    0 : slabdata      3      3      0
radix_tree_node       56     56    584   28    4 : tunables    0    0    0 : slabdata      2      2      0
btrfs_inode          140    140   1136   28    8 : tunables    0    0    0 : slabdata      5      5      0
kmalloc-1024          64     64   1024   32    8 : tunables    0    0    0 : slabdata      2      2      0
kmalloc-192           84     84    192   42    2 : tunables    0    0    0 : slabdata      2      2      0
inode_cache           54     54    600   27    4 : tunables    0    0    0 : slabdata      2      2      0
kmalloc-128            0      0    128   32    1 : tunables    0    0    0 : slabdata      0      0      0
kmalloc-512           32     32    512   32    4 : tunables    0    0    0 : slabdata      1      1      0
skbuff_head_cache     32     32    256   32    2 : tunables    0    0    0 : slabdata      1      1      0
sock_inode_cache      46     46    704   46    8 : tunables    0    0    0 : slabdata      1      1      0
cred_jar             378    378    192   42    2 : tunables    0    0    0 : slabdata      9      9      0
proc_inode_cache      96     96    672   24    4 : tunables    0    0    0 : slabdata      4      4      0
dentry               336    336    192   42    2 : tunables    0    0    0 : slabdata      8      8      0
filp                 697    864    256   32    2 : tunables    0    0    0 : slabdata     27     27      0
anon_vma             644    644     88   46    1 : tunables    0    0    0 : slabdata     14     14      0
pid                 1408   1408     64   64    1 : tunables    0    0    0 : slabdata     22     22      0
vm_area_struct      1200   1200    200   40    2 : tunables    0    0    0 : slabdata     30     30      0

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
---
 tools/cgroup/slabinfo.py | 158 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 158 insertions(+)
 create mode 100755 tools/cgroup/slabinfo.py

diff --git a/tools/cgroup/slabinfo.py b/tools/cgroup/slabinfo.py
new file mode 100755
index 000000000000..0dc3a1fc260c
--- /dev/null
+++ b/tools/cgroup/slabinfo.py
@@ -0,0 +1,158 @@
+#!/usr/bin/env drgn
+#
+# Copyright (C) 2019 Roman Gushchin <guro@fb.com>
+# Copyright (C) 2019 Facebook
+
+from os import stat
+import argparse
+import sys
+
+from drgn.helpers.linux import list_for_each_entry, list_empty
+from drgn import container_of
+
+
+DESC = """
+This is a drgn script to provide slab statistics for memory cgroups.
+It supports cgroup v2 and v1 and can emulate memory.kmem.slabinfo
+interface of cgroup v1.
+For drgn, visit https://github.com/osandov/drgn.
+"""
+
+
+MEMCGS = {}
+
+OO_SHIFT = 16
+OO_MASK = ((1 << OO_SHIFT) - 1)
+
+
+def err(s):
+    print('slabinfo.py: error: %s' % s, file=sys.stderr, flush=True)
+    sys.exit(1)
+
+
+def find_memcg_ids(css=prog['root_mem_cgroup'].css, prefix=''):
+    if not list_empty(css.children.address_of_()):
+        for css in list_for_each_entry('struct cgroup_subsys_state',
+                                       css.children.address_of_(),
+                                       'sibling'):
+            name = prefix + '/' + css.cgroup.kn.name.string_().decode('utf-8')
+            memcg = container_of(css, 'struct mem_cgroup', 'css')
+            MEMCGS[css.cgroup.kn.id.value_()] = memcg
+            find_memcg_ids(css, name)
+
+
+def is_root_cache(s):
+    return False if s.memcg_params.root_cache else True
+
+
+def cache_name(s):
+    if is_root_cache(s):
+        return s.name.string_().decode('utf-8')
+    else:
+        return s.memcg_params.root_cache.name.string_().decode('utf-8')
+
+
+# SLUB
+
+def oo_order(s):
+    return s.oo.x >> OO_SHIFT
+
+
+def oo_objects(s):
+    return s.oo.x & OO_MASK
+
+
+def count_partial(n, fn):
+    nr_pages = 0
+    for page in list_for_each_entry('struct page', n.partial.address_of_(),
+                                    'lru'):
+         nr_pages += fn(page)
+    return nr_pages
+
+
+def count_free(page):
+    return page.objects - page.inuse
+
+
+def slub_get_slabinfo(s, cfg):
+    nr_slabs = 0
+    nr_objs = 0
+    nr_free = 0
+
+    for node in range(cfg['nr_nodes']):
+        n = s.node[node]
+        nr_slabs += n.nr_slabs.counter.value_()
+        nr_objs += n.total_objects.counter.value_()
+        nr_free += count_partial(n, count_free)
+
+    return {'active_objs': nr_objs - nr_free,
+            'num_objs': nr_objs,
+            'active_slabs': nr_slabs,
+            'num_slabs': nr_slabs,
+            'objects_per_slab': oo_objects(s),
+            'cache_order': oo_order(s),
+            'limit': 0,
+            'batchcount': 0,
+            'shared': 0,
+            'shared_avail': 0}
+
+
+def cache_show(s, cfg):
+    if cfg['allocator'] == 'SLUB':
+        sinfo = slub_get_slabinfo(s, cfg)
+    else:
+        err('SLAB isn\'t supported yet')
+
+    print('%-17s %6lu %6lu %6u %4u %4d'
+          ' : tunables %4u %4u %4u'
+          ' : slabdata %6lu %6lu %6lu' % (
+              cache_name(s), sinfo['active_objs'], sinfo['num_objs'],
+              s.size, sinfo['objects_per_slab'], 1 << sinfo['cache_order'],
+              sinfo['limit'], sinfo['batchcount'], sinfo['shared'],
+              sinfo['active_slabs'], sinfo['num_slabs'],
+              sinfo['shared_avail']))
+
+
+def detect_kernel_config():
+    cfg = {}
+
+    cfg['nr_nodes'] = prog['nr_online_nodes'].value_()
+
+    if prog.type('struct kmem_cache').members[1][1] == 'flags':
+        cfg['allocator'] = 'SLUB'
+    elif prog.type('struct kmem_cache').members[1][1] == 'batchcount':
+        cfg['allocator'] = 'SLAB'
+    else:
+        err('Can\'t determine the slab allocator')
+
+    return cfg
+
+
+def main():
+    parser = argparse.ArgumentParser(description=DESC,
+                                     formatter_class=
+                                     argparse.RawTextHelpFormatter)
+    parser.add_argument('cgroup', metavar='CGROUP',
+                        help='Target memory cgroup')
+    args = parser.parse_args()
+
+    try:
+        cgroup_id = stat(args.cgroup).st_ino
+        find_memcg_ids()
+        memcg = MEMCGS[cgroup_id]
+    except KeyError:
+        err('Can\'t find the memory cgroup')
+
+    cfg = detect_kernel_config()
+
+    print('# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab>'
+          ' : tunables <limit> <batchcount> <sharedfactor>'
+          ' : slabdata <active_slabs> <num_slabs> <sharedavail>')
+
+    for s in list_for_each_entry('struct kmem_cache',
+                                 memcg.kmem_caches.address_of_(),
+                                 'memcg_params.kmem_caches_node'):
+        cache_show(s, cfg)
+
+
+main()
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 07/28] mm: memcg/slab: introduce mem_cgroup_from_obj()
  2020-01-27 17:34 ` [PATCH v2 07/28] mm: memcg/slab: introduce mem_cgroup_from_obj() Roman Gushchin
@ 2020-02-03 16:05   ` Johannes Weiner
  0 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2020-02-03 16:05 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Jan 27, 2020 at 09:34:32AM -0800, Roman Gushchin wrote:
> @@ -757,13 +757,12 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
>  
>  void __mod_lruvec_slab_state(void *p, enum node_stat_item idx, int val)
>  {
> -	struct page *page = virt_to_head_page(p);
> -	pg_data_t *pgdat = page_pgdat(page);
> +	pg_data_t *pgdat = page_pgdat(virt_to_page(p));
>  	struct mem_cgroup *memcg;
>  	struct lruvec *lruvec;
>  
>  	rcu_read_lock();
> -	memcg = memcg_from_slab_page(page);
> +	memcg = mem_cgroup_from_obj(p);
>  
>  	/* Untracked pages have no memcg, no lruvec. Update only the node */
>  	if (!memcg || memcg == root_mem_cgroup) {

This function is specifically for slab objects. Why does it need the
indirection and additional branch here?

If memcg_from_slab_page() is going away later, I think the conversion
to this new helper should happen at that point in the series, not now.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 08/28] mm: fork: fix kernel_stack memcg stats for various stack implementations
  2020-01-27 17:34 ` [PATCH v2 08/28] mm: fork: fix kernel_stack memcg stats for various stack implementations Roman Gushchin
@ 2020-02-03 16:12   ` Johannes Weiner
  0 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2020-02-03 16:12 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Jan 27, 2020 at 09:34:33AM -0800, Roman Gushchin wrote:
> Depending on CONFIG_VMAP_STACK and the THREAD_SIZE / PAGE_SIZE ratio
> the space for task stacks can be allocated using __vmalloc_node_range(),
> alloc_pages_node() and kmem_cache_alloc_node(). In the first and the
> second cases page->mem_cgroup pointer is set, but in the third it's
> not: memcg membership of a slab page should be determined using the
> memcg_from_slab_page() function, which looks at
> page->slab_cache->memcg_params.memcg . In this case, using
> mod_memcg_page_state() (as in account_kernel_stack()) is incorrect:
> page->mem_cgroup pointer is NULL even for pages charged to a non-root
> memory cgroup.
> 
> In order to fix it, let's introduce a mod_memcg_obj_state() helper,
> which takes a pointer to a kernel object as a first argument, uses
> mem_cgroup_from_obj() to get a RCU-protected memcg pointer and
> calls mod_memcg_state(). It allows to handle all possible
> configurations (CONFIG_VMAP_STACK and various THREAD_SIZE/PAGE_SIZE
> values) without spilling any memcg/kmem specifics into fork.c .

The change looks good to me, but it sounds like this is a bug with
actual consequences to userspace. Can you elaborate on that in the
changelog please? Maybe add a Fixes: line, if applicable?


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 09/28] mm: memcg/slab: rename __mod_lruvec_slab_state() into __mod_lruvec_obj_state()
  2020-01-27 17:34 ` [PATCH v2 09/28] mm: memcg/slab: rename __mod_lruvec_slab_state() into __mod_lruvec_obj_state() Roman Gushchin
@ 2020-02-03 16:13   ` Johannes Weiner
  0 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2020-02-03 16:13 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Jan 27, 2020 at 09:34:34AM -0800, Roman Gushchin wrote:
> Rename __mod_lruvec_slab_state() into __mod_lruvec_obj_state()
> to unify it with mod_memcg_obj_state(). It better reflects the fact
> that the passed object isn't necessary slab-backed.

Makes sense to me.

> @@ -1116,7 +1116,7 @@ static inline void mod_lruvec_page_state(struct page *page,
>  	mod_node_page_state(page_pgdat(page), idx, val);
>  }
>  
> -static inline void __mod_lruvec_slab_state(void *p, enum node_stat_item idx,
> +static inline void __mod_lruvec_obj_state(void *p, enum node_stat_item idx,
>  					   int val)
>  {
>  	struct page *page = virt_to_head_page(p);
> @@ -1217,12 +1217,12 @@ static inline void __dec_lruvec_page_state(struct page *page,
>  
>  static inline void __inc_lruvec_slab_state(void *p, enum node_stat_item idx)
>  {
> -	__mod_lruvec_slab_state(p, idx, 1);
> +	__mod_lruvec_obj_state(p, idx, 1);
>  }
>  
>  static inline void __dec_lruvec_slab_state(void *p, enum node_stat_item idx)
>  {
> -	__mod_lruvec_slab_state(p, idx, -1);
> +	__mod_lruvec_obj_state(p, idx, -1);
>  }

These should be renamed as well, no?


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 10/28] mm: memcg: introduce mod_lruvec_memcg_state()
  2020-01-27 17:34 ` [PATCH v2 10/28] mm: memcg: introduce mod_lruvec_memcg_state() Roman Gushchin
@ 2020-02-03 17:39   ` Johannes Weiner
  0 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2020-02-03 17:39 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Jan 27, 2020 at 09:34:35AM -0800, Roman Gushchin wrote:
> To prepare for per-object accounting of slab objects, let's introduce
> __mod_lruvec_memcg_state() and mod_lruvec_memcg_state() helpers,
> which are similar to mod_lruvec_state(), but do not update global
> node counters, only lruvec and per-cgroup.
> 
> It's necessary because soon node slab counters will be used for
> accounting of all memory used by slab pages, however on memcg level
> only the actually used memory will be counted. The free space will be
> shared between all cgroups, so it can't be accounted to any
> specific cgroup.

Makes perfect sense. However, I think the existing mod_lruvec_state()
has a bad and misleading name, and adding to it in the same style
makes things worse.

Can we instead rename lruvec_state to node_memcg_state to capture that
it changes all levels. And then do the following, clean API?

- node_state for node only

- memcg_state for memcg only

- lruvec_state for lruvec only

- node_memcg_state convenience wrapper to change node, memcg, lruvec counters

You can then open-code the disjunct node and memcg+lruvec counters.

[ Granted, lruvec counters are never modified on their own - always in
  conjunction with the memcg counters. And frankly, the only memcg
  counters that are modified *without* the lruvec counter-part are the
  special-case MEMCG_ counters.

  It would be nice to have 1) a completely separate API for the MEMCG_
  counters; and then 2) the node API for node and 3) a cgroup API for
  memcg+lruvec VM stat counters that allow you to easily do the
  disjunct accounting for slab memory.

  But I can't think of poignant names for these. At least nothing that
  would be better than separate memcg_state and lruvec_state calls. ]


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 11/28] mm: slub: implement SLUB version of obj_to_index()
  2020-01-27 17:34 ` [PATCH v2 11/28] mm: slub: implement SLUB version of obj_to_index() Roman Gushchin
@ 2020-02-03 17:44   ` Johannes Weiner
  0 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2020-02-03 17:44 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Jan 27, 2020 at 09:34:36AM -0800, Roman Gushchin wrote:
> This commit implements SLUB version of the obj_to_index() function,
> which will be required to calculate the offset of obj_cgroup in the
> obj_cgroups vector to store/obtain the objcg ownership data.
> 
> To make it faster, let's repeat the SLAB's trick introduced by
> commit 6a2d7a955d8d ("[PATCH] SLAB: use a multiply instead of a
> divide in obj_to_index()") and avoid an expensive division.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Acked-by: Christoph Lameter <cl@linux.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 12/28] mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat
  2020-01-27 17:34 ` [PATCH v2 12/28] mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat Roman Gushchin
@ 2020-02-03 17:58   ` Johannes Weiner
  2020-02-03 18:25     ` Roman Gushchin
  0 siblings, 1 reply; 84+ messages in thread
From: Johannes Weiner @ 2020-02-03 17:58 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Jan 27, 2020 at 09:34:37AM -0800, Roman Gushchin wrote:
> Currently s8 type is used for per-cpu caching of per-node statistics.
> It works fine because the overfill threshold can't exceed 125.
> 
> But if some counters are in bytes (and the next commit in the series
> will convert slab counters to bytes), it's not gonna work:
> value in bytes can easily exceed s8 without exceeding the threshold
> converted to bytes. So to avoid overfilling per-cpu caches and breaking
> vmstats correctness, let's use s32 instead.
> 
> This doesn't affect per-zone statistics. There are no plans to use
> zone-level byte-sized counters, so no reasons to change anything.

Wait, is this still necessary? AFAIU, the node counters will account
full slab pages, including free space, and only the memcg counters
that track actual objects will be in bytes.

Can you please elaborate?


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 12/28] mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat
  2020-02-03 17:58   ` Johannes Weiner
@ 2020-02-03 18:25     ` Roman Gushchin
  2020-02-03 20:34       ` Johannes Weiner
  0 siblings, 1 reply; 84+ messages in thread
From: Roman Gushchin @ 2020-02-03 18:25 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Feb 03, 2020 at 12:58:18PM -0500, Johannes Weiner wrote:
> On Mon, Jan 27, 2020 at 09:34:37AM -0800, Roman Gushchin wrote:
> > Currently s8 type is used for per-cpu caching of per-node statistics.
> > It works fine because the overfill threshold can't exceed 125.
> > 
> > But if some counters are in bytes (and the next commit in the series
> > will convert slab counters to bytes), it's not gonna work:
> > value in bytes can easily exceed s8 without exceeding the threshold
> > converted to bytes. So to avoid overfilling per-cpu caches and breaking
> > vmstats correctness, let's use s32 instead.
> > 
> > This doesn't affect per-zone statistics. There are no plans to use
> > zone-level byte-sized counters, so no reasons to change anything.
> 
> Wait, is this still necessary? AFAIU, the node counters will account
> full slab pages, including free space, and only the memcg counters
> that track actual objects will be in bytes.
> 
> Can you please elaborate?

It's weird to have a counter with the same name (e.g. NR_SLAB_RECLAIMABLE_B)
being in different units depending on the accounting scope.
So I do convert all slab counters: global, per-lruvec,
and per-memcg to bytes.

Alternatively I can fork them, e.g. introduce per-memcg or per-lruvec
NR_SLAB_RECLAIMABLE_OBJ
NR_SLAB_UNRECLAIMABLE_OBJ
and keep global counters untouched. If going this way, I'd prefer to make
them per-memcg, because it will simplify things on charging paths:
now we do get task->mem_cgroup->obj_cgroup in the pre_alloc_hook(),
and then obj_cgroup->mem_cgroup in the post_alloc_hook() just to
bump per-lruvec counters.


Btw, I wonder if we really need per-lruvec counters at all (at least
being enabled by default). For the significant amount of users who
have a single-node machine it doesn't bring anything except performance
overhead. For those who have multiple nodes (and most likely many many
memory cgroups) it provides way too many data except for debugging
some weird mm issues.
I guess in the absolute majority of cases having global per-node + per-memcg
counters will be enough.

Thanks!


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 16/28] mm: memcg/slab: allocate obj_cgroups for non-root slab pages
  2020-01-27 17:34 ` [PATCH v2 16/28] mm: memcg/slab: allocate obj_cgroups for non-root slab pages Roman Gushchin
@ 2020-02-03 18:27   ` Johannes Weiner
  2020-02-03 18:34     ` Roman Gushchin
  0 siblings, 1 reply; 84+ messages in thread
From: Johannes Weiner @ 2020-02-03 18:27 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Jan 27, 2020 at 09:34:41AM -0800, Roman Gushchin wrote:
> Allocate and release memory to store obj_cgroup pointers for each
> non-root slab page. Reuse page->mem_cgroup pointer to store a pointer
> to the allocated space.
> 
> To distinguish between obj_cgroups and memcg pointers in case
> when it's not obvious which one is used (as in page_cgroup_ino()),
> let's always set the lowest bit in the obj_cgroup case.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> ---
>  include/linux/mm.h       | 25 ++++++++++++++++++--
>  include/linux/mm_types.h |  5 +++-
>  mm/memcontrol.c          |  5 ++--
>  mm/slab.c                |  3 ++-
>  mm/slab.h                | 51 +++++++++++++++++++++++++++++++++++++++-
>  mm/slub.c                |  2 +-
>  6 files changed, 83 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 080f8ac8bfb7..65224becc4ca 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1264,12 +1264,33 @@ static inline void set_page_links(struct page *page, enum zone_type zone,
>  #ifdef CONFIG_MEMCG
>  static inline struct mem_cgroup *page_memcg(struct page *page)
>  {
> -	return page->mem_cgroup;
> +	struct mem_cgroup *memcg = page->mem_cgroup;
> +
> +	/*
> +	 * The lowest bit set means that memcg isn't a valid memcg pointer,
> +	 * but a obj_cgroups pointer. In this case the page is shared and
> +	 * isn't charged to any specific memory cgroup. Return NULL.
> +	 */
> +	if ((unsigned long) memcg & 0x1UL)
> +		memcg = NULL;
> +
> +	return memcg;

That should really WARN instead of silently returning NULL. Which
callsite optimistically asks a page's cgroup when it has no idea
whether that page is actually a userpage or not?

>  static inline struct mem_cgroup *page_memcg_rcu(struct page *page)
>  {
> +	struct mem_cgroup *memcg = READ_ONCE(page->mem_cgroup);
> +
>  	WARN_ON_ONCE(!rcu_read_lock_held());
> -	return READ_ONCE(page->mem_cgroup);
> +
> +	/*
> +	 * The lowest bit set means that memcg isn't a valid memcg pointer,
> +	 * but a obj_cgroups pointer. In this case the page is shared and
> +	 * isn't charged to any specific memory cgroup. Return NULL.
> +	 */
> +	if ((unsigned long) memcg & 0x1UL)
> +		memcg = NULL;
> +
> +	return memcg;

Same here.

>  }
>  #else
>  static inline struct mem_cgroup *page_memcg(struct page *page)
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 270aa8fd2800..5102f00f3336 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -198,7 +198,10 @@ struct page {
>  	atomic_t _refcount;
>  
>  #ifdef CONFIG_MEMCG
> -	struct mem_cgroup *mem_cgroup;
> +	union {
> +		struct mem_cgroup *mem_cgroup;
> +		struct obj_cgroup **obj_cgroups;
> +	};

Since you need the casts in both cases anyway, it's safer (and
simpler) to do

	unsigned long mem_cgroup;

to prevent accidental direct derefs in future code.

Otherwise, this patch looks good to me!


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 16/28] mm: memcg/slab: allocate obj_cgroups for non-root slab pages
  2020-02-03 18:27   ` Johannes Weiner
@ 2020-02-03 18:34     ` Roman Gushchin
  2020-02-03 20:46       ` Johannes Weiner
  0 siblings, 1 reply; 84+ messages in thread
From: Roman Gushchin @ 2020-02-03 18:34 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Feb 03, 2020 at 01:27:56PM -0500, Johannes Weiner wrote:
> On Mon, Jan 27, 2020 at 09:34:41AM -0800, Roman Gushchin wrote:
> > Allocate and release memory to store obj_cgroup pointers for each
> > non-root slab page. Reuse page->mem_cgroup pointer to store a pointer
> > to the allocated space.
> > 
> > To distinguish between obj_cgroups and memcg pointers in case
> > when it's not obvious which one is used (as in page_cgroup_ino()),
> > let's always set the lowest bit in the obj_cgroup case.
> > 
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > ---
> >  include/linux/mm.h       | 25 ++++++++++++++++++--
> >  include/linux/mm_types.h |  5 +++-
> >  mm/memcontrol.c          |  5 ++--
> >  mm/slab.c                |  3 ++-
> >  mm/slab.h                | 51 +++++++++++++++++++++++++++++++++++++++-
> >  mm/slub.c                |  2 +-
> >  6 files changed, 83 insertions(+), 8 deletions(-)
> > 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 080f8ac8bfb7..65224becc4ca 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1264,12 +1264,33 @@ static inline void set_page_links(struct page *page, enum zone_type zone,
> >  #ifdef CONFIG_MEMCG
> >  static inline struct mem_cgroup *page_memcg(struct page *page)
> >  {
> > -	return page->mem_cgroup;
> > +	struct mem_cgroup *memcg = page->mem_cgroup;
> > +
> > +	/*
> > +	 * The lowest bit set means that memcg isn't a valid memcg pointer,
> > +	 * but a obj_cgroups pointer. In this case the page is shared and
> > +	 * isn't charged to any specific memory cgroup. Return NULL.
> > +	 */
> > +	if ((unsigned long) memcg & 0x1UL)
> > +		memcg = NULL;
> > +
> > +	return memcg;
> 
> That should really WARN instead of silently returning NULL. Which
> callsite optimistically asks a page's cgroup when it has no idea
> whether that page is actually a userpage or not?

For instance, look at page_cgroup_ino() called from the
reading /proc/kpageflags.

> 
> >  static inline struct mem_cgroup *page_memcg_rcu(struct page *page)
> >  {
> > +	struct mem_cgroup *memcg = READ_ONCE(page->mem_cgroup);
> > +
> >  	WARN_ON_ONCE(!rcu_read_lock_held());
> > -	return READ_ONCE(page->mem_cgroup);
> > +
> > +	/*
> > +	 * The lowest bit set means that memcg isn't a valid memcg pointer,
> > +	 * but a obj_cgroups pointer. In this case the page is shared and
> > +	 * isn't charged to any specific memory cgroup. Return NULL.
> > +	 */
> > +	if ((unsigned long) memcg & 0x1UL)
> > +		memcg = NULL;
> > +
> > +	return memcg;
> 
> Same here.
> 
> >  }
> >  #else
> >  static inline struct mem_cgroup *page_memcg(struct page *page)
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 270aa8fd2800..5102f00f3336 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -198,7 +198,10 @@ struct page {
> >  	atomic_t _refcount;
> >  
> >  #ifdef CONFIG_MEMCG
> > -	struct mem_cgroup *mem_cgroup;
> > +	union {
> > +		struct mem_cgroup *mem_cgroup;
> > +		struct obj_cgroup **obj_cgroups;
> > +	};
> 
> Since you need the casts in both cases anyway, it's safer (and
> simpler) to do
> 
> 	unsigned long mem_cgroup;
> 
> to prevent accidental direct derefs in future code.

Agree. Maybe even mem_cgroup_data?

> 
> Otherwise, this patch looks good to me!

Thanks!


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 15/28] mm: memcg/slab: obj_cgroup API
  2020-01-27 17:34 ` [PATCH v2 15/28] mm: memcg/slab: obj_cgroup API Roman Gushchin
@ 2020-02-03 19:31   ` Johannes Weiner
  0 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2020-02-03 19:31 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Jan 27, 2020 at 09:34:40AM -0800, Roman Gushchin wrote:
> Obj_cgroup API provides an ability to account sub-page sized kernel
> objects, which potentially outlive the original memory cgroup.
> 
> The top-level API consists of the following functions:
>   bool obj_cgroup_tryget(struct obj_cgroup *objcg);
>   void obj_cgroup_get(struct obj_cgroup *objcg);
>   void obj_cgroup_put(struct obj_cgroup *objcg);
> 
>   int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size);
>   void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);
> 
>   struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg);
> 
> Object cgroup is basically a pointer to a memory cgroup with a per-cpu
> reference counter. It substitutes a memory cgroup in places where
> it's necessary to charge a custom amount of bytes instead of pages.
> 
> All charged memory rounded down to pages is charged to the
> corresponding memory cgroup using __memcg_kmem_charge().
> 
> It implements reparenting: on memcg offlining it's getting reattached
> to the parent memory cgroup. Each online memory cgroup has an
> associated active object cgroup to handle new allocations and the list
> of all attached object cgroups. On offlining of a cgroup this list is
> reparented and for each object cgroup in the list the memcg pointer is
> swapped to the parent memory cgroup. It prevents long-living objects
> from pinning the original memory cgroup in the memory.
> 
> The implementation is based on byte-sized per-cpu stocks. A sub-page
> sized leftover is stored in an atomic field, which is a part of
> obj_cgroup object. So on cgroup offlining the leftover is automatically
> reparented.
> 
> memcg->objcg is rcu protected.
> objcg->memcg is a raw pointer, which is always pointing at a memory
> cgroup, but can be atomically swapped to the parent memory cgroup. So
> the caller must ensure the lifetime of the cgroup, e.g. grab
> rcu_read_lock or css_set_lock.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>

> @@ -194,6 +195,22 @@ struct memcg_cgwb_frn {
>  	struct wb_completion done;	/* tracks in-flight foreign writebacks */
>  };
>  
> +/*
> + * Bucket for arbitrarily byte-sized objects charged to a memory
> + * cgroup. The bucket can be reparented in one piece when the cgroup
> + * is destroyed, without having to round up the individual references
> + * of all live memory objects in the wild.
> + */
> +struct obj_cgroup {
> +	struct percpu_ref refcnt;
> +	struct mem_cgroup *memcg;
> +	atomic_t nr_charged_bytes;
> +	union {
> +		struct list_head list;
> +		struct rcu_head rcu;
> +	};
> +};
> +
>  /*
>   * The memory controller data structure. The memory controller controls both
>   * page cache and RSS per cgroup. We would eventually like to provide
> @@ -306,6 +323,8 @@ struct mem_cgroup {
>  	int kmemcg_id;
>  	enum memcg_kmem_state kmem_state;
>  	struct list_head kmem_caches;
> +	struct obj_cgroup __rcu *objcg;
> +	struct list_head objcg_list;

These could use a comment, IMO.

	/*
	 * Active object acounting bucket, as well as
	 * reparented buckets from dead children with
	 * outstanding objects.
	 */

or something like that.

> @@ -257,6 +257,73 @@ struct cgroup_subsys_state *vmpressure_to_css(struct vmpressure *vmpr)
>  }
>  
>  #ifdef CONFIG_MEMCG_KMEM
> +extern spinlock_t css_set_lock;
> +
> +static void obj_cgroup_release(struct percpu_ref *ref)
> +{
> +	struct obj_cgroup *objcg = container_of(ref, struct obj_cgroup, refcnt);
> +	unsigned int nr_bytes;
> +	unsigned int nr_pages;
> +	unsigned long flags;
> +
> +	nr_bytes = atomic_read(&objcg->nr_charged_bytes);
> +	WARN_ON_ONCE(nr_bytes & (PAGE_SIZE - 1));
> +	nr_pages = nr_bytes >> PAGE_SHIFT;
> +
> +	if (nr_pages) {
> +		rcu_read_lock();
> +		__memcg_kmem_uncharge(obj_cgroup_memcg(objcg), nr_pages);
> +		rcu_read_unlock();
> +	}
> +
> +	spin_lock_irqsave(&css_set_lock, flags);
> +	list_del(&objcg->list);
> +	mem_cgroup_put(obj_cgroup_memcg(objcg));
> +	spin_unlock_irqrestore(&css_set_lock, flags);

Heh, two obj_cgroup_memcg() lookups with different synchronization
rules.

I know that reparenting could happen in between the page uncharge and
the mem_cgroup_put(), and it would still be safe because the counters
are migrated atomically. But it seems needlessly lockless and complex.

Since you have to css_set_lock anyway, wouldn't it be better to do

	spin_lock_irqsave(&css_set_lock, flags);
	memcg = obj_cgroup_memcg(objcg);
	if (nr_pages)
		__memcg_kmem_uncharge(memcg, nr_pages);
	list_del(&objcg->list);
	mem_cgroup_put(memcg);
	spin_unlock_irqrestore(&css_set_lock, flags);

instead?

> +	percpu_ref_exit(ref);
> +	kfree_rcu(objcg, rcu);
> +}
> +
> +static struct obj_cgroup *obj_cgroup_alloc(void)
> +{
> +	struct obj_cgroup *objcg;
> +	int ret;
> +
> +	objcg = kzalloc(sizeof(struct obj_cgroup), GFP_KERNEL);
> +	if (!objcg)
> +		return NULL;
> +
> +	ret = percpu_ref_init(&objcg->refcnt, obj_cgroup_release, 0,
> +			      GFP_KERNEL);
> +	if (ret) {
> +		kfree(objcg);
> +		return NULL;
> +	}
> +	INIT_LIST_HEAD(&objcg->list);
> +	return objcg;
> +}
> +
> +static void memcg_reparent_objcgs(struct mem_cgroup *memcg,
> +				  struct mem_cgroup *parent)
> +{
> +	struct obj_cgroup *objcg;
> +
> +	objcg = rcu_replace_pointer(memcg->objcg, NULL, true);

Can this actually race with new charges? By the time we are going
offline, where would they be coming from?

What happens if the charger sees a live memcg, but its memcg->objcg is
cleared? Shouldn't they have the same kind of lifetime, where as long
as the memcg can be charged, so can the objcg? What would happen if
you didn't clear memcg->objcg here?

> +	/* Paired with mem_cgroup_put() in objcg_release(). */
> +	css_get(&memcg->css);
> +	percpu_ref_kill(&objcg->refcnt);
> +
> +	spin_lock_irq(&css_set_lock);
> +	list_for_each_entry(objcg, &memcg->objcg_list, list) {
> +		css_get(&parent->css);
> +		xchg(&objcg->memcg, parent);
> +		css_put(&memcg->css);
> +	}

I'm having a pretty hard time following this refcounting.

Why does objcg only acquire a css reference on the way out? It should
hold one when objcg->memcg is set up, and put it when that pointer
goes away.

But also, objcg is already on its own memcg->objcg_list from the
start, so on the first reparenting we get a css ref, then move it to
the parent, then obj_cgroup_release() puts one it doesn't have ...?

Argh, help.

> @@ -2978,6 +3070,120 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
>  	if (PageKmemcg(page))
>  		__ClearPageKmemcg(page);
>  }
> +
> +static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
> +{
> +	struct memcg_stock_pcp *stock;
> +	unsigned long flags;
> +	bool ret = false;
> +
> +	local_irq_save(flags);
> +
> +	stock = this_cpu_ptr(&memcg_stock);
> +	if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
> +		stock->nr_bytes -= nr_bytes;
> +		ret = true;
> +	}
> +
> +	local_irq_restore(flags);
> +
> +	return ret;
> +}
> +
> +static void drain_obj_stock(struct memcg_stock_pcp *stock)
> +{
> +	struct obj_cgroup *old = stock->cached_objcg;
> +
> +	if (!old)
> +		return;
> +
> +	if (stock->nr_bytes) {
> +		unsigned int nr_pages = stock->nr_bytes >> PAGE_SHIFT;
> +		unsigned int nr_bytes = stock->nr_bytes & (PAGE_SIZE - 1);
> +
> +		if (nr_pages) {
> +			rcu_read_lock();
> +			__memcg_kmem_uncharge(obj_cgroup_memcg(old), nr_pages);
> +			rcu_read_unlock();
> +		}
> +
> +		atomic_add(nr_bytes, &old->nr_charged_bytes);
> +		stock->nr_bytes = 0;
> +	}
> +
> +	obj_cgroup_put(old);
> +	stock->cached_objcg = NULL;
> +}
> +
> +static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
> +				     struct mem_cgroup *root_memcg)
> +{
> +	struct mem_cgroup *memcg;
> +
> +	if (stock->cached_objcg) {
> +		memcg = obj_cgroup_memcg(stock->cached_objcg);
> +		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
> +			return true;
> +	}
> +
> +	return false;
> +}
> +
> +static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
> +{
> +	struct memcg_stock_pcp *stock;
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +
> +	stock = this_cpu_ptr(&memcg_stock);
> +	if (stock->cached_objcg != objcg) { /* reset if necessary */
> +		drain_obj_stock(stock);
> +		obj_cgroup_get(objcg);
> +		stock->cached_objcg = objcg;
> +		stock->nr_bytes = atomic_xchg(&objcg->nr_charged_bytes, 0);
> +	}
> +	stock->nr_bytes += nr_bytes;
> +
> +	if (stock->nr_bytes > PAGE_SIZE)
> +		drain_obj_stock(stock);
> +
> +	local_irq_restore(flags);
> +}
> +
> +int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
> +{
> +	struct mem_cgroup *memcg;
> +	unsigned int nr_pages, nr_bytes;
> +	int ret;
> +
> +	if (consume_obj_stock(objcg, size))
> +		return 0;
> +
> +	rcu_read_lock();
> +	memcg = obj_cgroup_memcg(objcg);
> +	css_get(&memcg->css);
> +	rcu_read_unlock();

I don't quite understand the lifetime rules here. You're holding the
rcu lock, so the memcg object cannot get physically freed while you
are looking it up. But you could be racing with an offlining and see
the stale memcg pointer. Isn't css_get() unsafe? Doesn't this need a
retry loop around css_tryget() similar to get_mem_cgroup_from_mm()?

> +
> +	nr_pages = size >> PAGE_SHIFT;
> +	nr_bytes = size & (PAGE_SIZE - 1);
> +
> +	if (nr_bytes)
> +		nr_pages += 1;
> +
> +	ret = __memcg_kmem_charge(memcg, gfp, nr_pages);
> +	if (!ret && nr_bytes)
> +		refill_obj_stock(objcg, PAGE_SIZE - nr_bytes);
> +
> +	css_put(&memcg->css);
> +	return ret;
> +}
> +
> +void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
> +{
> +	refill_obj_stock(objcg, size);
> +}
> +
>  #endif /* CONFIG_MEMCG_KMEM */
>  
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> @@ -3400,7 +3606,8 @@ static void memcg_flush_percpu_vmevents(struct mem_cgroup *memcg)
>  #ifdef CONFIG_MEMCG_KMEM
>  static int memcg_online_kmem(struct mem_cgroup *memcg)
>  {
> -	int memcg_id;
> +	struct obj_cgroup *objcg;
> +	int memcg_id, ret;
>  
>  	if (cgroup_memory_nokmem)
>  		return 0;
> @@ -3412,6 +3619,15 @@ static int memcg_online_kmem(struct mem_cgroup *memcg)
>  	if (memcg_id < 0)
>  		return memcg_id;
>  
> +	objcg = obj_cgroup_alloc();
> +	if (!objcg) {
> +		memcg_free_cache_id(memcg_id);
> +		return ret;
> +	}
> +	objcg->memcg = memcg;
> +	rcu_assign_pointer(memcg->objcg, objcg);
> +	list_add(&objcg->list, &memcg->objcg_list);

This self-hosting significantly adds to my confusion. It'd be a lot
easier to understand ownership rules and references if this list_add()
was done directly to the parent's list at the time of reparenting, not
here.

If the objcg holds a css reference, right here is where it should be
acquired. Then transferred in reparent and put during release.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 21/28] mm: memcg/slab: use a single set of kmem_caches for all memory cgroups
  2020-01-27 17:34 ` [PATCH v2 21/28] mm: memcg/slab: use a single set of kmem_caches for all memory cgroups Roman Gushchin
@ 2020-02-03 19:50   ` Johannes Weiner
  2020-02-03 20:58     ` Roman Gushchin
  0 siblings, 1 reply; 84+ messages in thread
From: Johannes Weiner @ 2020-02-03 19:50 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Jan 27, 2020 at 09:34:46AM -0800, Roman Gushchin wrote:
> This is fairly big but mostly red patch, which makes all non-root
> slab allocations use a single set of kmem_caches instead of
> creating a separate set for each memory cgroup.
> 
> Because the number of non-root kmem_caches is now capped by the number
> of root kmem_caches, there is no need to shrink or destroy them
> prematurely. They can be perfectly destroyed together with their
> root counterparts. This allows to dramatically simplify the
> management of non-root kmem_caches and delete a ton of code.

This is definitely going in the right direction. But it doesn't quite
explain why we still need two sets of kmem_caches?

In the old scheme, we had completely separate per-cgroup caches with
separate slab pages. If a cgrouped process wanted to allocate a slab
object, we'd go to the root cache and used the cgroup id to look up
the right cgroup cache. On slab free we'd use page->slab_cache.

Now we have slab pages that have a page->objcg array. Why can't all
allocations go through a single set of kmem caches? If an allocation
is coming from a cgroup and the slab page the allocator wants to use
doesn't have an objcg array yet, we can allocate it on the fly, no?


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 17/28] mm: memcg/slab: save obj_cgroup for non-root slab objects
  2020-01-27 17:34 ` [PATCH v2 17/28] mm: memcg/slab: save obj_cgroup for non-root slab objects Roman Gushchin
@ 2020-02-03 19:53   ` Johannes Weiner
  0 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2020-02-03 19:53 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Jan 27, 2020 at 09:34:42AM -0800, Roman Gushchin wrote:
> Store the obj_cgroup pointer in the corresponding place of
> page->obj_cgroups for each allocated non-root slab object.
> Make sure that each allocated object holds a reference to obj_cgroup.
> 
> Objcg pointer is obtained from the memcg->objcg dereferencing
> in memcg_kmem_get_cache() and passed from pre_alloc_hook to
> post_alloc_hook. Then in case of successful allocation(s) it's
> getting stored in the page->obj_cgroups vector.
> 
> The objcg obtaining part look a bit bulky now, but it will be simplified
> by next commits in the series.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> ---
>  include/linux/memcontrol.h |  3 +-
>  mm/memcontrol.c            | 14 +++++++--
>  mm/slab.c                  | 18 +++++++-----
>  mm/slab.h                  | 60 ++++++++++++++++++++++++++++++++++----
>  mm/slub.c                  | 14 +++++----
>  5 files changed, 88 insertions(+), 21 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 30bbea3f85e2..54bfb26b5016 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1431,7 +1431,8 @@ static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg,
>  }
>  #endif
>  
> -struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep);
> +struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep,
> +					struct obj_cgroup **objcgp);
>  void memcg_kmem_put_cache(struct kmem_cache *cachep);
>  
>  #ifdef CONFIG_MEMCG_KMEM
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 94337ab1ebe9..0e9fe272e688 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2896,7 +2896,8 @@ static inline bool memcg_kmem_bypass(void)
>   * done with it, memcg_kmem_put_cache() must be called to release the
>   * reference.
>   */
> -struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep)
> +struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep,
> +					struct obj_cgroup **objcgp)
>  {
>  	struct mem_cgroup *memcg;
>  	struct kmem_cache *memcg_cachep;
> @@ -2952,8 +2953,17 @@ struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep)
>  	 */
>  	if (unlikely(!memcg_cachep))
>  		memcg_schedule_kmem_cache_create(memcg, cachep);
> -	else if (percpu_ref_tryget(&memcg_cachep->memcg_params.refcnt))
> +	else if (percpu_ref_tryget(&memcg_cachep->memcg_params.refcnt)) {
> +		struct obj_cgroup *objcg = rcu_dereference(memcg->objcg);
> +
> +		if (!objcg || !obj_cgroup_tryget(objcg)) {
> +			percpu_ref_put(&memcg_cachep->memcg_params.refcnt);
> +			goto out_unlock;
> +		}

As per the reply to the previous patch: I don't understand why the
objcg requires a pulse check here. As long as the memcg is alive and
can be charged with memory, how can the objcg disappear?


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 12/28] mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat
  2020-02-03 18:25     ` Roman Gushchin
@ 2020-02-03 20:34       ` Johannes Weiner
  2020-02-03 22:28         ` Roman Gushchin
  0 siblings, 1 reply; 84+ messages in thread
From: Johannes Weiner @ 2020-02-03 20:34 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Feb 03, 2020 at 10:25:06AM -0800, Roman Gushchin wrote:
> On Mon, Feb 03, 2020 at 12:58:18PM -0500, Johannes Weiner wrote:
> > On Mon, Jan 27, 2020 at 09:34:37AM -0800, Roman Gushchin wrote:
> > > Currently s8 type is used for per-cpu caching of per-node statistics.
> > > It works fine because the overfill threshold can't exceed 125.
> > > 
> > > But if some counters are in bytes (and the next commit in the series
> > > will convert slab counters to bytes), it's not gonna work:
> > > value in bytes can easily exceed s8 without exceeding the threshold
> > > converted to bytes. So to avoid overfilling per-cpu caches and breaking
> > > vmstats correctness, let's use s32 instead.
> > > 
> > > This doesn't affect per-zone statistics. There are no plans to use
> > > zone-level byte-sized counters, so no reasons to change anything.
> > 
> > Wait, is this still necessary? AFAIU, the node counters will account
> > full slab pages, including free space, and only the memcg counters
> > that track actual objects will be in bytes.
> > 
> > Can you please elaborate?
> 
> It's weird to have a counter with the same name (e.g. NR_SLAB_RECLAIMABLE_B)
> being in different units depending on the accounting scope.
> So I do convert all slab counters: global, per-lruvec,
> and per-memcg to bytes.

Since the node counters tracks allocated slab pages and the memcg
counter tracks allocated objects, arguably they shouldn't use the same
name anyway.

> Alternatively I can fork them, e.g. introduce per-memcg or per-lruvec
> NR_SLAB_RECLAIMABLE_OBJ
> NR_SLAB_UNRECLAIMABLE_OBJ

Can we alias them and reuse their slots?

	/* Reuse the node slab page counters item for charged objects */
	MEMCG_SLAB_RECLAIMABLE = NR_SLAB_RECLAIMABLE,
	MEMCG_SLAB_UNRECLAIMABLE = NR_SLAB_UNRECLAIMABLE,

> and keep global counters untouched. If going this way, I'd prefer to make
> them per-memcg, because it will simplify things on charging paths:
> now we do get task->mem_cgroup->obj_cgroup in the pre_alloc_hook(),
> and then obj_cgroup->mem_cgroup in the post_alloc_hook() just to
> bump per-lruvec counters.

I don't quite follow. Don't you still have to update the global
counters?

> Btw, I wonder if we really need per-lruvec counters at all (at least
> being enabled by default). For the significant amount of users who
> have a single-node machine it doesn't bring anything except performance
> overhead.

Yeah, for single-node systems we should be able to redirect everything
to the memcg counters, without allocating and tracking lruvec copies.

> For those who have multiple nodes (and most likely many many
> memory cgroups) it provides way too many data except for debugging
> some weird mm issues.
> I guess in the absolute majority of cases having global per-node + per-memcg
> counters will be enough.

Hm? Reclaim uses the lruvec counters.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 16/28] mm: memcg/slab: allocate obj_cgroups for non-root slab pages
  2020-02-03 18:34     ` Roman Gushchin
@ 2020-02-03 20:46       ` Johannes Weiner
  2020-02-03 21:19         ` Roman Gushchin
  0 siblings, 1 reply; 84+ messages in thread
From: Johannes Weiner @ 2020-02-03 20:46 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Feb 03, 2020 at 10:34:52AM -0800, Roman Gushchin wrote:
> On Mon, Feb 03, 2020 at 01:27:56PM -0500, Johannes Weiner wrote:
> > On Mon, Jan 27, 2020 at 09:34:41AM -0800, Roman Gushchin wrote:
> > > Allocate and release memory to store obj_cgroup pointers for each
> > > non-root slab page. Reuse page->mem_cgroup pointer to store a pointer
> > > to the allocated space.
> > > 
> > > To distinguish between obj_cgroups and memcg pointers in case
> > > when it's not obvious which one is used (as in page_cgroup_ino()),
> > > let's always set the lowest bit in the obj_cgroup case.
> > > 
> > > Signed-off-by: Roman Gushchin <guro@fb.com>
> > > ---
> > >  include/linux/mm.h       | 25 ++++++++++++++++++--
> > >  include/linux/mm_types.h |  5 +++-
> > >  mm/memcontrol.c          |  5 ++--
> > >  mm/slab.c                |  3 ++-
> > >  mm/slab.h                | 51 +++++++++++++++++++++++++++++++++++++++-
> > >  mm/slub.c                |  2 +-
> > >  6 files changed, 83 insertions(+), 8 deletions(-)
> > > 
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index 080f8ac8bfb7..65224becc4ca 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -1264,12 +1264,33 @@ static inline void set_page_links(struct page *page, enum zone_type zone,
> > >  #ifdef CONFIG_MEMCG
> > >  static inline struct mem_cgroup *page_memcg(struct page *page)
> > >  {
> > > -	return page->mem_cgroup;
> > > +	struct mem_cgroup *memcg = page->mem_cgroup;
> > > +
> > > +	/*
> > > +	 * The lowest bit set means that memcg isn't a valid memcg pointer,
> > > +	 * but a obj_cgroups pointer. In this case the page is shared and
> > > +	 * isn't charged to any specific memory cgroup. Return NULL.
> > > +	 */
> > > +	if ((unsigned long) memcg & 0x1UL)
> > > +		memcg = NULL;
> > > +
> > > +	return memcg;
> > 
> > That should really WARN instead of silently returning NULL. Which
> > callsite optimistically asks a page's cgroup when it has no idea
> > whether that page is actually a userpage or not?
> 
> For instance, look at page_cgroup_ino() called from the
> reading /proc/kpageflags.

But that checks PageSlab() and implements memcg_from_slab_page() to
handle that case properly. And that's what we expect all callsites to
do: make sure that the question asked actually makes sense, instead of
having the interface paper over bogus requests.

If that function is completely racy and PageSlab isn't stable, then it
should really just open-code the lookup, rather than require weakening
the interface for everybody else.

> > >  static inline struct mem_cgroup *page_memcg_rcu(struct page *page)
> > >  {
> > > +	struct mem_cgroup *memcg = READ_ONCE(page->mem_cgroup);
> > > +
> > >  	WARN_ON_ONCE(!rcu_read_lock_held());
> > > -	return READ_ONCE(page->mem_cgroup);
> > > +
> > > +	/*
> > > +	 * The lowest bit set means that memcg isn't a valid memcg pointer,
> > > +	 * but a obj_cgroups pointer. In this case the page is shared and
> > > +	 * isn't charged to any specific memory cgroup. Return NULL.
> > > +	 */
> > > +	if ((unsigned long) memcg & 0x1UL)
> > > +		memcg = NULL;
> > > +
> > > +	return memcg;
> > 
> > Same here.
> > 
> > >  }
> > >  #else
> > >  static inline struct mem_cgroup *page_memcg(struct page *page)
> > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > > index 270aa8fd2800..5102f00f3336 100644
> > > --- a/include/linux/mm_types.h
> > > +++ b/include/linux/mm_types.h
> > > @@ -198,7 +198,10 @@ struct page {
> > >  	atomic_t _refcount;
> > >  
> > >  #ifdef CONFIG_MEMCG
> > > -	struct mem_cgroup *mem_cgroup;
> > > +	union {
> > > +		struct mem_cgroup *mem_cgroup;
> > > +		struct obj_cgroup **obj_cgroups;
> > > +	};
> > 
> > Since you need the casts in both cases anyway, it's safer (and
> > simpler) to do
> > 
> > 	unsigned long mem_cgroup;
> > 
> > to prevent accidental direct derefs in future code.
> 
> Agree. Maybe even mem_cgroup_data?

Personally, I don't think the suffix adds much. The type makes it so
the compiler catches any accidental use, and access is very
centralized so greppability doesn't matter much.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 21/28] mm: memcg/slab: use a single set of kmem_caches for all memory cgroups
  2020-02-03 19:50   ` Johannes Weiner
@ 2020-02-03 20:58     ` Roman Gushchin
  2020-02-03 22:17       ` Johannes Weiner
  0 siblings, 1 reply; 84+ messages in thread
From: Roman Gushchin @ 2020-02-03 20:58 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Feb 03, 2020 at 02:50:48PM -0500, Johannes Weiner wrote:
> On Mon, Jan 27, 2020 at 09:34:46AM -0800, Roman Gushchin wrote:
> > This is fairly big but mostly red patch, which makes all non-root
> > slab allocations use a single set of kmem_caches instead of
> > creating a separate set for each memory cgroup.
> > 
> > Because the number of non-root kmem_caches is now capped by the number
> > of root kmem_caches, there is no need to shrink or destroy them
> > prematurely. They can be perfectly destroyed together with their
> > root counterparts. This allows to dramatically simplify the
> > management of non-root kmem_caches and delete a ton of code.
> 
> This is definitely going in the right direction. But it doesn't quite
> explain why we still need two sets of kmem_caches?
> 
> In the old scheme, we had completely separate per-cgroup caches with
> separate slab pages. If a cgrouped process wanted to allocate a slab
> object, we'd go to the root cache and used the cgroup id to look up
> the right cgroup cache. On slab free we'd use page->slab_cache.
> 
> Now we have slab pages that have a page->objcg array. Why can't all
> allocations go through a single set of kmem caches? If an allocation
> is coming from a cgroup and the slab page the allocator wants to use
> doesn't have an objcg array yet, we can allocate it on the fly, no?

Well, arguably it can be done, but there are few drawbacks:

1) On the release path you'll need to make some extra work even for
   root allocations: calculate the offset only to find the NULL objcg pointer.

2) There will be a memory overhead for root allocations
   (which might or might not be compensated by the increase
   of the slab utilization).

3) I'm working on percpu memory accounting that resembles the same scheme,
   except that obj_cgroups vector is created for the whole percpu block.
   There will be root- and memcg-blocks, and it will be expensive to merge them.
   I kinda like using the same scheme here and there.

Upsides?

1) slab utilization might increase a little bit (but I doubt it will have
   a huge effect, because both merging sets should be relatively big and well
   utilized)
2) eliminate memcg kmem_cache dynamic creation/destruction. it's nice,
   but there isn't so much code left anyway.


So IMO it's an interesting direction to explore, but not something
that necessarily has to be done in the context of this patchset.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 16/28] mm: memcg/slab: allocate obj_cgroups for non-root slab pages
  2020-02-03 20:46       ` Johannes Weiner
@ 2020-02-03 21:19         ` Roman Gushchin
  2020-02-03 22:29           ` Johannes Weiner
  0 siblings, 1 reply; 84+ messages in thread
From: Roman Gushchin @ 2020-02-03 21:19 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Feb 03, 2020 at 03:46:27PM -0500, Johannes Weiner wrote:
> On Mon, Feb 03, 2020 at 10:34:52AM -0800, Roman Gushchin wrote:
> > On Mon, Feb 03, 2020 at 01:27:56PM -0500, Johannes Weiner wrote:
> > > On Mon, Jan 27, 2020 at 09:34:41AM -0800, Roman Gushchin wrote:
> > > > Allocate and release memory to store obj_cgroup pointers for each
> > > > non-root slab page. Reuse page->mem_cgroup pointer to store a pointer
> > > > to the allocated space.
> > > > 
> > > > To distinguish between obj_cgroups and memcg pointers in case
> > > > when it's not obvious which one is used (as in page_cgroup_ino()),
> > > > let's always set the lowest bit in the obj_cgroup case.
> > > > 
> > > > Signed-off-by: Roman Gushchin <guro@fb.com>
> > > > ---
> > > >  include/linux/mm.h       | 25 ++++++++++++++++++--
> > > >  include/linux/mm_types.h |  5 +++-
> > > >  mm/memcontrol.c          |  5 ++--
> > > >  mm/slab.c                |  3 ++-
> > > >  mm/slab.h                | 51 +++++++++++++++++++++++++++++++++++++++-
> > > >  mm/slub.c                |  2 +-
> > > >  6 files changed, 83 insertions(+), 8 deletions(-)
> > > > 
> > > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > > index 080f8ac8bfb7..65224becc4ca 100644
> > > > --- a/include/linux/mm.h
> > > > +++ b/include/linux/mm.h
> > > > @@ -1264,12 +1264,33 @@ static inline void set_page_links(struct page *page, enum zone_type zone,
> > > >  #ifdef CONFIG_MEMCG
> > > >  static inline struct mem_cgroup *page_memcg(struct page *page)
> > > >  {
> > > > -	return page->mem_cgroup;
> > > > +	struct mem_cgroup *memcg = page->mem_cgroup;
> > > > +
> > > > +	/*
> > > > +	 * The lowest bit set means that memcg isn't a valid memcg pointer,
> > > > +	 * but a obj_cgroups pointer. In this case the page is shared and
> > > > +	 * isn't charged to any specific memory cgroup. Return NULL.
> > > > +	 */
> > > > +	if ((unsigned long) memcg & 0x1UL)
> > > > +		memcg = NULL;
> > > > +
> > > > +	return memcg;
> > > 
> > > That should really WARN instead of silently returning NULL. Which
> > > callsite optimistically asks a page's cgroup when it has no idea
> > > whether that page is actually a userpage or not?
> > 
> > For instance, look at page_cgroup_ino() called from the
> > reading /proc/kpageflags.
> 
> But that checks PageSlab() and implements memcg_from_slab_page() to
> handle that case properly. And that's what we expect all callsites to
> do: make sure that the question asked actually makes sense, instead of
> having the interface paper over bogus requests.
> 
> If that function is completely racy and PageSlab isn't stable, then it
> should really just open-code the lookup, rather than require weakening
> the interface for everybody else.

Why though?

Another example: process stack can be depending on the machine config and
platform a vmalloc allocation, a slab allocation or a "high-order slab allocation",
which is executed by the page allocator directly.

It's kinda nice to have a function that hides accounting details
and returns a valid memcg pointer for any kind of objects.

To me it seems to be a valid question:
for a given kernel object give me a pointer to the memory cgroup.

Why it's weakening?

Moreover, open-coding of this question leads to bugs like one fixed by
ec9f02384f60 ("mm: workingset: fix vmstat counters for shadow nodes").

> 
> > > >  static inline struct mem_cgroup *page_memcg_rcu(struct page *page)
> > > >  {
> > > > +	struct mem_cgroup *memcg = READ_ONCE(page->mem_cgroup);
> > > > +
> > > >  	WARN_ON_ONCE(!rcu_read_lock_held());
> > > > -	return READ_ONCE(page->mem_cgroup);
> > > > +
> > > > +	/*
> > > > +	 * The lowest bit set means that memcg isn't a valid memcg pointer,
> > > > +	 * but a obj_cgroups pointer. In this case the page is shared and
> > > > +	 * isn't charged to any specific memory cgroup. Return NULL.
> > > > +	 */
> > > > +	if ((unsigned long) memcg & 0x1UL)
> > > > +		memcg = NULL;
> > > > +
> > > > +	return memcg;
> > > 
> > > Same here.
> > > 
> > > >  }
> > > >  #else
> > > >  static inline struct mem_cgroup *page_memcg(struct page *page)
> > > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > > > index 270aa8fd2800..5102f00f3336 100644
> > > > --- a/include/linux/mm_types.h
> > > > +++ b/include/linux/mm_types.h
> > > > @@ -198,7 +198,10 @@ struct page {
> > > >  	atomic_t _refcount;
> > > >  
> > > >  #ifdef CONFIG_MEMCG
> > > > -	struct mem_cgroup *mem_cgroup;
> > > > +	union {
> > > > +		struct mem_cgroup *mem_cgroup;
> > > > +		struct obj_cgroup **obj_cgroups;
> > > > +	};
> > > 
> > > Since you need the casts in both cases anyway, it's safer (and
> > > simpler) to do
> > > 
> > > 	unsigned long mem_cgroup;
> > > 
> > > to prevent accidental direct derefs in future code.
> > 
> > Agree. Maybe even mem_cgroup_data?
> 
> Personally, I don't think the suffix adds much. The type makes it so
> the compiler catches any accidental use, and access is very
> centralized so greppability doesn't matter much.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 21/28] mm: memcg/slab: use a single set of kmem_caches for all memory cgroups
  2020-02-03 20:58     ` Roman Gushchin
@ 2020-02-03 22:17       ` Johannes Weiner
  2020-02-03 22:38         ` Roman Gushchin
  2020-02-04  1:15         ` Roman Gushchin
  0 siblings, 2 replies; 84+ messages in thread
From: Johannes Weiner @ 2020-02-03 22:17 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Feb 03, 2020 at 12:58:34PM -0800, Roman Gushchin wrote:
> On Mon, Feb 03, 2020 at 02:50:48PM -0500, Johannes Weiner wrote:
> > On Mon, Jan 27, 2020 at 09:34:46AM -0800, Roman Gushchin wrote:
> > > This is fairly big but mostly red patch, which makes all non-root
> > > slab allocations use a single set of kmem_caches instead of
> > > creating a separate set for each memory cgroup.
> > > 
> > > Because the number of non-root kmem_caches is now capped by the number
> > > of root kmem_caches, there is no need to shrink or destroy them
> > > prematurely. They can be perfectly destroyed together with their
> > > root counterparts. This allows to dramatically simplify the
> > > management of non-root kmem_caches and delete a ton of code.
> > 
> > This is definitely going in the right direction. But it doesn't quite
> > explain why we still need two sets of kmem_caches?
> > 
> > In the old scheme, we had completely separate per-cgroup caches with
> > separate slab pages. If a cgrouped process wanted to allocate a slab
> > object, we'd go to the root cache and used the cgroup id to look up
> > the right cgroup cache. On slab free we'd use page->slab_cache.
> > 
> > Now we have slab pages that have a page->objcg array. Why can't all
> > allocations go through a single set of kmem caches? If an allocation
> > is coming from a cgroup and the slab page the allocator wants to use
> > doesn't have an objcg array yet, we can allocate it on the fly, no?
> 
> Well, arguably it can be done, but there are few drawbacks:
> 
> 1) On the release path you'll need to make some extra work even for
>    root allocations: calculate the offset only to find the NULL objcg pointer.
> 
> 2) There will be a memory overhead for root allocations
>    (which might or might not be compensated by the increase
>    of the slab utilization).

Those two are only true if there is a wild mix of root and cgroup
allocations inside the same slab, and that doesn't really happen in
practice. Either the machine is dedicated to one workload and cgroups
are only enabled due to e.g. a vendor kernel, or you have cgrouped
systems (like most distro systems now) that cgroup everything.

> 3) I'm working on percpu memory accounting that resembles the same scheme,
>    except that obj_cgroups vector is created for the whole percpu block.
>    There will be root- and memcg-blocks, and it will be expensive to merge them.
>    I kinda like using the same scheme here and there.

It's hard to conclude anything based on this information alone. If
it's truly expensive to merge them, then it warrants the additional
complexity. But I don't understand the desire to share a design for
two systems with sufficiently different constraints.

> Upsides?
> 
> 1) slab utilization might increase a little bit (but I doubt it will have
>    a huge effect, because both merging sets should be relatively big and well
>    utilized)

Right.

> 2) eliminate memcg kmem_cache dynamic creation/destruction. it's nice,
>    but there isn't so much code left anyway.

There is a lot of complexity associated with the cache cloning that
isn't the lines of code, but the lifetime and synchronization rules.

And these two things are the primary aspects that make my head hurt
trying to review this patch series.

> So IMO it's an interesting direction to explore, but not something
> that necessarily has to be done in the context of this patchset.

I disagree. Instead of replacing the old coherent model and its
complexities with a new coherent one, you are mixing the two. And I
can barely understand the end result.

Dynamically cloning entire slab caches for the sole purpose of telling
whether the pages have an obj_cgroup array or not is *completely
insane*. If the controller had followed the obj_cgroup design from the
start, nobody would have ever thought about doing it like this.

From a maintainability POV, we cannot afford merging it in this form.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 12/28] mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat
  2020-02-03 20:34       ` Johannes Weiner
@ 2020-02-03 22:28         ` Roman Gushchin
  2020-02-03 22:39           ` Johannes Weiner
  0 siblings, 1 reply; 84+ messages in thread
From: Roman Gushchin @ 2020-02-03 22:28 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Feb 03, 2020 at 03:34:50PM -0500, Johannes Weiner wrote:
> On Mon, Feb 03, 2020 at 10:25:06AM -0800, Roman Gushchin wrote:
> > On Mon, Feb 03, 2020 at 12:58:18PM -0500, Johannes Weiner wrote:
> > > On Mon, Jan 27, 2020 at 09:34:37AM -0800, Roman Gushchin wrote:
> > > > Currently s8 type is used for per-cpu caching of per-node statistics.
> > > > It works fine because the overfill threshold can't exceed 125.
> > > > 
> > > > But if some counters are in bytes (and the next commit in the series
> > > > will convert slab counters to bytes), it's not gonna work:
> > > > value in bytes can easily exceed s8 without exceeding the threshold
> > > > converted to bytes. So to avoid overfilling per-cpu caches and breaking
> > > > vmstats correctness, let's use s32 instead.
> > > > 
> > > > This doesn't affect per-zone statistics. There are no plans to use
> > > > zone-level byte-sized counters, so no reasons to change anything.
> > > 
> > > Wait, is this still necessary? AFAIU, the node counters will account
> > > full slab pages, including free space, and only the memcg counters
> > > that track actual objects will be in bytes.
> > > 
> > > Can you please elaborate?
> > 
> > It's weird to have a counter with the same name (e.g. NR_SLAB_RECLAIMABLE_B)
> > being in different units depending on the accounting scope.
> > So I do convert all slab counters: global, per-lruvec,
> > and per-memcg to bytes.
> 
> Since the node counters tracks allocated slab pages and the memcg
> counter tracks allocated objects, arguably they shouldn't use the same
> name anyway.
> 
> > Alternatively I can fork them, e.g. introduce per-memcg or per-lruvec
> > NR_SLAB_RECLAIMABLE_OBJ
> > NR_SLAB_UNRECLAIMABLE_OBJ
> 
> Can we alias them and reuse their slots?
> 
> 	/* Reuse the node slab page counters item for charged objects */
> 	MEMCG_SLAB_RECLAIMABLE = NR_SLAB_RECLAIMABLE,
> 	MEMCG_SLAB_UNRECLAIMABLE = NR_SLAB_UNRECLAIMABLE,

Yeah, lgtm.

Isn't MEMCG_ prefix bad because it makes everybody think that it belongs to
the enum memcg_stat_item?

> 
> > and keep global counters untouched. If going this way, I'd prefer to make
> > them per-memcg, because it will simplify things on charging paths:
> > now we do get task->mem_cgroup->obj_cgroup in the pre_alloc_hook(),
> > and then obj_cgroup->mem_cgroup in the post_alloc_hook() just to
> > bump per-lruvec counters.
> 
> I don't quite follow. Don't you still have to update the global
> counters?

Global counters are updated only if an allocation requires a new slab
page, which isn't the most common path.
In generic case post_hook is required because it's the only place where
we have both page (to get the node) and memcg pointer.

If NR_SLAB_RECLAIMABLE is tracked only per-memcg (as MEMCG_SOCK),
then post_hook can handle only the rare "allocation failed" case.

I'm not sure here what's better.

> 
> > Btw, I wonder if we really need per-lruvec counters at all (at least
> > being enabled by default). For the significant amount of users who
> > have a single-node machine it doesn't bring anything except performance
> > overhead.
> 
> Yeah, for single-node systems we should be able to redirect everything
> to the memcg counters, without allocating and tracking lruvec copies.

Sounds good. It can lead to significant savings on single-node machines.

> 
> > For those who have multiple nodes (and most likely many many
> > memory cgroups) it provides way too many data except for debugging
> > some weird mm issues.
> > I guess in the absolute majority of cases having global per-node + per-memcg
> > counters will be enough.
> 
> Hm? Reclaim uses the lruvec counters.

Can you, please, provide some examples? It looks like it's mostly based
on per-zone lruvec size counters.

Anyway, it seems to be a little bit off from this patchset, so let's
discuss it separately.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 16/28] mm: memcg/slab: allocate obj_cgroups for non-root slab pages
  2020-02-03 21:19         ` Roman Gushchin
@ 2020-02-03 22:29           ` Johannes Weiner
  0 siblings, 0 replies; 84+ messages in thread
From: Johannes Weiner @ 2020-02-03 22:29 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Feb 03, 2020 at 01:19:15PM -0800, Roman Gushchin wrote:
> On Mon, Feb 03, 2020 at 03:46:27PM -0500, Johannes Weiner wrote:
> > On Mon, Feb 03, 2020 at 10:34:52AM -0800, Roman Gushchin wrote:
> > > On Mon, Feb 03, 2020 at 01:27:56PM -0500, Johannes Weiner wrote:
> > > > On Mon, Jan 27, 2020 at 09:34:41AM -0800, Roman Gushchin wrote:
> > > > > Allocate and release memory to store obj_cgroup pointers for each
> > > > > non-root slab page. Reuse page->mem_cgroup pointer to store a pointer
> > > > > to the allocated space.
> > > > > 
> > > > > To distinguish between obj_cgroups and memcg pointers in case
> > > > > when it's not obvious which one is used (as in page_cgroup_ino()),
> > > > > let's always set the lowest bit in the obj_cgroup case.
> > > > > 
> > > > > Signed-off-by: Roman Gushchin <guro@fb.com>
> > > > > ---
> > > > >  include/linux/mm.h       | 25 ++++++++++++++++++--
> > > > >  include/linux/mm_types.h |  5 +++-
> > > > >  mm/memcontrol.c          |  5 ++--
> > > > >  mm/slab.c                |  3 ++-
> > > > >  mm/slab.h                | 51 +++++++++++++++++++++++++++++++++++++++-
> > > > >  mm/slub.c                |  2 +-
> > > > >  6 files changed, 83 insertions(+), 8 deletions(-)
> > > > > 
> > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > > > index 080f8ac8bfb7..65224becc4ca 100644
> > > > > --- a/include/linux/mm.h
> > > > > +++ b/include/linux/mm.h
> > > > > @@ -1264,12 +1264,33 @@ static inline void set_page_links(struct page *page, enum zone_type zone,
> > > > >  #ifdef CONFIG_MEMCG
> > > > >  static inline struct mem_cgroup *page_memcg(struct page *page)
> > > > >  {
> > > > > -	return page->mem_cgroup;
> > > > > +	struct mem_cgroup *memcg = page->mem_cgroup;
> > > > > +
> > > > > +	/*
> > > > > +	 * The lowest bit set means that memcg isn't a valid memcg pointer,
> > > > > +	 * but a obj_cgroups pointer. In this case the page is shared and
> > > > > +	 * isn't charged to any specific memory cgroup. Return NULL.
> > > > > +	 */
> > > > > +	if ((unsigned long) memcg & 0x1UL)
> > > > > +		memcg = NULL;
> > > > > +
> > > > > +	return memcg;
> > > > 
> > > > That should really WARN instead of silently returning NULL. Which
> > > > callsite optimistically asks a page's cgroup when it has no idea
> > > > whether that page is actually a userpage or not?
> > > 
> > > For instance, look at page_cgroup_ino() called from the
> > > reading /proc/kpageflags.
> > 
> > But that checks PageSlab() and implements memcg_from_slab_page() to
> > handle that case properly. And that's what we expect all callsites to
> > do: make sure that the question asked actually makes sense, instead of
> > having the interface paper over bogus requests.
> > 
> > If that function is completely racy and PageSlab isn't stable, then it
> > should really just open-code the lookup, rather than require weakening
> > the interface for everybody else.
> 
> Why though?
> 
> Another example: process stack can be depending on the machine config and
> platform a vmalloc allocation, a slab allocation or a "high-order slab allocation",
> which is executed by the page allocator directly.
> 
> It's kinda nice to have a function that hides accounting details
> and returns a valid memcg pointer for any kind of objects.

Hm? I'm not objecting to that, memcg_from_obj() makes perfect sense to
me, to use with kvmalloc() objects for example.

I'm objecting to page_memcg() silently swallowing bogus inputs. That
function shouldn't silently say "there is no cgroup associated with
this page" when the true answer is "this page has MANY cgroups
associated with it, this question doesn't make any sense".

It's not exactly hard to imagine how this could cause bugs, is it?
Where a caller should implement a slab case (exactly like
page_cgroup_ino()) but is confused about the type of page it has,
whether it's charged or not etc.?


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 21/28] mm: memcg/slab: use a single set of kmem_caches for all memory cgroups
  2020-02-03 22:17       ` Johannes Weiner
@ 2020-02-03 22:38         ` Roman Gushchin
  2020-02-04  1:15         ` Roman Gushchin
  1 sibling, 0 replies; 84+ messages in thread
From: Roman Gushchin @ 2020-02-03 22:38 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Feb 03, 2020 at 05:17:34PM -0500, Johannes Weiner wrote:
> On Mon, Feb 03, 2020 at 12:58:34PM -0800, Roman Gushchin wrote:
> > On Mon, Feb 03, 2020 at 02:50:48PM -0500, Johannes Weiner wrote:
> > > On Mon, Jan 27, 2020 at 09:34:46AM -0800, Roman Gushchin wrote:
> > > > This is fairly big but mostly red patch, which makes all non-root
> > > > slab allocations use a single set of kmem_caches instead of
> > > > creating a separate set for each memory cgroup.
> > > > 
> > > > Because the number of non-root kmem_caches is now capped by the number
> > > > of root kmem_caches, there is no need to shrink or destroy them
> > > > prematurely. They can be perfectly destroyed together with their
> > > > root counterparts. This allows to dramatically simplify the
> > > > management of non-root kmem_caches and delete a ton of code.
> > > 
> > > This is definitely going in the right direction. But it doesn't quite
> > > explain why we still need two sets of kmem_caches?
> > > 
> > > In the old scheme, we had completely separate per-cgroup caches with
> > > separate slab pages. If a cgrouped process wanted to allocate a slab
> > > object, we'd go to the root cache and used the cgroup id to look up
> > > the right cgroup cache. On slab free we'd use page->slab_cache.
> > > 
> > > Now we have slab pages that have a page->objcg array. Why can't all
> > > allocations go through a single set of kmem caches? If an allocation
> > > is coming from a cgroup and the slab page the allocator wants to use
> > > doesn't have an objcg array yet, we can allocate it on the fly, no?
> > 
> > Well, arguably it can be done, but there are few drawbacks:
> > 
> > 1) On the release path you'll need to make some extra work even for
> >    root allocations: calculate the offset only to find the NULL objcg pointer.
> > 
> > 2) There will be a memory overhead for root allocations
> >    (which might or might not be compensated by the increase
> >    of the slab utilization).
> 
> Those two are only true if there is a wild mix of root and cgroup
> allocations inside the same slab, and that doesn't really happen in
> practice. Either the machine is dedicated to one workload and cgroups
> are only enabled due to e.g. a vendor kernel, or you have cgrouped
> systems (like most distro systems now) that cgroup everything.
> 
> > 3) I'm working on percpu memory accounting that resembles the same scheme,
> >    except that obj_cgroups vector is created for the whole percpu block.
> >    There will be root- and memcg-blocks, and it will be expensive to merge them.
> >    I kinda like using the same scheme here and there.
> 
> It's hard to conclude anything based on this information alone. If
> it's truly expensive to merge them, then it warrants the additional
> complexity. But I don't understand the desire to share a design for
> two systems with sufficiently different constraints.
> 
> > Upsides?
> > 
> > 1) slab utilization might increase a little bit (but I doubt it will have
> >    a huge effect, because both merging sets should be relatively big and well
> >    utilized)
> 
> Right.
> 
> > 2) eliminate memcg kmem_cache dynamic creation/destruction. it's nice,
> >    but there isn't so much code left anyway.
> 
> There is a lot of complexity associated with the cache cloning that
> isn't the lines of code, but the lifetime and synchronization rules.
> 
> And these two things are the primary aspects that make my head hurt
> trying to review this patch series.
> 
> > So IMO it's an interesting direction to explore, but not something
> > that necessarily has to be done in the context of this patchset.
> 
> I disagree. Instead of replacing the old coherent model and its
> complexities with a new coherent one, you are mixing the two. And I
> can barely understand the end result.
> 
> Dynamically cloning entire slab caches for the sole purpose of telling
> whether the pages have an obj_cgroup array or not is *completely
> insane*. If the controller had followed the obj_cgroup design from the
> start, nobody would have ever thought about doing it like this.

Having two sets of kmem_caches has nothing to do with the refcounting
and obj_cgroup abstraction.
Please, take a look at the final code.

Thanks!


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 12/28] mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat
  2020-02-03 22:28         ` Roman Gushchin
@ 2020-02-03 22:39           ` Johannes Weiner
  2020-02-04  1:44             ` Roman Gushchin
  0 siblings, 1 reply; 84+ messages in thread
From: Johannes Weiner @ 2020-02-03 22:39 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Feb 03, 2020 at 02:28:53PM -0800, Roman Gushchin wrote:
> On Mon, Feb 03, 2020 at 03:34:50PM -0500, Johannes Weiner wrote:
> > On Mon, Feb 03, 2020 at 10:25:06AM -0800, Roman Gushchin wrote:
> > > On Mon, Feb 03, 2020 at 12:58:18PM -0500, Johannes Weiner wrote:
> > > > On Mon, Jan 27, 2020 at 09:34:37AM -0800, Roman Gushchin wrote:
> > > > > Currently s8 type is used for per-cpu caching of per-node statistics.
> > > > > It works fine because the overfill threshold can't exceed 125.
> > > > > 
> > > > > But if some counters are in bytes (and the next commit in the series
> > > > > will convert slab counters to bytes), it's not gonna work:
> > > > > value in bytes can easily exceed s8 without exceeding the threshold
> > > > > converted to bytes. So to avoid overfilling per-cpu caches and breaking
> > > > > vmstats correctness, let's use s32 instead.
> > > > > 
> > > > > This doesn't affect per-zone statistics. There are no plans to use
> > > > > zone-level byte-sized counters, so no reasons to change anything.
> > > > 
> > > > Wait, is this still necessary? AFAIU, the node counters will account
> > > > full slab pages, including free space, and only the memcg counters
> > > > that track actual objects will be in bytes.
> > > > 
> > > > Can you please elaborate?
> > > 
> > > It's weird to have a counter with the same name (e.g. NR_SLAB_RECLAIMABLE_B)
> > > being in different units depending on the accounting scope.
> > > So I do convert all slab counters: global, per-lruvec,
> > > and per-memcg to bytes.
> > 
> > Since the node counters tracks allocated slab pages and the memcg
> > counter tracks allocated objects, arguably they shouldn't use the same
> > name anyway.
> > 
> > > Alternatively I can fork them, e.g. introduce per-memcg or per-lruvec
> > > NR_SLAB_RECLAIMABLE_OBJ
> > > NR_SLAB_UNRECLAIMABLE_OBJ
> > 
> > Can we alias them and reuse their slots?
> > 
> > 	/* Reuse the node slab page counters item for charged objects */
> > 	MEMCG_SLAB_RECLAIMABLE = NR_SLAB_RECLAIMABLE,
> > 	MEMCG_SLAB_UNRECLAIMABLE = NR_SLAB_UNRECLAIMABLE,
> 
> Yeah, lgtm.
> 
> Isn't MEMCG_ prefix bad because it makes everybody think that it belongs to
> the enum memcg_stat_item?

Maybe, not sure that's a problem. #define CG_SLAB_RECLAIMABLE perhaps?

> > > and keep global counters untouched. If going this way, I'd prefer to make
> > > them per-memcg, because it will simplify things on charging paths:
> > > now we do get task->mem_cgroup->obj_cgroup in the pre_alloc_hook(),
> > > and then obj_cgroup->mem_cgroup in the post_alloc_hook() just to
> > > bump per-lruvec counters.
> > 
> > I don't quite follow. Don't you still have to update the global
> > counters?
> 
> Global counters are updated only if an allocation requires a new slab
> page, which isn't the most common path.

Right.

> In generic case post_hook is required because it's the only place where
> we have both page (to get the node) and memcg pointer.
> 
> If NR_SLAB_RECLAIMABLE is tracked only per-memcg (as MEMCG_SOCK),
> then post_hook can handle only the rare "allocation failed" case.
> 
> I'm not sure here what's better.

If it's tracked only per-memcg, you still have to account it every
time you charge an object to a memcg, no? How is it less frequent than
acconting at the lruvec level?

> > > Btw, I wonder if we really need per-lruvec counters at all (at least
> > > being enabled by default). For the significant amount of users who
> > > have a single-node machine it doesn't bring anything except performance
> > > overhead.
> > 
> > Yeah, for single-node systems we should be able to redirect everything
> > to the memcg counters, without allocating and tracking lruvec copies.
> 
> Sounds good. It can lead to significant savings on single-node machines.
> 
> > 
> > > For those who have multiple nodes (and most likely many many
> > > memory cgroups) it provides way too many data except for debugging
> > > some weird mm issues.
> > > I guess in the absolute majority of cases having global per-node + per-memcg
> > > counters will be enough.
> > 
> > Hm? Reclaim uses the lruvec counters.
> 
> Can you, please, provide some examples? It looks like it's mostly based
> on per-zone lruvec size counters.

It uses the recursive lruvec state to decide inactive_is_low(),
whether refaults are occuring, whether to trim cache only or go for
anon etc. We use it to determine refault distances and how many shadow
nodes to shrink.

Grep for lruvec_page_state().

> Anyway, it seems to be a little bit off from this patchset, so let's
> discuss it separately.

True


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 21/28] mm: memcg/slab: use a single set of kmem_caches for all memory cgroups
  2020-02-03 22:17       ` Johannes Weiner
  2020-02-03 22:38         ` Roman Gushchin
@ 2020-02-04  1:15         ` Roman Gushchin
  2020-02-04  2:47           ` Johannes Weiner
  1 sibling, 1 reply; 84+ messages in thread
From: Roman Gushchin @ 2020-02-04  1:15 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Feb 03, 2020 at 05:17:34PM -0500, Johannes Weiner wrote:
> On Mon, Feb 03, 2020 at 12:58:34PM -0800, Roman Gushchin wrote:
> > On Mon, Feb 03, 2020 at 02:50:48PM -0500, Johannes Weiner wrote:
> > > On Mon, Jan 27, 2020 at 09:34:46AM -0800, Roman Gushchin wrote:
> > > > This is fairly big but mostly red patch, which makes all non-root
> > > > slab allocations use a single set of kmem_caches instead of
> > > > creating a separate set for each memory cgroup.
> > > > 
> > > > Because the number of non-root kmem_caches is now capped by the number
> > > > of root kmem_caches, there is no need to shrink or destroy them
> > > > prematurely. They can be perfectly destroyed together with their
> > > > root counterparts. This allows to dramatically simplify the
> > > > management of non-root kmem_caches and delete a ton of code.
> > > 
> > > This is definitely going in the right direction. But it doesn't quite
> > > explain why we still need two sets of kmem_caches?
> > > 
> > > In the old scheme, we had completely separate per-cgroup caches with
> > > separate slab pages. If a cgrouped process wanted to allocate a slab
> > > object, we'd go to the root cache and used the cgroup id to look up
> > > the right cgroup cache. On slab free we'd use page->slab_cache.
> > > 
> > > Now we have slab pages that have a page->objcg array. Why can't all
> > > allocations go through a single set of kmem caches? If an allocation
> > > is coming from a cgroup and the slab page the allocator wants to use
> > > doesn't have an objcg array yet, we can allocate it on the fly, no?
> > 
> > Well, arguably it can be done, but there are few drawbacks:
> > 
> > 1) On the release path you'll need to make some extra work even for
> >    root allocations: calculate the offset only to find the NULL objcg pointer.
> > 
> > 2) There will be a memory overhead for root allocations
> >    (which might or might not be compensated by the increase
> >    of the slab utilization).
> 
> Those two are only true if there is a wild mix of root and cgroup
> allocations inside the same slab, and that doesn't really happen in
> practice. Either the machine is dedicated to one workload and cgroups
> are only enabled due to e.g. a vendor kernel, or you have cgrouped
> systems (like most distro systems now) that cgroup everything.

It's actually a questionable statement: we do skip allocations from certain
contexts, and we do merge slab caches.

Most likely it's true for certain slab_caches and not true for others.
Think of kmalloc-* caches.

Also, because obj_cgroup vectors will not be freed without underlying pages,
most likely the percentage of pages with obj_cgroups will grow with uptime.
In other words, memcg allocations will fragment root slab pages.

> 
> > 3) I'm working on percpu memory accounting that resembles the same scheme,
> >    except that obj_cgroups vector is created for the whole percpu block.
> >    There will be root- and memcg-blocks, and it will be expensive to merge them.
> >    I kinda like using the same scheme here and there.
> 
> It's hard to conclude anything based on this information alone. If
> it's truly expensive to merge them, then it warrants the additional
> complexity. But I don't understand the desire to share a design for
> two systems with sufficiently different constraints.
> 
> > Upsides?
> > 
> > 1) slab utilization might increase a little bit (but I doubt it will have
> >    a huge effect, because both merging sets should be relatively big and well
> >    utilized)
> 
> Right.
> 
> > 2) eliminate memcg kmem_cache dynamic creation/destruction. it's nice,
> >    but there isn't so much code left anyway.
> 
> There is a lot of complexity associated with the cache cloning that
> isn't the lines of code, but the lifetime and synchronization rules.

Quite opposite: the patchset removes all the complexity (or 90% of it),
because it makes the kmem_cache lifetime independent from any cgroup stuff.

Kmem_caches are created on demand on the first request (most likely during
the system start-up), and destroyed together with their root counterparts
(most likely never or on rmmod). First request means globally first request,
not a first request from a given memcg.

Memcg kmem_cache lifecycle has nothing to do with memory/obj_cgroups, and
after creation just matches the lifetime of the root kmem caches.

The only reason to keep the async creation is that some kmem_caches
are created very early in the boot process, long before any cgroup
stuff is initialized.

> 
> And these two things are the primary aspects that make my head hurt
> trying to review this patch series.
> 
> > So IMO it's an interesting direction to explore, but not something
> > that necessarily has to be done in the context of this patchset.
> 
> I disagree. Instead of replacing the old coherent model and its
> complexities with a new coherent one, you are mixing the two. And I
> can barely understand the end result.
> 
> Dynamically cloning entire slab caches for the sole purpose of telling
> whether the pages have an obj_cgroup array or not is *completely
> insane*. If the controller had followed the obj_cgroup design from the
> start, nobody would have ever thought about doing it like this.

It's just not true. The whole point of having root- and memcg sets is
to be able to not look for a NULL pointer in the obj_cgroup vector on
releasing of the root object. In other words, it allows to keep zero
overhead for root allocations. IMHO it's an important thing, and calling
it *completely insane* isn't the best way to communicate.

> 
> From a maintainability POV, we cannot afford merging it in this form.

It sounds strange: the patchset eliminates 90% of the complexity,
but it's unmergeable because there are 10% left.

I agree that it's an arguable question if we can tolerate some
additional overhead on root allocations to eliminate these additional
10%, but I really don't think it's so obvious that even discussing
it is insane.

Btw, there is another good idea to explore (also suggested by Christopher
Lameter): we can put memcg/objcg pointer into the slab page, avoiding
an extra allocation.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 12/28] mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat
  2020-02-03 22:39           ` Johannes Weiner
@ 2020-02-04  1:44             ` Roman Gushchin
  0 siblings, 0 replies; 84+ messages in thread
From: Roman Gushchin @ 2020-02-04  1:44 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Feb 03, 2020 at 05:39:54PM -0500, Johannes Weiner wrote:
> On Mon, Feb 03, 2020 at 02:28:53PM -0800, Roman Gushchin wrote:
> > On Mon, Feb 03, 2020 at 03:34:50PM -0500, Johannes Weiner wrote:
> > > On Mon, Feb 03, 2020 at 10:25:06AM -0800, Roman Gushchin wrote:
> > > > On Mon, Feb 03, 2020 at 12:58:18PM -0500, Johannes Weiner wrote:
> > > > > On Mon, Jan 27, 2020 at 09:34:37AM -0800, Roman Gushchin wrote:
> > > > > > Currently s8 type is used for per-cpu caching of per-node statistics.
> > > > > > It works fine because the overfill threshold can't exceed 125.
> > > > > > 
> > > > > > But if some counters are in bytes (and the next commit in the series
> > > > > > will convert slab counters to bytes), it's not gonna work:
> > > > > > value in bytes can easily exceed s8 without exceeding the threshold
> > > > > > converted to bytes. So to avoid overfilling per-cpu caches and breaking
> > > > > > vmstats correctness, let's use s32 instead.
> > > > > > 
> > > > > > This doesn't affect per-zone statistics. There are no plans to use
> > > > > > zone-level byte-sized counters, so no reasons to change anything.
> > > > > 
> > > > > Wait, is this still necessary? AFAIU, the node counters will account
> > > > > full slab pages, including free space, and only the memcg counters
> > > > > that track actual objects will be in bytes.
> > > > > 
> > > > > Can you please elaborate?
> > > > 
> > > > It's weird to have a counter with the same name (e.g. NR_SLAB_RECLAIMABLE_B)
> > > > being in different units depending on the accounting scope.
> > > > So I do convert all slab counters: global, per-lruvec,
> > > > and per-memcg to bytes.
> > > 
> > > Since the node counters tracks allocated slab pages and the memcg
> > > counter tracks allocated objects, arguably they shouldn't use the same
> > > name anyway.
> > > 
> > > > Alternatively I can fork them, e.g. introduce per-memcg or per-lruvec
> > > > NR_SLAB_RECLAIMABLE_OBJ
> > > > NR_SLAB_UNRECLAIMABLE_OBJ
> > > 
> > > Can we alias them and reuse their slots?
> > > 
> > > 	/* Reuse the node slab page counters item for charged objects */
> > > 	MEMCG_SLAB_RECLAIMABLE = NR_SLAB_RECLAIMABLE,
> > > 	MEMCG_SLAB_UNRECLAIMABLE = NR_SLAB_UNRECLAIMABLE,
> > 
> > Yeah, lgtm.
> > 
> > Isn't MEMCG_ prefix bad because it makes everybody think that it belongs to
> > the enum memcg_stat_item?
> 
> Maybe, not sure that's a problem. #define CG_SLAB_RECLAIMABLE perhaps?

Maybe not. I'll probably go with 
    MEMCG_SLAB_RECLAIMABLE_B = NR_SLAB_RECLAIMABLE,
    MEMCG_SLAB_UNRECLAIMABLE_B = NR_SLAB_UNRECLAIMABLE,

Please, let me know if you're not ok with it.

> 
> > > > and keep global counters untouched. If going this way, I'd prefer to make
> > > > them per-memcg, because it will simplify things on charging paths:
> > > > now we do get task->mem_cgroup->obj_cgroup in the pre_alloc_hook(),
> > > > and then obj_cgroup->mem_cgroup in the post_alloc_hook() just to
> > > > bump per-lruvec counters.
> > > 
> > > I don't quite follow. Don't you still have to update the global
> > > counters?
> > 
> > Global counters are updated only if an allocation requires a new slab
> > page, which isn't the most common path.
> 
> Right.
> 
> > In generic case post_hook is required because it's the only place where
> > we have both page (to get the node) and memcg pointer.
> > 
> > If NR_SLAB_RECLAIMABLE is tracked only per-memcg (as MEMCG_SOCK),
> > then post_hook can handle only the rare "allocation failed" case.
> > 
> > I'm not sure here what's better.
> 
> If it's tracked only per-memcg, you still have to account it every
> time you charge an object to a memcg, no? How is it less frequent than
> acconting at the lruvec level?

It's not less frequent, it just can be done in the pre-alloc hook
when there is a memcg pointer available.

The problem with the obj_cgroup thing is that we get it indirectly
from current memcg in the pre_alloc_hook, then pass it to obj_cgroup API,
internally we might need to get the memcg from it to charge a page,
and then again in the post_hook we need to get memcg to bump
per-lruvec stats. In other words we make several memcg <-> objcg
conversions, which isn't very nice on the hot path.

I see that in the future we might optimize the initial lookup
of objcg, but getting memcg just to bump vmstats looks unnecessarily expensive.
One option I think about is to handle byte-sized stats on obj_cgroup
level and flush whole pages to memcg level.

> 
> > > > Btw, I wonder if we really need per-lruvec counters at all (at least
> > > > being enabled by default). For the significant amount of users who
> > > > have a single-node machine it doesn't bring anything except performance
> > > > overhead.
> > > 
> > > Yeah, for single-node systems we should be able to redirect everything
> > > to the memcg counters, without allocating and tracking lruvec copies.
> > 
> > Sounds good. It can lead to significant savings on single-node machines.
> > 
> > > 
> > > > For those who have multiple nodes (and most likely many many
> > > > memory cgroups) it provides way too many data except for debugging
> > > > some weird mm issues.
> > > > I guess in the absolute majority of cases having global per-node + per-memcg
> > > > counters will be enough.
> > > 
> > > Hm? Reclaim uses the lruvec counters.
> > 
> > Can you, please, provide some examples? It looks like it's mostly based
> > on per-zone lruvec size counters.
> 
> It uses the recursive lruvec state to decide inactive_is_low(),
> whether refaults are occuring, whether to trim cache only or go for
> anon etc. We use it to determine refault distances and how many shadow
> nodes to shrink.
> 
> Grep for lruvec_page_state().

I see... Thanks!

> 
> > Anyway, it seems to be a little bit off from this patchset, so let's
> > discuss it separately.
> 
> True


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 21/28] mm: memcg/slab: use a single set of kmem_caches for all memory cgroups
  2020-02-04  1:15         ` Roman Gushchin
@ 2020-02-04  2:47           ` Johannes Weiner
  2020-02-04  4:35             ` Roman Gushchin
  0 siblings, 1 reply; 84+ messages in thread
From: Johannes Weiner @ 2020-02-04  2:47 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Feb 03, 2020 at 05:15:05PM -0800, Roman Gushchin wrote:
> On Mon, Feb 03, 2020 at 05:17:34PM -0500, Johannes Weiner wrote:
> > On Mon, Feb 03, 2020 at 12:58:34PM -0800, Roman Gushchin wrote:
> > > On Mon, Feb 03, 2020 at 02:50:48PM -0500, Johannes Weiner wrote:
> > > > On Mon, Jan 27, 2020 at 09:34:46AM -0800, Roman Gushchin wrote:
> > > > > This is fairly big but mostly red patch, which makes all non-root
> > > > > slab allocations use a single set of kmem_caches instead of
> > > > > creating a separate set for each memory cgroup.
> > > > > 
> > > > > Because the number of non-root kmem_caches is now capped by the number
> > > > > of root kmem_caches, there is no need to shrink or destroy them
> > > > > prematurely. They can be perfectly destroyed together with their
> > > > > root counterparts. This allows to dramatically simplify the
> > > > > management of non-root kmem_caches and delete a ton of code.
> > > > 
> > > > This is definitely going in the right direction. But it doesn't quite
> > > > explain why we still need two sets of kmem_caches?
> > > > 
> > > > In the old scheme, we had completely separate per-cgroup caches with
> > > > separate slab pages. If a cgrouped process wanted to allocate a slab
> > > > object, we'd go to the root cache and used the cgroup id to look up
> > > > the right cgroup cache. On slab free we'd use page->slab_cache.
> > > > 
> > > > Now we have slab pages that have a page->objcg array. Why can't all
> > > > allocations go through a single set of kmem caches? If an allocation
> > > > is coming from a cgroup and the slab page the allocator wants to use
> > > > doesn't have an objcg array yet, we can allocate it on the fly, no?
> > > 
> > > Well, arguably it can be done, but there are few drawbacks:
> > > 
> > > 1) On the release path you'll need to make some extra work even for
> > >    root allocations: calculate the offset only to find the NULL objcg pointer.
> > > 
> > > 2) There will be a memory overhead for root allocations
> > >    (which might or might not be compensated by the increase
> > >    of the slab utilization).
> > 
> > Those two are only true if there is a wild mix of root and cgroup
> > allocations inside the same slab, and that doesn't really happen in
> > practice. Either the machine is dedicated to one workload and cgroups
> > are only enabled due to e.g. a vendor kernel, or you have cgrouped
> > systems (like most distro systems now) that cgroup everything.
> 
> It's actually a questionable statement: we do skip allocations from certain
> contexts, and we do merge slab caches.
> 
> Most likely it's true for certain slab_caches and not true for others.
> Think of kmalloc-* caches.

With merging it's actually really hard to say how sparse or dense the
resulting objcgroup arrays would be. It could change all the time too.

> Also, because obj_cgroup vectors will not be freed without underlying pages,
> most likely the percentage of pages with obj_cgroups will grow with uptime.
> In other words, memcg allocations will fragment root slab pages.

I understand the first part of this paragraph, but not the second. The
objcgroup vectors will be freed when the slab pages get freed. But the
partially filled slab pages can be reused by any types of allocations,
surely? How would this cause the pages to fragment?

> > > 3) I'm working on percpu memory accounting that resembles the same scheme,
> > >    except that obj_cgroups vector is created for the whole percpu block.
> > >    There will be root- and memcg-blocks, and it will be expensive to merge them.
> > >    I kinda like using the same scheme here and there.
> > 
> > It's hard to conclude anything based on this information alone. If
> > it's truly expensive to merge them, then it warrants the additional
> > complexity. But I don't understand the desire to share a design for
> > two systems with sufficiently different constraints.
> > 
> > > Upsides?
> > > 
> > > 1) slab utilization might increase a little bit (but I doubt it will have
> > >    a huge effect, because both merging sets should be relatively big and well
> > >    utilized)
> > 
> > Right.
> > 
> > > 2) eliminate memcg kmem_cache dynamic creation/destruction. it's nice,
> > >    but there isn't so much code left anyway.
> > 
> > There is a lot of complexity associated with the cache cloning that
> > isn't the lines of code, but the lifetime and synchronization rules.
> 
> Quite opposite: the patchset removes all the complexity (or 90% of it),
> because it makes the kmem_cache lifetime independent from any cgroup stuff.
> 
> Kmem_caches are created on demand on the first request (most likely during
> the system start-up), and destroyed together with their root counterparts
> (most likely never or on rmmod). First request means globally first request,
> not a first request from a given memcg.
> 
> Memcg kmem_cache lifecycle has nothing to do with memory/obj_cgroups, and
> after creation just matches the lifetime of the root kmem caches.
> 
> The only reason to keep the async creation is that some kmem_caches
> are created very early in the boot process, long before any cgroup
> stuff is initialized.

Yes, it's independent of the obj_cgroup and memcg, and yes it's
simpler after your patches. But I'm not talking about the delta, I'm
trying to understand the end result.

And the truth is there is a decent chunk of code and tentacles spread
throughout the slab/cgroup code to clone, destroy, and handle the
split caches, as well as the branches/indirections on every cgrouped
slab allocation.

Yet there is no good explanation for why things are done this way
anywhere in the changelog, the cover letter, or the code. And it's
hard to get a satisfying answer even to direct questions about it.

Forget about how anything was before your patches and put yourself
into the shoes of somebody who comes at the new code without any
previous knowledge. "It was even worse before" just isn't a satisfying
answer.

> > And these two things are the primary aspects that make my head hurt
> > trying to review this patch series.
> > 
> > > So IMO it's an interesting direction to explore, but not something
> > > that necessarily has to be done in the context of this patchset.
> > 
> > I disagree. Instead of replacing the old coherent model and its
> > complexities with a new coherent one, you are mixing the two. And I
> > can barely understand the end result.
> > 
> > Dynamically cloning entire slab caches for the sole purpose of telling
> > whether the pages have an obj_cgroup array or not is *completely
> > insane*. If the controller had followed the obj_cgroup design from the
> > start, nobody would have ever thought about doing it like this.
> 
> It's just not true. The whole point of having root- and memcg sets is
> to be able to not look for a NULL pointer in the obj_cgroup vector on
> releasing of the root object. In other words, it allows to keep zero
> overhead for root allocations. IMHO it's an important thing, and calling
> it *completely insane* isn't the best way to communicate.

But you're trading it for the indirection of going through a separate
kmem_cache for every single cgroup-accounted allocation. Why is this a
preferable trade-off to make?

I'm asking basic questions about your design choices. It's not okay to
dismiss this with "it's an interesting direction to explore outside
the context this patchset".

> > From a maintainability POV, we cannot afford merging it in this form.
> 
> It sounds strange: the patchset eliminates 90% of the complexity,
> but it's unmergeable because there are 10% left.

No, it's unmergeable if you're unwilling to explain and document your
design choices when somebody who is taking the time and effort to look
at your patches doesn't understand why things are the way they are.

We are talking about 1500 lines of complicated core kernel code. They
*have* to make sense to people other than you if we want to have this
upstream.

> I agree that it's an arguable question if we can tolerate some
> additional overhead on root allocations to eliminate these additional
> 10%, but I really don't think it's so obvious that even discussing
> it is insane.

Well that's exactly my point.

> Btw, there is another good idea to explore (also suggested by Christopher
> Lameter): we can put memcg/objcg pointer into the slab page, avoiding
> an extra allocation.

I agree with this idea, but I do think that's a bit more obviously in
optimization territory. The objcg is much larger than a pointer to it,
and it wouldn't significantly change the alloc/free sequence, right?


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 21/28] mm: memcg/slab: use a single set of kmem_caches for all memory cgroups
  2020-02-04  2:47           ` Johannes Weiner
@ 2020-02-04  4:35             ` Roman Gushchin
  2020-02-04 18:41               ` Johannes Weiner
  0 siblings, 1 reply; 84+ messages in thread
From: Roman Gushchin @ 2020-02-04  4:35 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Feb 03, 2020 at 09:47:04PM -0500, Johannes Weiner wrote:
> On Mon, Feb 03, 2020 at 05:15:05PM -0800, Roman Gushchin wrote:
> > On Mon, Feb 03, 2020 at 05:17:34PM -0500, Johannes Weiner wrote:
> > > On Mon, Feb 03, 2020 at 12:58:34PM -0800, Roman Gushchin wrote:
> > > > On Mon, Feb 03, 2020 at 02:50:48PM -0500, Johannes Weiner wrote:
> > > > > On Mon, Jan 27, 2020 at 09:34:46AM -0800, Roman Gushchin wrote:
> > > > > > This is fairly big but mostly red patch, which makes all non-root
> > > > > > slab allocations use a single set of kmem_caches instead of
> > > > > > creating a separate set for each memory cgroup.
> > > > > > 
> > > > > > Because the number of non-root kmem_caches is now capped by the number
> > > > > > of root kmem_caches, there is no need to shrink or destroy them
> > > > > > prematurely. They can be perfectly destroyed together with their
> > > > > > root counterparts. This allows to dramatically simplify the
> > > > > > management of non-root kmem_caches and delete a ton of code.
> > > > > 
> > > > > This is definitely going in the right direction. But it doesn't quite
> > > > > explain why we still need two sets of kmem_caches?
> > > > > 
> > > > > In the old scheme, we had completely separate per-cgroup caches with
> > > > > separate slab pages. If a cgrouped process wanted to allocate a slab
> > > > > object, we'd go to the root cache and used the cgroup id to look up
> > > > > the right cgroup cache. On slab free we'd use page->slab_cache.
> > > > > 
> > > > > Now we have slab pages that have a page->objcg array. Why can't all
> > > > > allocations go through a single set of kmem caches? If an allocation
> > > > > is coming from a cgroup and the slab page the allocator wants to use
> > > > > doesn't have an objcg array yet, we can allocate it on the fly, no?
> > > > 
> > > > Well, arguably it can be done, but there are few drawbacks:
> > > > 
> > > > 1) On the release path you'll need to make some extra work even for
> > > >    root allocations: calculate the offset only to find the NULL objcg pointer.
> > > > 
> > > > 2) There will be a memory overhead for root allocations
> > > >    (which might or might not be compensated by the increase
> > > >    of the slab utilization).
> > > 
> > > Those two are only true if there is a wild mix of root and cgroup
> > > allocations inside the same slab, and that doesn't really happen in
> > > practice. Either the machine is dedicated to one workload and cgroups
> > > are only enabled due to e.g. a vendor kernel, or you have cgrouped
> > > systems (like most distro systems now) that cgroup everything.
> > 
> > It's actually a questionable statement: we do skip allocations from certain
> > contexts, and we do merge slab caches.
> > 
> > Most likely it's true for certain slab_caches and not true for others.
> > Think of kmalloc-* caches.
> 
> With merging it's actually really hard to say how sparse or dense the
> resulting objcgroup arrays would be. It could change all the time too.

So here is some actual data from my dev machine. The first column is the number
of pages in the root cache, the second - in the corresponding memcg.

   ext4_groupinfo_4k          1          0
     rpc_inode_cache          1          0
        fuse_request         62          0
          fuse_inode          1       2732
  btrfs_delayed_node       1192          0
btrfs_ordered_extent        129          0
    btrfs_extent_map       8686          0
 btrfs_extent_buffer       2648          0
         btrfs_inode         12       6739
              PINGv6          1         11
               RAWv6          2          5
               UDPv6          1         34
       tw_sock_TCPv6        378          3
  request_sock_TCPv6         24          0
               TCPv6         46         74
  mqueue_inode_cache          1          0
 jbd2_journal_handle          2          0
   jbd2_journal_head          2          0
 jbd2_revoke_table_s          1          0
    ext4_inode_cache          1          3
ext4_allocation_context          1          0
         ext4_io_end          1          0
  ext4_extent_status          5          0
             mbcache          1          0
      dnotify_struct          1          0
  posix_timers_cache         24          0
      xfrm_dst_cache        202          0
                 RAW          3         12
                 UDP          2         24
         tw_sock_TCP         25          0
    request_sock_TCP         24          0
                 TCP          7         24
hugetlbfs_inode_cache          2          0
               dquot          2          0
       eventpoll_pwq          1        119
           dax_cache          1          0
       request_queue          9          0
          blkdev_ioc        241          0
          biovec-max        112          0
          biovec-128          2          0
           biovec-64          6          0
  khugepaged_mm_slot        248          0
 dmaengine-unmap-256          1          0
 dmaengine-unmap-128          1          0
  dmaengine-unmap-16         39          0
    sock_inode_cache          9        219
    skbuff_ext_cache        249          0
 skbuff_fclone_cache         83          0
   skbuff_head_cache        138        141
     file_lock_cache         24          0
       net_namespace          1          5
   shmem_inode_cache         14         56
     task_delay_info         23        165
           taskstats         24          0
      proc_dir_entry         24          0
          pde_opener         16         24
    proc_inode_cache         24       1103
          bdev_cache          4         20
   kernfs_node_cache       1405          0
           mnt_cache         54          0
                filp         53        460
         inode_cache        488       2287
              dentry        367      10576
         names_cache         24          0
        ebitmap_node          2          0
     avc_xperms_data        256          0
      lsm_file_cache         92          0
         buffer_head         24          9
       uts_namespace          1          3
      vm_area_struct         48        810
           mm_struct         19         29
         files_cache         14         26
        signal_cache         28        143
       sighand_cache         45         47
         task_struct         77        430
            cred_jar         29        424
      anon_vma_chain         39        492
            anon_vma         28        467
                 pid         30        369
        Acpi-Operand         56          0
          Acpi-Parse       5587          0
          Acpi-State       4137          0
      Acpi-Namespace          8          0
         numa_policy        137          0
  ftrace_event_field         68          0
      pool_workqueue         25          0
     radix_tree_node       1694       7776
          task_group         21          0
           vmap_area        477          0
     kmalloc-rcl-512        473          0
     kmalloc-rcl-256        605          0
     kmalloc-rcl-192         43         16
     kmalloc-rcl-128          1         47
      kmalloc-rcl-96          3        229
      kmalloc-rcl-64          6        611
          kmalloc-8k         48         24
          kmalloc-4k        372         59
          kmalloc-2k        132         50
          kmalloc-1k        251         82
         kmalloc-512        360        150
         kmalloc-256        237          0
         kmalloc-192        298         24
         kmalloc-128        203         24
          kmalloc-96        112         24
          kmalloc-64        796         24
          kmalloc-32       1188         26
          kmalloc-16        555         25
           kmalloc-8         42         24
     kmem_cache_node         20          0
          kmem_cache         24          0

> 
> > Also, because obj_cgroup vectors will not be freed without underlying pages,
> > most likely the percentage of pages with obj_cgroups will grow with uptime.
> > In other words, memcg allocations will fragment root slab pages.
> 
> I understand the first part of this paragraph, but not the second. The
> objcgroup vectors will be freed when the slab pages get freed. But the
> partially filled slab pages can be reused by any types of allocations,
> surely? How would this cause the pages to fragment?

I mean the following: once you allocate a single accounted object
from the page, obj_cgroup vector is allocated and will be released only
with the slab page. We really really don't want to count how many accounted
objects are on the page and release obj_cgroup vector on reaching 0.
So even if all following allocations are root allocations, the overhead
will not go away with the uptime.

In other words, even a small percentage of accounted objects will
turn the whole cache into "accountable".

> 
> > > > 3) I'm working on percpu memory accounting that resembles the same scheme,
> > > >    except that obj_cgroups vector is created for the whole percpu block.
> > > >    There will be root- and memcg-blocks, and it will be expensive to merge them.
> > > >    I kinda like using the same scheme here and there.
> > > 
> > > It's hard to conclude anything based on this information alone. If
> > > it's truly expensive to merge them, then it warrants the additional
> > > complexity. But I don't understand the desire to share a design for
> > > two systems with sufficiently different constraints.
> > > 
> > > > Upsides?
> > > > 
> > > > 1) slab utilization might increase a little bit (but I doubt it will have
> > > >    a huge effect, because both merging sets should be relatively big and well
> > > >    utilized)
> > > 
> > > Right.
> > > 
> > > > 2) eliminate memcg kmem_cache dynamic creation/destruction. it's nice,
> > > >    but there isn't so much code left anyway.
> > > 
> > > There is a lot of complexity associated with the cache cloning that
> > > isn't the lines of code, but the lifetime and synchronization rules.
> > 
> > Quite opposite: the patchset removes all the complexity (or 90% of it),
> > because it makes the kmem_cache lifetime independent from any cgroup stuff.
> > 
> > Kmem_caches are created on demand on the first request (most likely during
> > the system start-up), and destroyed together with their root counterparts
> > (most likely never or on rmmod). First request means globally first request,
> > not a first request from a given memcg.
> > 
> > Memcg kmem_cache lifecycle has nothing to do with memory/obj_cgroups, and
> > after creation just matches the lifetime of the root kmem caches.
> > 
> > The only reason to keep the async creation is that some kmem_caches
> > are created very early in the boot process, long before any cgroup
> > stuff is initialized.
> 
> Yes, it's independent of the obj_cgroup and memcg, and yes it's
> simpler after your patches. But I'm not talking about the delta, I'm
> trying to understand the end result.
> 
> And the truth is there is a decent chunk of code and tentacles spread
> throughout the slab/cgroup code to clone, destroy, and handle the
> split caches, as well as the branches/indirections on every cgrouped
> slab allocation.

Did you see the final code? It's fairly simple and there is really not
much of complexity left. If you don't think so, let's go into details,
because otherwise it's hard to say anything.

With a such change which basically removes the current implementation
and replaces it with a new one, it's hard to keep the balance between
making commits self-contained and small, but also showing the whole picture.
I'm fully open to questions and generally want to make it simpler.

I've tried to separate some parts and get them merged before the main
thing, but they haven't been merged yet, so I have to include them
to keep the thing building.

Will a more-detailed design in the cover help?
Will writing a design doc to put into Documentation/ help?
Is it better to rearrange patches in a way to eliminate the current
implementation first and build from scratch?

> 
> Yet there is no good explanation for why things are done this way
> anywhere in the changelog, the cover letter, or the code. And it's
> hard to get a satisfying answer even to direct questions about it.

I do not agree. I try to answer all questions. But I also expect
that my arguments will be listened.
(I didn't answer questions re lifetime of obj_cgroup, but only
because I need some more time to think. If it wasn't clear, I'm sorry.).

> 
> Forget about how anything was before your patches and put yourself
> into the shoes of somebody who comes at the new code without any
> previous knowledge. "It was even worse before" just isn't a satisfying
> answer.

Absolutely agree.

But at the same time "now it's better than before" sounds like a good
validation for a change. The code is never perfect.

But, please, let's don't go into long discussions here and save some time.

> 
> > > And these two things are the primary aspects that make my head hurt
> > > trying to review this patch series.
> > > 
> > > > So IMO it's an interesting direction to explore, but not something
> > > > that necessarily has to be done in the context of this patchset.
> > > 
> > > I disagree. Instead of replacing the old coherent model and its
> > > complexities with a new coherent one, you are mixing the two. And I
> > > can barely understand the end result.
> > > 
> > > Dynamically cloning entire slab caches for the sole purpose of telling
> > > whether the pages have an obj_cgroup array or not is *completely
> > > insane*. If the controller had followed the obj_cgroup design from the
> > > start, nobody would have ever thought about doing it like this.
> > 
> > It's just not true. The whole point of having root- and memcg sets is
> > to be able to not look for a NULL pointer in the obj_cgroup vector on
> > releasing of the root object. In other words, it allows to keep zero
> > overhead for root allocations. IMHO it's an important thing, and calling
> > it *completely insane* isn't the best way to communicate.
> 
> But you're trading it for the indirection of going through a separate
> kmem_cache for every single cgroup-accounted allocation. Why is this a
> preferable trade-off to make?

Because it allows to keep zero memory and cpu overhead for root allocations.
I've no data showing that this overhead is small and acceptable in all cases.
I think keeping zero overhead for root allocations is more important
than having a single set of kmem caches.

> 
> I'm asking basic questions about your design choices. It's not okay to
> dismiss this with "it's an interesting direction to explore outside
> the context this patchset".

I'm not dismissing any questions.
There is a difference between a question and a must-to-follow suggestion,
which has known and ignored trade-offs.

> 
> > > From a maintainability POV, we cannot afford merging it in this form.
> > 
> > It sounds strange: the patchset eliminates 90% of the complexity,
> > but it's unmergeable because there are 10% left.
> 
> No, it's unmergeable if you're unwilling to explain and document your
> design choices when somebody who is taking the time and effort to look
> at your patches doesn't understand why things are the way they are.

I'm not unwilling to explain. Otherwise I just wouldn't post it upstream,
right? And I assume you're spending your time reviewing it not with the goal
to keep the current code intact.

Please, let's keep separate things which are hard to understand and
require an explanation and things which you think are better done differently.

Both are valid and appreciated comments, but mixing them isn't productive.

> 
> We are talking about 1500 lines of complicated core kernel code. They
> *have* to make sense to people other than you if we want to have this
> upstream.

Right.

> 
> > I agree that it's an arguable question if we can tolerate some
> > additional overhead on root allocations to eliminate these additional
> > 10%, but I really don't think it's so obvious that even discussing
> > it is insane.
> 
> Well that's exactly my point.

Ok, what's the acceptable performance penalty?
Is adding 20% on free path is acceptable, for example?
Or adding 3% of slab memory?

> 
> > Btw, there is another good idea to explore (also suggested by Christopher
> > Lameter): we can put memcg/objcg pointer into the slab page, avoiding
> > an extra allocation.
> 
> I agree with this idea, but I do think that's a bit more obviously in
> optimization territory. The objcg is much larger than a pointer to it,
> and it wouldn't significantly change the alloc/free sequence, right?

So the idea is that putting the obj_cgroup pointer nearby will eliminate
some cache misses. But then it's preferable to have two sets, because otherwise
there is a memory overhead from allocating an extra space for the objcg pointer.


Stepping a bit back: the new scheme (new slab controller) adds some cpu operations
on the allocation and release paths. It's unavoidable: more precise
accounting requires more CPU. But IMO it's worth it because it leads
to significant memory savings and reduced memory fragmentation.
Also it reduces the code complexity (which is a bonus but not the primary goal).

I haven't seen so far any workloads where the difference was noticeable,
but it doesn't mean they do not exist. That's why I'm very concerned about
any suggestions which might even in theory increase the cpu overhead.
Keeping it at zero level for root allocations allows do exclude
something from the accounting if the performance penalty is not tolerable.

Thanks!


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 21/28] mm: memcg/slab: use a single set of kmem_caches for all memory cgroups
  2020-02-04  4:35             ` Roman Gushchin
@ 2020-02-04 18:41               ` Johannes Weiner
  2020-02-05 15:58                 ` Roman Gushchin
  0 siblings, 1 reply; 84+ messages in thread
From: Johannes Weiner @ 2020-02-04 18:41 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Feb 03, 2020 at 08:35:41PM -0800, Roman Gushchin wrote:
> On Mon, Feb 03, 2020 at 09:47:04PM -0500, Johannes Weiner wrote:
> > On Mon, Feb 03, 2020 at 05:15:05PM -0800, Roman Gushchin wrote:
> > > On Mon, Feb 03, 2020 at 05:17:34PM -0500, Johannes Weiner wrote:
> > > > On Mon, Feb 03, 2020 at 12:58:34PM -0800, Roman Gushchin wrote:
> > > > > On Mon, Feb 03, 2020 at 02:50:48PM -0500, Johannes Weiner wrote:
> > > > > > On Mon, Jan 27, 2020 at 09:34:46AM -0800, Roman Gushchin wrote:
> > > > > > > This is fairly big but mostly red patch, which makes all non-root
> > > > > > > slab allocations use a single set of kmem_caches instead of
> > > > > > > creating a separate set for each memory cgroup.
> > > > > > > 
> > > > > > > Because the number of non-root kmem_caches is now capped by the number
> > > > > > > of root kmem_caches, there is no need to shrink or destroy them
> > > > > > > prematurely. They can be perfectly destroyed together with their
> > > > > > > root counterparts. This allows to dramatically simplify the
> > > > > > > management of non-root kmem_caches and delete a ton of code.
> > > > > > 
> > > > > > This is definitely going in the right direction. But it doesn't quite
> > > > > > explain why we still need two sets of kmem_caches?
> > > > > > 
> > > > > > In the old scheme, we had completely separate per-cgroup caches with
> > > > > > separate slab pages. If a cgrouped process wanted to allocate a slab
> > > > > > object, we'd go to the root cache and used the cgroup id to look up
> > > > > > the right cgroup cache. On slab free we'd use page->slab_cache.
> > > > > > 
> > > > > > Now we have slab pages that have a page->objcg array. Why can't all
> > > > > > allocations go through a single set of kmem caches? If an allocation
> > > > > > is coming from a cgroup and the slab page the allocator wants to use
> > > > > > doesn't have an objcg array yet, we can allocate it on the fly, no?
> > > > > 
> > > > > Well, arguably it can be done, but there are few drawbacks:
> > > > > 
> > > > > 1) On the release path you'll need to make some extra work even for
> > > > >    root allocations: calculate the offset only to find the NULL objcg pointer.
> > > > > 
> > > > > 2) There will be a memory overhead for root allocations
> > > > >    (which might or might not be compensated by the increase
> > > > >    of the slab utilization).
> > > > 
> > > > Those two are only true if there is a wild mix of root and cgroup
> > > > allocations inside the same slab, and that doesn't really happen in
> > > > practice. Either the machine is dedicated to one workload and cgroups
> > > > are only enabled due to e.g. a vendor kernel, or you have cgrouped
> > > > systems (like most distro systems now) that cgroup everything.
> > > 
> > > It's actually a questionable statement: we do skip allocations from certain
> > > contexts, and we do merge slab caches.
> > > 
> > > Most likely it's true for certain slab_caches and not true for others.
> > > Think of kmalloc-* caches.
> > 
> > With merging it's actually really hard to say how sparse or dense the
> > resulting objcgroup arrays would be. It could change all the time too.
> 
> So here is some actual data from my dev machine. The first column is the number
> of pages in the root cache, the second - in the corresponding memcg.
> 
>    ext4_groupinfo_4k          1          0
>      rpc_inode_cache          1          0
>         fuse_request         62          0
>           fuse_inode          1       2732
>   btrfs_delayed_node       1192          0
> btrfs_ordered_extent        129          0
>     btrfs_extent_map       8686          0
>  btrfs_extent_buffer       2648          0
>          btrfs_inode         12       6739
>               PINGv6          1         11
>                RAWv6          2          5
>                UDPv6          1         34
>        tw_sock_TCPv6        378          3
>   request_sock_TCPv6         24          0
>                TCPv6         46         74
>   mqueue_inode_cache          1          0
>  jbd2_journal_handle          2          0
>    jbd2_journal_head          2          0
>  jbd2_revoke_table_s          1          0
>     ext4_inode_cache          1          3
> ext4_allocation_context          1          0
>          ext4_io_end          1          0
>   ext4_extent_status          5          0
>              mbcache          1          0
>       dnotify_struct          1          0
>   posix_timers_cache         24          0
>       xfrm_dst_cache        202          0
>                  RAW          3         12
>                  UDP          2         24
>          tw_sock_TCP         25          0
>     request_sock_TCP         24          0
>                  TCP          7         24
> hugetlbfs_inode_cache          2          0
>                dquot          2          0
>        eventpoll_pwq          1        119
>            dax_cache          1          0
>        request_queue          9          0
>           blkdev_ioc        241          0
>           biovec-max        112          0
>           biovec-128          2          0
>            biovec-64          6          0
>   khugepaged_mm_slot        248          0
>  dmaengine-unmap-256          1          0
>  dmaengine-unmap-128          1          0
>   dmaengine-unmap-16         39          0
>     sock_inode_cache          9        219
>     skbuff_ext_cache        249          0
>  skbuff_fclone_cache         83          0
>    skbuff_head_cache        138        141
>      file_lock_cache         24          0
>        net_namespace          1          5
>    shmem_inode_cache         14         56
>      task_delay_info         23        165
>            taskstats         24          0
>       proc_dir_entry         24          0
>           pde_opener         16         24
>     proc_inode_cache         24       1103
>           bdev_cache          4         20
>    kernfs_node_cache       1405          0
>            mnt_cache         54          0
>                 filp         53        460
>          inode_cache        488       2287
>               dentry        367      10576
>          names_cache         24          0
>         ebitmap_node          2          0
>      avc_xperms_data        256          0
>       lsm_file_cache         92          0
>          buffer_head         24          9
>        uts_namespace          1          3
>       vm_area_struct         48        810
>            mm_struct         19         29
>          files_cache         14         26
>         signal_cache         28        143
>        sighand_cache         45         47
>          task_struct         77        430
>             cred_jar         29        424
>       anon_vma_chain         39        492
>             anon_vma         28        467
>                  pid         30        369
>         Acpi-Operand         56          0
>           Acpi-Parse       5587          0
>           Acpi-State       4137          0
>       Acpi-Namespace          8          0
>          numa_policy        137          0
>   ftrace_event_field         68          0
>       pool_workqueue         25          0
>      radix_tree_node       1694       7776
>           task_group         21          0
>            vmap_area        477          0
>      kmalloc-rcl-512        473          0
>      kmalloc-rcl-256        605          0
>      kmalloc-rcl-192         43         16
>      kmalloc-rcl-128          1         47
>       kmalloc-rcl-96          3        229
>       kmalloc-rcl-64          6        611
>           kmalloc-8k         48         24
>           kmalloc-4k        372         59
>           kmalloc-2k        132         50
>           kmalloc-1k        251         82
>          kmalloc-512        360        150
>          kmalloc-256        237          0
>          kmalloc-192        298         24
>          kmalloc-128        203         24
>           kmalloc-96        112         24
>           kmalloc-64        796         24
>           kmalloc-32       1188         26
>           kmalloc-16        555         25
>            kmalloc-8         42         24
>      kmem_cache_node         20          0
>           kmem_cache         24          0

That's interesting, thanks. It does look fairly bimodal, except in
some smaller caches. Which does make sense when you think about it: we
focus on accounting consumers that are driven by userspace activity
and big enough to actually matter in terms of cgroup footprint.

> > > Also, because obj_cgroup vectors will not be freed without underlying pages,
> > > most likely the percentage of pages with obj_cgroups will grow with uptime.
> > > In other words, memcg allocations will fragment root slab pages.
> > 
> > I understand the first part of this paragraph, but not the second. The
> > objcgroup vectors will be freed when the slab pages get freed. But the
> > partially filled slab pages can be reused by any types of allocations,
> > surely? How would this cause the pages to fragment?
> 
> I mean the following: once you allocate a single accounted object
> from the page, obj_cgroup vector is allocated and will be released only
> with the slab page. We really really don't want to count how many accounted
> objects are on the page and release obj_cgroup vector on reaching 0.
> So even if all following allocations are root allocations, the overhead
> will not go away with the uptime.
> 
> In other words, even a small percentage of accounted objects will
> turn the whole cache into "accountable".

Correct. The worst case is where we have a large cache that has N
objects per slab, but only ~1/N objects are accounted to a cgroup.

The question is whether this is common or even realistic. When would a
cache be big, but only a small subset of its allocations would be
attributable to specific cgroups?

On the less extreme overlapping cases, yeah there are fragmented
obj_cgroup arrays, but there is also better slab packing. One is an
array of pointers, the other is an array of larger objects. It would
seem slab fragmentation has the potential to waste much more memory?

> > > > > 3) I'm working on percpu memory accounting that resembles the same scheme,
> > > > >    except that obj_cgroups vector is created for the whole percpu block.
> > > > >    There will be root- and memcg-blocks, and it will be expensive to merge them.
> > > > >    I kinda like using the same scheme here and there.
> > > > 
> > > > It's hard to conclude anything based on this information alone. If
> > > > it's truly expensive to merge them, then it warrants the additional
> > > > complexity. But I don't understand the desire to share a design for
> > > > two systems with sufficiently different constraints.
> > > > 
> > > > > Upsides?
> > > > > 
> > > > > 1) slab utilization might increase a little bit (but I doubt it will have
> > > > >    a huge effect, because both merging sets should be relatively big and well
> > > > >    utilized)
> > > > 
> > > > Right.
> > > > 
> > > > > 2) eliminate memcg kmem_cache dynamic creation/destruction. it's nice,
> > > > >    but there isn't so much code left anyway.
> > > > 
> > > > There is a lot of complexity associated with the cache cloning that
> > > > isn't the lines of code, but the lifetime and synchronization rules.
> > > 
> > > Quite opposite: the patchset removes all the complexity (or 90% of it),
> > > because it makes the kmem_cache lifetime independent from any cgroup stuff.
> > > 
> > > Kmem_caches are created on demand on the first request (most likely during
> > > the system start-up), and destroyed together with their root counterparts
> > > (most likely never or on rmmod). First request means globally first request,
> > > not a first request from a given memcg.
> > > 
> > > Memcg kmem_cache lifecycle has nothing to do with memory/obj_cgroups, and
> > > after creation just matches the lifetime of the root kmem caches.
> > > 
> > > The only reason to keep the async creation is that some kmem_caches
> > > are created very early in the boot process, long before any cgroup
> > > stuff is initialized.
> > 
> > Yes, it's independent of the obj_cgroup and memcg, and yes it's
> > simpler after your patches. But I'm not talking about the delta, I'm
> > trying to understand the end result.
> > 
> > And the truth is there is a decent chunk of code and tentacles spread
> > throughout the slab/cgroup code to clone, destroy, and handle the
> > split caches, as well as the branches/indirections on every cgrouped
> > slab allocation.
> 
> Did you see the final code? It's fairly simple and there is really not
> much of complexity left. If you don't think so, let's go into details,
> because otherwise it's hard to say anything.

I have the patches applied to a local tree and am looking at the final
code. But I can only repeat that "it's not too bad" simply isn't a
good explanation for why the code is the way it is.

> With a such change which basically removes the current implementation
> and replaces it with a new one, it's hard to keep the balance between
> making commits self-contained and small, but also showing the whole picture.
> I'm fully open to questions and generally want to make it simpler.
> 
> I've tried to separate some parts and get them merged before the main
> thing, but they haven't been merged yet, so I have to include them
> to keep the thing building.
> 
> Will a more-detailed design in the cover help?
> Will writing a design doc to put into Documentation/ help?
> Is it better to rearrange patches in a way to eliminate the current
> implementation first and build from scratch?

It would help to have changelogs that actually describe how the new
design is supposed to work, and why you made the decisions you made.

The changelog in this patch here sells the change as a reduction in
complexity, without explaining why it stopped where it stopped. So
naturally, if that's the declared goal, the first question is whether
we can make it simpler.

Both the cover letter and the changelogs should focus less on what was
there and how it was deleted, and more on how the code is supposed to
work after the patches. How the components were designed and how they
all work together.

As I said before, imagine somebody without any historical knowledge
reading the code. They should be able to find out why you chose to
have two sets of kmem caches. There is no explanation for it other
than "there used to be more, so we cut it down to two".

> > > > And these two things are the primary aspects that make my head hurt
> > > > trying to review this patch series.
> > > > 
> > > > > So IMO it's an interesting direction to explore, but not something
> > > > > that necessarily has to be done in the context of this patchset.
> > > > 
> > > > I disagree. Instead of replacing the old coherent model and its
> > > > complexities with a new coherent one, you are mixing the two. And I
> > > > can barely understand the end result.
> > > > 
> > > > Dynamically cloning entire slab caches for the sole purpose of telling
> > > > whether the pages have an obj_cgroup array or not is *completely
> > > > insane*. If the controller had followed the obj_cgroup design from the
> > > > start, nobody would have ever thought about doing it like this.
> > > 
> > > It's just not true. The whole point of having root- and memcg sets is
> > > to be able to not look for a NULL pointer in the obj_cgroup vector on
> > > releasing of the root object. In other words, it allows to keep zero
> > > overhead for root allocations. IMHO it's an important thing, and calling
> > > it *completely insane* isn't the best way to communicate.
> > 
> > But you're trading it for the indirection of going through a separate
> > kmem_cache for every single cgroup-accounted allocation. Why is this a
> > preferable trade-off to make?
> 
> Because it allows to keep zero memory and cpu overhead for root allocations.
> I've no data showing that this overhead is small and acceptable in all cases.
> I think keeping zero overhead for root allocations is more important
> than having a single set of kmem caches.

In the kmem cache breakdown you provided above, there are 35887 pages
allocated to root caches and 37300 pages allocated to cgroup caches.

Why are root allocations supposed to be more important? Aren't some of
the hottest allocations tracked by cgroups? Look at fork():

>       vm_area_struct         48        810
>            mm_struct         19         29
>          files_cache         14         26
>         signal_cache         28        143
>        sighand_cache         45         47
>          task_struct         77        430
>             cred_jar         29        424
>       anon_vma_chain         39        492
>             anon_vma         28        467
>                  pid         30        369

Hard to think of much hotter allocations. They all have to suffer the
additional branch and cache footprint of the auxiliary cgroup caches.

> > > I agree that it's an arguable question if we can tolerate some
> > > additional overhead on root allocations to eliminate these additional
> > > 10%, but I really don't think it's so obvious that even discussing
> > > it is insane.
> > 
> > Well that's exactly my point.
> 
> Ok, what's the acceptable performance penalty?
> Is adding 20% on free path is acceptable, for example?
> Or adding 3% of slab memory?

I find offhand replies like these very jarring.

There is a legitimate design question: Why are you using a separate
set of kmem caches for the cgroup allocations, citing the additional
complexity over having one set? And your reply was mostly handwaving.

So: what's the overhead you're saving by having two sets? What is this
additional stuff buying us?

Pretend the split-cache infra hadn't been there. Would you have added
it? If so, based on what data? Now obviously, you didn't write it - it
was there because that's the way the per-cgroup accounting was done
previously. But you did choose to keep it. And it's a fair question
what (quantifiable) role it plays in your new way of doing things.

> > > Btw, there is another good idea to explore (also suggested by Christopher
> > > Lameter): we can put memcg/objcg pointer into the slab page, avoiding
> > > an extra allocation.
> > 
> > I agree with this idea, but I do think that's a bit more obviously in
> > optimization territory. The objcg is much larger than a pointer to it,
> > and it wouldn't significantly change the alloc/free sequence, right?
> 
> So the idea is that putting the obj_cgroup pointer nearby will eliminate
> some cache misses. But then it's preferable to have two sets, because otherwise
> there is a memory overhead from allocating an extra space for the objcg pointer.

This trade-off is based on two assumptions:

1) Unaccounted allocations are more performance sensitive than
accounted allocations.

2) Fragmented obj_cgroup arrays waste more memory than fragmented
slabs.

You haven't sufficiently shown that either of those are true. (And I
suspect they are both false.)

So my stance is that until you make a more convincing argument for
this, a simpler concept and implementation, as well as balanced CPU
cost for unaccounted and accounted allocations, wins out.

> Stepping a bit back: the new scheme (new slab controller) adds some cpu operations
> on the allocation and release paths. It's unavoidable: more precise
> accounting requires more CPU. But IMO it's worth it because it leads
> to significant memory savings and reduced memory fragmentation.
> Also it reduces the code complexity (which is a bonus but not the primary goal).
> 
> I haven't seen so far any workloads where the difference was noticeable,
> but it doesn't mean they do not exist. That's why I'm very concerned about
> any suggestions which might even in theory increase the cpu overhead.
> Keeping it at zero level for root allocations allows do exclude
> something from the accounting if the performance penalty is not tolerable.

That sounds like a false trade-off to me. We account memory for
functional correctness - consumers that are big enough to
fundamentally alter the picture of cgroup memory footprints, allow
users to disturb other containers, or even cause host-level OOM
situations. Not based on whether they are cheap to track.

In fact, I would make the counter argument. It'd be pretty bad if
everytime we had to make an accounting change to maintain functional
correctness we'd have to worry about a CPU regression that exists in
part because we're trying to keep unaccounted allocations cheaper.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 21/28] mm: memcg/slab: use a single set of kmem_caches for all memory cgroups
  2020-02-04 18:41               ` Johannes Weiner
@ 2020-02-05 15:58                 ` Roman Gushchin
  0 siblings, 0 replies; 84+ messages in thread
From: Roman Gushchin @ 2020-02-05 15:58 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Tue, Feb 04, 2020 at 01:41:59PM -0500, Johannes Weiner wrote:
> On Mon, Feb 03, 2020 at 08:35:41PM -0800, Roman Gushchin wrote:
> > On Mon, Feb 03, 2020 at 09:47:04PM -0500, Johannes Weiner wrote:
> > > On Mon, Feb 03, 2020 at 05:15:05PM -0800, Roman Gushchin wrote:
> > > > On Mon, Feb 03, 2020 at 05:17:34PM -0500, Johannes Weiner wrote:
> > > > > On Mon, Feb 03, 2020 at 12:58:34PM -0800, Roman Gushchin wrote:
> > > > > > On Mon, Feb 03, 2020 at 02:50:48PM -0500, Johannes Weiner wrote:
> > > > > > > On Mon, Jan 27, 2020 at 09:34:46AM -0800, Roman Gushchin wrote:
> > > > > > > > This is fairly big but mostly red patch, which makes all non-root
> > > > > > > > slab allocations use a single set of kmem_caches instead of
> > > > > > > > creating a separate set for each memory cgroup.
> > > > > > > > 
> > > > > > > > Because the number of non-root kmem_caches is now capped by the number
> > > > > > > > of root kmem_caches, there is no need to shrink or destroy them
> > > > > > > > prematurely. They can be perfectly destroyed together with their
> > > > > > > > root counterparts. This allows to dramatically simplify the
> > > > > > > > management of non-root kmem_caches and delete a ton of code.
> > > > > > > 
> > > > > > > This is definitely going in the right direction. But it doesn't quite
> > > > > > > explain why we still need two sets of kmem_caches?
> > > > > > > 
> > > > > > > In the old scheme, we had completely separate per-cgroup caches with
> > > > > > > separate slab pages. If a cgrouped process wanted to allocate a slab
> > > > > > > object, we'd go to the root cache and used the cgroup id to look up
> > > > > > > the right cgroup cache. On slab free we'd use page->slab_cache.
> > > > > > > 
> > > > > > > Now we have slab pages that have a page->objcg array. Why can't all
> > > > > > > allocations go through a single set of kmem caches? If an allocation
> > > > > > > is coming from a cgroup and the slab page the allocator wants to use
> > > > > > > doesn't have an objcg array yet, we can allocate it on the fly, no?
> > > > > > 
> > > > > > Well, arguably it can be done, but there are few drawbacks:
> > > > > > 
> > > > > > 1) On the release path you'll need to make some extra work even for
> > > > > >    root allocations: calculate the offset only to find the NULL objcg pointer.
> > > > > > 
> > > > > > 2) There will be a memory overhead for root allocations
> > > > > >    (which might or might not be compensated by the increase
> > > > > >    of the slab utilization).
> > > > > 
> > > > > Those two are only true if there is a wild mix of root and cgroup
> > > > > allocations inside the same slab, and that doesn't really happen in
> > > > > practice. Either the machine is dedicated to one workload and cgroups
> > > > > are only enabled due to e.g. a vendor kernel, or you have cgrouped
> > > > > systems (like most distro systems now) that cgroup everything.
> > > > 
> > > > It's actually a questionable statement: we do skip allocations from certain
> > > > contexts, and we do merge slab caches.
> > > > 
> > > > Most likely it's true for certain slab_caches and not true for others.
> > > > Think of kmalloc-* caches.
> > > 
> > > With merging it's actually really hard to say how sparse or dense the
> > > resulting objcgroup arrays would be. It could change all the time too.
> > 
> > So here is some actual data from my dev machine. The first column is the number
> > of pages in the root cache, the second - in the corresponding memcg.
> > 
> >    ext4_groupinfo_4k          1          0
> >      rpc_inode_cache          1          0
> >         fuse_request         62          0
> >           fuse_inode          1       2732
> >   btrfs_delayed_node       1192          0
> > btrfs_ordered_extent        129          0
> >     btrfs_extent_map       8686          0
> >  btrfs_extent_buffer       2648          0
> >          btrfs_inode         12       6739
> >               PINGv6          1         11
> >                RAWv6          2          5
> >                UDPv6          1         34
> >        tw_sock_TCPv6        378          3
> >   request_sock_TCPv6         24          0
> >                TCPv6         46         74
> >   mqueue_inode_cache          1          0
> >  jbd2_journal_handle          2          0
> >    jbd2_journal_head          2          0
> >  jbd2_revoke_table_s          1          0
> >     ext4_inode_cache          1          3
> > ext4_allocation_context          1          0
> >          ext4_io_end          1          0
> >   ext4_extent_status          5          0
> >              mbcache          1          0
> >       dnotify_struct          1          0
> >   posix_timers_cache         24          0
> >       xfrm_dst_cache        202          0
> >                  RAW          3         12
> >                  UDP          2         24
> >          tw_sock_TCP         25          0
> >     request_sock_TCP         24          0
> >                  TCP          7         24
> > hugetlbfs_inode_cache          2          0
> >                dquot          2          0
> >        eventpoll_pwq          1        119
> >            dax_cache          1          0
> >        request_queue          9          0
> >           blkdev_ioc        241          0
> >           biovec-max        112          0
> >           biovec-128          2          0
> >            biovec-64          6          0
> >   khugepaged_mm_slot        248          0
> >  dmaengine-unmap-256          1          0
> >  dmaengine-unmap-128          1          0
> >   dmaengine-unmap-16         39          0
> >     sock_inode_cache          9        219
> >     skbuff_ext_cache        249          0
> >  skbuff_fclone_cache         83          0
> >    skbuff_head_cache        138        141
> >      file_lock_cache         24          0
> >        net_namespace          1          5
> >    shmem_inode_cache         14         56
> >      task_delay_info         23        165
> >            taskstats         24          0
> >       proc_dir_entry         24          0
> >           pde_opener         16         24
> >     proc_inode_cache         24       1103
> >           bdev_cache          4         20
> >    kernfs_node_cache       1405          0
> >            mnt_cache         54          0
> >                 filp         53        460
> >          inode_cache        488       2287
> >               dentry        367      10576
> >          names_cache         24          0
> >         ebitmap_node          2          0
> >      avc_xperms_data        256          0
> >       lsm_file_cache         92          0
> >          buffer_head         24          9
> >        uts_namespace          1          3
> >       vm_area_struct         48        810
> >            mm_struct         19         29
> >          files_cache         14         26
> >         signal_cache         28        143
> >        sighand_cache         45         47
> >          task_struct         77        430
> >             cred_jar         29        424
> >       anon_vma_chain         39        492
> >             anon_vma         28        467
> >                  pid         30        369
> >         Acpi-Operand         56          0
> >           Acpi-Parse       5587          0
> >           Acpi-State       4137          0
> >       Acpi-Namespace          8          0
> >          numa_policy        137          0
> >   ftrace_event_field         68          0
> >       pool_workqueue         25          0
> >      radix_tree_node       1694       7776
> >           task_group         21          0
> >            vmap_area        477          0
> >      kmalloc-rcl-512        473          0
> >      kmalloc-rcl-256        605          0
> >      kmalloc-rcl-192         43         16
> >      kmalloc-rcl-128          1         47
> >       kmalloc-rcl-96          3        229
> >       kmalloc-rcl-64          6        611
> >           kmalloc-8k         48         24
> >           kmalloc-4k        372         59
> >           kmalloc-2k        132         50
> >           kmalloc-1k        251         82
> >          kmalloc-512        360        150
> >          kmalloc-256        237          0
> >          kmalloc-192        298         24
> >          kmalloc-128        203         24
> >           kmalloc-96        112         24
> >           kmalloc-64        796         24
> >           kmalloc-32       1188         26
> >           kmalloc-16        555         25
> >            kmalloc-8         42         24
> >      kmem_cache_node         20          0
> >           kmem_cache         24          0
> 
> That's interesting, thanks. It does look fairly bimodal, except in
> some smaller caches. Which does make sense when you think about it: we
> focus on accounting consumers that are driven by userspace activity
> and big enough to actually matter in terms of cgroup footprint.
> 
> > > > Also, because obj_cgroup vectors will not be freed without underlying pages,
> > > > most likely the percentage of pages with obj_cgroups will grow with uptime.
> > > > In other words, memcg allocations will fragment root slab pages.
> > > 
> > > I understand the first part of this paragraph, but not the second. The
> > > objcgroup vectors will be freed when the slab pages get freed. But the
> > > partially filled slab pages can be reused by any types of allocations,
> > > surely? How would this cause the pages to fragment?
> > 
> > I mean the following: once you allocate a single accounted object
> > from the page, obj_cgroup vector is allocated and will be released only
> > with the slab page. We really really don't want to count how many accounted
> > objects are on the page and release obj_cgroup vector on reaching 0.
> > So even if all following allocations are root allocations, the overhead
> > will not go away with the uptime.
> > 
> > In other words, even a small percentage of accounted objects will
> > turn the whole cache into "accountable".
> 
> Correct. The worst case is where we have a large cache that has N
> objects per slab, but only ~1/N objects are accounted to a cgroup.
> 
> The question is whether this is common or even realistic. When would a
> cache be big, but only a small subset of its allocations would be
> attributable to specific cgroups?
> 
> On the less extreme overlapping cases, yeah there are fragmented
> obj_cgroup arrays, but there is also better slab packing. One is an
> array of pointers, the other is an array of larger objects. It would
> seem slab fragmentation has the potential to waste much more memory?
> 
> > > > > > 3) I'm working on percpu memory accounting that resembles the same scheme,
> > > > > >    except that obj_cgroups vector is created for the whole percpu block.
> > > > > >    There will be root- and memcg-blocks, and it will be expensive to merge them.
> > > > > >    I kinda like using the same scheme here and there.
> > > > > 
> > > > > It's hard to conclude anything based on this information alone. If
> > > > > it's truly expensive to merge them, then it warrants the additional
> > > > > complexity. But I don't understand the desire to share a design for
> > > > > two systems with sufficiently different constraints.
> > > > > 
> > > > > > Upsides?
> > > > > > 
> > > > > > 1) slab utilization might increase a little bit (but I doubt it will have
> > > > > >    a huge effect, because both merging sets should be relatively big and well
> > > > > >    utilized)
> > > > > 
> > > > > Right.
> > > > > 
> > > > > > 2) eliminate memcg kmem_cache dynamic creation/destruction. it's nice,
> > > > > >    but there isn't so much code left anyway.
> > > > > 
> > > > > There is a lot of complexity associated with the cache cloning that
> > > > > isn't the lines of code, but the lifetime and synchronization rules.
> > > > 
> > > > Quite opposite: the patchset removes all the complexity (or 90% of it),
> > > > because it makes the kmem_cache lifetime independent from any cgroup stuff.
> > > > 
> > > > Kmem_caches are created on demand on the first request (most likely during
> > > > the system start-up), and destroyed together with their root counterparts
> > > > (most likely never or on rmmod). First request means globally first request,
> > > > not a first request from a given memcg.
> > > > 
> > > > Memcg kmem_cache lifecycle has nothing to do with memory/obj_cgroups, and
> > > > after creation just matches the lifetime of the root kmem caches.
> > > > 
> > > > The only reason to keep the async creation is that some kmem_caches
> > > > are created very early in the boot process, long before any cgroup
> > > > stuff is initialized.
> > > 
> > > Yes, it's independent of the obj_cgroup and memcg, and yes it's
> > > simpler after your patches. But I'm not talking about the delta, I'm
> > > trying to understand the end result.
> > > 
> > > And the truth is there is a decent chunk of code and tentacles spread
> > > throughout the slab/cgroup code to clone, destroy, and handle the
> > > split caches, as well as the branches/indirections on every cgrouped
> > > slab allocation.
> > 
> > Did you see the final code? It's fairly simple and there is really not
> > much of complexity left. If you don't think so, let's go into details,
> > because otherwise it's hard to say anything.
> 
> I have the patches applied to a local tree and am looking at the final
> code. But I can only repeat that "it's not too bad" simply isn't a
> good explanation for why the code is the way it is.
> 
> > With a such change which basically removes the current implementation
> > and replaces it with a new one, it's hard to keep the balance between
> > making commits self-contained and small, but also showing the whole picture.
> > I'm fully open to questions and generally want to make it simpler.
> > 
> > I've tried to separate some parts and get them merged before the main
> > thing, but they haven't been merged yet, so I have to include them
> > to keep the thing building.
> > 
> > Will a more-detailed design in the cover help?
> > Will writing a design doc to put into Documentation/ help?
> > Is it better to rearrange patches in a way to eliminate the current
> > implementation first and build from scratch?
> 
> It would help to have changelogs that actually describe how the new
> design is supposed to work, and why you made the decisions you made.
> 
> The changelog in this patch here sells the change as a reduction in
> complexity, without explaining why it stopped where it stopped. So
> naturally, if that's the declared goal, the first question is whether
> we can make it simpler.
> 
> Both the cover letter and the changelogs should focus less on what was
> there and how it was deleted, and more on how the code is supposed to
> work after the patches. How the components were designed and how they
> all work together.
> 
> As I said before, imagine somebody without any historical knowledge
> reading the code. They should be able to find out why you chose to
> have two sets of kmem caches. There is no explanation for it other
> than "there used to be more, so we cut it down to two".
> 
> > > > > And these two things are the primary aspects that make my head hurt
> > > > > trying to review this patch series.
> > > > > 
> > > > > > So IMO it's an interesting direction to explore, but not something
> > > > > > that necessarily has to be done in the context of this patchset.
> > > > > 
> > > > > I disagree. Instead of replacing the old coherent model and its
> > > > > complexities with a new coherent one, you are mixing the two. And I
> > > > > can barely understand the end result.
> > > > > 
> > > > > Dynamically cloning entire slab caches for the sole purpose of telling
> > > > > whether the pages have an obj_cgroup array or not is *completely
> > > > > insane*. If the controller had followed the obj_cgroup design from the
> > > > > start, nobody would have ever thought about doing it like this.
> > > > 
> > > > It's just not true. The whole point of having root- and memcg sets is
> > > > to be able to not look for a NULL pointer in the obj_cgroup vector on
> > > > releasing of the root object. In other words, it allows to keep zero
> > > > overhead for root allocations. IMHO it's an important thing, and calling
> > > > it *completely insane* isn't the best way to communicate.
> > > 
> > > But you're trading it for the indirection of going through a separate
> > > kmem_cache for every single cgroup-accounted allocation. Why is this a
> > > preferable trade-off to make?
> > 
> > Because it allows to keep zero memory and cpu overhead for root allocations.
> > I've no data showing that this overhead is small and acceptable in all cases.
> > I think keeping zero overhead for root allocations is more important
> > than having a single set of kmem caches.
> 
> In the kmem cache breakdown you provided above, there are 35887 pages
> allocated to root caches and 37300 pages allocated to cgroup caches.
> 
> Why are root allocations supposed to be more important? Aren't some of
> the hottest allocations tracked by cgroups? Look at fork():
> 
> >       vm_area_struct         48        810
> >            mm_struct         19         29
> >          files_cache         14         26
> >         signal_cache         28        143
> >        sighand_cache         45         47
> >          task_struct         77        430
> >             cred_jar         29        424
> >       anon_vma_chain         39        492
> >             anon_vma         28        467
> >                  pid         30        369
> 
> Hard to think of much hotter allocations. They all have to suffer the
> additional branch and cache footprint of the auxiliary cgroup caches.
> 
> > > > I agree that it's an arguable question if we can tolerate some
> > > > additional overhead on root allocations to eliminate these additional
> > > > 10%, but I really don't think it's so obvious that even discussing
> > > > it is insane.
> > > 
> > > Well that's exactly my point.
> > 
> > Ok, what's the acceptable performance penalty?
> > Is adding 20% on free path is acceptable, for example?
> > Or adding 3% of slab memory?
> 
> I find offhand replies like these very jarring.
> 
> There is a legitimate design question: Why are you using a separate
> set of kmem caches for the cgroup allocations, citing the additional
> complexity over having one set? And your reply was mostly handwaving.

Johannes,

I posted patches and numbers that shows that the patchset improves
a fundamental kernel characteristic (slab utilization) by a meaningful margin.
It has been confirmed by others, who kindly tested it on their machines.

Surprisingly, during this and previous review sessions, I didn't hear
a single good word from you, but a constant stream of blame: I do not answer
questions, I do not write perfect code, I fail to provide satisfying
answers, I'm waving hands, saying insane things etc etc.
Any minimal disagreement with you and you're basically raising the tone.

I find this style of discussions irritating and non-productive.
So I'm taking a break and start working on the next version.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 27/28] tools/cgroup: make slabinfo.py compatible with new slab controller
  2020-01-31 22:24     ` Roman Gushchin
@ 2020-02-12  5:21       ` Bharata B Rao
  2020-02-12 20:42         ` Roman Gushchin
  0 siblings, 1 reply; 84+ messages in thread
From: Bharata B Rao @ 2020-02-12  5:21 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, Johannes Weiner,
	Shakeel Butt, Vladimir Davydov, linux-kernel, Kernel Team,
	Yafang Shao

On Fri, Jan 31, 2020 at 10:24:58PM +0000, Roman Gushchin wrote:
> On Thu, Jan 30, 2020 at 07:47:29AM +0530, Bharata B Rao wrote:
> > On Mon, Jan 27, 2020 at 09:34:52AM -0800, Roman Gushchin wrote:
> 
> Btw, I've checked that the change like you've done above fixes the problem.
> The script works for me both on current upstream and new_slab.2 branch.
> 
> Are you sure that in your case there is some kernel memory charged to that
> cgroup? Please note, that in the current implementation kmem_caches are created
> on demand, so the accounting is effectively enabled with some delay.

I do see kmem getting charged.

# cat /sys/fs/cgroup/memory/1/memory.kmem.usage_in_bytes /sys/fs/cgroup/memory/1/memory.usage_in_bytes
182910976
4515627008

> Below is an updated version of the patch to use:

I see the below failure with this updated version:

# ./tools/cgroup/slabinfo-new.py /sys/fs/cgroup/memory/1
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
Traceback (most recent call last):
  File "/usr/local/bin/drgn", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/dist-packages/drgn/internal/cli.py", line 127, in main
    runpy.run_path(args.script[0], init_globals=init_globals, run_name="__main__")
  File "/usr/lib/python3.6/runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/usr/lib/python3.6/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "./tools/cgroup/slabinfo-new.py", line 158, in <module>
    main()
  File "./tools/cgroup/slabinfo-new.py", line 153, in main
    memcg.kmem_caches.address_of_(),
AttributeError: 'struct mem_cgroup' has no member 'kmem_caches'

> +
> +def main():
> +    parser = argparse.ArgumentParser(description=DESC,
> +                                     formatter_class=
> +                                     argparse.RawTextHelpFormatter)
> +    parser.add_argument('cgroup', metavar='CGROUP',
> +                        help='Target memory cgroup')
> +    args = parser.parse_args()
> +
> +    try:
> +        cgroup_id = stat(args.cgroup).st_ino
> +        find_memcg_ids()
> +        memcg = MEMCGS[cgroup_id]
> +    except KeyError:
> +        err('Can\'t find the memory cgroup')
> +
> +    cfg = detect_kernel_config()
> +
> +    print('# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab>'
> +          ' : tunables <limit> <batchcount> <sharedfactor>'
> +          ' : slabdata <active_slabs> <num_slabs> <sharedavail>')
> +
> +    for s in list_for_each_entry('struct kmem_cache',
> +                                 memcg.kmem_caches.address_of_(),
> +                                 'memcg_params.kmem_caches_node'):

Are you sure this is the right version? In the previous version
you had the if-else loop that handled shared_slab_pages and old
scheme separately.

Regards,
Bharata.



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 27/28] tools/cgroup: make slabinfo.py compatible with new slab controller
  2020-02-12  5:21       ` Bharata B Rao
@ 2020-02-12 20:42         ` Roman Gushchin
  0 siblings, 0 replies; 84+ messages in thread
From: Roman Gushchin @ 2020-02-12 20:42 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-mm, Andrew Morton, Michal Hocko, Johannes Weiner,
	Shakeel Butt, Vladimir Davydov, linux-kernel, Kernel Team,
	Yafang Shao

On Wed, Feb 12, 2020 at 10:51:24AM +0530, Bharata B Rao wrote:
> On Fri, Jan 31, 2020 at 10:24:58PM +0000, Roman Gushchin wrote:
> > On Thu, Jan 30, 2020 at 07:47:29AM +0530, Bharata B Rao wrote:
> > > On Mon, Jan 27, 2020 at 09:34:52AM -0800, Roman Gushchin wrote:
> > 
> > Btw, I've checked that the change like you've done above fixes the problem.
> > The script works for me both on current upstream and new_slab.2 branch.
> > 
> > Are you sure that in your case there is some kernel memory charged to that
> > cgroup? Please note, that in the current implementation kmem_caches are created
> > on demand, so the accounting is effectively enabled with some delay.
> 
> I do see kmem getting charged.
> 
> # cat /sys/fs/cgroup/memory/1/memory.kmem.usage_in_bytes /sys/fs/cgroup/memory/1/memory.usage_in_bytes
> 182910976
> 4515627008

Great.

> 
> > Below is an updated version of the patch to use:
> 
> I see the below failure with this updated version:

Are you sure that drgn is picking right symbols?
I had a similar transient issue during my work, when drgn was actually
using symbols from a different kernel.

> 
> # ./tools/cgroup/slabinfo-new.py /sys/fs/cgroup/memory/1
> # name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
> Traceback (most recent call last):
>   File "/usr/local/bin/drgn", line 11, in <module>
>     sys.exit(main())
>   File "/usr/local/lib/python3.6/dist-packages/drgn/internal/cli.py", line 127, in main
>     runpy.run_path(args.script[0], init_globals=init_globals, run_name="__main__")
>   File "/usr/lib/python3.6/runpy.py", line 263, in run_path
>     pkg_name=pkg_name, script_name=fname)
>   File "/usr/lib/python3.6/runpy.py", line 96, in _run_module_code
>     mod_name, mod_spec, pkg_name, script_name)
>   File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
>     exec(code, run_globals)
>   File "./tools/cgroup/slabinfo-new.py", line 158, in <module>
>     main()
>   File "./tools/cgroup/slabinfo-new.py", line 153, in main
>     memcg.kmem_caches.address_of_(),
> AttributeError: 'struct mem_cgroup' has no member 'kmem_caches'
> 
> > +
> > +def main():
> > +    parser = argparse.ArgumentParser(description=DES,C
> > +                                     formatter_class=
> > +                                     argparse.RawTextHelpFormatter)
> > +    parser.add_argument('cgroup', metavar='CGROUP',
> > +                        help='Target memory cgroup')
> > +    args = parser.parse_args()
> > +
> > +    try:
> > +        cgroup_id = stat(args.cgroup).st_ino
> > +        find_memcg_ids()
> > +        memcg = MEMCGS[cgroup_id]
> > +    except KeyError:
> > +        err('Can\'t find the memory cgroup')
> > +
> > +    cfg = detect_kernel_config()
> > +
> > +    print('# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab>'
> > +          ' : tunables <limit> <batchcount> <sharedfactor>'
> > +          ' : slabdata <active_slabs> <num_slabs> <sharedavail>')
> > +
> > +    for s in list_for_each_entry('struct kmem_cache',
> > +                                 memcg.kmem_caches.address_of_(),
> > +                                 'memcg_params.kmem_caches_node'):
> 
> Are you sure this is the right version? In the previous version
> you had the if-else loop that handled shared_slab_pages and old
> scheme separately.

Which one you're refering to?

As in my tree there are two patches:
fa490da39afb tools/cgroup: add slabinfo.py tool
e3bee81aab44 tools/cgroup: make slabinfo.py compatible with new slab controller

The second one adds the if clause you're probably referring to.

Thanks!

Roman


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 00/28] The new cgroup slab memory controller
  2020-01-30  2:41   ` Roman Gushchin
@ 2020-08-12 23:16     ` Pavel Tatashin
  2020-08-12 23:18       ` Pavel Tatashin
  2020-08-13  0:04       ` Roman Gushchin
  0 siblings, 2 replies; 84+ messages in thread
From: Pavel Tatashin @ 2020-08-12 23:16 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Bharata B Rao, linux-mm, Andrew Morton, Michal Hocko,
	Johannes Weiner, Shakeel Butt, Vladimir Davydov, linux-kernel,
	Kernel Team, Yafang Shao, stable, Linus Torvalds, Sasha Levin,
	Greg Kroah-Hartman

Guys,

There is a convoluted deadlock that I just root caused, and that is
fixed by this work (at least based on my code inspection it appears to
be fixed); but the deadlock exists in older and stable kernels, and I
am not sure whether to create a separate patch for it, or backport
this whole thing.

Thread #1: Hot-removes memory
device_offline
  memory_subsys_offline
    offline_pages
      __offline_pages
        mem_hotplug_lock <- write access
      waits for Thread #3 refcnt for pfn 9e5113 to get to 1 so it can
migrate it.

Thread #2: ccs killer kthread
   css_killed_work_fn
     cgroup_mutex  <- Grab this Mutex
     mem_cgroup_css_offline
       memcg_offline_kmem.part
          memcg_deactivate_kmem_caches
            get_online_mems
              mem_hotplug_lock <- waits for Thread#1 to get read access

Thread #3: crashing userland program
do_coredump
  elf_core_dump
      get_dump_page() -> get page with pfn#9e5113, and increment refcnt
      dump_emit
        __kernel_write
          __vfs_write
            new_sync_write
              pipe_write
                pipe_wait   -> waits for Thread #4 systemd-coredump to
read the pipe

Thread #4: systemd-coredump
ksys_read
  vfs_read
    __vfs_read
      seq_read
        proc_single_show
          proc_cgroup_show
            cgroup_mutex -> waits from Thread #2 for this lock.

In Summary:
Thread#1 waits for Thread#3 for refcnt, Thread#3 waits for Thread#4 to
read pipe. Thread#4 waits for Thread#2 for cgroup_mutex lock; Thread#2
waits for Thread#1 for mem_hotplug_lock rwlock.

This work appears to fix this deadlock because cgroup_mutex is not
called anymore before mem_hotplug_lock (unless I am missing it), as it
removes memcg_deactivate_kmem_caches.

Thank you,
Pasha

On Wed, Jan 29, 2020 at 9:42 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Thu, Jan 30, 2020 at 07:36:26AM +0530, Bharata B Rao wrote:
> > On Mon, Jan 27, 2020 at 09:34:25AM -0800, Roman Gushchin wrote:
> > > The existing cgroup slab memory controller is based on the idea of
> > > replicating slab allocator internals for each memory cgroup.
> > > This approach promises a low memory overhead (one pointer per page),
> > > and isn't adding too much code on hot allocation and release paths.
> > > But is has a very serious flaw: it leads to a low slab utilization.
> > >
> > > Using a drgn* script I've got an estimation of slab utilization on
> > > a number of machines running different production workloads. In most
> > > cases it was between 45% and 65%, and the best number I've seen was
> > > around 85%. Turning kmem accounting off brings it to high 90s. Also
> > > it brings back 30-50% of slab memory. It means that the real price
> > > of the existing slab memory controller is way bigger than a pointer
> > > per page.
> > >
> > > The real reason why the existing design leads to a low slab utilization
> > > is simple: slab pages are used exclusively by one memory cgroup.
> > > If there are only few allocations of certain size made by a cgroup,
> > > or if some active objects (e.g. dentries) are left after the cgroup is
> > > deleted, or the cgroup contains a single-threaded application which is
> > > barely allocating any kernel objects, but does it every time on a new CPU:
> > > in all these cases the resulting slab utilization is very low.
> > > If kmem accounting is off, the kernel is able to use free space
> > > on slab pages for other allocations.
> > >
> > > Arguably it wasn't an issue back to days when the kmem controller was
> > > introduced and was an opt-in feature, which had to be turned on
> > > individually for each memory cgroup. But now it's turned on by default
> > > on both cgroup v1 and v2. And modern systemd-based systems tend to
> > > create a large number of cgroups.
> > >
> > > This patchset provides a new implementation of the slab memory controller,
> > > which aims to reach a much better slab utilization by sharing slab pages
> > > between multiple memory cgroups. Below is the short description of the new
> > > design (more details in commit messages).
> > >
> > > Accounting is performed per-object instead of per-page. Slab-related
> > > vmstat counters are converted to bytes. Charging is performed on page-basis,
> > > with rounding up and remembering leftovers.
> > >
> > > Memcg ownership data is stored in a per-slab-page vector: for each slab page
> > > a vector of corresponding size is allocated. To keep slab memory reparenting
> > > working, instead of saving a pointer to the memory cgroup directly an
> > > intermediate object is used. It's simply a pointer to a memcg (which can be
> > > easily changed to the parent) with a built-in reference counter. This scheme
> > > allows to reparent all allocated objects without walking them over and
> > > changing memcg pointer to the parent.
> > >
> > > Instead of creating an individual set of kmem_caches for each memory cgroup,
> > > two global sets are used: the root set for non-accounted and root-cgroup
> > > allocations and the second set for all other allocations. This allows to
> > > simplify the lifetime management of individual kmem_caches: they are
> > > destroyed with root counterparts. It allows to remove a good amount of code
> > > and make things generally simpler.
> > >
> > > The patchset* has been tested on a number of different workloads in our
> > > production. In all cases it saved significant amount of memory, measured
> > > from high hundreds of MBs to single GBs per host. On average, the size
> > > of slab memory has been reduced by 35-45%.
> >
> > Here are some numbers from multiple runs of sysbench and kernel compilation
> > with this patchset on a 10 core POWER8 host:
> >
> > ==========================================================================
> > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> > meminfo:Slab for Sysbench oltp_read_write with mysqld running as part
> > of a mem cgroup (Sampling every 5s)
> > --------------------------------------------------------------------------
> >                               5.5.0-rc7-mm1   +slab patch     %reduction
> > --------------------------------------------------------------------------
> > memory.kmem.usage_in_bytes    15859712        4456448         72
> > memory.usage_in_bytes         337510400       335806464       .5
> > Slab: (kB)                    814336          607296          25
> >
> > memory.kmem.usage_in_bytes    16187392        4653056         71
> > memory.usage_in_bytes         318832640       300154880       5
> > Slab: (kB)                    789888          559744          29
> > --------------------------------------------------------------------------
> >
> >
> > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> > meminfo:Slab for kernel compilation (make -s -j64) Compilation was
> > done from bash that is in a memory cgroup. (Sampling every 5s)
> > --------------------------------------------------------------------------
> >                               5.5.0-rc7-mm1   +slab patch     %reduction
> > --------------------------------------------------------------------------
> > memory.kmem.usage_in_bytes    338493440       231931904       31
> > memory.usage_in_bytes         7368015872      6275923968      15
> > Slab: (kB)                    1139072         785408          31
> >
> > memory.kmem.usage_in_bytes    341835776       236453888       30
> > memory.usage_in_bytes         6540427264      6072893440      7
> > Slab: (kB)                    1074304         761280          29
> >
> > memory.kmem.usage_in_bytes    340525056       233570304       31
> > memory.usage_in_bytes         6406209536      6177357824      3
> > Slab: (kB)                    1244288         739712          40
> > --------------------------------------------------------------------------
> >
> > Slab consumption right after boot
> > --------------------------------------------------------------------------
> >                               5.5.0-rc7-mm1   +slab patch     %reduction
> > --------------------------------------------------------------------------
> > Slab: (kB)                    821888          583424          29
> > ==========================================================================
> >
> > Summary:
> >
> > With sysbench and kernel compilation,  memory.kmem.usage_in_bytes shows
> > around 70% and 30% reduction consistently.
> >
> > Didn't see consistent reduction of memory.usage_in_bytes with sysbench and
> > kernel compilation.
> >
> > Slab usage (from /proc/meminfo) shows consistent 30% reduction and the
> > same is seen right after boot too.
>
> That's just perfect!
>
> memory.usage_in_bytes was most likely the same because the freed space
> was taken by pagecache.
>
> Thank you very much for testing!
>
> Roman


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 00/28] The new cgroup slab memory controller
  2020-08-12 23:16     ` Pavel Tatashin
@ 2020-08-12 23:18       ` Pavel Tatashin
  2020-08-13  0:04       ` Roman Gushchin
  1 sibling, 0 replies; 84+ messages in thread
From: Pavel Tatashin @ 2020-08-12 23:18 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Bharata B Rao, linux-mm, Andrew Morton, Michal Hocko,
	Johannes Weiner, Shakeel Butt, Vladimir Davydov, linux-kernel,
	Kernel Team, Yafang Shao, stable, Linus Torvalds, Sasha Levin,
	Greg Kroah-Hartman

BTW, I replied to a wrong version of this work. I intended to reply to
version 7:
https://lore.kernel.org/lkml/20200623174037.3951353-1-guro@fb.com/

Nevertheless, the problem is the same.

Thank you,
Pasha

On Wed, Aug 12, 2020 at 7:16 PM Pavel Tatashin
<pasha.tatashin@soleen.com> wrote:
>
> Guys,
>
> There is a convoluted deadlock that I just root caused, and that is
> fixed by this work (at least based on my code inspection it appears to
> be fixed); but the deadlock exists in older and stable kernels, and I
> am not sure whether to create a separate patch for it, or backport
> this whole thing.
>
> Thread #1: Hot-removes memory
> device_offline
>   memory_subsys_offline
>     offline_pages
>       __offline_pages
>         mem_hotplug_lock <- write access
>       waits for Thread #3 refcnt for pfn 9e5113 to get to 1 so it can
> migrate it.
>
> Thread #2: ccs killer kthread
>    css_killed_work_fn
>      cgroup_mutex  <- Grab this Mutex
>      mem_cgroup_css_offline
>        memcg_offline_kmem.part
>           memcg_deactivate_kmem_caches
>             get_online_mems
>               mem_hotplug_lock <- waits for Thread#1 to get read access
>
> Thread #3: crashing userland program
> do_coredump
>   elf_core_dump
>       get_dump_page() -> get page with pfn#9e5113, and increment refcnt
>       dump_emit
>         __kernel_write
>           __vfs_write
>             new_sync_write
>               pipe_write
>                 pipe_wait   -> waits for Thread #4 systemd-coredump to
> read the pipe
>
> Thread #4: systemd-coredump
> ksys_read
>   vfs_read
>     __vfs_read
>       seq_read
>         proc_single_show
>           proc_cgroup_show
>             cgroup_mutex -> waits from Thread #2 for this lock.
>
> In Summary:
> Thread#1 waits for Thread#3 for refcnt, Thread#3 waits for Thread#4 to
> read pipe. Thread#4 waits for Thread#2 for cgroup_mutex lock; Thread#2
> waits for Thread#1 for mem_hotplug_lock rwlock.
>
> This work appears to fix this deadlock because cgroup_mutex is not
> called anymore before mem_hotplug_lock (unless I am missing it), as it
> removes memcg_deactivate_kmem_caches.
>
> Thank you,
> Pasha
>
> On Wed, Jan 29, 2020 at 9:42 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Thu, Jan 30, 2020 at 07:36:26AM +0530, Bharata B Rao wrote:
> > > On Mon, Jan 27, 2020 at 09:34:25AM -0800, Roman Gushchin wrote:
> > > > The existing cgroup slab memory controller is based on the idea of
> > > > replicating slab allocator internals for each memory cgroup.
> > > > This approach promises a low memory overhead (one pointer per page),
> > > > and isn't adding too much code on hot allocation and release paths.
> > > > But is has a very serious flaw: it leads to a low slab utilization.
> > > >
> > > > Using a drgn* script I've got an estimation of slab utilization on
> > > > a number of machines running different production workloads. In most
> > > > cases it was between 45% and 65%, and the best number I've seen was
> > > > around 85%. Turning kmem accounting off brings it to high 90s. Also
> > > > it brings back 30-50% of slab memory. It means that the real price
> > > > of the existing slab memory controller is way bigger than a pointer
> > > > per page.
> > > >
> > > > The real reason why the existing design leads to a low slab utilization
> > > > is simple: slab pages are used exclusively by one memory cgroup.
> > > > If there are only few allocations of certain size made by a cgroup,
> > > > or if some active objects (e.g. dentries) are left after the cgroup is
> > > > deleted, or the cgroup contains a single-threaded application which is
> > > > barely allocating any kernel objects, but does it every time on a new CPU:
> > > > in all these cases the resulting slab utilization is very low.
> > > > If kmem accounting is off, the kernel is able to use free space
> > > > on slab pages for other allocations.
> > > >
> > > > Arguably it wasn't an issue back to days when the kmem controller was
> > > > introduced and was an opt-in feature, which had to be turned on
> > > > individually for each memory cgroup. But now it's turned on by default
> > > > on both cgroup v1 and v2. And modern systemd-based systems tend to
> > > > create a large number of cgroups.
> > > >
> > > > This patchset provides a new implementation of the slab memory controller,
> > > > which aims to reach a much better slab utilization by sharing slab pages
> > > > between multiple memory cgroups. Below is the short description of the new
> > > > design (more details in commit messages).
> > > >
> > > > Accounting is performed per-object instead of per-page. Slab-related
> > > > vmstat counters are converted to bytes. Charging is performed on page-basis,
> > > > with rounding up and remembering leftovers.
> > > >
> > > > Memcg ownership data is stored in a per-slab-page vector: for each slab page
> > > > a vector of corresponding size is allocated. To keep slab memory reparenting
> > > > working, instead of saving a pointer to the memory cgroup directly an
> > > > intermediate object is used. It's simply a pointer to a memcg (which can be
> > > > easily changed to the parent) with a built-in reference counter. This scheme
> > > > allows to reparent all allocated objects without walking them over and
> > > > changing memcg pointer to the parent.
> > > >
> > > > Instead of creating an individual set of kmem_caches for each memory cgroup,
> > > > two global sets are used: the root set for non-accounted and root-cgroup
> > > > allocations and the second set for all other allocations. This allows to
> > > > simplify the lifetime management of individual kmem_caches: they are
> > > > destroyed with root counterparts. It allows to remove a good amount of code
> > > > and make things generally simpler.
> > > >
> > > > The patchset* has been tested on a number of different workloads in our
> > > > production. In all cases it saved significant amount of memory, measured
> > > > from high hundreds of MBs to single GBs per host. On average, the size
> > > > of slab memory has been reduced by 35-45%.
> > >
> > > Here are some numbers from multiple runs of sysbench and kernel compilation
> > > with this patchset on a 10 core POWER8 host:
> > >
> > > ==========================================================================
> > > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> > > meminfo:Slab for Sysbench oltp_read_write with mysqld running as part
> > > of a mem cgroup (Sampling every 5s)
> > > --------------------------------------------------------------------------
> > >                               5.5.0-rc7-mm1   +slab patch     %reduction
> > > --------------------------------------------------------------------------
> > > memory.kmem.usage_in_bytes    15859712        4456448         72
> > > memory.usage_in_bytes         337510400       335806464       .5
> > > Slab: (kB)                    814336          607296          25
> > >
> > > memory.kmem.usage_in_bytes    16187392        4653056         71
> > > memory.usage_in_bytes         318832640       300154880       5
> > > Slab: (kB)                    789888          559744          29
> > > --------------------------------------------------------------------------
> > >
> > >
> > > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> > > meminfo:Slab for kernel compilation (make -s -j64) Compilation was
> > > done from bash that is in a memory cgroup. (Sampling every 5s)
> > > --------------------------------------------------------------------------
> > >                               5.5.0-rc7-mm1   +slab patch     %reduction
> > > --------------------------------------------------------------------------
> > > memory.kmem.usage_in_bytes    338493440       231931904       31
> > > memory.usage_in_bytes         7368015872      6275923968      15
> > > Slab: (kB)                    1139072         785408          31
> > >
> > > memory.kmem.usage_in_bytes    341835776       236453888       30
> > > memory.usage_in_bytes         6540427264      6072893440      7
> > > Slab: (kB)                    1074304         761280          29
> > >
> > > memory.kmem.usage_in_bytes    340525056       233570304       31
> > > memory.usage_in_bytes         6406209536      6177357824      3
> > > Slab: (kB)                    1244288         739712          40
> > > --------------------------------------------------------------------------
> > >
> > > Slab consumption right after boot
> > > --------------------------------------------------------------------------
> > >                               5.5.0-rc7-mm1   +slab patch     %reduction
> > > --------------------------------------------------------------------------
> > > Slab: (kB)                    821888          583424          29
> > > ==========================================================================
> > >
> > > Summary:
> > >
> > > With sysbench and kernel compilation,  memory.kmem.usage_in_bytes shows
> > > around 70% and 30% reduction consistently.
> > >
> > > Didn't see consistent reduction of memory.usage_in_bytes with sysbench and
> > > kernel compilation.
> > >
> > > Slab usage (from /proc/meminfo) shows consistent 30% reduction and the
> > > same is seen right after boot too.
> >
> > That's just perfect!
> >
> > memory.usage_in_bytes was most likely the same because the freed space
> > was taken by pagecache.
> >
> > Thank you very much for testing!
> >
> > Roman


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 00/28] The new cgroup slab memory controller
  2020-08-12 23:16     ` Pavel Tatashin
  2020-08-12 23:18       ` Pavel Tatashin
@ 2020-08-13  0:04       ` Roman Gushchin
  2020-08-13  0:31         ` Pavel Tatashin
  1 sibling, 1 reply; 84+ messages in thread
From: Roman Gushchin @ 2020-08-13  0:04 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: Bharata B Rao, linux-mm, Andrew Morton, Michal Hocko,
	Johannes Weiner, Shakeel Butt, Vladimir Davydov, linux-kernel,
	Kernel Team, Yafang Shao, stable, Linus Torvalds, Sasha Levin,
	Greg Kroah-Hartman

On Wed, Aug 12, 2020 at 07:16:08PM -0400, Pavel Tatashin wrote:
> Guys,
> 
> There is a convoluted deadlock that I just root caused, and that is
> fixed by this work (at least based on my code inspection it appears to
> be fixed); but the deadlock exists in older and stable kernels, and I
> am not sure whether to create a separate patch for it, or backport
> this whole thing.

Hi Pavel,

wow, it's a quite complicated deadlock. Thank you for providing
a perfect analysis!

Unfortunately, backporting the whole new slab controller isn't an option:
it's way too big and invasive.
Do you already have a standalone fix?

Thanks!


> 
> Thread #1: Hot-removes memory
> device_offline
>   memory_subsys_offline
>     offline_pages
>       __offline_pages
>         mem_hotplug_lock <- write access
>       waits for Thread #3 refcnt for pfn 9e5113 to get to 1 so it can
> migrate it.
> 
> Thread #2: ccs killer kthread
>    css_killed_work_fn
>      cgroup_mutex  <- Grab this Mutex
>      mem_cgroup_css_offline
>        memcg_offline_kmem.part
>           memcg_deactivate_kmem_caches
>             get_online_mems
>               mem_hotplug_lock <- waits for Thread#1 to get read access
> 
> Thread #3: crashing userland program
> do_coredump
>   elf_core_dump
>       get_dump_page() -> get page with pfn#9e5113, and increment refcnt
>       dump_emit
>         __kernel_write
>           __vfs_write
>             new_sync_write
>               pipe_write
>                 pipe_wait   -> waits for Thread #4 systemd-coredump to
> read the pipe
> 
> Thread #4: systemd-coredump
> ksys_read
>   vfs_read
>     __vfs_read
>       seq_read
>         proc_single_show
>           proc_cgroup_show
>             cgroup_mutex -> waits from Thread #2 for this lock.

> 
> In Summary:
> Thread#1 waits for Thread#3 for refcnt, Thread#3 waits for Thread#4 to
> read pipe. Thread#4 waits for Thread#2 for cgroup_mutex lock; Thread#2
> waits for Thread#1 for mem_hotplug_lock rwlock.
> 
> This work appears to fix this deadlock because cgroup_mutex is not
> called anymore before mem_hotplug_lock (unless I am missing it), as it
> removes memcg_deactivate_kmem_caches.
> 
> Thank you,
> Pasha
> 
> On Wed, Jan 29, 2020 at 9:42 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Thu, Jan 30, 2020 at 07:36:26AM +0530, Bharata B Rao wrote:
> > > On Mon, Jan 27, 2020 at 09:34:25AM -0800, Roman Gushchin wrote:
> > > > The existing cgroup slab memory controller is based on the idea of
> > > > replicating slab allocator internals for each memory cgroup.
> > > > This approach promises a low memory overhead (one pointer per page),
> > > > and isn't adding too much code on hot allocation and release paths.
> > > > But is has a very serious flaw: it leads to a low slab utilization.
> > > >
> > > > Using a drgn* script I've got an estimation of slab utilization on
> > > > a number of machines running different production workloads. In most
> > > > cases it was between 45% and 65%, and the best number I've seen was
> > > > around 85%. Turning kmem accounting off brings it to high 90s. Also
> > > > it brings back 30-50% of slab memory. It means that the real price
> > > > of the existing slab memory controller is way bigger than a pointer
> > > > per page.
> > > >
> > > > The real reason why the existing design leads to a low slab utilization
> > > > is simple: slab pages are used exclusively by one memory cgroup.
> > > > If there are only few allocations of certain size made by a cgroup,
> > > > or if some active objects (e.g. dentries) are left after the cgroup is
> > > > deleted, or the cgroup contains a single-threaded application which is
> > > > barely allocating any kernel objects, but does it every time on a new CPU:
> > > > in all these cases the resulting slab utilization is very low.
> > > > If kmem accounting is off, the kernel is able to use free space
> > > > on slab pages for other allocations.
> > > >
> > > > Arguably it wasn't an issue back to days when the kmem controller was
> > > > introduced and was an opt-in feature, which had to be turned on
> > > > individually for each memory cgroup. But now it's turned on by default
> > > > on both cgroup v1 and v2. And modern systemd-based systems tend to
> > > > create a large number of cgroups.
> > > >
> > > > This patchset provides a new implementation of the slab memory controller,
> > > > which aims to reach a much better slab utilization by sharing slab pages
> > > > between multiple memory cgroups. Below is the short description of the new
> > > > design (more details in commit messages).
> > > >
> > > > Accounting is performed per-object instead of per-page. Slab-related
> > > > vmstat counters are converted to bytes. Charging is performed on page-basis,
> > > > with rounding up and remembering leftovers.
> > > >
> > > > Memcg ownership data is stored in a per-slab-page vector: for each slab page
> > > > a vector of corresponding size is allocated. To keep slab memory reparenting
> > > > working, instead of saving a pointer to the memory cgroup directly an
> > > > intermediate object is used. It's simply a pointer to a memcg (which can be
> > > > easily changed to the parent) with a built-in reference counter. This scheme
> > > > allows to reparent all allocated objects without walking them over and
> > > > changing memcg pointer to the parent.
> > > >
> > > > Instead of creating an individual set of kmem_caches for each memory cgroup,
> > > > two global sets are used: the root set for non-accounted and root-cgroup
> > > > allocations and the second set for all other allocations. This allows to
> > > > simplify the lifetime management of individual kmem_caches: they are
> > > > destroyed with root counterparts. It allows to remove a good amount of code
> > > > and make things generally simpler.
> > > >
> > > > The patchset* has been tested on a number of different workloads in our
> > > > production. In all cases it saved significant amount of memory, measured
> > > > from high hundreds of MBs to single GBs per host. On average, the size
> > > > of slab memory has been reduced by 35-45%.
> > >
> > > Here are some numbers from multiple runs of sysbench and kernel compilation
> > > with this patchset on a 10 core POWER8 host:
> > >
> > > ==========================================================================
> > > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> > > meminfo:Slab for Sysbench oltp_read_write with mysqld running as part
> > > of a mem cgroup (Sampling every 5s)
> > > --------------------------------------------------------------------------
> > >                               5.5.0-rc7-mm1   +slab patch     %reduction
> > > --------------------------------------------------------------------------
> > > memory.kmem.usage_in_bytes    15859712        4456448         72
> > > memory.usage_in_bytes         337510400       335806464       .5
> > > Slab: (kB)                    814336          607296          25
> > >
> > > memory.kmem.usage_in_bytes    16187392        4653056         71
> > > memory.usage_in_bytes         318832640       300154880       5
> > > Slab: (kB)                    789888          559744          29
> > > --------------------------------------------------------------------------
> > >
> > >
> > > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> > > meminfo:Slab for kernel compilation (make -s -j64) Compilation was
> > > done from bash that is in a memory cgroup. (Sampling every 5s)
> > > --------------------------------------------------------------------------
> > >                               5.5.0-rc7-mm1   +slab patch     %reduction
> > > --------------------------------------------------------------------------
> > > memory.kmem.usage_in_bytes    338493440       231931904       31
> > > memory.usage_in_bytes         7368015872      6275923968      15
> > > Slab: (kB)                    1139072         785408          31
> > >
> > > memory.kmem.usage_in_bytes    341835776       236453888       30
> > > memory.usage_in_bytes         6540427264      6072893440      7
> > > Slab: (kB)                    1074304         761280          29
> > >
> > > memory.kmem.usage_in_bytes    340525056       233570304       31
> > > memory.usage_in_bytes         6406209536      6177357824      3
> > > Slab: (kB)                    1244288         739712          40
> > > --------------------------------------------------------------------------
> > >
> > > Slab consumption right after boot
> > > --------------------------------------------------------------------------
> > >                               5.5.0-rc7-mm1   +slab patch     %reduction
> > > --------------------------------------------------------------------------
> > > Slab: (kB)                    821888          583424          29
> > > ==========================================================================
> > >
> > > Summary:
> > >
> > > With sysbench and kernel compilation,  memory.kmem.usage_in_bytes shows
> > > around 70% and 30% reduction consistently.
> > >
> > > Didn't see consistent reduction of memory.usage_in_bytes with sysbench and
> > > kernel compilation.
> > >
> > > Slab usage (from /proc/meminfo) shows consistent 30% reduction and the
> > > same is seen right after boot too.
> >
> > That's just perfect!
> >
> > memory.usage_in_bytes was most likely the same because the freed space
> > was taken by pagecache.
> >
> > Thank you very much for testing!
> >
> > Roman


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 00/28] The new cgroup slab memory controller
  2020-08-13  0:04       ` Roman Gushchin
@ 2020-08-13  0:31         ` Pavel Tatashin
  2020-08-28 16:47           ` Pavel Tatashin
  0 siblings, 1 reply; 84+ messages in thread
From: Pavel Tatashin @ 2020-08-13  0:31 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Bharata B Rao, linux-mm, Andrew Morton, Michal Hocko,
	Johannes Weiner, Shakeel Butt, Vladimir Davydov, linux-kernel,
	Kernel Team, Yafang Shao, stable, Linus Torvalds, Sasha Levin,
	Greg Kroah-Hartman

On Wed, Aug 12, 2020 at 8:04 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Wed, Aug 12, 2020 at 07:16:08PM -0400, Pavel Tatashin wrote:
> > Guys,
> >
> > There is a convoluted deadlock that I just root caused, and that is
> > fixed by this work (at least based on my code inspection it appears to
> > be fixed); but the deadlock exists in older and stable kernels, and I
> > am not sure whether to create a separate patch for it, or backport
> > this whole thing.
>

Hi Roman,

> Hi Pavel,
>
> wow, it's a quite complicated deadlock. Thank you for providing
> a perfect analysis!

Thank you, it indeed took me a while to fully grasp the deadlock.

>
> Unfortunately, backporting the whole new slab controller isn't an option:
> it's way too big and invasive.

This is what I thought as well, this is why I want to figure out what
is the best way forward.

> Do you already have a standalone fix?

Not yet, I do not have a standalone fix. I suspect the best fix would
be to address fix css_killed_work_fn() stack so we never have:
cgroup_mutex -> mem_hotplug_lock. Either decoupling them or reverse
the order would work. If you have suggestions since you worked on this
code recently, please let me know.

Thank you,
Pasha

>
> Thanks!
>
>
> >
> > Thread #1: Hot-removes memory
> > device_offline
> >   memory_subsys_offline
> >     offline_pages
> >       __offline_pages
> >         mem_hotplug_lock <- write access
> >       waits for Thread #3 refcnt for pfn 9e5113 to get to 1 so it can
> > migrate it.
> >
> > Thread #2: ccs killer kthread
> >    css_killed_work_fn
> >      cgroup_mutex  <- Grab this Mutex
> >      mem_cgroup_css_offline
> >        memcg_offline_kmem.part
> >           memcg_deactivate_kmem_caches
> >             get_online_mems
> >               mem_hotplug_lock <- waits for Thread#1 to get read access
> >
> > Thread #3: crashing userland program
> > do_coredump
> >   elf_core_dump
> >       get_dump_page() -> get page with pfn#9e5113, and increment refcnt
> >       dump_emit
> >         __kernel_write
> >           __vfs_write
> >             new_sync_write
> >               pipe_write
> >                 pipe_wait   -> waits for Thread #4 systemd-coredump to
> > read the pipe
> >
> > Thread #4: systemd-coredump
> > ksys_read
> >   vfs_read
> >     __vfs_read
> >       seq_read
> >         proc_single_show
> >           proc_cgroup_show
> >             cgroup_mutex -> waits from Thread #2 for this lock.
>
> >
> > In Summary:
> > Thread#1 waits for Thread#3 for refcnt, Thread#3 waits for Thread#4 to
> > read pipe. Thread#4 waits for Thread#2 for cgroup_mutex lock; Thread#2
> > waits for Thread#1 for mem_hotplug_lock rwlock.
> >
> > This work appears to fix this deadlock because cgroup_mutex is not
> > called anymore before mem_hotplug_lock (unless I am missing it), as it
> > removes memcg_deactivate_kmem_caches.
> >
> > Thank you,
> > Pasha
> >
> > On Wed, Jan 29, 2020 at 9:42 PM Roman Gushchin <guro@fb.com> wrote:
> > >
> > > On Thu, Jan 30, 2020 at 07:36:26AM +0530, Bharata B Rao wrote:
> > > > On Mon, Jan 27, 2020 at 09:34:25AM -0800, Roman Gushchin wrote:
> > > > > The existing cgroup slab memory controller is based on the idea of
> > > > > replicating slab allocator internals for each memory cgroup.
> > > > > This approach promises a low memory overhead (one pointer per page),
> > > > > and isn't adding too much code on hot allocation and release paths.
> > > > > But is has a very serious flaw: it leads to a low slab utilization.
> > > > >
> > > > > Using a drgn* script I've got an estimation of slab utilization on
> > > > > a number of machines running different production workloads. In most
> > > > > cases it was between 45% and 65%, and the best number I've seen was
> > > > > around 85%. Turning kmem accounting off brings it to high 90s. Also
> > > > > it brings back 30-50% of slab memory. It means that the real price
> > > > > of the existing slab memory controller is way bigger than a pointer
> > > > > per page.
> > > > >
> > > > > The real reason why the existing design leads to a low slab utilization
> > > > > is simple: slab pages are used exclusively by one memory cgroup.
> > > > > If there are only few allocations of certain size made by a cgroup,
> > > > > or if some active objects (e.g. dentries) are left after the cgroup is
> > > > > deleted, or the cgroup contains a single-threaded application which is
> > > > > barely allocating any kernel objects, but does it every time on a new CPU:
> > > > > in all these cases the resulting slab utilization is very low.
> > > > > If kmem accounting is off, the kernel is able to use free space
> > > > > on slab pages for other allocations.
> > > > >
> > > > > Arguably it wasn't an issue back to days when the kmem controller was
> > > > > introduced and was an opt-in feature, which had to be turned on
> > > > > individually for each memory cgroup. But now it's turned on by default
> > > > > on both cgroup v1 and v2. And modern systemd-based systems tend to
> > > > > create a large number of cgroups.
> > > > >
> > > > > This patchset provides a new implementation of the slab memory controller,
> > > > > which aims to reach a much better slab utilization by sharing slab pages
> > > > > between multiple memory cgroups. Below is the short description of the new
> > > > > design (more details in commit messages).
> > > > >
> > > > > Accounting is performed per-object instead of per-page. Slab-related
> > > > > vmstat counters are converted to bytes. Charging is performed on page-basis,
> > > > > with rounding up and remembering leftovers.
> > > > >
> > > > > Memcg ownership data is stored in a per-slab-page vector: for each slab page
> > > > > a vector of corresponding size is allocated. To keep slab memory reparenting
> > > > > working, instead of saving a pointer to the memory cgroup directly an
> > > > > intermediate object is used. It's simply a pointer to a memcg (which can be
> > > > > easily changed to the parent) with a built-in reference counter. This scheme
> > > > > allows to reparent all allocated objects without walking them over and
> > > > > changing memcg pointer to the parent.
> > > > >
> > > > > Instead of creating an individual set of kmem_caches for each memory cgroup,
> > > > > two global sets are used: the root set for non-accounted and root-cgroup
> > > > > allocations and the second set for all other allocations. This allows to
> > > > > simplify the lifetime management of individual kmem_caches: they are
> > > > > destroyed with root counterparts. It allows to remove a good amount of code
> > > > > and make things generally simpler.
> > > > >
> > > > > The patchset* has been tested on a number of different workloads in our
> > > > > production. In all cases it saved significant amount of memory, measured
> > > > > from high hundreds of MBs to single GBs per host. On average, the size
> > > > > of slab memory has been reduced by 35-45%.
> > > >
> > > > Here are some numbers from multiple runs of sysbench and kernel compilation
> > > > with this patchset on a 10 core POWER8 host:
> > > >
> > > > ==========================================================================
> > > > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> > > > meminfo:Slab for Sysbench oltp_read_write with mysqld running as part
> > > > of a mem cgroup (Sampling every 5s)
> > > > --------------------------------------------------------------------------
> > > >                               5.5.0-rc7-mm1   +slab patch     %reduction
> > > > --------------------------------------------------------------------------
> > > > memory.kmem.usage_in_bytes    15859712        4456448         72
> > > > memory.usage_in_bytes         337510400       335806464       .5
> > > > Slab: (kB)                    814336          607296          25
> > > >
> > > > memory.kmem.usage_in_bytes    16187392        4653056         71
> > > > memory.usage_in_bytes         318832640       300154880       5
> > > > Slab: (kB)                    789888          559744          29
> > > > --------------------------------------------------------------------------
> > > >
> > > >
> > > > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> > > > meminfo:Slab for kernel compilation (make -s -j64) Compilation was
> > > > done from bash that is in a memory cgroup. (Sampling every 5s)
> > > > --------------------------------------------------------------------------
> > > >                               5.5.0-rc7-mm1   +slab patch     %reduction
> > > > --------------------------------------------------------------------------
> > > > memory.kmem.usage_in_bytes    338493440       231931904       31
> > > > memory.usage_in_bytes         7368015872      6275923968      15
> > > > Slab: (kB)                    1139072         785408          31
> > > >
> > > > memory.kmem.usage_in_bytes    341835776       236453888       30
> > > > memory.usage_in_bytes         6540427264      6072893440      7
> > > > Slab: (kB)                    1074304         761280          29
> > > >
> > > > memory.kmem.usage_in_bytes    340525056       233570304       31
> > > > memory.usage_in_bytes         6406209536      6177357824      3
> > > > Slab: (kB)                    1244288         739712          40
> > > > --------------------------------------------------------------------------
> > > >
> > > > Slab consumption right after boot
> > > > --------------------------------------------------------------------------
> > > >                               5.5.0-rc7-mm1   +slab patch     %reduction
> > > > --------------------------------------------------------------------------
> > > > Slab: (kB)                    821888          583424          29
> > > > ==========================================================================
> > > >
> > > > Summary:
> > > >
> > > > With sysbench and kernel compilation,  memory.kmem.usage_in_bytes shows
> > > > around 70% and 30% reduction consistently.
> > > >
> > > > Didn't see consistent reduction of memory.usage_in_bytes with sysbench and
> > > > kernel compilation.
> > > >
> > > > Slab usage (from /proc/meminfo) shows consistent 30% reduction and the
> > > > same is seen right after boot too.
> > >
> > > That's just perfect!
> > >
> > > memory.usage_in_bytes was most likely the same because the freed space
> > > was taken by pagecache.
> > >
> > > Thank you very much for testing!
> > >
> > > Roman


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 00/28] The new cgroup slab memory controller
  2020-08-13  0:31         ` Pavel Tatashin
@ 2020-08-28 16:47           ` Pavel Tatashin
  2020-09-01  5:28             ` Bharata B Rao
  2020-09-02  9:53             ` Vlastimil Babka
  0 siblings, 2 replies; 84+ messages in thread
From: Pavel Tatashin @ 2020-08-28 16:47 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Bharata B Rao, linux-mm, Andrew Morton, Michal Hocko,
	Johannes Weiner, Shakeel Butt, Vladimir Davydov, linux-kernel,
	Kernel Team, Yafang Shao, stable, Linus Torvalds, Sasha Levin,
	Greg Kroah-Hartman

There appears to be another problem that is related to the
cgroup_mutex -> mem_hotplug_lock deadlock described above.

In the original deadlock that I described, the workaround is to
replace crash dump from piping to Linux traditional save to files
method. However, after trying this workaround, I still observed
hardware watchdog resets during machine  shutdown.

The new problem occurs for the following reason: upon shutdown systemd
calls a service that hot-removes memory, and if hot-removing fails for
some reason systemd kills that service after timeout. However, systemd
is never able to kill the service, and we get hardware reset caused by
watchdog or a hang during shutdown:

Thread #1: memory hot-remove systemd service
Loops indefinitely, because if there is something still to be migrated
this loop never terminates. However, this loop can be terminated via
signal from systemd after timeout.
__offline_pages()
      do {
          pfn = scan_movable_pages(pfn, end_pfn);
                  # Returns 0, meaning there is nothing available to
                  # migrate, no page is PageLRU(page)
          ...
          ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
                                            NULL, check_pages_isolated_cb);
                  # Returns -EBUSY, meaning there is at least one PFN that
                  # still has to be migrated.
      } while (ret);

Thread #2: ccs killer kthread
   css_killed_work_fn
     cgroup_mutex  <- Grab this Mutex
     mem_cgroup_css_offline
       memcg_offline_kmem.part
          memcg_deactivate_kmem_caches
            get_online_mems
              mem_hotplug_lock <- waits for Thread#1 to get read access

Thread #3: systemd
ksys_read
 vfs_read
   __vfs_read
     seq_read
       proc_single_show
         proc_cgroup_show
           mutex_lock -> wait for cgroup_mutex that is owned by Thread #2

Thus, thread #3 systemd stuck, and unable to deliver timeout interrupt
to thread #1.

The proper fix for both of the problems is to avoid cgroup_mutex ->
mem_hotplug_lock ordering that was recently fixed in the mainline but
still present in all stable branches. Unfortunately, I do not see a
simple fix in how to remove mem_hotplug_lock from
memcg_deactivate_kmem_caches without using Roman's series that is too
big for stable.

Thanks,
Pasha

On Wed, Aug 12, 2020 at 8:31 PM Pavel Tatashin
<pasha.tatashin@soleen.com> wrote:
>
> On Wed, Aug 12, 2020 at 8:04 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Wed, Aug 12, 2020 at 07:16:08PM -0400, Pavel Tatashin wrote:
> > > Guys,
> > >
> > > There is a convoluted deadlock that I just root caused, and that is
> > > fixed by this work (at least based on my code inspection it appears to
> > > be fixed); but the deadlock exists in older and stable kernels, and I
> > > am not sure whether to create a separate patch for it, or backport
> > > this whole thing.
> >
>
> Hi Roman,
>
> > Hi Pavel,
> >
> > wow, it's a quite complicated deadlock. Thank you for providing
> > a perfect analysis!
>
> Thank you, it indeed took me a while to fully grasp the deadlock.
>
> >
> > Unfortunately, backporting the whole new slab controller isn't an option:
> > it's way too big and invasive.
>
> This is what I thought as well, this is why I want to figure out what
> is the best way forward.
>
> > Do you already have a standalone fix?
>
> Not yet, I do not have a standalone fix. I suspect the best fix would
> be to address fix css_killed_work_fn() stack so we never have:
> cgroup_mutex -> mem_hotplug_lock. Either decoupling them or reverse
> the order would work. If you have suggestions since you worked on this
> code recently, please let me know.
>
> Thank you,
> Pasha
>
> >
> > Thanks!
> >
> >
> > >
> > > Thread #1: Hot-removes memory
> > > device_offline
> > >   memory_subsys_offline
> > >     offline_pages
> > >       __offline_pages
> > >         mem_hotplug_lock <- write access
> > >       waits for Thread #3 refcnt for pfn 9e5113 to get to 1 so it can
> > > migrate it.
> > >
> > > Thread #2: ccs killer kthread
> > >    css_killed_work_fn
> > >      cgroup_mutex  <- Grab this Mutex
> > >      mem_cgroup_css_offline
> > >        memcg_offline_kmem.part
> > >           memcg_deactivate_kmem_caches
> > >             get_online_mems
> > >               mem_hotplug_lock <- waits for Thread#1 to get read access
> > >
> > > Thread #3: crashing userland program
> > > do_coredump
> > >   elf_core_dump
> > >       get_dump_page() -> get page with pfn#9e5113, and increment refcnt
> > >       dump_emit
> > >         __kernel_write
> > >           __vfs_write
> > >             new_sync_write
> > >               pipe_write
> > >                 pipe_wait   -> waits for Thread #4 systemd-coredump to
> > > read the pipe
> > >
> > > Thread #4: systemd-coredump
> > > ksys_read
> > >   vfs_read
> > >     __vfs_read
> > >       seq_read
> > >         proc_single_show
> > >           proc_cgroup_show
> > >             cgroup_mutex -> waits from Thread #2 for this lock.
> >
> > >
> > > In Summary:
> > > Thread#1 waits for Thread#3 for refcnt, Thread#3 waits for Thread#4 to
> > > read pipe. Thread#4 waits for Thread#2 for cgroup_mutex lock; Thread#2
> > > waits for Thread#1 for mem_hotplug_lock rwlock.
> > >
> > > This work appears to fix this deadlock because cgroup_mutex is not
> > > called anymore before mem_hotplug_lock (unless I am missing it), as it
> > > removes memcg_deactivate_kmem_caches.
> > >
> > > Thank you,
> > > Pasha
> > >
> > > On Wed, Jan 29, 2020 at 9:42 PM Roman Gushchin <guro@fb.com> wrote:
> > > >
> > > > On Thu, Jan 30, 2020 at 07:36:26AM +0530, Bharata B Rao wrote:
> > > > > On Mon, Jan 27, 2020 at 09:34:25AM -0800, Roman Gushchin wrote:
> > > > > > The existing cgroup slab memory controller is based on the idea of
> > > > > > replicating slab allocator internals for each memory cgroup.
> > > > > > This approach promises a low memory overhead (one pointer per page),
> > > > > > and isn't adding too much code on hot allocation and release paths.
> > > > > > But is has a very serious flaw: it leads to a low slab utilization.
> > > > > >
> > > > > > Using a drgn* script I've got an estimation of slab utilization on
> > > > > > a number of machines running different production workloads. In most
> > > > > > cases it was between 45% and 65%, and the best number I've seen was
> > > > > > around 85%. Turning kmem accounting off brings it to high 90s. Also
> > > > > > it brings back 30-50% of slab memory. It means that the real price
> > > > > > of the existing slab memory controller is way bigger than a pointer
> > > > > > per page.
> > > > > >
> > > > > > The real reason why the existing design leads to a low slab utilization
> > > > > > is simple: slab pages are used exclusively by one memory cgroup.
> > > > > > If there are only few allocations of certain size made by a cgroup,
> > > > > > or if some active objects (e.g. dentries) are left after the cgroup is
> > > > > > deleted, or the cgroup contains a single-threaded application which is
> > > > > > barely allocating any kernel objects, but does it every time on a new CPU:
> > > > > > in all these cases the resulting slab utilization is very low.
> > > > > > If kmem accounting is off, the kernel is able to use free space
> > > > > > on slab pages for other allocations.
> > > > > >
> > > > > > Arguably it wasn't an issue back to days when the kmem controller was
> > > > > > introduced and was an opt-in feature, which had to be turned on
> > > > > > individually for each memory cgroup. But now it's turned on by default
> > > > > > on both cgroup v1 and v2. And modern systemd-based systems tend to
> > > > > > create a large number of cgroups.
> > > > > >
> > > > > > This patchset provides a new implementation of the slab memory controller,
> > > > > > which aims to reach a much better slab utilization by sharing slab pages
> > > > > > between multiple memory cgroups. Below is the short description of the new
> > > > > > design (more details in commit messages).
> > > > > >
> > > > > > Accounting is performed per-object instead of per-page. Slab-related
> > > > > > vmstat counters are converted to bytes. Charging is performed on page-basis,
> > > > > > with rounding up and remembering leftovers.
> > > > > >
> > > > > > Memcg ownership data is stored in a per-slab-page vector: for each slab page
> > > > > > a vector of corresponding size is allocated. To keep slab memory reparenting
> > > > > > working, instead of saving a pointer to the memory cgroup directly an
> > > > > > intermediate object is used. It's simply a pointer to a memcg (which can be
> > > > > > easily changed to the parent) with a built-in reference counter. This scheme
> > > > > > allows to reparent all allocated objects without walking them over and
> > > > > > changing memcg pointer to the parent.
> > > > > >
> > > > > > Instead of creating an individual set of kmem_caches for each memory cgroup,
> > > > > > two global sets are used: the root set for non-accounted and root-cgroup
> > > > > > allocations and the second set for all other allocations. This allows to
> > > > > > simplify the lifetime management of individual kmem_caches: they are
> > > > > > destroyed with root counterparts. It allows to remove a good amount of code
> > > > > > and make things generally simpler.
> > > > > >
> > > > > > The patchset* has been tested on a number of different workloads in our
> > > > > > production. In all cases it saved significant amount of memory, measured
> > > > > > from high hundreds of MBs to single GBs per host. On average, the size
> > > > > > of slab memory has been reduced by 35-45%.
> > > > >
> > > > > Here are some numbers from multiple runs of sysbench and kernel compilation
> > > > > with this patchset on a 10 core POWER8 host:
> > > > >
> > > > > ==========================================================================
> > > > > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> > > > > meminfo:Slab for Sysbench oltp_read_write with mysqld running as part
> > > > > of a mem cgroup (Sampling every 5s)
> > > > > --------------------------------------------------------------------------
> > > > >                               5.5.0-rc7-mm1   +slab patch     %reduction
> > > > > --------------------------------------------------------------------------
> > > > > memory.kmem.usage_in_bytes    15859712        4456448         72
> > > > > memory.usage_in_bytes         337510400       335806464       .5
> > > > > Slab: (kB)                    814336          607296          25
> > > > >
> > > > > memory.kmem.usage_in_bytes    16187392        4653056         71
> > > > > memory.usage_in_bytes         318832640       300154880       5
> > > > > Slab: (kB)                    789888          559744          29
> > > > > --------------------------------------------------------------------------
> > > > >
> > > > >
> > > > > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> > > > > meminfo:Slab for kernel compilation (make -s -j64) Compilation was
> > > > > done from bash that is in a memory cgroup. (Sampling every 5s)
> > > > > --------------------------------------------------------------------------
> > > > >                               5.5.0-rc7-mm1   +slab patch     %reduction
> > > > > --------------------------------------------------------------------------
> > > > > memory.kmem.usage_in_bytes    338493440       231931904       31
> > > > > memory.usage_in_bytes         7368015872      6275923968      15
> > > > > Slab: (kB)                    1139072         785408          31
> > > > >
> > > > > memory.kmem.usage_in_bytes    341835776       236453888       30
> > > > > memory.usage_in_bytes         6540427264      6072893440      7
> > > > > Slab: (kB)                    1074304         761280          29
> > > > >
> > > > > memory.kmem.usage_in_bytes    340525056       233570304       31
> > > > > memory.usage_in_bytes         6406209536      6177357824      3
> > > > > Slab: (kB)                    1244288         739712          40
> > > > > --------------------------------------------------------------------------
> > > > >
> > > > > Slab consumption right after boot
> > > > > --------------------------------------------------------------------------
> > > > >                               5.5.0-rc7-mm1   +slab patch     %reduction
> > > > > --------------------------------------------------------------------------
> > > > > Slab: (kB)                    821888          583424          29
> > > > > ==========================================================================
> > > > >
> > > > > Summary:
> > > > >
> > > > > With sysbench and kernel compilation,  memory.kmem.usage_in_bytes shows
> > > > > around 70% and 30% reduction consistently.
> > > > >
> > > > > Didn't see consistent reduction of memory.usage_in_bytes with sysbench and
> > > > > kernel compilation.
> > > > >
> > > > > Slab usage (from /proc/meminfo) shows consistent 30% reduction and the
> > > > > same is seen right after boot too.
> > > >
> > > > That's just perfect!
> > > >
> > > > memory.usage_in_bytes was most likely the same because the freed space
> > > > was taken by pagecache.
> > > >
> > > > Thank you very much for testing!
> > > >
> > > > Roman


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 00/28] The new cgroup slab memory controller
  2020-08-28 16:47           ` Pavel Tatashin
@ 2020-09-01  5:28             ` Bharata B Rao
  2020-09-01 12:52               ` Pavel Tatashin
  2020-09-02  9:53             ` Vlastimil Babka
  1 sibling, 1 reply; 84+ messages in thread
From: Bharata B Rao @ 2020-09-01  5:28 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: Roman Gushchin, linux-mm, Andrew Morton, Michal Hocko,
	Johannes Weiner, Shakeel Butt, Vladimir Davydov, linux-kernel,
	Kernel Team, Yafang Shao, stable, Linus Torvalds, Sasha Levin,
	Greg Kroah-Hartman

On Fri, Aug 28, 2020 at 12:47:03PM -0400, Pavel Tatashin wrote:
> There appears to be another problem that is related to the
> cgroup_mutex -> mem_hotplug_lock deadlock described above.
> 
> In the original deadlock that I described, the workaround is to
> replace crash dump from piping to Linux traditional save to files
> method. However, after trying this workaround, I still observed
> hardware watchdog resets during machine  shutdown.
> 
> The new problem occurs for the following reason: upon shutdown systemd
> calls a service that hot-removes memory, and if hot-removing fails for
> some reason systemd kills that service after timeout. However, systemd
> is never able to kill the service, and we get hardware reset caused by
> watchdog or a hang during shutdown:
> 
> Thread #1: memory hot-remove systemd service
> Loops indefinitely, because if there is something still to be migrated
> this loop never terminates. However, this loop can be terminated via
> signal from systemd after timeout.
> __offline_pages()
>       do {
>           pfn = scan_movable_pages(pfn, end_pfn);
>                   # Returns 0, meaning there is nothing available to
>                   # migrate, no page is PageLRU(page)
>           ...
>           ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
>                                             NULL, check_pages_isolated_cb);
>                   # Returns -EBUSY, meaning there is at least one PFN that
>                   # still has to be migrated.
>       } while (ret);
> 
> Thread #2: ccs killer kthread
>    css_killed_work_fn
>      cgroup_mutex  <- Grab this Mutex
>      mem_cgroup_css_offline
>        memcg_offline_kmem.part
>           memcg_deactivate_kmem_caches
>             get_online_mems
>               mem_hotplug_lock <- waits for Thread#1 to get read access
> 
> Thread #3: systemd
> ksys_read
>  vfs_read
>    __vfs_read
>      seq_read
>        proc_single_show
>          proc_cgroup_show
>            mutex_lock -> wait for cgroup_mutex that is owned by Thread #2
> 
> Thus, thread #3 systemd stuck, and unable to deliver timeout interrupt
> to thread #1.
> 
> The proper fix for both of the problems is to avoid cgroup_mutex ->
> mem_hotplug_lock ordering that was recently fixed in the mainline but
> still present in all stable branches. Unfortunately, I do not see a
> simple fix in how to remove mem_hotplug_lock from
> memcg_deactivate_kmem_caches without using Roman's series that is too
> big for stable.

We too are seeing this on Power systems when stress-testing memory
hotplug, but with the following call trace (from hung task timer)
instead of Thread #2 above:

__switch_to
__schedule
schedule
percpu_rwsem_wait
__percpu_down_read
get_online_mems
memcg_create_kmem_cache
memcg_kmem_cache_create_func
process_one_work
worker_thread
kthread
ret_from_kernel_thread

While I understand that Roman's new slab controller patchset will fix
this, I also wonder if infinitely looping in the memory unplug path
with mem_hotplug_lock held is the right thing to do? Earlier we had
a few other exit possibilities in this path (like max retries etc)
but those were removed by commits:

72b39cfc4d75: mm, memory_hotplug: do not fail offlining too early
ecde0f3e7f9e: mm, memory_hotplug: remove timeout from __offline_memory

Or, is the user-space test is expected to induce a signal back-off when
unplug doesn't complete within a reasonable amount of time?

Regards,
Bharata.



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 00/28] The new cgroup slab memory controller
  2020-09-01  5:28             ` Bharata B Rao
@ 2020-09-01 12:52               ` Pavel Tatashin
  2020-09-02  6:23                 ` Bharata B Rao
  0 siblings, 1 reply; 84+ messages in thread
From: Pavel Tatashin @ 2020-09-01 12:52 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: Roman Gushchin, linux-mm, Andrew Morton, Michal Hocko,
	Johannes Weiner, Shakeel Butt, Vladimir Davydov, linux-kernel,
	Kernel Team, Yafang Shao, stable, Linus Torvalds, Sasha Levin,
	Greg Kroah-Hartman

On Tue, Sep 1, 2020 at 1:28 AM Bharata B Rao <bharata@linux.ibm.com> wrote:
>
> On Fri, Aug 28, 2020 at 12:47:03PM -0400, Pavel Tatashin wrote:
> > There appears to be another problem that is related to the
> > cgroup_mutex -> mem_hotplug_lock deadlock described above.
> >
> > In the original deadlock that I described, the workaround is to
> > replace crash dump from piping to Linux traditional save to files
> > method. However, after trying this workaround, I still observed
> > hardware watchdog resets during machine  shutdown.
> >
> > The new problem occurs for the following reason: upon shutdown systemd
> > calls a service that hot-removes memory, and if hot-removing fails for
> > some reason systemd kills that service after timeout. However, systemd
> > is never able to kill the service, and we get hardware reset caused by
> > watchdog or a hang during shutdown:
> >
> > Thread #1: memory hot-remove systemd service
> > Loops indefinitely, because if there is something still to be migrated
> > this loop never terminates. However, this loop can be terminated via
> > signal from systemd after timeout.
> > __offline_pages()
> >       do {
> >           pfn = scan_movable_pages(pfn, end_pfn);
> >                   # Returns 0, meaning there is nothing available to
> >                   # migrate, no page is PageLRU(page)
> >           ...
> >           ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
> >                                             NULL, check_pages_isolated_cb);
> >                   # Returns -EBUSY, meaning there is at least one PFN that
> >                   # still has to be migrated.
> >       } while (ret);
> >
> > Thread #2: ccs killer kthread
> >    css_killed_work_fn
> >      cgroup_mutex  <- Grab this Mutex
> >      mem_cgroup_css_offline
> >        memcg_offline_kmem.part
> >           memcg_deactivate_kmem_caches
> >             get_online_mems
> >               mem_hotplug_lock <- waits for Thread#1 to get read access
> >
> > Thread #3: systemd
> > ksys_read
> >  vfs_read
> >    __vfs_read
> >      seq_read
> >        proc_single_show
> >          proc_cgroup_show
> >            mutex_lock -> wait for cgroup_mutex that is owned by Thread #2
> >
> > Thus, thread #3 systemd stuck, and unable to deliver timeout interrupt
> > to thread #1.
> >
> > The proper fix for both of the problems is to avoid cgroup_mutex ->
> > mem_hotplug_lock ordering that was recently fixed in the mainline but
> > still present in all stable branches. Unfortunately, I do not see a
> > simple fix in how to remove mem_hotplug_lock from
> > memcg_deactivate_kmem_caches without using Roman's series that is too
> > big for stable.
>
> We too are seeing this on Power systems when stress-testing memory
> hotplug, but with the following call trace (from hung task timer)
> instead of Thread #2 above:
>
> __switch_to
> __schedule
> schedule
> percpu_rwsem_wait
> __percpu_down_read
> get_online_mems
> memcg_create_kmem_cache
> memcg_kmem_cache_create_func
> process_one_work
> worker_thread
> kthread
> ret_from_kernel_thread
>
> While I understand that Roman's new slab controller patchset will fix
> this, I also wonder if infinitely looping in the memory unplug path
> with mem_hotplug_lock held is the right thing to do? Earlier we had
> a few other exit possibilities in this path (like max retries etc)
> but those were removed by commits:
>
> 72b39cfc4d75: mm, memory_hotplug: do not fail offlining too early
> ecde0f3e7f9e: mm, memory_hotplug: remove timeout from __offline_memory
>
> Or, is the user-space test is expected to induce a signal back-off when
> unplug doesn't complete within a reasonable amount of time?

Hi Bharata,

Thank you for your input, it looks like you are experiencing the same
problems that I observed.

What I found is that the reason why our machines did not complete
hot-remove within the given time is because of this bug:
https://lore.kernel.org/linux-mm/20200901124615.137200-1-pasha.tatashin@soleen.com

Could you please try it and see if that helps for your case?

Thank you,
Pasha


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 00/28] The new cgroup slab memory controller
  2020-09-01 12:52               ` Pavel Tatashin
@ 2020-09-02  6:23                 ` Bharata B Rao
  2020-09-02 12:34                   ` Pavel Tatashin
  0 siblings, 1 reply; 84+ messages in thread
From: Bharata B Rao @ 2020-09-02  6:23 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: Roman Gushchin, linux-mm, Andrew Morton, Michal Hocko,
	Johannes Weiner, Shakeel Butt, Vladimir Davydov, linux-kernel,
	Kernel Team, Yafang Shao, stable, Linus Torvalds, Sasha Levin,
	Greg Kroah-Hartman

On Tue, Sep 01, 2020 at 08:52:05AM -0400, Pavel Tatashin wrote:
> On Tue, Sep 1, 2020 at 1:28 AM Bharata B Rao <bharata@linux.ibm.com> wrote:
> >
> > On Fri, Aug 28, 2020 at 12:47:03PM -0400, Pavel Tatashin wrote:
> > > There appears to be another problem that is related to the
> > > cgroup_mutex -> mem_hotplug_lock deadlock described above.
> > >
> > > In the original deadlock that I described, the workaround is to
> > > replace crash dump from piping to Linux traditional save to files
> > > method. However, after trying this workaround, I still observed
> > > hardware watchdog resets during machine  shutdown.
> > >
> > > The new problem occurs for the following reason: upon shutdown systemd
> > > calls a service that hot-removes memory, and if hot-removing fails for
> > > some reason systemd kills that service after timeout. However, systemd
> > > is never able to kill the service, and we get hardware reset caused by
> > > watchdog or a hang during shutdown:
> > >
> > > Thread #1: memory hot-remove systemd service
> > > Loops indefinitely, because if there is something still to be migrated
> > > this loop never terminates. However, this loop can be terminated via
> > > signal from systemd after timeout.
> > > __offline_pages()
> > >       do {
> > >           pfn = scan_movable_pages(pfn, end_pfn);
> > >                   # Returns 0, meaning there is nothing available to
> > >                   # migrate, no page is PageLRU(page)
> > >           ...
> > >           ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
> > >                                             NULL, check_pages_isolated_cb);
> > >                   # Returns -EBUSY, meaning there is at least one PFN that
> > >                   # still has to be migrated.
> > >       } while (ret);
> > >
> > > Thread #2: ccs killer kthread
> > >    css_killed_work_fn
> > >      cgroup_mutex  <- Grab this Mutex
> > >      mem_cgroup_css_offline
> > >        memcg_offline_kmem.part
> > >           memcg_deactivate_kmem_caches
> > >             get_online_mems
> > >               mem_hotplug_lock <- waits for Thread#1 to get read access
> > >
> > > Thread #3: systemd
> > > ksys_read
> > >  vfs_read
> > >    __vfs_read
> > >      seq_read
> > >        proc_single_show
> > >          proc_cgroup_show
> > >            mutex_lock -> wait for cgroup_mutex that is owned by Thread #2
> > >
> > > Thus, thread #3 systemd stuck, and unable to deliver timeout interrupt
> > > to thread #1.
> > >
> > > The proper fix for both of the problems is to avoid cgroup_mutex ->
> > > mem_hotplug_lock ordering that was recently fixed in the mainline but
> > > still present in all stable branches. Unfortunately, I do not see a
> > > simple fix in how to remove mem_hotplug_lock from
> > > memcg_deactivate_kmem_caches without using Roman's series that is too
> > > big for stable.
> >
> > We too are seeing this on Power systems when stress-testing memory
> > hotplug, but with the following call trace (from hung task timer)
> > instead of Thread #2 above:
> >
> > __switch_to
> > __schedule
> > schedule
> > percpu_rwsem_wait
> > __percpu_down_read
> > get_online_mems
> > memcg_create_kmem_cache
> > memcg_kmem_cache_create_func
> > process_one_work
> > worker_thread
> > kthread
> > ret_from_kernel_thread
> >
> > While I understand that Roman's new slab controller patchset will fix
> > this, I also wonder if infinitely looping in the memory unplug path
> > with mem_hotplug_lock held is the right thing to do? Earlier we had
> > a few other exit possibilities in this path (like max retries etc)
> > but those were removed by commits:
> >
> > 72b39cfc4d75: mm, memory_hotplug: do not fail offlining too early
> > ecde0f3e7f9e: mm, memory_hotplug: remove timeout from __offline_memory
> >
> > Or, is the user-space test is expected to induce a signal back-off when
> > unplug doesn't complete within a reasonable amount of time?
> 
> Hi Bharata,
> 
> Thank you for your input, it looks like you are experiencing the same
> problems that I observed.
> 
> What I found is that the reason why our machines did not complete
> hot-remove within the given time is because of this bug:
> https://lore.kernel.org/linux-mm/20200901124615.137200-1-pasha.tatashin@soleen.com
> 
> Could you please try it and see if that helps for your case?

I am on an old codebase that already has the fix that you are proposing,
so I might be seeing someother issue which I will debug further.

So looks like the loop in __offline_pages() had a call to
drain_all_pages() before it was removed by

c52e75935f8d: mm: remove extra drain pages on pcp list

Regards,
Bharata.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 00/28] The new cgroup slab memory controller
  2020-08-28 16:47           ` Pavel Tatashin
  2020-09-01  5:28             ` Bharata B Rao
@ 2020-09-02  9:53             ` Vlastimil Babka
  2020-09-02 10:39               ` David Hildenbrand
                                 ` (2 more replies)
  1 sibling, 3 replies; 84+ messages in thread
From: Vlastimil Babka @ 2020-09-02  9:53 UTC (permalink / raw)
  To: Pavel Tatashin, Roman Gushchin
  Cc: Bharata B Rao, linux-mm, Andrew Morton, Michal Hocko,
	Johannes Weiner, Shakeel Butt, Vladimir Davydov, linux-kernel,
	Kernel Team, Yafang Shao, stable, Linus Torvalds, Sasha Levin,
	Greg Kroah-Hartman, David Hildenbrand, Michal Hocko

On 8/28/20 6:47 PM, Pavel Tatashin wrote:
> There appears to be another problem that is related to the
> cgroup_mutex -> mem_hotplug_lock deadlock described above.
> 
> In the original deadlock that I described, the workaround is to
> replace crash dump from piping to Linux traditional save to files
> method. However, after trying this workaround, I still observed
> hardware watchdog resets during machine  shutdown.
> 
> The new problem occurs for the following reason: upon shutdown systemd
> calls a service that hot-removes memory, and if hot-removing fails for

Why is that hotremove even needed if we're shutting down? Are there any
(virtualization?) platforms where it makes some difference over plain
shutdown/restart?

> some reason systemd kills that service after timeout. However, systemd
> is never able to kill the service, and we get hardware reset caused by
> watchdog or a hang during shutdown:
> 
> Thread #1: memory hot-remove systemd service
> Loops indefinitely, because if there is something still to be migrated
> this loop never terminates. However, this loop can be terminated via
> signal from systemd after timeout.
> __offline_pages()
>       do {
>           pfn = scan_movable_pages(pfn, end_pfn);
>                   # Returns 0, meaning there is nothing available to
>                   # migrate, no page is PageLRU(page)
>           ...
>           ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
>                                             NULL, check_pages_isolated_cb);
>                   # Returns -EBUSY, meaning there is at least one PFN that
>                   # still has to be migrated.
>       } while (ret);
> 
> Thread #2: ccs killer kthread
>    css_killed_work_fn
>      cgroup_mutex  <- Grab this Mutex
>      mem_cgroup_css_offline
>        memcg_offline_kmem.part
>           memcg_deactivate_kmem_caches
>             get_online_mems
>               mem_hotplug_lock <- waits for Thread#1 to get read access
> 
> Thread #3: systemd
> ksys_read
>  vfs_read
>    __vfs_read
>      seq_read
>        proc_single_show
>          proc_cgroup_show
>            mutex_lock -> wait for cgroup_mutex that is owned by Thread #2
> 
> Thus, thread #3 systemd stuck, and unable to deliver timeout interrupt
> to thread #1.
> 
> The proper fix for both of the problems is to avoid cgroup_mutex ->
> mem_hotplug_lock ordering that was recently fixed in the mainline but
> still present in all stable branches. Unfortunately, I do not see a
> simple fix in how to remove mem_hotplug_lock from
> memcg_deactivate_kmem_caches without using Roman's series that is too
> big for stable.
> 
> Thanks,
> Pasha
> 
> On Wed, Aug 12, 2020 at 8:31 PM Pavel Tatashin
> <pasha.tatashin@soleen.com> wrote:
>>
>> On Wed, Aug 12, 2020 at 8:04 PM Roman Gushchin <guro@fb.com> wrote:
>> >
>> > On Wed, Aug 12, 2020 at 07:16:08PM -0400, Pavel Tatashin wrote:
>> > > Guys,
>> > >
>> > > There is a convoluted deadlock that I just root caused, and that is
>> > > fixed by this work (at least based on my code inspection it appears to
>> > > be fixed); but the deadlock exists in older and stable kernels, and I
>> > > am not sure whether to create a separate patch for it, or backport
>> > > this whole thing.
>> >
>>
>> Hi Roman,
>>
>> > Hi Pavel,
>> >
>> > wow, it's a quite complicated deadlock. Thank you for providing
>> > a perfect analysis!
>>
>> Thank you, it indeed took me a while to fully grasp the deadlock.
>>
>> >
>> > Unfortunately, backporting the whole new slab controller isn't an option:
>> > it's way too big and invasive.
>>
>> This is what I thought as well, this is why I want to figure out what
>> is the best way forward.
>>
>> > Do you already have a standalone fix?
>>
>> Not yet, I do not have a standalone fix. I suspect the best fix would
>> be to address fix css_killed_work_fn() stack so we never have:
>> cgroup_mutex -> mem_hotplug_lock. Either decoupling them or reverse
>> the order would work. If you have suggestions since you worked on this
>> code recently, please let me know.
>>
>> Thank you,
>> Pasha
>>
>> >
>> > Thanks!
>> >
>> >
>> > >
>> > > Thread #1: Hot-removes memory
>> > > device_offline
>> > >   memory_subsys_offline
>> > >     offline_pages
>> > >       __offline_pages
>> > >         mem_hotplug_lock <- write access
>> > >       waits for Thread #3 refcnt for pfn 9e5113 to get to 1 so it can
>> > > migrate it.
>> > >
>> > > Thread #2: ccs killer kthread
>> > >    css_killed_work_fn
>> > >      cgroup_mutex  <- Grab this Mutex
>> > >      mem_cgroup_css_offline
>> > >        memcg_offline_kmem.part
>> > >           memcg_deactivate_kmem_caches
>> > >             get_online_mems
>> > >               mem_hotplug_lock <- waits for Thread#1 to get read access
>> > >
>> > > Thread #3: crashing userland program
>> > > do_coredump
>> > >   elf_core_dump
>> > >       get_dump_page() -> get page with pfn#9e5113, and increment refcnt
>> > >       dump_emit
>> > >         __kernel_write
>> > >           __vfs_write
>> > >             new_sync_write
>> > >               pipe_write
>> > >                 pipe_wait   -> waits for Thread #4 systemd-coredump to
>> > > read the pipe
>> > >
>> > > Thread #4: systemd-coredump
>> > > ksys_read
>> > >   vfs_read
>> > >     __vfs_read
>> > >       seq_read
>> > >         proc_single_show
>> > >           proc_cgroup_show
>> > >             cgroup_mutex -> waits from Thread #2 for this lock.
>> >
>> > >
>> > > In Summary:
>> > > Thread#1 waits for Thread#3 for refcnt, Thread#3 waits for Thread#4 to
>> > > read pipe. Thread#4 waits for Thread#2 for cgroup_mutex lock; Thread#2
>> > > waits for Thread#1 for mem_hotplug_lock rwlock.
>> > >
>> > > This work appears to fix this deadlock because cgroup_mutex is not
>> > > called anymore before mem_hotplug_lock (unless I am missing it), as it
>> > > removes memcg_deactivate_kmem_caches.
>> > >
>> > > Thank you,
>> > > Pasha
>> > >
>> > > On Wed, Jan 29, 2020 at 9:42 PM Roman Gushchin <guro@fb.com> wrote:
>> > > >
>> > > > On Thu, Jan 30, 2020 at 07:36:26AM +0530, Bharata B Rao wrote:
>> > > > > On Mon, Jan 27, 2020 at 09:34:25AM -0800, Roman Gushchin wrote:
>> > > > > > The existing cgroup slab memory controller is based on the idea of
>> > > > > > replicating slab allocator internals for each memory cgroup.
>> > > > > > This approach promises a low memory overhead (one pointer per page),
>> > > > > > and isn't adding too much code on hot allocation and release paths.
>> > > > > > But is has a very serious flaw: it leads to a low slab utilization.
>> > > > > >
>> > > > > > Using a drgn* script I've got an estimation of slab utilization on
>> > > > > > a number of machines running different production workloads. In most
>> > > > > > cases it was between 45% and 65%, and the best number I've seen was
>> > > > > > around 85%. Turning kmem accounting off brings it to high 90s. Also
>> > > > > > it brings back 30-50% of slab memory. It means that the real price
>> > > > > > of the existing slab memory controller is way bigger than a pointer
>> > > > > > per page.
>> > > > > >
>> > > > > > The real reason why the existing design leads to a low slab utilization
>> > > > > > is simple: slab pages are used exclusively by one memory cgroup.
>> > > > > > If there are only few allocations of certain size made by a cgroup,
>> > > > > > or if some active objects (e.g. dentries) are left after the cgroup is
>> > > > > > deleted, or the cgroup contains a single-threaded application which is
>> > > > > > barely allocating any kernel objects, but does it every time on a new CPU:
>> > > > > > in all these cases the resulting slab utilization is very low.
>> > > > > > If kmem accounting is off, the kernel is able to use free space
>> > > > > > on slab pages for other allocations.
>> > > > > >
>> > > > > > Arguably it wasn't an issue back to days when the kmem controller was
>> > > > > > introduced and was an opt-in feature, which had to be turned on
>> > > > > > individually for each memory cgroup. But now it's turned on by default
>> > > > > > on both cgroup v1 and v2. And modern systemd-based systems tend to
>> > > > > > create a large number of cgroups.
>> > > > > >
>> > > > > > This patchset provides a new implementation of the slab memory controller,
>> > > > > > which aims to reach a much better slab utilization by sharing slab pages
>> > > > > > between multiple memory cgroups. Below is the short description of the new
>> > > > > > design (more details in commit messages).
>> > > > > >
>> > > > > > Accounting is performed per-object instead of per-page. Slab-related
>> > > > > > vmstat counters are converted to bytes. Charging is performed on page-basis,
>> > > > > > with rounding up and remembering leftovers.
>> > > > > >
>> > > > > > Memcg ownership data is stored in a per-slab-page vector: for each slab page
>> > > > > > a vector of corresponding size is allocated. To keep slab memory reparenting
>> > > > > > working, instead of saving a pointer to the memory cgroup directly an
>> > > > > > intermediate object is used. It's simply a pointer to a memcg (which can be
>> > > > > > easily changed to the parent) with a built-in reference counter. This scheme
>> > > > > > allows to reparent all allocated objects without walking them over and
>> > > > > > changing memcg pointer to the parent.
>> > > > > >
>> > > > > > Instead of creating an individual set of kmem_caches for each memory cgroup,
>> > > > > > two global sets are used: the root set for non-accounted and root-cgroup
>> > > > > > allocations and the second set for all other allocations. This allows to
>> > > > > > simplify the lifetime management of individual kmem_caches: they are
>> > > > > > destroyed with root counterparts. It allows to remove a good amount of code
>> > > > > > and make things generally simpler.
>> > > > > >
>> > > > > > The patchset* has been tested on a number of different workloads in our
>> > > > > > production. In all cases it saved significant amount of memory, measured
>> > > > > > from high hundreds of MBs to single GBs per host. On average, the size
>> > > > > > of slab memory has been reduced by 35-45%.
>> > > > >
>> > > > > Here are some numbers from multiple runs of sysbench and kernel compilation
>> > > > > with this patchset on a 10 core POWER8 host:
>> > > > >
>> > > > > ==========================================================================
>> > > > > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
>> > > > > meminfo:Slab for Sysbench oltp_read_write with mysqld running as part
>> > > > > of a mem cgroup (Sampling every 5s)
>> > > > > --------------------------------------------------------------------------
>> > > > >                               5.5.0-rc7-mm1   +slab patch     %reduction
>> > > > > --------------------------------------------------------------------------
>> > > > > memory.kmem.usage_in_bytes    15859712        4456448         72
>> > > > > memory.usage_in_bytes         337510400       335806464       .5
>> > > > > Slab: (kB)                    814336          607296          25
>> > > > >
>> > > > > memory.kmem.usage_in_bytes    16187392        4653056         71
>> > > > > memory.usage_in_bytes         318832640       300154880       5
>> > > > > Slab: (kB)                    789888          559744          29
>> > > > > --------------------------------------------------------------------------
>> > > > >
>> > > > >
>> > > > > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
>> > > > > meminfo:Slab for kernel compilation (make -s -j64) Compilation was
>> > > > > done from bash that is in a memory cgroup. (Sampling every 5s)
>> > > > > --------------------------------------------------------------------------
>> > > > >                               5.5.0-rc7-mm1   +slab patch     %reduction
>> > > > > --------------------------------------------------------------------------
>> > > > > memory.kmem.usage_in_bytes    338493440       231931904       31
>> > > > > memory.usage_in_bytes         7368015872      6275923968      15
>> > > > > Slab: (kB)                    1139072         785408          31
>> > > > >
>> > > > > memory.kmem.usage_in_bytes    341835776       236453888       30
>> > > > > memory.usage_in_bytes         6540427264      6072893440      7
>> > > > > Slab: (kB)                    1074304         761280          29
>> > > > >
>> > > > > memory.kmem.usage_in_bytes    340525056       233570304       31
>> > > > > memory.usage_in_bytes         6406209536      6177357824      3
>> > > > > Slab: (kB)                    1244288         739712          40
>> > > > > --------------------------------------------------------------------------
>> > > > >
>> > > > > Slab consumption right after boot
>> > > > > --------------------------------------------------------------------------
>> > > > >                               5.5.0-rc7-mm1   +slab patch     %reduction
>> > > > > --------------------------------------------------------------------------
>> > > > > Slab: (kB)                    821888          583424          29
>> > > > > ==========================================================================
>> > > > >
>> > > > > Summary:
>> > > > >
>> > > > > With sysbench and kernel compilation,  memory.kmem.usage_in_bytes shows
>> > > > > around 70% and 30% reduction consistently.
>> > > > >
>> > > > > Didn't see consistent reduction of memory.usage_in_bytes with sysbench and
>> > > > > kernel compilation.
>> > > > >
>> > > > > Slab usage (from /proc/meminfo) shows consistent 30% reduction and the
>> > > > > same is seen right after boot too.
>> > > >
>> > > > That's just perfect!
>> > > >
>> > > > memory.usage_in_bytes was most likely the same because the freed space
>> > > > was taken by pagecache.
>> > > >
>> > > > Thank you very much for testing!
>> > > >
>> > > > Roman
> 



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 00/28] The new cgroup slab memory controller
  2020-09-02  9:53             ` Vlastimil Babka
@ 2020-09-02 10:39               ` David Hildenbrand
  2020-09-02 12:42                 ` Pavel Tatashin
  2020-09-02 11:26               ` Michal Hocko
  2020-09-02 11:32               ` Michal Hocko
  2 siblings, 1 reply; 84+ messages in thread
From: David Hildenbrand @ 2020-09-02 10:39 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Pavel Tatashin, Roman Gushchin, Bharata B Rao, linux-mm,
	Andrew Morton, Michal Hocko, Johannes Weiner, Shakeel Butt,
	Vladimir Davydov, linux-kernel, Kernel Team, Yafang Shao, stable,
	Linus Torvalds, Sasha Levin, Greg Kroah-Hartman,
	David Hildenbrand



> Am 02.09.2020 um 11:53 schrieb Vlastimil Babka <vbabka@suse.cz>:
> 
> On 8/28/20 6:47 PM, Pavel Tatashin wrote:
>> There appears to be another problem that is related to the
>> cgroup_mutex -> mem_hotplug_lock deadlock described above.
>> 
>> In the original deadlock that I described, the workaround is to
>> replace crash dump from piping to Linux traditional save to files
>> method. However, after trying this workaround, I still observed
>> hardware watchdog resets during machine  shutdown.
>> 
>> The new problem occurs for the following reason: upon shutdown systemd
>> calls a service that hot-removes memory, and if hot-removing fails for
> 
> Why is that hotremove even needed if we're shutting down? Are there any
> (virtualization?) platforms where it makes some difference over plain
> shutdown/restart?

If all it‘s doing is offlining random memory that sounds unnecessary and dangerous. Any pointers to this service so we can figure out what it‘s doing and why? (Arch? Hypervisor?)

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 00/28] The new cgroup slab memory controller
  2020-09-02  9:53             ` Vlastimil Babka
  2020-09-02 10:39               ` David Hildenbrand
@ 2020-09-02 11:26               ` Michal Hocko
  2020-09-02 12:51                 ` Pavel Tatashin
  2020-09-02 11:32               ` Michal Hocko
  2 siblings, 1 reply; 84+ messages in thread
From: Michal Hocko @ 2020-09-02 11:26 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Pavel Tatashin, Roman Gushchin, Bharata B Rao, linux-mm,
	Andrew Morton, Johannes Weiner, Shakeel Butt, Vladimir Davydov,
	linux-kernel, Kernel Team, Yafang Shao, stable, Linus Torvalds,
	Sasha Levin, Greg Kroah-Hartman, David Hildenbrand

On Wed 02-09-20 11:53:00, Vlastimil Babka wrote:
> On 8/28/20 6:47 PM, Pavel Tatashin wrote:
> > There appears to be another problem that is related to the
> > cgroup_mutex -> mem_hotplug_lock deadlock described above.
> > 
> > In the original deadlock that I described, the workaround is to
> > replace crash dump from piping to Linux traditional save to files
> > method. However, after trying this workaround, I still observed
> > hardware watchdog resets during machine  shutdown.
> > 
> > The new problem occurs for the following reason: upon shutdown systemd
> > calls a service that hot-removes memory, and if hot-removing fails for
> 
> Why is that hotremove even needed if we're shutting down? Are there any
> (virtualization?) platforms where it makes some difference over plain
> shutdown/restart?

Yes this sounds quite dubious.

> > some reason systemd kills that service after timeout. However, systemd
> > is never able to kill the service, and we get hardware reset caused by
> > watchdog or a hang during shutdown:
> > 
> > Thread #1: memory hot-remove systemd service
> > Loops indefinitely, because if there is something still to be migrated
> > this loop never terminates. However, this loop can be terminated via
> > signal from systemd after timeout.
> > __offline_pages()
> >       do {
> >           pfn = scan_movable_pages(pfn, end_pfn);
> >                   # Returns 0, meaning there is nothing available to
> >                   # migrate, no page is PageLRU(page)
> >           ...
> >           ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
> >                                             NULL, check_pages_isolated_cb);
> >                   # Returns -EBUSY, meaning there is at least one PFN that
> >                   # still has to be migrated.
> >       } while (ret);

This shouldn't really happen. What does prevent from this to proceed?
Did you manage to catch the specific pfn and what is it used for?
start_isolate_page_range and scan_movable_pages should fail if there is
any memory that cannot be migrated permanently. This is something that
we should focus on when debugging.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 00/28] The new cgroup slab memory controller
  2020-09-02  9:53             ` Vlastimil Babka
  2020-09-02 10:39               ` David Hildenbrand
  2020-09-02 11:26               ` Michal Hocko
@ 2020-09-02 11:32               ` Michal Hocko
  2020-09-02 12:53                 ` Pavel Tatashin
  2 siblings, 1 reply; 84+ messages in thread
From: Michal Hocko @ 2020-09-02 11:32 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Pavel Tatashin, Roman Gushchin, Bharata B Rao, linux-mm,
	Andrew Morton, Johannes Weiner, Shakeel Butt, Vladimir Davydov,
	linux-kernel, Kernel Team, Yafang Shao, stable, Linus Torvalds,
	Sasha Levin, Greg Kroah-Hartman, David Hildenbrand

On Wed 02-09-20 11:53:00, Vlastimil Babka wrote:
> >> > > Thread #2: ccs killer kthread
> >> > >    css_killed_work_fn
> >> > >      cgroup_mutex  <- Grab this Mutex
> >> > >      mem_cgroup_css_offline
> >> > >        memcg_offline_kmem.part
> >> > >           memcg_deactivate_kmem_caches
> >> > >             get_online_mems
> >> > >               mem_hotplug_lock <- waits for Thread#1 to get read access

And one more thing. THis has been brought up several times already.
Maybe I have forgoten but why do we take hotplug locks in this path in
the first place? Memory hotplug notifier takes slab_mutex so this
shouldn't be really needed.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 00/28] The new cgroup slab memory controller
  2020-09-02  6:23                 ` Bharata B Rao
@ 2020-09-02 12:34                   ` Pavel Tatashin
  0 siblings, 0 replies; 84+ messages in thread
From: Pavel Tatashin @ 2020-09-02 12:34 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: Roman Gushchin, linux-mm, Andrew Morton, Michal Hocko,
	Johannes Weiner, Shakeel Butt, Vladimir Davydov, linux-kernel,
	Kernel Team, Yafang Shao, stable, Linus Torvalds, Sasha Levin,
	Greg Kroah-Hartman

> I am on an old codebase that already has the fix that you are proposing,
> so I might be seeing someother issue which I will debug further.
>
> So looks like the loop in __offline_pages() had a call to
> drain_all_pages() before it was removed by
>
> c52e75935f8d: mm: remove extra drain pages on pcp list

I see, thanks. There is a reason to have the second drain, my fix is a
little better as it is performed only on a rare occasion when it is
needed, but I should add a FIXES tag. I have not checked
alloc_contig_range race.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 00/28] The new cgroup slab memory controller
  2020-09-02 10:39               ` David Hildenbrand
@ 2020-09-02 12:42                 ` Pavel Tatashin
  2020-09-02 13:50                   ` Michal Hocko
  0 siblings, 1 reply; 84+ messages in thread
From: Pavel Tatashin @ 2020-09-02 12:42 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Vlastimil Babka, Roman Gushchin, Bharata B Rao, linux-mm,
	Andrew Morton, Michal Hocko, Johannes Weiner, Shakeel Butt,
	Vladimir Davydov, linux-kernel, Kernel Team, Yafang Shao, stable,
	Linus Torvalds, Sasha Levin, Greg Kroah-Hartman,
	David Hildenbrand

> > Am 02.09.2020 um 11:53 schrieb Vlastimil Babka <vbabka@suse.cz>:
> >
> > On 8/28/20 6:47 PM, Pavel Tatashin wrote:
> >> There appears to be another problem that is related to the
> >> cgroup_mutex -> mem_hotplug_lock deadlock described above.
> >>
> >> In the original deadlock that I described, the workaround is to
> >> replace crash dump from piping to Linux traditional save to files
> >> method. However, after trying this workaround, I still observed
> >> hardware watchdog resets during machine  shutdown.
> >>
> >> The new problem occurs for the following reason: upon shutdown systemd
> >> calls a service that hot-removes memory, and if hot-removing fails for
> >
> > Why is that hotremove even needed if we're shutting down? Are there any
> > (virtualization?) platforms where it makes some difference over plain
> > shutdown/restart?
>
> If all it‘s doing is offlining random memory that sounds unnecessary and dangerous. Any pointers to this service so we can figure out what it‘s doing and why? (Arch? Hypervisor?)

Hi David,

This is how we are using it at Microsoft: there is  a very large
number of small memory machines (8G each) with low downtime
requirements (reboot must be under a second). There is also a large
state ~2G of memory that we need to transfer during reboot, otherwise
it is very expensive to recreate the state. We have 2G of system
memory memory reserved as a pmem in the device tree, and use it to
pass information across reboots. Once the information is not needed we
hot-add that memory and use it during runtime, before shutdown we
hot-remove the 2G, save the program state on it, and do the reboot.

Pasha


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 00/28] The new cgroup slab memory controller
  2020-09-02 11:26               ` Michal Hocko
@ 2020-09-02 12:51                 ` Pavel Tatashin
  2020-09-02 13:51                   ` Michal Hocko
  0 siblings, 1 reply; 84+ messages in thread
From: Pavel Tatashin @ 2020-09-02 12:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, Roman Gushchin, Bharata B Rao, linux-mm,
	Andrew Morton, Johannes Weiner, Shakeel Butt, Vladimir Davydov,
	linux-kernel, Kernel Team, Yafang Shao, stable, Linus Torvalds,
	Sasha Levin, Greg Kroah-Hartman, David Hildenbrand

> > > Thread #1: memory hot-remove systemd service
> > > Loops indefinitely, because if there is something still to be migrated
> > > this loop never terminates. However, this loop can be terminated via
> > > signal from systemd after timeout.
> > > __offline_pages()
> > >       do {
> > >           pfn = scan_movable_pages(pfn, end_pfn);
> > >                   # Returns 0, meaning there is nothing available to
> > >                   # migrate, no page is PageLRU(page)
> > >           ...
> > >           ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
> > >                                             NULL, check_pages_isolated_cb);
> > >                   # Returns -EBUSY, meaning there is at least one PFN that
> > >                   # still has to be migrated.
> > >       } while (ret);
>

Hi Micahl,

> This shouldn't really happen. What does prevent from this to proceed?
> Did you manage to catch the specific pfn and what is it used for?

I did.

> start_isolate_page_range and scan_movable_pages should fail if there is
> any memory that cannot be migrated permanently. This is something that
> we should focus on when debugging.

I was hitting this issue:
mm/memory_hotplug: drain per-cpu pages again during memory offline
https://lore.kernel.org/lkml/20200901124615.137200-1-pasha.tatashin@soleen.com

Once the pcp drain  race is fixed, this particular deadlock becomes irrelavent.

The lock ordering, however, cgroup_mutex ->  mem_hotplug_lock is bad,
and the first race condition that I was hitting and described above is
still present. For now I added a temporary workaround by using save to
file instead of piping the core during shutdown. I am glad the
mainline is fixed, but stables should also have some kind of fix for
this problem.

Pasha


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 00/28] The new cgroup slab memory controller
  2020-09-02 11:32               ` Michal Hocko
@ 2020-09-02 12:53                 ` Pavel Tatashin
  2020-09-02 13:52                   ` Michal Hocko
  0 siblings, 1 reply; 84+ messages in thread
From: Pavel Tatashin @ 2020-09-02 12:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, Roman Gushchin, Bharata B Rao, linux-mm,
	Andrew Morton, Johannes Weiner, Shakeel Butt, Vladimir Davydov,
	linux-kernel, Kernel Team, Yafang Shao, stable, Linus Torvalds,
	Sasha Levin, Greg Kroah-Hartman, David Hildenbrand

On Wed, Sep 2, 2020 at 7:32 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 02-09-20 11:53:00, Vlastimil Babka wrote:
> > >> > > Thread #2: ccs killer kthread
> > >> > >    css_killed_work_fn
> > >> > >      cgroup_mutex  <- Grab this Mutex
> > >> > >      mem_cgroup_css_offline
> > >> > >        memcg_offline_kmem.part
> > >> > >           memcg_deactivate_kmem_caches
> > >> > >             get_online_mems
> > >> > >               mem_hotplug_lock <- waits for Thread#1 to get read access
>
> And one more thing. THis has been brought up several times already.
> Maybe I have forgoten but why do we take hotplug locks in this path in
> the first place? Memory hotplug notifier takes slab_mutex so this
> shouldn't be really needed.

Good point, it seems this lock can be completely removed from
memcg_deactivate_kmem_caches

Pasha

> --
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 00/28] The new cgroup slab memory controller
  2020-09-02 12:42                 ` Pavel Tatashin
@ 2020-09-02 13:50                   ` Michal Hocko
  2020-09-02 14:20                     ` Pavel Tatashin
  0 siblings, 1 reply; 84+ messages in thread
From: Michal Hocko @ 2020-09-02 13:50 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: David Hildenbrand, Vlastimil Babka, Roman Gushchin,
	Bharata B Rao, linux-mm, Andrew Morton, Johannes Weiner,
	Shakeel Butt, Vladimir Davydov, linux-kernel, Kernel Team,
	Yafang Shao, stable, Linus Torvalds, Sasha Levin,
	Greg Kroah-Hartman, David Hildenbrand

On Wed 02-09-20 08:42:13, Pavel Tatashin wrote:
> > > Am 02.09.2020 um 11:53 schrieb Vlastimil Babka <vbabka@suse.cz>:
> > >
> > > On 8/28/20 6:47 PM, Pavel Tatashin wrote:
> > >> There appears to be another problem that is related to the
> > >> cgroup_mutex -> mem_hotplug_lock deadlock described above.
> > >>
> > >> In the original deadlock that I described, the workaround is to
> > >> replace crash dump from piping to Linux traditional save to files
> > >> method. However, after trying this workaround, I still observed
> > >> hardware watchdog resets during machine  shutdown.
> > >>
> > >> The new problem occurs for the following reason: upon shutdown systemd
> > >> calls a service that hot-removes memory, and if hot-removing fails for
> > >
> > > Why is that hotremove even needed if we're shutting down? Are there any
> > > (virtualization?) platforms where it makes some difference over plain
> > > shutdown/restart?
> >
> > If all it‘s doing is offlining random memory that sounds unnecessary and dangerous. Any pointers to this service so we can figure out what it‘s doing and why? (Arch? Hypervisor?)
> 
> Hi David,
> 
> This is how we are using it at Microsoft: there is  a very large
> number of small memory machines (8G each) with low downtime
> requirements (reboot must be under a second). There is also a large
> state ~2G of memory that we need to transfer during reboot, otherwise
> it is very expensive to recreate the state. We have 2G of system
> memory memory reserved as a pmem in the device tree, and use it to
> pass information across reboots. Once the information is not needed we
> hot-add that memory and use it during runtime, before shutdown we
> hot-remove the 2G, save the program state on it, and do the reboot.

I still do not get it. So what does guarantee that the memory is
offlineable in the first place? Also what is the difference between
offlining and simply shutting the system down so that the memory is not
used in the first place. In other words what kind of difference
hotremove makes?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 00/28] The new cgroup slab memory controller
  2020-09-02 12:51                 ` Pavel Tatashin
@ 2020-09-02 13:51                   ` Michal Hocko
  0 siblings, 0 replies; 84+ messages in thread
From: Michal Hocko @ 2020-09-02 13:51 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: Vlastimil Babka, Roman Gushchin, Bharata B Rao, linux-mm,
	Andrew Morton, Johannes Weiner, Shakeel Butt, Vladimir Davydov,
	linux-kernel, Kernel Team, Yafang Shao, stable, Linus Torvalds,
	Sasha Levin, Greg Kroah-Hartman, David Hildenbrand

On Wed 02-09-20 08:51:06, Pavel Tatashin wrote:
> > > > Thread #1: memory hot-remove systemd service
> > > > Loops indefinitely, because if there is something still to be migrated
> > > > this loop never terminates. However, this loop can be terminated via
> > > > signal from systemd after timeout.
> > > > __offline_pages()
> > > >       do {
> > > >           pfn = scan_movable_pages(pfn, end_pfn);
> > > >                   # Returns 0, meaning there is nothing available to
> > > >                   # migrate, no page is PageLRU(page)
> > > >           ...
> > > >           ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
> > > >                                             NULL, check_pages_isolated_cb);
> > > >                   # Returns -EBUSY, meaning there is at least one PFN that
> > > >                   # still has to be migrated.
> > > >       } while (ret);
> >
> 
> Hi Micahl,
> 
> > This shouldn't really happen. What does prevent from this to proceed?
> > Did you manage to catch the specific pfn and what is it used for?
> 
> I did.
> 
> > start_isolate_page_range and scan_movable_pages should fail if there is
> > any memory that cannot be migrated permanently. This is something that
> > we should focus on when debugging.
> 
> I was hitting this issue:
> mm/memory_hotplug: drain per-cpu pages again during memory offline
> https://lore.kernel.org/lkml/20200901124615.137200-1-pasha.tatashin@soleen.com

I have noticed the patch but didn't have time to think it through (have
been few days off and catching up with emails). Will give it a higher
priority.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 00/28] The new cgroup slab memory controller
  2020-09-02 12:53                 ` Pavel Tatashin
@ 2020-09-02 13:52                   ` Michal Hocko
  0 siblings, 0 replies; 84+ messages in thread
From: Michal Hocko @ 2020-09-02 13:52 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: Vlastimil Babka, Roman Gushchin, Bharata B Rao, linux-mm,
	Andrew Morton, Johannes Weiner, Shakeel Butt, Vladimir Davydov,
	linux-kernel, Kernel Team, Yafang Shao, stable, Linus Torvalds,
	Sasha Levin, Greg Kroah-Hartman, David Hildenbrand

On Wed 02-09-20 08:53:49, Pavel Tatashin wrote:
> On Wed, Sep 2, 2020 at 7:32 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Wed 02-09-20 11:53:00, Vlastimil Babka wrote:
> > > >> > > Thread #2: ccs killer kthread
> > > >> > >    css_killed_work_fn
> > > >> > >      cgroup_mutex  <- Grab this Mutex
> > > >> > >      mem_cgroup_css_offline
> > > >> > >        memcg_offline_kmem.part
> > > >> > >           memcg_deactivate_kmem_caches
> > > >> > >             get_online_mems
> > > >> > >               mem_hotplug_lock <- waits for Thread#1 to get read access
> >
> > And one more thing. THis has been brought up several times already.
> > Maybe I have forgoten but why do we take hotplug locks in this path in
> > the first place? Memory hotplug notifier takes slab_mutex so this
> > shouldn't be really needed.
> 
> Good point, it seems this lock can be completely removed from
> memcg_deactivate_kmem_caches

I am pretty sure we have discussed that in the past. But I do not
remember the outcome. Either we have concluded that this is indeed the
case but nobody came up with a patch or we have hit some obscure
issue... Maybe David/Roman rememeber more than I do.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 00/28] The new cgroup slab memory controller
  2020-09-02 13:50                   ` Michal Hocko
@ 2020-09-02 14:20                     ` Pavel Tatashin
  2020-09-03 18:09                       ` David Hildenbrand
  0 siblings, 1 reply; 84+ messages in thread
From: Pavel Tatashin @ 2020-09-02 14:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, Vlastimil Babka, Roman Gushchin,
	Bharata B Rao, linux-mm, Andrew Morton, Johannes Weiner,
	Shakeel Butt, Vladimir Davydov, linux-kernel, Kernel Team,
	Yafang Shao, stable, Linus Torvalds, Sasha Levin,
	Greg Kroah-Hartman, David Hildenbrand

> > This is how we are using it at Microsoft: there is  a very large
> > number of small memory machines (8G each) with low downtime
> > requirements (reboot must be under a second). There is also a large
> > state ~2G of memory that we need to transfer during reboot, otherwise
> > it is very expensive to recreate the state. We have 2G of system
> > memory memory reserved as a pmem in the device tree, and use it to
> > pass information across reboots. Once the information is not needed we
> > hot-add that memory and use it during runtime, before shutdown we
> > hot-remove the 2G, save the program state on it, and do the reboot.
>
> I still do not get it. So what does guarantee that the memory is
> offlineable in the first place?

It is in a movable zone, and we have more than 2G of free memory for
successful migrations.

> Also what is the difference between
> offlining and simply shutting the system down so that the memory is not
> used in the first place. In other words what kind of difference
> hotremove makes?

For performance reasons during system updates/reboots we do not erase
memory content. The memory content is erased only on power cycle,
which we do not do in production.

Once we hot-remove the memory, we convert it back into DAXFS PMEM
device, format it into EXT4, mount it as DAX file system, and allow
programs to serialize their states to it so they can read it back
after the reboot.

During startup we mount pmem, programs read the state back, and after
that we hotplug the PMEM DAX as a movable zone. This way during normal
runtime we have 8G available to programs.

Pasha


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 00/28] The new cgroup slab memory controller
  2020-09-02 14:20                     ` Pavel Tatashin
@ 2020-09-03 18:09                       ` David Hildenbrand
  0 siblings, 0 replies; 84+ messages in thread
From: David Hildenbrand @ 2020-09-03 18:09 UTC (permalink / raw)
  To: Pavel Tatashin, Michal Hocko
  Cc: David Hildenbrand, Vlastimil Babka, Roman Gushchin,
	Bharata B Rao, linux-mm, Andrew Morton, Johannes Weiner,
	Shakeel Butt, Vladimir Davydov, linux-kernel, Kernel Team,
	Yafang Shao, stable, Linus Torvalds, Sasha Levin,
	Greg Kroah-Hartman

> For performance reasons during system updates/reboots we do not erase
> memory content. The memory content is erased only on power cycle,
> which we do not do in production.
> 
> Once we hot-remove the memory, we convert it back into DAXFS PMEM
> device, format it into EXT4, mount it as DAX file system, and allow
> programs to serialize their states to it so they can read it back
> after the reboot.
> 
> During startup we mount pmem, programs read the state back, and after
> that we hotplug the PMEM DAX as a movable zone. This way during normal
> runtime we have 8G available to programs.
> 

Thanks for sharing the workflow - while it sounds somewhat sub-optimal,
I guess it gets the job done using existing tools / mechanisms.

(I remember the persistent tmpfs over kexec RFC, which tries to tackle
it by introducing something new)

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 84+ messages in thread

end of thread, other threads:[~2020-09-03 18:09 UTC | newest]

Thread overview: 84+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-27 17:34 [PATCH v2 00/28] The new cgroup slab memory controller Roman Gushchin
2020-01-27 17:34 ` [PATCH v2 01/28] mm: kmem: cleanup (__)memcg_kmem_charge_memcg() arguments Roman Gushchin
2020-01-27 17:34 ` [PATCH v2 02/28] mm: kmem: cleanup memcg_kmem_uncharge_memcg() arguments Roman Gushchin
2020-01-27 17:34 ` [PATCH v2 03/28] mm: kmem: rename memcg_kmem_(un)charge() into memcg_kmem_(un)charge_page() Roman Gushchin
2020-01-27 17:34 ` [PATCH v2 04/28] mm: kmem: switch to nr_pages in (__)memcg_kmem_charge_memcg() Roman Gushchin
2020-01-27 17:34 ` [PATCH v2 05/28] mm: memcg/slab: cache page number in memcg_(un)charge_slab() Roman Gushchin
2020-01-27 17:34 ` [PATCH v2 06/28] mm: kmem: rename (__)memcg_kmem_(un)charge_memcg() to __memcg_kmem_(un)charge() Roman Gushchin
2020-01-27 17:34 ` [PATCH v2 07/28] mm: memcg/slab: introduce mem_cgroup_from_obj() Roman Gushchin
2020-02-03 16:05   ` Johannes Weiner
2020-01-27 17:34 ` [PATCH v2 08/28] mm: fork: fix kernel_stack memcg stats for various stack implementations Roman Gushchin
2020-02-03 16:12   ` Johannes Weiner
2020-01-27 17:34 ` [PATCH v2 09/28] mm: memcg/slab: rename __mod_lruvec_slab_state() into __mod_lruvec_obj_state() Roman Gushchin
2020-02-03 16:13   ` Johannes Weiner
2020-01-27 17:34 ` [PATCH v2 10/28] mm: memcg: introduce mod_lruvec_memcg_state() Roman Gushchin
2020-02-03 17:39   ` Johannes Weiner
2020-01-27 17:34 ` [PATCH v2 11/28] mm: slub: implement SLUB version of obj_to_index() Roman Gushchin
2020-02-03 17:44   ` Johannes Weiner
2020-01-27 17:34 ` [PATCH v2 12/28] mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat Roman Gushchin
2020-02-03 17:58   ` Johannes Weiner
2020-02-03 18:25     ` Roman Gushchin
2020-02-03 20:34       ` Johannes Weiner
2020-02-03 22:28         ` Roman Gushchin
2020-02-03 22:39           ` Johannes Weiner
2020-02-04  1:44             ` Roman Gushchin
2020-01-27 17:34 ` [PATCH v2 13/28] mm: vmstat: convert slab vmstat counter to bytes Roman Gushchin
2020-01-27 17:34 ` [PATCH v2 14/28] mm: memcontrol: decouple reference counting from page accounting Roman Gushchin
2020-01-27 17:34 ` [PATCH v2 15/28] mm: memcg/slab: obj_cgroup API Roman Gushchin
2020-02-03 19:31   ` Johannes Weiner
2020-01-27 17:34 ` [PATCH v2 16/28] mm: memcg/slab: allocate obj_cgroups for non-root slab pages Roman Gushchin
2020-02-03 18:27   ` Johannes Weiner
2020-02-03 18:34     ` Roman Gushchin
2020-02-03 20:46       ` Johannes Weiner
2020-02-03 21:19         ` Roman Gushchin
2020-02-03 22:29           ` Johannes Weiner
2020-01-27 17:34 ` [PATCH v2 17/28] mm: memcg/slab: save obj_cgroup for non-root slab objects Roman Gushchin
2020-02-03 19:53   ` Johannes Weiner
2020-01-27 17:34 ` [PATCH v2 18/28] mm: memcg/slab: charge individual slab objects instead of pages Roman Gushchin
2020-01-27 17:34 ` [PATCH v2 19/28] mm: memcg/slab: deprecate memory.kmem.slabinfo Roman Gushchin
2020-01-27 17:34 ` [PATCH v2 20/28] mm: memcg/slab: move memcg_kmem_bypass() to memcontrol.h Roman Gushchin
2020-01-27 17:34 ` [PATCH v2 21/28] mm: memcg/slab: use a single set of kmem_caches for all memory cgroups Roman Gushchin
2020-02-03 19:50   ` Johannes Weiner
2020-02-03 20:58     ` Roman Gushchin
2020-02-03 22:17       ` Johannes Weiner
2020-02-03 22:38         ` Roman Gushchin
2020-02-04  1:15         ` Roman Gushchin
2020-02-04  2:47           ` Johannes Weiner
2020-02-04  4:35             ` Roman Gushchin
2020-02-04 18:41               ` Johannes Weiner
2020-02-05 15:58                 ` Roman Gushchin
2020-01-27 17:34 ` [PATCH v2 22/28] mm: memcg/slab: simplify memcg cache creation Roman Gushchin
2020-01-27 17:34 ` [PATCH v2 23/28] mm: memcg/slab: deprecate memcg_kmem_get_cache() Roman Gushchin
2020-01-27 17:34 ` [PATCH v2 24/28] mm: memcg/slab: deprecate slab_root_caches Roman Gushchin
2020-01-27 17:34 ` [PATCH v2 25/28] mm: memcg/slab: remove redundant check in memcg_accumulate_slabinfo() Roman Gushchin
2020-01-27 17:34 ` [PATCH v2 26/28] tools/cgroup: add slabinfo.py tool Roman Gushchin
2020-01-27 17:34 ` [PATCH v2 27/28] tools/cgroup: make slabinfo.py compatible with new slab controller Roman Gushchin
2020-01-30  2:17   ` Bharata B Rao
2020-01-30  2:44     ` Roman Gushchin
2020-01-31 22:24     ` Roman Gushchin
2020-02-12  5:21       ` Bharata B Rao
2020-02-12 20:42         ` Roman Gushchin
2020-01-27 17:34 ` [PATCH v2 28/28] kselftests: cgroup: add kernel memory accounting tests Roman Gushchin
2020-01-30  2:06 ` [PATCH v2 00/28] The new cgroup slab memory controller Bharata B Rao
2020-01-30  2:41   ` Roman Gushchin
2020-08-12 23:16     ` Pavel Tatashin
2020-08-12 23:18       ` Pavel Tatashin
2020-08-13  0:04       ` Roman Gushchin
2020-08-13  0:31         ` Pavel Tatashin
2020-08-28 16:47           ` Pavel Tatashin
2020-09-01  5:28             ` Bharata B Rao
2020-09-01 12:52               ` Pavel Tatashin
2020-09-02  6:23                 ` Bharata B Rao
2020-09-02 12:34                   ` Pavel Tatashin
2020-09-02  9:53             ` Vlastimil Babka
2020-09-02 10:39               ` David Hildenbrand
2020-09-02 12:42                 ` Pavel Tatashin
2020-09-02 13:50                   ` Michal Hocko
2020-09-02 14:20                     ` Pavel Tatashin
2020-09-03 18:09                       ` David Hildenbrand
2020-09-02 11:26               ` Michal Hocko
2020-09-02 12:51                 ` Pavel Tatashin
2020-09-02 13:51                   ` Michal Hocko
2020-09-02 11:32               ` Michal Hocko
2020-09-02 12:53                 ` Pavel Tatashin
2020-09-02 13:52                   ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).