linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 14/28] mm: memcontrol: decouple reference counting from page accounting
       [not found] <20200127173453.2089565-1-guro@fb.com>
@ 2020-01-27 17:34 ` Roman Gushchin
  2020-01-30  2:06 ` [PATCH v2 00/28] The new cgroup slab memory controller Bharata B Rao
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 56+ messages in thread
From: Roman Gushchin @ 2020-01-27 17:34 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Michal Hocko, Johannes Weiner, Shakeel Butt, Vladimir Davydov,
	linux-kernel, kernel-team, Bharata B Rao, Yafang Shao

From: Johannes Weiner <hannes@cmpxchg.org>

The reference counting of a memcg is currently coupled directly to how
many 4k pages are charged to it. This doesn't work well with Roman's
new slab controller, which maintains pools of objects and doesn't want
to keep an extra balance sheet for the pages backing those objects.

This unusual refcounting design (reference counts usually track
pointers to an object) is only for historical reasons: memcg used to
not take any css references and simply stalled offlining until all
charges had been reparented and the page counters had dropped to
zero. When we got rid of the reparenting requirement, the simple
mechanical translation was to take a reference for every charge.

More historical context can be found in commit e8ea14cc6ead ("mm:
memcontrol: take a css reference for each charged page"),
commit 64f219938941 ("mm: memcontrol: remove obsolete kmemcg pinning
tricks") and commit b2052564e66d ("mm: memcontrol: continue cache
reclaim from offlined groups").

The new slab controller exposes the limitations in this scheme, so
let's switch it to a more idiomatic reference counting model based on
actual kernel pointers to the memcg:

- The per-cpu stock holds a reference to the memcg its caching

- User pages hold a reference for their page->mem_cgroup. Transparent
  huge pages will no longer acquire tail references in advance, we'll
  get them if needed during the split.

- Kernel pages hold a reference for their page->mem_cgroup

- mem_cgroup_try_charge(), if successful, will return one reference to
  be consumed by page->mem_cgroup during commit, or put during cancel

- Pages allocated in the root cgroup will acquire and release css
  references for simplicity. css_get() and css_put() optimize that.

- The current memcg_charge_slab() already hacked around the per-charge
  references; this change gets rid of that as well.

Roman: I've reformatted commit references in the commit log to make
  checkpatch.pl happy.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Roman Gushchin <guro@fb.com>
---
 mm/memcontrol.c | 45 ++++++++++++++++++++++++++-------------------
 mm/slab.h       |  2 --
 2 files changed, 26 insertions(+), 21 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bf846fb60d9f..b86cfdcf2e1d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2109,13 +2109,17 @@ static void drain_stock(struct memcg_stock_pcp *stock)
 {
 	struct mem_cgroup *old = stock->cached;
 
+	if (!old)
+		return;
+
 	if (stock->nr_pages) {
 		page_counter_uncharge(&old->memory, stock->nr_pages);
 		if (do_memsw_account())
 			page_counter_uncharge(&old->memsw, stock->nr_pages);
-		css_put_many(&old->css, stock->nr_pages);
 		stock->nr_pages = 0;
 	}
+
+	css_put(&old->css);
 	stock->cached = NULL;
 }
 
@@ -2151,6 +2155,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 	stock = this_cpu_ptr(&memcg_stock);
 	if (stock->cached != memcg) { /* reset if necessary */
 		drain_stock(stock);
+		css_get(&memcg->css);
 		stock->cached = memcg;
 	}
 	stock->nr_pages += nr_pages;
@@ -2554,12 +2559,10 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	page_counter_charge(&memcg->memory, nr_pages);
 	if (do_memsw_account())
 		page_counter_charge(&memcg->memsw, nr_pages);
-	css_get_many(&memcg->css, nr_pages);
 
 	return 0;
 
 done_restock:
-	css_get_many(&memcg->css, batch);
 	if (batch > nr_pages)
 		refill_stock(memcg, batch - nr_pages);
 
@@ -2596,8 +2599,6 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
 	page_counter_uncharge(&memcg->memory, nr_pages);
 	if (do_memsw_account())
 		page_counter_uncharge(&memcg->memsw, nr_pages);
-
-	css_put_many(&memcg->css, nr_pages);
 }
 
 static void lock_page_lru(struct page *page, int *isolated)
@@ -2948,6 +2949,7 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
 		if (!ret) {
 			page->mem_cgroup = memcg;
 			__SetPageKmemcg(page);
+			return 0;
 		}
 	}
 	css_put(&memcg->css);
@@ -2970,12 +2972,11 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
 	VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
 	__memcg_kmem_uncharge(memcg, nr_pages);
 	page->mem_cgroup = NULL;
+	css_put(&memcg->css);
 
 	/* slab pages do not have PageKmemcg flag set */
 	if (PageKmemcg(page))
 		__ClearPageKmemcg(page);
-
-	css_put_many(&memcg->css, nr_pages);
 }
 #endif /* CONFIG_MEMCG_KMEM */
 
@@ -2987,15 +2988,18 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
  */
 void mem_cgroup_split_huge_fixup(struct page *head)
 {
+	struct mem_cgroup *memcg = head->mem_cgroup;
 	int i;
 
 	if (mem_cgroup_disabled())
 		return;
 
-	for (i = 1; i < HPAGE_PMD_NR; i++)
-		head[i].mem_cgroup = head->mem_cgroup;
+	for (i = 1; i < HPAGE_PMD_NR; i++) {
+		css_get(&memcg->css);
+		head[i].mem_cgroup = memcg;
+	}
 
-	__mod_memcg_state(head->mem_cgroup, MEMCG_RSS_HUGE, -HPAGE_PMD_NR);
+	__mod_memcg_state(memcg, MEMCG_RSS_HUGE, -HPAGE_PMD_NR);
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
@@ -5401,7 +5405,9 @@ static int mem_cgroup_move_account(struct page *page,
 	 * uncharging, charging, migration, or LRU putback.
 	 */
 
-	/* caller should have done css_get */
+	css_get(&to->css);
+	css_put(&from->css);
+
 	page->mem_cgroup = to;
 
 	spin_unlock_irqrestore(&from->move_lock, flags);
@@ -6420,8 +6426,10 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 		memcg = get_mem_cgroup_from_mm(mm);
 
 	ret = try_charge(memcg, gfp_mask, nr_pages);
-
-	css_put(&memcg->css);
+	if (ret) {
+		css_put(&memcg->css);
+		memcg = NULL;
+	}
 out:
 	*memcgp = memcg;
 	return ret;
@@ -6517,6 +6525,8 @@ void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg,
 		return;
 
 	cancel_charge(memcg, nr_pages);
+
+	css_put(&memcg->css);
 }
 
 struct uncharge_gather {
@@ -6558,9 +6568,6 @@ static void uncharge_batch(const struct uncharge_gather *ug)
 	__this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, nr_pages);
 	memcg_check_events(ug->memcg, ug->dummy_page);
 	local_irq_restore(flags);
-
-	if (!mem_cgroup_is_root(ug->memcg))
-		css_put_many(&ug->memcg->css, nr_pages);
 }
 
 static void uncharge_page(struct page *page, struct uncharge_gather *ug)
@@ -6608,6 +6615,7 @@ static void uncharge_page(struct page *page, struct uncharge_gather *ug)
 
 	ug->dummy_page = page;
 	page->mem_cgroup = NULL;
+	css_put(&ug->memcg->css);
 }
 
 static void uncharge_list(struct list_head *page_list)
@@ -6714,8 +6722,8 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage)
 	page_counter_charge(&memcg->memory, nr_pages);
 	if (do_memsw_account())
 		page_counter_charge(&memcg->memsw, nr_pages);
-	css_get_many(&memcg->css, nr_pages);
 
+	css_get(&memcg->css);
 	commit_charge(newpage, memcg, false);
 
 	local_irq_save(flags);
@@ -6964,8 +6972,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 				     -nr_entries);
 	memcg_check_events(memcg, page);
 
-	if (!mem_cgroup_is_root(memcg))
-		css_put_many(&memcg->css, nr_entries);
+	css_put(&memcg->css);
 }
 
 /**
diff --git a/mm/slab.h b/mm/slab.h
index 517f1f1359e5..7925f7005161 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -373,9 +373,7 @@ static __always_inline int memcg_charge_slab(struct page *page,
 	lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
 	mod_lruvec_state(lruvec, cache_vmstat_idx(s), nr_pages << PAGE_SHIFT);
 
-	/* transer try_charge() page references to kmem_cache */
 	percpu_ref_get_many(&s->memcg_params.refcnt, 1 << order);
-	css_put_many(&memcg->css, 1 << order);
 out:
 	css_put(&memcg->css);
 	return ret;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 00/28] The new cgroup slab memory controller
       [not found] <20200127173453.2089565-1-guro@fb.com>
  2020-01-27 17:34 ` [PATCH v2 14/28] mm: memcontrol: decouple reference counting from page accounting Roman Gushchin
@ 2020-01-30  2:06 ` Bharata B Rao
  2020-01-30  2:41   ` Roman Gushchin
       [not found] ` <20200127173453.2089565-28-guro@fb.com>
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 56+ messages in thread
From: Bharata B Rao @ 2020-01-30  2:06 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, Johannes Weiner,
	Shakeel Butt, Vladimir Davydov, linux-kernel, kernel-team,
	Yafang Shao

On Mon, Jan 27, 2020 at 09:34:25AM -0800, Roman Gushchin wrote:
> The existing cgroup slab memory controller is based on the idea of
> replicating slab allocator internals for each memory cgroup.
> This approach promises a low memory overhead (one pointer per page),
> and isn't adding too much code on hot allocation and release paths.
> But is has a very serious flaw: it leads to a low slab utilization.
> 
> Using a drgn* script I've got an estimation of slab utilization on
> a number of machines running different production workloads. In most
> cases it was between 45% and 65%, and the best number I've seen was
> around 85%. Turning kmem accounting off brings it to high 90s. Also
> it brings back 30-50% of slab memory. It means that the real price
> of the existing slab memory controller is way bigger than a pointer
> per page.
> 
> The real reason why the existing design leads to a low slab utilization
> is simple: slab pages are used exclusively by one memory cgroup.
> If there are only few allocations of certain size made by a cgroup,
> or if some active objects (e.g. dentries) are left after the cgroup is
> deleted, or the cgroup contains a single-threaded application which is
> barely allocating any kernel objects, but does it every time on a new CPU:
> in all these cases the resulting slab utilization is very low.
> If kmem accounting is off, the kernel is able to use free space
> on slab pages for other allocations.
> 
> Arguably it wasn't an issue back to days when the kmem controller was
> introduced and was an opt-in feature, which had to be turned on
> individually for each memory cgroup. But now it's turned on by default
> on both cgroup v1 and v2. And modern systemd-based systems tend to
> create a large number of cgroups.
> 
> This patchset provides a new implementation of the slab memory controller,
> which aims to reach a much better slab utilization by sharing slab pages
> between multiple memory cgroups. Below is the short description of the new
> design (more details in commit messages).
> 
> Accounting is performed per-object instead of per-page. Slab-related
> vmstat counters are converted to bytes. Charging is performed on page-basis,
> with rounding up and remembering leftovers.
> 
> Memcg ownership data is stored in a per-slab-page vector: for each slab page
> a vector of corresponding size is allocated. To keep slab memory reparenting
> working, instead of saving a pointer to the memory cgroup directly an
> intermediate object is used. It's simply a pointer to a memcg (which can be
> easily changed to the parent) with a built-in reference counter. This scheme
> allows to reparent all allocated objects without walking them over and
> changing memcg pointer to the parent.
> 
> Instead of creating an individual set of kmem_caches for each memory cgroup,
> two global sets are used: the root set for non-accounted and root-cgroup
> allocations and the second set for all other allocations. This allows to
> simplify the lifetime management of individual kmem_caches: they are
> destroyed with root counterparts. It allows to remove a good amount of code
> and make things generally simpler.
> 
> The patchset* has been tested on a number of different workloads in our
> production. In all cases it saved significant amount of memory, measured
> from high hundreds of MBs to single GBs per host. On average, the size
> of slab memory has been reduced by 35-45%.

Here are some numbers from multiple runs of sysbench and kernel compilation
with this patchset on a 10 core POWER8 host:

==========================================================================
Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
meminfo:Slab for Sysbench oltp_read_write with mysqld running as part
of a mem cgroup (Sampling every 5s)
--------------------------------------------------------------------------
				5.5.0-rc7-mm1	+slab patch	%reduction
--------------------------------------------------------------------------
memory.kmem.usage_in_bytes	15859712	4456448		72
memory.usage_in_bytes		337510400	335806464	.5
Slab: (kB)			814336		607296		25

memory.kmem.usage_in_bytes	16187392	4653056		71
memory.usage_in_bytes		318832640	300154880	5
Slab: (kB)			789888		559744		29
--------------------------------------------------------------------------


Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
meminfo:Slab for kernel compilation (make -s -j64) Compilation was
done from bash that is in a memory cgroup. (Sampling every 5s)
--------------------------------------------------------------------------
				5.5.0-rc7-mm1	+slab patch	%reduction
--------------------------------------------------------------------------
memory.kmem.usage_in_bytes	338493440	231931904	31
memory.usage_in_bytes		7368015872	6275923968	15
Slab: (kB)			1139072		785408		31

memory.kmem.usage_in_bytes	341835776	236453888	30
memory.usage_in_bytes		6540427264	6072893440	7
Slab: (kB)			1074304		761280		29

memory.kmem.usage_in_bytes	340525056	233570304	31
memory.usage_in_bytes		6406209536	6177357824	3
Slab: (kB)			1244288		739712		40
--------------------------------------------------------------------------

Slab consumption right after boot
--------------------------------------------------------------------------
				5.5.0-rc7-mm1	+slab patch	%reduction
--------------------------------------------------------------------------
Slab: (kB)			821888		583424		29
==========================================================================

Summary:

With sysbench and kernel compilation,  memory.kmem.usage_in_bytes shows
around 70% and 30% reduction consistently.

Didn't see consistent reduction of memory.usage_in_bytes with sysbench and
kernel compilation.

Slab usage (from /proc/meminfo) shows consistent 30% reduction and the
same is seen right after boot too.

Regards,
Bharata.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 27/28] tools/cgroup: make slabinfo.py compatible with new slab controller
       [not found] ` <20200127173453.2089565-28-guro@fb.com>
@ 2020-01-30  2:17   ` Bharata B Rao
  2020-01-30  2:44     ` Roman Gushchin
  2020-01-31 22:24     ` Roman Gushchin
  0 siblings, 2 replies; 56+ messages in thread
From: Bharata B Rao @ 2020-01-30  2:17 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, Johannes Weiner,
	Shakeel Butt, Vladimir Davydov, linux-kernel, kernel-team,
	Yafang Shao

On Mon, Jan 27, 2020 at 09:34:52AM -0800, Roman Gushchin wrote:
> Make slabinfo.py compatible with the new slab controller.
 
Tried using slabinfo.py, but run into some errors. (I am using your
new_slab.2 branch)

 ./tools/cgroup/slabinfo.py /sys/fs/cgroup/memory/1
Traceback (most recent call last):
  File "/usr/local/bin/drgn", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/dist-packages/drgn/internal/cli.py", line 127, in main
    runpy.run_path(args.script[0], init_globals=init_globals, run_name="__main__")
  File "/usr/lib/python3.6/runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/usr/lib/python3.6/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "./tools/cgroup/slabinfo.py", line 220, in <module>
    main()
  File "./tools/cgroup/slabinfo.py", line 165, in main
    find_memcg_ids()
  File "./tools/cgroup/slabinfo.py", line 43, in find_memcg_ids
    MEMCGS[css.cgroup.kn.id.ino.value_()] = memcg
AttributeError: '_drgn.Object' object has no attribute 'ino'

I did make this change...

# git diff
diff --git a/tools/cgroup/slabinfo.py b/tools/cgroup/slabinfo.py
index b779a4863beb..571fd95224d6 100755
--- a/tools/cgroup/slabinfo.py
+++ b/tools/cgroup/slabinfo.py
@@ -40,7 +40,7 @@ def find_memcg_ids(css=prog['root_mem_cgroup'].css, prefix=''):
                                        'sibling'):
             name = prefix + '/' + css.cgroup.kn.name.string_().decode('utf-8')
             memcg = container_of(css, 'struct mem_cgroup', 'css')
-            MEMCGS[css.cgroup.kn.id.ino.value_()] = memcg
+            MEMCGS[css.cgroup.kn.id.value_()] = memcg
             find_memcg_ids(css, name)


but now get empty output.

# ./tools/cgroup/slabinfo.py /sys/fs/cgroup/memory/1
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>

Guess this script is not yet ready for the upstream kernel?

Regards,
Bharata.


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 00/28] The new cgroup slab memory controller
  2020-01-30  2:06 ` [PATCH v2 00/28] The new cgroup slab memory controller Bharata B Rao
@ 2020-01-30  2:41   ` Roman Gushchin
  2020-08-12 23:16     ` Pavel Tatashin
  0 siblings, 1 reply; 56+ messages in thread
From: Roman Gushchin @ 2020-01-30  2:41 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-mm, Andrew Morton, Michal Hocko, Johannes Weiner,
	Shakeel Butt, Vladimir Davydov, linux-kernel, Kernel Team,
	Yafang Shao

On Thu, Jan 30, 2020 at 07:36:26AM +0530, Bharata B Rao wrote:
> On Mon, Jan 27, 2020 at 09:34:25AM -0800, Roman Gushchin wrote:
> > The existing cgroup slab memory controller is based on the idea of
> > replicating slab allocator internals for each memory cgroup.
> > This approach promises a low memory overhead (one pointer per page),
> > and isn't adding too much code on hot allocation and release paths.
> > But is has a very serious flaw: it leads to a low slab utilization.
> > 
> > Using a drgn* script I've got an estimation of slab utilization on
> > a number of machines running different production workloads. In most
> > cases it was between 45% and 65%, and the best number I've seen was
> > around 85%. Turning kmem accounting off brings it to high 90s. Also
> > it brings back 30-50% of slab memory. It means that the real price
> > of the existing slab memory controller is way bigger than a pointer
> > per page.
> > 
> > The real reason why the existing design leads to a low slab utilization
> > is simple: slab pages are used exclusively by one memory cgroup.
> > If there are only few allocations of certain size made by a cgroup,
> > or if some active objects (e.g. dentries) are left after the cgroup is
> > deleted, or the cgroup contains a single-threaded application which is
> > barely allocating any kernel objects, but does it every time on a new CPU:
> > in all these cases the resulting slab utilization is very low.
> > If kmem accounting is off, the kernel is able to use free space
> > on slab pages for other allocations.
> > 
> > Arguably it wasn't an issue back to days when the kmem controller was
> > introduced and was an opt-in feature, which had to be turned on
> > individually for each memory cgroup. But now it's turned on by default
> > on both cgroup v1 and v2. And modern systemd-based systems tend to
> > create a large number of cgroups.
> > 
> > This patchset provides a new implementation of the slab memory controller,
> > which aims to reach a much better slab utilization by sharing slab pages
> > between multiple memory cgroups. Below is the short description of the new
> > design (more details in commit messages).
> > 
> > Accounting is performed per-object instead of per-page. Slab-related
> > vmstat counters are converted to bytes. Charging is performed on page-basis,
> > with rounding up and remembering leftovers.
> > 
> > Memcg ownership data is stored in a per-slab-page vector: for each slab page
> > a vector of corresponding size is allocated. To keep slab memory reparenting
> > working, instead of saving a pointer to the memory cgroup directly an
> > intermediate object is used. It's simply a pointer to a memcg (which can be
> > easily changed to the parent) with a built-in reference counter. This scheme
> > allows to reparent all allocated objects without walking them over and
> > changing memcg pointer to the parent.
> > 
> > Instead of creating an individual set of kmem_caches for each memory cgroup,
> > two global sets are used: the root set for non-accounted and root-cgroup
> > allocations and the second set for all other allocations. This allows to
> > simplify the lifetime management of individual kmem_caches: they are
> > destroyed with root counterparts. It allows to remove a good amount of code
> > and make things generally simpler.
> > 
> > The patchset* has been tested on a number of different workloads in our
> > production. In all cases it saved significant amount of memory, measured
> > from high hundreds of MBs to single GBs per host. On average, the size
> > of slab memory has been reduced by 35-45%.
> 
> Here are some numbers from multiple runs of sysbench and kernel compilation
> with this patchset on a 10 core POWER8 host:
> 
> ==========================================================================
> Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> meminfo:Slab for Sysbench oltp_read_write with mysqld running as part
> of a mem cgroup (Sampling every 5s)
> --------------------------------------------------------------------------
> 				5.5.0-rc7-mm1	+slab patch	%reduction
> --------------------------------------------------------------------------
> memory.kmem.usage_in_bytes	15859712	4456448		72
> memory.usage_in_bytes		337510400	335806464	.5
> Slab: (kB)			814336		607296		25
> 
> memory.kmem.usage_in_bytes	16187392	4653056		71
> memory.usage_in_bytes		318832640	300154880	5
> Slab: (kB)			789888		559744		29
> --------------------------------------------------------------------------
> 
> 
> Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> meminfo:Slab for kernel compilation (make -s -j64) Compilation was
> done from bash that is in a memory cgroup. (Sampling every 5s)
> --------------------------------------------------------------------------
> 				5.5.0-rc7-mm1	+slab patch	%reduction
> --------------------------------------------------------------------------
> memory.kmem.usage_in_bytes	338493440	231931904	31
> memory.usage_in_bytes		7368015872	6275923968	15
> Slab: (kB)			1139072		785408		31
> 
> memory.kmem.usage_in_bytes	341835776	236453888	30
> memory.usage_in_bytes		6540427264	6072893440	7
> Slab: (kB)			1074304		761280		29
> 
> memory.kmem.usage_in_bytes	340525056	233570304	31
> memory.usage_in_bytes		6406209536	6177357824	3
> Slab: (kB)			1244288		739712		40
> --------------------------------------------------------------------------
> 
> Slab consumption right after boot
> --------------------------------------------------------------------------
> 				5.5.0-rc7-mm1	+slab patch	%reduction
> --------------------------------------------------------------------------
> Slab: (kB)			821888		583424		29
> ==========================================================================
> 
> Summary:
> 
> With sysbench and kernel compilation,  memory.kmem.usage_in_bytes shows
> around 70% and 30% reduction consistently.
> 
> Didn't see consistent reduction of memory.usage_in_bytes with sysbench and
> kernel compilation.
> 
> Slab usage (from /proc/meminfo) shows consistent 30% reduction and the
> same is seen right after boot too.

That's just perfect!

memory.usage_in_bytes was most likely the same because the freed space
was taken by pagecache.

Thank you very much for testing!

Roman

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 27/28] tools/cgroup: make slabinfo.py compatible with new slab controller
  2020-01-30  2:17   ` [PATCH v2 27/28] tools/cgroup: make slabinfo.py compatible with new slab controller Bharata B Rao
@ 2020-01-30  2:44     ` Roman Gushchin
  2020-01-31 22:24     ` Roman Gushchin
  1 sibling, 0 replies; 56+ messages in thread
From: Roman Gushchin @ 2020-01-30  2:44 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-mm, Andrew Morton, Michal Hocko, Johannes Weiner,
	Shakeel Butt, Vladimir Davydov, linux-kernel, Kernel Team,
	Yafang Shao

On Thu, Jan 30, 2020 at 07:47:29AM +0530, Bharata B Rao wrote:
> On Mon, Jan 27, 2020 at 09:34:52AM -0800, Roman Gushchin wrote:
> > Make slabinfo.py compatible with the new slab controller.
>  
> Tried using slabinfo.py, but run into some errors. (I am using your
> new_slab.2 branch)
> 
>  ./tools/cgroup/slabinfo.py /sys/fs/cgroup/memory/1
> Traceback (most recent call last):
>   File "/usr/local/bin/drgn", line 11, in <module>
>     sys.exit(main())
>   File "/usr/local/lib/python3.6/dist-packages/drgn/internal/cli.py", line 127, in main
>     runpy.run_path(args.script[0], init_globals=init_globals, run_name="__main__")
>   File "/usr/lib/python3.6/runpy.py", line 263, in run_path
>     pkg_name=pkg_name, script_name=fname)
>   File "/usr/lib/python3.6/runpy.py", line 96, in _run_module_code
>     mod_name, mod_spec, pkg_name, script_name)
>   File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
>     exec(code, run_globals)
>   File "./tools/cgroup/slabinfo.py", line 220, in <module>
>     main()
>   File "./tools/cgroup/slabinfo.py", line 165, in main
>     find_memcg_ids()
>   File "./tools/cgroup/slabinfo.py", line 43, in find_memcg_ids
>     MEMCGS[css.cgroup.kn.id.ino.value_()] = memcg
> AttributeError: '_drgn.Object' object has no attribute 'ino'
> 
> I did make this change...
> 
> # git diff
> diff --git a/tools/cgroup/slabinfo.py b/tools/cgroup/slabinfo.py
> index b779a4863beb..571fd95224d6 100755
> --- a/tools/cgroup/slabinfo.py
> +++ b/tools/cgroup/slabinfo.py
> @@ -40,7 +40,7 @@ def find_memcg_ids(css=prog['root_mem_cgroup'].css, prefix=''):
>                                         'sibling'):
>              name = prefix + '/' + css.cgroup.kn.name.string_().decode('utf-8')
>              memcg = container_of(css, 'struct mem_cgroup', 'css')
> -            MEMCGS[css.cgroup.kn.id.ino.value_()] = memcg
> +            MEMCGS[css.cgroup.kn.id.value_()] = memcg
>              find_memcg_ids(css, name)
> 
> 
> but now get empty output.
> 
> # ./tools/cgroup/slabinfo.py /sys/fs/cgroup/memory/1
> # name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
> 
> Guess this script is not yet ready for the upstream kernel?

Yes, looks like I've used a slightly outdated kernel version to test it.
I'll fix it in the next version.

Thank you for reporting it!

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 27/28] tools/cgroup: make slabinfo.py compatible with new slab controller
  2020-01-30  2:17   ` [PATCH v2 27/28] tools/cgroup: make slabinfo.py compatible with new slab controller Bharata B Rao
  2020-01-30  2:44     ` Roman Gushchin
@ 2020-01-31 22:24     ` Roman Gushchin
  2020-02-12  5:21       ` Bharata B Rao
  1 sibling, 1 reply; 56+ messages in thread
From: Roman Gushchin @ 2020-01-31 22:24 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-mm, Andrew Morton, Michal Hocko, Johannes Weiner,
	Shakeel Butt, Vladimir Davydov, linux-kernel, Kernel Team,
	Yafang Shao

On Thu, Jan 30, 2020 at 07:47:29AM +0530, Bharata B Rao wrote:
> On Mon, Jan 27, 2020 at 09:34:52AM -0800, Roman Gushchin wrote:
> > Make slabinfo.py compatible with the new slab controller.
>  
> Tried using slabinfo.py, but run into some errors. (I am using your
> new_slab.2 branch)
> 
>  ./tools/cgroup/slabinfo.py /sys/fs/cgroup/memory/1
> Traceback (most recent call last):
>   File "/usr/local/bin/drgn", line 11, in <module>
>     sys.exit(main())
>   File "/usr/local/lib/python3.6/dist-packages/drgn/internal/cli.py", line 127, in main
>     runpy.run_path(args.script[0], init_globals=init_globals, run_name="__main__")
>   File "/usr/lib/python3.6/runpy.py", line 263, in run_path
>     pkg_name=pkg_name, script_name=fname)
>   File "/usr/lib/python3.6/runpy.py", line 96, in _run_module_code
>     mod_name, mod_spec, pkg_name, script_name)
>   File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
>     exec(code, run_globals)
>   File "./tools/cgroup/slabinfo.py", line 220, in <module>
>     main()
>   File "./tools/cgroup/slabinfo.py", line 165, in main
>     find_memcg_ids()
>   File "./tools/cgroup/slabinfo.py", line 43, in find_memcg_ids
>     MEMCGS[css.cgroup.kn.id.ino.value_()] = memcg
> AttributeError: '_drgn.Object' object has no attribute 'ino'
> 
> I did make this change...
> 
> # git diff
> diff --git a/tools/cgroup/slabinfo.py b/tools/cgroup/slabinfo.py
> index b779a4863beb..571fd95224d6 100755
> --- a/tools/cgroup/slabinfo.py
> +++ b/tools/cgroup/slabinfo.py
> @@ -40,7 +40,7 @@ def find_memcg_ids(css=prog['root_mem_cgroup'].css, prefix=''):
>                                         'sibling'):
>              name = prefix + '/' + css.cgroup.kn.name.string_().decode('utf-8')
>              memcg = container_of(css, 'struct mem_cgroup', 'css')
> -            MEMCGS[css.cgroup.kn.id.ino.value_()] = memcg
> +            MEMCGS[css.cgroup.kn.id.value_()] = memcg
>              find_memcg_ids(css, name)
> 
> 
> but now get empty output.

Btw, I've checked that the change like you've done above fixes the problem.
The script works for me both on current upstream and new_slab.2 branch.

Are you sure that in your case there is some kernel memory charged to that
cgroup? Please note, that in the current implementation kmem_caches are created
on demand, so the accounting is effectively enabled with some delay.

Thank you!

Below is an updated version of the patch to use:
--------------------------------------------------------------------------------

From 69b8e1bf451043c41e43e769b9ae15b36092ddf9 Mon Sep 17 00:00:00 2001
From: Roman Gushchin <guro@fb.com>
Date: Tue, 15 Oct 2019 17:06:04 -0700
Subject: [PATCH v2 26/28] tools/cgroup: add slabinfo.py tool

Add a drgn-based tool to display slab information for a given memcg.
Can replace cgroup v1 memory.kmem.slabinfo interface on cgroup v2,
but in a more flexiable way.

Currently supports only SLUB configuration, but SLAB can be trivially
added later.

Output example:
$ sudo ./tools/cgroup/slabinfo.py /sys/fs/cgroup/user.slice/user-111017.slice/user\@111017.service
shmem_inode_cache     92     92    704   46    8 : tunables    0    0    0 : slabdata      2      2      0
eventpoll_pwq         56     56     72   56    1 : tunables    0    0    0 : slabdata      1      1      0
eventpoll_epi         32     32    128   32    1 : tunables    0    0    0 : slabdata      1      1      0
kmalloc-8              0      0      8  512    1 : tunables    0    0    0 : slabdata      0      0      0
kmalloc-96             0      0     96   42    1 : tunables    0    0    0 : slabdata      0      0      0
kmalloc-2048           0      0   2048   16    8 : tunables    0    0    0 : slabdata      0      0      0
kmalloc-64           128    128     64   64    1 : tunables    0    0    0 : slabdata      2      2      0
mm_struct            160    160   1024   32    8 : tunables    0    0    0 : slabdata      5      5      0
signal_cache          96     96   1024   32    8 : tunables    0    0    0 : slabdata      3      3      0
sighand_cache         45     45   2112   15    8 : tunables    0    0    0 : slabdata      3      3      0
files_cache          138    138    704   46    8 : tunables    0    0    0 : slabdata      3      3      0
task_delay_info      153    153     80   51    1 : tunables    0    0    0 : slabdata      3      3      0
task_struct           27     27   3520    9    8 : tunables    0    0    0 : slabdata      3      3      0
radix_tree_node       56     56    584   28    4 : tunables    0    0    0 : slabdata      2      2      0
btrfs_inode          140    140   1136   28    8 : tunables    0    0    0 : slabdata      5      5      0
kmalloc-1024          64     64   1024   32    8 : tunables    0    0    0 : slabdata      2      2      0
kmalloc-192           84     84    192   42    2 : tunables    0    0    0 : slabdata      2      2      0
inode_cache           54     54    600   27    4 : tunables    0    0    0 : slabdata      2      2      0
kmalloc-128            0      0    128   32    1 : tunables    0    0    0 : slabdata      0      0      0
kmalloc-512           32     32    512   32    4 : tunables    0    0    0 : slabdata      1      1      0
skbuff_head_cache     32     32    256   32    2 : tunables    0    0    0 : slabdata      1      1      0
sock_inode_cache      46     46    704   46    8 : tunables    0    0    0 : slabdata      1      1      0
cred_jar             378    378    192   42    2 : tunables    0    0    0 : slabdata      9      9      0
proc_inode_cache      96     96    672   24    4 : tunables    0    0    0 : slabdata      4      4      0
dentry               336    336    192   42    2 : tunables    0    0    0 : slabdata      8      8      0
filp                 697    864    256   32    2 : tunables    0    0    0 : slabdata     27     27      0
anon_vma             644    644     88   46    1 : tunables    0    0    0 : slabdata     14     14      0
pid                 1408   1408     64   64    1 : tunables    0    0    0 : slabdata     22     22      0
vm_area_struct      1200   1200    200   40    2 : tunables    0    0    0 : slabdata     30     30      0

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
---
 tools/cgroup/slabinfo.py | 158 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 158 insertions(+)
 create mode 100755 tools/cgroup/slabinfo.py

diff --git a/tools/cgroup/slabinfo.py b/tools/cgroup/slabinfo.py
new file mode 100755
index 000000000000..0dc3a1fc260c
--- /dev/null
+++ b/tools/cgroup/slabinfo.py
@@ -0,0 +1,158 @@
+#!/usr/bin/env drgn
+#
+# Copyright (C) 2019 Roman Gushchin <guro@fb.com>
+# Copyright (C) 2019 Facebook
+
+from os import stat
+import argparse
+import sys
+
+from drgn.helpers.linux import list_for_each_entry, list_empty
+from drgn import container_of
+
+
+DESC = """
+This is a drgn script to provide slab statistics for memory cgroups.
+It supports cgroup v2 and v1 and can emulate memory.kmem.slabinfo
+interface of cgroup v1.
+For drgn, visit https://github.com/osandov/drgn.
+"""
+
+
+MEMCGS = {}
+
+OO_SHIFT = 16
+OO_MASK = ((1 << OO_SHIFT) - 1)
+
+
+def err(s):
+    print('slabinfo.py: error: %s' % s, file=sys.stderr, flush=True)
+    sys.exit(1)
+
+
+def find_memcg_ids(css=prog['root_mem_cgroup'].css, prefix=''):
+    if not list_empty(css.children.address_of_()):
+        for css in list_for_each_entry('struct cgroup_subsys_state',
+                                       css.children.address_of_(),
+                                       'sibling'):
+            name = prefix + '/' + css.cgroup.kn.name.string_().decode('utf-8')
+            memcg = container_of(css, 'struct mem_cgroup', 'css')
+            MEMCGS[css.cgroup.kn.id.value_()] = memcg
+            find_memcg_ids(css, name)
+
+
+def is_root_cache(s):
+    return False if s.memcg_params.root_cache else True
+
+
+def cache_name(s):
+    if is_root_cache(s):
+        return s.name.string_().decode('utf-8')
+    else:
+        return s.memcg_params.root_cache.name.string_().decode('utf-8')
+
+
+# SLUB
+
+def oo_order(s):
+    return s.oo.x >> OO_SHIFT
+
+
+def oo_objects(s):
+    return s.oo.x & OO_MASK
+
+
+def count_partial(n, fn):
+    nr_pages = 0
+    for page in list_for_each_entry('struct page', n.partial.address_of_(),
+                                    'lru'):
+         nr_pages += fn(page)
+    return nr_pages
+
+
+def count_free(page):
+    return page.objects - page.inuse
+
+
+def slub_get_slabinfo(s, cfg):
+    nr_slabs = 0
+    nr_objs = 0
+    nr_free = 0
+
+    for node in range(cfg['nr_nodes']):
+        n = s.node[node]
+        nr_slabs += n.nr_slabs.counter.value_()
+        nr_objs += n.total_objects.counter.value_()
+        nr_free += count_partial(n, count_free)
+
+    return {'active_objs': nr_objs - nr_free,
+            'num_objs': nr_objs,
+            'active_slabs': nr_slabs,
+            'num_slabs': nr_slabs,
+            'objects_per_slab': oo_objects(s),
+            'cache_order': oo_order(s),
+            'limit': 0,
+            'batchcount': 0,
+            'shared': 0,
+            'shared_avail': 0}
+
+
+def cache_show(s, cfg):
+    if cfg['allocator'] == 'SLUB':
+        sinfo = slub_get_slabinfo(s, cfg)
+    else:
+        err('SLAB isn\'t supported yet')
+
+    print('%-17s %6lu %6lu %6u %4u %4d'
+          ' : tunables %4u %4u %4u'
+          ' : slabdata %6lu %6lu %6lu' % (
+              cache_name(s), sinfo['active_objs'], sinfo['num_objs'],
+              s.size, sinfo['objects_per_slab'], 1 << sinfo['cache_order'],
+              sinfo['limit'], sinfo['batchcount'], sinfo['shared'],
+              sinfo['active_slabs'], sinfo['num_slabs'],
+              sinfo['shared_avail']))
+
+
+def detect_kernel_config():
+    cfg = {}
+
+    cfg['nr_nodes'] = prog['nr_online_nodes'].value_()
+
+    if prog.type('struct kmem_cache').members[1][1] == 'flags':
+        cfg['allocator'] = 'SLUB'
+    elif prog.type('struct kmem_cache').members[1][1] == 'batchcount':
+        cfg['allocator'] = 'SLAB'
+    else:
+        err('Can\'t determine the slab allocator')
+
+    return cfg
+
+
+def main():
+    parser = argparse.ArgumentParser(description=DESC,
+                                     formatter_class=
+                                     argparse.RawTextHelpFormatter)
+    parser.add_argument('cgroup', metavar='CGROUP',
+                        help='Target memory cgroup')
+    args = parser.parse_args()
+
+    try:
+        cgroup_id = stat(args.cgroup).st_ino
+        find_memcg_ids()
+        memcg = MEMCGS[cgroup_id]
+    except KeyError:
+        err('Can\'t find the memory cgroup')
+
+    cfg = detect_kernel_config()
+
+    print('# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab>'
+          ' : tunables <limit> <batchcount> <sharedfactor>'
+          ' : slabdata <active_slabs> <num_slabs> <sharedavail>')
+
+    for s in list_for_each_entry('struct kmem_cache',
+                                 memcg.kmem_caches.address_of_(),
+                                 'memcg_params.kmem_caches_node'):
+        cache_show(s, cfg)
+
+
+main()
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 07/28] mm: memcg/slab: introduce mem_cgroup_from_obj()
       [not found] ` <20200127173453.2089565-8-guro@fb.com>
@ 2020-02-03 16:05   ` Johannes Weiner
  0 siblings, 0 replies; 56+ messages in thread
From: Johannes Weiner @ 2020-02-03 16:05 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Jan 27, 2020 at 09:34:32AM -0800, Roman Gushchin wrote:
> @@ -757,13 +757,12 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
>  
>  void __mod_lruvec_slab_state(void *p, enum node_stat_item idx, int val)
>  {
> -	struct page *page = virt_to_head_page(p);
> -	pg_data_t *pgdat = page_pgdat(page);
> +	pg_data_t *pgdat = page_pgdat(virt_to_page(p));
>  	struct mem_cgroup *memcg;
>  	struct lruvec *lruvec;
>  
>  	rcu_read_lock();
> -	memcg = memcg_from_slab_page(page);
> +	memcg = mem_cgroup_from_obj(p);
>  
>  	/* Untracked pages have no memcg, no lruvec. Update only the node */
>  	if (!memcg || memcg == root_mem_cgroup) {

This function is specifically for slab objects. Why does it need the
indirection and additional branch here?

If memcg_from_slab_page() is going away later, I think the conversion
to this new helper should happen at that point in the series, not now.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 08/28] mm: fork: fix kernel_stack memcg stats for various stack implementations
       [not found] ` <20200127173453.2089565-9-guro@fb.com>
@ 2020-02-03 16:12   ` Johannes Weiner
  0 siblings, 0 replies; 56+ messages in thread
From: Johannes Weiner @ 2020-02-03 16:12 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Jan 27, 2020 at 09:34:33AM -0800, Roman Gushchin wrote:
> Depending on CONFIG_VMAP_STACK and the THREAD_SIZE / PAGE_SIZE ratio
> the space for task stacks can be allocated using __vmalloc_node_range(),
> alloc_pages_node() and kmem_cache_alloc_node(). In the first and the
> second cases page->mem_cgroup pointer is set, but in the third it's
> not: memcg membership of a slab page should be determined using the
> memcg_from_slab_page() function, which looks at
> page->slab_cache->memcg_params.memcg . In this case, using
> mod_memcg_page_state() (as in account_kernel_stack()) is incorrect:
> page->mem_cgroup pointer is NULL even for pages charged to a non-root
> memory cgroup.
> 
> In order to fix it, let's introduce a mod_memcg_obj_state() helper,
> which takes a pointer to a kernel object as a first argument, uses
> mem_cgroup_from_obj() to get a RCU-protected memcg pointer and
> calls mod_memcg_state(). It allows to handle all possible
> configurations (CONFIG_VMAP_STACK and various THREAD_SIZE/PAGE_SIZE
> values) without spilling any memcg/kmem specifics into fork.c .

The change looks good to me, but it sounds like this is a bug with
actual consequences to userspace. Can you elaborate on that in the
changelog please? Maybe add a Fixes: line, if applicable?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 09/28] mm: memcg/slab: rename __mod_lruvec_slab_state() into __mod_lruvec_obj_state()
       [not found] ` <20200127173453.2089565-10-guro@fb.com>
@ 2020-02-03 16:13   ` Johannes Weiner
  0 siblings, 0 replies; 56+ messages in thread
From: Johannes Weiner @ 2020-02-03 16:13 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Jan 27, 2020 at 09:34:34AM -0800, Roman Gushchin wrote:
> Rename __mod_lruvec_slab_state() into __mod_lruvec_obj_state()
> to unify it with mod_memcg_obj_state(). It better reflects the fact
> that the passed object isn't necessary slab-backed.

Makes sense to me.

> @@ -1116,7 +1116,7 @@ static inline void mod_lruvec_page_state(struct page *page,
>  	mod_node_page_state(page_pgdat(page), idx, val);
>  }
>  
> -static inline void __mod_lruvec_slab_state(void *p, enum node_stat_item idx,
> +static inline void __mod_lruvec_obj_state(void *p, enum node_stat_item idx,
>  					   int val)
>  {
>  	struct page *page = virt_to_head_page(p);
> @@ -1217,12 +1217,12 @@ static inline void __dec_lruvec_page_state(struct page *page,
>  
>  static inline void __inc_lruvec_slab_state(void *p, enum node_stat_item idx)
>  {
> -	__mod_lruvec_slab_state(p, idx, 1);
> +	__mod_lruvec_obj_state(p, idx, 1);
>  }
>  
>  static inline void __dec_lruvec_slab_state(void *p, enum node_stat_item idx)
>  {
> -	__mod_lruvec_slab_state(p, idx, -1);
> +	__mod_lruvec_obj_state(p, idx, -1);
>  }

These should be renamed as well, no?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 10/28] mm: memcg: introduce mod_lruvec_memcg_state()
       [not found] ` <20200127173453.2089565-11-guro@fb.com>
@ 2020-02-03 17:39   ` Johannes Weiner
  0 siblings, 0 replies; 56+ messages in thread
From: Johannes Weiner @ 2020-02-03 17:39 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Jan 27, 2020 at 09:34:35AM -0800, Roman Gushchin wrote:
> To prepare for per-object accounting of slab objects, let's introduce
> __mod_lruvec_memcg_state() and mod_lruvec_memcg_state() helpers,
> which are similar to mod_lruvec_state(), but do not update global
> node counters, only lruvec and per-cgroup.
> 
> It's necessary because soon node slab counters will be used for
> accounting of all memory used by slab pages, however on memcg level
> only the actually used memory will be counted. The free space will be
> shared between all cgroups, so it can't be accounted to any
> specific cgroup.

Makes perfect sense. However, I think the existing mod_lruvec_state()
has a bad and misleading name, and adding to it in the same style
makes things worse.

Can we instead rename lruvec_state to node_memcg_state to capture that
it changes all levels. And then do the following, clean API?

- node_state for node only

- memcg_state for memcg only

- lruvec_state for lruvec only

- node_memcg_state convenience wrapper to change node, memcg, lruvec counters

You can then open-code the disjunct node and memcg+lruvec counters.

[ Granted, lruvec counters are never modified on their own - always in
  conjunction with the memcg counters. And frankly, the only memcg
  counters that are modified *without* the lruvec counter-part are the
  special-case MEMCG_ counters.

  It would be nice to have 1) a completely separate API for the MEMCG_
  counters; and then 2) the node API for node and 3) a cgroup API for
  memcg+lruvec VM stat counters that allow you to easily do the
  disjunct accounting for slab memory.

  But I can't think of poignant names for these. At least nothing that
  would be better than separate memcg_state and lruvec_state calls. ]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 11/28] mm: slub: implement SLUB version of obj_to_index()
       [not found] ` <20200127173453.2089565-12-guro@fb.com>
@ 2020-02-03 17:44   ` Johannes Weiner
  0 siblings, 0 replies; 56+ messages in thread
From: Johannes Weiner @ 2020-02-03 17:44 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Jan 27, 2020 at 09:34:36AM -0800, Roman Gushchin wrote:
> This commit implements SLUB version of the obj_to_index() function,
> which will be required to calculate the offset of obj_cgroup in the
> obj_cgroups vector to store/obtain the objcg ownership data.
> 
> To make it faster, let's repeat the SLAB's trick introduced by
> commit 6a2d7a955d8d ("[PATCH] SLAB: use a multiply instead of a
> divide in obj_to_index()") and avoid an expensive division.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Acked-by: Christoph Lameter <cl@linux.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 12/28] mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat
       [not found] ` <20200127173453.2089565-13-guro@fb.com>
@ 2020-02-03 17:58   ` Johannes Weiner
  2020-02-03 18:25     ` Roman Gushchin
  0 siblings, 1 reply; 56+ messages in thread
From: Johannes Weiner @ 2020-02-03 17:58 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Jan 27, 2020 at 09:34:37AM -0800, Roman Gushchin wrote:
> Currently s8 type is used for per-cpu caching of per-node statistics.
> It works fine because the overfill threshold can't exceed 125.
> 
> But if some counters are in bytes (and the next commit in the series
> will convert slab counters to bytes), it's not gonna work:
> value in bytes can easily exceed s8 without exceeding the threshold
> converted to bytes. So to avoid overfilling per-cpu caches and breaking
> vmstats correctness, let's use s32 instead.
> 
> This doesn't affect per-zone statistics. There are no plans to use
> zone-level byte-sized counters, so no reasons to change anything.

Wait, is this still necessary? AFAIU, the node counters will account
full slab pages, including free space, and only the memcg counters
that track actual objects will be in bytes.

Can you please elaborate?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 12/28] mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat
  2020-02-03 17:58   ` [PATCH v2 12/28] mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat Johannes Weiner
@ 2020-02-03 18:25     ` Roman Gushchin
  2020-02-03 20:34       ` Johannes Weiner
  0 siblings, 1 reply; 56+ messages in thread
From: Roman Gushchin @ 2020-02-03 18:25 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Feb 03, 2020 at 12:58:18PM -0500, Johannes Weiner wrote:
> On Mon, Jan 27, 2020 at 09:34:37AM -0800, Roman Gushchin wrote:
> > Currently s8 type is used for per-cpu caching of per-node statistics.
> > It works fine because the overfill threshold can't exceed 125.
> > 
> > But if some counters are in bytes (and the next commit in the series
> > will convert slab counters to bytes), it's not gonna work:
> > value in bytes can easily exceed s8 without exceeding the threshold
> > converted to bytes. So to avoid overfilling per-cpu caches and breaking
> > vmstats correctness, let's use s32 instead.
> > 
> > This doesn't affect per-zone statistics. There are no plans to use
> > zone-level byte-sized counters, so no reasons to change anything.
> 
> Wait, is this still necessary? AFAIU, the node counters will account
> full slab pages, including free space, and only the memcg counters
> that track actual objects will be in bytes.
> 
> Can you please elaborate?

It's weird to have a counter with the same name (e.g. NR_SLAB_RECLAIMABLE_B)
being in different units depending on the accounting scope.
So I do convert all slab counters: global, per-lruvec,
and per-memcg to bytes.

Alternatively I can fork them, e.g. introduce per-memcg or per-lruvec
NR_SLAB_RECLAIMABLE_OBJ
NR_SLAB_UNRECLAIMABLE_OBJ
and keep global counters untouched. If going this way, I'd prefer to make
them per-memcg, because it will simplify things on charging paths:
now we do get task->mem_cgroup->obj_cgroup in the pre_alloc_hook(),
and then obj_cgroup->mem_cgroup in the post_alloc_hook() just to
bump per-lruvec counters.


Btw, I wonder if we really need per-lruvec counters at all (at least
being enabled by default). For the significant amount of users who
have a single-node machine it doesn't bring anything except performance
overhead. For those who have multiple nodes (and most likely many many
memory cgroups) it provides way too many data except for debugging
some weird mm issues.
I guess in the absolute majority of cases having global per-node + per-memcg
counters will be enough.

Thanks!

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 16/28] mm: memcg/slab: allocate obj_cgroups for non-root slab pages
       [not found] ` <20200127173453.2089565-17-guro@fb.com>
@ 2020-02-03 18:27   ` Johannes Weiner
  2020-02-03 18:34     ` Roman Gushchin
  0 siblings, 1 reply; 56+ messages in thread
From: Johannes Weiner @ 2020-02-03 18:27 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Jan 27, 2020 at 09:34:41AM -0800, Roman Gushchin wrote:
> Allocate and release memory to store obj_cgroup pointers for each
> non-root slab page. Reuse page->mem_cgroup pointer to store a pointer
> to the allocated space.
> 
> To distinguish between obj_cgroups and memcg pointers in case
> when it's not obvious which one is used (as in page_cgroup_ino()),
> let's always set the lowest bit in the obj_cgroup case.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> ---
>  include/linux/mm.h       | 25 ++++++++++++++++++--
>  include/linux/mm_types.h |  5 +++-
>  mm/memcontrol.c          |  5 ++--
>  mm/slab.c                |  3 ++-
>  mm/slab.h                | 51 +++++++++++++++++++++++++++++++++++++++-
>  mm/slub.c                |  2 +-
>  6 files changed, 83 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 080f8ac8bfb7..65224becc4ca 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1264,12 +1264,33 @@ static inline void set_page_links(struct page *page, enum zone_type zone,
>  #ifdef CONFIG_MEMCG
>  static inline struct mem_cgroup *page_memcg(struct page *page)
>  {
> -	return page->mem_cgroup;
> +	struct mem_cgroup *memcg = page->mem_cgroup;
> +
> +	/*
> +	 * The lowest bit set means that memcg isn't a valid memcg pointer,
> +	 * but a obj_cgroups pointer. In this case the page is shared and
> +	 * isn't charged to any specific memory cgroup. Return NULL.
> +	 */
> +	if ((unsigned long) memcg & 0x1UL)
> +		memcg = NULL;
> +
> +	return memcg;

That should really WARN instead of silently returning NULL. Which
callsite optimistically asks a page's cgroup when it has no idea
whether that page is actually a userpage or not?

>  static inline struct mem_cgroup *page_memcg_rcu(struct page *page)
>  {
> +	struct mem_cgroup *memcg = READ_ONCE(page->mem_cgroup);
> +
>  	WARN_ON_ONCE(!rcu_read_lock_held());
> -	return READ_ONCE(page->mem_cgroup);
> +
> +	/*
> +	 * The lowest bit set means that memcg isn't a valid memcg pointer,
> +	 * but a obj_cgroups pointer. In this case the page is shared and
> +	 * isn't charged to any specific memory cgroup. Return NULL.
> +	 */
> +	if ((unsigned long) memcg & 0x1UL)
> +		memcg = NULL;
> +
> +	return memcg;

Same here.

>  }
>  #else
>  static inline struct mem_cgroup *page_memcg(struct page *page)
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 270aa8fd2800..5102f00f3336 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -198,7 +198,10 @@ struct page {
>  	atomic_t _refcount;
>  
>  #ifdef CONFIG_MEMCG
> -	struct mem_cgroup *mem_cgroup;
> +	union {
> +		struct mem_cgroup *mem_cgroup;
> +		struct obj_cgroup **obj_cgroups;
> +	};

Since you need the casts in both cases anyway, it's safer (and
simpler) to do

	unsigned long mem_cgroup;

to prevent accidental direct derefs in future code.

Otherwise, this patch looks good to me!

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 16/28] mm: memcg/slab: allocate obj_cgroups for non-root slab pages
  2020-02-03 18:27   ` [PATCH v2 16/28] mm: memcg/slab: allocate obj_cgroups for non-root slab pages Johannes Weiner
@ 2020-02-03 18:34     ` Roman Gushchin
  2020-02-03 20:46       ` Johannes Weiner
  0 siblings, 1 reply; 56+ messages in thread
From: Roman Gushchin @ 2020-02-03 18:34 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Feb 03, 2020 at 01:27:56PM -0500, Johannes Weiner wrote:
> On Mon, Jan 27, 2020 at 09:34:41AM -0800, Roman Gushchin wrote:
> > Allocate and release memory to store obj_cgroup pointers for each
> > non-root slab page. Reuse page->mem_cgroup pointer to store a pointer
> > to the allocated space.
> > 
> > To distinguish between obj_cgroups and memcg pointers in case
> > when it's not obvious which one is used (as in page_cgroup_ino()),
> > let's always set the lowest bit in the obj_cgroup case.
> > 
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > ---
> >  include/linux/mm.h       | 25 ++++++++++++++++++--
> >  include/linux/mm_types.h |  5 +++-
> >  mm/memcontrol.c          |  5 ++--
> >  mm/slab.c                |  3 ++-
> >  mm/slab.h                | 51 +++++++++++++++++++++++++++++++++++++++-
> >  mm/slub.c                |  2 +-
> >  6 files changed, 83 insertions(+), 8 deletions(-)
> > 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 080f8ac8bfb7..65224becc4ca 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1264,12 +1264,33 @@ static inline void set_page_links(struct page *page, enum zone_type zone,
> >  #ifdef CONFIG_MEMCG
> >  static inline struct mem_cgroup *page_memcg(struct page *page)
> >  {
> > -	return page->mem_cgroup;
> > +	struct mem_cgroup *memcg = page->mem_cgroup;
> > +
> > +	/*
> > +	 * The lowest bit set means that memcg isn't a valid memcg pointer,
> > +	 * but a obj_cgroups pointer. In this case the page is shared and
> > +	 * isn't charged to any specific memory cgroup. Return NULL.
> > +	 */
> > +	if ((unsigned long) memcg & 0x1UL)
> > +		memcg = NULL;
> > +
> > +	return memcg;
> 
> That should really WARN instead of silently returning NULL. Which
> callsite optimistically asks a page's cgroup when it has no idea
> whether that page is actually a userpage or not?

For instance, look at page_cgroup_ino() called from the
reading /proc/kpageflags.

> 
> >  static inline struct mem_cgroup *page_memcg_rcu(struct page *page)
> >  {
> > +	struct mem_cgroup *memcg = READ_ONCE(page->mem_cgroup);
> > +
> >  	WARN_ON_ONCE(!rcu_read_lock_held());
> > -	return READ_ONCE(page->mem_cgroup);
> > +
> > +	/*
> > +	 * The lowest bit set means that memcg isn't a valid memcg pointer,
> > +	 * but a obj_cgroups pointer. In this case the page is shared and
> > +	 * isn't charged to any specific memory cgroup. Return NULL.
> > +	 */
> > +	if ((unsigned long) memcg & 0x1UL)
> > +		memcg = NULL;
> > +
> > +	return memcg;
> 
> Same here.
> 
> >  }
> >  #else
> >  static inline struct mem_cgroup *page_memcg(struct page *page)
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 270aa8fd2800..5102f00f3336 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -198,7 +198,10 @@ struct page {
> >  	atomic_t _refcount;
> >  
> >  #ifdef CONFIG_MEMCG
> > -	struct mem_cgroup *mem_cgroup;
> > +	union {
> > +		struct mem_cgroup *mem_cgroup;
> > +		struct obj_cgroup **obj_cgroups;
> > +	};
> 
> Since you need the casts in both cases anyway, it's safer (and
> simpler) to do
> 
> 	unsigned long mem_cgroup;
> 
> to prevent accidental direct derefs in future code.

Agree. Maybe even mem_cgroup_data?

> 
> Otherwise, this patch looks good to me!

Thanks!

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 15/28] mm: memcg/slab: obj_cgroup API
       [not found] ` <20200127173453.2089565-16-guro@fb.com>
@ 2020-02-03 19:31   ` Johannes Weiner
  0 siblings, 0 replies; 56+ messages in thread
From: Johannes Weiner @ 2020-02-03 19:31 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Jan 27, 2020 at 09:34:40AM -0800, Roman Gushchin wrote:
> Obj_cgroup API provides an ability to account sub-page sized kernel
> objects, which potentially outlive the original memory cgroup.
> 
> The top-level API consists of the following functions:
>   bool obj_cgroup_tryget(struct obj_cgroup *objcg);
>   void obj_cgroup_get(struct obj_cgroup *objcg);
>   void obj_cgroup_put(struct obj_cgroup *objcg);
> 
>   int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size);
>   void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);
> 
>   struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg);
> 
> Object cgroup is basically a pointer to a memory cgroup with a per-cpu
> reference counter. It substitutes a memory cgroup in places where
> it's necessary to charge a custom amount of bytes instead of pages.
> 
> All charged memory rounded down to pages is charged to the
> corresponding memory cgroup using __memcg_kmem_charge().
> 
> It implements reparenting: on memcg offlining it's getting reattached
> to the parent memory cgroup. Each online memory cgroup has an
> associated active object cgroup to handle new allocations and the list
> of all attached object cgroups. On offlining of a cgroup this list is
> reparented and for each object cgroup in the list the memcg pointer is
> swapped to the parent memory cgroup. It prevents long-living objects
> from pinning the original memory cgroup in the memory.
> 
> The implementation is based on byte-sized per-cpu stocks. A sub-page
> sized leftover is stored in an atomic field, which is a part of
> obj_cgroup object. So on cgroup offlining the leftover is automatically
> reparented.
> 
> memcg->objcg is rcu protected.
> objcg->memcg is a raw pointer, which is always pointing at a memory
> cgroup, but can be atomically swapped to the parent memory cgroup. So
> the caller must ensure the lifetime of the cgroup, e.g. grab
> rcu_read_lock or css_set_lock.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>

> @@ -194,6 +195,22 @@ struct memcg_cgwb_frn {
>  	struct wb_completion done;	/* tracks in-flight foreign writebacks */
>  };
>  
> +/*
> + * Bucket for arbitrarily byte-sized objects charged to a memory
> + * cgroup. The bucket can be reparented in one piece when the cgroup
> + * is destroyed, without having to round up the individual references
> + * of all live memory objects in the wild.
> + */
> +struct obj_cgroup {
> +	struct percpu_ref refcnt;
> +	struct mem_cgroup *memcg;
> +	atomic_t nr_charged_bytes;
> +	union {
> +		struct list_head list;
> +		struct rcu_head rcu;
> +	};
> +};
> +
>  /*
>   * The memory controller data structure. The memory controller controls both
>   * page cache and RSS per cgroup. We would eventually like to provide
> @@ -306,6 +323,8 @@ struct mem_cgroup {
>  	int kmemcg_id;
>  	enum memcg_kmem_state kmem_state;
>  	struct list_head kmem_caches;
> +	struct obj_cgroup __rcu *objcg;
> +	struct list_head objcg_list;

These could use a comment, IMO.

	/*
	 * Active object acounting bucket, as well as
	 * reparented buckets from dead children with
	 * outstanding objects.
	 */

or something like that.

> @@ -257,6 +257,73 @@ struct cgroup_subsys_state *vmpressure_to_css(struct vmpressure *vmpr)
>  }
>  
>  #ifdef CONFIG_MEMCG_KMEM
> +extern spinlock_t css_set_lock;
> +
> +static void obj_cgroup_release(struct percpu_ref *ref)
> +{
> +	struct obj_cgroup *objcg = container_of(ref, struct obj_cgroup, refcnt);
> +	unsigned int nr_bytes;
> +	unsigned int nr_pages;
> +	unsigned long flags;
> +
> +	nr_bytes = atomic_read(&objcg->nr_charged_bytes);
> +	WARN_ON_ONCE(nr_bytes & (PAGE_SIZE - 1));
> +	nr_pages = nr_bytes >> PAGE_SHIFT;
> +
> +	if (nr_pages) {
> +		rcu_read_lock();
> +		__memcg_kmem_uncharge(obj_cgroup_memcg(objcg), nr_pages);
> +		rcu_read_unlock();
> +	}
> +
> +	spin_lock_irqsave(&css_set_lock, flags);
> +	list_del(&objcg->list);
> +	mem_cgroup_put(obj_cgroup_memcg(objcg));
> +	spin_unlock_irqrestore(&css_set_lock, flags);

Heh, two obj_cgroup_memcg() lookups with different synchronization
rules.

I know that reparenting could happen in between the page uncharge and
the mem_cgroup_put(), and it would still be safe because the counters
are migrated atomically. But it seems needlessly lockless and complex.

Since you have to css_set_lock anyway, wouldn't it be better to do

	spin_lock_irqsave(&css_set_lock, flags);
	memcg = obj_cgroup_memcg(objcg);
	if (nr_pages)
		__memcg_kmem_uncharge(memcg, nr_pages);
	list_del(&objcg->list);
	mem_cgroup_put(memcg);
	spin_unlock_irqrestore(&css_set_lock, flags);

instead?

> +	percpu_ref_exit(ref);
> +	kfree_rcu(objcg, rcu);
> +}
> +
> +static struct obj_cgroup *obj_cgroup_alloc(void)
> +{
> +	struct obj_cgroup *objcg;
> +	int ret;
> +
> +	objcg = kzalloc(sizeof(struct obj_cgroup), GFP_KERNEL);
> +	if (!objcg)
> +		return NULL;
> +
> +	ret = percpu_ref_init(&objcg->refcnt, obj_cgroup_release, 0,
> +			      GFP_KERNEL);
> +	if (ret) {
> +		kfree(objcg);
> +		return NULL;
> +	}
> +	INIT_LIST_HEAD(&objcg->list);
> +	return objcg;
> +}
> +
> +static void memcg_reparent_objcgs(struct mem_cgroup *memcg,
> +				  struct mem_cgroup *parent)
> +{
> +	struct obj_cgroup *objcg;
> +
> +	objcg = rcu_replace_pointer(memcg->objcg, NULL, true);

Can this actually race with new charges? By the time we are going
offline, where would they be coming from?

What happens if the charger sees a live memcg, but its memcg->objcg is
cleared? Shouldn't they have the same kind of lifetime, where as long
as the memcg can be charged, so can the objcg? What would happen if
you didn't clear memcg->objcg here?

> +	/* Paired with mem_cgroup_put() in objcg_release(). */
> +	css_get(&memcg->css);
> +	percpu_ref_kill(&objcg->refcnt);
> +
> +	spin_lock_irq(&css_set_lock);
> +	list_for_each_entry(objcg, &memcg->objcg_list, list) {
> +		css_get(&parent->css);
> +		xchg(&objcg->memcg, parent);
> +		css_put(&memcg->css);
> +	}

I'm having a pretty hard time following this refcounting.

Why does objcg only acquire a css reference on the way out? It should
hold one when objcg->memcg is set up, and put it when that pointer
goes away.

But also, objcg is already on its own memcg->objcg_list from the
start, so on the first reparenting we get a css ref, then move it to
the parent, then obj_cgroup_release() puts one it doesn't have ...?

Argh, help.

> @@ -2978,6 +3070,120 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
>  	if (PageKmemcg(page))
>  		__ClearPageKmemcg(page);
>  }
> +
> +static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
> +{
> +	struct memcg_stock_pcp *stock;
> +	unsigned long flags;
> +	bool ret = false;
> +
> +	local_irq_save(flags);
> +
> +	stock = this_cpu_ptr(&memcg_stock);
> +	if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
> +		stock->nr_bytes -= nr_bytes;
> +		ret = true;
> +	}
> +
> +	local_irq_restore(flags);
> +
> +	return ret;
> +}
> +
> +static void drain_obj_stock(struct memcg_stock_pcp *stock)
> +{
> +	struct obj_cgroup *old = stock->cached_objcg;
> +
> +	if (!old)
> +		return;
> +
> +	if (stock->nr_bytes) {
> +		unsigned int nr_pages = stock->nr_bytes >> PAGE_SHIFT;
> +		unsigned int nr_bytes = stock->nr_bytes & (PAGE_SIZE - 1);
> +
> +		if (nr_pages) {
> +			rcu_read_lock();
> +			__memcg_kmem_uncharge(obj_cgroup_memcg(old), nr_pages);
> +			rcu_read_unlock();
> +		}
> +
> +		atomic_add(nr_bytes, &old->nr_charged_bytes);
> +		stock->nr_bytes = 0;
> +	}
> +
> +	obj_cgroup_put(old);
> +	stock->cached_objcg = NULL;
> +}
> +
> +static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
> +				     struct mem_cgroup *root_memcg)
> +{
> +	struct mem_cgroup *memcg;
> +
> +	if (stock->cached_objcg) {
> +		memcg = obj_cgroup_memcg(stock->cached_objcg);
> +		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
> +			return true;
> +	}
> +
> +	return false;
> +}
> +
> +static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
> +{
> +	struct memcg_stock_pcp *stock;
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +
> +	stock = this_cpu_ptr(&memcg_stock);
> +	if (stock->cached_objcg != objcg) { /* reset if necessary */
> +		drain_obj_stock(stock);
> +		obj_cgroup_get(objcg);
> +		stock->cached_objcg = objcg;
> +		stock->nr_bytes = atomic_xchg(&objcg->nr_charged_bytes, 0);
> +	}
> +	stock->nr_bytes += nr_bytes;
> +
> +	if (stock->nr_bytes > PAGE_SIZE)
> +		drain_obj_stock(stock);
> +
> +	local_irq_restore(flags);
> +}
> +
> +int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
> +{
> +	struct mem_cgroup *memcg;
> +	unsigned int nr_pages, nr_bytes;
> +	int ret;
> +
> +	if (consume_obj_stock(objcg, size))
> +		return 0;
> +
> +	rcu_read_lock();
> +	memcg = obj_cgroup_memcg(objcg);
> +	css_get(&memcg->css);
> +	rcu_read_unlock();

I don't quite understand the lifetime rules here. You're holding the
rcu lock, so the memcg object cannot get physically freed while you
are looking it up. But you could be racing with an offlining and see
the stale memcg pointer. Isn't css_get() unsafe? Doesn't this need a
retry loop around css_tryget() similar to get_mem_cgroup_from_mm()?

> +
> +	nr_pages = size >> PAGE_SHIFT;
> +	nr_bytes = size & (PAGE_SIZE - 1);
> +
> +	if (nr_bytes)
> +		nr_pages += 1;
> +
> +	ret = __memcg_kmem_charge(memcg, gfp, nr_pages);
> +	if (!ret && nr_bytes)
> +		refill_obj_stock(objcg, PAGE_SIZE - nr_bytes);
> +
> +	css_put(&memcg->css);
> +	return ret;
> +}
> +
> +void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
> +{
> +	refill_obj_stock(objcg, size);
> +}
> +
>  #endif /* CONFIG_MEMCG_KMEM */
>  
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> @@ -3400,7 +3606,8 @@ static void memcg_flush_percpu_vmevents(struct mem_cgroup *memcg)
>  #ifdef CONFIG_MEMCG_KMEM
>  static int memcg_online_kmem(struct mem_cgroup *memcg)
>  {
> -	int memcg_id;
> +	struct obj_cgroup *objcg;
> +	int memcg_id, ret;
>  
>  	if (cgroup_memory_nokmem)
>  		return 0;
> @@ -3412,6 +3619,15 @@ static int memcg_online_kmem(struct mem_cgroup *memcg)
>  	if (memcg_id < 0)
>  		return memcg_id;
>  
> +	objcg = obj_cgroup_alloc();
> +	if (!objcg) {
> +		memcg_free_cache_id(memcg_id);
> +		return ret;
> +	}
> +	objcg->memcg = memcg;
> +	rcu_assign_pointer(memcg->objcg, objcg);
> +	list_add(&objcg->list, &memcg->objcg_list);

This self-hosting significantly adds to my confusion. It'd be a lot
easier to understand ownership rules and references if this list_add()
was done directly to the parent's list at the time of reparenting, not
here.

If the objcg holds a css reference, right here is where it should be
acquired. Then transferred in reparent and put during release.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 21/28] mm: memcg/slab: use a single set of kmem_caches for all memory cgroups
       [not found] ` <20200127173453.2089565-22-guro@fb.com>
@ 2020-02-03 19:50   ` Johannes Weiner
  2020-02-03 20:58     ` Roman Gushchin
  0 siblings, 1 reply; 56+ messages in thread
From: Johannes Weiner @ 2020-02-03 19:50 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Jan 27, 2020 at 09:34:46AM -0800, Roman Gushchin wrote:
> This is fairly big but mostly red patch, which makes all non-root
> slab allocations use a single set of kmem_caches instead of
> creating a separate set for each memory cgroup.
> 
> Because the number of non-root kmem_caches is now capped by the number
> of root kmem_caches, there is no need to shrink or destroy them
> prematurely. They can be perfectly destroyed together with their
> root counterparts. This allows to dramatically simplify the
> management of non-root kmem_caches and delete a ton of code.

This is definitely going in the right direction. But it doesn't quite
explain why we still need two sets of kmem_caches?

In the old scheme, we had completely separate per-cgroup caches with
separate slab pages. If a cgrouped process wanted to allocate a slab
object, we'd go to the root cache and used the cgroup id to look up
the right cgroup cache. On slab free we'd use page->slab_cache.

Now we have slab pages that have a page->objcg array. Why can't all
allocations go through a single set of kmem caches? If an allocation
is coming from a cgroup and the slab page the allocator wants to use
doesn't have an objcg array yet, we can allocate it on the fly, no?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 17/28] mm: memcg/slab: save obj_cgroup for non-root slab objects
       [not found] ` <20200127173453.2089565-18-guro@fb.com>
@ 2020-02-03 19:53   ` Johannes Weiner
  0 siblings, 0 replies; 56+ messages in thread
From: Johannes Weiner @ 2020-02-03 19:53 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Jan 27, 2020 at 09:34:42AM -0800, Roman Gushchin wrote:
> Store the obj_cgroup pointer in the corresponding place of
> page->obj_cgroups for each allocated non-root slab object.
> Make sure that each allocated object holds a reference to obj_cgroup.
> 
> Objcg pointer is obtained from the memcg->objcg dereferencing
> in memcg_kmem_get_cache() and passed from pre_alloc_hook to
> post_alloc_hook. Then in case of successful allocation(s) it's
> getting stored in the page->obj_cgroups vector.
> 
> The objcg obtaining part look a bit bulky now, but it will be simplified
> by next commits in the series.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> ---
>  include/linux/memcontrol.h |  3 +-
>  mm/memcontrol.c            | 14 +++++++--
>  mm/slab.c                  | 18 +++++++-----
>  mm/slab.h                  | 60 ++++++++++++++++++++++++++++++++++----
>  mm/slub.c                  | 14 +++++----
>  5 files changed, 88 insertions(+), 21 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 30bbea3f85e2..54bfb26b5016 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1431,7 +1431,8 @@ static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg,
>  }
>  #endif
>  
> -struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep);
> +struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep,
> +					struct obj_cgroup **objcgp);
>  void memcg_kmem_put_cache(struct kmem_cache *cachep);
>  
>  #ifdef CONFIG_MEMCG_KMEM
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 94337ab1ebe9..0e9fe272e688 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2896,7 +2896,8 @@ static inline bool memcg_kmem_bypass(void)
>   * done with it, memcg_kmem_put_cache() must be called to release the
>   * reference.
>   */
> -struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep)
> +struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep,
> +					struct obj_cgroup **objcgp)
>  {
>  	struct mem_cgroup *memcg;
>  	struct kmem_cache *memcg_cachep;
> @@ -2952,8 +2953,17 @@ struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep)
>  	 */
>  	if (unlikely(!memcg_cachep))
>  		memcg_schedule_kmem_cache_create(memcg, cachep);
> -	else if (percpu_ref_tryget(&memcg_cachep->memcg_params.refcnt))
> +	else if (percpu_ref_tryget(&memcg_cachep->memcg_params.refcnt)) {
> +		struct obj_cgroup *objcg = rcu_dereference(memcg->objcg);
> +
> +		if (!objcg || !obj_cgroup_tryget(objcg)) {
> +			percpu_ref_put(&memcg_cachep->memcg_params.refcnt);
> +			goto out_unlock;
> +		}

As per the reply to the previous patch: I don't understand why the
objcg requires a pulse check here. As long as the memcg is alive and
can be charged with memory, how can the objcg disappear?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 12/28] mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat
  2020-02-03 18:25     ` Roman Gushchin
@ 2020-02-03 20:34       ` Johannes Weiner
  2020-02-03 22:28         ` Roman Gushchin
  0 siblings, 1 reply; 56+ messages in thread
From: Johannes Weiner @ 2020-02-03 20:34 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Feb 03, 2020 at 10:25:06AM -0800, Roman Gushchin wrote:
> On Mon, Feb 03, 2020 at 12:58:18PM -0500, Johannes Weiner wrote:
> > On Mon, Jan 27, 2020 at 09:34:37AM -0800, Roman Gushchin wrote:
> > > Currently s8 type is used for per-cpu caching of per-node statistics.
> > > It works fine because the overfill threshold can't exceed 125.
> > > 
> > > But if some counters are in bytes (and the next commit in the series
> > > will convert slab counters to bytes), it's not gonna work:
> > > value in bytes can easily exceed s8 without exceeding the threshold
> > > converted to bytes. So to avoid overfilling per-cpu caches and breaking
> > > vmstats correctness, let's use s32 instead.
> > > 
> > > This doesn't affect per-zone statistics. There are no plans to use
> > > zone-level byte-sized counters, so no reasons to change anything.
> > 
> > Wait, is this still necessary? AFAIU, the node counters will account
> > full slab pages, including free space, and only the memcg counters
> > that track actual objects will be in bytes.
> > 
> > Can you please elaborate?
> 
> It's weird to have a counter with the same name (e.g. NR_SLAB_RECLAIMABLE_B)
> being in different units depending on the accounting scope.
> So I do convert all slab counters: global, per-lruvec,
> and per-memcg to bytes.

Since the node counters tracks allocated slab pages and the memcg
counter tracks allocated objects, arguably they shouldn't use the same
name anyway.

> Alternatively I can fork them, e.g. introduce per-memcg or per-lruvec
> NR_SLAB_RECLAIMABLE_OBJ
> NR_SLAB_UNRECLAIMABLE_OBJ

Can we alias them and reuse their slots?

	/* Reuse the node slab page counters item for charged objects */
	MEMCG_SLAB_RECLAIMABLE = NR_SLAB_RECLAIMABLE,
	MEMCG_SLAB_UNRECLAIMABLE = NR_SLAB_UNRECLAIMABLE,

> and keep global counters untouched. If going this way, I'd prefer to make
> them per-memcg, because it will simplify things on charging paths:
> now we do get task->mem_cgroup->obj_cgroup in the pre_alloc_hook(),
> and then obj_cgroup->mem_cgroup in the post_alloc_hook() just to
> bump per-lruvec counters.

I don't quite follow. Don't you still have to update the global
counters?

> Btw, I wonder if we really need per-lruvec counters at all (at least
> being enabled by default). For the significant amount of users who
> have a single-node machine it doesn't bring anything except performance
> overhead.

Yeah, for single-node systems we should be able to redirect everything
to the memcg counters, without allocating and tracking lruvec copies.

> For those who have multiple nodes (and most likely many many
> memory cgroups) it provides way too many data except for debugging
> some weird mm issues.
> I guess in the absolute majority of cases having global per-node + per-memcg
> counters will be enough.

Hm? Reclaim uses the lruvec counters.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 16/28] mm: memcg/slab: allocate obj_cgroups for non-root slab pages
  2020-02-03 18:34     ` Roman Gushchin
@ 2020-02-03 20:46       ` Johannes Weiner
  2020-02-03 21:19         ` Roman Gushchin
  0 siblings, 1 reply; 56+ messages in thread
From: Johannes Weiner @ 2020-02-03 20:46 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Feb 03, 2020 at 10:34:52AM -0800, Roman Gushchin wrote:
> On Mon, Feb 03, 2020 at 01:27:56PM -0500, Johannes Weiner wrote:
> > On Mon, Jan 27, 2020 at 09:34:41AM -0800, Roman Gushchin wrote:
> > > Allocate and release memory to store obj_cgroup pointers for each
> > > non-root slab page. Reuse page->mem_cgroup pointer to store a pointer
> > > to the allocated space.
> > > 
> > > To distinguish between obj_cgroups and memcg pointers in case
> > > when it's not obvious which one is used (as in page_cgroup_ino()),
> > > let's always set the lowest bit in the obj_cgroup case.
> > > 
> > > Signed-off-by: Roman Gushchin <guro@fb.com>
> > > ---
> > >  include/linux/mm.h       | 25 ++++++++++++++++++--
> > >  include/linux/mm_types.h |  5 +++-
> > >  mm/memcontrol.c          |  5 ++--
> > >  mm/slab.c                |  3 ++-
> > >  mm/slab.h                | 51 +++++++++++++++++++++++++++++++++++++++-
> > >  mm/slub.c                |  2 +-
> > >  6 files changed, 83 insertions(+), 8 deletions(-)
> > > 
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index 080f8ac8bfb7..65224becc4ca 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -1264,12 +1264,33 @@ static inline void set_page_links(struct page *page, enum zone_type zone,
> > >  #ifdef CONFIG_MEMCG
> > >  static inline struct mem_cgroup *page_memcg(struct page *page)
> > >  {
> > > -	return page->mem_cgroup;
> > > +	struct mem_cgroup *memcg = page->mem_cgroup;
> > > +
> > > +	/*
> > > +	 * The lowest bit set means that memcg isn't a valid memcg pointer,
> > > +	 * but a obj_cgroups pointer. In this case the page is shared and
> > > +	 * isn't charged to any specific memory cgroup. Return NULL.
> > > +	 */
> > > +	if ((unsigned long) memcg & 0x1UL)
> > > +		memcg = NULL;
> > > +
> > > +	return memcg;
> > 
> > That should really WARN instead of silently returning NULL. Which
> > callsite optimistically asks a page's cgroup when it has no idea
> > whether that page is actually a userpage or not?
> 
> For instance, look at page_cgroup_ino() called from the
> reading /proc/kpageflags.

But that checks PageSlab() and implements memcg_from_slab_page() to
handle that case properly. And that's what we expect all callsites to
do: make sure that the question asked actually makes sense, instead of
having the interface paper over bogus requests.

If that function is completely racy and PageSlab isn't stable, then it
should really just open-code the lookup, rather than require weakening
the interface for everybody else.

> > >  static inline struct mem_cgroup *page_memcg_rcu(struct page *page)
> > >  {
> > > +	struct mem_cgroup *memcg = READ_ONCE(page->mem_cgroup);
> > > +
> > >  	WARN_ON_ONCE(!rcu_read_lock_held());
> > > -	return READ_ONCE(page->mem_cgroup);
> > > +
> > > +	/*
> > > +	 * The lowest bit set means that memcg isn't a valid memcg pointer,
> > > +	 * but a obj_cgroups pointer. In this case the page is shared and
> > > +	 * isn't charged to any specific memory cgroup. Return NULL.
> > > +	 */
> > > +	if ((unsigned long) memcg & 0x1UL)
> > > +		memcg = NULL;
> > > +
> > > +	return memcg;
> > 
> > Same here.
> > 
> > >  }
> > >  #else
> > >  static inline struct mem_cgroup *page_memcg(struct page *page)
> > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > > index 270aa8fd2800..5102f00f3336 100644
> > > --- a/include/linux/mm_types.h
> > > +++ b/include/linux/mm_types.h
> > > @@ -198,7 +198,10 @@ struct page {
> > >  	atomic_t _refcount;
> > >  
> > >  #ifdef CONFIG_MEMCG
> > > -	struct mem_cgroup *mem_cgroup;
> > > +	union {
> > > +		struct mem_cgroup *mem_cgroup;
> > > +		struct obj_cgroup **obj_cgroups;
> > > +	};
> > 
> > Since you need the casts in both cases anyway, it's safer (and
> > simpler) to do
> > 
> > 	unsigned long mem_cgroup;
> > 
> > to prevent accidental direct derefs in future code.
> 
> Agree. Maybe even mem_cgroup_data?

Personally, I don't think the suffix adds much. The type makes it so
the compiler catches any accidental use, and access is very
centralized so greppability doesn't matter much.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 21/28] mm: memcg/slab: use a single set of kmem_caches for all memory cgroups
  2020-02-03 19:50   ` [PATCH v2 21/28] mm: memcg/slab: use a single set of kmem_caches for all memory cgroups Johannes Weiner
@ 2020-02-03 20:58     ` Roman Gushchin
  2020-02-03 22:17       ` Johannes Weiner
  0 siblings, 1 reply; 56+ messages in thread
From: Roman Gushchin @ 2020-02-03 20:58 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Feb 03, 2020 at 02:50:48PM -0500, Johannes Weiner wrote:
> On Mon, Jan 27, 2020 at 09:34:46AM -0800, Roman Gushchin wrote:
> > This is fairly big but mostly red patch, which makes all non-root
> > slab allocations use a single set of kmem_caches instead of
> > creating a separate set for each memory cgroup.
> > 
> > Because the number of non-root kmem_caches is now capped by the number
> > of root kmem_caches, there is no need to shrink or destroy them
> > prematurely. They can be perfectly destroyed together with their
> > root counterparts. This allows to dramatically simplify the
> > management of non-root kmem_caches and delete a ton of code.
> 
> This is definitely going in the right direction. But it doesn't quite
> explain why we still need two sets of kmem_caches?
> 
> In the old scheme, we had completely separate per-cgroup caches with
> separate slab pages. If a cgrouped process wanted to allocate a slab
> object, we'd go to the root cache and used the cgroup id to look up
> the right cgroup cache. On slab free we'd use page->slab_cache.
> 
> Now we have slab pages that have a page->objcg array. Why can't all
> allocations go through a single set of kmem caches? If an allocation
> is coming from a cgroup and the slab page the allocator wants to use
> doesn't have an objcg array yet, we can allocate it on the fly, no?

Well, arguably it can be done, but there are few drawbacks:

1) On the release path you'll need to make some extra work even for
   root allocations: calculate the offset only to find the NULL objcg pointer.

2) There will be a memory overhead for root allocations
   (which might or might not be compensated by the increase
   of the slab utilization).

3) I'm working on percpu memory accounting that resembles the same scheme,
   except that obj_cgroups vector is created for the whole percpu block.
   There will be root- and memcg-blocks, and it will be expensive to merge them.
   I kinda like using the same scheme here and there.

Upsides?

1) slab utilization might increase a little bit (but I doubt it will have
   a huge effect, because both merging sets should be relatively big and well
   utilized)
2) eliminate memcg kmem_cache dynamic creation/destruction. it's nice,
   but there isn't so much code left anyway.


So IMO it's an interesting direction to explore, but not something
that necessarily has to be done in the context of this patchset.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 16/28] mm: memcg/slab: allocate obj_cgroups for non-root slab pages
  2020-02-03 20:46       ` Johannes Weiner
@ 2020-02-03 21:19         ` Roman Gushchin
  2020-02-03 22:29           ` Johannes Weiner
  0 siblings, 1 reply; 56+ messages in thread
From: Roman Gushchin @ 2020-02-03 21:19 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Feb 03, 2020 at 03:46:27PM -0500, Johannes Weiner wrote:
> On Mon, Feb 03, 2020 at 10:34:52AM -0800, Roman Gushchin wrote:
> > On Mon, Feb 03, 2020 at 01:27:56PM -0500, Johannes Weiner wrote:
> > > On Mon, Jan 27, 2020 at 09:34:41AM -0800, Roman Gushchin wrote:
> > > > Allocate and release memory to store obj_cgroup pointers for each
> > > > non-root slab page. Reuse page->mem_cgroup pointer to store a pointer
> > > > to the allocated space.
> > > > 
> > > > To distinguish between obj_cgroups and memcg pointers in case
> > > > when it's not obvious which one is used (as in page_cgroup_ino()),
> > > > let's always set the lowest bit in the obj_cgroup case.
> > > > 
> > > > Signed-off-by: Roman Gushchin <guro@fb.com>
> > > > ---
> > > >  include/linux/mm.h       | 25 ++++++++++++++++++--
> > > >  include/linux/mm_types.h |  5 +++-
> > > >  mm/memcontrol.c          |  5 ++--
> > > >  mm/slab.c                |  3 ++-
> > > >  mm/slab.h                | 51 +++++++++++++++++++++++++++++++++++++++-
> > > >  mm/slub.c                |  2 +-
> > > >  6 files changed, 83 insertions(+), 8 deletions(-)
> > > > 
> > > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > > index 080f8ac8bfb7..65224becc4ca 100644
> > > > --- a/include/linux/mm.h
> > > > +++ b/include/linux/mm.h
> > > > @@ -1264,12 +1264,33 @@ static inline void set_page_links(struct page *page, enum zone_type zone,
> > > >  #ifdef CONFIG_MEMCG
> > > >  static inline struct mem_cgroup *page_memcg(struct page *page)
> > > >  {
> > > > -	return page->mem_cgroup;
> > > > +	struct mem_cgroup *memcg = page->mem_cgroup;
> > > > +
> > > > +	/*
> > > > +	 * The lowest bit set means that memcg isn't a valid memcg pointer,
> > > > +	 * but a obj_cgroups pointer. In this case the page is shared and
> > > > +	 * isn't charged to any specific memory cgroup. Return NULL.
> > > > +	 */
> > > > +	if ((unsigned long) memcg & 0x1UL)
> > > > +		memcg = NULL;
> > > > +
> > > > +	return memcg;
> > > 
> > > That should really WARN instead of silently returning NULL. Which
> > > callsite optimistically asks a page's cgroup when it has no idea
> > > whether that page is actually a userpage or not?
> > 
> > For instance, look at page_cgroup_ino() called from the
> > reading /proc/kpageflags.
> 
> But that checks PageSlab() and implements memcg_from_slab_page() to
> handle that case properly. And that's what we expect all callsites to
> do: make sure that the question asked actually makes sense, instead of
> having the interface paper over bogus requests.
> 
> If that function is completely racy and PageSlab isn't stable, then it
> should really just open-code the lookup, rather than require weakening
> the interface for everybody else.

Why though?

Another example: process stack can be depending on the machine config and
platform a vmalloc allocation, a slab allocation or a "high-order slab allocation",
which is executed by the page allocator directly.

It's kinda nice to have a function that hides accounting details
and returns a valid memcg pointer for any kind of objects.

To me it seems to be a valid question:
for a given kernel object give me a pointer to the memory cgroup.

Why it's weakening?

Moreover, open-coding of this question leads to bugs like one fixed by
ec9f02384f60 ("mm: workingset: fix vmstat counters for shadow nodes").

> 
> > > >  static inline struct mem_cgroup *page_memcg_rcu(struct page *page)
> > > >  {
> > > > +	struct mem_cgroup *memcg = READ_ONCE(page->mem_cgroup);
> > > > +
> > > >  	WARN_ON_ONCE(!rcu_read_lock_held());
> > > > -	return READ_ONCE(page->mem_cgroup);
> > > > +
> > > > +	/*
> > > > +	 * The lowest bit set means that memcg isn't a valid memcg pointer,
> > > > +	 * but a obj_cgroups pointer. In this case the page is shared and
> > > > +	 * isn't charged to any specific memory cgroup. Return NULL.
> > > > +	 */
> > > > +	if ((unsigned long) memcg & 0x1UL)
> > > > +		memcg = NULL;
> > > > +
> > > > +	return memcg;
> > > 
> > > Same here.
> > > 
> > > >  }
> > > >  #else
> > > >  static inline struct mem_cgroup *page_memcg(struct page *page)
> > > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > > > index 270aa8fd2800..5102f00f3336 100644
> > > > --- a/include/linux/mm_types.h
> > > > +++ b/include/linux/mm_types.h
> > > > @@ -198,7 +198,10 @@ struct page {
> > > >  	atomic_t _refcount;
> > > >  
> > > >  #ifdef CONFIG_MEMCG
> > > > -	struct mem_cgroup *mem_cgroup;
> > > > +	union {
> > > > +		struct mem_cgroup *mem_cgroup;
> > > > +		struct obj_cgroup **obj_cgroups;
> > > > +	};
> > > 
> > > Since you need the casts in both cases anyway, it's safer (and
> > > simpler) to do
> > > 
> > > 	unsigned long mem_cgroup;
> > > 
> > > to prevent accidental direct derefs in future code.
> > 
> > Agree. Maybe even mem_cgroup_data?
> 
> Personally, I don't think the suffix adds much. The type makes it so
> the compiler catches any accidental use, and access is very
> centralized so greppability doesn't matter much.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 21/28] mm: memcg/slab: use a single set of kmem_caches for all memory cgroups
  2020-02-03 20:58     ` Roman Gushchin
@ 2020-02-03 22:17       ` Johannes Weiner
  2020-02-03 22:38         ` Roman Gushchin
  2020-02-04  1:15         ` Roman Gushchin
  0 siblings, 2 replies; 56+ messages in thread
From: Johannes Weiner @ 2020-02-03 22:17 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Feb 03, 2020 at 12:58:34PM -0800, Roman Gushchin wrote:
> On Mon, Feb 03, 2020 at 02:50:48PM -0500, Johannes Weiner wrote:
> > On Mon, Jan 27, 2020 at 09:34:46AM -0800, Roman Gushchin wrote:
> > > This is fairly big but mostly red patch, which makes all non-root
> > > slab allocations use a single set of kmem_caches instead of
> > > creating a separate set for each memory cgroup.
> > > 
> > > Because the number of non-root kmem_caches is now capped by the number
> > > of root kmem_caches, there is no need to shrink or destroy them
> > > prematurely. They can be perfectly destroyed together with their
> > > root counterparts. This allows to dramatically simplify the
> > > management of non-root kmem_caches and delete a ton of code.
> > 
> > This is definitely going in the right direction. But it doesn't quite
> > explain why we still need two sets of kmem_caches?
> > 
> > In the old scheme, we had completely separate per-cgroup caches with
> > separate slab pages. If a cgrouped process wanted to allocate a slab
> > object, we'd go to the root cache and used the cgroup id to look up
> > the right cgroup cache. On slab free we'd use page->slab_cache.
> > 
> > Now we have slab pages that have a page->objcg array. Why can't all
> > allocations go through a single set of kmem caches? If an allocation
> > is coming from a cgroup and the slab page the allocator wants to use
> > doesn't have an objcg array yet, we can allocate it on the fly, no?
> 
> Well, arguably it can be done, but there are few drawbacks:
> 
> 1) On the release path you'll need to make some extra work even for
>    root allocations: calculate the offset only to find the NULL objcg pointer.
> 
> 2) There will be a memory overhead for root allocations
>    (which might or might not be compensated by the increase
>    of the slab utilization).

Those two are only true if there is a wild mix of root and cgroup
allocations inside the same slab, and that doesn't really happen in
practice. Either the machine is dedicated to one workload and cgroups
are only enabled due to e.g. a vendor kernel, or you have cgrouped
systems (like most distro systems now) that cgroup everything.

> 3) I'm working on percpu memory accounting that resembles the same scheme,
>    except that obj_cgroups vector is created for the whole percpu block.
>    There will be root- and memcg-blocks, and it will be expensive to merge them.
>    I kinda like using the same scheme here and there.

It's hard to conclude anything based on this information alone. If
it's truly expensive to merge them, then it warrants the additional
complexity. But I don't understand the desire to share a design for
two systems with sufficiently different constraints.

> Upsides?
> 
> 1) slab utilization might increase a little bit (but I doubt it will have
>    a huge effect, because both merging sets should be relatively big and well
>    utilized)

Right.

> 2) eliminate memcg kmem_cache dynamic creation/destruction. it's nice,
>    but there isn't so much code left anyway.

There is a lot of complexity associated with the cache cloning that
isn't the lines of code, but the lifetime and synchronization rules.

And these two things are the primary aspects that make my head hurt
trying to review this patch series.

> So IMO it's an interesting direction to explore, but not something
> that necessarily has to be done in the context of this patchset.

I disagree. Instead of replacing the old coherent model and its
complexities with a new coherent one, you are mixing the two. And I
can barely understand the end result.

Dynamically cloning entire slab caches for the sole purpose of telling
whether the pages have an obj_cgroup array or not is *completely
insane*. If the controller had followed the obj_cgroup design from the
start, nobody would have ever thought about doing it like this.

From a maintainability POV, we cannot afford merging it in this form.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 12/28] mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat
  2020-02-03 20:34       ` Johannes Weiner
@ 2020-02-03 22:28         ` Roman Gushchin
  2020-02-03 22:39           ` Johannes Weiner
  0 siblings, 1 reply; 56+ messages in thread
From: Roman Gushchin @ 2020-02-03 22:28 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Feb 03, 2020 at 03:34:50PM -0500, Johannes Weiner wrote:
> On Mon, Feb 03, 2020 at 10:25:06AM -0800, Roman Gushchin wrote:
> > On Mon, Feb 03, 2020 at 12:58:18PM -0500, Johannes Weiner wrote:
> > > On Mon, Jan 27, 2020 at 09:34:37AM -0800, Roman Gushchin wrote:
> > > > Currently s8 type is used for per-cpu caching of per-node statistics.
> > > > It works fine because the overfill threshold can't exceed 125.
> > > > 
> > > > But if some counters are in bytes (and the next commit in the series
> > > > will convert slab counters to bytes), it's not gonna work:
> > > > value in bytes can easily exceed s8 without exceeding the threshold
> > > > converted to bytes. So to avoid overfilling per-cpu caches and breaking
> > > > vmstats correctness, let's use s32 instead.
> > > > 
> > > > This doesn't affect per-zone statistics. There are no plans to use
> > > > zone-level byte-sized counters, so no reasons to change anything.
> > > 
> > > Wait, is this still necessary? AFAIU, the node counters will account
> > > full slab pages, including free space, and only the memcg counters
> > > that track actual objects will be in bytes.
> > > 
> > > Can you please elaborate?
> > 
> > It's weird to have a counter with the same name (e.g. NR_SLAB_RECLAIMABLE_B)
> > being in different units depending on the accounting scope.
> > So I do convert all slab counters: global, per-lruvec,
> > and per-memcg to bytes.
> 
> Since the node counters tracks allocated slab pages and the memcg
> counter tracks allocated objects, arguably they shouldn't use the same
> name anyway.
> 
> > Alternatively I can fork them, e.g. introduce per-memcg or per-lruvec
> > NR_SLAB_RECLAIMABLE_OBJ
> > NR_SLAB_UNRECLAIMABLE_OBJ
> 
> Can we alias them and reuse their slots?
> 
> 	/* Reuse the node slab page counters item for charged objects */
> 	MEMCG_SLAB_RECLAIMABLE = NR_SLAB_RECLAIMABLE,
> 	MEMCG_SLAB_UNRECLAIMABLE = NR_SLAB_UNRECLAIMABLE,

Yeah, lgtm.

Isn't MEMCG_ prefix bad because it makes everybody think that it belongs to
the enum memcg_stat_item?

> 
> > and keep global counters untouched. If going this way, I'd prefer to make
> > them per-memcg, because it will simplify things on charging paths:
> > now we do get task->mem_cgroup->obj_cgroup in the pre_alloc_hook(),
> > and then obj_cgroup->mem_cgroup in the post_alloc_hook() just to
> > bump per-lruvec counters.
> 
> I don't quite follow. Don't you still have to update the global
> counters?

Global counters are updated only if an allocation requires a new slab
page, which isn't the most common path.
In generic case post_hook is required because it's the only place where
we have both page (to get the node) and memcg pointer.

If NR_SLAB_RECLAIMABLE is tracked only per-memcg (as MEMCG_SOCK),
then post_hook can handle only the rare "allocation failed" case.

I'm not sure here what's better.

> 
> > Btw, I wonder if we really need per-lruvec counters at all (at least
> > being enabled by default). For the significant amount of users who
> > have a single-node machine it doesn't bring anything except performance
> > overhead.
> 
> Yeah, for single-node systems we should be able to redirect everything
> to the memcg counters, without allocating and tracking lruvec copies.

Sounds good. It can lead to significant savings on single-node machines.

> 
> > For those who have multiple nodes (and most likely many many
> > memory cgroups) it provides way too many data except for debugging
> > some weird mm issues.
> > I guess in the absolute majority of cases having global per-node + per-memcg
> > counters will be enough.
> 
> Hm? Reclaim uses the lruvec counters.

Can you, please, provide some examples? It looks like it's mostly based
on per-zone lruvec size counters.

Anyway, it seems to be a little bit off from this patchset, so let's
discuss it separately.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 16/28] mm: memcg/slab: allocate obj_cgroups for non-root slab pages
  2020-02-03 21:19         ` Roman Gushchin
@ 2020-02-03 22:29           ` Johannes Weiner
  0 siblings, 0 replies; 56+ messages in thread
From: Johannes Weiner @ 2020-02-03 22:29 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Feb 03, 2020 at 01:19:15PM -0800, Roman Gushchin wrote:
> On Mon, Feb 03, 2020 at 03:46:27PM -0500, Johannes Weiner wrote:
> > On Mon, Feb 03, 2020 at 10:34:52AM -0800, Roman Gushchin wrote:
> > > On Mon, Feb 03, 2020 at 01:27:56PM -0500, Johannes Weiner wrote:
> > > > On Mon, Jan 27, 2020 at 09:34:41AM -0800, Roman Gushchin wrote:
> > > > > Allocate and release memory to store obj_cgroup pointers for each
> > > > > non-root slab page. Reuse page->mem_cgroup pointer to store a pointer
> > > > > to the allocated space.
> > > > > 
> > > > > To distinguish between obj_cgroups and memcg pointers in case
> > > > > when it's not obvious which one is used (as in page_cgroup_ino()),
> > > > > let's always set the lowest bit in the obj_cgroup case.
> > > > > 
> > > > > Signed-off-by: Roman Gushchin <guro@fb.com>
> > > > > ---
> > > > >  include/linux/mm.h       | 25 ++++++++++++++++++--
> > > > >  include/linux/mm_types.h |  5 +++-
> > > > >  mm/memcontrol.c          |  5 ++--
> > > > >  mm/slab.c                |  3 ++-
> > > > >  mm/slab.h                | 51 +++++++++++++++++++++++++++++++++++++++-
> > > > >  mm/slub.c                |  2 +-
> > > > >  6 files changed, 83 insertions(+), 8 deletions(-)
> > > > > 
> > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > > > index 080f8ac8bfb7..65224becc4ca 100644
> > > > > --- a/include/linux/mm.h
> > > > > +++ b/include/linux/mm.h
> > > > > @@ -1264,12 +1264,33 @@ static inline void set_page_links(struct page *page, enum zone_type zone,
> > > > >  #ifdef CONFIG_MEMCG
> > > > >  static inline struct mem_cgroup *page_memcg(struct page *page)
> > > > >  {
> > > > > -	return page->mem_cgroup;
> > > > > +	struct mem_cgroup *memcg = page->mem_cgroup;
> > > > > +
> > > > > +	/*
> > > > > +	 * The lowest bit set means that memcg isn't a valid memcg pointer,
> > > > > +	 * but a obj_cgroups pointer. In this case the page is shared and
> > > > > +	 * isn't charged to any specific memory cgroup. Return NULL.
> > > > > +	 */
> > > > > +	if ((unsigned long) memcg & 0x1UL)
> > > > > +		memcg = NULL;
> > > > > +
> > > > > +	return memcg;
> > > > 
> > > > That should really WARN instead of silently returning NULL. Which
> > > > callsite optimistically asks a page's cgroup when it has no idea
> > > > whether that page is actually a userpage or not?
> > > 
> > > For instance, look at page_cgroup_ino() called from the
> > > reading /proc/kpageflags.
> > 
> > But that checks PageSlab() and implements memcg_from_slab_page() to
> > handle that case properly. And that's what we expect all callsites to
> > do: make sure that the question asked actually makes sense, instead of
> > having the interface paper over bogus requests.
> > 
> > If that function is completely racy and PageSlab isn't stable, then it
> > should really just open-code the lookup, rather than require weakening
> > the interface for everybody else.
> 
> Why though?
> 
> Another example: process stack can be depending on the machine config and
> platform a vmalloc allocation, a slab allocation or a "high-order slab allocation",
> which is executed by the page allocator directly.
> 
> It's kinda nice to have a function that hides accounting details
> and returns a valid memcg pointer for any kind of objects.

Hm? I'm not objecting to that, memcg_from_obj() makes perfect sense to
me, to use with kvmalloc() objects for example.

I'm objecting to page_memcg() silently swallowing bogus inputs. That
function shouldn't silently say "there is no cgroup associated with
this page" when the true answer is "this page has MANY cgroups
associated with it, this question doesn't make any sense".

It's not exactly hard to imagine how this could cause bugs, is it?
Where a caller should implement a slab case (exactly like
page_cgroup_ino()) but is confused about the type of page it has,
whether it's charged or not etc.?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 21/28] mm: memcg/slab: use a single set of kmem_caches for all memory cgroups
  2020-02-03 22:17       ` Johannes Weiner
@ 2020-02-03 22:38         ` Roman Gushchin
  2020-02-04  1:15         ` Roman Gushchin
  1 sibling, 0 replies; 56+ messages in thread
From: Roman Gushchin @ 2020-02-03 22:38 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Feb 03, 2020 at 05:17:34PM -0500, Johannes Weiner wrote:
> On Mon, Feb 03, 2020 at 12:58:34PM -0800, Roman Gushchin wrote:
> > On Mon, Feb 03, 2020 at 02:50:48PM -0500, Johannes Weiner wrote:
> > > On Mon, Jan 27, 2020 at 09:34:46AM -0800, Roman Gushchin wrote:
> > > > This is fairly big but mostly red patch, which makes all non-root
> > > > slab allocations use a single set of kmem_caches instead of
> > > > creating a separate set for each memory cgroup.
> > > > 
> > > > Because the number of non-root kmem_caches is now capped by the number
> > > > of root kmem_caches, there is no need to shrink or destroy them
> > > > prematurely. They can be perfectly destroyed together with their
> > > > root counterparts. This allows to dramatically simplify the
> > > > management of non-root kmem_caches and delete a ton of code.
> > > 
> > > This is definitely going in the right direction. But it doesn't quite
> > > explain why we still need two sets of kmem_caches?
> > > 
> > > In the old scheme, we had completely separate per-cgroup caches with
> > > separate slab pages. If a cgrouped process wanted to allocate a slab
> > > object, we'd go to the root cache and used the cgroup id to look up
> > > the right cgroup cache. On slab free we'd use page->slab_cache.
> > > 
> > > Now we have slab pages that have a page->objcg array. Why can't all
> > > allocations go through a single set of kmem caches? If an allocation
> > > is coming from a cgroup and the slab page the allocator wants to use
> > > doesn't have an objcg array yet, we can allocate it on the fly, no?
> > 
> > Well, arguably it can be done, but there are few drawbacks:
> > 
> > 1) On the release path you'll need to make some extra work even for
> >    root allocations: calculate the offset only to find the NULL objcg pointer.
> > 
> > 2) There will be a memory overhead for root allocations
> >    (which might or might not be compensated by the increase
> >    of the slab utilization).
> 
> Those two are only true if there is a wild mix of root and cgroup
> allocations inside the same slab, and that doesn't really happen in
> practice. Either the machine is dedicated to one workload and cgroups
> are only enabled due to e.g. a vendor kernel, or you have cgrouped
> systems (like most distro systems now) that cgroup everything.
> 
> > 3) I'm working on percpu memory accounting that resembles the same scheme,
> >    except that obj_cgroups vector is created for the whole percpu block.
> >    There will be root- and memcg-blocks, and it will be expensive to merge them.
> >    I kinda like using the same scheme here and there.
> 
> It's hard to conclude anything based on this information alone. If
> it's truly expensive to merge them, then it warrants the additional
> complexity. But I don't understand the desire to share a design for
> two systems with sufficiently different constraints.
> 
> > Upsides?
> > 
> > 1) slab utilization might increase a little bit (but I doubt it will have
> >    a huge effect, because both merging sets should be relatively big and well
> >    utilized)
> 
> Right.
> 
> > 2) eliminate memcg kmem_cache dynamic creation/destruction. it's nice,
> >    but there isn't so much code left anyway.
> 
> There is a lot of complexity associated with the cache cloning that
> isn't the lines of code, but the lifetime and synchronization rules.
> 
> And these two things are the primary aspects that make my head hurt
> trying to review this patch series.
> 
> > So IMO it's an interesting direction to explore, but not something
> > that necessarily has to be done in the context of this patchset.
> 
> I disagree. Instead of replacing the old coherent model and its
> complexities with a new coherent one, you are mixing the two. And I
> can barely understand the end result.
> 
> Dynamically cloning entire slab caches for the sole purpose of telling
> whether the pages have an obj_cgroup array or not is *completely
> insane*. If the controller had followed the obj_cgroup design from the
> start, nobody would have ever thought about doing it like this.

Having two sets of kmem_caches has nothing to do with the refcounting
and obj_cgroup abstraction.
Please, take a look at the final code.

Thanks!

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 12/28] mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat
  2020-02-03 22:28         ` Roman Gushchin
@ 2020-02-03 22:39           ` Johannes Weiner
  2020-02-04  1:44             ` Roman Gushchin
  0 siblings, 1 reply; 56+ messages in thread
From: Johannes Weiner @ 2020-02-03 22:39 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Feb 03, 2020 at 02:28:53PM -0800, Roman Gushchin wrote:
> On Mon, Feb 03, 2020 at 03:34:50PM -0500, Johannes Weiner wrote:
> > On Mon, Feb 03, 2020 at 10:25:06AM -0800, Roman Gushchin wrote:
> > > On Mon, Feb 03, 2020 at 12:58:18PM -0500, Johannes Weiner wrote:
> > > > On Mon, Jan 27, 2020 at 09:34:37AM -0800, Roman Gushchin wrote:
> > > > > Currently s8 type is used for per-cpu caching of per-node statistics.
> > > > > It works fine because the overfill threshold can't exceed 125.
> > > > > 
> > > > > But if some counters are in bytes (and the next commit in the series
> > > > > will convert slab counters to bytes), it's not gonna work:
> > > > > value in bytes can easily exceed s8 without exceeding the threshold
> > > > > converted to bytes. So to avoid overfilling per-cpu caches and breaking
> > > > > vmstats correctness, let's use s32 instead.
> > > > > 
> > > > > This doesn't affect per-zone statistics. There are no plans to use
> > > > > zone-level byte-sized counters, so no reasons to change anything.
> > > > 
> > > > Wait, is this still necessary? AFAIU, the node counters will account
> > > > full slab pages, including free space, and only the memcg counters
> > > > that track actual objects will be in bytes.
> > > > 
> > > > Can you please elaborate?
> > > 
> > > It's weird to have a counter with the same name (e.g. NR_SLAB_RECLAIMABLE_B)
> > > being in different units depending on the accounting scope.
> > > So I do convert all slab counters: global, per-lruvec,
> > > and per-memcg to bytes.
> > 
> > Since the node counters tracks allocated slab pages and the memcg
> > counter tracks allocated objects, arguably they shouldn't use the same
> > name anyway.
> > 
> > > Alternatively I can fork them, e.g. introduce per-memcg or per-lruvec
> > > NR_SLAB_RECLAIMABLE_OBJ
> > > NR_SLAB_UNRECLAIMABLE_OBJ
> > 
> > Can we alias them and reuse their slots?
> > 
> > 	/* Reuse the node slab page counters item for charged objects */
> > 	MEMCG_SLAB_RECLAIMABLE = NR_SLAB_RECLAIMABLE,
> > 	MEMCG_SLAB_UNRECLAIMABLE = NR_SLAB_UNRECLAIMABLE,
> 
> Yeah, lgtm.
> 
> Isn't MEMCG_ prefix bad because it makes everybody think that it belongs to
> the enum memcg_stat_item?

Maybe, not sure that's a problem. #define CG_SLAB_RECLAIMABLE perhaps?

> > > and keep global counters untouched. If going this way, I'd prefer to make
> > > them per-memcg, because it will simplify things on charging paths:
> > > now we do get task->mem_cgroup->obj_cgroup in the pre_alloc_hook(),
> > > and then obj_cgroup->mem_cgroup in the post_alloc_hook() just to
> > > bump per-lruvec counters.
> > 
> > I don't quite follow. Don't you still have to update the global
> > counters?
> 
> Global counters are updated only if an allocation requires a new slab
> page, which isn't the most common path.

Right.

> In generic case post_hook is required because it's the only place where
> we have both page (to get the node) and memcg pointer.
> 
> If NR_SLAB_RECLAIMABLE is tracked only per-memcg (as MEMCG_SOCK),
> then post_hook can handle only the rare "allocation failed" case.
> 
> I'm not sure here what's better.

If it's tracked only per-memcg, you still have to account it every
time you charge an object to a memcg, no? How is it less frequent than
acconting at the lruvec level?

> > > Btw, I wonder if we really need per-lruvec counters at all (at least
> > > being enabled by default). For the significant amount of users who
> > > have a single-node machine it doesn't bring anything except performance
> > > overhead.
> > 
> > Yeah, for single-node systems we should be able to redirect everything
> > to the memcg counters, without allocating and tracking lruvec copies.
> 
> Sounds good. It can lead to significant savings on single-node machines.
> 
> > 
> > > For those who have multiple nodes (and most likely many many
> > > memory cgroups) it provides way too many data except for debugging
> > > some weird mm issues.
> > > I guess in the absolute majority of cases having global per-node + per-memcg
> > > counters will be enough.
> > 
> > Hm? Reclaim uses the lruvec counters.
> 
> Can you, please, provide some examples? It looks like it's mostly based
> on per-zone lruvec size counters.

It uses the recursive lruvec state to decide inactive_is_low(),
whether refaults are occuring, whether to trim cache only or go for
anon etc. We use it to determine refault distances and how many shadow
nodes to shrink.

Grep for lruvec_page_state().

> Anyway, it seems to be a little bit off from this patchset, so let's
> discuss it separately.

True

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 21/28] mm: memcg/slab: use a single set of kmem_caches for all memory cgroups
  2020-02-03 22:17       ` Johannes Weiner
  2020-02-03 22:38         ` Roman Gushchin
@ 2020-02-04  1:15         ` Roman Gushchin
  2020-02-04  2:47           ` Johannes Weiner
  1 sibling, 1 reply; 56+ messages in thread
From: Roman Gushchin @ 2020-02-04  1:15 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Feb 03, 2020 at 05:17:34PM -0500, Johannes Weiner wrote:
> On Mon, Feb 03, 2020 at 12:58:34PM -0800, Roman Gushchin wrote:
> > On Mon, Feb 03, 2020 at 02:50:48PM -0500, Johannes Weiner wrote:
> > > On Mon, Jan 27, 2020 at 09:34:46AM -0800, Roman Gushchin wrote:
> > > > This is fairly big but mostly red patch, which makes all non-root
> > > > slab allocations use a single set of kmem_caches instead of
> > > > creating a separate set for each memory cgroup.
> > > > 
> > > > Because the number of non-root kmem_caches is now capped by the number
> > > > of root kmem_caches, there is no need to shrink or destroy them
> > > > prematurely. They can be perfectly destroyed together with their
> > > > root counterparts. This allows to dramatically simplify the
> > > > management of non-root kmem_caches and delete a ton of code.
> > > 
> > > This is definitely going in the right direction. But it doesn't quite
> > > explain why we still need two sets of kmem_caches?
> > > 
> > > In the old scheme, we had completely separate per-cgroup caches with
> > > separate slab pages. If a cgrouped process wanted to allocate a slab
> > > object, we'd go to the root cache and used the cgroup id to look up
> > > the right cgroup cache. On slab free we'd use page->slab_cache.
> > > 
> > > Now we have slab pages that have a page->objcg array. Why can't all
> > > allocations go through a single set of kmem caches? If an allocation
> > > is coming from a cgroup and the slab page the allocator wants to use
> > > doesn't have an objcg array yet, we can allocate it on the fly, no?
> > 
> > Well, arguably it can be done, but there are few drawbacks:
> > 
> > 1) On the release path you'll need to make some extra work even for
> >    root allocations: calculate the offset only to find the NULL objcg pointer.
> > 
> > 2) There will be a memory overhead for root allocations
> >    (which might or might not be compensated by the increase
> >    of the slab utilization).
> 
> Those two are only true if there is a wild mix of root and cgroup
> allocations inside the same slab, and that doesn't really happen in
> practice. Either the machine is dedicated to one workload and cgroups
> are only enabled due to e.g. a vendor kernel, or you have cgrouped
> systems (like most distro systems now) that cgroup everything.

It's actually a questionable statement: we do skip allocations from certain
contexts, and we do merge slab caches.

Most likely it's true for certain slab_caches and not true for others.
Think of kmalloc-* caches.

Also, because obj_cgroup vectors will not be freed without underlying pages,
most likely the percentage of pages with obj_cgroups will grow with uptime.
In other words, memcg allocations will fragment root slab pages.

> 
> > 3) I'm working on percpu memory accounting that resembles the same scheme,
> >    except that obj_cgroups vector is created for the whole percpu block.
> >    There will be root- and memcg-blocks, and it will be expensive to merge them.
> >    I kinda like using the same scheme here and there.
> 
> It's hard to conclude anything based on this information alone. If
> it's truly expensive to merge them, then it warrants the additional
> complexity. But I don't understand the desire to share a design for
> two systems with sufficiently different constraints.
> 
> > Upsides?
> > 
> > 1) slab utilization might increase a little bit (but I doubt it will have
> >    a huge effect, because both merging sets should be relatively big and well
> >    utilized)
> 
> Right.
> 
> > 2) eliminate memcg kmem_cache dynamic creation/destruction. it's nice,
> >    but there isn't so much code left anyway.
> 
> There is a lot of complexity associated with the cache cloning that
> isn't the lines of code, but the lifetime and synchronization rules.

Quite opposite: the patchset removes all the complexity (or 90% of it),
because it makes the kmem_cache lifetime independent from any cgroup stuff.

Kmem_caches are created on demand on the first request (most likely during
the system start-up), and destroyed together with their root counterparts
(most likely never or on rmmod). First request means globally first request,
not a first request from a given memcg.

Memcg kmem_cache lifecycle has nothing to do with memory/obj_cgroups, and
after creation just matches the lifetime of the root kmem caches.

The only reason to keep the async creation is that some kmem_caches
are created very early in the boot process, long before any cgroup
stuff is initialized.

> 
> And these two things are the primary aspects that make my head hurt
> trying to review this patch series.
> 
> > So IMO it's an interesting direction to explore, but not something
> > that necessarily has to be done in the context of this patchset.
> 
> I disagree. Instead of replacing the old coherent model and its
> complexities with a new coherent one, you are mixing the two. And I
> can barely understand the end result.
> 
> Dynamically cloning entire slab caches for the sole purpose of telling
> whether the pages have an obj_cgroup array or not is *completely
> insane*. If the controller had followed the obj_cgroup design from the
> start, nobody would have ever thought about doing it like this.

It's just not true. The whole point of having root- and memcg sets is
to be able to not look for a NULL pointer in the obj_cgroup vector on
releasing of the root object. In other words, it allows to keep zero
overhead for root allocations. IMHO it's an important thing, and calling
it *completely insane* isn't the best way to communicate.

> 
> From a maintainability POV, we cannot afford merging it in this form.

It sounds strange: the patchset eliminates 90% of the complexity,
but it's unmergeable because there are 10% left.

I agree that it's an arguable question if we can tolerate some
additional overhead on root allocations to eliminate these additional
10%, but I really don't think it's so obvious that even discussing
it is insane.

Btw, there is another good idea to explore (also suggested by Christopher
Lameter): we can put memcg/objcg pointer into the slab page, avoiding
an extra allocation.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 12/28] mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat
  2020-02-03 22:39           ` Johannes Weiner
@ 2020-02-04  1:44             ` Roman Gushchin
  0 siblings, 0 replies; 56+ messages in thread
From: Roman Gushchin @ 2020-02-04  1:44 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Feb 03, 2020 at 05:39:54PM -0500, Johannes Weiner wrote:
> On Mon, Feb 03, 2020 at 02:28:53PM -0800, Roman Gushchin wrote:
> > On Mon, Feb 03, 2020 at 03:34:50PM -0500, Johannes Weiner wrote:
> > > On Mon, Feb 03, 2020 at 10:25:06AM -0800, Roman Gushchin wrote:
> > > > On Mon, Feb 03, 2020 at 12:58:18PM -0500, Johannes Weiner wrote:
> > > > > On Mon, Jan 27, 2020 at 09:34:37AM -0800, Roman Gushchin wrote:
> > > > > > Currently s8 type is used for per-cpu caching of per-node statistics.
> > > > > > It works fine because the overfill threshold can't exceed 125.
> > > > > > 
> > > > > > But if some counters are in bytes (and the next commit in the series
> > > > > > will convert slab counters to bytes), it's not gonna work:
> > > > > > value in bytes can easily exceed s8 without exceeding the threshold
> > > > > > converted to bytes. So to avoid overfilling per-cpu caches and breaking
> > > > > > vmstats correctness, let's use s32 instead.
> > > > > > 
> > > > > > This doesn't affect per-zone statistics. There are no plans to use
> > > > > > zone-level byte-sized counters, so no reasons to change anything.
> > > > > 
> > > > > Wait, is this still necessary? AFAIU, the node counters will account
> > > > > full slab pages, including free space, and only the memcg counters
> > > > > that track actual objects will be in bytes.
> > > > > 
> > > > > Can you please elaborate?
> > > > 
> > > > It's weird to have a counter with the same name (e.g. NR_SLAB_RECLAIMABLE_B)
> > > > being in different units depending on the accounting scope.
> > > > So I do convert all slab counters: global, per-lruvec,
> > > > and per-memcg to bytes.
> > > 
> > > Since the node counters tracks allocated slab pages and the memcg
> > > counter tracks allocated objects, arguably they shouldn't use the same
> > > name anyway.
> > > 
> > > > Alternatively I can fork them, e.g. introduce per-memcg or per-lruvec
> > > > NR_SLAB_RECLAIMABLE_OBJ
> > > > NR_SLAB_UNRECLAIMABLE_OBJ
> > > 
> > > Can we alias them and reuse their slots?
> > > 
> > > 	/* Reuse the node slab page counters item for charged objects */
> > > 	MEMCG_SLAB_RECLAIMABLE = NR_SLAB_RECLAIMABLE,
> > > 	MEMCG_SLAB_UNRECLAIMABLE = NR_SLAB_UNRECLAIMABLE,
> > 
> > Yeah, lgtm.
> > 
> > Isn't MEMCG_ prefix bad because it makes everybody think that it belongs to
> > the enum memcg_stat_item?
> 
> Maybe, not sure that's a problem. #define CG_SLAB_RECLAIMABLE perhaps?

Maybe not. I'll probably go with 
    MEMCG_SLAB_RECLAIMABLE_B = NR_SLAB_RECLAIMABLE,
    MEMCG_SLAB_UNRECLAIMABLE_B = NR_SLAB_UNRECLAIMABLE,

Please, let me know if you're not ok with it.

> 
> > > > and keep global counters untouched. If going this way, I'd prefer to make
> > > > them per-memcg, because it will simplify things on charging paths:
> > > > now we do get task->mem_cgroup->obj_cgroup in the pre_alloc_hook(),
> > > > and then obj_cgroup->mem_cgroup in the post_alloc_hook() just to
> > > > bump per-lruvec counters.
> > > 
> > > I don't quite follow. Don't you still have to update the global
> > > counters?
> > 
> > Global counters are updated only if an allocation requires a new slab
> > page, which isn't the most common path.
> 
> Right.
> 
> > In generic case post_hook is required because it's the only place where
> > we have both page (to get the node) and memcg pointer.
> > 
> > If NR_SLAB_RECLAIMABLE is tracked only per-memcg (as MEMCG_SOCK),
> > then post_hook can handle only the rare "allocation failed" case.
> > 
> > I'm not sure here what's better.
> 
> If it's tracked only per-memcg, you still have to account it every
> time you charge an object to a memcg, no? How is it less frequent than
> acconting at the lruvec level?

It's not less frequent, it just can be done in the pre-alloc hook
when there is a memcg pointer available.

The problem with the obj_cgroup thing is that we get it indirectly
from current memcg in the pre_alloc_hook, then pass it to obj_cgroup API,
internally we might need to get the memcg from it to charge a page,
and then again in the post_hook we need to get memcg to bump
per-lruvec stats. In other words we make several memcg <-> objcg
conversions, which isn't very nice on the hot path.

I see that in the future we might optimize the initial lookup
of objcg, but getting memcg just to bump vmstats looks unnecessarily expensive.
One option I think about is to handle byte-sized stats on obj_cgroup
level and flush whole pages to memcg level.

> 
> > > > Btw, I wonder if we really need per-lruvec counters at all (at least
> > > > being enabled by default). For the significant amount of users who
> > > > have a single-node machine it doesn't bring anything except performance
> > > > overhead.
> > > 
> > > Yeah, for single-node systems we should be able to redirect everything
> > > to the memcg counters, without allocating and tracking lruvec copies.
> > 
> > Sounds good. It can lead to significant savings on single-node machines.
> > 
> > > 
> > > > For those who have multiple nodes (and most likely many many
> > > > memory cgroups) it provides way too many data except for debugging
> > > > some weird mm issues.
> > > > I guess in the absolute majority of cases having global per-node + per-memcg
> > > > counters will be enough.
> > > 
> > > Hm? Reclaim uses the lruvec counters.
> > 
> > Can you, please, provide some examples? It looks like it's mostly based
> > on per-zone lruvec size counters.
> 
> It uses the recursive lruvec state to decide inactive_is_low(),
> whether refaults are occuring, whether to trim cache only or go for
> anon etc. We use it to determine refault distances and how many shadow
> nodes to shrink.
> 
> Grep for lruvec_page_state().

I see... Thanks!

> 
> > Anyway, it seems to be a little bit off from this patchset, so let's
> > discuss it separately.
> 
> True

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 21/28] mm: memcg/slab: use a single set of kmem_caches for all memory cgroups
  2020-02-04  1:15         ` Roman Gushchin
@ 2020-02-04  2:47           ` Johannes Weiner
  2020-02-04  4:35             ` Roman Gushchin
  0 siblings, 1 reply; 56+ messages in thread
From: Johannes Weiner @ 2020-02-04  2:47 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Feb 03, 2020 at 05:15:05PM -0800, Roman Gushchin wrote:
> On Mon, Feb 03, 2020 at 05:17:34PM -0500, Johannes Weiner wrote:
> > On Mon, Feb 03, 2020 at 12:58:34PM -0800, Roman Gushchin wrote:
> > > On Mon, Feb 03, 2020 at 02:50:48PM -0500, Johannes Weiner wrote:
> > > > On Mon, Jan 27, 2020 at 09:34:46AM -0800, Roman Gushchin wrote:
> > > > > This is fairly big but mostly red patch, which makes all non-root
> > > > > slab allocations use a single set of kmem_caches instead of
> > > > > creating a separate set for each memory cgroup.
> > > > > 
> > > > > Because the number of non-root kmem_caches is now capped by the number
> > > > > of root kmem_caches, there is no need to shrink or destroy them
> > > > > prematurely. They can be perfectly destroyed together with their
> > > > > root counterparts. This allows to dramatically simplify the
> > > > > management of non-root kmem_caches and delete a ton of code.
> > > > 
> > > > This is definitely going in the right direction. But it doesn't quite
> > > > explain why we still need two sets of kmem_caches?
> > > > 
> > > > In the old scheme, we had completely separate per-cgroup caches with
> > > > separate slab pages. If a cgrouped process wanted to allocate a slab
> > > > object, we'd go to the root cache and used the cgroup id to look up
> > > > the right cgroup cache. On slab free we'd use page->slab_cache.
> > > > 
> > > > Now we have slab pages that have a page->objcg array. Why can't all
> > > > allocations go through a single set of kmem caches? If an allocation
> > > > is coming from a cgroup and the slab page the allocator wants to use
> > > > doesn't have an objcg array yet, we can allocate it on the fly, no?
> > > 
> > > Well, arguably it can be done, but there are few drawbacks:
> > > 
> > > 1) On the release path you'll need to make some extra work even for
> > >    root allocations: calculate the offset only to find the NULL objcg pointer.
> > > 
> > > 2) There will be a memory overhead for root allocations
> > >    (which might or might not be compensated by the increase
> > >    of the slab utilization).
> > 
> > Those two are only true if there is a wild mix of root and cgroup
> > allocations inside the same slab, and that doesn't really happen in
> > practice. Either the machine is dedicated to one workload and cgroups
> > are only enabled due to e.g. a vendor kernel, or you have cgrouped
> > systems (like most distro systems now) that cgroup everything.
> 
> It's actually a questionable statement: we do skip allocations from certain
> contexts, and we do merge slab caches.
> 
> Most likely it's true for certain slab_caches and not true for others.
> Think of kmalloc-* caches.

With merging it's actually really hard to say how sparse or dense the
resulting objcgroup arrays would be. It could change all the time too.

> Also, because obj_cgroup vectors will not be freed without underlying pages,
> most likely the percentage of pages with obj_cgroups will grow with uptime.
> In other words, memcg allocations will fragment root slab pages.

I understand the first part of this paragraph, but not the second. The
objcgroup vectors will be freed when the slab pages get freed. But the
partially filled slab pages can be reused by any types of allocations,
surely? How would this cause the pages to fragment?

> > > 3) I'm working on percpu memory accounting that resembles the same scheme,
> > >    except that obj_cgroups vector is created for the whole percpu block.
> > >    There will be root- and memcg-blocks, and it will be expensive to merge them.
> > >    I kinda like using the same scheme here and there.
> > 
> > It's hard to conclude anything based on this information alone. If
> > it's truly expensive to merge them, then it warrants the additional
> > complexity. But I don't understand the desire to share a design for
> > two systems with sufficiently different constraints.
> > 
> > > Upsides?
> > > 
> > > 1) slab utilization might increase a little bit (but I doubt it will have
> > >    a huge effect, because both merging sets should be relatively big and well
> > >    utilized)
> > 
> > Right.
> > 
> > > 2) eliminate memcg kmem_cache dynamic creation/destruction. it's nice,
> > >    but there isn't so much code left anyway.
> > 
> > There is a lot of complexity associated with the cache cloning that
> > isn't the lines of code, but the lifetime and synchronization rules.
> 
> Quite opposite: the patchset removes all the complexity (or 90% of it),
> because it makes the kmem_cache lifetime independent from any cgroup stuff.
> 
> Kmem_caches are created on demand on the first request (most likely during
> the system start-up), and destroyed together with their root counterparts
> (most likely never or on rmmod). First request means globally first request,
> not a first request from a given memcg.
> 
> Memcg kmem_cache lifecycle has nothing to do with memory/obj_cgroups, and
> after creation just matches the lifetime of the root kmem caches.
> 
> The only reason to keep the async creation is that some kmem_caches
> are created very early in the boot process, long before any cgroup
> stuff is initialized.

Yes, it's independent of the obj_cgroup and memcg, and yes it's
simpler after your patches. But I'm not talking about the delta, I'm
trying to understand the end result.

And the truth is there is a decent chunk of code and tentacles spread
throughout the slab/cgroup code to clone, destroy, and handle the
split caches, as well as the branches/indirections on every cgrouped
slab allocation.

Yet there is no good explanation for why things are done this way
anywhere in the changelog, the cover letter, or the code. And it's
hard to get a satisfying answer even to direct questions about it.

Forget about how anything was before your patches and put yourself
into the shoes of somebody who comes at the new code without any
previous knowledge. "It was even worse before" just isn't a satisfying
answer.

> > And these two things are the primary aspects that make my head hurt
> > trying to review this patch series.
> > 
> > > So IMO it's an interesting direction to explore, but not something
> > > that necessarily has to be done in the context of this patchset.
> > 
> > I disagree. Instead of replacing the old coherent model and its
> > complexities with a new coherent one, you are mixing the two. And I
> > can barely understand the end result.
> > 
> > Dynamically cloning entire slab caches for the sole purpose of telling
> > whether the pages have an obj_cgroup array or not is *completely
> > insane*. If the controller had followed the obj_cgroup design from the
> > start, nobody would have ever thought about doing it like this.
> 
> It's just not true. The whole point of having root- and memcg sets is
> to be able to not look for a NULL pointer in the obj_cgroup vector on
> releasing of the root object. In other words, it allows to keep zero
> overhead for root allocations. IMHO it's an important thing, and calling
> it *completely insane* isn't the best way to communicate.

But you're trading it for the indirection of going through a separate
kmem_cache for every single cgroup-accounted allocation. Why is this a
preferable trade-off to make?

I'm asking basic questions about your design choices. It's not okay to
dismiss this with "it's an interesting direction to explore outside
the context this patchset".

> > From a maintainability POV, we cannot afford merging it in this form.
> 
> It sounds strange: the patchset eliminates 90% of the complexity,
> but it's unmergeable because there are 10% left.

No, it's unmergeable if you're unwilling to explain and document your
design choices when somebody who is taking the time and effort to look
at your patches doesn't understand why things are the way they are.

We are talking about 1500 lines of complicated core kernel code. They
*have* to make sense to people other than you if we want to have this
upstream.

> I agree that it's an arguable question if we can tolerate some
> additional overhead on root allocations to eliminate these additional
> 10%, but I really don't think it's so obvious that even discussing
> it is insane.

Well that's exactly my point.

> Btw, there is another good idea to explore (also suggested by Christopher
> Lameter): we can put memcg/objcg pointer into the slab page, avoiding
> an extra allocation.

I agree with this idea, but I do think that's a bit more obviously in
optimization territory. The objcg is much larger than a pointer to it,
and it wouldn't significantly change the alloc/free sequence, right?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 21/28] mm: memcg/slab: use a single set of kmem_caches for all memory cgroups
  2020-02-04  2:47           ` Johannes Weiner
@ 2020-02-04  4:35             ` Roman Gushchin
  2020-02-04 18:41               ` Johannes Weiner
  0 siblings, 1 reply; 56+ messages in thread
From: Roman Gushchin @ 2020-02-04  4:35 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Feb 03, 2020 at 09:47:04PM -0500, Johannes Weiner wrote:
> On Mon, Feb 03, 2020 at 05:15:05PM -0800, Roman Gushchin wrote:
> > On Mon, Feb 03, 2020 at 05:17:34PM -0500, Johannes Weiner wrote:
> > > On Mon, Feb 03, 2020 at 12:58:34PM -0800, Roman Gushchin wrote:
> > > > On Mon, Feb 03, 2020 at 02:50:48PM -0500, Johannes Weiner wrote:
> > > > > On Mon, Jan 27, 2020 at 09:34:46AM -0800, Roman Gushchin wrote:
> > > > > > This is fairly big but mostly red patch, which makes all non-root
> > > > > > slab allocations use a single set of kmem_caches instead of
> > > > > > creating a separate set for each memory cgroup.
> > > > > > 
> > > > > > Because the number of non-root kmem_caches is now capped by the number
> > > > > > of root kmem_caches, there is no need to shrink or destroy them
> > > > > > prematurely. They can be perfectly destroyed together with their
> > > > > > root counterparts. This allows to dramatically simplify the
> > > > > > management of non-root kmem_caches and delete a ton of code.
> > > > > 
> > > > > This is definitely going in the right direction. But it doesn't quite
> > > > > explain why we still need two sets of kmem_caches?
> > > > > 
> > > > > In the old scheme, we had completely separate per-cgroup caches with
> > > > > separate slab pages. If a cgrouped process wanted to allocate a slab
> > > > > object, we'd go to the root cache and used the cgroup id to look up
> > > > > the right cgroup cache. On slab free we'd use page->slab_cache.
> > > > > 
> > > > > Now we have slab pages that have a page->objcg array. Why can't all
> > > > > allocations go through a single set of kmem caches? If an allocation
> > > > > is coming from a cgroup and the slab page the allocator wants to use
> > > > > doesn't have an objcg array yet, we can allocate it on the fly, no?
> > > > 
> > > > Well, arguably it can be done, but there are few drawbacks:
> > > > 
> > > > 1) On the release path you'll need to make some extra work even for
> > > >    root allocations: calculate the offset only to find the NULL objcg pointer.
> > > > 
> > > > 2) There will be a memory overhead for root allocations
> > > >    (which might or might not be compensated by the increase
> > > >    of the slab utilization).
> > > 
> > > Those two are only true if there is a wild mix of root and cgroup
> > > allocations inside the same slab, and that doesn't really happen in
> > > practice. Either the machine is dedicated to one workload and cgroups
> > > are only enabled due to e.g. a vendor kernel, or you have cgrouped
> > > systems (like most distro systems now) that cgroup everything.
> > 
> > It's actually a questionable statement: we do skip allocations from certain
> > contexts, and we do merge slab caches.
> > 
> > Most likely it's true for certain slab_caches and not true for others.
> > Think of kmalloc-* caches.
> 
> With merging it's actually really hard to say how sparse or dense the
> resulting objcgroup arrays would be. It could change all the time too.

So here is some actual data from my dev machine. The first column is the number
of pages in the root cache, the second - in the corresponding memcg.

   ext4_groupinfo_4k          1          0
     rpc_inode_cache          1          0
        fuse_request         62          0
          fuse_inode          1       2732
  btrfs_delayed_node       1192          0
btrfs_ordered_extent        129          0
    btrfs_extent_map       8686          0
 btrfs_extent_buffer       2648          0
         btrfs_inode         12       6739
              PINGv6          1         11
               RAWv6          2          5
               UDPv6          1         34
       tw_sock_TCPv6        378          3
  request_sock_TCPv6         24          0
               TCPv6         46         74
  mqueue_inode_cache          1          0
 jbd2_journal_handle          2          0
   jbd2_journal_head          2          0
 jbd2_revoke_table_s          1          0
    ext4_inode_cache          1          3
ext4_allocation_context          1          0
         ext4_io_end          1          0
  ext4_extent_status          5          0
             mbcache          1          0
      dnotify_struct          1          0
  posix_timers_cache         24          0
      xfrm_dst_cache        202          0
                 RAW          3         12
                 UDP          2         24
         tw_sock_TCP         25          0
    request_sock_TCP         24          0
                 TCP          7         24
hugetlbfs_inode_cache          2          0
               dquot          2          0
       eventpoll_pwq          1        119
           dax_cache          1          0
       request_queue          9          0
          blkdev_ioc        241          0
          biovec-max        112          0
          biovec-128          2          0
           biovec-64          6          0
  khugepaged_mm_slot        248          0
 dmaengine-unmap-256          1          0
 dmaengine-unmap-128          1          0
  dmaengine-unmap-16         39          0
    sock_inode_cache          9        219
    skbuff_ext_cache        249          0
 skbuff_fclone_cache         83          0
   skbuff_head_cache        138        141
     file_lock_cache         24          0
       net_namespace          1          5
   shmem_inode_cache         14         56
     task_delay_info         23        165
           taskstats         24          0
      proc_dir_entry         24          0
          pde_opener         16         24
    proc_inode_cache         24       1103
          bdev_cache          4         20
   kernfs_node_cache       1405          0
           mnt_cache         54          0
                filp         53        460
         inode_cache        488       2287
              dentry        367      10576
         names_cache         24          0
        ebitmap_node          2          0
     avc_xperms_data        256          0
      lsm_file_cache         92          0
         buffer_head         24          9
       uts_namespace          1          3
      vm_area_struct         48        810
           mm_struct         19         29
         files_cache         14         26
        signal_cache         28        143
       sighand_cache         45         47
         task_struct         77        430
            cred_jar         29        424
      anon_vma_chain         39        492
            anon_vma         28        467
                 pid         30        369
        Acpi-Operand         56          0
          Acpi-Parse       5587          0
          Acpi-State       4137          0
      Acpi-Namespace          8          0
         numa_policy        137          0
  ftrace_event_field         68          0
      pool_workqueue         25          0
     radix_tree_node       1694       7776
          task_group         21          0
           vmap_area        477          0
     kmalloc-rcl-512        473          0
     kmalloc-rcl-256        605          0
     kmalloc-rcl-192         43         16
     kmalloc-rcl-128          1         47
      kmalloc-rcl-96          3        229
      kmalloc-rcl-64          6        611
          kmalloc-8k         48         24
          kmalloc-4k        372         59
          kmalloc-2k        132         50
          kmalloc-1k        251         82
         kmalloc-512        360        150
         kmalloc-256        237          0
         kmalloc-192        298         24
         kmalloc-128        203         24
          kmalloc-96        112         24
          kmalloc-64        796         24
          kmalloc-32       1188         26
          kmalloc-16        555         25
           kmalloc-8         42         24
     kmem_cache_node         20          0
          kmem_cache         24          0

> 
> > Also, because obj_cgroup vectors will not be freed without underlying pages,
> > most likely the percentage of pages with obj_cgroups will grow with uptime.
> > In other words, memcg allocations will fragment root slab pages.
> 
> I understand the first part of this paragraph, but not the second. The
> objcgroup vectors will be freed when the slab pages get freed. But the
> partially filled slab pages can be reused by any types of allocations,
> surely? How would this cause the pages to fragment?

I mean the following: once you allocate a single accounted object
from the page, obj_cgroup vector is allocated and will be released only
with the slab page. We really really don't want to count how many accounted
objects are on the page and release obj_cgroup vector on reaching 0.
So even if all following allocations are root allocations, the overhead
will not go away with the uptime.

In other words, even a small percentage of accounted objects will
turn the whole cache into "accountable".

> 
> > > > 3) I'm working on percpu memory accounting that resembles the same scheme,
> > > >    except that obj_cgroups vector is created for the whole percpu block.
> > > >    There will be root- and memcg-blocks, and it will be expensive to merge them.
> > > >    I kinda like using the same scheme here and there.
> > > 
> > > It's hard to conclude anything based on this information alone. If
> > > it's truly expensive to merge them, then it warrants the additional
> > > complexity. But I don't understand the desire to share a design for
> > > two systems with sufficiently different constraints.
> > > 
> > > > Upsides?
> > > > 
> > > > 1) slab utilization might increase a little bit (but I doubt it will have
> > > >    a huge effect, because both merging sets should be relatively big and well
> > > >    utilized)
> > > 
> > > Right.
> > > 
> > > > 2) eliminate memcg kmem_cache dynamic creation/destruction. it's nice,
> > > >    but there isn't so much code left anyway.
> > > 
> > > There is a lot of complexity associated with the cache cloning that
> > > isn't the lines of code, but the lifetime and synchronization rules.
> > 
> > Quite opposite: the patchset removes all the complexity (or 90% of it),
> > because it makes the kmem_cache lifetime independent from any cgroup stuff.
> > 
> > Kmem_caches are created on demand on the first request (most likely during
> > the system start-up), and destroyed together with their root counterparts
> > (most likely never or on rmmod). First request means globally first request,
> > not a first request from a given memcg.
> > 
> > Memcg kmem_cache lifecycle has nothing to do with memory/obj_cgroups, and
> > after creation just matches the lifetime of the root kmem caches.
> > 
> > The only reason to keep the async creation is that some kmem_caches
> > are created very early in the boot process, long before any cgroup
> > stuff is initialized.
> 
> Yes, it's independent of the obj_cgroup and memcg, and yes it's
> simpler after your patches. But I'm not talking about the delta, I'm
> trying to understand the end result.
> 
> And the truth is there is a decent chunk of code and tentacles spread
> throughout the slab/cgroup code to clone, destroy, and handle the
> split caches, as well as the branches/indirections on every cgrouped
> slab allocation.

Did you see the final code? It's fairly simple and there is really not
much of complexity left. If you don't think so, let's go into details,
because otherwise it's hard to say anything.

With a such change which basically removes the current implementation
and replaces it with a new one, it's hard to keep the balance between
making commits self-contained and small, but also showing the whole picture.
I'm fully open to questions and generally want to make it simpler.

I've tried to separate some parts and get them merged before the main
thing, but they haven't been merged yet, so I have to include them
to keep the thing building.

Will a more-detailed design in the cover help?
Will writing a design doc to put into Documentation/ help?
Is it better to rearrange patches in a way to eliminate the current
implementation first and build from scratch?

> 
> Yet there is no good explanation for why things are done this way
> anywhere in the changelog, the cover letter, or the code. And it's
> hard to get a satisfying answer even to direct questions about it.

I do not agree. I try to answer all questions. But I also expect
that my arguments will be listened.
(I didn't answer questions re lifetime of obj_cgroup, but only
because I need some more time to think. If it wasn't clear, I'm sorry.).

> 
> Forget about how anything was before your patches and put yourself
> into the shoes of somebody who comes at the new code without any
> previous knowledge. "It was even worse before" just isn't a satisfying
> answer.

Absolutely agree.

But at the same time "now it's better than before" sounds like a good
validation for a change. The code is never perfect.

But, please, let's don't go into long discussions here and save some time.

> 
> > > And these two things are the primary aspects that make my head hurt
> > > trying to review this patch series.
> > > 
> > > > So IMO it's an interesting direction to explore, but not something
> > > > that necessarily has to be done in the context of this patchset.
> > > 
> > > I disagree. Instead of replacing the old coherent model and its
> > > complexities with a new coherent one, you are mixing the two. And I
> > > can barely understand the end result.
> > > 
> > > Dynamically cloning entire slab caches for the sole purpose of telling
> > > whether the pages have an obj_cgroup array or not is *completely
> > > insane*. If the controller had followed the obj_cgroup design from the
> > > start, nobody would have ever thought about doing it like this.
> > 
> > It's just not true. The whole point of having root- and memcg sets is
> > to be able to not look for a NULL pointer in the obj_cgroup vector on
> > releasing of the root object. In other words, it allows to keep zero
> > overhead for root allocations. IMHO it's an important thing, and calling
> > it *completely insane* isn't the best way to communicate.
> 
> But you're trading it for the indirection of going through a separate
> kmem_cache for every single cgroup-accounted allocation. Why is this a
> preferable trade-off to make?

Because it allows to keep zero memory and cpu overhead for root allocations.
I've no data showing that this overhead is small and acceptable in all cases.
I think keeping zero overhead for root allocations is more important
than having a single set of kmem caches.

> 
> I'm asking basic questions about your design choices. It's not okay to
> dismiss this with "it's an interesting direction to explore outside
> the context this patchset".

I'm not dismissing any questions.
There is a difference between a question and a must-to-follow suggestion,
which has known and ignored trade-offs.

> 
> > > From a maintainability POV, we cannot afford merging it in this form.
> > 
> > It sounds strange: the patchset eliminates 90% of the complexity,
> > but it's unmergeable because there are 10% left.
> 
> No, it's unmergeable if you're unwilling to explain and document your
> design choices when somebody who is taking the time and effort to look
> at your patches doesn't understand why things are the way they are.

I'm not unwilling to explain. Otherwise I just wouldn't post it upstream,
right? And I assume you're spending your time reviewing it not with the goal
to keep the current code intact.

Please, let's keep separate things which are hard to understand and
require an explanation and things which you think are better done differently.

Both are valid and appreciated comments, but mixing them isn't productive.

> 
> We are talking about 1500 lines of complicated core kernel code. They
> *have* to make sense to people other than you if we want to have this
> upstream.

Right.

> 
> > I agree that it's an arguable question if we can tolerate some
> > additional overhead on root allocations to eliminate these additional
> > 10%, but I really don't think it's so obvious that even discussing
> > it is insane.
> 
> Well that's exactly my point.

Ok, what's the acceptable performance penalty?
Is adding 20% on free path is acceptable, for example?
Or adding 3% of slab memory?

> 
> > Btw, there is another good idea to explore (also suggested by Christopher
> > Lameter): we can put memcg/objcg pointer into the slab page, avoiding
> > an extra allocation.
> 
> I agree with this idea, but I do think that's a bit more obviously in
> optimization territory. The objcg is much larger than a pointer to it,
> and it wouldn't significantly change the alloc/free sequence, right?

So the idea is that putting the obj_cgroup pointer nearby will eliminate
some cache misses. But then it's preferable to have two sets, because otherwise
there is a memory overhead from allocating an extra space for the objcg pointer.


Stepping a bit back: the new scheme (new slab controller) adds some cpu operations
on the allocation and release paths. It's unavoidable: more precise
accounting requires more CPU. But IMO it's worth it because it leads
to significant memory savings and reduced memory fragmentation.
Also it reduces the code complexity (which is a bonus but not the primary goal).

I haven't seen so far any workloads where the difference was noticeable,
but it doesn't mean they do not exist. That's why I'm very concerned about
any suggestions which might even in theory increase the cpu overhead.
Keeping it at zero level for root allocations allows do exclude
something from the accounting if the performance penalty is not tolerable.

Thanks!

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 21/28] mm: memcg/slab: use a single set of kmem_caches for all memory cgroups
  2020-02-04  4:35             ` Roman Gushchin
@ 2020-02-04 18:41               ` Johannes Weiner
  2020-02-05 15:58                 ` Roman Gushchin
  0 siblings, 1 reply; 56+ messages in thread
From: Johannes Weiner @ 2020-02-04 18:41 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Mon, Feb 03, 2020 at 08:35:41PM -0800, Roman Gushchin wrote:
> On Mon, Feb 03, 2020 at 09:47:04PM -0500, Johannes Weiner wrote:
> > On Mon, Feb 03, 2020 at 05:15:05PM -0800, Roman Gushchin wrote:
> > > On Mon, Feb 03, 2020 at 05:17:34PM -0500, Johannes Weiner wrote:
> > > > On Mon, Feb 03, 2020 at 12:58:34PM -0800, Roman Gushchin wrote:
> > > > > On Mon, Feb 03, 2020 at 02:50:48PM -0500, Johannes Weiner wrote:
> > > > > > On Mon, Jan 27, 2020 at 09:34:46AM -0800, Roman Gushchin wrote:
> > > > > > > This is fairly big but mostly red patch, which makes all non-root
> > > > > > > slab allocations use a single set of kmem_caches instead of
> > > > > > > creating a separate set for each memory cgroup.
> > > > > > > 
> > > > > > > Because the number of non-root kmem_caches is now capped by the number
> > > > > > > of root kmem_caches, there is no need to shrink or destroy them
> > > > > > > prematurely. They can be perfectly destroyed together with their
> > > > > > > root counterparts. This allows to dramatically simplify the
> > > > > > > management of non-root kmem_caches and delete a ton of code.
> > > > > > 
> > > > > > This is definitely going in the right direction. But it doesn't quite
> > > > > > explain why we still need two sets of kmem_caches?
> > > > > > 
> > > > > > In the old scheme, we had completely separate per-cgroup caches with
> > > > > > separate slab pages. If a cgrouped process wanted to allocate a slab
> > > > > > object, we'd go to the root cache and used the cgroup id to look up
> > > > > > the right cgroup cache. On slab free we'd use page->slab_cache.
> > > > > > 
> > > > > > Now we have slab pages that have a page->objcg array. Why can't all
> > > > > > allocations go through a single set of kmem caches? If an allocation
> > > > > > is coming from a cgroup and the slab page the allocator wants to use
> > > > > > doesn't have an objcg array yet, we can allocate it on the fly, no?
> > > > > 
> > > > > Well, arguably it can be done, but there are few drawbacks:
> > > > > 
> > > > > 1) On the release path you'll need to make some extra work even for
> > > > >    root allocations: calculate the offset only to find the NULL objcg pointer.
> > > > > 
> > > > > 2) There will be a memory overhead for root allocations
> > > > >    (which might or might not be compensated by the increase
> > > > >    of the slab utilization).
> > > > 
> > > > Those two are only true if there is a wild mix of root and cgroup
> > > > allocations inside the same slab, and that doesn't really happen in
> > > > practice. Either the machine is dedicated to one workload and cgroups
> > > > are only enabled due to e.g. a vendor kernel, or you have cgrouped
> > > > systems (like most distro systems now) that cgroup everything.
> > > 
> > > It's actually a questionable statement: we do skip allocations from certain
> > > contexts, and we do merge slab caches.
> > > 
> > > Most likely it's true for certain slab_caches and not true for others.
> > > Think of kmalloc-* caches.
> > 
> > With merging it's actually really hard to say how sparse or dense the
> > resulting objcgroup arrays would be. It could change all the time too.
> 
> So here is some actual data from my dev machine. The first column is the number
> of pages in the root cache, the second - in the corresponding memcg.
> 
>    ext4_groupinfo_4k          1          0
>      rpc_inode_cache          1          0
>         fuse_request         62          0
>           fuse_inode          1       2732
>   btrfs_delayed_node       1192          0
> btrfs_ordered_extent        129          0
>     btrfs_extent_map       8686          0
>  btrfs_extent_buffer       2648          0
>          btrfs_inode         12       6739
>               PINGv6          1         11
>                RAWv6          2          5
>                UDPv6          1         34
>        tw_sock_TCPv6        378          3
>   request_sock_TCPv6         24          0
>                TCPv6         46         74
>   mqueue_inode_cache          1          0
>  jbd2_journal_handle          2          0
>    jbd2_journal_head          2          0
>  jbd2_revoke_table_s          1          0
>     ext4_inode_cache          1          3
> ext4_allocation_context          1          0
>          ext4_io_end          1          0
>   ext4_extent_status          5          0
>              mbcache          1          0
>       dnotify_struct          1          0
>   posix_timers_cache         24          0
>       xfrm_dst_cache        202          0
>                  RAW          3         12
>                  UDP          2         24
>          tw_sock_TCP         25          0
>     request_sock_TCP         24          0
>                  TCP          7         24
> hugetlbfs_inode_cache          2          0
>                dquot          2          0
>        eventpoll_pwq          1        119
>            dax_cache          1          0
>        request_queue          9          0
>           blkdev_ioc        241          0
>           biovec-max        112          0
>           biovec-128          2          0
>            biovec-64          6          0
>   khugepaged_mm_slot        248          0
>  dmaengine-unmap-256          1          0
>  dmaengine-unmap-128          1          0
>   dmaengine-unmap-16         39          0
>     sock_inode_cache          9        219
>     skbuff_ext_cache        249          0
>  skbuff_fclone_cache         83          0
>    skbuff_head_cache        138        141
>      file_lock_cache         24          0
>        net_namespace          1          5
>    shmem_inode_cache         14         56
>      task_delay_info         23        165
>            taskstats         24          0
>       proc_dir_entry         24          0
>           pde_opener         16         24
>     proc_inode_cache         24       1103
>           bdev_cache          4         20
>    kernfs_node_cache       1405          0
>            mnt_cache         54          0
>                 filp         53        460
>          inode_cache        488       2287
>               dentry        367      10576
>          names_cache         24          0
>         ebitmap_node          2          0
>      avc_xperms_data        256          0
>       lsm_file_cache         92          0
>          buffer_head         24          9
>        uts_namespace          1          3
>       vm_area_struct         48        810
>            mm_struct         19         29
>          files_cache         14         26
>         signal_cache         28        143
>        sighand_cache         45         47
>          task_struct         77        430
>             cred_jar         29        424
>       anon_vma_chain         39        492
>             anon_vma         28        467
>                  pid         30        369
>         Acpi-Operand         56          0
>           Acpi-Parse       5587          0
>           Acpi-State       4137          0
>       Acpi-Namespace          8          0
>          numa_policy        137          0
>   ftrace_event_field         68          0
>       pool_workqueue         25          0
>      radix_tree_node       1694       7776
>           task_group         21          0
>            vmap_area        477          0
>      kmalloc-rcl-512        473          0
>      kmalloc-rcl-256        605          0
>      kmalloc-rcl-192         43         16
>      kmalloc-rcl-128          1         47
>       kmalloc-rcl-96          3        229
>       kmalloc-rcl-64          6        611
>           kmalloc-8k         48         24
>           kmalloc-4k        372         59
>           kmalloc-2k        132         50
>           kmalloc-1k        251         82
>          kmalloc-512        360        150
>          kmalloc-256        237          0
>          kmalloc-192        298         24
>          kmalloc-128        203         24
>           kmalloc-96        112         24
>           kmalloc-64        796         24
>           kmalloc-32       1188         26
>           kmalloc-16        555         25
>            kmalloc-8         42         24
>      kmem_cache_node         20          0
>           kmem_cache         24          0

That's interesting, thanks. It does look fairly bimodal, except in
some smaller caches. Which does make sense when you think about it: we
focus on accounting consumers that are driven by userspace activity
and big enough to actually matter in terms of cgroup footprint.

> > > Also, because obj_cgroup vectors will not be freed without underlying pages,
> > > most likely the percentage of pages with obj_cgroups will grow with uptime.
> > > In other words, memcg allocations will fragment root slab pages.
> > 
> > I understand the first part of this paragraph, but not the second. The
> > objcgroup vectors will be freed when the slab pages get freed. But the
> > partially filled slab pages can be reused by any types of allocations,
> > surely? How would this cause the pages to fragment?
> 
> I mean the following: once you allocate a single accounted object
> from the page, obj_cgroup vector is allocated and will be released only
> with the slab page. We really really don't want to count how many accounted
> objects are on the page and release obj_cgroup vector on reaching 0.
> So even if all following allocations are root allocations, the overhead
> will not go away with the uptime.
> 
> In other words, even a small percentage of accounted objects will
> turn the whole cache into "accountable".

Correct. The worst case is where we have a large cache that has N
objects per slab, but only ~1/N objects are accounted to a cgroup.

The question is whether this is common or even realistic. When would a
cache be big, but only a small subset of its allocations would be
attributable to specific cgroups?

On the less extreme overlapping cases, yeah there are fragmented
obj_cgroup arrays, but there is also better slab packing. One is an
array of pointers, the other is an array of larger objects. It would
seem slab fragmentation has the potential to waste much more memory?

> > > > > 3) I'm working on percpu memory accounting that resembles the same scheme,
> > > > >    except that obj_cgroups vector is created for the whole percpu block.
> > > > >    There will be root- and memcg-blocks, and it will be expensive to merge them.
> > > > >    I kinda like using the same scheme here and there.
> > > > 
> > > > It's hard to conclude anything based on this information alone. If
> > > > it's truly expensive to merge them, then it warrants the additional
> > > > complexity. But I don't understand the desire to share a design for
> > > > two systems with sufficiently different constraints.
> > > > 
> > > > > Upsides?
> > > > > 
> > > > > 1) slab utilization might increase a little bit (but I doubt it will have
> > > > >    a huge effect, because both merging sets should be relatively big and well
> > > > >    utilized)
> > > > 
> > > > Right.
> > > > 
> > > > > 2) eliminate memcg kmem_cache dynamic creation/destruction. it's nice,
> > > > >    but there isn't so much code left anyway.
> > > > 
> > > > There is a lot of complexity associated with the cache cloning that
> > > > isn't the lines of code, but the lifetime and synchronization rules.
> > > 
> > > Quite opposite: the patchset removes all the complexity (or 90% of it),
> > > because it makes the kmem_cache lifetime independent from any cgroup stuff.
> > > 
> > > Kmem_caches are created on demand on the first request (most likely during
> > > the system start-up), and destroyed together with their root counterparts
> > > (most likely never or on rmmod). First request means globally first request,
> > > not a first request from a given memcg.
> > > 
> > > Memcg kmem_cache lifecycle has nothing to do with memory/obj_cgroups, and
> > > after creation just matches the lifetime of the root kmem caches.
> > > 
> > > The only reason to keep the async creation is that some kmem_caches
> > > are created very early in the boot process, long before any cgroup
> > > stuff is initialized.
> > 
> > Yes, it's independent of the obj_cgroup and memcg, and yes it's
> > simpler after your patches. But I'm not talking about the delta, I'm
> > trying to understand the end result.
> > 
> > And the truth is there is a decent chunk of code and tentacles spread
> > throughout the slab/cgroup code to clone, destroy, and handle the
> > split caches, as well as the branches/indirections on every cgrouped
> > slab allocation.
> 
> Did you see the final code? It's fairly simple and there is really not
> much of complexity left. If you don't think so, let's go into details,
> because otherwise it's hard to say anything.

I have the patches applied to a local tree and am looking at the final
code. But I can only repeat that "it's not too bad" simply isn't a
good explanation for why the code is the way it is.

> With a such change which basically removes the current implementation
> and replaces it with a new one, it's hard to keep the balance between
> making commits self-contained and small, but also showing the whole picture.
> I'm fully open to questions and generally want to make it simpler.
> 
> I've tried to separate some parts and get them merged before the main
> thing, but they haven't been merged yet, so I have to include them
> to keep the thing building.
> 
> Will a more-detailed design in the cover help?
> Will writing a design doc to put into Documentation/ help?
> Is it better to rearrange patches in a way to eliminate the current
> implementation first and build from scratch?

It would help to have changelogs that actually describe how the new
design is supposed to work, and why you made the decisions you made.

The changelog in this patch here sells the change as a reduction in
complexity, without explaining why it stopped where it stopped. So
naturally, if that's the declared goal, the first question is whether
we can make it simpler.

Both the cover letter and the changelogs should focus less on what was
there and how it was deleted, and more on how the code is supposed to
work after the patches. How the components were designed and how they
all work together.

As I said before, imagine somebody without any historical knowledge
reading the code. They should be able to find out why you chose to
have two sets of kmem caches. There is no explanation for it other
than "there used to be more, so we cut it down to two".

> > > > And these two things are the primary aspects that make my head hurt
> > > > trying to review this patch series.
> > > > 
> > > > > So IMO it's an interesting direction to explore, but not something
> > > > > that necessarily has to be done in the context of this patchset.
> > > > 
> > > > I disagree. Instead of replacing the old coherent model and its
> > > > complexities with a new coherent one, you are mixing the two. And I
> > > > can barely understand the end result.
> > > > 
> > > > Dynamically cloning entire slab caches for the sole purpose of telling
> > > > whether the pages have an obj_cgroup array or not is *completely
> > > > insane*. If the controller had followed the obj_cgroup design from the
> > > > start, nobody would have ever thought about doing it like this.
> > > 
> > > It's just not true. The whole point of having root- and memcg sets is
> > > to be able to not look for a NULL pointer in the obj_cgroup vector on
> > > releasing of the root object. In other words, it allows to keep zero
> > > overhead for root allocations. IMHO it's an important thing, and calling
> > > it *completely insane* isn't the best way to communicate.
> > 
> > But you're trading it for the indirection of going through a separate
> > kmem_cache for every single cgroup-accounted allocation. Why is this a
> > preferable trade-off to make?
> 
> Because it allows to keep zero memory and cpu overhead for root allocations.
> I've no data showing that this overhead is small and acceptable in all cases.
> I think keeping zero overhead for root allocations is more important
> than having a single set of kmem caches.

In the kmem cache breakdown you provided above, there are 35887 pages
allocated to root caches and 37300 pages allocated to cgroup caches.

Why are root allocations supposed to be more important? Aren't some of
the hottest allocations tracked by cgroups? Look at fork():

>       vm_area_struct         48        810
>            mm_struct         19         29
>          files_cache         14         26
>         signal_cache         28        143
>        sighand_cache         45         47
>          task_struct         77        430
>             cred_jar         29        424
>       anon_vma_chain         39        492
>             anon_vma         28        467
>                  pid         30        369

Hard to think of much hotter allocations. They all have to suffer the
additional branch and cache footprint of the auxiliary cgroup caches.

> > > I agree that it's an arguable question if we can tolerate some
> > > additional overhead on root allocations to eliminate these additional
> > > 10%, but I really don't think it's so obvious that even discussing
> > > it is insane.
> > 
> > Well that's exactly my point.
> 
> Ok, what's the acceptable performance penalty?
> Is adding 20% on free path is acceptable, for example?
> Or adding 3% of slab memory?

I find offhand replies like these very jarring.

There is a legitimate design question: Why are you using a separate
set of kmem caches for the cgroup allocations, citing the additional
complexity over having one set? And your reply was mostly handwaving.

So: what's the overhead you're saving by having two sets? What is this
additional stuff buying us?

Pretend the split-cache infra hadn't been there. Would you have added
it? If so, based on what data? Now obviously, you didn't write it - it
was there because that's the way the per-cgroup accounting was done
previously. But you did choose to keep it. And it's a fair question
what (quantifiable) role it plays in your new way of doing things.

> > > Btw, there is another good idea to explore (also suggested by Christopher
> > > Lameter): we can put memcg/objcg pointer into the slab page, avoiding
> > > an extra allocation.
> > 
> > I agree with this idea, but I do think that's a bit more obviously in
> > optimization territory. The objcg is much larger than a pointer to it,
> > and it wouldn't significantly change the alloc/free sequence, right?
> 
> So the idea is that putting the obj_cgroup pointer nearby will eliminate
> some cache misses. But then it's preferable to have two sets, because otherwise
> there is a memory overhead from allocating an extra space for the objcg pointer.

This trade-off is based on two assumptions:

1) Unaccounted allocations are more performance sensitive than
accounted allocations.

2) Fragmented obj_cgroup arrays waste more memory than fragmented
slabs.

You haven't sufficiently shown that either of those are true. (And I
suspect they are both false.)

So my stance is that until you make a more convincing argument for
this, a simpler concept and implementation, as well as balanced CPU
cost for unaccounted and accounted allocations, wins out.

> Stepping a bit back: the new scheme (new slab controller) adds some cpu operations
> on the allocation and release paths. It's unavoidable: more precise
> accounting requires more CPU. But IMO it's worth it because it leads
> to significant memory savings and reduced memory fragmentation.
> Also it reduces the code complexity (which is a bonus but not the primary goal).
> 
> I haven't seen so far any workloads where the difference was noticeable,
> but it doesn't mean they do not exist. That's why I'm very concerned about
> any suggestions which might even in theory increase the cpu overhead.
> Keeping it at zero level for root allocations allows do exclude
> something from the accounting if the performance penalty is not tolerable.

That sounds like a false trade-off to me. We account memory for
functional correctness - consumers that are big enough to
fundamentally alter the picture of cgroup memory footprints, allow
users to disturb other containers, or even cause host-level OOM
situations. Not based on whether they are cheap to track.

In fact, I would make the counter argument. It'd be pretty bad if
everytime we had to make an accounting change to maintain functional
correctness we'd have to worry about a CPU regression that exists in
part because we're trying to keep unaccounted allocations cheaper.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 21/28] mm: memcg/slab: use a single set of kmem_caches for all memory cgroups
  2020-02-04 18:41               ` Johannes Weiner
@ 2020-02-05 15:58                 ` Roman Gushchin
  0 siblings, 0 replies; 56+ messages in thread
From: Roman Gushchin @ 2020-02-05 15:58 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Andrew Morton, Michal Hocko, Shakeel Butt,
	Vladimir Davydov, linux-kernel, kernel-team, Bharata B Rao,
	Yafang Shao

On Tue, Feb 04, 2020 at 01:41:59PM -0500, Johannes Weiner wrote:
> On Mon, Feb 03, 2020 at 08:35:41PM -0800, Roman Gushchin wrote:
> > On Mon, Feb 03, 2020 at 09:47:04PM -0500, Johannes Weiner wrote:
> > > On Mon, Feb 03, 2020 at 05:15:05PM -0800, Roman Gushchin wrote:
> > > > On Mon, Feb 03, 2020 at 05:17:34PM -0500, Johannes Weiner wrote:
> > > > > On Mon, Feb 03, 2020 at 12:58:34PM -0800, Roman Gushchin wrote:
> > > > > > On Mon, Feb 03, 2020 at 02:50:48PM -0500, Johannes Weiner wrote:
> > > > > > > On Mon, Jan 27, 2020 at 09:34:46AM -0800, Roman Gushchin wrote:
> > > > > > > > This is fairly big but mostly red patch, which makes all non-root
> > > > > > > > slab allocations use a single set of kmem_caches instead of
> > > > > > > > creating a separate set for each memory cgroup.
> > > > > > > > 
> > > > > > > > Because the number of non-root kmem_caches is now capped by the number
> > > > > > > > of root kmem_caches, there is no need to shrink or destroy them
> > > > > > > > prematurely. They can be perfectly destroyed together with their
> > > > > > > > root counterparts. This allows to dramatically simplify the
> > > > > > > > management of non-root kmem_caches and delete a ton of code.
> > > > > > > 
> > > > > > > This is definitely going in the right direction. But it doesn't quite
> > > > > > > explain why we still need two sets of kmem_caches?
> > > > > > > 
> > > > > > > In the old scheme, we had completely separate per-cgroup caches with
> > > > > > > separate slab pages. If a cgrouped process wanted to allocate a slab
> > > > > > > object, we'd go to the root cache and used the cgroup id to look up
> > > > > > > the right cgroup cache. On slab free we'd use page->slab_cache.
> > > > > > > 
> > > > > > > Now we have slab pages that have a page->objcg array. Why can't all
> > > > > > > allocations go through a single set of kmem caches? If an allocation
> > > > > > > is coming from a cgroup and the slab page the allocator wants to use
> > > > > > > doesn't have an objcg array yet, we can allocate it on the fly, no?
> > > > > > 
> > > > > > Well, arguably it can be done, but there are few drawbacks:
> > > > > > 
> > > > > > 1) On the release path you'll need to make some extra work even for
> > > > > >    root allocations: calculate the offset only to find the NULL objcg pointer.
> > > > > > 
> > > > > > 2) There will be a memory overhead for root allocations
> > > > > >    (which might or might not be compensated by the increase
> > > > > >    of the slab utilization).
> > > > > 
> > > > > Those two are only true if there is a wild mix of root and cgroup
> > > > > allocations inside the same slab, and that doesn't really happen in
> > > > > practice. Either the machine is dedicated to one workload and cgroups
> > > > > are only enabled due to e.g. a vendor kernel, or you have cgrouped
> > > > > systems (like most distro systems now) that cgroup everything.
> > > > 
> > > > It's actually a questionable statement: we do skip allocations from certain
> > > > contexts, and we do merge slab caches.
> > > > 
> > > > Most likely it's true for certain slab_caches and not true for others.
> > > > Think of kmalloc-* caches.
> > > 
> > > With merging it's actually really hard to say how sparse or dense the
> > > resulting objcgroup arrays would be. It could change all the time too.
> > 
> > So here is some actual data from my dev machine. The first column is the number
> > of pages in the root cache, the second - in the corresponding memcg.
> > 
> >    ext4_groupinfo_4k          1          0
> >      rpc_inode_cache          1          0
> >         fuse_request         62          0
> >           fuse_inode          1       2732
> >   btrfs_delayed_node       1192          0
> > btrfs_ordered_extent        129          0
> >     btrfs_extent_map       8686          0
> >  btrfs_extent_buffer       2648          0
> >          btrfs_inode         12       6739
> >               PINGv6          1         11
> >                RAWv6          2          5
> >                UDPv6          1         34
> >        tw_sock_TCPv6        378          3
> >   request_sock_TCPv6         24          0
> >                TCPv6         46         74
> >   mqueue_inode_cache          1          0
> >  jbd2_journal_handle          2          0
> >    jbd2_journal_head          2          0
> >  jbd2_revoke_table_s          1          0
> >     ext4_inode_cache          1          3
> > ext4_allocation_context          1          0
> >          ext4_io_end          1          0
> >   ext4_extent_status          5          0
> >              mbcache          1          0
> >       dnotify_struct          1          0
> >   posix_timers_cache         24          0
> >       xfrm_dst_cache        202          0
> >                  RAW          3         12
> >                  UDP          2         24
> >          tw_sock_TCP         25          0
> >     request_sock_TCP         24          0
> >                  TCP          7         24
> > hugetlbfs_inode_cache          2          0
> >                dquot          2          0
> >        eventpoll_pwq          1        119
> >            dax_cache          1          0
> >        request_queue          9          0
> >           blkdev_ioc        241          0
> >           biovec-max        112          0
> >           biovec-128          2          0
> >            biovec-64          6          0
> >   khugepaged_mm_slot        248          0
> >  dmaengine-unmap-256          1          0
> >  dmaengine-unmap-128          1          0
> >   dmaengine-unmap-16         39          0
> >     sock_inode_cache          9        219
> >     skbuff_ext_cache        249          0
> >  skbuff_fclone_cache         83          0
> >    skbuff_head_cache        138        141
> >      file_lock_cache         24          0
> >        net_namespace          1          5
> >    shmem_inode_cache         14         56
> >      task_delay_info         23        165
> >            taskstats         24          0
> >       proc_dir_entry         24          0
> >           pde_opener         16         24
> >     proc_inode_cache         24       1103
> >           bdev_cache          4         20
> >    kernfs_node_cache       1405          0
> >            mnt_cache         54          0
> >                 filp         53        460
> >          inode_cache        488       2287
> >               dentry        367      10576
> >          names_cache         24          0
> >         ebitmap_node          2          0
> >      avc_xperms_data        256          0
> >       lsm_file_cache         92          0
> >          buffer_head         24          9
> >        uts_namespace          1          3
> >       vm_area_struct         48        810
> >            mm_struct         19         29
> >          files_cache         14         26
> >         signal_cache         28        143
> >        sighand_cache         45         47
> >          task_struct         77        430
> >             cred_jar         29        424
> >       anon_vma_chain         39        492
> >             anon_vma         28        467
> >                  pid         30        369
> >         Acpi-Operand         56          0
> >           Acpi-Parse       5587          0
> >           Acpi-State       4137          0
> >       Acpi-Namespace          8          0
> >          numa_policy        137          0
> >   ftrace_event_field         68          0
> >       pool_workqueue         25          0
> >      radix_tree_node       1694       7776
> >           task_group         21          0
> >            vmap_area        477          0
> >      kmalloc-rcl-512        473          0
> >      kmalloc-rcl-256        605          0
> >      kmalloc-rcl-192         43         16
> >      kmalloc-rcl-128          1         47
> >       kmalloc-rcl-96          3        229
> >       kmalloc-rcl-64          6        611
> >           kmalloc-8k         48         24
> >           kmalloc-4k        372         59
> >           kmalloc-2k        132         50
> >           kmalloc-1k        251         82
> >          kmalloc-512        360        150
> >          kmalloc-256        237          0
> >          kmalloc-192        298         24
> >          kmalloc-128        203         24
> >           kmalloc-96        112         24
> >           kmalloc-64        796         24
> >           kmalloc-32       1188         26
> >           kmalloc-16        555         25
> >            kmalloc-8         42         24
> >      kmem_cache_node         20          0
> >           kmem_cache         24          0
> 
> That's interesting, thanks. It does look fairly bimodal, except in
> some smaller caches. Which does make sense when you think about it: we
> focus on accounting consumers that are driven by userspace activity
> and big enough to actually matter in terms of cgroup footprint.
> 
> > > > Also, because obj_cgroup vectors will not be freed without underlying pages,
> > > > most likely the percentage of pages with obj_cgroups will grow with uptime.
> > > > In other words, memcg allocations will fragment root slab pages.
> > > 
> > > I understand the first part of this paragraph, but not the second. The
> > > objcgroup vectors will be freed when the slab pages get freed. But the
> > > partially filled slab pages can be reused by any types of allocations,
> > > surely? How would this cause the pages to fragment?
> > 
> > I mean the following: once you allocate a single accounted object
> > from the page, obj_cgroup vector is allocated and will be released only
> > with the slab page. We really really don't want to count how many accounted
> > objects are on the page and release obj_cgroup vector on reaching 0.
> > So even if all following allocations are root allocations, the overhead
> > will not go away with the uptime.
> > 
> > In other words, even a small percentage of accounted objects will
> > turn the whole cache into "accountable".
> 
> Correct. The worst case is where we have a large cache that has N
> objects per slab, but only ~1/N objects are accounted to a cgroup.
> 
> The question is whether this is common or even realistic. When would a
> cache be big, but only a small subset of its allocations would be
> attributable to specific cgroups?
> 
> On the less extreme overlapping cases, yeah there are fragmented
> obj_cgroup arrays, but there is also better slab packing. One is an
> array of pointers, the other is an array of larger objects. It would
> seem slab fragmentation has the potential to waste much more memory?
> 
> > > > > > 3) I'm working on percpu memory accounting that resembles the same scheme,
> > > > > >    except that obj_cgroups vector is created for the whole percpu block.
> > > > > >    There will be root- and memcg-blocks, and it will be expensive to merge them.
> > > > > >    I kinda like using the same scheme here and there.
> > > > > 
> > > > > It's hard to conclude anything based on this information alone. If
> > > > > it's truly expensive to merge them, then it warrants the additional
> > > > > complexity. But I don't understand the desire to share a design for
> > > > > two systems with sufficiently different constraints.
> > > > > 
> > > > > > Upsides?
> > > > > > 
> > > > > > 1) slab utilization might increase a little bit (but I doubt it will have
> > > > > >    a huge effect, because both merging sets should be relatively big and well
> > > > > >    utilized)
> > > > > 
> > > > > Right.
> > > > > 
> > > > > > 2) eliminate memcg kmem_cache dynamic creation/destruction. it's nice,
> > > > > >    but there isn't so much code left anyway.
> > > > > 
> > > > > There is a lot of complexity associated with the cache cloning that
> > > > > isn't the lines of code, but the lifetime and synchronization rules.
> > > > 
> > > > Quite opposite: the patchset removes all the complexity (or 90% of it),
> > > > because it makes the kmem_cache lifetime independent from any cgroup stuff.
> > > > 
> > > > Kmem_caches are created on demand on the first request (most likely during
> > > > the system start-up), and destroyed together with their root counterparts
> > > > (most likely never or on rmmod). First request means globally first request,
> > > > not a first request from a given memcg.
> > > > 
> > > > Memcg kmem_cache lifecycle has nothing to do with memory/obj_cgroups, and
> > > > after creation just matches the lifetime of the root kmem caches.
> > > > 
> > > > The only reason to keep the async creation is that some kmem_caches
> > > > are created very early in the boot process, long before any cgroup
> > > > stuff is initialized.
> > > 
> > > Yes, it's independent of the obj_cgroup and memcg, and yes it's
> > > simpler after your patches. But I'm not talking about the delta, I'm
> > > trying to understand the end result.
> > > 
> > > And the truth is there is a decent chunk of code and tentacles spread
> > > throughout the slab/cgroup code to clone, destroy, and handle the
> > > split caches, as well as the branches/indirections on every cgrouped
> > > slab allocation.
> > 
> > Did you see the final code? It's fairly simple and there is really not
> > much of complexity left. If you don't think so, let's go into details,
> > because otherwise it's hard to say anything.
> 
> I have the patches applied to a local tree and am looking at the final
> code. But I can only repeat that "it's not too bad" simply isn't a
> good explanation for why the code is the way it is.
> 
> > With a such change which basically removes the current implementation
> > and replaces it with a new one, it's hard to keep the balance between
> > making commits self-contained and small, but also showing the whole picture.
> > I'm fully open to questions and generally want to make it simpler.
> > 
> > I've tried to separate some parts and get them merged before the main
> > thing, but they haven't been merged yet, so I have to include them
> > to keep the thing building.
> > 
> > Will a more-detailed design in the cover help?
> > Will writing a design doc to put into Documentation/ help?
> > Is it better to rearrange patches in a way to eliminate the current
> > implementation first and build from scratch?
> 
> It would help to have changelogs that actually describe how the new
> design is supposed to work, and why you made the decisions you made.
> 
> The changelog in this patch here sells the change as a reduction in
> complexity, without explaining why it stopped where it stopped. So
> naturally, if that's the declared goal, the first question is whether
> we can make it simpler.
> 
> Both the cover letter and the changelogs should focus less on what was
> there and how it was deleted, and more on how the code is supposed to
> work after the patches. How the components were designed and how they
> all work together.
> 
> As I said before, imagine somebody without any historical knowledge
> reading the code. They should be able to find out why you chose to
> have two sets of kmem caches. There is no explanation for it other
> than "there used to be more, so we cut it down to two".
> 
> > > > > And these two things are the primary aspects that make my head hurt
> > > > > trying to review this patch series.
> > > > > 
> > > > > > So IMO it's an interesting direction to explore, but not something
> > > > > > that necessarily has to be done in the context of this patchset.
> > > > > 
> > > > > I disagree. Instead of replacing the old coherent model and its
> > > > > complexities with a new coherent one, you are mixing the two. And I
> > > > > can barely understand the end result.
> > > > > 
> > > > > Dynamically cloning entire slab caches for the sole purpose of telling
> > > > > whether the pages have an obj_cgroup array or not is *completely
> > > > > insane*. If the controller had followed the obj_cgroup design from the
> > > > > start, nobody would have ever thought about doing it like this.
> > > > 
> > > > It's just not true. The whole point of having root- and memcg sets is
> > > > to be able to not look for a NULL pointer in the obj_cgroup vector on
> > > > releasing of the root object. In other words, it allows to keep zero
> > > > overhead for root allocations. IMHO it's an important thing, and calling
> > > > it *completely insane* isn't the best way to communicate.
> > > 
> > > But you're trading it for the indirection of going through a separate
> > > kmem_cache for every single cgroup-accounted allocation. Why is this a
> > > preferable trade-off to make?
> > 
> > Because it allows to keep zero memory and cpu overhead for root allocations.
> > I've no data showing that this overhead is small and acceptable in all cases.
> > I think keeping zero overhead for root allocations is more important
> > than having a single set of kmem caches.
> 
> In the kmem cache breakdown you provided above, there are 35887 pages
> allocated to root caches and 37300 pages allocated to cgroup caches.
> 
> Why are root allocations supposed to be more important? Aren't some of
> the hottest allocations tracked by cgroups? Look at fork():
> 
> >       vm_area_struct         48        810
> >            mm_struct         19         29
> >          files_cache         14         26
> >         signal_cache         28        143
> >        sighand_cache         45         47
> >          task_struct         77        430
> >             cred_jar         29        424
> >       anon_vma_chain         39        492
> >             anon_vma         28        467
> >                  pid         30        369
> 
> Hard to think of much hotter allocations. They all have to suffer the
> additional branch and cache footprint of the auxiliary cgroup caches.
> 
> > > > I agree that it's an arguable question if we can tolerate some
> > > > additional overhead on root allocations to eliminate these additional
> > > > 10%, but I really don't think it's so obvious that even discussing
> > > > it is insane.
> > > 
> > > Well that's exactly my point.
> > 
> > Ok, what's the acceptable performance penalty?
> > Is adding 20% on free path is acceptable, for example?
> > Or adding 3% of slab memory?
> 
> I find offhand replies like these very jarring.
> 
> There is a legitimate design question: Why are you using a separate
> set of kmem caches for the cgroup allocations, citing the additional
> complexity over having one set? And your reply was mostly handwaving.

Johannes,

I posted patches and numbers that shows that the patchset improves
a fundamental kernel characteristic (slab utilization) by a meaningful margin.
It has been confirmed by others, who kindly tested it on their machines.

Surprisingly, during this and previous review sessions, I didn't hear
a single good word from you, but a constant stream of blame: I do not answer
questions, I do not write perfect code, I fail to provide satisfying
answers, I'm waving hands, saying insane things etc etc.
Any minimal disagreement with you and you're basically raising the tone.

I find this style of discussions irritating and non-productive.
So I'm taking a break and start working on the next version.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 27/28] tools/cgroup: make slabinfo.py compatible with new slab controller
  2020-01-31 22:24     ` Roman Gushchin
@ 2020-02-12  5:21       ` Bharata B Rao
  2020-02-12 20:42         ` Roman Gushchin
  0 siblings, 1 reply; 56+ messages in thread
From: Bharata B Rao @ 2020-02-12  5:21 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, Johannes Weiner,
	Shakeel Butt, Vladimir Davydov, linux-kernel, Kernel Team,
	Yafang Shao

On Fri, Jan 31, 2020 at 10:24:58PM +0000, Roman Gushchin wrote:
> On Thu, Jan 30, 2020 at 07:47:29AM +0530, Bharata B Rao wrote:
> > On Mon, Jan 27, 2020 at 09:34:52AM -0800, Roman Gushchin wrote:
> 
> Btw, I've checked that the change like you've done above fixes the problem.
> The script works for me both on current upstream and new_slab.2 branch.
> 
> Are you sure that in your case there is some kernel memory charged to that
> cgroup? Please note, that in the current implementation kmem_caches are created
> on demand, so the accounting is effectively enabled with some delay.

I do see kmem getting charged.

# cat /sys/fs/cgroup/memory/1/memory.kmem.usage_in_bytes /sys/fs/cgroup/memory/1/memory.usage_in_bytes
182910976
4515627008

> Below is an updated version of the patch to use:

I see the below failure with this updated version:

# ./tools/cgroup/slabinfo-new.py /sys/fs/cgroup/memory/1
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
Traceback (most recent call last):
  File "/usr/local/bin/drgn", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/dist-packages/drgn/internal/cli.py", line 127, in main
    runpy.run_path(args.script[0], init_globals=init_globals, run_name="__main__")
  File "/usr/lib/python3.6/runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/usr/lib/python3.6/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "./tools/cgroup/slabinfo-new.py", line 158, in <module>
    main()
  File "./tools/cgroup/slabinfo-new.py", line 153, in main
    memcg.kmem_caches.address_of_(),
AttributeError: 'struct mem_cgroup' has no member 'kmem_caches'

> +
> +def main():
> +    parser = argparse.ArgumentParser(description=DESC,
> +                                     formatter_class=
> +                                     argparse.RawTextHelpFormatter)
> +    parser.add_argument('cgroup', metavar='CGROUP',
> +                        help='Target memory cgroup')
> +    args = parser.parse_args()
> +
> +    try:
> +        cgroup_id = stat(args.cgroup).st_ino
> +        find_memcg_ids()
> +        memcg = MEMCGS[cgroup_id]
> +    except KeyError:
> +        err('Can\'t find the memory cgroup')
> +
> +    cfg = detect_kernel_config()
> +
> +    print('# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab>'
> +          ' : tunables <limit> <batchcount> <sharedfactor>'
> +          ' : slabdata <active_slabs> <num_slabs> <sharedavail>')
> +
> +    for s in list_for_each_entry('struct kmem_cache',
> +                                 memcg.kmem_caches.address_of_(),
> +                                 'memcg_params.kmem_caches_node'):

Are you sure this is the right version? In the previous version
you had the if-else loop that handled shared_slab_pages and old
scheme separately.

Regards,
Bharata.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 27/28] tools/cgroup: make slabinfo.py compatible with new slab controller
  2020-02-12  5:21       ` Bharata B Rao
@ 2020-02-12 20:42         ` Roman Gushchin
  0 siblings, 0 replies; 56+ messages in thread
From: Roman Gushchin @ 2020-02-12 20:42 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-mm, Andrew Morton, Michal Hocko, Johannes Weiner,
	Shakeel Butt, Vladimir Davydov, linux-kernel, Kernel Team,
	Yafang Shao

On Wed, Feb 12, 2020 at 10:51:24AM +0530, Bharata B Rao wrote:
> On Fri, Jan 31, 2020 at 10:24:58PM +0000, Roman Gushchin wrote:
> > On Thu, Jan 30, 2020 at 07:47:29AM +0530, Bharata B Rao wrote:
> > > On Mon, Jan 27, 2020 at 09:34:52AM -0800, Roman Gushchin wrote:
> > 
> > Btw, I've checked that the change like you've done above fixes the problem.
> > The script works for me both on current upstream and new_slab.2 branch.
> > 
> > Are you sure that in your case there is some kernel memory charged to that
> > cgroup? Please note, that in the current implementation kmem_caches are created
> > on demand, so the accounting is effectively enabled with some delay.
> 
> I do see kmem getting charged.
> 
> # cat /sys/fs/cgroup/memory/1/memory.kmem.usage_in_bytes /sys/fs/cgroup/memory/1/memory.usage_in_bytes
> 182910976
> 4515627008

Great.

> 
> > Below is an updated version of the patch to use:
> 
> I see the below failure with this updated version:

Are you sure that drgn is picking right symbols?
I had a similar transient issue during my work, when drgn was actually
using symbols from a different kernel.

> 
> # ./tools/cgroup/slabinfo-new.py /sys/fs/cgroup/memory/1
> # name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
> Traceback (most recent call last):
>   File "/usr/local/bin/drgn", line 11, in <module>
>     sys.exit(main())
>   File "/usr/local/lib/python3.6/dist-packages/drgn/internal/cli.py", line 127, in main
>     runpy.run_path(args.script[0], init_globals=init_globals, run_name="__main__")
>   File "/usr/lib/python3.6/runpy.py", line 263, in run_path
>     pkg_name=pkg_name, script_name=fname)
>   File "/usr/lib/python3.6/runpy.py", line 96, in _run_module_code
>     mod_name, mod_spec, pkg_name, script_name)
>   File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
>     exec(code, run_globals)
>   File "./tools/cgroup/slabinfo-new.py", line 158, in <module>
>     main()
>   File "./tools/cgroup/slabinfo-new.py", line 153, in main
>     memcg.kmem_caches.address_of_(),
> AttributeError: 'struct mem_cgroup' has no member 'kmem_caches'
> 
> > +
> > +def main():
> > +    parser = argparse.ArgumentParser(description=DES,C
> > +                                     formatter_class=
> > +                                     argparse.RawTextHelpFormatter)
> > +    parser.add_argument('cgroup', metavar='CGROUP',
> > +                        help='Target memory cgroup')
> > +    args = parser.parse_args()
> > +
> > +    try:
> > +        cgroup_id = stat(args.cgroup).st_ino
> > +        find_memcg_ids()
> > +        memcg = MEMCGS[cgroup_id]
> > +    except KeyError:
> > +        err('Can\'t find the memory cgroup')
> > +
> > +    cfg = detect_kernel_config()
> > +
> > +    print('# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab>'
> > +          ' : tunables <limit> <batchcount> <sharedfactor>'
> > +          ' : slabdata <active_slabs> <num_slabs> <sharedavail>')
> > +
> > +    for s in list_for_each_entry('struct kmem_cache',
> > +                                 memcg.kmem_caches.address_of_(),
> > +                                 'memcg_params.kmem_caches_node'):
> 
> Are you sure this is the right version? In the previous version
> you had the if-else loop that handled shared_slab_pages and old
> scheme separately.

Which one you're refering to?

As in my tree there are two patches:
fa490da39afb tools/cgroup: add slabinfo.py tool
e3bee81aab44 tools/cgroup: make slabinfo.py compatible with new slab controller

The second one adds the if clause you're probably referring to.

Thanks!

Roman

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 00/28] The new cgroup slab memory controller
  2020-01-30  2:41   ` Roman Gushchin
@ 2020-08-12 23:16     ` Pavel Tatashin
  2020-08-12 23:18       ` Pavel Tatashin
  2020-08-13  0:04       ` Roman Gushchin
  0 siblings, 2 replies; 56+ messages in thread
From: Pavel Tatashin @ 2020-08-12 23:16 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Bharata B Rao, linux-mm, Andrew Morton, Michal Hocko,
	Johannes Weiner, Shakeel Butt, Vladimir Davydov, linux-kernel,
	Kernel Team, Yafang Shao, stable, Linus Torvalds, Sasha Levin,
	Greg Kroah-Hartman

Guys,

There is a convoluted deadlock that I just root caused, and that is
fixed by this work (at least based on my code inspection it appears to
be fixed); but the deadlock exists in older and stable kernels, and I
am not sure whether to create a separate patch for it, or backport
this whole thing.

Thread #1: Hot-removes memory
device_offline
  memory_subsys_offline
    offline_pages
      __offline_pages
        mem_hotplug_lock <- write access
      waits for Thread #3 refcnt for pfn 9e5113 to get to 1 so it can
migrate it.

Thread #2: ccs killer kthread
   css_killed_work_fn
     cgroup_mutex  <- Grab this Mutex
     mem_cgroup_css_offline
       memcg_offline_kmem.part
          memcg_deactivate_kmem_caches
            get_online_mems
              mem_hotplug_lock <- waits for Thread#1 to get read access

Thread #3: crashing userland program
do_coredump
  elf_core_dump
      get_dump_page() -> get page with pfn#9e5113, and increment refcnt
      dump_emit
        __kernel_write
          __vfs_write
            new_sync_write
              pipe_write
                pipe_wait   -> waits for Thread #4 systemd-coredump to
read the pipe

Thread #4: systemd-coredump
ksys_read
  vfs_read
    __vfs_read
      seq_read
        proc_single_show
          proc_cgroup_show
            cgroup_mutex -> waits from Thread #2 for this lock.

In Summary:
Thread#1 waits for Thread#3 for refcnt, Thread#3 waits for Thread#4 to
read pipe. Thread#4 waits for Thread#2 for cgroup_mutex lock; Thread#2
waits for Thread#1 for mem_hotplug_lock rwlock.

This work appears to fix this deadlock because cgroup_mutex is not
called anymore before mem_hotplug_lock (unless I am missing it), as it
removes memcg_deactivate_kmem_caches.

Thank you,
Pasha

On Wed, Jan 29, 2020 at 9:42 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Thu, Jan 30, 2020 at 07:36:26AM +0530, Bharata B Rao wrote:
> > On Mon, Jan 27, 2020 at 09:34:25AM -0800, Roman Gushchin wrote:
> > > The existing cgroup slab memory controller is based on the idea of
> > > replicating slab allocator internals for each memory cgroup.
> > > This approach promises a low memory overhead (one pointer per page),
> > > and isn't adding too much code on hot allocation and release paths.
> > > But is has a very serious flaw: it leads to a low slab utilization.
> > >
> > > Using a drgn* script I've got an estimation of slab utilization on
> > > a number of machines running different production workloads. In most
> > > cases it was between 45% and 65%, and the best number I've seen was
> > > around 85%. Turning kmem accounting off brings it to high 90s. Also
> > > it brings back 30-50% of slab memory. It means that the real price
> > > of the existing slab memory controller is way bigger than a pointer
> > > per page.
> > >
> > > The real reason why the existing design leads to a low slab utilization
> > > is simple: slab pages are used exclusively by one memory cgroup.
> > > If there are only few allocations of certain size made by a cgroup,
> > > or if some active objects (e.g. dentries) are left after the cgroup is
> > > deleted, or the cgroup contains a single-threaded application which is
> > > barely allocating any kernel objects, but does it every time on a new CPU:
> > > in all these cases the resulting slab utilization is very low.
> > > If kmem accounting is off, the kernel is able to use free space
> > > on slab pages for other allocations.
> > >
> > > Arguably it wasn't an issue back to days when the kmem controller was
> > > introduced and was an opt-in feature, which had to be turned on
> > > individually for each memory cgroup. But now it's turned on by default
> > > on both cgroup v1 and v2. And modern systemd-based systems tend to
> > > create a large number of cgroups.
> > >
> > > This patchset provides a new implementation of the slab memory controller,
> > > which aims to reach a much better slab utilization by sharing slab pages
> > > between multiple memory cgroups. Below is the short description of the new
> > > design (more details in commit messages).
> > >
> > > Accounting is performed per-object instead of per-page. Slab-related
> > > vmstat counters are converted to bytes. Charging is performed on page-basis,
> > > with rounding up and remembering leftovers.
> > >
> > > Memcg ownership data is stored in a per-slab-page vector: for each slab page
> > > a vector of corresponding size is allocated. To keep slab memory reparenting
> > > working, instead of saving a pointer to the memory cgroup directly an
> > > intermediate object is used. It's simply a pointer to a memcg (which can be
> > > easily changed to the parent) with a built-in reference counter. This scheme
> > > allows to reparent all allocated objects without walking them over and
> > > changing memcg pointer to the parent.
> > >
> > > Instead of creating an individual set of kmem_caches for each memory cgroup,
> > > two global sets are used: the root set for non-accounted and root-cgroup
> > > allocations and the second set for all other allocations. This allows to
> > > simplify the lifetime management of individual kmem_caches: they are
> > > destroyed with root counterparts. It allows to remove a good amount of code
> > > and make things generally simpler.
> > >
> > > The patchset* has been tested on a number of different workloads in our
> > > production. In all cases it saved significant amount of memory, measured
> > > from high hundreds of MBs to single GBs per host. On average, the size
> > > of slab memory has been reduced by 35-45%.
> >
> > Here are some numbers from multiple runs of sysbench and kernel compilation
> > with this patchset on a 10 core POWER8 host:
> >
> > ==========================================================================
> > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> > meminfo:Slab for Sysbench oltp_read_write with mysqld running as part
> > of a mem cgroup (Sampling every 5s)
> > --------------------------------------------------------------------------
> >                               5.5.0-rc7-mm1   +slab patch     %reduction
> > --------------------------------------------------------------------------
> > memory.kmem.usage_in_bytes    15859712        4456448         72
> > memory.usage_in_bytes         337510400       335806464       .5
> > Slab: (kB)                    814336          607296          25
> >
> > memory.kmem.usage_in_bytes    16187392        4653056         71
> > memory.usage_in_bytes         318832640       300154880       5
> > Slab: (kB)                    789888          559744          29
> > --------------------------------------------------------------------------
> >
> >
> > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> > meminfo:Slab for kernel compilation (make -s -j64) Compilation was
> > done from bash that is in a memory cgroup. (Sampling every 5s)
> > --------------------------------------------------------------------------
> >                               5.5.0-rc7-mm1   +slab patch     %reduction
> > --------------------------------------------------------------------------
> > memory.kmem.usage_in_bytes    338493440       231931904       31
> > memory.usage_in_bytes         7368015872      6275923968      15
> > Slab: (kB)                    1139072         785408          31
> >
> > memory.kmem.usage_in_bytes    341835776       236453888       30
> > memory.usage_in_bytes         6540427264      6072893440      7
> > Slab: (kB)                    1074304         761280          29
> >
> > memory.kmem.usage_in_bytes    340525056       233570304       31
> > memory.usage_in_bytes         6406209536      6177357824      3
> > Slab: (kB)                    1244288         739712          40
> > --------------------------------------------------------------------------
> >
> > Slab consumption right after boot
> > --------------------------------------------------------------------------
> >                               5.5.0-rc7-mm1   +slab patch     %reduction
> > --------------------------------------------------------------------------
> > Slab: (kB)                    821888          583424          29
> > ==========================================================================
> >
> > Summary:
> >
> > With sysbench and kernel compilation,  memory.kmem.usage_in_bytes shows
> > around 70% and 30% reduction consistently.
> >
> > Didn't see consistent reduction of memory.usage_in_bytes with sysbench and
> > kernel compilation.
> >
> > Slab usage (from /proc/meminfo) shows consistent 30% reduction and the
> > same is seen right after boot too.
>
> That's just perfect!
>
> memory.usage_in_bytes was most likely the same because the freed space
> was taken by pagecache.
>
> Thank you very much for testing!
>
> Roman

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 00/28] The new cgroup slab memory controller
  2020-08-12 23:16     ` Pavel Tatashin
@ 2020-08-12 23:18       ` Pavel Tatashin
  2020-08-13  0:04       ` Roman Gushchin
  1 sibling, 0 replies; 56+ messages in thread
From: Pavel Tatashin @ 2020-08-12 23:18 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Bharata B Rao, linux-mm, Andrew Morton, Michal Hocko,
	Johannes Weiner, Shakeel Butt, Vladimir Davydov, linux-kernel,
	Kernel Team, Yafang Shao, stable, Linus Torvalds, Sasha Levin,
	Greg Kroah-Hartman

BTW, I replied to a wrong version of this work. I intended to reply to
version 7:
https://lore.kernel.org/lkml/20200623174037.3951353-1-guro@fb.com/

Nevertheless, the problem is the same.

Thank you,
Pasha

On Wed, Aug 12, 2020 at 7:16 PM Pavel Tatashin
<pasha.tatashin@soleen.com> wrote:
>
> Guys,
>
> There is a convoluted deadlock that I just root caused, and that is
> fixed by this work (at least based on my code inspection it appears to
> be fixed); but the deadlock exists in older and stable kernels, and I
> am not sure whether to create a separate patch for it, or backport
> this whole thing.
>
> Thread #1: Hot-removes memory
> device_offline
>   memory_subsys_offline
>     offline_pages
>       __offline_pages
>         mem_hotplug_lock <- write access
>       waits for Thread #3 refcnt for pfn 9e5113 to get to 1 so it can
> migrate it.
>
> Thread #2: ccs killer kthread
>    css_killed_work_fn
>      cgroup_mutex  <- Grab this Mutex
>      mem_cgroup_css_offline
>        memcg_offline_kmem.part
>           memcg_deactivate_kmem_caches
>             get_online_mems
>               mem_hotplug_lock <- waits for Thread#1 to get read access
>
> Thread #3: crashing userland program
> do_coredump
>   elf_core_dump
>       get_dump_page() -> get page with pfn#9e5113, and increment refcnt
>       dump_emit
>         __kernel_write
>           __vfs_write
>             new_sync_write
>               pipe_write
>                 pipe_wait   -> waits for Thread #4 systemd-coredump to
> read the pipe
>
> Thread #4: systemd-coredump
> ksys_read
>   vfs_read
>     __vfs_read
>       seq_read
>         proc_single_show
>           proc_cgroup_show
>             cgroup_mutex -> waits from Thread #2 for this lock.
>
> In Summary:
> Thread#1 waits for Thread#3 for refcnt, Thread#3 waits for Thread#4 to
> read pipe. Thread#4 waits for Thread#2 for cgroup_mutex lock; Thread#2
> waits for Thread#1 for mem_hotplug_lock rwlock.
>
> This work appears to fix this deadlock because cgroup_mutex is not
> called anymore before mem_hotplug_lock (unless I am missing it), as it
> removes memcg_deactivate_kmem_caches.
>
> Thank you,
> Pasha
>
> On Wed, Jan 29, 2020 at 9:42 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Thu, Jan 30, 2020 at 07:36:26AM +0530, Bharata B Rao wrote:
> > > On Mon, Jan 27, 2020 at 09:34:25AM -0800, Roman Gushchin wrote:
> > > > The existing cgroup slab memory controller is based on the idea of
> > > > replicating slab allocator internals for each memory cgroup.
> > > > This approach promises a low memory overhead (one pointer per page),
> > > > and isn't adding too much code on hot allocation and release paths.
> > > > But is has a very serious flaw: it leads to a low slab utilization.
> > > >
> > > > Using a drgn* script I've got an estimation of slab utilization on
> > > > a number of machines running different production workloads. In most
> > > > cases it was between 45% and 65%, and the best number I've seen was
> > > > around 85%. Turning kmem accounting off brings it to high 90s. Also
> > > > it brings back 30-50% of slab memory. It means that the real price
> > > > of the existing slab memory controller is way bigger than a pointer
> > > > per page.
> > > >
> > > > The real reason why the existing design leads to a low slab utilization
> > > > is simple: slab pages are used exclusively by one memory cgroup.
> > > > If there are only few allocations of certain size made by a cgroup,
> > > > or if some active objects (e.g. dentries) are left after the cgroup is
> > > > deleted, or the cgroup contains a single-threaded application which is
> > > > barely allocating any kernel objects, but does it every time on a new CPU:
> > > > in all these cases the resulting slab utilization is very low.
> > > > If kmem accounting is off, the kernel is able to use free space
> > > > on slab pages for other allocations.
> > > >
> > > > Arguably it wasn't an issue back to days when the kmem controller was
> > > > introduced and was an opt-in feature, which had to be turned on
> > > > individually for each memory cgroup. But now it's turned on by default
> > > > on both cgroup v1 and v2. And modern systemd-based systems tend to
> > > > create a large number of cgroups.
> > > >
> > > > This patchset provides a new implementation of the slab memory controller,
> > > > which aims to reach a much better slab utilization by sharing slab pages
> > > > between multiple memory cgroups. Below is the short description of the new
> > > > design (more details in commit messages).
> > > >
> > > > Accounting is performed per-object instead of per-page. Slab-related
> > > > vmstat counters are converted to bytes. Charging is performed on page-basis,
> > > > with rounding up and remembering leftovers.
> > > >
> > > > Memcg ownership data is stored in a per-slab-page vector: for each slab page
> > > > a vector of corresponding size is allocated. To keep slab memory reparenting
> > > > working, instead of saving a pointer to the memory cgroup directly an
> > > > intermediate object is used. It's simply a pointer to a memcg (which can be
> > > > easily changed to the parent) with a built-in reference counter. This scheme
> > > > allows to reparent all allocated objects without walking them over and
> > > > changing memcg pointer to the parent.
> > > >
> > > > Instead of creating an individual set of kmem_caches for each memory cgroup,
> > > > two global sets are used: the root set for non-accounted and root-cgroup
> > > > allocations and the second set for all other allocations. This allows to
> > > > simplify the lifetime management of individual kmem_caches: they are
> > > > destroyed with root counterparts. It allows to remove a good amount of code
> > > > and make things generally simpler.
> > > >
> > > > The patchset* has been tested on a number of different workloads in our
> > > > production. In all cases it saved significant amount of memory, measured
> > > > from high hundreds of MBs to single GBs per host. On average, the size
> > > > of slab memory has been reduced by 35-45%.
> > >
> > > Here are some numbers from multiple runs of sysbench and kernel compilation
> > > with this patchset on a 10 core POWER8 host:
> > >
> > > ==========================================================================
> > > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> > > meminfo:Slab for Sysbench oltp_read_write with mysqld running as part
> > > of a mem cgroup (Sampling every 5s)
> > > --------------------------------------------------------------------------
> > >                               5.5.0-rc7-mm1   +slab patch     %reduction
> > > --------------------------------------------------------------------------
> > > memory.kmem.usage_in_bytes    15859712        4456448         72
> > > memory.usage_in_bytes         337510400       335806464       .5
> > > Slab: (kB)                    814336          607296          25
> > >
> > > memory.kmem.usage_in_bytes    16187392        4653056         71
> > > memory.usage_in_bytes         318832640       300154880       5
> > > Slab: (kB)                    789888          559744          29
> > > --------------------------------------------------------------------------
> > >
> > >
> > > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> > > meminfo:Slab for kernel compilation (make -s -j64) Compilation was
> > > done from bash that is in a memory cgroup. (Sampling every 5s)
> > > --------------------------------------------------------------------------
> > >                               5.5.0-rc7-mm1   +slab patch     %reduction
> > > --------------------------------------------------------------------------
> > > memory.kmem.usage_in_bytes    338493440       231931904       31
> > > memory.usage_in_bytes         7368015872      6275923968      15
> > > Slab: (kB)                    1139072         785408          31
> > >
> > > memory.kmem.usage_in_bytes    341835776       236453888       30
> > > memory.usage_in_bytes         6540427264      6072893440      7
> > > Slab: (kB)                    1074304         761280          29
> > >
> > > memory.kmem.usage_in_bytes    340525056       233570304       31
> > > memory.usage_in_bytes         6406209536      6177357824      3
> > > Slab: (kB)                    1244288         739712          40
> > > --------------------------------------------------------------------------
> > >
> > > Slab consumption right after boot
> > > --------------------------------------------------------------------------
> > >                               5.5.0-rc7-mm1   +slab patch     %reduction
> > > --------------------------------------------------------------------------
> > > Slab: (kB)                    821888          583424          29
> > > ==========================================================================
> > >
> > > Summary:
> > >
> > > With sysbench and kernel compilation,  memory.kmem.usage_in_bytes shows
> > > around 70% and 30% reduction consistently.
> > >
> > > Didn't see consistent reduction of memory.usage_in_bytes with sysbench and
> > > kernel compilation.
> > >
> > > Slab usage (from /proc/meminfo) shows consistent 30% reduction and the
> > > same is seen right after boot too.
> >
> > That's just perfect!
> >
> > memory.usage_in_bytes was most likely the same because the freed space
> > was taken by pagecache.
> >
> > Thank you very much for testing!
> >
> > Roman

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 00/28] The new cgroup slab memory controller
  2020-08-12 23:16     ` Pavel Tatashin
  2020-08-12 23:18       ` Pavel Tatashin
@ 2020-08-13  0:04       ` Roman Gushchin
  2020-08-13  0:31         ` Pavel Tatashin
  1 sibling, 1 reply; 56+ messages in thread
From: Roman Gushchin @ 2020-08-13  0:04 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: Bharata B Rao, linux-mm, Andrew Morton, Michal Hocko,
	Johannes Weiner, Shakeel Butt, Vladimir Davydov, linux-kernel,
	Kernel Team, Yafang Shao, stable, Linus Torvalds, Sasha Levin,
	Greg Kroah-Hartman

On Wed, Aug 12, 2020 at 07:16:08PM -0400, Pavel Tatashin wrote:
> Guys,
> 
> There is a convoluted deadlock that I just root caused, and that is
> fixed by this work (at least based on my code inspection it appears to
> be fixed); but the deadlock exists in older and stable kernels, and I
> am not sure whether to create a separate patch for it, or backport
> this whole thing.

Hi Pavel,

wow, it's a quite complicated deadlock. Thank you for providing
a perfect analysis!

Unfortunately, backporting the whole new slab controller isn't an option:
it's way too big and invasive.
Do you already have a standalone fix?

Thanks!


> 
> Thread #1: Hot-removes memory
> device_offline
>   memory_subsys_offline
>     offline_pages
>       __offline_pages
>         mem_hotplug_lock <- write access
>       waits for Thread #3 refcnt for pfn 9e5113 to get to 1 so it can
> migrate it.
> 
> Thread #2: ccs killer kthread
>    css_killed_work_fn
>      cgroup_mutex  <- Grab this Mutex
>      mem_cgroup_css_offline
>        memcg_offline_kmem.part
>           memcg_deactivate_kmem_caches
>             get_online_mems
>               mem_hotplug_lock <- waits for Thread#1 to get read access
> 
> Thread #3: crashing userland program
> do_coredump
>   elf_core_dump
>       get_dump_page() -> get page with pfn#9e5113, and increment refcnt
>       dump_emit
>         __kernel_write
>           __vfs_write
>             new_sync_write
>               pipe_write
>                 pipe_wait   -> waits for Thread #4 systemd-coredump to
> read the pipe
> 
> Thread #4: systemd-coredump
> ksys_read
>   vfs_read
>     __vfs_read
>       seq_read
>         proc_single_show
>           proc_cgroup_show
>             cgroup_mutex -> waits from Thread #2 for this lock.

> 
> In Summary:
> Thread#1 waits for Thread#3 for refcnt, Thread#3 waits for Thread#4 to
> read pipe. Thread#4 waits for Thread#2 for cgroup_mutex lock; Thread#2
> waits for Thread#1 for mem_hotplug_lock rwlock.
> 
> This work appears to fix this deadlock because cgroup_mutex is not
> called anymore before mem_hotplug_lock (unless I am missing it), as it
> removes memcg_deactivate_kmem_caches.
> 
> Thank you,
> Pasha
> 
> On Wed, Jan 29, 2020 at 9:42 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Thu, Jan 30, 2020 at 07:36:26AM +0530, Bharata B Rao wrote:
> > > On Mon, Jan 27, 2020 at 09:34:25AM -0800, Roman Gushchin wrote:
> > > > The existing cgroup slab memory controller is based on the idea of
> > > > replicating slab allocator internals for each memory cgroup.
> > > > This approach promises a low memory overhead (one pointer per page),
> > > > and isn't adding too much code on hot allocation and release paths.
> > > > But is has a very serious flaw: it leads to a low slab utilization.
> > > >
> > > > Using a drgn* script I've got an estimation of slab utilization on
> > > > a number of machines running different production workloads. In most
> > > > cases it was between 45% and 65%, and the best number I've seen was
> > > > around 85%. Turning kmem accounting off brings it to high 90s. Also
> > > > it brings back 30-50% of slab memory. It means that the real price
> > > > of the existing slab memory controller is way bigger than a pointer
> > > > per page.
> > > >
> > > > The real reason why the existing design leads to a low slab utilization
> > > > is simple: slab pages are used exclusively by one memory cgroup.
> > > > If there are only few allocations of certain size made by a cgroup,
> > > > or if some active objects (e.g. dentries) are left after the cgroup is
> > > > deleted, or the cgroup contains a single-threaded application which is
> > > > barely allocating any kernel objects, but does it every time on a new CPU:
> > > > in all these cases the resulting slab utilization is very low.
> > > > If kmem accounting is off, the kernel is able to use free space
> > > > on slab pages for other allocations.
> > > >
> > > > Arguably it wasn't an issue back to days when the kmem controller was
> > > > introduced and was an opt-in feature, which had to be turned on
> > > > individually for each memory cgroup. But now it's turned on by default
> > > > on both cgroup v1 and v2. And modern systemd-based systems tend to
> > > > create a large number of cgroups.
> > > >
> > > > This patchset provides a new implementation of the slab memory controller,
> > > > which aims to reach a much better slab utilization by sharing slab pages
> > > > between multiple memory cgroups. Below is the short description of the new
> > > > design (more details in commit messages).
> > > >
> > > > Accounting is performed per-object instead of per-page. Slab-related
> > > > vmstat counters are converted to bytes. Charging is performed on page-basis,
> > > > with rounding up and remembering leftovers.
> > > >
> > > > Memcg ownership data is stored in a per-slab-page vector: for each slab page
> > > > a vector of corresponding size is allocated. To keep slab memory reparenting
> > > > working, instead of saving a pointer to the memory cgroup directly an
> > > > intermediate object is used. It's simply a pointer to a memcg (which can be
> > > > easily changed to the parent) with a built-in reference counter. This scheme
> > > > allows to reparent all allocated objects without walking them over and
> > > > changing memcg pointer to the parent.
> > > >
> > > > Instead of creating an individual set of kmem_caches for each memory cgroup,
> > > > two global sets are used: the root set for non-accounted and root-cgroup
> > > > allocations and the second set for all other allocations. This allows to
> > > > simplify the lifetime management of individual kmem_caches: they are
> > > > destroyed with root counterparts. It allows to remove a good amount of code
> > > > and make things generally simpler.
> > > >
> > > > The patchset* has been tested on a number of different workloads in our
> > > > production. In all cases it saved significant amount of memory, measured
> > > > from high hundreds of MBs to single GBs per host. On average, the size
> > > > of slab memory has been reduced by 35-45%.
> > >
> > > Here are some numbers from multiple runs of sysbench and kernel compilation
> > > with this patchset on a 10 core POWER8 host:
> > >
> > > ==========================================================================
> > > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> > > meminfo:Slab for Sysbench oltp_read_write with mysqld running as part
> > > of a mem cgroup (Sampling every 5s)
> > > --------------------------------------------------------------------------
> > >                               5.5.0-rc7-mm1   +slab patch     %reduction
> > > --------------------------------------------------------------------------
> > > memory.kmem.usage_in_bytes    15859712        4456448         72
> > > memory.usage_in_bytes         337510400       335806464       .5
> > > Slab: (kB)                    814336          607296          25
> > >
> > > memory.kmem.usage_in_bytes    16187392        4653056         71
> > > memory.usage_in_bytes         318832640       300154880       5
> > > Slab: (kB)                    789888          559744          29
> > > --------------------------------------------------------------------------
> > >
> > >
> > > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> > > meminfo:Slab for kernel compilation (make -s -j64) Compilation was
> > > done from bash that is in a memory cgroup. (Sampling every 5s)
> > > --------------------------------------------------------------------------
> > >                               5.5.0-rc7-mm1   +slab patch     %reduction
> > > --------------------------------------------------------------------------
> > > memory.kmem.usage_in_bytes    338493440       231931904       31
> > > memory.usage_in_bytes         7368015872      6275923968      15
> > > Slab: (kB)                    1139072         785408          31
> > >
> > > memory.kmem.usage_in_bytes    341835776       236453888       30
> > > memory.usage_in_bytes         6540427264      6072893440      7
> > > Slab: (kB)                    1074304         761280          29
> > >
> > > memory.kmem.usage_in_bytes    340525056       233570304       31
> > > memory.usage_in_bytes         6406209536      6177357824      3
> > > Slab: (kB)                    1244288         739712          40
> > > --------------------------------------------------------------------------
> > >
> > > Slab consumption right after boot
> > > --------------------------------------------------------------------------
> > >                               5.5.0-rc7-mm1   +slab patch     %reduction
> > > --------------------------------------------------------------------------
> > > Slab: (kB)                    821888          583424          29
> > > ==========================================================================
> > >
> > > Summary:
> > >
> > > With sysbench and kernel compilation,  memory.kmem.usage_in_bytes shows
> > > around 70% and 30% reduction consistently.
> > >
> > > Didn't see consistent reduction of memory.usage_in_bytes with sysbench and
> > > kernel compilation.
> > >
> > > Slab usage (from /proc/meminfo) shows consistent 30% reduction and the
> > > same is seen right after boot too.
> >
> > That's just perfect!
> >
> > memory.usage_in_bytes was most likely the same because the freed space
> > was taken by pagecache.
> >
> > Thank you very much for testing!
> >
> > Roman

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 00/28] The new cgroup slab memory controller
  2020-08-13  0:04       ` Roman Gushchin
@ 2020-08-13  0:31         ` Pavel Tatashin
  2020-08-28 16:47           ` Pavel Tatashin
  0 siblings, 1 reply; 56+ messages in thread
From: Pavel Tatashin @ 2020-08-13  0:31 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Bharata B Rao, linux-mm, Andrew Morton, Michal Hocko,
	Johannes Weiner, Shakeel Butt, Vladimir Davydov, linux-kernel,
	Kernel Team, Yafang Shao, stable, Linus Torvalds, Sasha Levin,
	Greg Kroah-Hartman

On Wed, Aug 12, 2020 at 8:04 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Wed, Aug 12, 2020 at 07:16:08PM -0400, Pavel Tatashin wrote:
> > Guys,
> >
> > There is a convoluted deadlock that I just root caused, and that is
> > fixed by this work (at least based on my code inspection it appears to
> > be fixed); but the deadlock exists in older and stable kernels, and I
> > am not sure whether to create a separate patch for it, or backport
> > this whole thing.
>

Hi Roman,

> Hi Pavel,
>
> wow, it's a quite complicated deadlock. Thank you for providing
> a perfect analysis!

Thank you, it indeed took me a while to fully grasp the deadlock.

>
> Unfortunately, backporting the whole new slab controller isn't an option:
> it's way too big and invasive.

This is what I thought as well, this is why I want to figure out what
is the best way forward.

> Do you already have a standalone fix?

Not yet, I do not have a standalone fix. I suspect the best fix would
be to address fix css_killed_work_fn() stack so we never have:
cgroup_mutex -> mem_hotplug_lock. Either decoupling them or reverse
the order would work. If you have suggestions since you worked on this
code recently, please let me know.

Thank you,
Pasha

>
> Thanks!
>
>
> >
> > Thread #1: Hot-removes memory
> > device_offline
> >   memory_subsys_offline
> >     offline_pages
> >       __offline_pages
> >         mem_hotplug_lock <- write access
> >       waits for Thread #3 refcnt for pfn 9e5113 to get to 1 so it can
> > migrate it.
> >
> > Thread #2: ccs killer kthread
> >    css_killed_work_fn
> >      cgroup_mutex  <- Grab this Mutex
> >      mem_cgroup_css_offline
> >        memcg_offline_kmem.part
> >           memcg_deactivate_kmem_caches
> >             get_online_mems
> >               mem_hotplug_lock <- waits for Thread#1 to get read access
> >
> > Thread #3: crashing userland program
> > do_coredump
> >   elf_core_dump
> >       get_dump_page() -> get page with pfn#9e5113, and increment refcnt
> >       dump_emit
> >         __kernel_write
> >           __vfs_write
> >             new_sync_write
> >               pipe_write
> >                 pipe_wait   -> waits for Thread #4 systemd-coredump to
> > read the pipe
> >
> > Thread #4: systemd-coredump
> > ksys_read
> >   vfs_read
> >     __vfs_read
> >       seq_read
> >         proc_single_show
> >           proc_cgroup_show
> >             cgroup_mutex -> waits from Thread #2 for this lock.
>
> >
> > In Summary:
> > Thread#1 waits for Thread#3 for refcnt, Thread#3 waits for Thread#4 to
> > read pipe. Thread#4 waits for Thread#2 for cgroup_mutex lock; Thread#2
> > waits for Thread#1 for mem_hotplug_lock rwlock.
> >
> > This work appears to fix this deadlock because cgroup_mutex is not
> > called anymore before mem_hotplug_lock (unless I am missing it), as it
> > removes memcg_deactivate_kmem_caches.
> >
> > Thank you,
> > Pasha
> >
> > On Wed, Jan 29, 2020 at 9:42 PM Roman Gushchin <guro@fb.com> wrote:
> > >
> > > On Thu, Jan 30, 2020 at 07:36:26AM +0530, Bharata B Rao wrote:
> > > > On Mon, Jan 27, 2020 at 09:34:25AM -0800, Roman Gushchin wrote:
> > > > > The existing cgroup slab memory controller is based on the idea of
> > > > > replicating slab allocator internals for each memory cgroup.
> > > > > This approach promises a low memory overhead (one pointer per page),
> > > > > and isn't adding too much code on hot allocation and release paths.
> > > > > But is has a very serious flaw: it leads to a low slab utilization.
> > > > >
> > > > > Using a drgn* script I've got an estimation of slab utilization on
> > > > > a number of machines running different production workloads. In most
> > > > > cases it was between 45% and 65%, and the best number I've seen was
> > > > > around 85%. Turning kmem accounting off brings it to high 90s. Also
> > > > > it brings back 30-50% of slab memory. It means that the real price
> > > > > of the existing slab memory controller is way bigger than a pointer
> > > > > per page.
> > > > >
> > > > > The real reason why the existing design leads to a low slab utilization
> > > > > is simple: slab pages are used exclusively by one memory cgroup.
> > > > > If there are only few allocations of certain size made by a cgroup,
> > > > > or if some active objects (e.g. dentries) are left after the cgroup is
> > > > > deleted, or the cgroup contains a single-threaded application which is
> > > > > barely allocating any kernel objects, but does it every time on a new CPU:
> > > > > in all these cases the resulting slab utilization is very low.
> > > > > If kmem accounting is off, the kernel is able to use free space
> > > > > on slab pages for other allocations.
> > > > >
> > > > > Arguably it wasn't an issue back to days when the kmem controller was
> > > > > introduced and was an opt-in feature, which had to be turned on
> > > > > individually for each memory cgroup. But now it's turned on by default
> > > > > on both cgroup v1 and v2. And modern systemd-based systems tend to
> > > > > create a large number of cgroups.
> > > > >
> > > > > This patchset provides a new implementation of the slab memory controller,
> > > > > which aims to reach a much better slab utilization by sharing slab pages
> > > > > between multiple memory cgroups. Below is the short description of the new
> > > > > design (more details in commit messages).
> > > > >
> > > > > Accounting is performed per-object instead of per-page. Slab-related
> > > > > vmstat counters are converted to bytes. Charging is performed on page-basis,
> > > > > with rounding up and remembering leftovers.
> > > > >
> > > > > Memcg ownership data is stored in a per-slab-page vector: for each slab page
> > > > > a vector of corresponding size is allocated. To keep slab memory reparenting
> > > > > working, instead of saving a pointer to the memory cgroup directly an
> > > > > intermediate object is used. It's simply a pointer to a memcg (which can be
> > > > > easily changed to the parent) with a built-in reference counter. This scheme
> > > > > allows to reparent all allocated objects without walking them over and
> > > > > changing memcg pointer to the parent.
> > > > >
> > > > > Instead of creating an individual set of kmem_caches for each memory cgroup,
> > > > > two global sets are used: the root set for non-accounted and root-cgroup
> > > > > allocations and the second set for all other allocations. This allows to
> > > > > simplify the lifetime management of individual kmem_caches: they are
> > > > > destroyed with root counterparts. It allows to remove a good amount of code
> > > > > and make things generally simpler.
> > > > >
> > > > > The patchset* has been tested on a number of different workloads in our
> > > > > production. In all cases it saved significant amount of memory, measured
> > > > > from high hundreds of MBs to single GBs per host. On average, the size
> > > > > of slab memory has been reduced by 35-45%.
> > > >
> > > > Here are some numbers from multiple runs of sysbench and kernel compilation
> > > > with this patchset on a 10 core POWER8 host:
> > > >
> > > > ==========================================================================
> > > > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> > > > meminfo:Slab for Sysbench oltp_read_write with mysqld running as part
> > > > of a mem cgroup (Sampling every 5s)
> > > > --------------------------------------------------------------------------
> > > >                               5.5.0-rc7-mm1   +slab patch     %reduction
> > > > --------------------------------------------------------------------------
> > > > memory.kmem.usage_in_bytes    15859712        4456448         72
> > > > memory.usage_in_bytes         337510400       335806464       .5
> > > > Slab: (kB)                    814336          607296          25
> > > >
> > > > memory.kmem.usage_in_bytes    16187392        4653056         71
> > > > memory.usage_in_bytes         318832640       300154880       5
> > > > Slab: (kB)                    789888          559744          29
> > > > --------------------------------------------------------------------------
> > > >
> > > >
> > > > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> > > > meminfo:Slab for kernel compilation (make -s -j64) Compilation was
> > > > done from bash that is in a memory cgroup. (Sampling every 5s)
> > > > --------------------------------------------------------------------------
> > > >                               5.5.0-rc7-mm1   +slab patch     %reduction
> > > > --------------------------------------------------------------------------
> > > > memory.kmem.usage_in_bytes    338493440       231931904       31
> > > > memory.usage_in_bytes         7368015872      6275923968      15
> > > > Slab: (kB)                    1139072         785408          31
> > > >
> > > > memory.kmem.usage_in_bytes    341835776       236453888       30
> > > > memory.usage_in_bytes         6540427264      6072893440      7
> > > > Slab: (kB)                    1074304         761280          29
> > > >
> > > > memory.kmem.usage_in_bytes    340525056       233570304       31
> > > > memory.usage_in_bytes         6406209536      6177357824      3
> > > > Slab: (kB)                    1244288         739712          40
> > > > --------------------------------------------------------------------------
> > > >
> > > > Slab consumption right after boot
> > > > --------------------------------------------------------------------------
> > > >                               5.5.0-rc7-mm1   +slab patch     %reduction
> > > > --------------------------------------------------------------------------
> > > > Slab: (kB)                    821888          583424          29
> > > > ==========================================================================
> > > >
> > > > Summary:
> > > >
> > > > With sysbench and kernel compilation,  memory.kmem.usage_in_bytes shows
> > > > around 70% and 30% reduction consistently.
> > > >
> > > > Didn't see consistent reduction of memory.usage_in_bytes with sysbench and
> > > > kernel compilation.
> > > >
> > > > Slab usage (from /proc/meminfo) shows consistent 30% reduction and the
> > > > same is seen right after boot too.
> > >
> > > That's just perfect!
> > >
> > > memory.usage_in_bytes was most likely the same because the freed space
> > > was taken by pagecache.
> > >
> > > Thank you very much for testing!
> > >
> > > Roman

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 00/28] The new cgroup slab memory controller
  2020-08-13  0:31         ` Pavel Tatashin
@ 2020-08-28 16:47           ` Pavel Tatashin
  2020-09-01  5:28             ` Bharata B Rao
  2020-09-02  9:53             ` Vlastimil Babka
  0 siblings, 2 replies; 56+ messages in thread
From: Pavel Tatashin @ 2020-08-28 16:47 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Bharata B Rao, linux-mm, Andrew Morton, Michal Hocko,
	Johannes Weiner, Shakeel Butt, Vladimir Davydov, linux-kernel,
	Kernel Team, Yafang Shao, stable, Linus Torvalds, Sasha Levin,
	Greg Kroah-Hartman

There appears to be another problem that is related to the
cgroup_mutex -> mem_hotplug_lock deadlock described above.

In the original deadlock that I described, the workaround is to
replace crash dump from piping to Linux traditional save to files
method. However, after trying this workaround, I still observed
hardware watchdog resets during machine  shutdown.

The new problem occurs for the following reason: upon shutdown systemd
calls a service that hot-removes memory, and if hot-removing fails for
some reason systemd kills that service after timeout. However, systemd
is never able to kill the service, and we get hardware reset caused by
watchdog or a hang during shutdown:

Thread #1: memory hot-remove systemd service
Loops indefinitely, because if there is something still to be migrated
this loop never terminates. However, this loop can be terminated via
signal from systemd after timeout.
__offline_pages()
      do {
          pfn = scan_movable_pages(pfn, end_pfn);
                  # Returns 0, meaning there is nothing available to
                  # migrate, no page is PageLRU(page)
          ...
          ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
                                            NULL, check_pages_isolated_cb);
                  # Returns -EBUSY, meaning there is at least one PFN that
                  # still has to be migrated.
      } while (ret);

Thread #2: ccs killer kthread
   css_killed_work_fn
     cgroup_mutex  <- Grab this Mutex
     mem_cgroup_css_offline
       memcg_offline_kmem.part
          memcg_deactivate_kmem_caches
            get_online_mems
              mem_hotplug_lock <- waits for Thread#1 to get read access

Thread #3: systemd
ksys_read
 vfs_read
   __vfs_read
     seq_read
       proc_single_show
         proc_cgroup_show
           mutex_lock -> wait for cgroup_mutex that is owned by Thread #2

Thus, thread #3 systemd stuck, and unable to deliver timeout interrupt
to thread #1.

The proper fix for both of the problems is to avoid cgroup_mutex ->
mem_hotplug_lock ordering that was recently fixed in the mainline but
still present in all stable branches. Unfortunately, I do not see a
simple fix in how to remove mem_hotplug_lock from
memcg_deactivate_kmem_caches without using Roman's series that is too
big for stable.

Thanks,
Pasha

On Wed, Aug 12, 2020 at 8:31 PM Pavel Tatashin
<pasha.tatashin@soleen.com> wrote:
>
> On Wed, Aug 12, 2020 at 8:04 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Wed, Aug 12, 2020 at 07:16:08PM -0400, Pavel Tatashin wrote:
> > > Guys,
> > >
> > > There is a convoluted deadlock that I just root caused, and that is
> > > fixed by this work (at least based on my code inspection it appears to
> > > be fixed); but the deadlock exists in older and stable kernels, and I
> > > am not sure whether to create a separate patch for it, or backport
> > > this whole thing.
> >
>
> Hi Roman,
>
> > Hi Pavel,
> >
> > wow, it's a quite complicated deadlock. Thank you for providing
> > a perfect analysis!
>
> Thank you, it indeed took me a while to fully grasp the deadlock.
>
> >
> > Unfortunately, backporting the whole new slab controller isn't an option:
> > it's way too big and invasive.
>
> This is what I thought as well, this is why I want to figure out what
> is the best way forward.
>
> > Do you already have a standalone fix?
>
> Not yet, I do not have a standalone fix. I suspect the best fix would
> be to address fix css_killed_work_fn() stack so we never have:
> cgroup_mutex -> mem_hotplug_lock. Either decoupling them or reverse
> the order would work. If you have suggestions since you worked on this
> code recently, please let me know.
>
> Thank you,
> Pasha
>
> >
> > Thanks!
> >
> >
> > >
> > > Thread #1: Hot-removes memory
> > > device_offline
> > >   memory_subsys_offline
> > >     offline_pages
> > >       __offline_pages
> > >         mem_hotplug_lock <- write access
> > >       waits for Thread #3 refcnt for pfn 9e5113 to get to 1 so it can
> > > migrate it.
> > >
> > > Thread #2: ccs killer kthread
> > >    css_killed_work_fn
> > >      cgroup_mutex  <- Grab this Mutex
> > >      mem_cgroup_css_offline
> > >        memcg_offline_kmem.part
> > >           memcg_deactivate_kmem_caches
> > >             get_online_mems
> > >               mem_hotplug_lock <- waits for Thread#1 to get read access
> > >
> > > Thread #3: crashing userland program
> > > do_coredump
> > >   elf_core_dump
> > >       get_dump_page() -> get page with pfn#9e5113, and increment refcnt
> > >       dump_emit
> > >         __kernel_write
> > >           __vfs_write
> > >             new_sync_write
> > >               pipe_write
> > >                 pipe_wait   -> waits for Thread #4 systemd-coredump to
> > > read the pipe
> > >
> > > Thread #4: systemd-coredump
> > > ksys_read
> > >   vfs_read
> > >     __vfs_read
> > >       seq_read
> > >         proc_single_show
> > >           proc_cgroup_show
> > >             cgroup_mutex -> waits from Thread #2 for this lock.
> >
> > >
> > > In Summary:
> > > Thread#1 waits for Thread#3 for refcnt, Thread#3 waits for Thread#4 to
> > > read pipe. Thread#4 waits for Thread#2 for cgroup_mutex lock; Thread#2
> > > waits for Thread#1 for mem_hotplug_lock rwlock.
> > >
> > > This work appears to fix this deadlock because cgroup_mutex is not
> > > called anymore before mem_hotplug_lock (unless I am missing it), as it
> > > removes memcg_deactivate_kmem_caches.
> > >
> > > Thank you,
> > > Pasha
> > >
> > > On Wed, Jan 29, 2020 at 9:42 PM Roman Gushchin <guro@fb.com> wrote:
> > > >
> > > > On Thu, Jan 30, 2020 at 07:36:26AM +0530, Bharata B Rao wrote:
> > > > > On Mon, Jan 27, 2020 at 09:34:25AM -0800, Roman Gushchin wrote:
> > > > > > The existing cgroup slab memory controller is based on the idea of
> > > > > > replicating slab allocator internals for each memory cgroup.
> > > > > > This approach promises a low memory overhead (one pointer per page),
> > > > > > and isn't adding too much code on hot allocation and release paths.
> > > > > > But is has a very serious flaw: it leads to a low slab utilization.
> > > > > >
> > > > > > Using a drgn* script I've got an estimation of slab utilization on
> > > > > > a number of machines running different production workloads. In most
> > > > > > cases it was between 45% and 65%, and the best number I've seen was
> > > > > > around 85%. Turning kmem accounting off brings it to high 90s. Also
> > > > > > it brings back 30-50% of slab memory. It means that the real price
> > > > > > of the existing slab memory controller is way bigger than a pointer
> > > > > > per page.
> > > > > >
> > > > > > The real reason why the existing design leads to a low slab utilization
> > > > > > is simple: slab pages are used exclusively by one memory cgroup.
> > > > > > If there are only few allocations of certain size made by a cgroup,
> > > > > > or if some active objects (e.g. dentries) are left after the cgroup is
> > > > > > deleted, or the cgroup contains a single-threaded application which is
> > > > > > barely allocating any kernel objects, but does it every time on a new CPU:
> > > > > > in all these cases the resulting slab utilization is very low.
> > > > > > If kmem accounting is off, the kernel is able to use free space
> > > > > > on slab pages for other allocations.
> > > > > >
> > > > > > Arguably it wasn't an issue back to days when the kmem controller was
> > > > > > introduced and was an opt-in feature, which had to be turned on
> > > > > > individually for each memory cgroup. But now it's turned on by default
> > > > > > on both cgroup v1 and v2. And modern systemd-based systems tend to
> > > > > > create a large number of cgroups.
> > > > > >
> > > > > > This patchset provides a new implementation of the slab memory controller,
> > > > > > which aims to reach a much better slab utilization by sharing slab pages
> > > > > > between multiple memory cgroups. Below is the short description of the new
> > > > > > design (more details in commit messages).
> > > > > >
> > > > > > Accounting is performed per-object instead of per-page. Slab-related
> > > > > > vmstat counters are converted to bytes. Charging is performed on page-basis,
> > > > > > with rounding up and remembering leftovers.
> > > > > >
> > > > > > Memcg ownership data is stored in a per-slab-page vector: for each slab page
> > > > > > a vector of corresponding size is allocated. To keep slab memory reparenting
> > > > > > working, instead of saving a pointer to the memory cgroup directly an
> > > > > > intermediate object is used. It's simply a pointer to a memcg (which can be
> > > > > > easily changed to the parent) with a built-in reference counter. This scheme
> > > > > > allows to reparent all allocated objects without walking them over and
> > > > > > changing memcg pointer to the parent.
> > > > > >
> > > > > > Instead of creating an individual set of kmem_caches for each memory cgroup,
> > > > > > two global sets are used: the root set for non-accounted and root-cgroup
> > > > > > allocations and the second set for all other allocations. This allows to
> > > > > > simplify the lifetime management of individual kmem_caches: they are
> > > > > > destroyed with root counterparts. It allows to remove a good amount of code
> > > > > > and make things generally simpler.
> > > > > >
> > > > > > The patchset* has been tested on a number of different workloads in our
> > > > > > production. In all cases it saved significant amount of memory, measured
> > > > > > from high hundreds of MBs to single GBs per host. On average, the size
> > > > > > of slab memory has been reduced by 35-45%.
> > > > >
> > > > > Here are some numbers from multiple runs of sysbench and kernel compilation
> > > > > with this patchset on a 10 core POWER8 host:
> > > > >
> > > > > ==========================================================================
> > > > > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> > > > > meminfo:Slab for Sysbench oltp_read_write with mysqld running as part
> > > > > of a mem cgroup (Sampling every 5s)
> > > > > --------------------------------------------------------------------------
> > > > >                               5.5.0-rc7-mm1   +slab patch     %reduction
> > > > > --------------------------------------------------------------------------
> > > > > memory.kmem.usage_in_bytes    15859712        4456448         72
> > > > > memory.usage_in_bytes         337510400       335806464       .5
> > > > > Slab: (kB)                    814336          607296          25
> > > > >
> > > > > memory.kmem.usage_in_bytes    16187392        4653056         71
> > > > > memory.usage_in_bytes         318832640       300154880       5
> > > > > Slab: (kB)                    789888          559744          29
> > > > > --------------------------------------------------------------------------
> > > > >
> > > > >
> > > > > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> > > > > meminfo:Slab for kernel compilation (make -s -j64) Compilation was
> > > > > done from bash that is in a memory cgroup. (Sampling every 5s)
> > > > > --------------------------------------------------------------------------
> > > > >                               5.5.0-rc7-mm1   +slab patch     %reduction
> > > > > --------------------------------------------------------------------------
> > > > > memory.kmem.usage_in_bytes    338493440       231931904       31
> > > > > memory.usage_in_bytes         7368015872      6275923968      15
> > > > > Slab: (kB)                    1139072         785408          31
> > > > >
> > > > > memory.kmem.usage_in_bytes    341835776       236453888       30
> > > > > memory.usage_in_bytes         6540427264      6072893440      7
> > > > > Slab: (kB)                    1074304         761280          29
> > > > >
> > > > > memory.kmem.usage_in_bytes    340525056       233570304       31
> > > > > memory.usage_in_bytes         6406209536      6177357824      3
> > > > > Slab: (kB)                    1244288         739712          40
> > > > > --------------------------------------------------------------------------
> > > > >
> > > > > Slab consumption right after boot
> > > > > --------------------------------------------------------------------------
> > > > >                               5.5.0-rc7-mm1   +slab patch     %reduction
> > > > > --------------------------------------------------------------------------
> > > > > Slab: (kB)                    821888          583424          29
> > > > > ==========================================================================
> > > > >
> > > > > Summary:
> > > > >
> > > > > With sysbench and kernel compilation,  memory.kmem.usage_in_bytes shows
> > > > > around 70% and 30% reduction consistently.
> > > > >
> > > > > Didn't see consistent reduction of memory.usage_in_bytes with sysbench and
> > > > > kernel compilation.
> > > > >
> > > > > Slab usage (from /proc/meminfo) shows consistent 30% reduction and the
> > > > > same is seen right after boot too.
> > > >
> > > > That's just perfect!
> > > >
> > > > memory.usage_in_bytes was most likely the same because the freed space
> > > > was taken by pagecache.
> > > >
> > > > Thank you very much for testing!
> > > >
> > > > Roman

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 00/28] The new cgroup slab memory controller
  2020-08-28 16:47           ` Pavel Tatashin
@ 2020-09-01  5:28             ` Bharata B Rao
  2020-09-01 12:52               ` Pavel Tatashin
  2020-09-02  9:53             ` Vlastimil Babka
  1 sibling, 1 reply; 56+ messages in thread
From: Bharata B Rao @ 2020-09-01  5:28 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: Roman Gushchin, linux-mm, Andrew Morton, Michal Hocko,
	Johannes Weiner, Shakeel Butt, Vladimir Davydov, linux-kernel,
	Kernel Team, Yafang Shao, stable, Linus Torvalds, Sasha Levin,
	Greg Kroah-Hartman

On Fri, Aug 28, 2020 at 12:47:03PM -0400, Pavel Tatashin wrote:
> There appears to be another problem that is related to the
> cgroup_mutex -> mem_hotplug_lock deadlock described above.
> 
> In the original deadlock that I described, the workaround is to
> replace crash dump from piping to Linux traditional save to files
> method. However, after trying this workaround, I still observed
> hardware watchdog resets during machine  shutdown.
> 
> The new problem occurs for the following reason: upon shutdown systemd
> calls a service that hot-removes memory, and if hot-removing fails for
> some reason systemd kills that service after timeout. However, systemd
> is never able to kill the service, and we get hardware reset caused by
> watchdog or a hang during shutdown:
> 
> Thread #1: memory hot-remove systemd service
> Loops indefinitely, because if there is something still to be migrated
> this loop never terminates. However, this loop can be terminated via
> signal from systemd after timeout.
> __offline_pages()
>       do {
>           pfn = scan_movable_pages(pfn, end_pfn);
>                   # Returns 0, meaning there is nothing available to
>                   # migrate, no page is PageLRU(page)
>           ...
>           ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
>                                             NULL, check_pages_isolated_cb);
>                   # Returns -EBUSY, meaning there is at least one PFN that
>                   # still has to be migrated.
>       } while (ret);
> 
> Thread #2: ccs killer kthread
>    css_killed_work_fn
>      cgroup_mutex  <- Grab this Mutex
>      mem_cgroup_css_offline
>        memcg_offline_kmem.part
>           memcg_deactivate_kmem_caches
>             get_online_mems
>               mem_hotplug_lock <- waits for Thread#1 to get read access
> 
> Thread #3: systemd
> ksys_read
>  vfs_read
>    __vfs_read
>      seq_read
>        proc_single_show
>          proc_cgroup_show
>            mutex_lock -> wait for cgroup_mutex that is owned by Thread #2
> 
> Thus, thread #3 systemd stuck, and unable to deliver timeout interrupt
> to thread #1.
> 
> The proper fix for both of the problems is to avoid cgroup_mutex ->
> mem_hotplug_lock ordering that was recently fixed in the mainline but
> still present in all stable branches. Unfortunately, I do not see a
> simple fix in how to remove mem_hotplug_lock from
> memcg_deactivate_kmem_caches without using Roman's series that is too
> big for stable.

We too are seeing this on Power systems when stress-testing memory
hotplug, but with the following call trace (from hung task timer)
instead of Thread #2 above:

__switch_to
__schedule
schedule
percpu_rwsem_wait
__percpu_down_read
get_online_mems
memcg_create_kmem_cache
memcg_kmem_cache_create_func
process_one_work
worker_thread
kthread
ret_from_kernel_thread

While I understand that Roman's new slab controller patchset will fix
this, I also wonder if infinitely looping in the memory unplug path
with mem_hotplug_lock held is the right thing to do? Earlier we had
a few other exit possibilities in this path (like max retries etc)
but those were removed by commits:

72b39cfc4d75: mm, memory_hotplug: do not fail offlining too early
ecde0f3e7f9e: mm, memory_hotplug: remove timeout from __offline_memory

Or, is the user-space test is expected to induce a signal back-off when
unplug doesn't complete within a reasonable amount of time?

Regards,
Bharata.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 00/28] The new cgroup slab memory controller
  2020-09-01  5:28             ` Bharata B Rao
@ 2020-09-01 12:52               ` Pavel Tatashin
  2020-09-02  6:23                 ` Bharata B Rao
  0 siblings, 1 reply; 56+ messages in thread
From: Pavel Tatashin @ 2020-09-01 12:52 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: Roman Gushchin, linux-mm, Andrew Morton, Michal Hocko,
	Johannes Weiner, Shakeel Butt, Vladimir Davydov, linux-kernel,
	Kernel Team, Yafang Shao, stable, Linus Torvalds, Sasha Levin,
	Greg Kroah-Hartman

On Tue, Sep 1, 2020 at 1:28 AM Bharata B Rao <bharata@linux.ibm.com> wrote:
>
> On Fri, Aug 28, 2020 at 12:47:03PM -0400, Pavel Tatashin wrote:
> > There appears to be another problem that is related to the
> > cgroup_mutex -> mem_hotplug_lock deadlock described above.
> >
> > In the original deadlock that I described, the workaround is to
> > replace crash dump from piping to Linux traditional save to files
> > method. However, after trying this workaround, I still observed
> > hardware watchdog resets during machine  shutdown.
> >
> > The new problem occurs for the following reason: upon shutdown systemd
> > calls a service that hot-removes memory, and if hot-removing fails for
> > some reason systemd kills that service after timeout. However, systemd
> > is never able to kill the service, and we get hardware reset caused by
> > watchdog or a hang during shutdown:
> >
> > Thread #1: memory hot-remove systemd service
> > Loops indefinitely, because if there is something still to be migrated
> > this loop never terminates. However, this loop can be terminated via
> > signal from systemd after timeout.
> > __offline_pages()
> >       do {
> >           pfn = scan_movable_pages(pfn, end_pfn);
> >                   # Returns 0, meaning there is nothing available to
> >                   # migrate, no page is PageLRU(page)
> >           ...
> >           ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
> >                                             NULL, check_pages_isolated_cb);
> >                   # Returns -EBUSY, meaning there is at least one PFN that
> >                   # still has to be migrated.
> >       } while (ret);
> >
> > Thread #2: ccs killer kthread
> >    css_killed_work_fn
> >      cgroup_mutex  <- Grab this Mutex
> >      mem_cgroup_css_offline
> >        memcg_offline_kmem.part
> >           memcg_deactivate_kmem_caches
> >             get_online_mems
> >               mem_hotplug_lock <- waits for Thread#1 to get read access
> >
> > Thread #3: systemd
> > ksys_read
> >  vfs_read
> >    __vfs_read
> >      seq_read
> >        proc_single_show
> >          proc_cgroup_show
> >            mutex_lock -> wait for cgroup_mutex that is owned by Thread #2
> >
> > Thus, thread #3 systemd stuck, and unable to deliver timeout interrupt
> > to thread #1.
> >
> > The proper fix for both of the problems is to avoid cgroup_mutex ->
> > mem_hotplug_lock ordering that was recently fixed in the mainline but
> > still present in all stable branches. Unfortunately, I do not see a
> > simple fix in how to remove mem_hotplug_lock from
> > memcg_deactivate_kmem_caches without using Roman's series that is too
> > big for stable.
>
> We too are seeing this on Power systems when stress-testing memory
> hotplug, but with the following call trace (from hung task timer)
> instead of Thread #2 above:
>
> __switch_to
> __schedule
> schedule
> percpu_rwsem_wait
> __percpu_down_read
> get_online_mems
> memcg_create_kmem_cache
> memcg_kmem_cache_create_func
> process_one_work
> worker_thread
> kthread
> ret_from_kernel_thread
>
> While I understand that Roman's new slab controller patchset will fix
> this, I also wonder if infinitely looping in the memory unplug path
> with mem_hotplug_lock held is the right thing to do? Earlier we had
> a few other exit possibilities in this path (like max retries etc)
> but those were removed by commits:
>
> 72b39cfc4d75: mm, memory_hotplug: do not fail offlining too early
> ecde0f3e7f9e: mm, memory_hotplug: remove timeout from __offline_memory
>
> Or, is the user-space test is expected to induce a signal back-off when
> unplug doesn't complete within a reasonable amount of time?

Hi Bharata,

Thank you for your input, it looks like you are experiencing the same
problems that I observed.

What I found is that the reason why our machines did not complete
hot-remove within the given time is because of this bug:
https://lore.kernel.org/linux-mm/20200901124615.137200-1-pasha.tatashin@soleen.com

Could you please try it and see if that helps for your case?

Thank you,
Pasha

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 00/28] The new cgroup slab memory controller
  2020-09-01 12:52               ` Pavel Tatashin
@ 2020-09-02  6:23                 ` Bharata B Rao
  2020-09-02 12:34                   ` Pavel Tatashin
  0 siblings, 1 reply; 56+ messages in thread
From: Bharata B Rao @ 2020-09-02  6:23 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: Roman Gushchin, linux-mm, Andrew Morton, Michal Hocko,
	Johannes Weiner, Shakeel Butt, Vladimir Davydov, linux-kernel,
	Kernel Team, Yafang Shao, stable, Linus Torvalds, Sasha Levin,
	Greg Kroah-Hartman

On Tue, Sep 01, 2020 at 08:52:05AM -0400, Pavel Tatashin wrote:
> On Tue, Sep 1, 2020 at 1:28 AM Bharata B Rao <bharata@linux.ibm.com> wrote:
> >
> > On Fri, Aug 28, 2020 at 12:47:03PM -0400, Pavel Tatashin wrote:
> > > There appears to be another problem that is related to the
> > > cgroup_mutex -> mem_hotplug_lock deadlock described above.
> > >
> > > In the original deadlock that I described, the workaround is to
> > > replace crash dump from piping to Linux traditional save to files
> > > method. However, after trying this workaround, I still observed
> > > hardware watchdog resets during machine  shutdown.
> > >
> > > The new problem occurs for the following reason: upon shutdown systemd
> > > calls a service that hot-removes memory, and if hot-removing fails for
> > > some reason systemd kills that service after timeout. However, systemd
> > > is never able to kill the service, and we get hardware reset caused by
> > > watchdog or a hang during shutdown:
> > >
> > > Thread #1: memory hot-remove systemd service
> > > Loops indefinitely, because if there is something still to be migrated
> > > this loop never terminates. However, this loop can be terminated via
> > > signal from systemd after timeout.
> > > __offline_pages()
> > >       do {
> > >           pfn = scan_movable_pages(pfn, end_pfn);
> > >                   # Returns 0, meaning there is nothing available to
> > >                   # migrate, no page is PageLRU(page)
> > >           ...
> > >           ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
> > >                                             NULL, check_pages_isolated_cb);
> > >                   # Returns -EBUSY, meaning there is at least one PFN that
> > >                   # still has to be migrated.
> > >       } while (ret);
> > >
> > > Thread #2: ccs killer kthread
> > >    css_killed_work_fn
> > >      cgroup_mutex  <- Grab this Mutex
> > >      mem_cgroup_css_offline
> > >        memcg_offline_kmem.part
> > >           memcg_deactivate_kmem_caches
> > >             get_online_mems
> > >               mem_hotplug_lock <- waits for Thread#1 to get read access
> > >
> > > Thread #3: systemd
> > > ksys_read
> > >  vfs_read
> > >    __vfs_read
> > >      seq_read
> > >        proc_single_show
> > >          proc_cgroup_show
> > >            mutex_lock -> wait for cgroup_mutex that is owned by Thread #2
> > >
> > > Thus, thread #3 systemd stuck, and unable to deliver timeout interrupt
> > > to thread #1.
> > >
> > > The proper fix for both of the problems is to avoid cgroup_mutex ->
> > > mem_hotplug_lock ordering that was recently fixed in the mainline but
> > > still present in all stable branches. Unfortunately, I do not see a
> > > simple fix in how to remove mem_hotplug_lock from
> > > memcg_deactivate_kmem_caches without using Roman's series that is too
> > > big for stable.
> >
> > We too are seeing this on Power systems when stress-testing memory
> > hotplug, but with the following call trace (from hung task timer)
> > instead of Thread #2 above:
> >
> > __switch_to
> > __schedule
> > schedule
> > percpu_rwsem_wait
> > __percpu_down_read
> > get_online_mems
> > memcg_create_kmem_cache
> > memcg_kmem_cache_create_func
> > process_one_work
> > worker_thread
> > kthread
> > ret_from_kernel_thread
> >
> > While I understand that Roman's new slab controller patchset will fix
> > this, I also wonder if infinitely looping in the memory unplug path
> > with mem_hotplug_lock held is the right thing to do? Earlier we had
> > a few other exit possibilities in this path (like max retries etc)
> > but those were removed by commits:
> >
> > 72b39cfc4d75: mm, memory_hotplug: do not fail offlining too early
> > ecde0f3e7f9e: mm, memory_hotplug: remove timeout from __offline_memory
> >
> > Or, is the user-space test is expected to induce a signal back-off when
> > unplug doesn't complete within a reasonable amount of time?
> 
> Hi Bharata,
> 
> Thank you for your input, it looks like you are experiencing the same
> problems that I observed.
> 
> What I found is that the reason why our machines did not complete
> hot-remove within the given time is because of this bug:
> https://lore.kernel.org/linux-mm/20200901124615.137200-1-pasha.tatashin@soleen.com
> 
> Could you please try it and see if that helps for your case?

I am on an old codebase that already has the fix that you are proposing,
so I might be seeing someother issue which I will debug further.

So looks like the loop in __offline_pages() had a call to
drain_all_pages() before it was removed by

c52e75935f8d: mm: remove extra drain pages on pcp list

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 00/28] The new cgroup slab memory controller
  2020-08-28 16:47           ` Pavel Tatashin
  2020-09-01  5:28             ` Bharata B Rao
@ 2020-09-02  9:53             ` Vlastimil Babka
  2020-09-02 10:39               ` David Hildenbrand
                                 ` (2 more replies)
  1 sibling, 3 replies; 56+ messages in thread
From: Vlastimil Babka @ 2020-09-02  9:53 UTC (permalink / raw)
  To: Pavel Tatashin, Roman Gushchin
  Cc: Bharata B Rao, linux-mm, Andrew Morton, Michal Hocko,
	Johannes Weiner, Shakeel Butt, Vladimir Davydov, linux-kernel,
	Kernel Team, Yafang Shao, stable, Linus Torvalds, Sasha Levin,
	Greg Kroah-Hartman, David Hildenbrand, Michal Hocko

On 8/28/20 6:47 PM, Pavel Tatashin wrote:
> There appears to be another problem that is related to the
> cgroup_mutex -> mem_hotplug_lock deadlock described above.
> 
> In the original deadlock that I described, the workaround is to
> replace crash dump from piping to Linux traditional save to files
> method. However, after trying this workaround, I still observed
> hardware watchdog resets during machine  shutdown.
> 
> The new problem occurs for the following reason: upon shutdown systemd
> calls a service that hot-removes memory, and if hot-removing fails for

Why is that hotremove even needed if we're shutting down? Are there any
(virtualization?) platforms where it makes some difference over plain
shutdown/restart?

> some reason systemd kills that service after timeout. However, systemd
> is never able to kill the service, and we get hardware reset caused by
> watchdog or a hang during shutdown:
> 
> Thread #1: memory hot-remove systemd service
> Loops indefinitely, because if there is something still to be migrated
> this loop never terminates. However, this loop can be terminated via
> signal from systemd after timeout.
> __offline_pages()
>       do {
>           pfn = scan_movable_pages(pfn, end_pfn);
>                   # Returns 0, meaning there is nothing available to
>                   # migrate, no page is PageLRU(page)
>           ...
>           ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
>                                             NULL, check_pages_isolated_cb);
>                   # Returns -EBUSY, meaning there is at least one PFN that
>                   # still has to be migrated.
>       } while (ret);
> 
> Thread #2: ccs killer kthread
>    css_killed_work_fn
>      cgroup_mutex  <- Grab this Mutex
>      mem_cgroup_css_offline
>        memcg_offline_kmem.part
>           memcg_deactivate_kmem_caches
>             get_online_mems
>               mem_hotplug_lock <- waits for Thread#1 to get read access
> 
> Thread #3: systemd
> ksys_read
>  vfs_read
>    __vfs_read
>      seq_read
>        proc_single_show
>          proc_cgroup_show
>            mutex_lock -> wait for cgroup_mutex that is owned by Thread #2
> 
> Thus, thread #3 systemd stuck, and unable to deliver timeout interrupt
> to thread #1.
> 
> The proper fix for both of the problems is to avoid cgroup_mutex ->
> mem_hotplug_lock ordering that was recently fixed in the mainline but
> still present in all stable branches. Unfortunately, I do not see a
> simple fix in how to remove mem_hotplug_lock from
> memcg_deactivate_kmem_caches without using Roman's series that is too
> big for stable.
> 
> Thanks,
> Pasha
> 
> On Wed, Aug 12, 2020 at 8:31 PM Pavel Tatashin
> <pasha.tatashin@soleen.com> wrote:
>>
>> On Wed, Aug 12, 2020 at 8:04 PM Roman Gushchin <guro@fb.com> wrote:
>> >
>> > On Wed, Aug 12, 2020 at 07:16:08PM -0400, Pavel Tatashin wrote:
>> > > Guys,
>> > >
>> > > There is a convoluted deadlock that I just root caused, and that is
>> > > fixed by this work (at least based on my code inspection it appears to
>> > > be fixed); but the deadlock exists in older and stable kernels, and I
>> > > am not sure whether to create a separate patch for it, or backport
>> > > this whole thing.
>> >
>>
>> Hi Roman,
>>
>> > Hi Pavel,
>> >
>> > wow, it's a quite complicated deadlock. Thank you for providing
>> > a perfect analysis!
>>
>> Thank you, it indeed took me a while to fully grasp the deadlock.
>>
>> >
>> > Unfortunately, backporting the whole new slab controller isn't an option:
>> > it's way too big and invasive.
>>
>> This is what I thought as well, this is why I want to figure out what
>> is the best way forward.
>>
>> > Do you already have a standalone fix?
>>
>> Not yet, I do not have a standalone fix. I suspect the best fix would
>> be to address fix css_killed_work_fn() stack so we never have:
>> cgroup_mutex -> mem_hotplug_lock. Either decoupling them or reverse
>> the order would work. If you have suggestions since you worked on this
>> code recently, please let me know.
>>
>> Thank you,
>> Pasha
>>
>> >
>> > Thanks!
>> >
>> >
>> > >
>> > > Thread #1: Hot-removes memory
>> > > device_offline
>> > >   memory_subsys_offline
>> > >     offline_pages
>> > >       __offline_pages
>> > >         mem_hotplug_lock <- write access
>> > >       waits for Thread #3 refcnt for pfn 9e5113 to get to 1 so it can
>> > > migrate it.
>> > >
>> > > Thread #2: ccs killer kthread
>> > >    css_killed_work_fn
>> > >      cgroup_mutex  <- Grab this Mutex
>> > >      mem_cgroup_css_offline
>> > >        memcg_offline_kmem.part
>> > >           memcg_deactivate_kmem_caches
>> > >             get_online_mems
>> > >               mem_hotplug_lock <- waits for Thread#1 to get read access
>> > >
>> > > Thread #3: crashing userland program
>> > > do_coredump
>> > >   elf_core_dump
>> > >       get_dump_page() -> get page with pfn#9e5113, and increment refcnt
>> > >       dump_emit
>> > >         __kernel_write
>> > >           __vfs_write
>> > >             new_sync_write
>> > >               pipe_write
>> > >                 pipe_wait   -> waits for Thread #4 systemd-coredump to
>> > > read the pipe
>> > >
>> > > Thread #4: systemd-coredump
>> > > ksys_read
>> > >   vfs_read
>> > >     __vfs_read
>> > >       seq_read
>> > >         proc_single_show
>> > >           proc_cgroup_show
>> > >             cgroup_mutex -> waits from Thread #2 for this lock.
>> >
>> > >
>> > > In Summary:
>> > > Thread#1 waits for Thread#3 for refcnt, Thread#3 waits for Thread#4 to
>> > > read pipe. Thread#4 waits for Thread#2 for cgroup_mutex lock; Thread#2
>> > > waits for Thread#1 for mem_hotplug_lock rwlock.
>> > >
>> > > This work appears to fix this deadlock because cgroup_mutex is not
>> > > called anymore before mem_hotplug_lock (unless I am missing it), as it
>> > > removes memcg_deactivate_kmem_caches.
>> > >
>> > > Thank you,
>> > > Pasha
>> > >
>> > > On Wed, Jan 29, 2020 at 9:42 PM Roman Gushchin <guro@fb.com> wrote:
>> > > >
>> > > > On Thu, Jan 30, 2020 at 07:36:26AM +0530, Bharata B Rao wrote:
>> > > > > On Mon, Jan 27, 2020 at 09:34:25AM -0800, Roman Gushchin wrote:
>> > > > > > The existing cgroup slab memory controller is based on the idea of
>> > > > > > replicating slab allocator internals for each memory cgroup.
>> > > > > > This approach promises a low memory overhead (one pointer per page),
>> > > > > > and isn't adding too much code on hot allocation and release paths.
>> > > > > > But is has a very serious flaw: it leads to a low slab utilization.
>> > > > > >
>> > > > > > Using a drgn* script I've got an estimation of slab utilization on
>> > > > > > a number of machines running different production workloads. In most
>> > > > > > cases it was between 45% and 65%, and the best number I've seen was
>> > > > > > around 85%. Turning kmem accounting off brings it to high 90s. Also
>> > > > > > it brings back 30-50% of slab memory. It means that the real price
>> > > > > > of the existing slab memory controller is way bigger than a pointer
>> > > > > > per page.
>> > > > > >
>> > > > > > The real reason why the existing design leads to a low slab utilization
>> > > > > > is simple: slab pages are used exclusively by one memory cgroup.
>> > > > > > If there are only few allocations of certain size made by a cgroup,
>> > > > > > or if some active objects (e.g. dentries) are left after the cgroup is
>> > > > > > deleted, or the cgroup contains a single-threaded application which is
>> > > > > > barely allocating any kernel objects, but does it every time on a new CPU:
>> > > > > > in all these cases the resulting slab utilization is very low.
>> > > > > > If kmem accounting is off, the kernel is able to use free space
>> > > > > > on slab pages for other allocations.
>> > > > > >
>> > > > > > Arguably it wasn't an issue back to days when the kmem controller was
>> > > > > > introduced and was an opt-in feature, which had to be turned on
>> > > > > > individually for each memory cgroup. But now it's turned on by default
>> > > > > > on both cgroup v1 and v2. And modern systemd-based systems tend to
>> > > > > > create a large number of cgroups.
>> > > > > >
>> > > > > > This patchset provides a new implementation of the slab memory controller,
>> > > > > > which aims to reach a much better slab utilization by sharing slab pages
>> > > > > > between multiple memory cgroups. Below is the short description of the new
>> > > > > > design (more details in commit messages).
>> > > > > >
>> > > > > > Accounting is performed per-object instead of per-page. Slab-related
>> > > > > > vmstat counters are converted to bytes. Charging is performed on page-basis,
>> > > > > > with rounding up and remembering leftovers.
>> > > > > >
>> > > > > > Memcg ownership data is stored in a per-slab-page vector: for each slab page
>> > > > > > a vector of corresponding size is allocated. To keep slab memory reparenting
>> > > > > > working, instead of saving a pointer to the memory cgroup directly an
>> > > > > > intermediate object is used. It's simply a pointer to a memcg (which can be
>> > > > > > easily changed to the parent) with a built-in reference counter. This scheme
>> > > > > > allows to reparent all allocated objects without walking them over and
>> > > > > > changing memcg pointer to the parent.
>> > > > > >
>> > > > > > Instead of creating an individual set of kmem_caches for each memory cgroup,
>> > > > > > two global sets are used: the root set for non-accounted and root-cgroup
>> > > > > > allocations and the second set for all other allocations. This allows to
>> > > > > > simplify the lifetime management of individual kmem_caches: they are
>> > > > > > destroyed with root counterparts. It allows to remove a good amount of code
>> > > > > > and make things generally simpler.
>> > > > > >
>> > > > > > The patchset* has been tested on a number of different workloads in our
>> > > > > > production. In all cases it saved significant amount of memory, measured
>> > > > > > from high hundreds of MBs to single GBs per host. On average, the size
>> > > > > > of slab memory has been reduced by 35-45%.
>> > > > >
>> > > > > Here are some numbers from multiple runs of sysbench and kernel compilation
>> > > > > with this patchset on a 10 core POWER8 host:
>> > > > >
>> > > > > ==========================================================================
>> > > > > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
>> > > > > meminfo:Slab for Sysbench oltp_read_write with mysqld running as part
>> > > > > of a mem cgroup (Sampling every 5s)
>> > > > > --------------------------------------------------------------------------
>> > > > >                               5.5.0-rc7-mm1   +slab patch     %reduction
>> > > > > --------------------------------------------------------------------------
>> > > > > memory.kmem.usage_in_bytes    15859712        4456448         72
>> > > > > memory.usage_in_bytes         337510400       335806464       .5
>> > > > > Slab: (kB)                    814336          607296          25
>> > > > >
>> > > > > memory.kmem.usage_in_bytes    16187392        4653056         71
>> > > > > memory.usage_in_bytes         318832640       300154880       5
>> > > > > Slab: (kB)                    789888          559744          29
>> > > > > --------------------------------------------------------------------------
>> > > > >
>> > > > >
>> > > > > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
>> > > > > meminfo:Slab for kernel compilation (make -s -j64) Compilation was
>> > > > > done from bash that is in a memory cgroup. (Sampling every 5s)
>> > > > > --------------------------------------------------------------------------
>> > > > >                               5.5.0-rc7-mm1   +slab patch     %reduction
>> > > > > --------------------------------------------------------------------------
>> > > > > memory.kmem.usage_in_bytes    338493440       231931904       31
>> > > > > memory.usage_in_bytes         7368015872      6275923968      15
>> > > > > Slab: (kB)                    1139072         785408          31
>> > > > >
>> > > > > memory.kmem.usage_in_bytes    341835776       236453888       30
>> > > > > memory.usage_in_bytes         6540427264      6072893440      7
>> > > > > Slab: (kB)                    1074304         761280          29
>> > > > >
>> > > > > memory.kmem.usage_in_bytes    340525056       233570304       31
>> > > > > memory.usage_in_bytes         6406209536      6177357824      3
>> > > > > Slab: (kB)                    1244288         739712          40
>> > > > > --------------------------------------------------------------------------
>> > > > >
>> > > > > Slab consumption right after boot
>> > > > > --------------------------------------------------------------------------
>> > > > >                               5.5.0-rc7-mm1   +slab patch     %reduction
>> > > > > --------------------------------------------------------------------------
>> > > > > Slab: (kB)                    821888          583424          29
>> > > > > ==========================================================================
>> > > > >
>> > > > > Summary:
>> > > > >
>> > > > > With sysbench and kernel compilation,  memory.kmem.usage_in_bytes shows
>> > > > > around 70% and 30% reduction consistently.
>> > > > >
>> > > > > Didn't see consistent reduction of memory.usage_in_bytes with sysbench and
>> > > > > kernel compilation.
>> > > > >
>> > > > > Slab usage (from /proc/meminfo) shows consistent 30% reduction and the
>> > > > > same is seen right after boot too.
>> > > >
>> > > > That's just perfect!
>> > > >
>> > > > memory.usage_in_bytes was most likely the same because the freed space
>> > > > was taken by pagecache.
>> > > >
>> > > > Thank you very much for testing!
>> > > >
>> > > > Roman
> 


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 00/28] The new cgroup slab memory controller
  2020-09-02  9:53             ` Vlastimil Babka
@ 2020-09-02 10:39               ` David Hildenbrand
  2020-09-02 12:42                 ` Pavel Tatashin
  2020-09-02 11:26               ` Michal Hocko
  2020-09-02 11:32               ` Michal Hocko
  2 siblings, 1 reply; 56+ messages in thread
From: David Hildenbrand @ 2020-09-02 10:39 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Pavel Tatashin, Roman Gushchin, Bharata B Rao, linux-mm,
	Andrew Morton, Michal Hocko, Johannes Weiner, Shakeel Butt,
	Vladimir Davydov, linux-kernel, Kernel Team, Yafang Shao, stable,
	Linus Torvalds, Sasha Levin, Greg Kroah-Hartman,
	David Hildenbrand



> Am 02.09.2020 um 11:53 schrieb Vlastimil Babka <vbabka@suse.cz>:
> 
> On 8/28/20 6:47 PM, Pavel Tatashin wrote:
>> There appears to be another problem that is related to the
>> cgroup_mutex -> mem_hotplug_lock deadlock described above.
>> 
>> In the original deadlock that I described, the workaround is to
>> replace crash dump from piping to Linux traditional save to files
>> method. However, after trying this workaround, I still observed
>> hardware watchdog resets during machine  shutdown.
>> 
>> The new problem occurs for the following reason: upon shutdown systemd
>> calls a service that hot-removes memory, and if hot-removing fails for
> 
> Why is that hotremove even needed if we're shutting down? Are there any
> (virtualization?) platforms where it makes some difference over plain
> shutdown/restart?

If all it‘s doing is offlining random memory that sounds unnecessary and dangerous. Any pointers to this service so we can figure out what it‘s doing and why? (Arch? Hypervisor?)

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 00/28] The new cgroup slab memory controller
  2020-09-02  9:53             ` Vlastimil Babka
  2020-09-02 10:39               ` David Hildenbrand
@ 2020-09-02 11:26               ` Michal Hocko
  2020-09-02 12:51                 ` Pavel Tatashin
  2020-09-02 11:32               ` Michal Hocko
  2 siblings, 1 reply; 56+ messages in thread
From: Michal Hocko @ 2020-09-02 11:26 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Pavel Tatashin, Roman Gushchin, Bharata B Rao, linux-mm,
	Andrew Morton, Johannes Weiner, Shakeel Butt, Vladimir Davydov,
	linux-kernel, Kernel Team, Yafang Shao, stable, Linus Torvalds,
	Sasha Levin, Greg Kroah-Hartman, David Hildenbrand

On Wed 02-09-20 11:53:00, Vlastimil Babka wrote:
> On 8/28/20 6:47 PM, Pavel Tatashin wrote:
> > There appears to be another problem that is related to the
> > cgroup_mutex -> mem_hotplug_lock deadlock described above.
> > 
> > In the original deadlock that I described, the workaround is to
> > replace crash dump from piping to Linux traditional save to files
> > method. However, after trying this workaround, I still observed
> > hardware watchdog resets during machine  shutdown.
> > 
> > The new problem occurs for the following reason: upon shutdown systemd
> > calls a service that hot-removes memory, and if hot-removing fails for
> 
> Why is that hotremove even needed if we're shutting down? Are there any
> (virtualization?) platforms where it makes some difference over plain
> shutdown/restart?

Yes this sounds quite dubious.

> > some reason systemd kills that service after timeout. However, systemd
> > is never able to kill the service, and we get hardware reset caused by
> > watchdog or a hang during shutdown:
> > 
> > Thread #1: memory hot-remove systemd service
> > Loops indefinitely, because if there is something still to be migrated
> > this loop never terminates. However, this loop can be terminated via
> > signal from systemd after timeout.
> > __offline_pages()
> >       do {
> >           pfn = scan_movable_pages(pfn, end_pfn);
> >                   # Returns 0, meaning there is nothing available to
> >                   # migrate, no page is PageLRU(page)
> >           ...
> >           ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
> >                                             NULL, check_pages_isolated_cb);
> >                   # Returns -EBUSY, meaning there is at least one PFN that
> >                   # still has to be migrated.
> >       } while (ret);

This shouldn't really happen. What does prevent from this to proceed?
Did you manage to catch the specific pfn and what is it used for?
start_isolate_page_range and scan_movable_pages should fail if there is
any memory that cannot be migrated permanently. This is something that
we should focus on when debugging.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 00/28] The new cgroup slab memory controller
  2020-09-02  9:53             ` Vlastimil Babka
  2020-09-02 10:39               ` David Hildenbrand
  2020-09-02 11:26               ` Michal Hocko
@ 2020-09-02 11:32               ` Michal Hocko
  2020-09-02 12:53                 ` Pavel Tatashin
  2 siblings, 1 reply; 56+ messages in thread
From: Michal Hocko @ 2020-09-02 11:32 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Pavel Tatashin, Roman Gushchin, Bharata B Rao, linux-mm,
	Andrew Morton, Johannes Weiner, Shakeel Butt, Vladimir Davydov,
	linux-kernel, Kernel Team, Yafang Shao, stable, Linus Torvalds,
	Sasha Levin, Greg Kroah-Hartman, David Hildenbrand

On Wed 02-09-20 11:53:00, Vlastimil Babka wrote:
> >> > > Thread #2: ccs killer kthread
> >> > >    css_killed_work_fn
> >> > >      cgroup_mutex  <- Grab this Mutex
> >> > >      mem_cgroup_css_offline
> >> > >        memcg_offline_kmem.part
> >> > >           memcg_deactivate_kmem_caches
> >> > >             get_online_mems
> >> > >               mem_hotplug_lock <- waits for Thread#1 to get read access

And one more thing. THis has been brought up several times already.
Maybe I have forgoten but why do we take hotplug locks in this path in
the first place? Memory hotplug notifier takes slab_mutex so this
shouldn't be really needed.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 00/28] The new cgroup slab memory controller
  2020-09-02  6:23                 ` Bharata B Rao
@ 2020-09-02 12:34                   ` Pavel Tatashin
  0 siblings, 0 replies; 56+ messages in thread
From: Pavel Tatashin @ 2020-09-02 12:34 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: Roman Gushchin, linux-mm, Andrew Morton, Michal Hocko,
	Johannes Weiner, Shakeel Butt, Vladimir Davydov, linux-kernel,
	Kernel Team, Yafang Shao, stable, Linus Torvalds, Sasha Levin,
	Greg Kroah-Hartman

> I am on an old codebase that already has the fix that you are proposing,
> so I might be seeing someother issue which I will debug further.
>
> So looks like the loop in __offline_pages() had a call to
> drain_all_pages() before it was removed by
>
> c52e75935f8d: mm: remove extra drain pages on pcp list

I see, thanks. There is a reason to have the second drain, my fix is a
little better as it is performed only on a rare occasion when it is
needed, but I should add a FIXES tag. I have not checked
alloc_contig_range race.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 00/28] The new cgroup slab memory controller
  2020-09-02 10:39               ` David Hildenbrand
@ 2020-09-02 12:42                 ` Pavel Tatashin
  2020-09-02 13:50                   ` Michal Hocko
  0 siblings, 1 reply; 56+ messages in thread
From: Pavel Tatashin @ 2020-09-02 12:42 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Vlastimil Babka, Roman Gushchin, Bharata B Rao, linux-mm,
	Andrew Morton, Michal Hocko, Johannes Weiner, Shakeel Butt,
	Vladimir Davydov, linux-kernel, Kernel Team, Yafang Shao, stable,
	Linus Torvalds, Sasha Levin, Greg Kroah-Hartman,
	David Hildenbrand

> > Am 02.09.2020 um 11:53 schrieb Vlastimil Babka <vbabka@suse.cz>:
> >
> > On 8/28/20 6:47 PM, Pavel Tatashin wrote:
> >> There appears to be another problem that is related to the
> >> cgroup_mutex -> mem_hotplug_lock deadlock described above.
> >>
> >> In the original deadlock that I described, the workaround is to
> >> replace crash dump from piping to Linux traditional save to files
> >> method. However, after trying this workaround, I still observed
> >> hardware watchdog resets during machine  shutdown.
> >>
> >> The new problem occurs for the following reason: upon shutdown systemd
> >> calls a service that hot-removes memory, and if hot-removing fails for
> >
> > Why is that hotremove even needed if we're shutting down? Are there any
> > (virtualization?) platforms where it makes some difference over plain
> > shutdown/restart?
>
> If all it‘s doing is offlining random memory that sounds unnecessary and dangerous. Any pointers to this service so we can figure out what it‘s doing and why? (Arch? Hypervisor?)

Hi David,

This is how we are using it at Microsoft: there is  a very large
number of small memory machines (8G each) with low downtime
requirements (reboot must be under a second). There is also a large
state ~2G of memory that we need to transfer during reboot, otherwise
it is very expensive to recreate the state. We have 2G of system
memory memory reserved as a pmem in the device tree, and use it to
pass information across reboots. Once the information is not needed we
hot-add that memory and use it during runtime, before shutdown we
hot-remove the 2G, save the program state on it, and do the reboot.

Pasha

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 00/28] The new cgroup slab memory controller
  2020-09-02 11:26               ` Michal Hocko
@ 2020-09-02 12:51                 ` Pavel Tatashin
  2020-09-02 13:51                   ` Michal Hocko
  0 siblings, 1 reply; 56+ messages in thread
From: Pavel Tatashin @ 2020-09-02 12:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, Roman Gushchin, Bharata B Rao, linux-mm,
	Andrew Morton, Johannes Weiner, Shakeel Butt, Vladimir Davydov,
	linux-kernel, Kernel Team, Yafang Shao, stable, Linus Torvalds,
	Sasha Levin, Greg Kroah-Hartman, David Hildenbrand

> > > Thread #1: memory hot-remove systemd service
> > > Loops indefinitely, because if there is something still to be migrated
> > > this loop never terminates. However, this loop can be terminated via
> > > signal from systemd after timeout.
> > > __offline_pages()
> > >       do {
> > >           pfn = scan_movable_pages(pfn, end_pfn);
> > >                   # Returns 0, meaning there is nothing available to
> > >                   # migrate, no page is PageLRU(page)
> > >           ...
> > >           ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
> > >                                             NULL, check_pages_isolated_cb);
> > >                   # Returns -EBUSY, meaning there is at least one PFN that
> > >                   # still has to be migrated.
> > >       } while (ret);
>

Hi Micahl,

> This shouldn't really happen. What does prevent from this to proceed?
> Did you manage to catch the specific pfn and what is it used for?

I did.

> start_isolate_page_range and scan_movable_pages should fail if there is
> any memory that cannot be migrated permanently. This is something that
> we should focus on when debugging.

I was hitting this issue:
mm/memory_hotplug: drain per-cpu pages again during memory offline
https://lore.kernel.org/lkml/20200901124615.137200-1-pasha.tatashin@soleen.com

Once the pcp drain  race is fixed, this particular deadlock becomes irrelavent.

The lock ordering, however, cgroup_mutex ->  mem_hotplug_lock is bad,
and the first race condition that I was hitting and described above is
still present. For now I added a temporary workaround by using save to
file instead of piping the core during shutdown. I am glad the
mainline is fixed, but stables should also have some kind of fix for
this problem.

Pasha

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 00/28] The new cgroup slab memory controller
  2020-09-02 11:32               ` Michal Hocko
@ 2020-09-02 12:53                 ` Pavel Tatashin
  2020-09-02 13:52                   ` Michal Hocko
  0 siblings, 1 reply; 56+ messages in thread
From: Pavel Tatashin @ 2020-09-02 12:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, Roman Gushchin, Bharata B Rao, linux-mm,
	Andrew Morton, Johannes Weiner, Shakeel Butt, Vladimir Davydov,
	linux-kernel, Kernel Team, Yafang Shao, stable, Linus Torvalds,
	Sasha Levin, Greg Kroah-Hartman, David Hildenbrand

On Wed, Sep 2, 2020 at 7:32 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 02-09-20 11:53:00, Vlastimil Babka wrote:
> > >> > > Thread #2: ccs killer kthread
> > >> > >    css_killed_work_fn
> > >> > >      cgroup_mutex  <- Grab this Mutex
> > >> > >      mem_cgroup_css_offline
> > >> > >        memcg_offline_kmem.part
> > >> > >           memcg_deactivate_kmem_caches
> > >> > >             get_online_mems
> > >> > >               mem_hotplug_lock <- waits for Thread#1 to get read access
>
> And one more thing. THis has been brought up several times already.
> Maybe I have forgoten but why do we take hotplug locks in this path in
> the first place? Memory hotplug notifier takes slab_mutex so this
> shouldn't be really needed.

Good point, it seems this lock can be completely removed from
memcg_deactivate_kmem_caches

Pasha

> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 00/28] The new cgroup slab memory controller
  2020-09-02 12:42                 ` Pavel Tatashin
@ 2020-09-02 13:50                   ` Michal Hocko
  2020-09-02 14:20                     ` Pavel Tatashin
  0 siblings, 1 reply; 56+ messages in thread
From: Michal Hocko @ 2020-09-02 13:50 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: David Hildenbrand, Vlastimil Babka, Roman Gushchin,
	Bharata B Rao, linux-mm, Andrew Morton, Johannes Weiner,
	Shakeel Butt, Vladimir Davydov, linux-kernel, Kernel Team,
	Yafang Shao, stable, Linus Torvalds, Sasha Levin,
	Greg Kroah-Hartman, David Hildenbrand

On Wed 02-09-20 08:42:13, Pavel Tatashin wrote:
> > > Am 02.09.2020 um 11:53 schrieb Vlastimil Babka <vbabka@suse.cz>:
> > >
> > > On 8/28/20 6:47 PM, Pavel Tatashin wrote:
> > >> There appears to be another problem that is related to the
> > >> cgroup_mutex -> mem_hotplug_lock deadlock described above.
> > >>
> > >> In the original deadlock that I described, the workaround is to
> > >> replace crash dump from piping to Linux traditional save to files
> > >> method. However, after trying this workaround, I still observed
> > >> hardware watchdog resets during machine  shutdown.
> > >>
> > >> The new problem occurs for the following reason: upon shutdown systemd
> > >> calls a service that hot-removes memory, and if hot-removing fails for
> > >
> > > Why is that hotremove even needed if we're shutting down? Are there any
> > > (virtualization?) platforms where it makes some difference over plain
> > > shutdown/restart?
> >
> > If all it‘s doing is offlining random memory that sounds unnecessary and dangerous. Any pointers to this service so we can figure out what it‘s doing and why? (Arch? Hypervisor?)
> 
> Hi David,
> 
> This is how we are using it at Microsoft: there is  a very large
> number of small memory machines (8G each) with low downtime
> requirements (reboot must be under a second). There is also a large
> state ~2G of memory that we need to transfer during reboot, otherwise
> it is very expensive to recreate the state. We have 2G of system
> memory memory reserved as a pmem in the device tree, and use it to
> pass information across reboots. Once the information is not needed we
> hot-add that memory and use it during runtime, before shutdown we
> hot-remove the 2G, save the program state on it, and do the reboot.

I still do not get it. So what does guarantee that the memory is
offlineable in the first place? Also what is the difference between
offlining and simply shutting the system down so that the memory is not
used in the first place. In other words what kind of difference
hotremove makes?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 00/28] The new cgroup slab memory controller
  2020-09-02 12:51                 ` Pavel Tatashin
@ 2020-09-02 13:51                   ` Michal Hocko
  0 siblings, 0 replies; 56+ messages in thread
From: Michal Hocko @ 2020-09-02 13:51 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: Vlastimil Babka, Roman Gushchin, Bharata B Rao, linux-mm,
	Andrew Morton, Johannes Weiner, Shakeel Butt, Vladimir Davydov,
	linux-kernel, Kernel Team, Yafang Shao, stable, Linus Torvalds,
	Sasha Levin, Greg Kroah-Hartman, David Hildenbrand

On Wed 02-09-20 08:51:06, Pavel Tatashin wrote:
> > > > Thread #1: memory hot-remove systemd service
> > > > Loops indefinitely, because if there is something still to be migrated
> > > > this loop never terminates. However, this loop can be terminated via
> > > > signal from systemd after timeout.
> > > > __offline_pages()
> > > >       do {
> > > >           pfn = scan_movable_pages(pfn, end_pfn);
> > > >                   # Returns 0, meaning there is nothing available to
> > > >                   # migrate, no page is PageLRU(page)
> > > >           ...
> > > >           ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
> > > >                                             NULL, check_pages_isolated_cb);
> > > >                   # Returns -EBUSY, meaning there is at least one PFN that
> > > >                   # still has to be migrated.
> > > >       } while (ret);
> >
> 
> Hi Micahl,
> 
> > This shouldn't really happen. What does prevent from this to proceed?
> > Did you manage to catch the specific pfn and what is it used for?
> 
> I did.
> 
> > start_isolate_page_range and scan_movable_pages should fail if there is
> > any memory that cannot be migrated permanently. This is something that
> > we should focus on when debugging.
> 
> I was hitting this issue:
> mm/memory_hotplug: drain per-cpu pages again during memory offline
> https://lore.kernel.org/lkml/20200901124615.137200-1-pasha.tatashin@soleen.com

I have noticed the patch but didn't have time to think it through (have
been few days off and catching up with emails). Will give it a higher
priority.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 00/28] The new cgroup slab memory controller
  2020-09-02 12:53                 ` Pavel Tatashin
@ 2020-09-02 13:52                   ` Michal Hocko
  0 siblings, 0 replies; 56+ messages in thread
From: Michal Hocko @ 2020-09-02 13:52 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: Vlastimil Babka, Roman Gushchin, Bharata B Rao, linux-mm,
	Andrew Morton, Johannes Weiner, Shakeel Butt, Vladimir Davydov,
	linux-kernel, Kernel Team, Yafang Shao, stable, Linus Torvalds,
	Sasha Levin, Greg Kroah-Hartman, David Hildenbrand

On Wed 02-09-20 08:53:49, Pavel Tatashin wrote:
> On Wed, Sep 2, 2020 at 7:32 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Wed 02-09-20 11:53:00, Vlastimil Babka wrote:
> > > >> > > Thread #2: ccs killer kthread
> > > >> > >    css_killed_work_fn
> > > >> > >      cgroup_mutex  <- Grab this Mutex
> > > >> > >      mem_cgroup_css_offline
> > > >> > >        memcg_offline_kmem.part
> > > >> > >           memcg_deactivate_kmem_caches
> > > >> > >             get_online_mems
> > > >> > >               mem_hotplug_lock <- waits for Thread#1 to get read access
> >
> > And one more thing. THis has been brought up several times already.
> > Maybe I have forgoten but why do we take hotplug locks in this path in
> > the first place? Memory hotplug notifier takes slab_mutex so this
> > shouldn't be really needed.
> 
> Good point, it seems this lock can be completely removed from
> memcg_deactivate_kmem_caches

I am pretty sure we have discussed that in the past. But I do not
remember the outcome. Either we have concluded that this is indeed the
case but nobody came up with a patch or we have hit some obscure
issue... Maybe David/Roman rememeber more than I do.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 00/28] The new cgroup slab memory controller
  2020-09-02 13:50                   ` Michal Hocko
@ 2020-09-02 14:20                     ` Pavel Tatashin
  2020-09-03 18:09                       ` David Hildenbrand
  0 siblings, 1 reply; 56+ messages in thread
From: Pavel Tatashin @ 2020-09-02 14:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, Vlastimil Babka, Roman Gushchin,
	Bharata B Rao, linux-mm, Andrew Morton, Johannes Weiner,
	Shakeel Butt, Vladimir Davydov, linux-kernel, Kernel Team,
	Yafang Shao, stable, Linus Torvalds, Sasha Levin,
	Greg Kroah-Hartman, David Hildenbrand

> > This is how we are using it at Microsoft: there is  a very large
> > number of small memory machines (8G each) with low downtime
> > requirements (reboot must be under a second). There is also a large
> > state ~2G of memory that we need to transfer during reboot, otherwise
> > it is very expensive to recreate the state. We have 2G of system
> > memory memory reserved as a pmem in the device tree, and use it to
> > pass information across reboots. Once the information is not needed we
> > hot-add that memory and use it during runtime, before shutdown we
> > hot-remove the 2G, save the program state on it, and do the reboot.
>
> I still do not get it. So what does guarantee that the memory is
> offlineable in the first place?

It is in a movable zone, and we have more than 2G of free memory for
successful migrations.

> Also what is the difference between
> offlining and simply shutting the system down so that the memory is not
> used in the first place. In other words what kind of difference
> hotremove makes?

For performance reasons during system updates/reboots we do not erase
memory content. The memory content is erased only on power cycle,
which we do not do in production.

Once we hot-remove the memory, we convert it back into DAXFS PMEM
device, format it into EXT4, mount it as DAX file system, and allow
programs to serialize their states to it so they can read it back
after the reboot.

During startup we mount pmem, programs read the state back, and after
that we hotplug the PMEM DAX as a movable zone. This way during normal
runtime we have 8G available to programs.

Pasha

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 00/28] The new cgroup slab memory controller
  2020-09-02 14:20                     ` Pavel Tatashin
@ 2020-09-03 18:09                       ` David Hildenbrand
  0 siblings, 0 replies; 56+ messages in thread
From: David Hildenbrand @ 2020-09-03 18:09 UTC (permalink / raw)
  To: Pavel Tatashin, Michal Hocko
  Cc: David Hildenbrand, Vlastimil Babka, Roman Gushchin,
	Bharata B Rao, linux-mm, Andrew Morton, Johannes Weiner,
	Shakeel Butt, Vladimir Davydov, linux-kernel, Kernel Team,
	Yafang Shao, stable, Linus Torvalds, Sasha Levin,
	Greg Kroah-Hartman

> For performance reasons during system updates/reboots we do not erase
> memory content. The memory content is erased only on power cycle,
> which we do not do in production.
> 
> Once we hot-remove the memory, we convert it back into DAXFS PMEM
> device, format it into EXT4, mount it as DAX file system, and allow
> programs to serialize their states to it so they can read it back
> after the reboot.
> 
> During startup we mount pmem, programs read the state back, and after
> that we hotplug the PMEM DAX as a movable zone. This way during normal
> runtime we have 8G available to programs.
> 

Thanks for sharing the workflow - while it sounds somewhat sub-optimal,
I guess it gets the job done using existing tools / mechanisms.

(I remember the persistent tmpfs over kexec RFC, which tries to tackle
it by introducing something new)

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2020-09-03 18:09 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20200127173453.2089565-1-guro@fb.com>
2020-01-27 17:34 ` [PATCH v2 14/28] mm: memcontrol: decouple reference counting from page accounting Roman Gushchin
2020-01-30  2:06 ` [PATCH v2 00/28] The new cgroup slab memory controller Bharata B Rao
2020-01-30  2:41   ` Roman Gushchin
2020-08-12 23:16     ` Pavel Tatashin
2020-08-12 23:18       ` Pavel Tatashin
2020-08-13  0:04       ` Roman Gushchin
2020-08-13  0:31         ` Pavel Tatashin
2020-08-28 16:47           ` Pavel Tatashin
2020-09-01  5:28             ` Bharata B Rao
2020-09-01 12:52               ` Pavel Tatashin
2020-09-02  6:23                 ` Bharata B Rao
2020-09-02 12:34                   ` Pavel Tatashin
2020-09-02  9:53             ` Vlastimil Babka
2020-09-02 10:39               ` David Hildenbrand
2020-09-02 12:42                 ` Pavel Tatashin
2020-09-02 13:50                   ` Michal Hocko
2020-09-02 14:20                     ` Pavel Tatashin
2020-09-03 18:09                       ` David Hildenbrand
2020-09-02 11:26               ` Michal Hocko
2020-09-02 12:51                 ` Pavel Tatashin
2020-09-02 13:51                   ` Michal Hocko
2020-09-02 11:32               ` Michal Hocko
2020-09-02 12:53                 ` Pavel Tatashin
2020-09-02 13:52                   ` Michal Hocko
     [not found] ` <20200127173453.2089565-28-guro@fb.com>
2020-01-30  2:17   ` [PATCH v2 27/28] tools/cgroup: make slabinfo.py compatible with new slab controller Bharata B Rao
2020-01-30  2:44     ` Roman Gushchin
2020-01-31 22:24     ` Roman Gushchin
2020-02-12  5:21       ` Bharata B Rao
2020-02-12 20:42         ` Roman Gushchin
     [not found] ` <20200127173453.2089565-8-guro@fb.com>
2020-02-03 16:05   ` [PATCH v2 07/28] mm: memcg/slab: introduce mem_cgroup_from_obj() Johannes Weiner
     [not found] ` <20200127173453.2089565-9-guro@fb.com>
2020-02-03 16:12   ` [PATCH v2 08/28] mm: fork: fix kernel_stack memcg stats for various stack implementations Johannes Weiner
     [not found] ` <20200127173453.2089565-10-guro@fb.com>
2020-02-03 16:13   ` [PATCH v2 09/28] mm: memcg/slab: rename __mod_lruvec_slab_state() into __mod_lruvec_obj_state() Johannes Weiner
     [not found] ` <20200127173453.2089565-11-guro@fb.com>
2020-02-03 17:39   ` [PATCH v2 10/28] mm: memcg: introduce mod_lruvec_memcg_state() Johannes Weiner
     [not found] ` <20200127173453.2089565-12-guro@fb.com>
2020-02-03 17:44   ` [PATCH v2 11/28] mm: slub: implement SLUB version of obj_to_index() Johannes Weiner
     [not found] ` <20200127173453.2089565-13-guro@fb.com>
2020-02-03 17:58   ` [PATCH v2 12/28] mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat Johannes Weiner
2020-02-03 18:25     ` Roman Gushchin
2020-02-03 20:34       ` Johannes Weiner
2020-02-03 22:28         ` Roman Gushchin
2020-02-03 22:39           ` Johannes Weiner
2020-02-04  1:44             ` Roman Gushchin
     [not found] ` <20200127173453.2089565-17-guro@fb.com>
2020-02-03 18:27   ` [PATCH v2 16/28] mm: memcg/slab: allocate obj_cgroups for non-root slab pages Johannes Weiner
2020-02-03 18:34     ` Roman Gushchin
2020-02-03 20:46       ` Johannes Weiner
2020-02-03 21:19         ` Roman Gushchin
2020-02-03 22:29           ` Johannes Weiner
     [not found] ` <20200127173453.2089565-16-guro@fb.com>
2020-02-03 19:31   ` [PATCH v2 15/28] mm: memcg/slab: obj_cgroup API Johannes Weiner
     [not found] ` <20200127173453.2089565-22-guro@fb.com>
2020-02-03 19:50   ` [PATCH v2 21/28] mm: memcg/slab: use a single set of kmem_caches for all memory cgroups Johannes Weiner
2020-02-03 20:58     ` Roman Gushchin
2020-02-03 22:17       ` Johannes Weiner
2020-02-03 22:38         ` Roman Gushchin
2020-02-04  1:15         ` Roman Gushchin
2020-02-04  2:47           ` Johannes Weiner
2020-02-04  4:35             ` Roman Gushchin
2020-02-04 18:41               ` Johannes Weiner
2020-02-05 15:58                 ` Roman Gushchin
     [not found] ` <20200127173453.2089565-18-guro@fb.com>
2020-02-03 19:53   ` [PATCH v2 17/28] mm: memcg/slab: save obj_cgroup for non-root slab objects Johannes Weiner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).