Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
* [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root
@ 2020-10-14 19:07 Richard Palethorpe
  2020-10-14 20:08 ` Roman Gushchin
  2020-10-16  9:47 ` Michal Koutný
  0 siblings, 2 replies; 45+ messages in thread
From: Richard Palethorpe @ 2020-10-14 19:07 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Richard Palethorpe, ltp, Johannes Weiner, Andrew Morton,
	Shakeel Butt, Christoph Lameter, Michal Hocko, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-kernel

SLAB objects which outlive their memcg are moved to their parent
memcg where they may be uncharged. However if they are moved to the
root memcg, uncharging will result in negative page counter values as
root has no page counters.

To prevent this, we check whether we are about to uncharge the root
memcg and skip it if we are. Possibly instead; the obj_cgroups should
be removed from their slabs and any per cpu stocks instead of
reparenting them to root?

The warning can be, unreliably, reproduced with the LTP test
madvise06 if the entire patch series
https://lore.kernel.org/linux-mm/20200623174037.3951353-1-guro@fb.com/
is present. Although the listed commit in 'fixes' appears to introduce
the bug, I can not reproduce it with just that commit and bisecting
runs into other bugs.

[   12.029417] WARNING: CPU: 2 PID: 21 at mm/page_counter.c:57 page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
[   12.029539] Modules linked in:
[   12.029611] CPU: 2 PID: 21 Comm: ksoftirqd/2 Not tainted 5.9.0-rc7-22-default #76
[   12.029729] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
[   12.029908] RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
[ 12.029991] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe ff ff 48 85 db 78 10 48 8b 6d 28 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b eb ec 90 e8 db 47 36 27 48 8b 17 48 39 d6 72 41 41 54 49 89
[   12.030258] RSP: 0018:ffffa5d8000efd08 EFLAGS: 00010086
[   12.030344] RAX: ffffffffffffffff RBX: ffffffffffffffff RCX: 0000000000000009
[   12.030455] RDX: 000000000000000b RSI: ffffffffffffffff RDI: ffff8ef8c7d2b248
[   12.030561] RBP: ffff8ef8c7d2b248 R08: ffff8ef8c78b19c8 R09: 0000000000000001
[   12.030672] R10: 0000000000000000 R11: ffff8ef8c780e0d0 R12: 0000000000000001
[   12.030784] R13: ffffffffffffffff R14: ffff8ef9478b19c8 R15: 0000000000000000
[   12.030895] FS:  0000000000000000(0000) GS:ffff8ef8fbc80000(0000) knlGS:0000000000000000
[   12.031017] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   12.031104] CR2: 00007f72c0af93ec CR3: 000000005c40a000 CR4: 00000000000006e0
[   12.031209] Call Trace:
[   12.031267] __memcg_kmem_uncharge (mm/memcontrol.c:3022)
[   12.031470] drain_obj_stock (./include/linux/rcupdate.h:689 mm/memcontrol.c:3114)
[   12.031594] refill_obj_stock (mm/memcontrol.c:3166)
[   12.031733] ? rcu_do_batch (kernel/rcu/tree.c:2438)
[   12.032075] memcg_slab_free_hook (./include/linux/mm.h:1294 ./include/linux/mm.h:1441 mm/slab.h:368 mm/slab.h:348)
[   12.032339] kmem_cache_free (mm/slub.c:3107 mm/slub.c:3143 mm/slub.c:3158)
[   12.032464] rcu_do_batch (kernel/rcu/tree.c:2438)
[   12.032567] rcu_core (kernel/rcu/tree_plugin.h:2122 kernel/rcu/tree_plugin.h:2157 kernel/rcu/tree.c:2661)
[   12.032664] __do_softirq (./arch/x86/include/asm/jump_label.h:25 ./include/linux/jump_label.h:200 ./include/trace/events/irq.h:142 kernel/softirq.c:299)
[   12.032766] run_ksoftirqd (./arch/x86/include/asm/irqflags.h:54 ./arch/x86/include/asm/irqflags.h:94 kernel/softirq.c:653 kernel/softirq.c:644)
[   12.032852] smpboot_thread_fn (kernel/smpboot.c:165)
[   12.032940] ? smpboot_register_percpu_thread (kernel/smpboot.c:108)
[   12.033059] kthread (kernel/kthread.c:292)
[   12.033148] ? __kthread_bind_mask (kernel/kthread.c:245)
[   12.033269] ret_from_fork (arch/x86/entry/entry_64.S:300)
[   12.033357] ---[ end trace 961dbfc01c109d1f ]---

[    9.841552] ------------[ cut here ]------------
[    9.841788] WARNING: CPU: 0 PID: 12 at mm/page_counter.c:57 page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
[    9.841982] Modules linked in:
[    9.842072] CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.9.0-rc7-22-default #77
[    9.842266] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
[    9.842571] Workqueue: events drain_local_stock
[    9.842750] RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
[ 9.842894] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe ff ff 48 85 db 78 10 48 8b 6d 28 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b eb ec 90 e8 4b f9 88 2a 48 8b 17 48 39 d6 72 41 41 54 49 89
[    9.843438] RSP: 0018:ffffb1c18006be28 EFLAGS: 00010086
[    9.843585] RAX: ffffffffffffffff RBX: ffffffffffffffff RCX: ffff94803bc2cae0
[    9.843806] RDX: 0000000000000001 RSI: ffffffffffffffff RDI: ffff948007d2b248
[    9.844026] RBP: ffff948007d2b248 R08: ffff948007c58eb0 R09: ffff948007da05ac
[    9.844248] R10: 0000000000000018 R11: 0000000000000018 R12: 0000000000000001
[    9.844477] R13: ffffffffffffffff R14: 0000000000000000 R15: ffff94803bc2cac0
[    9.844696] FS:  0000000000000000(0000) GS:ffff94803bc00000(0000) knlGS:0000000000000000
[    9.844915] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    9.845096] CR2: 00007f0579ee0384 CR3: 000000002cc0a000 CR4: 00000000000006f0
[    9.845319] Call Trace:
[    9.845429] __memcg_kmem_uncharge (mm/memcontrol.c:3022)
[    9.845582] drain_obj_stock (./include/linux/rcupdate.h:689 mm/memcontrol.c:3114)
[    9.845684] drain_local_stock (mm/memcontrol.c:2255)
[    9.845789] process_one_work (./arch/x86/include/asm/jump_label.h:25 ./include/linux/jump_label.h:200 ./include/trace/events/workqueue.h:108 kernel/workqueue.c:2274)
[    9.845898] worker_thread (./include/linux/list.h:282 kernel/workqueue.c:2416)
[    9.846034] ? process_one_work (kernel/workqueue.c:2358)
[    9.846162] kthread (kernel/kthread.c:292)
[    9.846271] ? __kthread_bind_mask (kernel/kthread.c:245)
[    9.846420] ret_from_fork (arch/x86/entry/entry_64.S:300)
[    9.846531] ---[ end trace 8b5647c1eba9d18a ]---

Reported-By: ltp@lists.linux.it
Signed-off-by: Richard Palethorpe <rpalethorpe@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API")
---
 mm/memcontrol.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6877c765b8d0..214e1fe4e9a2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -291,7 +291,7 @@ static void obj_cgroup_release(struct percpu_ref *ref)
 
 	spin_lock_irqsave(&css_set_lock, flags);
 	memcg = obj_cgroup_memcg(objcg);
-	if (nr_pages)
+	if (nr_pages && !mem_cgroup_is_root(memcg))
 		__memcg_kmem_uncharge(memcg, nr_pages);
 	list_del(&objcg->list);
 	mem_cgroup_put(memcg);
@@ -3100,6 +3100,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 static void drain_obj_stock(struct memcg_stock_pcp *stock)
 {
 	struct obj_cgroup *old = stock->cached_objcg;
+	struct mem_cgroup *memcg;
 
 	if (!old)
 		return;
@@ -3110,7 +3111,9 @@ static void drain_obj_stock(struct memcg_stock_pcp *stock)
 
 		if (nr_pages) {
 			rcu_read_lock();
-			__memcg_kmem_uncharge(obj_cgroup_memcg(old), nr_pages);
+			memcg = obj_cgroup_memcg(old);
+			if (!mem_cgroup_is_root(memcg))
+				__memcg_kmem_uncharge(memcg, nr_pages);
 			rcu_read_unlock();
 		}
 
-- 
2.28.0



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root
  2020-10-14 19:07 [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root Richard Palethorpe
@ 2020-10-14 20:08 ` Roman Gushchin
  2020-10-16  5:40   ` Richard Palethorpe
  2020-10-16  9:47 ` Michal Koutný
  1 sibling, 1 reply; 45+ messages in thread
From: Roman Gushchin @ 2020-10-14 20:08 UTC (permalink / raw)
  To: Richard Palethorpe
  Cc: ltp, Johannes Weiner, Andrew Morton, Shakeel Butt,
	Christoph Lameter, Michal Hocko, Tejun Heo, Vlastimil Babka,
	linux-mm, linux-kernel

Hi Richard!

> SLAB objects which outlive their memcg are moved to their parent
> memcg where they may be uncharged. However if they are moved to the
> root memcg, uncharging will result in negative page counter values as
> root has no page counters.
> 
> To prevent this, we check whether we are about to uncharge the root
> memcg and skip it if we are. Possibly instead; the obj_cgroups should
> be removed from their slabs and any per cpu stocks instead of
> reparenting them to root?

It would be really complex. I think your fix is totally fine.
We have similar checks in cancel_charge(), uncharge_batch(),
mem_cgroup_swapout(), mem_cgroup_uncharge_swap() etc.

> 
> The warning can be, unreliably, reproduced with the LTP test
> madvise06 if the entire patch series
> https://lore.kernel.org/linux-mm/20200623174037.3951353-1-guro@fb.com/
> is present. Although the listed commit in 'fixes' appears to introduce
> the bug, I can not reproduce it with just that commit and bisecting
> runs into other bugs.
> 
> [   12.029417] WARNING: CPU: 2 PID: 21 at mm/page_counter.c:57 page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
> [   12.029539] Modules linked in:
> [   12.029611] CPU: 2 PID: 21 Comm: ksoftirqd/2 Not tainted 5.9.0-rc7-22-default #76
> [   12.029729] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
> [   12.029908] RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
> [ 12.029991] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe ff ff 48 85 db 78 10 48 8b 6d 28 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b eb ec 90 e8 db 47 36 27 48 8b 17 48 39 d6 72 41 41 54 49 89
> [   12.030258] RSP: 0018:ffffa5d8000efd08 EFLAGS: 00010086
> [   12.030344] RAX: ffffffffffffffff RBX: ffffffffffffffff RCX: 0000000000000009
> [   12.030455] RDX: 000000000000000b RSI: ffffffffffffffff RDI: ffff8ef8c7d2b248
> [   12.030561] RBP: ffff8ef8c7d2b248 R08: ffff8ef8c78b19c8 R09: 0000000000000001
> [   12.030672] R10: 0000000000000000 R11: ffff8ef8c780e0d0 R12: 0000000000000001
> [   12.030784] R13: ffffffffffffffff R14: ffff8ef9478b19c8 R15: 0000000000000000
> [   12.030895] FS:  0000000000000000(0000) GS:ffff8ef8fbc80000(0000) knlGS:0000000000000000
> [   12.031017] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   12.031104] CR2: 00007f72c0af93ec CR3: 000000005c40a000 CR4: 00000000000006e0
> [   12.031209] Call Trace:
> [   12.031267] __memcg_kmem_uncharge (mm/memcontrol.c:3022)
> [   12.031470] drain_obj_stock (./include/linux/rcupdate.h:689 mm/memcontrol.c:3114)
> [   12.031594] refill_obj_stock (mm/memcontrol.c:3166)
> [   12.031733] ? rcu_do_batch (kernel/rcu/tree.c:2438)
> [   12.032075] memcg_slab_free_hook (./include/linux/mm.h:1294 ./include/linux/mm.h:1441 mm/slab.h:368 mm/slab.h:348)
> [   12.032339] kmem_cache_free (mm/slub.c:3107 mm/slub.c:3143 mm/slub.c:3158)
> [   12.032464] rcu_do_batch (kernel/rcu/tree.c:2438)
> [   12.032567] rcu_core (kernel/rcu/tree_plugin.h:2122 kernel/rcu/tree_plugin.h:2157 kernel/rcu/tree.c:2661)
> ...
> Reported-By: ltp@lists.linux.it
> Signed-off-by: Richard Palethorpe <rpalethorpe@suse.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Roman Gushchin <guro@fb.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Shakeel Butt <shakeelb@google.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API")

Acked-by: Roman Gushchin <guro@fb.com>

Thanks!


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root
  2020-10-14 20:08 ` Roman Gushchin
@ 2020-10-16  5:40   ` Richard Palethorpe
  0 siblings, 0 replies; 45+ messages in thread
From: Richard Palethorpe @ 2020-10-16  5:40 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: ltp, Johannes Weiner, Andrew Morton, Shakeel Butt,
	Christoph Lameter, Michal Hocko, Tejun Heo, Vlastimil Babka,
	linux-mm, linux-kernel

Hello Roman,

Roman Gushchin <guro@fb.com> writes:

> Hi Richard!
>
>> SLAB objects which outlive their memcg are moved to their parent
>> memcg where they may be uncharged. However if they are moved to the
>> root memcg, uncharging will result in negative page counter values as
>> root has no page counters.
>> 
>> To prevent this, we check whether we are about to uncharge the root
>> memcg and skip it if we are. Possibly instead; the obj_cgroups should
>> be removed from their slabs and any per cpu stocks instead of
>> reparenting them to root?
>
> It would be really complex. I think your fix is totally fine.
> We have similar checks in cancel_charge(), uncharge_batch(),
> mem_cgroup_swapout(), mem_cgroup_uncharge_swap() etc.
>
>
> Acked-by: Roman Gushchin <guro@fb.com>
>
> Thanks!

Great I will respin.

-- 
Thank you,
Richard.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root
  2020-10-14 19:07 [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root Richard Palethorpe
  2020-10-14 20:08 ` Roman Gushchin
@ 2020-10-16  9:47 ` Michal Koutný
  2020-10-16 10:41   ` Richard Palethorpe
  2020-10-16 14:53   ` Johannes Weiner
  1 sibling, 2 replies; 45+ messages in thread
From: Michal Koutný @ 2020-10-16  9:47 UTC (permalink / raw)
  To: Richard Palethorpe
  Cc: Roman Gushchin, ltp, Johannes Weiner, Andrew Morton,
	Shakeel Butt, Christoph Lameter, Michal Hocko, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-kernel


[-- Attachment #1: Type: text/plain, Size: 778 bytes --]

Hello.

On Wed, Oct 14, 2020 at 08:07:49PM +0100, Richard Palethorpe <rpalethorpe@suse.com> wrote:
> SLAB objects which outlive their memcg are moved to their parent
> memcg where they may be uncharged. However if they are moved to the
> root memcg, uncharging will result in negative page counter values as
> root has no page counters.
Why do you think those are reparented objects? If those are originally
charged in a non-root cgroup, then the charge value should be propagated up the
hierarchy, including root memcg, so if they're later uncharged in root
after reparenting, it should still break even. (Or did I miss some stock
imbalance?)

(But the patch seems justifiable to me as objects (not)charged directly to
root memcg may be incorrectly uncharged.)

Thanks,
Michal

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root
  2020-10-16  9:47 ` Michal Koutný
@ 2020-10-16 10:41   ` Richard Palethorpe
  2020-10-16 15:05     ` Richard Palethorpe
  2020-10-16 14:53   ` Johannes Weiner
  1 sibling, 1 reply; 45+ messages in thread
From: Richard Palethorpe @ 2020-10-16 10:41 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Roman Gushchin, ltp, Johannes Weiner, Andrew Morton,
	Shakeel Butt, Christoph Lameter, Michal Hocko, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-kernel

Hello Michal,

Michal Koutný <mkoutny@suse.com> writes:

> Hello.
>
> On Wed, Oct 14, 2020 at 08:07:49PM +0100, Richard Palethorpe <rpalethorpe@suse.com> wrote:
>> SLAB objects which outlive their memcg are moved to their parent
>> memcg where they may be uncharged. However if they are moved to the
>> root memcg, uncharging will result in negative page counter values as
>> root has no page counters.
> Why do you think those are reparented objects? If those are originally
> charged in a non-root cgroup, then the charge value should be propagated up the
> hierarchy, including root memcg, so if they're later uncharged in root
> after reparenting, it should still break even. (Or did I miss some stock
> imbalance?)

I traced it and can see they are reparented objects and that the root
groups counters are zero (or negative if I run madvise06 multiple times)
before a drain takes place. I'm guessing this is because the root group
has 'use_hierachy' set to false so that the childs page_counter parents
are set to NULL. However I will check, because I'm not sure about
either.

>
> (But the patch seems justifiable to me as objects (not)charged directly to
> root memcg may be incorrectly uncharged.)
>
> Thanks,
> Michal


-- 
Thank you,
Richard.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root
  2020-10-16  9:47 ` Michal Koutný
  2020-10-16 10:41   ` Richard Palethorpe
@ 2020-10-16 14:53   ` Johannes Weiner
  2020-10-16 17:02     ` Roman Gushchin
  2020-10-16 17:15     ` Michal Koutný
  1 sibling, 2 replies; 45+ messages in thread
From: Johannes Weiner @ 2020-10-16 14:53 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Richard Palethorpe, Roman Gushchin, ltp, Andrew Morton,
	Shakeel Butt, Christoph Lameter, Michal Hocko, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-kernel

On Fri, Oct 16, 2020 at 11:47:02AM +0200, Michal Koutný wrote:
> Hello.
> 
> On Wed, Oct 14, 2020 at 08:07:49PM +0100, Richard Palethorpe <rpalethorpe@suse.com> wrote:
> > SLAB objects which outlive their memcg are moved to their parent
> > memcg where they may be uncharged. However if they are moved to the
> > root memcg, uncharging will result in negative page counter values as
> > root has no page counters.
> Why do you think those are reparented objects? If those are originally
> charged in a non-root cgroup, then the charge value should be propagated up the
> hierarchy, including root memcg, so if they're later uncharged in root
> after reparenting, it should still break even. (Or did I miss some stock
> imbalance?)

Looking a bit closer at this code, it's kind of a mess right now.

The central try_charge() function charges recursively all the way up
to and including the root. But not if it's called directly on the
root, in which case it bails and does nothing.

kmem and objcg use try_charge(), so they have the same
behavior. get_obj_cgroup_from_current() does it's own redundant
filtering for root_mem_cgroup, whereas get_mem_cgroup_from_current()
does not, but its callsite __memcg_kmem_charge_page() does.

We should clean this up one way or another: either charge the root or
don't, but do it consistently.

Since we export memory.stat at the root now, we should probably just
always charge the root instead of special-casing it all over the place
and risking bugs.

Indeed, it looks like there is at least one bug where the root-level
memory.stat shows non-root slab objects, but not root ones, whereas it
shows all anon and cache pages, root or no root.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root
  2020-10-16 10:41   ` Richard Palethorpe
@ 2020-10-16 15:05     ` Richard Palethorpe
  2020-10-16 17:26       ` Michal Koutný
  0 siblings, 1 reply; 45+ messages in thread
From: Richard Palethorpe @ 2020-10-16 15:05 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Roman Gushchin, ltp, Johannes Weiner, Andrew Morton,
	Shakeel Butt, Christoph Lameter, Michal Hocko, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-kernel

Hello,

Richard Palethorpe <rpalethorpe@suse.de> writes:

> Hello Michal,
>
> Michal Koutný <mkoutny@suse.com> writes:
>
>> Hello.
>>
>> On Wed, Oct 14, 2020 at 08:07:49PM +0100, Richard Palethorpe <rpalethorpe@suse.com> wrote:
>>> SLAB objects which outlive their memcg are moved to their parent
>>> memcg where they may be uncharged. However if they are moved to the
>>> root memcg, uncharging will result in negative page counter values as
>>> root has no page counters.
>> Why do you think those are reparented objects? If those are originally
>> charged in a non-root cgroup, then the charge value should be propagated up the
>> hierarchy, including root memcg, so if they're later uncharged in root
>> after reparenting, it should still break even. (Or did I miss some stock
>> imbalance?)
>
> I traced it and can see they are reparented objects and that the root
> groups counters are zero (or negative if I run madvise06 multiple times)
> before a drain takes place. I'm guessing this is because the root group
> has 'use_hierachy' set to false so that the childs page_counter parents
> are set to NULL. However I will check, because I'm not sure about
> either.

Yes, it appears that use_hierarchy=0 which is probably because the test
mounts cgroup v1, creates a child group within that and does not set
use_hierarchy on the root. On v2 root always has use_hierarchy enabled.

>
>>
>> (But the patch seems justifiable to me as objects (not)charged directly to
>> root memcg may be incorrectly uncharged.)
>>
>> Thanks,
>> Michal

I'm don't know if that could happen without reparenting. I suppose if
use_hierarchy=1 then actually this patch will result in root being
overcharged, so perhaps it should also check for use_hierarchy?

-- 
Thank you,
Richard.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root
  2020-10-16 14:53   ` Johannes Weiner
@ 2020-10-16 17:02     ` Roman Gushchin
  2020-10-16 17:15     ` Michal Koutný
  1 sibling, 0 replies; 45+ messages in thread
From: Roman Gushchin @ 2020-10-16 17:02 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Koutný,
	Richard Palethorpe, ltp, Andrew Morton, Shakeel Butt,
	Christoph Lameter, Michal Hocko, Tejun Heo, Vlastimil Babka,
	linux-mm, linux-kernel

On Fri, Oct 16, 2020 at 10:53:08AM -0400, Johannes Weiner wrote:
> On Fri, Oct 16, 2020 at 11:47:02AM +0200, Michal Koutný wrote:
> > Hello.
> > 
> > On Wed, Oct 14, 2020 at 08:07:49PM +0100, Richard Palethorpe <rpalethorpe@suse.com> wrote:
> > > SLAB objects which outlive their memcg are moved to their parent
> > > memcg where they may be uncharged. However if they are moved to the
> > > root memcg, uncharging will result in negative page counter values as
> > > root has no page counters.
> > Why do you think those are reparented objects? If those are originally
> > charged in a non-root cgroup, then the charge value should be propagated up the
> > hierarchy, including root memcg, so if they're later uncharged in root
> > after reparenting, it should still break even. (Or did I miss some stock
> > imbalance?)
> 
> Looking a bit closer at this code, it's kind of a mess right now.
> 
> The central try_charge() function charges recursively all the way up
> to and including the root. But not if it's called directly on the
> root, in which case it bails and does nothing.
> 
> kmem and objcg use try_charge(), so they have the same
> behavior. get_obj_cgroup_from_current() does it's own redundant
> filtering for root_mem_cgroup, whereas get_mem_cgroup_from_current()
> does not, but its callsite __memcg_kmem_charge_page() does.
> 
> We should clean this up one way or another: either charge the root or
> don't, but do it consistently.

+1

> 
> Since we export memory.stat at the root now, we should probably just
> always charge the root instead of special-casing it all over the place
> and risking bugs.

Hm, we export memory.stat but not memory.current. Charging the root memcg
seems to be an extra atomic operation, which can be avoided.

I wonder if we can handle it in page_counter.c, so there will be a single
place where we do the check.

> 
> Indeed, it looks like there is at least one bug where the root-level
> memory.stat shows non-root slab objects, but not root ones, whereas it
> shows all anon and cache pages, root or no root.

I'll take a look.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root
  2020-10-16 14:53   ` Johannes Weiner
  2020-10-16 17:02     ` Roman Gushchin
@ 2020-10-16 17:15     ` Michal Koutný
  2020-10-19  8:45       ` Richard Palethorpe
  2020-10-19 22:28       ` [RFC PATCH] " Roman Gushchin
  1 sibling, 2 replies; 45+ messages in thread
From: Michal Koutný @ 2020-10-16 17:15 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Richard Palethorpe, Roman Gushchin, ltp, Andrew Morton,
	Shakeel Butt, Christoph Lameter, Michal Hocko, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-kernel


[-- Attachment #1: Type: text/plain, Size: 680 bytes --]

On Fri, Oct 16, 2020 at 10:53:08AM -0400, Johannes Weiner <hannes@cmpxchg.org> wrote:
> The central try_charge() function charges recursively all the way up
> to and including the root.
Except for use_hiearchy=0 (which is the case here as Richard
wrote). The reparenting is hence somewhat incompatible with
new_parent.use_hiearchy=0 :-/

> We should clean this up one way or another: either charge the root or
> don't, but do it consistently.
I agree this'd be good to unify. One upside of excluding root memcg from
charging is that users are spared from the charging overhead when memcg
tree is not created.  (Actually, I thought that was the reason for this
exception.)

Michal

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root
  2020-10-16 15:05     ` Richard Palethorpe
@ 2020-10-16 17:26       ` Michal Koutný
  0 siblings, 0 replies; 45+ messages in thread
From: Michal Koutný @ 2020-10-16 17:26 UTC (permalink / raw)
  To: Richard Palethorpe
  Cc: Roman Gushchin, ltp, Johannes Weiner, Andrew Morton,
	Shakeel Butt, Christoph Lameter, Michal Hocko, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-kernel


[-- Attachment #1: Type: text/plain, Size: 622 bytes --]

On Fri, Oct 16, 2020 at 04:05:21PM +0100, Richard Palethorpe <rpalethorpe@suse.de> wrote:
> I'm don't know if that could happen without reparenting. I suppose if
> use_hierarchy=1 then actually this patch will result in root being
> overcharged, so perhaps it should also check for use_hierarchy?
Right, you'd need to distinguish whether the uncharged objcg was
originally (not)charged in the root memcg or it was only reparented to
it. (I originally considered only the genuine root objcgs.)

In this light, homogenous charing to root memcg looks really simpler but
I wonder what other surprises it brings about.

Michal

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root
  2020-10-16 17:15     ` Michal Koutný
@ 2020-10-19  8:45       ` Richard Palethorpe
  2020-10-19  9:58         ` [PATCH v3] " Richard Palethorpe
  2020-10-19 22:28       ` [RFC PATCH] " Roman Gushchin
  1 sibling, 1 reply; 45+ messages in thread
From: Richard Palethorpe @ 2020-10-19  8:45 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Johannes Weiner, Roman Gushchin, ltp, Andrew Morton,
	Shakeel Butt, Christoph Lameter, Michal Hocko, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-kernel

Hello,

Michal Koutný <mkoutny@suse.com> writes:

> On Fri, Oct 16, 2020 at 10:53:08AM -0400, Johannes Weiner <hannes@cmpxchg.org> wrote:
>> The central try_charge() function charges recursively all the way up
>> to and including the root.
> Except for use_hiearchy=0 (which is the case here as Richard
> wrote). The reparenting is hence somewhat incompatible with
> new_parent.use_hiearchy=0 :-/
>

Yes and it also seems

new_parent.use_hierarch=0 -> new_child.use_hierarchy=0

and

new_parent.use_hierarch=0 -> new_child.use_hierarchy=1

are considered valid on cgroupsV1. The kernel will also allow more
descendants on new_child.use_hierarchy=0, but sets
broken_hierarchy=1. However this will not stop the stack trace occuring
(AFAICT) when the reparenting happens between two descendants.

>> We should clean this up one way or another: either charge the root or
>> don't, but do it consistently.
> I agree this'd be good to unify. One upside of excluding root memcg from
> charging is that users are spared from the charging overhead when memcg
> tree is not created.  (Actually, I thought that was the reason for this
> exception.)
>
> Michal


-- 
Thank you,
Richard.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v3] mm: memcg/slab: Stop reparented obj_cgroups from charging root
  2020-10-19  8:45       ` Richard Palethorpe
@ 2020-10-19  9:58         ` Richard Palethorpe
  2020-10-19 16:58           ` Shakeel Butt
  0 siblings, 1 reply; 45+ messages in thread
From: Richard Palethorpe @ 2020-10-19  9:58 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Richard Palethorpe, ltp, Roman Gushchin, Johannes Weiner,
	Andrew Morton, Shakeel Butt, Christoph Lameter, Michal Hocko,
	Tejun Heo, Vlastimil Babka

SLAB objects which outlive their descendant memcg are moved to their
parent memcg where they may be uncharged. However if they are moved to
the root memcg and use_hierarchy=0, uncharging will result in negative
page counter values. This is because when use_hierarchy=0, the root
memcg's page counters are disconnected from its children.

To prevent this, we check whether we are about to uncharge the root
memcg and whether use_hierarchy=0. If this is the case then we skip
uncharging.

Note that on the default hierarchy (CGroupV2 now) root always has
use_hierarchy=1. So this only effects CGroupV1. Also it is possible to
have a deeper hierarchy where descendants also have use_hierarchy=0;
this is not considered valid by the kernel, but it is still allowed
and in such cases reparenting may still result in negative page
counter values.

The warning can be, unreliably, reproduced with the LTP test
madvise06 if the entire patch series
https://lore.kernel.org/linux-mm/20200623174037.3951353-1-guro@fb.com/
is present. Although the listed commit in 'fixes' appears to introduce
the bug, I can not reproduce it with just that commit and bisecting
runs into other bugs.

[   12.029417] WARNING: CPU: 2 PID: 21 at mm/page_counter.c:57 page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
[   12.029539] Modules linked in:
[   12.029611] CPU: 2 PID: 21 Comm: ksoftirqd/2 Not tainted 5.9.0-rc7-22-default #76
[   12.029729] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
[   12.029908] RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
[ 12.029991] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe ff ff 48 85 db 78 10 48 8b 6d 28 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b eb ec 90 e8 db 47 36 27 48 8b 17 48 39 d6 72 41 41 54 49 89
[   12.030258] RSP: 0018:ffffa5d8000efd08 EFLAGS: 00010086
[   12.030344] RAX: ffffffffffffffff RBX: ffffffffffffffff RCX: 0000000000000009
[   12.030455] RDX: 000000000000000b RSI: ffffffffffffffff RDI: ffff8ef8c7d2b248
[   12.030561] RBP: ffff8ef8c7d2b248 R08: ffff8ef8c78b19c8 R09: 0000000000000001
[   12.030672] R10: 0000000000000000 R11: ffff8ef8c780e0d0 R12: 0000000000000001
[   12.030784] R13: ffffffffffffffff R14: ffff8ef9478b19c8 R15: 0000000000000000
[   12.030895] FS:  0000000000000000(0000) GS:ffff8ef8fbc80000(0000) knlGS:0000000000000000
[   12.031017] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   12.031104] CR2: 00007f72c0af93ec CR3: 000000005c40a000 CR4: 00000000000006e0
[   12.031209] Call Trace:
[   12.031267] __memcg_kmem_uncharge (mm/memcontrol.c:3022)
[   12.031470] drain_obj_stock (./include/linux/rcupdate.h:689 mm/memcontrol.c:3114)
[   12.031594] refill_obj_stock (mm/memcontrol.c:3166)
[   12.031733] ? rcu_do_batch (kernel/rcu/tree.c:2438)
[   12.032075] memcg_slab_free_hook (./include/linux/mm.h:1294 ./include/linux/mm.h:1441 mm/slab.h:368 mm/slab.h:348)
[   12.032339] kmem_cache_free (mm/slub.c:3107 mm/slub.c:3143 mm/slub.c:3158)
[   12.032464] rcu_do_batch (kernel/rcu/tree.c:2438)
[   12.032567] rcu_core (kernel/rcu/tree_plugin.h:2122 kernel/rcu/tree_plugin.h:2157 kernel/rcu/tree.c:2661)
[   12.032664] __do_softirq (./arch/x86/include/asm/jump_label.h:25 ./include/linux/jump_label.h:200 ./include/trace/events/irq.h:142 kernel/softirq.c:299)
[   12.032766] run_ksoftirqd (./arch/x86/include/asm/irqflags.h:54 ./arch/x86/include/asm/irqflags.h:94 kernel/softirq.c:653 kernel/softirq.c:644)
[   12.032852] smpboot_thread_fn (kernel/smpboot.c:165)
[   12.032940] ? smpboot_register_percpu_thread (kernel/smpboot.c:108)
[   12.033059] kthread (kernel/kthread.c:292)
[   12.033148] ? __kthread_bind_mask (kernel/kthread.c:245)
[   12.033269] ret_from_fork (arch/x86/entry/entry_64.S:300)
[   12.033357] ---[ end trace 961dbfc01c109d1f ]---

[    9.841552] ------------[ cut here ]------------
[    9.841788] WARNING: CPU: 0 PID: 12 at mm/page_counter.c:57 page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
[    9.841982] Modules linked in:
[    9.842072] CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.9.0-rc7-22-default #77
[    9.842266] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
[    9.842571] Workqueue: events drain_local_stock
[    9.842750] RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
[ 9.842894] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe ff ff 48 85 db 78 10 48 8b 6d 28 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b eb ec 90 e8 4b f9 88 2a 48 8b 17 48 39 d6 72 41 41 54 49 89
[    9.843438] RSP: 0018:ffffb1c18006be28 EFLAGS: 00010086
[    9.843585] RAX: ffffffffffffffff RBX: ffffffffffffffff RCX: ffff94803bc2cae0
[    9.843806] RDX: 0000000000000001 RSI: ffffffffffffffff RDI: ffff948007d2b248
[    9.844026] RBP: ffff948007d2b248 R08: ffff948007c58eb0 R09: ffff948007da05ac
[    9.844248] R10: 0000000000000018 R11: 0000000000000018 R12: 0000000000000001
[    9.844477] R13: ffffffffffffffff R14: 0000000000000000 R15: ffff94803bc2cac0
[    9.844696] FS:  0000000000000000(0000) GS:ffff94803bc00000(0000) knlGS:0000000000000000
[    9.844915] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    9.845096] CR2: 00007f0579ee0384 CR3: 000000002cc0a000 CR4: 00000000000006f0
[    9.845319] Call Trace:
[    9.845429] __memcg_kmem_uncharge (mm/memcontrol.c:3022)
[    9.845582] drain_obj_stock (./include/linux/rcupdate.h:689 mm/memcontrol.c:3114)
[    9.845684] drain_local_stock (mm/memcontrol.c:2255)
[    9.845789] process_one_work (./arch/x86/include/asm/jump_label.h:25 ./include/linux/jump_label.h:200 ./include/trace/events/workqueue.h:108 kernel/workqueue.c:2274)
[    9.845898] worker_thread (./include/linux/list.h:282 kernel/workqueue.c:2416)
[    9.846034] ? process_one_work (kernel/workqueue.c:2358)
[    9.846162] kthread (kernel/kthread.c:292)
[    9.846271] ? __kthread_bind_mask (kernel/kthread.c:245)
[    9.846420] ret_from_fork (arch/x86/entry/entry_64.S:300)
[    9.846531] ---[ end trace 8b5647c1eba9d18a ]---

Reported-by: ltp@lists.linux.it
Signed-off-by: Richard Palethorpe <rpalethorpe@suse.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API")
---

V3: Handle common case where use_hierarchy=1 and update description.

 mm/memcontrol.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6877c765b8d0..34b8c4a66853 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -291,7 +291,7 @@ static void obj_cgroup_release(struct percpu_ref *ref)
 
 	spin_lock_irqsave(&css_set_lock, flags);
 	memcg = obj_cgroup_memcg(objcg);
-	if (nr_pages)
+	if (nr_pages && (!mem_cgroup_is_root(memcg) || memcg->use_hierarchy))
 		__memcg_kmem_uncharge(memcg, nr_pages);
 	list_del(&objcg->list);
 	mem_cgroup_put(memcg);
@@ -3100,6 +3100,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 static void drain_obj_stock(struct memcg_stock_pcp *stock)
 {
 	struct obj_cgroup *old = stock->cached_objcg;
+	struct mem_cgroup *memcg;
 
 	if (!old)
 		return;
@@ -3110,7 +3111,9 @@ static void drain_obj_stock(struct memcg_stock_pcp *stock)
 
 		if (nr_pages) {
 			rcu_read_lock();
-			__memcg_kmem_uncharge(obj_cgroup_memcg(old), nr_pages);
+			memcg = obj_cgroup_memcg(old);
+			if (!mem_cgroup_is_root(memcg) || memcg->use_hierarchy)
+				__memcg_kmem_uncharge(memcg, nr_pages);
 			rcu_read_unlock();
 		}
 
-- 
2.28.0



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3] mm: memcg/slab: Stop reparented obj_cgroups from charging root
  2020-10-19  9:58         ` [PATCH v3] " Richard Palethorpe
@ 2020-10-19 16:58           ` Shakeel Butt
  2020-10-20  5:52             ` Richard Palethorpe
  0 siblings, 1 reply; 45+ messages in thread
From: Shakeel Butt @ 2020-10-19 16:58 UTC (permalink / raw)
  To: Richard Palethorpe
  Cc: Linux MM, LKML, LTP List, Roman Gushchin, Johannes Weiner,
	Andrew Morton, Christoph Lameter, Michal Hocko, Tejun Heo,
	Vlastimil Babka

On Mon, Oct 19, 2020 at 2:59 AM Richard Palethorpe <rpalethorpe@suse.com> wrote:
>
> SLAB objects which outlive their descendant memcg are moved to their
> parent memcg where they may be uncharged. However if they are moved to
> the root memcg and use_hierarchy=0, uncharging will result in negative
> page counter values. This is because when use_hierarchy=0, the root
> memcg's page counters are disconnected from its children.
>
> To prevent this, we check whether we are about to uncharge the root
> memcg and whether use_hierarchy=0. If this is the case then we skip
> uncharging.
>
> Note that on the default hierarchy (CGroupV2 now) root always has
> use_hierarchy=1. So this only effects CGroupV1. Also it is possible to
> have a deeper hierarchy where descendants also have use_hierarchy=0;
> this is not considered valid by the kernel, but it is still allowed
> and in such cases reparenting may still result in negative page
> counter values.
>
> The warning can be, unreliably, reproduced with the LTP test
> madvise06 if the entire patch series
> https://lore.kernel.org/linux-mm/20200623174037.3951353-1-guro@fb.com/
> is present. Although the listed commit in 'fixes' appears to introduce
> the bug, I can not reproduce it with just that commit and bisecting
> runs into other bugs.
>
> [   12.029417] WARNING: CPU: 2 PID: 21 at mm/page_counter.c:57 page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
> [   12.029539] Modules linked in:
> [   12.029611] CPU: 2 PID: 21 Comm: ksoftirqd/2 Not tainted 5.9.0-rc7-22-default #76
> [   12.029729] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
> [   12.029908] RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
> [ 12.029991] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe ff ff 48 85 db 78 10 48 8b 6d 28 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b eb ec 90 e8 db 47 36 27 48 8b 17 48 39 d6 72 41 41 54 49 89
> [   12.030258] RSP: 0018:ffffa5d8000efd08 EFLAGS: 00010086
> [   12.030344] RAX: ffffffffffffffff RBX: ffffffffffffffff RCX: 0000000000000009
> [   12.030455] RDX: 000000000000000b RSI: ffffffffffffffff RDI: ffff8ef8c7d2b248
> [   12.030561] RBP: ffff8ef8c7d2b248 R08: ffff8ef8c78b19c8 R09: 0000000000000001
> [   12.030672] R10: 0000000000000000 R11: ffff8ef8c780e0d0 R12: 0000000000000001
> [   12.030784] R13: ffffffffffffffff R14: ffff8ef9478b19c8 R15: 0000000000000000
> [   12.030895] FS:  0000000000000000(0000) GS:ffff8ef8fbc80000(0000) knlGS:0000000000000000
> [   12.031017] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   12.031104] CR2: 00007f72c0af93ec CR3: 000000005c40a000 CR4: 00000000000006e0
> [   12.031209] Call Trace:
> [   12.031267] __memcg_kmem_uncharge (mm/memcontrol.c:3022)
> [   12.031470] drain_obj_stock (./include/linux/rcupdate.h:689 mm/memcontrol.c:3114)
> [   12.031594] refill_obj_stock (mm/memcontrol.c:3166)
> [   12.031733] ? rcu_do_batch (kernel/rcu/tree.c:2438)
> [   12.032075] memcg_slab_free_hook (./include/linux/mm.h:1294 ./include/linux/mm.h:1441 mm/slab.h:368 mm/slab.h:348)
> [   12.032339] kmem_cache_free (mm/slub.c:3107 mm/slub.c:3143 mm/slub.c:3158)
> [   12.032464] rcu_do_batch (kernel/rcu/tree.c:2438)
> [   12.032567] rcu_core (kernel/rcu/tree_plugin.h:2122 kernel/rcu/tree_plugin.h:2157 kernel/rcu/tree.c:2661)
> [   12.032664] __do_softirq (./arch/x86/include/asm/jump_label.h:25 ./include/linux/jump_label.h:200 ./include/trace/events/irq.h:142 kernel/softirq.c:299)
> [   12.032766] run_ksoftirqd (./arch/x86/include/asm/irqflags.h:54 ./arch/x86/include/asm/irqflags.h:94 kernel/softirq.c:653 kernel/softirq.c:644)
> [   12.032852] smpboot_thread_fn (kernel/smpboot.c:165)
> [   12.032940] ? smpboot_register_percpu_thread (kernel/smpboot.c:108)
> [   12.033059] kthread (kernel/kthread.c:292)
> [   12.033148] ? __kthread_bind_mask (kernel/kthread.c:245)
> [   12.033269] ret_from_fork (arch/x86/entry/entry_64.S:300)
> [   12.033357] ---[ end trace 961dbfc01c109d1f ]---
>
> [    9.841552] ------------[ cut here ]------------
> [    9.841788] WARNING: CPU: 0 PID: 12 at mm/page_counter.c:57 page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
> [    9.841982] Modules linked in:
> [    9.842072] CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.9.0-rc7-22-default #77
> [    9.842266] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
> [    9.842571] Workqueue: events drain_local_stock
> [    9.842750] RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
> [ 9.842894] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe ff ff 48 85 db 78 10 48 8b 6d 28 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b eb ec 90 e8 4b f9 88 2a 48 8b 17 48 39 d6 72 41 41 54 49 89
> [    9.843438] RSP: 0018:ffffb1c18006be28 EFLAGS: 00010086
> [    9.843585] RAX: ffffffffffffffff RBX: ffffffffffffffff RCX: ffff94803bc2cae0
> [    9.843806] RDX: 0000000000000001 RSI: ffffffffffffffff RDI: ffff948007d2b248
> [    9.844026] RBP: ffff948007d2b248 R08: ffff948007c58eb0 R09: ffff948007da05ac
> [    9.844248] R10: 0000000000000018 R11: 0000000000000018 R12: 0000000000000001
> [    9.844477] R13: ffffffffffffffff R14: 0000000000000000 R15: ffff94803bc2cac0
> [    9.844696] FS:  0000000000000000(0000) GS:ffff94803bc00000(0000) knlGS:0000000000000000
> [    9.844915] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    9.845096] CR2: 00007f0579ee0384 CR3: 000000002cc0a000 CR4: 00000000000006f0
> [    9.845319] Call Trace:
> [    9.845429] __memcg_kmem_uncharge (mm/memcontrol.c:3022)
> [    9.845582] drain_obj_stock (./include/linux/rcupdate.h:689 mm/memcontrol.c:3114)
> [    9.845684] drain_local_stock (mm/memcontrol.c:2255)
> [    9.845789] process_one_work (./arch/x86/include/asm/jump_label.h:25 ./include/linux/jump_label.h:200 ./include/trace/events/workqueue.h:108 kernel/workqueue.c:2274)
> [    9.845898] worker_thread (./include/linux/list.h:282 kernel/workqueue.c:2416)
> [    9.846034] ? process_one_work (kernel/workqueue.c:2358)
> [    9.846162] kthread (kernel/kthread.c:292)
> [    9.846271] ? __kthread_bind_mask (kernel/kthread.c:245)
> [    9.846420] ret_from_fork (arch/x86/entry/entry_64.S:300)
> [    9.846531] ---[ end trace 8b5647c1eba9d18a ]---
>
> Reported-by: ltp@lists.linux.it
> Signed-off-by: Richard Palethorpe <rpalethorpe@suse.com>
> Acked-by: Roman Gushchin <guro@fb.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Shakeel Butt <shakeelb@google.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API")
> ---
>
> V3: Handle common case where use_hierarchy=1 and update description.
>
>  mm/memcontrol.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6877c765b8d0..34b8c4a66853 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -291,7 +291,7 @@ static void obj_cgroup_release(struct percpu_ref *ref)
>
>         spin_lock_irqsave(&css_set_lock, flags);
>         memcg = obj_cgroup_memcg(objcg);
> -       if (nr_pages)
> +       if (nr_pages && (!mem_cgroup_is_root(memcg) || memcg->use_hierarchy))

If we have non-root memcg with use_hierarchy as 0 and this objcg was
reparented then this __memcg_kmem_uncharge() can potentially underflow
the page counter and give the same warning.

We never set root_mem_cgroup->objcg, so, no need to check for root
here. I think checking just memcg->use_hierarchy should be sufficient.

>                 __memcg_kmem_uncharge(memcg, nr_pages);
>         list_del(&objcg->list);
>         mem_cgroup_put(memcg);
> @@ -3100,6 +3100,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
>  static void drain_obj_stock(struct memcg_stock_pcp *stock)
>  {
>         struct obj_cgroup *old = stock->cached_objcg;
> +       struct mem_cgroup *memcg;
>
>         if (!old)
>                 return;
> @@ -3110,7 +3111,9 @@ static void drain_obj_stock(struct memcg_stock_pcp *stock)
>
>                 if (nr_pages) {
>                         rcu_read_lock();
> -                       __memcg_kmem_uncharge(obj_cgroup_memcg(old), nr_pages);
> +                       memcg = obj_cgroup_memcg(old);
> +                       if (!mem_cgroup_is_root(memcg) || memcg->use_hierarchy)
> +                               __memcg_kmem_uncharge(memcg, nr_pages);
>                         rcu_read_unlock();
>                 }
>
> --
> 2.28.0
>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root
  2020-10-16 17:15     ` Michal Koutný
  2020-10-19  8:45       ` Richard Palethorpe
@ 2020-10-19 22:28       ` Roman Gushchin
  2020-10-20  6:04         ` Richard Palethorpe
                           ` (3 more replies)
  1 sibling, 4 replies; 45+ messages in thread
From: Roman Gushchin @ 2020-10-19 22:28 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Johannes Weiner, Richard Palethorpe, ltp, Andrew Morton,
	Shakeel Butt, Christoph Lameter, Michal Hocko, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-kernel, Michal Hocko

On Fri, Oct 16, 2020 at 07:15:02PM +0200, Michal Koutny wrote:
> On Fri, Oct 16, 2020 at 10:53:08AM -0400, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > The central try_charge() function charges recursively all the way up
> > to and including the root.
> Except for use_hiearchy=0 (which is the case here as Richard
> wrote). The reparenting is hence somewhat incompatible with
> new_parent.use_hiearchy=0 :-/
> 
> > We should clean this up one way or another: either charge the root or
> > don't, but do it consistently.
> I agree this'd be good to unify. One upside of excluding root memcg from
> charging is that users are spared from the charging overhead when memcg
> tree is not created.  (Actually, I thought that was the reason for this
> exception.)

Yeah, I'm completely on the same page. Moving a process to the root memory
cgroup is currently a good way to estimate the memory cgroup overhead.

How about the patch below, which consistently avoids charging the root
memory cgroup? It seems like it doesn't add too many checks.

Thanks!

--

From f50ea74d8f118b9121da3754acdde630ddc060a7 Mon Sep 17 00:00:00 2001
From: Roman Gushchin <guro@fb.com>
Date: Mon, 19 Oct 2020 14:37:35 -0700
Subject: [PATCH RFC] mm: memcontrol: do not charge the root memory cgroup

Currently the root memory cgroup is never charged directly, but
if an ancestor cgroup is charged, the charge is propagated up to the
root memory cgroup. The root memory cgroup doesn't show the charge
to a user, neither it does allow to set any limits/protections.
So the information about the current charge is completely useless.

Avoiding to charge the root memory cgroup allows to:
1) simplify the model and the code, so, hopefully, fewer bugs will
   be introduced in the future;
2) avoid unnecessary atomic operations, which are used to (un)charge
   corresponding root page counters.

In the default hierarchy case or if use_hiearchy == true, it's very
straightforward: when the page counters tree is traversed to the root,
the root page counter (the one with parent == NULL), should be
skipped. To avoid multiple identical checks over the page counters
code, for_each_nonroot_ancestor() macro is introduced.

To handle the use_hierarchy == false case without adding custom
checks, let's make page counters of all non-root memory cgroup
direct ascendants of the corresponding root memory cgroup's page
counters. In this case for_each_nonroot_ancestor() will work correctly
as well.

Please, note, that cgroup v1 provides root level memory.usage_in_bytes.
However, it's not based on page counters (refer to mem_cgroup_usage()).

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 mm/memcontrol.c   | 21 ++++++++++++++++-----
 mm/page_counter.c | 21 ++++++++++++---------
 2 files changed, 28 insertions(+), 14 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2636f8bad908..34cac7522e74 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5339,17 +5339,28 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 		memcg->swappiness = mem_cgroup_swappiness(parent);
 		memcg->oom_kill_disable = parent->oom_kill_disable;
 	}
-	if (parent && parent->use_hierarchy) {
+	if (!parent) {
+		/* root memory cgroup */
+		page_counter_init(&memcg->memory, NULL);
+		page_counter_init(&memcg->swap, NULL);
+		page_counter_init(&memcg->kmem, NULL);
+		page_counter_init(&memcg->tcpmem, NULL);
+	} else if (parent->use_hierarchy) {
 		memcg->use_hierarchy = true;
 		page_counter_init(&memcg->memory, &parent->memory);
 		page_counter_init(&memcg->swap, &parent->swap);
 		page_counter_init(&memcg->kmem, &parent->kmem);
 		page_counter_init(&memcg->tcpmem, &parent->tcpmem);
 	} else {
-		page_counter_init(&memcg->memory, NULL);
-		page_counter_init(&memcg->swap, NULL);
-		page_counter_init(&memcg->kmem, NULL);
-		page_counter_init(&memcg->tcpmem, NULL);
+		/*
+		 * If use_hierarchy == false, consider all page counters direct
+		 * descendants of the corresponding root level counters.
+		 */
+		page_counter_init(&memcg->memory, &root_mem_cgroup->memory);
+		page_counter_init(&memcg->swap, &root_mem_cgroup->swap);
+		page_counter_init(&memcg->kmem, &root_mem_cgroup->kmem);
+		page_counter_init(&memcg->tcpmem, &root_mem_cgroup->tcpmem);
+
 		/*
 		 * Deeper hierachy with use_hierarchy == false doesn't make
 		 * much sense so let cgroup subsystem know about this
diff --git a/mm/page_counter.c b/mm/page_counter.c
index b24a60b28bb0..8901b184b9d5 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -13,6 +13,9 @@
 #include <linux/bug.h>
 #include <asm/page.h>
 
+#define for_each_nonroot_ancestor(c, counter) \
+	for ((c) = (counter); ((c) && ((c)->parent)); (c) = (c)->parent)
+
 static void propagate_protected_usage(struct page_counter *c,
 				      unsigned long usage)
 {
@@ -20,9 +23,6 @@ static void propagate_protected_usage(struct page_counter *c,
 	unsigned long low, min;
 	long delta;
 
-	if (!c->parent)
-		return;
-
 	min = READ_ONCE(c->min);
 	if (min || atomic_long_read(&c->min_usage)) {
 		protected = min(usage, min);
@@ -68,7 +68,7 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
 {
 	struct page_counter *c;
 
-	for (c = counter; c; c = c->parent) {
+	for_each_nonroot_ancestor(c, counter) {
 		long new;
 
 		new = atomic_long_add_return(nr_pages, &c->usage);
@@ -97,7 +97,7 @@ bool page_counter_try_charge(struct page_counter *counter,
 {
 	struct page_counter *c;
 
-	for (c = counter; c; c = c->parent) {
+	for_each_nonroot_ancestor(c, counter) {
 		long new;
 		/*
 		 * Charge speculatively to avoid an expensive CAS.  If
@@ -137,8 +137,11 @@ bool page_counter_try_charge(struct page_counter *counter,
 	return true;
 
 failed:
-	for (c = counter; c != *fail; c = c->parent)
+	for_each_nonroot_ancestor(c, counter) {
+		if (c == *fail)
+			break;
 		page_counter_cancel(c, nr_pages);
+	}
 
 	return false;
 }
@@ -152,7 +155,7 @@ void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages)
 {
 	struct page_counter *c;
 
-	for (c = counter; c; c = c->parent)
+	for_each_nonroot_ancestor(c, counter)
 		page_counter_cancel(c, nr_pages);
 }
 
@@ -211,7 +214,7 @@ void page_counter_set_min(struct page_counter *counter, unsigned long nr_pages)
 
 	WRITE_ONCE(counter->min, nr_pages);
 
-	for (c = counter; c; c = c->parent)
+	for_each_nonroot_ancestor(c, counter)
 		propagate_protected_usage(c, atomic_long_read(&c->usage));
 }
 
@@ -228,7 +231,7 @@ void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages)
 
 	WRITE_ONCE(counter->low, nr_pages);
 
-	for (c = counter; c; c = c->parent)
+	for_each_nonroot_ancestor(c, counter)
 		propagate_protected_usage(c, atomic_long_read(&c->usage));
 }
 
-- 
2.26.2



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3] mm: memcg/slab: Stop reparented obj_cgroups from charging root
  2020-10-19 16:58           ` Shakeel Butt
@ 2020-10-20  5:52             ` Richard Palethorpe
  2020-10-20 13:49               ` Richard Palethorpe
  2020-10-20 17:24               ` Michal Koutný
  0 siblings, 2 replies; 45+ messages in thread
From: Richard Palethorpe @ 2020-10-20  5:52 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Linux MM, LKML, LTP List, Roman Gushchin, Johannes Weiner,
	Andrew Morton, Christoph Lameter, Michal Hocko, Tejun Heo,
	Vlastimil Babka

Hello Shakeel,

Shakeel Butt <shakeelb@google.com> writes:
>>
>> V3: Handle common case where use_hierarchy=1 and update description.
>>
>>  mm/memcontrol.c | 7 +++++--
>>  1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 6877c765b8d0..34b8c4a66853 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -291,7 +291,7 @@ static void obj_cgroup_release(struct percpu_ref *ref)
>>
>>         spin_lock_irqsave(&css_set_lock, flags);
>>         memcg = obj_cgroup_memcg(objcg);
>> -       if (nr_pages)
>> +       if (nr_pages && (!mem_cgroup_is_root(memcg) || memcg->use_hierarchy))
>
> If we have non-root memcg with use_hierarchy as 0 and this objcg was
> reparented then this __memcg_kmem_uncharge() can potentially underflow
> the page counter and give the same warning.

Yes, although the kernel considers such a config to be broken, and
prints a warning to the log, it does allow it.

>
> We never set root_mem_cgroup->objcg, so, no need to check for root

I don't think that is relevant as we get the memcg from objcg->memcg
which is set during reparenting. I suppose however, we can determine if
the objcg was reparented by inspecting memcg->objcg.

> here. I think checking just memcg->use_hierarchy should be sufficient.

If we just check use_hierarchy then objects directly charged to the
memcg where use_hierarchy=0 will not be uncharged. However, maybe it is
better to check if it was reparented and if use_hierarchy=0.

>
>>                 __memcg_kmem_uncharge(memcg, nr_pages);
>>         list_del(&objcg->list);
>>         mem_cgroup_put(memcg);
>> @@ -3100,6 +3100,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
>>  static void drain_obj_stock(struct memcg_stock_pcp *stock)
>>  {
>>         struct obj_cgroup *old = stock->cached_objcg;
>> +       struct mem_cgroup *memcg;
>>
>>         if (!old)
>>                 return;
>> @@ -3110,7 +3111,9 @@ static void drain_obj_stock(struct memcg_stock_pcp *stock)
>>
>>                 if (nr_pages) {
>>                         rcu_read_lock();
>> -                       __memcg_kmem_uncharge(obj_cgroup_memcg(old), nr_pages);
>> +                       memcg = obj_cgroup_memcg(old);
>> +                       if (!mem_cgroup_is_root(memcg) || memcg->use_hierarchy)
>> +                               __memcg_kmem_uncharge(memcg, nr_pages);
>>                         rcu_read_unlock();
>>                 }
>>
>> --
>> 2.28.0
>>


-- 
Thank you,
Richard.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root
  2020-10-19 22:28       ` [RFC PATCH] " Roman Gushchin
@ 2020-10-20  6:04         ` Richard Palethorpe
  2020-10-20 12:02           ` Richard Palethorpe
  2020-10-20 14:48         ` Richard Palethorpe
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 45+ messages in thread
From: Richard Palethorpe @ 2020-10-20  6:04 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Michal Koutný,
	Johannes Weiner, ltp, Andrew Morton, Shakeel Butt,
	Christoph Lameter, Michal Hocko, Tejun Heo, Vlastimil Babka,
	linux-mm, linux-kernel, Michal Hocko

Hello Roman,

Roman Gushchin <guro@fb.com> writes:

> On Fri, Oct 16, 2020 at 07:15:02PM +0200, Michal Koutny wrote:
>> On Fri, Oct 16, 2020 at 10:53:08AM -0400, Johannes Weiner <hannes@cmpxchg.org> wrote:
>> > The central try_charge() function charges recursively all the way up
>> > to and including the root.
>> Except for use_hiearchy=0 (which is the case here as Richard
>> wrote). The reparenting is hence somewhat incompatible with
>> new_parent.use_hiearchy=0 :-/
>> 
>> > We should clean this up one way or another: either charge the root or
>> > don't, but do it consistently.
>> I agree this'd be good to unify. One upside of excluding root memcg from
>> charging is that users are spared from the charging overhead when memcg
>> tree is not created.  (Actually, I thought that was the reason for this
>> exception.)
>
> Yeah, I'm completely on the same page. Moving a process to the root memory
> cgroup is currently a good way to estimate the memory cgroup overhead.
>
> How about the patch below, which consistently avoids charging the root
> memory cgroup? It seems like it doesn't add too many checks.
>
> Thanks!
>
> --
>
> From f50ea74d8f118b9121da3754acdde630ddc060a7 Mon Sep 17 00:00:00 2001
> From: Roman Gushchin <guro@fb.com>
> Date: Mon, 19 Oct 2020 14:37:35 -0700
> Subject: [PATCH RFC] mm: memcontrol: do not charge the root memory cgroup
>
> Currently the root memory cgroup is never charged directly, but
> if an ancestor cgroup is charged, the charge is propagated up to the
> root memory cgroup. The root memory cgroup doesn't show the charge
> to a user, neither it does allow to set any limits/protections.
> So the information about the current charge is completely useless.
>
> Avoiding to charge the root memory cgroup allows to:
> 1) simplify the model and the code, so, hopefully, fewer bugs will
>    be introduced in the future;
> 2) avoid unnecessary atomic operations, which are used to (un)charge
>    corresponding root page counters.
>
> In the default hierarchy case or if use_hiearchy == true, it's very
> straightforward: when the page counters tree is traversed to the root,
> the root page counter (the one with parent == NULL), should be
> skipped. To avoid multiple identical checks over the page counters
> code, for_each_nonroot_ancestor() macro is introduced.
>
> To handle the use_hierarchy == false case without adding custom
> checks, let's make page counters of all non-root memory cgroup
> direct ascendants of the corresponding root memory cgroup's page
> counters. In this case for_each_nonroot_ancestor() will work correctly
> as well.
>
> Please, note, that cgroup v1 provides root level memory.usage_in_bytes.
> However, it's not based on page counters (refer to mem_cgroup_usage()).
>
> Signed-off-by: Roman Gushchin <guro@fb.com>
> ---
>  mm/memcontrol.c   | 21 ++++++++++++++++-----
>  mm/page_counter.c | 21 ++++++++++++---------
>  2 files changed, 28 insertions(+), 14 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 2636f8bad908..34cac7522e74 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5339,17 +5339,28 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>  		memcg->swappiness = mem_cgroup_swappiness(parent);
>  		memcg->oom_kill_disable = parent->oom_kill_disable;
>  	}
> -	if (parent && parent->use_hierarchy) {
> +	if (!parent) {
> +		/* root memory cgroup */
> +		page_counter_init(&memcg->memory, NULL);
> +		page_counter_init(&memcg->swap, NULL);
> +		page_counter_init(&memcg->kmem, NULL);
> +		page_counter_init(&memcg->tcpmem, NULL);
> +	} else if (parent->use_hierarchy) {
>  		memcg->use_hierarchy = true;
>  		page_counter_init(&memcg->memory, &parent->memory);
>  		page_counter_init(&memcg->swap, &parent->swap);
>  		page_counter_init(&memcg->kmem, &parent->kmem);
>  		page_counter_init(&memcg->tcpmem, &parent->tcpmem);
>  	} else {
> -		page_counter_init(&memcg->memory, NULL);
> -		page_counter_init(&memcg->swap, NULL);
> -		page_counter_init(&memcg->kmem, NULL);
> -		page_counter_init(&memcg->tcpmem, NULL);
> +		/*
> +		 * If use_hierarchy == false, consider all page counters direct
> +		 * descendants of the corresponding root level counters.
> +		 */
> +		page_counter_init(&memcg->memory, &root_mem_cgroup->memory);
> +		page_counter_init(&memcg->swap, &root_mem_cgroup->swap);
> +		page_counter_init(&memcg->kmem, &root_mem_cgroup->kmem);
> +		page_counter_init(&memcg->tcpmem, &root_mem_cgroup->tcpmem);
> +
>  		/*
>  		 * Deeper hierachy with use_hierarchy == false doesn't make
>  		 * much sense so let cgroup subsystem know about this

Perhaps in this case, where the hierarchy is broken, objcgs should also
be reparented directly to root? Otherwise it will still be possible to
underflow the counter in a descendant of root which has use_hierarchy=0,
but also has children.

> diff --git a/mm/page_counter.c b/mm/page_counter.c
> index b24a60b28bb0..8901b184b9d5 100644
> --- a/mm/page_counter.c
> +++ b/mm/page_counter.c
> @@ -13,6 +13,9 @@
>  #include <linux/bug.h>
>  #include <asm/page.h>
>  
> +#define for_each_nonroot_ancestor(c, counter) \
> +	for ((c) = (counter); ((c) && ((c)->parent)); (c) = (c)->parent)
> +
>  static void propagate_protected_usage(struct page_counter *c,
>  				      unsigned long usage)
>  {
> @@ -20,9 +23,6 @@ static void propagate_protected_usage(struct page_counter *c,
>  	unsigned long low, min;
>  	long delta;
>  
> -	if (!c->parent)
> -		return;
> -
>  	min = READ_ONCE(c->min);
>  	if (min || atomic_long_read(&c->min_usage)) {
>  		protected = min(usage, min);
> @@ -68,7 +68,7 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
>  {
>  	struct page_counter *c;
>  
> -	for (c = counter; c; c = c->parent) {
> +	for_each_nonroot_ancestor(c, counter) {
>  		long new;
>  
>  		new = atomic_long_add_return(nr_pages, &c->usage);
> @@ -97,7 +97,7 @@ bool page_counter_try_charge(struct page_counter *counter,
>  {
>  	struct page_counter *c;
>  
> -	for (c = counter; c; c = c->parent) {
> +	for_each_nonroot_ancestor(c, counter) {
>  		long new;
>  		/*
>  		 * Charge speculatively to avoid an expensive CAS.  If
> @@ -137,8 +137,11 @@ bool page_counter_try_charge(struct page_counter *counter,
>  	return true;
>  
>  failed:
> -	for (c = counter; c != *fail; c = c->parent)
> +	for_each_nonroot_ancestor(c, counter) {
> +		if (c == *fail)
> +			break;
>  		page_counter_cancel(c, nr_pages);
> +	}
>  
>  	return false;
>  }
> @@ -152,7 +155,7 @@ void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages)
>  {
>  	struct page_counter *c;
>  
> -	for (c = counter; c; c = c->parent)
> +	for_each_nonroot_ancestor(c, counter)
>  		page_counter_cancel(c, nr_pages);
>  }
>  
> @@ -211,7 +214,7 @@ void page_counter_set_min(struct page_counter *counter, unsigned long nr_pages)
>  
>  	WRITE_ONCE(counter->min, nr_pages);
>  
> -	for (c = counter; c; c = c->parent)
> +	for_each_nonroot_ancestor(c, counter)
>  		propagate_protected_usage(c, atomic_long_read(&c->usage));
>  }
>  
> @@ -228,7 +231,7 @@ void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages)
>  
>  	WRITE_ONCE(counter->low, nr_pages);
>  
> -	for (c = counter; c; c = c->parent)
> +	for_each_nonroot_ancestor(c, counter)
>  		propagate_protected_usage(c, atomic_long_read(&c->usage));
>  }


-- 
Thank you,
Richard.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root
  2020-10-20  6:04         ` Richard Palethorpe
@ 2020-10-20 12:02           ` Richard Palethorpe
  0 siblings, 0 replies; 45+ messages in thread
From: Richard Palethorpe @ 2020-10-20 12:02 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Michal Koutný,
	Johannes Weiner, ltp, Andrew Morton, Shakeel Butt,
	Christoph Lameter, Michal Hocko, Tejun Heo, Vlastimil Babka,
	linux-mm, linux-kernel, Michal Hocko

Hello,

Richard Palethorpe <rpalethorpe@suse.de> writes:

> Hello Roman,
>
> Roman Gushchin <guro@fb.com> writes:
>
>> -		page_counter_init(&memcg->memory, NULL);
>> -		page_counter_init(&memcg->swap, NULL);
>> -		page_counter_init(&memcg->kmem, NULL);
>> -		page_counter_init(&memcg->tcpmem, NULL);
>> +		/*
>> +		 * If use_hierarchy == false, consider all page counters direct
>> +		 * descendants of the corresponding root level counters.
>> +		 */
>> +		page_counter_init(&memcg->memory, &root_mem_cgroup->memory);
>> +		page_counter_init(&memcg->swap, &root_mem_cgroup->swap);
>> +		page_counter_init(&memcg->kmem, &root_mem_cgroup->kmem);
>> +		page_counter_init(&memcg->tcpmem, &root_mem_cgroup->tcpmem);
>> +
>>  		/*
>>  		 * Deeper hierachy with use_hierarchy == false doesn't make
>>  		 * much sense so let cgroup subsystem know about this
>
> Perhaps in this case, where the hierarchy is broken, objcgs should also
> be reparented directly to root? Otherwise it will still be possible to
> underflow the counter in a descendant of root which has use_hierarchy=0,
> but also has children.

Sorry ignore me, parent_mem_cgroup already selects root. So in the case
of a broken hierarchy objcgs are reparented directly to root.

-- 
Thank you,
Richard.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3] mm: memcg/slab: Stop reparented obj_cgroups from charging root
  2020-10-20  5:52             ` Richard Palethorpe
@ 2020-10-20 13:49               ` Richard Palethorpe
  2020-10-20 16:56                 ` Shakeel Butt
  2020-10-20 17:24               ` Michal Koutný
  1 sibling, 1 reply; 45+ messages in thread
From: Richard Palethorpe @ 2020-10-20 13:49 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Linux MM, LKML, LTP List, Roman Gushchin, Johannes Weiner,
	Andrew Morton, Christoph Lameter, Michal Hocko, Tejun Heo,
	Vlastimil Babka

Hello,

Richard Palethorpe <rpalethorpe@suse.de> writes:

> Hello Shakeel,
>
> Shakeel Butt <shakeelb@google.com> writes:
>>>
>>> V3: Handle common case where use_hierarchy=1 and update description.
>>>
>>>  mm/memcontrol.c | 7 +++++--
>>>  1 file changed, 5 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index 6877c765b8d0..34b8c4a66853 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -291,7 +291,7 @@ static void obj_cgroup_release(struct percpu_ref *ref)
>>>
>>>         spin_lock_irqsave(&css_set_lock, flags);
>>>         memcg = obj_cgroup_memcg(objcg);
>>> -       if (nr_pages)
>>> +       if (nr_pages && (!mem_cgroup_is_root(memcg) || memcg->use_hierarchy))
>>
>> If we have non-root memcg with use_hierarchy as 0 and this objcg was
>> reparented then this __memcg_kmem_uncharge() can potentially underflow
>> the page counter and give the same warning.
>
> Yes, although the kernel considers such a config to be broken, and
> prints a warning to the log, it does allow it.

Actually this can not happen because if use_hierarchy=0 then the objcg
will be reparented to root.

>
>>
>> We never set root_mem_cgroup->objcg, so, no need to check for root
>
> I don't think that is relevant as we get the memcg from objcg->memcg
> which is set during reparenting. I suppose however, we can determine if
> the objcg was reparented by inspecting memcg->objcg.
>
>> here. I think checking just memcg->use_hierarchy should be sufficient.
>
> If we just check use_hierarchy then objects directly charged to the
> memcg where use_hierarchy=0 will not be uncharged. However, maybe it is
> better to check if it was reparented and if use_hierarchy=0.

-- 
Thank you,
Richard.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root
  2020-10-19 22:28       ` [RFC PATCH] " Roman Gushchin
  2020-10-20  6:04         ` Richard Palethorpe
@ 2020-10-20 14:48         ` Richard Palethorpe
  2020-10-20 16:27         ` Michal Koutný
  2020-10-20 16:55         ` Shakeel Butt
  3 siblings, 0 replies; 45+ messages in thread
From: Richard Palethorpe @ 2020-10-20 14:48 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Michal Koutný,
	Johannes Weiner, ltp, Andrew Morton, Shakeel Butt,
	Christoph Lameter, Michal Hocko, Tejun Heo, Vlastimil Babka,
	linux-mm, linux-kernel, Michal Hocko

Hello Roman,

Roman Gushchin <guro@fb.com> writes:

> On Fri, Oct 16, 2020 at 07:15:02PM +0200, Michal Koutny wrote:
>
> From f50ea74d8f118b9121da3754acdde630ddc060a7 Mon Sep 17 00:00:00 2001
> From: Roman Gushchin <guro@fb.com>
> Date: Mon, 19 Oct 2020 14:37:35 -0700
> Subject: [PATCH RFC] mm: memcontrol: do not charge the root memory cgroup
>
> Currently the root memory cgroup is never charged directly, but
> if an ancestor cgroup is charged, the charge is propagated up to the
> root memory cgroup. The root memory cgroup doesn't show the charge
> to a user, neither it does allow to set any limits/protections.
> So the information about the current charge is completely useless.
>
> Avoiding to charge the root memory cgroup allows to:
> 1) simplify the model and the code, so, hopefully, fewer bugs will
>    be introduced in the future;
> 2) avoid unnecessary atomic operations, which are used to (un)charge
>    corresponding root page counters.
>
> In the default hierarchy case or if use_hiearchy == true, it's very
> straightforward: when the page counters tree is traversed to the root,
> the root page counter (the one with parent == NULL), should be
> skipped. To avoid multiple identical checks over the page counters
> code, for_each_nonroot_ancestor() macro is introduced.
>
> To handle the use_hierarchy == false case without adding custom
> checks, let's make page counters of all non-root memory cgroup
> direct ascendants of the corresponding root memory cgroup's page
> counters. In this case for_each_nonroot_ancestor() will work correctly
> as well.
>
> Please, note, that cgroup v1 provides root level memory.usage_in_bytes.
> However, it's not based on page counters (refer to mem_cgroup_usage()).
>
> Signed-off-by: Roman Gushchin <guro@fb.com>
> ---
>  mm/memcontrol.c   | 21 ++++++++++++++++-----
>  mm/page_counter.c | 21 ++++++++++++---------
>  2 files changed, 28 insertions(+), 14 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 2636f8bad908..34cac7522e74 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5339,17 +5339,28 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>  		memcg->swappiness = mem_cgroup_swappiness(parent);
>  		memcg->oom_kill_disable = parent->oom_kill_disable;
>  	}
> -	if (parent && parent->use_hierarchy) {
> +	if (!parent) {
> +		/* root memory cgroup */
> +		page_counter_init(&memcg->memory, NULL);
> +		page_counter_init(&memcg->swap, NULL);
> +		page_counter_init(&memcg->kmem, NULL);
> +		page_counter_init(&memcg->tcpmem, NULL);
> +	} else if (parent->use_hierarchy) {
>  		memcg->use_hierarchy = true;
>  		page_counter_init(&memcg->memory, &parent->memory);
>  		page_counter_init(&memcg->swap, &parent->swap);
>  		page_counter_init(&memcg->kmem, &parent->kmem);
>  		page_counter_init(&memcg->tcpmem, &parent->tcpmem);
>  	} else {
> -		page_counter_init(&memcg->memory, NULL);
> -		page_counter_init(&memcg->swap, NULL);
> -		page_counter_init(&memcg->kmem, NULL);
> -		page_counter_init(&memcg->tcpmem, NULL);
> +		/*
> +		 * If use_hierarchy == false, consider all page counters direct
> +		 * descendants of the corresponding root level counters.
> +		 */
> +		page_counter_init(&memcg->memory, &root_mem_cgroup->memory);
> +		page_counter_init(&memcg->swap, &root_mem_cgroup->swap);
> +		page_counter_init(&memcg->kmem, &root_mem_cgroup->kmem);
> +		page_counter_init(&memcg->tcpmem, &root_mem_cgroup->tcpmem);
> +
>  		/*
>  		 * Deeper hierachy with use_hierarchy == false doesn't make
>  		 * much sense so let cgroup subsystem know about this
> diff --git a/mm/page_counter.c b/mm/page_counter.c
> index b24a60b28bb0..8901b184b9d5 100644
> --- a/mm/page_counter.c
> +++ b/mm/page_counter.c
> @@ -13,6 +13,9 @@
>  #include <linux/bug.h>
>  #include <asm/page.h>
>  
> +#define for_each_nonroot_ancestor(c, counter) \
> +	for ((c) = (counter); ((c) && ((c)->parent)); (c) = (c)->parent)
> +
>  static void propagate_protected_usage(struct page_counter *c,
>  				      unsigned long usage)
>  {
> @@ -20,9 +23,6 @@ static void propagate_protected_usage(struct page_counter *c,
>  	unsigned long low, min;
>  	long delta;
>  
> -	if (!c->parent)
> -		return;
> -
>  	min = READ_ONCE(c->min);
>  	if (min || atomic_long_read(&c->min_usage)) {
>  		protected = min(usage, min);
> @@ -68,7 +68,7 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
>  {
>  	struct page_counter *c;
>  
> -	for (c = counter; c; c = c->parent) {
> +	for_each_nonroot_ancestor(c, counter) {
>  		long new;
>  
>  		new = atomic_long_add_return(nr_pages, &c->usage);
> @@ -97,7 +97,7 @@ bool page_counter_try_charge(struct page_counter *counter,
>  {
>  	struct page_counter *c;
>  
> -	for (c = counter; c; c = c->parent) {
> +	for_each_nonroot_ancestor(c, counter) {
>  		long new;
>  		/*
>  		 * Charge speculatively to avoid an expensive CAS.  If
> @@ -137,8 +137,11 @@ bool page_counter_try_charge(struct page_counter *counter,
>  	return true;
>  
>  failed:
> -	for (c = counter; c != *fail; c = c->parent)
> +	for_each_nonroot_ancestor(c, counter) {
> +		if (c == *fail)
> +			break;
>  		page_counter_cancel(c, nr_pages);
> +	}
>  
>  	return false;
>  }
> @@ -152,7 +155,7 @@ void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages)
>  {
>  	struct page_counter *c;
>  
> -	for (c = counter; c; c = c->parent)
> +	for_each_nonroot_ancestor(c, counter)
>  		page_counter_cancel(c, nr_pages);
>  }
>  
> @@ -211,7 +214,7 @@ void page_counter_set_min(struct page_counter *counter, unsigned long nr_pages)
>  
>  	WRITE_ONCE(counter->min, nr_pages);
>  
> -	for (c = counter; c; c = c->parent)
> +	for_each_nonroot_ancestor(c, counter)
>  		propagate_protected_usage(c, atomic_long_read(&c->usage));
>  }
>  
> @@ -228,7 +231,7 @@ void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages)
>  
>  	WRITE_ONCE(counter->low, nr_pages);
>  
> -	for (c = counter; c; c = c->parent)
> +	for_each_nonroot_ancestor(c, counter)
>  		propagate_protected_usage(c, atomic_long_read(&c->usage));
>  }

This for sure prevents the counter underflow reported by madvise06 and
makes my patch redundant. Although perhaps this is significantly more
intrusive, so not suitable for the 5.9 stable branch?

Tested-by: Richard Palethorpe <rpalethorpe@suse.com>

-- 
Thank you,
Richard.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root
  2020-10-19 22:28       ` [RFC PATCH] " Roman Gushchin
  2020-10-20  6:04         ` Richard Palethorpe
  2020-10-20 14:48         ` Richard Palethorpe
@ 2020-10-20 16:27         ` Michal Koutný
  2020-10-20 17:07           ` Roman Gushchin
  2020-10-20 16:55         ` Shakeel Butt
  3 siblings, 1 reply; 45+ messages in thread
From: Michal Koutný @ 2020-10-20 16:27 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Johannes Weiner, Richard Palethorpe, ltp, Andrew Morton,
	Shakeel Butt, Christoph Lameter, Michal Hocko, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-kernel, Michal Hocko


[-- Attachment #1: Type: text/plain, Size: 1015 bytes --]

Hi.

On Mon, Oct 19, 2020 at 03:28:45PM -0700, Roman Gushchin <guro@fb.com> wrote:
> Currently the root memory cgroup is never charged directly, but
> if an ancestor cgroup is charged, the charge is propagated up to the
s/ancestor/descendant/

> The root memory cgroup doesn't show the charge to a user, neither it
> does allow to set any limits/protections.
An appealing claim, I'd like this to be true...

> Please, note, that cgroup v1 provides root level memory.usage_in_bytes.
> However, it's not based on page counters (refer to mem_cgroup_usage()).
...and it almost is. But there are still exposed kmem and tcpmem counters.


> To avoid multiple identical checks over the page counters
> code, for_each_nonroot_ancestor() macro is introduced.
If the assumptions behind this patch's idea were true, I think the
implementation would be simpler by merely (not)connecting the root
counters and keep the traversal as is.

> direct ascendants of the corresponding root memory cgroup's page
s/asc/desc/ ;-)

Michal

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root
  2020-10-19 22:28       ` [RFC PATCH] " Roman Gushchin
                           ` (2 preceding siblings ...)
  2020-10-20 16:27         ` Michal Koutný
@ 2020-10-20 16:55         ` Shakeel Butt
  2020-10-20 17:17           ` Roman Gushchin
  3 siblings, 1 reply; 45+ messages in thread
From: Shakeel Butt @ 2020-10-20 16:55 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Michal Koutný,
	Johannes Weiner, Richard Palethorpe, LTP List, Andrew Morton,
	Christoph Lameter, Michal Hocko, Tejun Heo, Vlastimil Babka,
	Linux MM, LKML, Michal Hocko

On Mon, Oct 19, 2020 at 3:28 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Fri, Oct 16, 2020 at 07:15:02PM +0200, Michal Koutny wrote:
> > On Fri, Oct 16, 2020 at 10:53:08AM -0400, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > The central try_charge() function charges recursively all the way up
> > > to and including the root.
> > Except for use_hiearchy=0 (which is the case here as Richard
> > wrote). The reparenting is hence somewhat incompatible with
> > new_parent.use_hiearchy=0 :-/
> >
> > > We should clean this up one way or another: either charge the root or
> > > don't, but do it consistently.
> > I agree this'd be good to unify. One upside of excluding root memcg from
> > charging is that users are spared from the charging overhead when memcg
> > tree is not created.  (Actually, I thought that was the reason for this
> > exception.)
>
> Yeah, I'm completely on the same page. Moving a process to the root memory
> cgroup is currently a good way to estimate the memory cgroup overhead.
>
> How about the patch below, which consistently avoids charging the root
> memory cgroup? It seems like it doesn't add too many checks.
>
> Thanks!
>
> --
>
> From f50ea74d8f118b9121da3754acdde630ddc060a7 Mon Sep 17 00:00:00 2001
> From: Roman Gushchin <guro@fb.com>
> Date: Mon, 19 Oct 2020 14:37:35 -0700
> Subject: [PATCH RFC] mm: memcontrol: do not charge the root memory cgroup
>
> Currently the root memory cgroup is never charged directly, but
> if an ancestor cgroup is charged, the charge is propagated up to the
> root memory cgroup. The root memory cgroup doesn't show the charge
> to a user, neither it does allow to set any limits/protections.
> So the information about the current charge is completely useless.
>
> Avoiding to charge the root memory cgroup allows to:
> 1) simplify the model and the code, so, hopefully, fewer bugs will
>    be introduced in the future;
> 2) avoid unnecessary atomic operations, which are used to (un)charge
>    corresponding root page counters.
>
> In the default hierarchy case or if use_hiearchy == true, it's very
> straightforward: when the page counters tree is traversed to the root,
> the root page counter (the one with parent == NULL), should be
> skipped. To avoid multiple identical checks over the page counters
> code, for_each_nonroot_ancestor() macro is introduced.
>
> To handle the use_hierarchy == false case without adding custom
> checks, let's make page counters of all non-root memory cgroup
> direct ascendants of the corresponding root memory cgroup's page
> counters. In this case for_each_nonroot_ancestor() will work correctly
> as well.
>
> Please, note, that cgroup v1 provides root level memory.usage_in_bytes.
> However, it's not based on page counters (refer to mem_cgroup_usage()).
>
> Signed-off-by: Roman Gushchin <guro@fb.com>

This patch is only doing the page counter part of the cleanup (i.e. to
not update root's page counters when descendent's page counter is
updated) but not the stats part.

For the user memory, we do update the stats for the root memcg and do
set page->mem_cgroup = root_mem_cgroup. However this is not done for
the kmem/obj. I thought this is what Johannes was asking for the
cleanup.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3] mm: memcg/slab: Stop reparented obj_cgroups from charging root
  2020-10-20 13:49               ` Richard Palethorpe
@ 2020-10-20 16:56                 ` Shakeel Butt
  2020-10-21 20:32                   ` Roman Gushchin
  0 siblings, 1 reply; 45+ messages in thread
From: Shakeel Butt @ 2020-10-20 16:56 UTC (permalink / raw)
  To: rpalethorpe
  Cc: Linux MM, LKML, LTP List, Roman Gushchin, Johannes Weiner,
	Andrew Morton, Christoph Lameter, Michal Hocko, Tejun Heo,
	Vlastimil Babka

On Tue, Oct 20, 2020 at 6:49 AM Richard Palethorpe <rpalethorpe@suse.de> wrote:
>
> Hello,
>
> Richard Palethorpe <rpalethorpe@suse.de> writes:
>
> > Hello Shakeel,
> >
> > Shakeel Butt <shakeelb@google.com> writes:
> >>>
> >>> V3: Handle common case where use_hierarchy=1 and update description.
> >>>
> >>>  mm/memcontrol.c | 7 +++++--
> >>>  1 file changed, 5 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >>> index 6877c765b8d0..34b8c4a66853 100644
> >>> --- a/mm/memcontrol.c
> >>> +++ b/mm/memcontrol.c
> >>> @@ -291,7 +291,7 @@ static void obj_cgroup_release(struct percpu_ref *ref)
> >>>
> >>>         spin_lock_irqsave(&css_set_lock, flags);
> >>>         memcg = obj_cgroup_memcg(objcg);
> >>> -       if (nr_pages)
> >>> +       if (nr_pages && (!mem_cgroup_is_root(memcg) || memcg->use_hierarchy))
> >>
> >> If we have non-root memcg with use_hierarchy as 0 and this objcg was
> >> reparented then this __memcg_kmem_uncharge() can potentially underflow
> >> the page counter and give the same warning.
> >
> > Yes, although the kernel considers such a config to be broken, and
> > prints a warning to the log, it does allow it.
>
> Actually this can not happen because if use_hierarchy=0 then the objcg
> will be reparented to root.
>

Yup, you are right. I do wonder if we should completely deprecate
use_hierarchy=0.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root
  2020-10-20 16:27         ` Michal Koutný
@ 2020-10-20 17:07           ` Roman Gushchin
  2020-10-20 18:18             ` Johannes Weiner
  0 siblings, 1 reply; 45+ messages in thread
From: Roman Gushchin @ 2020-10-20 17:07 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Johannes Weiner, Richard Palethorpe, ltp, Andrew Morton,
	Shakeel Butt, Christoph Lameter, Michal Hocko, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-kernel, Michal Hocko

On Tue, Oct 20, 2020 at 06:27:14PM +0200, Michal Koutny wrote:
> Hi.
> 
> On Mon, Oct 19, 2020 at 03:28:45PM -0700, Roman Gushchin <guro@fb.com> wrote:
> > Currently the root memory cgroup is never charged directly, but
> > if an ancestor cgroup is charged, the charge is propagated up to the
> s/ancestor/descendant/

Oops, will fix, thanks!

> 
> > The root memory cgroup doesn't show the charge to a user, neither it
> > does allow to set any limits/protections.
> An appealing claim, I'd like this to be true...
> 
> > Please, note, that cgroup v1 provides root level memory.usage_in_bytes.
> > However, it's not based on page counters (refer to mem_cgroup_usage()).
> ...and it almost is. But there are still exposed kmem and tcpmem counters.

Hm, I wonder what do they show given that we never set sk->sk_memcg
to the root_mem_cgroup (see mem_cgroup_sk_alloc()) and we never charge
the root_mem_cgroup for !slab kmem allocations (see __memcg_kmem_charge_page()).

So yeah, it's quite a mess now, and it looks like it has been broken
in multiple places and for a while.

If we want these counter to function properly, then we should go into the opposite
direction and remove the special handling of the root memory cgroup in many places.

> > To avoid multiple identical checks over the page counters
> > code, for_each_nonroot_ancestor() macro is introduced.
> If the assumptions behind this patch's idea were true, I think the
> implementation would be simpler by merely (not)connecting the root
> counters and keep the traversal as is.

We use some fields in root page counters to calculate protections:
see propagate_protected_usage().

Thanks!


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root
  2020-10-20 16:55         ` Shakeel Butt
@ 2020-10-20 17:17           ` Roman Gushchin
  0 siblings, 0 replies; 45+ messages in thread
From: Roman Gushchin @ 2020-10-20 17:17 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Koutný,
	Johannes Weiner, Richard Palethorpe, LTP List, Andrew Morton,
	Christoph Lameter, Michal Hocko, Tejun Heo, Vlastimil Babka,
	Linux MM, LKML, Michal Hocko

On Tue, Oct 20, 2020 at 09:55:38AM -0700, Shakeel Butt wrote:
> On Mon, Oct 19, 2020 at 3:28 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Fri, Oct 16, 2020 at 07:15:02PM +0200, Michal Koutny wrote:
> > > On Fri, Oct 16, 2020 at 10:53:08AM -0400, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > > The central try_charge() function charges recursively all the way up
> > > > to and including the root.
> > > Except for use_hiearchy=0 (which is the case here as Richard
> > > wrote). The reparenting is hence somewhat incompatible with
> > > new_parent.use_hiearchy=0 :-/
> > >
> > > > We should clean this up one way or another: either charge the root or
> > > > don't, but do it consistently.
> > > I agree this'd be good to unify. One upside of excluding root memcg from
> > > charging is that users are spared from the charging overhead when memcg
> > > tree is not created.  (Actually, I thought that was the reason for this
> > > exception.)
> >
> > Yeah, I'm completely on the same page. Moving a process to the root memory
> > cgroup is currently a good way to estimate the memory cgroup overhead.
> >
> > How about the patch below, which consistently avoids charging the root
> > memory cgroup? It seems like it doesn't add too many checks.
> >
> > Thanks!
> >
> > --
> >
> > From f50ea74d8f118b9121da3754acdde630ddc060a7 Mon Sep 17 00:00:00 2001
> > From: Roman Gushchin <guro@fb.com>
> > Date: Mon, 19 Oct 2020 14:37:35 -0700
> > Subject: [PATCH RFC] mm: memcontrol: do not charge the root memory cgroup
> >
> > Currently the root memory cgroup is never charged directly, but
> > if an ancestor cgroup is charged, the charge is propagated up to the
> > root memory cgroup. The root memory cgroup doesn't show the charge
> > to a user, neither it does allow to set any limits/protections.
> > So the information about the current charge is completely useless.
> >
> > Avoiding to charge the root memory cgroup allows to:
> > 1) simplify the model and the code, so, hopefully, fewer bugs will
> >    be introduced in the future;
> > 2) avoid unnecessary atomic operations, which are used to (un)charge
> >    corresponding root page counters.
> >
> > In the default hierarchy case or if use_hiearchy == true, it's very
> > straightforward: when the page counters tree is traversed to the root,
> > the root page counter (the one with parent == NULL), should be
> > skipped. To avoid multiple identical checks over the page counters
> > code, for_each_nonroot_ancestor() macro is introduced.
> >
> > To handle the use_hierarchy == false case without adding custom
> > checks, let's make page counters of all non-root memory cgroup
> > direct ascendants of the corresponding root memory cgroup's page
> > counters. In this case for_each_nonroot_ancestor() will work correctly
> > as well.
> >
> > Please, note, that cgroup v1 provides root level memory.usage_in_bytes.
> > However, it's not based on page counters (refer to mem_cgroup_usage()).
> >
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> 
> This patch is only doing the page counter part of the cleanup (i.e. to
> not update root's page counters when descendent's page counter is
> updated) but not the stats part.
> 
> For the user memory, we do update the stats for the root memcg and do
> set page->mem_cgroup = root_mem_cgroup. However this is not done for
> the kmem/obj. I thought this is what Johannes was asking for the
> cleanup.

Yes, it's not the whole story, of course.

Actually I missed that we do export root kmem and tcpmem counters
on cgroup v1 (thanks Michal to pointing at it!). If we want them to function
properly, we have to go into the opposite direction and start charging
the root cgroup for all kinds of kernel memory allocations.

We also have the same problem with root MEMCG_SOCK stats, which seems
to be broken now.

I'll master a patch.

Thanks!


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3] mm: memcg/slab: Stop reparented obj_cgroups from charging root
  2020-10-20  5:52             ` Richard Palethorpe
  2020-10-20 13:49               ` Richard Palethorpe
@ 2020-10-20 17:24               ` Michal Koutný
  2020-10-22  7:04                 ` Richard Palethorpe
  1 sibling, 1 reply; 45+ messages in thread
From: Michal Koutný @ 2020-10-20 17:24 UTC (permalink / raw)
  To: Richard Palethorpe
  Cc: Shakeel Butt, Linux MM, LKML, LTP List, Roman Gushchin,
	Johannes Weiner, Andrew Morton, Christoph Lameter, Michal Hocko,
	Tejun Heo, Vlastimil Babka


[-- Attachment #1: Type: text/plain, Size: 746 bytes --]

Hi.

On Tue, Oct 20, 2020 at 06:52:08AM +0100, Richard Palethorpe <rpalethorpe@suse.de> wrote:
> I don't think that is relevant as we get the memcg from objcg->memcg
> which is set during reparenting. I suppose however, we can determine if
> the objcg was reparented by inspecting memcg->objcg.
+1

> If we just check use_hierarchy then objects directly charged to the
> memcg where use_hierarchy=0 will not be uncharged. However, maybe it is
> better to check if it was reparented and if use_hierarchy=0.
I think (I had to make a table) the yielded condition would be:

if ((memcg->use_hierarchy && reparented) || (!mem_cgroup_is_root(memcg) && !reparented))
	 __memcg_kmem_uncharge(memcg, nr_pages);

(I admit it's not very readable.)


Michal

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root
  2020-10-20 17:07           ` Roman Gushchin
@ 2020-10-20 18:18             ` Johannes Weiner
  2020-10-21 19:33               ` Roman Gushchin
  0 siblings, 1 reply; 45+ messages in thread
From: Johannes Weiner @ 2020-10-20 18:18 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Michal Koutný,
	Richard Palethorpe, ltp, Andrew Morton, Shakeel Butt,
	Christoph Lameter, Michal Hocko, Tejun Heo, Vlastimil Babka,
	linux-mm, linux-kernel, Michal Hocko

On Tue, Oct 20, 2020 at 10:07:17AM -0700, Roman Gushchin wrote:
> If we want these counter to function properly, then we should go into the opposite
> direction and remove the special handling of the root memory cgroup in many places.

I suspect this is also by far the most robust solution from a code and
maintenance POV.

I don't recall the page counter at the root level having been a
concern in recent years, even though it's widely used in production
environments. It's lockless and cache compact. It's also per-cpu
batched, which means it isn't actually part of the memcg hotpath.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root
  2020-10-20 18:18             ` Johannes Weiner
@ 2020-10-21 19:33               ` Roman Gushchin
  2020-10-23 16:30                 ` Johannes Weiner
  2020-11-03 13:22                 ` Michal Hocko
  0 siblings, 2 replies; 45+ messages in thread
From: Roman Gushchin @ 2020-10-21 19:33 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Koutný,
	Richard Palethorpe, ltp, Andrew Morton, Shakeel Butt,
	Christoph Lameter, Michal Hocko, Tejun Heo, Vlastimil Babka,
	linux-mm, linux-kernel, Michal Hocko

On Tue, Oct 20, 2020 at 02:18:22PM -0400, Johannes Weiner wrote:
> On Tue, Oct 20, 2020 at 10:07:17AM -0700, Roman Gushchin wrote:
> > If we want these counter to function properly, then we should go into the opposite
> > direction and remove the special handling of the root memory cgroup in many places.
> 
> I suspect this is also by far the most robust solution from a code and
> maintenance POV.
> 
> I don't recall the page counter at the root level having been a
> concern in recent years, even though it's widely used in production
> environments. It's lockless and cache compact. It's also per-cpu
> batched, which means it isn't actually part of the memcg hotpath.


I agree.

Here is my first attempt. Comments are welcome!

It doesn't solve the original problem though (use_hierarchy == false and
objcg reparenting), I'll send a separate patch for that.

Thanks!

--

From 9c7d94a3f999447417b02a7100527ce1922bc252 Mon Sep 17 00:00:00 2001
From: Roman Gushchin <guro@fb.com>
Date: Tue, 20 Oct 2020 18:05:43 -0700
Subject: [PATCH RFC] mm: memcontrol: do not treat the root memory cgroup
 specially

Currently the root memory cgroup is treated in a special way:
it's not charged and uncharged directly (only indirectly with their
descendants), processes belonging to the root memory cgroup are exempt
from the kernel- and the socket memory accounting.

At the same time some of root level statistics and data are available
to a user:
  - cgroup v2: memory.stat
  - cgroup v1: memory.stat, memory.usage_in_bytes, memory.memsw.usage_in_bytes,
               memory.kmem.usage_in_bytes and memory.kmem.tcp.usage_in_bytes

Historically the reason for a special treatment was an avoidance
of extra performance cost, however now it's unlikely a good reason:
over years there was a significant improvement in the performance
of the memory cgroup code. Also on a modern system actively using
cgroups (e.g. managed by systemd) there are usually no (significant)
processes in the root memory cgroup.

The special treatment of the root memory cgroups creates a number of
issues visible to a user:
1) slab stats on the root level do not include the slab memory
   consumed by processes in the root memory cgroup
2) non-slab kernel memory consumed by processes in the root memory cgroup
   is not included into memory.kmem.usage_in_bytes
3) socket memory consumed by processes in the root memory cgroup
   is not included into memory.kmem.tcp.usage_in_bytes

It complicates the code and increases a risk of new bugs.

This patch removes a number of exceptions related to the handling of
the root memory cgroup. With this patch applied the root memory cgroup
is treated uniformly to other cgroups in the following cases:
1) root memory cgroup is charged and uncharged directly, try_charge()
   and cancel_charge() do not return immediately if the root memory
   cgroups is passed. uncharge_batch() and __mem_cgroup_clear_mc()
   do not handle the root memory cgroup specially.
2) per-memcg slab statistics is gathered for the root memory cgroup
3) shrinkers infra treats the root memory cgroup as any other memory
   cgroup
4) non-slab kernel memory accounting doesn't exclude pages allocated
   by processes belonging to the root memory cgroup
5) if a socket is opened by a process in the root memory cgroup,
   the socket memory is accounted
6) root cgroup is charged for the used swap memory.

Signed-off-by: Roman Gushchin <guro@fb.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h |  3 +-
 mm/memcontrol.c            | 82 ++++++++++++++------------------------
 mm/vmscan.c                |  9 +----
 3 files changed, 31 insertions(+), 63 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e391e3c56de5..d3653eb5d1b2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -416,8 +416,7 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root,
 static inline bool mem_cgroup_supports_protection(struct mem_cgroup *memcg)
 {
 	/*
-	 * The root memcg doesn't account charges, and doesn't support
-	 * protection.
+	 * The root memcg doesn't support memory protection.
 	 */
 	return !mem_cgroup_disabled() && !mem_cgroup_is_root(memcg);
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2636f8bad908..a8bdca0f58f4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -438,9 +438,6 @@ static void memcg_free_shrinker_maps(struct mem_cgroup *memcg)
 	struct memcg_shrinker_map *map;
 	int nid;
 
-	if (mem_cgroup_is_root(memcg))
-		return;
-
 	for_each_node(nid) {
 		pn = mem_cgroup_nodeinfo(memcg, nid);
 		map = rcu_dereference_protected(pn->shrinker_map, true);
@@ -455,9 +452,6 @@ static int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
 	struct memcg_shrinker_map *map;
 	int nid, size, ret = 0;
 
-	if (mem_cgroup_is_root(memcg))
-		return 0;
-
 	mutex_lock(&memcg_shrinker_map_mutex);
 	size = memcg_shrinker_map_size;
 	for_each_node(nid) {
@@ -489,8 +483,6 @@ int memcg_expand_shrinker_maps(int new_id)
 		goto unlock;
 
 	for_each_mem_cgroup(memcg) {
-		if (mem_cgroup_is_root(memcg))
-			continue;
 		ret = memcg_expand_one_shrinker_map(memcg, size, old_size);
 		if (ret) {
 			mem_cgroup_iter_break(NULL, memcg);
@@ -506,7 +498,7 @@ int memcg_expand_shrinker_maps(int new_id)
 
 void memcg_set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
 {
-	if (shrinker_id >= 0 && memcg && !mem_cgroup_is_root(memcg)) {
+	if (shrinker_id >= 0 && memcg) {
 		struct memcg_shrinker_map *map;
 
 		rcu_read_lock();
@@ -868,7 +860,7 @@ void __mod_lruvec_slab_state(void *p, enum node_stat_item idx, int val)
 	memcg = mem_cgroup_from_obj(p);
 
 	/* Untracked pages have no memcg, no lruvec. Update only the node */
-	if (!memcg || memcg == root_mem_cgroup) {
+	if (!memcg) {
 		__mod_node_page_state(pgdat, idx, val);
 	} else {
 		lruvec = mem_cgroup_lruvec(memcg, pgdat);
@@ -2439,8 +2431,7 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
 		nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages,
 							     gfp_mask, true);
 		psi_memstall_leave(&pflags);
-	} while ((memcg = parent_mem_cgroup(memcg)) &&
-		 !mem_cgroup_is_root(memcg));
+	} while ((memcg = parent_mem_cgroup(memcg)));
 
 	return nr_reclaimed;
 }
@@ -2532,8 +2523,7 @@ static u64 mem_find_max_overage(struct mem_cgroup *memcg)
 		overage = calculate_overage(page_counter_read(&memcg->memory),
 					    READ_ONCE(memcg->memory.high));
 		max_overage = max(overage, max_overage);
-	} while ((memcg = parent_mem_cgroup(memcg)) &&
-		 !mem_cgroup_is_root(memcg));
+	} while ((memcg = parent_mem_cgroup(memcg)));
 
 	return max_overage;
 }
@@ -2548,8 +2538,7 @@ static u64 swap_find_max_overage(struct mem_cgroup *memcg)
 		if (overage)
 			memcg_memory_event(memcg, MEMCG_SWAP_HIGH);
 		max_overage = max(overage, max_overage);
-	} while ((memcg = parent_mem_cgroup(memcg)) &&
-		 !mem_cgroup_is_root(memcg));
+	} while ((memcg = parent_mem_cgroup(memcg)));
 
 	return max_overage;
 }
@@ -2686,8 +2675,6 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	bool drained = false;
 	unsigned long pflags;
 
-	if (mem_cgroup_is_root(memcg))
-		return 0;
 retry:
 	if (consume_stock(memcg, nr_pages))
 		return 0;
@@ -2873,9 +2860,6 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 #if defined(CONFIG_MEMCG_KMEM) || defined(CONFIG_MMU)
 static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
-	if (mem_cgroup_is_root(memcg))
-		return;
-
 	page_counter_uncharge(&memcg->memory, nr_pages);
 	if (do_memsw_account())
 		page_counter_uncharge(&memcg->memsw, nr_pages);
@@ -2978,7 +2962,7 @@ __always_inline struct obj_cgroup *get_obj_cgroup_from_current(void)
 	else
 		memcg = mem_cgroup_from_task(current);
 
-	for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) {
+	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
 		objcg = rcu_dereference(memcg->objcg);
 		if (objcg && obj_cgroup_tryget(objcg))
 			break;
@@ -3096,15 +3080,16 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
 	int ret = 0;
 
 	memcg = get_mem_cgroup_from_current();
-	if (memcg && !mem_cgroup_is_root(memcg)) {
-		ret = __memcg_kmem_charge(memcg, gfp, 1 << order);
-		if (!ret) {
-			page->mem_cgroup = memcg;
-			__SetPageKmemcg(page);
-			return 0;
-		}
-		css_put(&memcg->css);
+	if (!memcg)
+		return 0;
+
+	ret = __memcg_kmem_charge(memcg, gfp, 1 << order);
+	if (!ret) {
+		page->mem_cgroup = memcg;
+		__SetPageKmemcg(page);
+		return 0;
 	}
+	css_put(&memcg->css);
 	return ret;
 }
 
@@ -3121,7 +3106,6 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
 	if (!memcg)
 		return;
 
-	VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
 	__memcg_kmem_uncharge(memcg, nr_pages);
 	page->mem_cgroup = NULL;
 	css_put(&memcg->css);
@@ -5913,8 +5897,7 @@ static void __mem_cgroup_clear_mc(void)
 	/* we must fixup refcnts and charges */
 	if (mc.moved_swap) {
 		/* uncharge swap account from the old cgroup */
-		if (!mem_cgroup_is_root(mc.from))
-			page_counter_uncharge(&mc.from->memsw, mc.moved_swap);
+		page_counter_uncharge(&mc.from->memsw, mc.moved_swap);
 
 		mem_cgroup_id_put_many(mc.from, mc.moved_swap);
 
@@ -5922,8 +5905,7 @@ static void __mem_cgroup_clear_mc(void)
 		 * we charged both to->memory and to->memsw, so we
 		 * should uncharge to->memory.
 		 */
-		if (!mem_cgroup_is_root(mc.to))
-			page_counter_uncharge(&mc.to->memory, mc.moved_swap);
+		page_counter_uncharge(&mc.to->memory, mc.moved_swap);
 
 		mc.moved_swap = 0;
 	}
@@ -6824,14 +6806,12 @@ static void uncharge_batch(const struct uncharge_gather *ug)
 {
 	unsigned long flags;
 
-	if (!mem_cgroup_is_root(ug->memcg)) {
-		page_counter_uncharge(&ug->memcg->memory, ug->nr_pages);
-		if (do_memsw_account())
-			page_counter_uncharge(&ug->memcg->memsw, ug->nr_pages);
-		if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && ug->nr_kmem)
-			page_counter_uncharge(&ug->memcg->kmem, ug->nr_kmem);
-		memcg_oom_recover(ug->memcg);
-	}
+	page_counter_uncharge(&ug->memcg->memory, ug->nr_pages);
+	if (do_memsw_account())
+		page_counter_uncharge(&ug->memcg->memsw, ug->nr_pages);
+	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && ug->nr_kmem)
+		page_counter_uncharge(&ug->memcg->kmem, ug->nr_kmem);
+	memcg_oom_recover(ug->memcg);
 
 	local_irq_save(flags);
 	__count_memcg_events(ug->memcg, PGPGOUT, ug->pgpgout);
@@ -7013,8 +6993,6 @@ void mem_cgroup_sk_alloc(struct sock *sk)
 
 	rcu_read_lock();
 	memcg = mem_cgroup_from_task(current);
-	if (memcg == root_mem_cgroup)
-		goto out;
 	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !memcg->tcpmem_active)
 		goto out;
 	if (css_tryget(&memcg->css))
@@ -7195,12 +7173,10 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 
 	page->mem_cgroup = NULL;
 
-	if (!mem_cgroup_is_root(memcg))
-		page_counter_uncharge(&memcg->memory, nr_entries);
+	page_counter_uncharge(&memcg->memory, nr_entries);
 
 	if (!cgroup_memory_noswap && memcg != swap_memcg) {
-		if (!mem_cgroup_is_root(swap_memcg))
-			page_counter_charge(&swap_memcg->memsw, nr_entries);
+		page_counter_charge(&swap_memcg->memsw, nr_entries);
 		page_counter_uncharge(&memcg->memsw, nr_entries);
 	}
 
@@ -7249,7 +7225,7 @@ int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry)
 
 	memcg = mem_cgroup_id_get_online(memcg);
 
-	if (!cgroup_memory_noswap && !mem_cgroup_is_root(memcg) &&
+	if (!cgroup_memory_noswap &&
 	    !page_counter_try_charge(&memcg->swap, nr_pages, &counter)) {
 		memcg_memory_event(memcg, MEMCG_SWAP_MAX);
 		memcg_memory_event(memcg, MEMCG_SWAP_FAIL);
@@ -7281,7 +7257,7 @@ void mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages)
 	rcu_read_lock();
 	memcg = mem_cgroup_from_id(id);
 	if (memcg) {
-		if (!cgroup_memory_noswap && !mem_cgroup_is_root(memcg)) {
+		if (!cgroup_memory_noswap) {
 			if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
 				page_counter_uncharge(&memcg->swap, nr_pages);
 			else
@@ -7299,7 +7275,7 @@ long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg)
 
 	if (cgroup_memory_noswap || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
 		return nr_swap_pages;
-	for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg))
+	for (; memcg; memcg = parent_mem_cgroup(memcg))
 		nr_swap_pages = min_t(long, nr_swap_pages,
 				      READ_ONCE(memcg->swap.max) -
 				      page_counter_read(&memcg->swap));
@@ -7321,7 +7297,7 @@ bool mem_cgroup_swap_full(struct page *page)
 	if (!memcg)
 		return false;
 
-	for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) {
+	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
 		unsigned long usage = page_counter_read(&memcg->swap);
 
 		if (usage * 2 >= READ_ONCE(memcg->swap.high) ||
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d848c76e035a..fb6b3cbe0764 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -651,14 +651,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
 	unsigned long ret, freed = 0;
 	struct shrinker *shrinker;
 
-	/*
-	 * The root memcg might be allocated even though memcg is disabled
-	 * via "cgroup_disable=memory" boot parameter.  This could make
-	 * mem_cgroup_is_root() return false, then just run memcg slab
-	 * shrink, but skip global shrink.  This may result in premature
-	 * oom.
-	 */
-	if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
+	if (!mem_cgroup_disabled())
 		return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
 
 	if (!down_read_trylock(&shrinker_rwsem))
-- 
2.26.2



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3] mm: memcg/slab: Stop reparented obj_cgroups from charging root
  2020-10-20 16:56                 ` Shakeel Butt
@ 2020-10-21 20:32                   ` Roman Gushchin
  0 siblings, 0 replies; 45+ messages in thread
From: Roman Gushchin @ 2020-10-21 20:32 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: rpalethorpe, Linux MM, LKML, LTP List, Johannes Weiner,
	Andrew Morton, Christoph Lameter, Michal Hocko, Tejun Heo,
	Vlastimil Babka

On Tue, Oct 20, 2020 at 09:56:51AM -0700, Shakeel Butt wrote:
> On Tue, Oct 20, 2020 at 6:49 AM Richard Palethorpe <rpalethorpe@suse.de> wrote:
> >
> > Hello,
> >
> > Richard Palethorpe <rpalethorpe@suse.de> writes:
> >
> > > Hello Shakeel,
> > >
> > > Shakeel Butt <shakeelb@google.com> writes:
> > >>>
> > >>> V3: Handle common case where use_hierarchy=1 and update description.
> > >>>
> > >>>  mm/memcontrol.c | 7 +++++--
> > >>>  1 file changed, 5 insertions(+), 2 deletions(-)
> > >>>
> > >>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > >>> index 6877c765b8d0..34b8c4a66853 100644
> > >>> --- a/mm/memcontrol.c
> > >>> +++ b/mm/memcontrol.c
> > >>> @@ -291,7 +291,7 @@ static void obj_cgroup_release(struct percpu_ref *ref)
> > >>>
> > >>>         spin_lock_irqsave(&css_set_lock, flags);
> > >>>         memcg = obj_cgroup_memcg(objcg);
> > >>> -       if (nr_pages)
> > >>> +       if (nr_pages && (!mem_cgroup_is_root(memcg) || memcg->use_hierarchy))
> > >>
> > >> If we have non-root memcg with use_hierarchy as 0 and this objcg was
> > >> reparented then this __memcg_kmem_uncharge() can potentially underflow
> > >> the page counter and give the same warning.
> > >
> > > Yes, although the kernel considers such a config to be broken, and
> > > prints a warning to the log, it does allow it.
> >
> > Actually this can not happen because if use_hierarchy=0 then the objcg
> > will be reparented to root.
> >
> 
> Yup, you are right. I do wonder if we should completely deprecate
> use_hierarchy=0.

+1

Until that happy time maybe we can just link all page counters
to root page counters if use_hierarchy == false?
That would solve the original problem without complicating the code
in the main use_hierarchy == true mode.

Are there any bad consequences, which I miss?

Thanks!

--

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2636f8bad908..fbbc74b82e1a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5339,17 +5339,22 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
                memcg->swappiness = mem_cgroup_swappiness(parent);
                memcg->oom_kill_disable = parent->oom_kill_disable;
        }
-       if (parent && parent->use_hierarchy) {
+       if (!parent) {
+               page_counter_init(&memcg->memory, NULL);
+               page_counter_init(&memcg->swap, NULL);
+               page_counter_init(&memcg->kmem, NULL);
+               page_counter_init(&memcg->tcpmem, NULL);
+       } else if (parent->use_hierarchy) {
                memcg->use_hierarchy = true;
                page_counter_init(&memcg->memory, &parent->memory);
                page_counter_init(&memcg->swap, &parent->swap);
                page_counter_init(&memcg->kmem, &parent->kmem);
                page_counter_init(&memcg->tcpmem, &parent->tcpmem);
        } else {
-               page_counter_init(&memcg->memory, NULL);
-               page_counter_init(&memcg->swap, NULL);
-               page_counter_init(&memcg->kmem, NULL);
-               page_counter_init(&memcg->tcpmem, NULL);
+               page_counter_init(&memcg->memory, &root_mem_cgroup->memory);
+               page_counter_init(&memcg->swap, &root_mem_cgroup->swap);
+               page_counter_init(&memcg->kmem, &root_mem_cgroup->kmem);
+               page_counter_init(&memcg->tcpmem, &root_mem_cgroup->tcpmem);
                /*
                 * Deeper hierachy with use_hierarchy == false doesn't make
                 * much sense so let cgroup subsystem know about this



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3] mm: memcg/slab: Stop reparented obj_cgroups from charging root
  2020-10-20 17:24               ` Michal Koutný
@ 2020-10-22  7:04                 ` Richard Palethorpe
  2020-10-22 12:28                   ` [PATCH v4] " Richard Palethorpe
  0 siblings, 1 reply; 45+ messages in thread
From: Richard Palethorpe @ 2020-10-22  7:04 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Shakeel Butt, Linux MM, LKML, LTP List, Roman Gushchin,
	Johannes Weiner, Andrew Morton, Christoph Lameter, Michal Hocko,
	Tejun Heo, Vlastimil Babka

Hello,

Michal Koutný <mkoutny@suse.com> writes:

> Hi.
>
> On Tue, Oct 20, 2020 at 06:52:08AM +0100, Richard Palethorpe <rpalethorpe@suse.de> wrote:
>> I don't think that is relevant as we get the memcg from objcg->memcg
>> which is set during reparenting. I suppose however, we can determine if
>> the objcg was reparented by inspecting memcg->objcg.
> +1
>
>> If we just check use_hierarchy then objects directly charged to the
>> memcg where use_hierarchy=0 will not be uncharged. However, maybe it is
>> better to check if it was reparented and if use_hierarchy=0.
> I think (I had to make a table) the yielded condition would be:
>
> if ((memcg->use_hierarchy && reparented) || (!mem_cgroup_is_root(memcg) && !reparented))

This looks correct, but I don't think we need to check for reparenting
after all. We have the following unique scenarious:

use_hierarchy=1, memcg=?, reparented=?:
Because use_hierarchy=1 any descendants will have charged the current
memcg, including root, so we can simply uncharge regardless of the memcg
or objcg.

use_hierarchy=0, memcg!=root, reparented=?:
When use_hierarchy=0, objcgs are *only* reparented to root, so if we are
not root the objcg must belong to us.

use_hierarchy=0, memcg=root, reparented=?:
We must have been reparented because root is not initialised with an
objcg, but use_hierarchy=0 so we can not uncharge.

So I believe that the following is sufficient.

if (memcg->use_hierarchy || !mem_cgroup_is_root(memcg))
> 	 __memcg_kmem_uncharge(memcg, nr_pages);
>
> (I admit it's not very readable.)
>
>
> Michal

For the record, I did create the following patch which checks for
reparenting, but it appears to be unecessary.

----

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6877c765b8d0..0285f760f003 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -257,6 +257,14 @@ struct cgroup_subsys_state *vmpressure_to_css(struct vmpressure *vmpr)
 #ifdef CONFIG_MEMCG_KMEM
 extern spinlock_t css_set_lock;

+/* Assumes objcg originated from a descendant of memcg or is memcg's */
+static bool obj_cgroup_did_charge(const struct obj_cgroup *objcg,
+                                 const struct mem_cgroup *memcg)
+{
+       return memcg->use_hierarchy ||
+               rcu_access_pointer(memcg->objcg) == objcg;
+}
+
 static void obj_cgroup_release(struct percpu_ref *ref)
 {
        struct obj_cgroup *objcg = container_of(ref, struct obj_cgroup, refcnt);
@@ -291,7 +299,7 @@ static void obj_cgroup_release(struct percpu_ref *ref)

        spin_lock_irqsave(&css_set_lock, flags);
        memcg = obj_cgroup_memcg(objcg);
-       if (nr_pages)
+       if (nr_pages && obj_cgroup_did_charge(objcg, memcg))
                __memcg_kmem_uncharge(memcg, nr_pages);
        list_del(&objcg->list);
        mem_cgroup_put(memcg);
@@ -3100,6 +3108,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 static void drain_obj_stock(struct memcg_stock_pcp *stock)
 {
        struct obj_cgroup *old = stock->cached_objcg;
+       struct mem_cgroup *memcg;

        if (!old)
                return;
@@ -3110,7 +3119,9 @@ static void drain_obj_stock(struct memcg_stock_pcp *stock)

                if (nr_pages) {
                        rcu_read_lock();
-                       __memcg_kmem_uncharge(obj_cgroup_memcg(old), nr_pages);
+                       memcg = obj_cgroup_memcg(old);
+                       if (obj_cgroup_did_charge(old, memcg))
+                               __memcg_kmem_uncharge(memcg, nr_pages);
                        rcu_read_unlock();
                }

-- 
Thank you,
Richard.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v4] mm: memcg/slab: Stop reparented obj_cgroups from charging root
  2020-10-22  7:04                 ` Richard Palethorpe
@ 2020-10-22 12:28                   ` Richard Palethorpe
  2020-10-22 16:37                     ` Shakeel Butt
  0 siblings, 1 reply; 45+ messages in thread
From: Richard Palethorpe @ 2020-10-22 12:28 UTC (permalink / raw)
  To: Linux MM, LKML
  Cc: Richard Palethorpe, ltp, Roman Gushchin, Johannes Weiner,
	Andrew Morton, Shakeel Butt, Christoph Lameter, Michal Hocko,
	Tejun Heo, Vlastimil Babka

When use_hierarchy=1, SLAB objects which outlive their descendant
memcg are moved to their parent memcg where they may be uncharged
because charges are made recursively from leaf to root nodes.

However when use_hierarchy=0, they are reparented directly to root and
charging is not made recursively. Therefor uncharging will result in a
counter underflow on the root memcg, but no other ancestors.

To prevent this, we check whether we are about to uncharge the root
memcg and whether use_hierarchy=0. If this is the case then we skip
uncharging. The root memcg does not have its own objcg, so any objcg
which is uncharging to it must have been reparented.

Note that on the default hierarchy (CGroupV2 now) root always has
use_hierarchy=1. So this only effects CGroupV1.

The warning can be, unreliably, reproduced with the LTP test
madvise06 if the entire patch series
https://lore.kernel.org/linux-mm/20200623174037.3951353-1-guro@fb.com/
is present. Although the listed commit in 'fixes' appears to introduce
the bug, I can not reproduce it with just that commit and bisecting
runs into other bugs.

[   12.029417] WARNING: CPU: 2 PID: 21 at mm/page_counter.c:57 page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
[   12.029539] Modules linked in:
[   12.029611] CPU: 2 PID: 21 Comm: ksoftirqd/2 Not tainted 5.9.0-rc7-22-default #76
[   12.029729] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
[   12.029908] RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
[ 12.029991] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe ff ff 48 85 db 78 10 48 8b 6d 28 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b eb ec 90 e8 db 47 36 27 48 8b 17 48 39 d6 72 41 41 54 49 89
[   12.030258] RSP: 0018:ffffa5d8000efd08 EFLAGS: 00010086
[   12.030344] RAX: ffffffffffffffff RBX: ffffffffffffffff RCX: 0000000000000009
[   12.030455] RDX: 000000000000000b RSI: ffffffffffffffff RDI: ffff8ef8c7d2b248
[   12.030561] RBP: ffff8ef8c7d2b248 R08: ffff8ef8c78b19c8 R09: 0000000000000001
[   12.030672] R10: 0000000000000000 R11: ffff8ef8c780e0d0 R12: 0000000000000001
[   12.030784] R13: ffffffffffffffff R14: ffff8ef9478b19c8 R15: 0000000000000000
[   12.030895] FS:  0000000000000000(0000) GS:ffff8ef8fbc80000(0000) knlGS:0000000000000000
[   12.031017] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   12.031104] CR2: 00007f72c0af93ec CR3: 000000005c40a000 CR4: 00000000000006e0
[   12.031209] Call Trace:
[   12.031267] __memcg_kmem_uncharge (mm/memcontrol.c:3022)
[   12.031470] drain_obj_stock (./include/linux/rcupdate.h:689 mm/memcontrol.c:3114)
[   12.031594] refill_obj_stock (mm/memcontrol.c:3166)
[   12.031733] ? rcu_do_batch (kernel/rcu/tree.c:2438)
[   12.032075] memcg_slab_free_hook (./include/linux/mm.h:1294 ./include/linux/mm.h:1441 mm/slab.h:368 mm/slab.h:348)
[   12.032339] kmem_cache_free (mm/slub.c:3107 mm/slub.c:3143 mm/slub.c:3158)
[   12.032464] rcu_do_batch (kernel/rcu/tree.c:2438)
[   12.032567] rcu_core (kernel/rcu/tree_plugin.h:2122 kernel/rcu/tree_plugin.h:2157 kernel/rcu/tree.c:2661)
[   12.032664] __do_softirq (./arch/x86/include/asm/jump_label.h:25 ./include/linux/jump_label.h:200 ./include/trace/events/irq.h:142 kernel/softirq.c:299)
[   12.032766] run_ksoftirqd (./arch/x86/include/asm/irqflags.h:54 ./arch/x86/include/asm/irqflags.h:94 kernel/softirq.c:653 kernel/softirq.c:644)
[   12.032852] smpboot_thread_fn (kernel/smpboot.c:165)
[   12.032940] ? smpboot_register_percpu_thread (kernel/smpboot.c:108)
[   12.033059] kthread (kernel/kthread.c:292)
[   12.033148] ? __kthread_bind_mask (kernel/kthread.c:245)
[   12.033269] ret_from_fork (arch/x86/entry/entry_64.S:300)
[   12.033357] ---[ end trace 961dbfc01c109d1f ]---

[    9.841552] ------------[ cut here ]------------
[    9.841788] WARNING: CPU: 0 PID: 12 at mm/page_counter.c:57 page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
[    9.841982] Modules linked in:
[    9.842072] CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.9.0-rc7-22-default #77
[    9.842266] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
[    9.842571] Workqueue: events drain_local_stock
[    9.842750] RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
[ 9.842894] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe ff ff 48 85 db 78 10 48 8b 6d 28 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b eb ec 90 e8 4b f9 88 2a 48 8b 17 48 39 d6 72 41 41 54 49 89
[    9.843438] RSP: 0018:ffffb1c18006be28 EFLAGS: 00010086
[    9.843585] RAX: ffffffffffffffff RBX: ffffffffffffffff RCX: ffff94803bc2cae0
[    9.843806] RDX: 0000000000000001 RSI: ffffffffffffffff RDI: ffff948007d2b248
[    9.844026] RBP: ffff948007d2b248 R08: ffff948007c58eb0 R09: ffff948007da05ac
[    9.844248] R10: 0000000000000018 R11: 0000000000000018 R12: 0000000000000001
[    9.844477] R13: ffffffffffffffff R14: 0000000000000000 R15: ffff94803bc2cac0
[    9.844696] FS:  0000000000000000(0000) GS:ffff94803bc00000(0000) knlGS:0000000000000000
[    9.844915] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    9.845096] CR2: 00007f0579ee0384 CR3: 000000002cc0a000 CR4: 00000000000006f0
[    9.845319] Call Trace:
[    9.845429] __memcg_kmem_uncharge (mm/memcontrol.c:3022)
[    9.845582] drain_obj_stock (./include/linux/rcupdate.h:689 mm/memcontrol.c:3114)
[    9.845684] drain_local_stock (mm/memcontrol.c:2255)
[    9.845789] process_one_work (./arch/x86/include/asm/jump_label.h:25 ./include/linux/jump_label.h:200 ./include/trace/events/workqueue.h:108 kernel/workqueue.c:2274)
[    9.845898] worker_thread (./include/linux/list.h:282 kernel/workqueue.c:2416)
[    9.846034] ? process_one_work (kernel/workqueue.c:2358)
[    9.846162] kthread (kernel/kthread.c:292)
[    9.846271] ? __kthread_bind_mask (kernel/kthread.c:245)
[    9.846420] ret_from_fork (arch/x86/entry/entry_64.S:300)
[    9.846531] ---[ end trace 8b5647c1eba9d18a ]---

Reported-by: ltp@lists.linux.it
Signed-off-by: Richard Palethorpe <rpalethorpe@suse.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API")
---

V4: Same as V3, but with hopefully better/correct commit message.

 mm/memcontrol.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6877c765b8d0..34b8c4a66853 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -291,7 +291,7 @@ static void obj_cgroup_release(struct percpu_ref *ref)
 
 	spin_lock_irqsave(&css_set_lock, flags);
 	memcg = obj_cgroup_memcg(objcg);
-	if (nr_pages)
+	if (nr_pages && (!mem_cgroup_is_root(memcg) || memcg->use_hierarchy))
 		__memcg_kmem_uncharge(memcg, nr_pages);
 	list_del(&objcg->list);
 	mem_cgroup_put(memcg);
@@ -3100,6 +3100,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 static void drain_obj_stock(struct memcg_stock_pcp *stock)
 {
 	struct obj_cgroup *old = stock->cached_objcg;
+	struct mem_cgroup *memcg;
 
 	if (!old)
 		return;
@@ -3110,7 +3111,9 @@ static void drain_obj_stock(struct memcg_stock_pcp *stock)
 
 		if (nr_pages) {
 			rcu_read_lock();
-			__memcg_kmem_uncharge(obj_cgroup_memcg(old), nr_pages);
+			memcg = obj_cgroup_memcg(old);
+			if (!mem_cgroup_is_root(memcg) || memcg->use_hierarchy)
+				__memcg_kmem_uncharge(memcg, nr_pages);
 			rcu_read_unlock();
 		}
 
-- 
2.28.0



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4] mm: memcg/slab: Stop reparented obj_cgroups from charging root
  2020-10-22 12:28                   ` [PATCH v4] " Richard Palethorpe
@ 2020-10-22 16:37                     ` Shakeel Butt
  2020-10-22 17:25                       ` Roman Gushchin
  0 siblings, 1 reply; 45+ messages in thread
From: Shakeel Butt @ 2020-10-22 16:37 UTC (permalink / raw)
  To: Richard Palethorpe
  Cc: Linux MM, LKML, LTP List, Roman Gushchin, Johannes Weiner,
	Andrew Morton, Christoph Lameter, Michal Hocko, Tejun Heo,
	Vlastimil Babka

On Thu, Oct 22, 2020 at 5:29 AM Richard Palethorpe <rpalethorpe@suse.com> wrote:
>
> When use_hierarchy=1, SLAB objects which outlive their descendant
> memcg are moved to their parent memcg where they may be uncharged
> because charges are made recursively from leaf to root nodes.
>
> However when use_hierarchy=0, they are reparented directly to root and
> charging is not made recursively. Therefor uncharging will result in a
> counter underflow on the root memcg, but no other ancestors.
>
> To prevent this, we check whether we are about to uncharge the root
> memcg and whether use_hierarchy=0. If this is the case then we skip
> uncharging. The root memcg does not have its own objcg, so any objcg
> which is uncharging to it must have been reparented.
>
> Note that on the default hierarchy (CGroupV2 now) root always has
> use_hierarchy=1. So this only effects CGroupV1.
>
> The warning can be, unreliably, reproduced with the LTP test
> madvise06 if the entire patch series
> https://lore.kernel.org/linux-mm/20200623174037.3951353-1-guro@fb.com/
> is present. Although the listed commit in 'fixes' appears to introduce
> the bug, I can not reproduce it with just that commit and bisecting
> runs into other bugs.
>
> [   12.029417] WARNING: CPU: 2 PID: 21 at mm/page_counter.c:57 page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
> [   12.029539] Modules linked in:
> [   12.029611] CPU: 2 PID: 21 Comm: ksoftirqd/2 Not tainted 5.9.0-rc7-22-default #76
> [   12.029729] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
> [   12.029908] RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
> [ 12.029991] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe ff ff 48 85 db 78 10 48 8b 6d 28 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b eb ec 90 e8 db 47 36 27 48 8b 17 48 39 d6 72 41 41 54 49 89
> [   12.030258] RSP: 0018:ffffa5d8000efd08 EFLAGS: 00010086
> [   12.030344] RAX: ffffffffffffffff RBX: ffffffffffffffff RCX: 0000000000000009
> [   12.030455] RDX: 000000000000000b RSI: ffffffffffffffff RDI: ffff8ef8c7d2b248
> [   12.030561] RBP: ffff8ef8c7d2b248 R08: ffff8ef8c78b19c8 R09: 0000000000000001
> [   12.030672] R10: 0000000000000000 R11: ffff8ef8c780e0d0 R12: 0000000000000001
> [   12.030784] R13: ffffffffffffffff R14: ffff8ef9478b19c8 R15: 0000000000000000
> [   12.030895] FS:  0000000000000000(0000) GS:ffff8ef8fbc80000(0000) knlGS:0000000000000000
> [   12.031017] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   12.031104] CR2: 00007f72c0af93ec CR3: 000000005c40a000 CR4: 00000000000006e0
> [   12.031209] Call Trace:
> [   12.031267] __memcg_kmem_uncharge (mm/memcontrol.c:3022)
> [   12.031470] drain_obj_stock (./include/linux/rcupdate.h:689 mm/memcontrol.c:3114)
> [   12.031594] refill_obj_stock (mm/memcontrol.c:3166)
> [   12.031733] ? rcu_do_batch (kernel/rcu/tree.c:2438)
> [   12.032075] memcg_slab_free_hook (./include/linux/mm.h:1294 ./include/linux/mm.h:1441 mm/slab.h:368 mm/slab.h:348)
> [   12.032339] kmem_cache_free (mm/slub.c:3107 mm/slub.c:3143 mm/slub.c:3158)
> [   12.032464] rcu_do_batch (kernel/rcu/tree.c:2438)
> [   12.032567] rcu_core (kernel/rcu/tree_plugin.h:2122 kernel/rcu/tree_plugin.h:2157 kernel/rcu/tree.c:2661)
> [   12.032664] __do_softirq (./arch/x86/include/asm/jump_label.h:25 ./include/linux/jump_label.h:200 ./include/trace/events/irq.h:142 kernel/softirq.c:299)
> [   12.032766] run_ksoftirqd (./arch/x86/include/asm/irqflags.h:54 ./arch/x86/include/asm/irqflags.h:94 kernel/softirq.c:653 kernel/softirq.c:644)
> [   12.032852] smpboot_thread_fn (kernel/smpboot.c:165)
> [   12.032940] ? smpboot_register_percpu_thread (kernel/smpboot.c:108)
> [   12.033059] kthread (kernel/kthread.c:292)
> [   12.033148] ? __kthread_bind_mask (kernel/kthread.c:245)
> [   12.033269] ret_from_fork (arch/x86/entry/entry_64.S:300)
> [   12.033357] ---[ end trace 961dbfc01c109d1f ]---
>
> [    9.841552] ------------[ cut here ]------------
> [    9.841788] WARNING: CPU: 0 PID: 12 at mm/page_counter.c:57 page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
> [    9.841982] Modules linked in:
> [    9.842072] CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.9.0-rc7-22-default #77
> [    9.842266] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
> [    9.842571] Workqueue: events drain_local_stock
> [    9.842750] RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
> [ 9.842894] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe ff ff 48 85 db 78 10 48 8b 6d 28 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b eb ec 90 e8 4b f9 88 2a 48 8b 17 48 39 d6 72 41 41 54 49 89
> [    9.843438] RSP: 0018:ffffb1c18006be28 EFLAGS: 00010086
> [    9.843585] RAX: ffffffffffffffff RBX: ffffffffffffffff RCX: ffff94803bc2cae0
> [    9.843806] RDX: 0000000000000001 RSI: ffffffffffffffff RDI: ffff948007d2b248
> [    9.844026] RBP: ffff948007d2b248 R08: ffff948007c58eb0 R09: ffff948007da05ac
> [    9.844248] R10: 0000000000000018 R11: 0000000000000018 R12: 0000000000000001
> [    9.844477] R13: ffffffffffffffff R14: 0000000000000000 R15: ffff94803bc2cac0
> [    9.844696] FS:  0000000000000000(0000) GS:ffff94803bc00000(0000) knlGS:0000000000000000
> [    9.844915] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    9.845096] CR2: 00007f0579ee0384 CR3: 000000002cc0a000 CR4: 00000000000006f0
> [    9.845319] Call Trace:
> [    9.845429] __memcg_kmem_uncharge (mm/memcontrol.c:3022)
> [    9.845582] drain_obj_stock (./include/linux/rcupdate.h:689 mm/memcontrol.c:3114)
> [    9.845684] drain_local_stock (mm/memcontrol.c:2255)
> [    9.845789] process_one_work (./arch/x86/include/asm/jump_label.h:25 ./include/linux/jump_label.h:200 ./include/trace/events/workqueue.h:108 kernel/workqueue.c:2274)
> [    9.845898] worker_thread (./include/linux/list.h:282 kernel/workqueue.c:2416)
> [    9.846034] ? process_one_work (kernel/workqueue.c:2358)
> [    9.846162] kthread (kernel/kthread.c:292)
> [    9.846271] ? __kthread_bind_mask (kernel/kthread.c:245)
> [    9.846420] ret_from_fork (arch/x86/entry/entry_64.S:300)
> [    9.846531] ---[ end trace 8b5647c1eba9d18a ]---
>
> Reported-by: ltp@lists.linux.it
> Signed-off-by: Richard Palethorpe <rpalethorpe@suse.com>
> Acked-by: Roman Gushchin <guro@fb.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Shakeel Butt <shakeelb@google.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API")

This looks fine to me but I would prefer Roman's approach of charging
to root and for use_hierarchy=0, linking them to root.

Since bf4f059954dc ("mm: memcg/slab: obj_cgroup API") is in 5.9, I
think we can take this patch for 5.9 and 5.10 but keep Roman's cleanup
for 5.11.

What does everyone think?


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4] mm: memcg/slab: Stop reparented obj_cgroups from charging root
  2020-10-22 16:37                     ` Shakeel Butt
@ 2020-10-22 17:25                       ` Roman Gushchin
  2020-10-22 23:59                         ` Shakeel Butt
  0 siblings, 1 reply; 45+ messages in thread
From: Roman Gushchin @ 2020-10-22 17:25 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Richard Palethorpe, Linux MM, LKML, LTP List, Johannes Weiner,
	Andrew Morton, Christoph Lameter, Michal Hocko, Tejun Heo,
	Vlastimil Babka

On Thu, Oct 22, 2020 at 09:37:02AM -0700, Shakeel Butt wrote:
> On Thu, Oct 22, 2020 at 5:29 AM Richard Palethorpe <rpalethorpe@suse.com> wrote:
> >
> > When use_hierarchy=1, SLAB objects which outlive their descendant
> > memcg are moved to their parent memcg where they may be uncharged
> > because charges are made recursively from leaf to root nodes.
> >
> > However when use_hierarchy=0, they are reparented directly to root and
> > charging is not made recursively. Therefor uncharging will result in a
> > counter underflow on the root memcg, but no other ancestors.
> >
> > To prevent this, we check whether we are about to uncharge the root
> > memcg and whether use_hierarchy=0. If this is the case then we skip
> > uncharging. The root memcg does not have its own objcg, so any objcg
> > which is uncharging to it must have been reparented.
> >
> > Note that on the default hierarchy (CGroupV2 now) root always has
> > use_hierarchy=1. So this only effects CGroupV1.
> >
> > The warning can be, unreliably, reproduced with the LTP test
> > madvise06 if the entire patch series
> > https://lore.kernel.org/linux-mm/20200623174037.3951353-1-guro@fb.com/
> > is present. Although the listed commit in 'fixes' appears to introduce
> > the bug, I can not reproduce it with just that commit and bisecting
> > runs into other bugs.
> >
> > [   12.029417] WARNING: CPU: 2 PID: 21 at mm/page_counter.c:57 page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
> > [   12.029539] Modules linked in:
> > [   12.029611] CPU: 2 PID: 21 Comm: ksoftirqd/2 Not tainted 5.9.0-rc7-22-default #76
> > [   12.029729] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
> > [   12.029908] RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
> > [ 12.029991] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe ff ff 48 85 db 78 10 48 8b 6d 28 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b eb ec 90 e8 db 47 36 27 48 8b 17 48 39 d6 72 41 41 54 49 89
> > [   12.030258] RSP: 0018:ffffa5d8000efd08 EFLAGS: 00010086
> > [   12.030344] RAX: ffffffffffffffff RBX: ffffffffffffffff RCX: 0000000000000009
> > [   12.030455] RDX: 000000000000000b RSI: ffffffffffffffff RDI: ffff8ef8c7d2b248
> > [   12.030561] RBP: ffff8ef8c7d2b248 R08: ffff8ef8c78b19c8 R09: 0000000000000001
> > [   12.030672] R10: 0000000000000000 R11: ffff8ef8c780e0d0 R12: 0000000000000001
> > [   12.030784] R13: ffffffffffffffff R14: ffff8ef9478b19c8 R15: 0000000000000000
> > [   12.030895] FS:  0000000000000000(0000) GS:ffff8ef8fbc80000(0000) knlGS:0000000000000000
> > [   12.031017] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [   12.031104] CR2: 00007f72c0af93ec CR3: 000000005c40a000 CR4: 00000000000006e0
> > [   12.031209] Call Trace:
> > [   12.031267] __memcg_kmem_uncharge (mm/memcontrol.c:3022)
> > [   12.031470] drain_obj_stock (./include/linux/rcupdate.h:689 mm/memcontrol.c:3114)
> > [   12.031594] refill_obj_stock (mm/memcontrol.c:3166)
> > [   12.031733] ? rcu_do_batch (kernel/rcu/tree.c:2438)
> > [   12.032075] memcg_slab_free_hook (./include/linux/mm.h:1294 ./include/linux/mm.h:1441 mm/slab.h:368 mm/slab.h:348)
> > [   12.032339] kmem_cache_free (mm/slub.c:3107 mm/slub.c:3143 mm/slub.c:3158)
> > [   12.032464] rcu_do_batch (kernel/rcu/tree.c:2438)
> > [   12.032567] rcu_core (kernel/rcu/tree_plugin.h:2122 kernel/rcu/tree_plugin.h:2157 kernel/rcu/tree.c:2661)
> > [   12.032664] __do_softirq (./arch/x86/include/asm/jump_label.h:25 ./include/linux/jump_label.h:200 ./include/trace/events/irq.h:142 kernel/softirq.c:299)
> > [   12.032766] run_ksoftirqd (./arch/x86/include/asm/irqflags.h:54 ./arch/x86/include/asm/irqflags.h:94 kernel/softirq.c:653 kernel/softirq.c:644)
> > [   12.032852] smpboot_thread_fn (kernel/smpboot.c:165)
> > [   12.032940] ? smpboot_register_percpu_thread (kernel/smpboot.c:108)
> > [   12.033059] kthread (kernel/kthread.c:292)
> > [   12.033148] ? __kthread_bind_mask (kernel/kthread.c:245)
> > [   12.033269] ret_from_fork (arch/x86/entry/entry_64.S:300)
> > [   12.033357] ---[ end trace 961dbfc01c109d1f ]---
> >
> > [    9.841552] ------------[ cut here ]------------
> > [    9.841788] WARNING: CPU: 0 PID: 12 at mm/page_counter.c:57 page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
> > [    9.841982] Modules linked in:
> > [    9.842072] CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.9.0-rc7-22-default #77
> > [    9.842266] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
> > [    9.842571] Workqueue: events drain_local_stock
> > [    9.842750] RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
> > [ 9.842894] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe ff ff 48 85 db 78 10 48 8b 6d 28 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b eb ec 90 e8 4b f9 88 2a 48 8b 17 48 39 d6 72 41 41 54 49 89
> > [    9.843438] RSP: 0018:ffffb1c18006be28 EFLAGS: 00010086
> > [    9.843585] RAX: ffffffffffffffff RBX: ffffffffffffffff RCX: ffff94803bc2cae0
> > [    9.843806] RDX: 0000000000000001 RSI: ffffffffffffffff RDI: ffff948007d2b248
> > [    9.844026] RBP: ffff948007d2b248 R08: ffff948007c58eb0 R09: ffff948007da05ac
> > [    9.844248] R10: 0000000000000018 R11: 0000000000000018 R12: 0000000000000001
> > [    9.844477] R13: ffffffffffffffff R14: 0000000000000000 R15: ffff94803bc2cac0
> > [    9.844696] FS:  0000000000000000(0000) GS:ffff94803bc00000(0000) knlGS:0000000000000000
> > [    9.844915] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [    9.845096] CR2: 00007f0579ee0384 CR3: 000000002cc0a000 CR4: 00000000000006f0
> > [    9.845319] Call Trace:
> > [    9.845429] __memcg_kmem_uncharge (mm/memcontrol.c:3022)
> > [    9.845582] drain_obj_stock (./include/linux/rcupdate.h:689 mm/memcontrol.c:3114)
> > [    9.845684] drain_local_stock (mm/memcontrol.c:2255)
> > [    9.845789] process_one_work (./arch/x86/include/asm/jump_label.h:25 ./include/linux/jump_label.h:200 ./include/trace/events/workqueue.h:108 kernel/workqueue.c:2274)
> > [    9.845898] worker_thread (./include/linux/list.h:282 kernel/workqueue.c:2416)
> > [    9.846034] ? process_one_work (kernel/workqueue.c:2358)
> > [    9.846162] kthread (kernel/kthread.c:292)
> > [    9.846271] ? __kthread_bind_mask (kernel/kthread.c:245)
> > [    9.846420] ret_from_fork (arch/x86/entry/entry_64.S:300)
> > [    9.846531] ---[ end trace 8b5647c1eba9d18a ]---
> >
> > Reported-by: ltp@lists.linux.it
> > Signed-off-by: Richard Palethorpe <rpalethorpe@suse.com>
> > Acked-by: Roman Gushchin <guro@fb.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Shakeel Butt <shakeelb@google.com>
> > Cc: Christoph Lameter <cl@linux.com>
> > Cc: Michal Hocko <mhocko@kernel.org>
> > Cc: Tejun Heo <tj@kernel.org>
> > Cc: Vlastimil Babka <vbabka@suse.cz>
> > Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API")
> 
> This looks fine to me but I would prefer Roman's approach of charging
> to root and for use_hierarchy=0, linking them to root.

+1

I think it's safer because it has no effect on use_hierarchy == true case,
and this is what we should really care about. But the latest Richard's version
looks correct to me, so I'm fine with using it for stable too.

> 
> Since bf4f059954dc ("mm: memcg/slab: obj_cgroup API") is in 5.9, I
> think we can take this patch for 5.9 and 5.10 but keep Roman's cleanup
> for 5.11.
> 
> What does everyone think?

I think we should use the link to the root approach both for stable backports
and for 5.11+, to keep them in sync. The cleanup (always charging the root cgroup)
is not directly related to this problem, and we can keep it for 5.11+ only.

Thanks!


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4] mm: memcg/slab: Stop reparented obj_cgroups from charging root
  2020-10-22 17:25                       ` Roman Gushchin
@ 2020-10-22 23:59                         ` Shakeel Butt
  2020-10-23  0:40                           ` Roman Gushchin
  0 siblings, 1 reply; 45+ messages in thread
From: Shakeel Butt @ 2020-10-22 23:59 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Richard Palethorpe, Linux MM, LKML, LTP List, Johannes Weiner,
	Andrew Morton, Christoph Lameter, Michal Hocko, Tejun Heo,
	Vlastimil Babka

On Thu, Oct 22, 2020 at 10:25 AM Roman Gushchin <guro@fb.com> wrote:
>
[snip]
> >
> > Since bf4f059954dc ("mm: memcg/slab: obj_cgroup API") is in 5.9, I
> > think we can take this patch for 5.9 and 5.10 but keep Roman's cleanup
> > for 5.11.
> >
> > What does everyone think?
>
> I think we should use the link to the root approach both for stable backports
> and for 5.11+, to keep them in sync. The cleanup (always charging the root cgroup)
> is not directly related to this problem, and we can keep it for 5.11+ only.
>
> Thanks!

Roman, can you send the signed-off patch for the root linking for
use_hierarchy=0?


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4] mm: memcg/slab: Stop reparented obj_cgroups from charging root
  2020-10-22 23:59                         ` Shakeel Butt
@ 2020-10-23  0:40                           ` Roman Gushchin
  2020-10-23 15:44                             ` Johannes Weiner
                                               ` (2 more replies)
  0 siblings, 3 replies; 45+ messages in thread
From: Roman Gushchin @ 2020-10-23  0:40 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Richard Palethorpe, Linux MM, LKML, LTP List, Johannes Weiner,
	Andrew Morton, Christoph Lameter, Michal Hocko, Tejun Heo,
	Vlastimil Babka

On Thu, Oct 22, 2020 at 04:59:56PM -0700, Shakeel Butt wrote:
> On Thu, Oct 22, 2020 at 10:25 AM Roman Gushchin <guro@fb.com> wrote:
> >
> [snip]
> > >
> > > Since bf4f059954dc ("mm: memcg/slab: obj_cgroup API") is in 5.9, I
> > > think we can take this patch for 5.9 and 5.10 but keep Roman's cleanup
> > > for 5.11.
> > >
> > > What does everyone think?
> >
> > I think we should use the link to the root approach both for stable backports
> > and for 5.11+, to keep them in sync. The cleanup (always charging the root cgroup)
> > is not directly related to this problem, and we can keep it for 5.11+ only.
> >
> > Thanks!
> 
> Roman, can you send the signed-off patch for the root linking for
> use_hierarchy=0?

Sure, here we are.

Thanks!

--

From 19d66695f0ef1bf1ef7c51073ab91d67daa91362 Mon Sep 17 00:00:00 2001
From: Roman Gushchin <guro@fb.com>
Date: Thu, 22 Oct 2020 17:12:32 -0700
Subject: [PATCH] mm: memcg: link page counters to root if use_hierarchy is false

Richard reported a warning which can be reproduced by running the LTP
madvise6 test (cgroup v1 in the non-hierarchical mode should be used):

[    9.841552] ------------[ cut here ]------------
[    9.841788] WARNING: CPU: 0 PID: 12 at mm/page_counter.c:57 page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
[    9.841982] Modules linked in:
[    9.842072] CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.9.0-rc7-22-default #77
[    9.842266] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
[    9.842571] Workqueue: events drain_local_stock
[    9.842750] RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
[ 9.842894] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe ff ff 48 85 db 78 10 48 8b 6d 28 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b eb ec 90 e8 4b f9 88 2a 48 8b 17 48 39 d6 72 41 41 54 49 89
[    9.843438] RSP: 0018:ffffb1c18006be28 EFLAGS: 00010086
[    9.843585] RAX: ffffffffffffffff RBX: ffffffffffffffff RCX: ffff94803bc2cae0
[    9.843806] RDX: 0000000000000001 RSI: ffffffffffffffff RDI: ffff948007d2b248
[    9.844026] RBP: ffff948007d2b248 R08: ffff948007c58eb0 R09: ffff948007da05ac
[    9.844248] R10: 0000000000000018 R11: 0000000000000018 R12: 0000000000000001
[    9.844477] R13: ffffffffffffffff R14: 0000000000000000 R15: ffff94803bc2cac0
[    9.844696] FS:  0000000000000000(0000) GS:ffff94803bc00000(0000) knlGS:0000000000000000
[    9.844915] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    9.845096] CR2: 00007f0579ee0384 CR3: 000000002cc0a000 CR4: 00000000000006f0
[    9.845319] Call Trace:
[    9.845429] __memcg_kmem_uncharge (mm/memcontrol.c:3022)
[    9.845582] drain_obj_stock (./include/linux/rcupdate.h:689 mm/memcontrol.c:3114)
[    9.845684] drain_local_stock (mm/memcontrol.c:2255)
[    9.845789] process_one_work (./arch/x86/include/asm/jump_label.h:25 ./include/linux/jump_label.h:200 ./include/trace/events/workqueue.h:108 kernel/workqueue.c:2274)
[    9.845898] worker_thread (./include/linux/list.h:282 kernel/workqueue.c:2416)
[    9.846034] ? process_one_work (kernel/workqueue.c:2358)
[    9.846162] kthread (kernel/kthread.c:292)
[    9.846271] ? __kthread_bind_mask (kernel/kthread.c:245)
[    9.846420] ret_from_fork (arch/x86/entry/entry_64.S:300)
[    9.846531] ---[ end trace 8b5647c1eba9d18a ]---

The problem occurs because in the non-hierarchical mode non-root page
counters are not linked to root page counters, so the charge is not
propagated to the root memory cgroup.

After the removal of the original memory cgroup and reparenting of the
object cgroup, the root cgroup might be uncharged by draining a objcg
stock, for example. It leads to an eventual underflow of the charge
and triggers a warning.

Fix it by linking all page counters to corresponding root page
counters in the non-hierarchical mode.

The patch doesn't affect how the hierarchical mode is working,
which is the only sane and truly supported mode now.

Thanks to Richard for reporting, debugging and providing an
alternative version of the fix!

Reported-by: ltp@lists.linux.it
Debugged-by: Richard Palethorpe <rpalethorpe@suse.com>
Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API")
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: stable@vger.kernel.org
---
 mm/memcontrol.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2636f8bad908..009297017c87 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5339,17 +5339,22 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 		memcg->swappiness = mem_cgroup_swappiness(parent);
 		memcg->oom_kill_disable = parent->oom_kill_disable;
 	}
-	if (parent && parent->use_hierarchy) {
+	if (!parent) {
+		page_counter_init(&memcg->memory, NULL);
+		page_counter_init(&memcg->swap, NULL);
+		page_counter_init(&memcg->kmem, NULL);
+		page_counter_init(&memcg->tcpmem, NULL);
+	} else if (parent->use_hierarchy) {
 		memcg->use_hierarchy = true;
 		page_counter_init(&memcg->memory, &parent->memory);
 		page_counter_init(&memcg->swap, &parent->swap);
 		page_counter_init(&memcg->kmem, &parent->kmem);
 		page_counter_init(&memcg->tcpmem, &parent->tcpmem);
 	} else {
-		page_counter_init(&memcg->memory, NULL);
-		page_counter_init(&memcg->swap, NULL);
-		page_counter_init(&memcg->kmem, NULL);
-		page_counter_init(&memcg->tcpmem, NULL);
+		page_counter_init(&memcg->memory, &root_mem_cgroup->memory);
+		page_counter_init(&memcg->swap, &root_mem_cgroup->swap);
+		page_counter_init(&memcg->kmem, &root_mem_cgroup->kmem);
+		page_counter_init(&memcg->tcpmem, &root_mem_cgroup->tcpmem);
 		/*
 		 * Deeper hierachy with use_hierarchy == false doesn't make
 		 * much sense so let cgroup subsystem know about this
-- 
2.26.2


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4] mm: memcg/slab: Stop reparented obj_cgroups from charging root
  2020-10-23  0:40                           ` Roman Gushchin
@ 2020-10-23 15:44                             ` Johannes Weiner
  2020-10-23 16:41                             ` Shakeel Butt
  2020-10-26  7:32                             ` Richard Palethorpe
  2 siblings, 0 replies; 45+ messages in thread
From: Johannes Weiner @ 2020-10-23 15:44 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Shakeel Butt, Richard Palethorpe, Linux MM, LKML, LTP List,
	Andrew Morton, Christoph Lameter, Michal Hocko, Tejun Heo,
	Vlastimil Babka

On Thu, Oct 22, 2020 at 05:40:26PM -0700, Roman Gushchin wrote:
> From 19d66695f0ef1bf1ef7c51073ab91d67daa91362 Mon Sep 17 00:00:00 2001
> From: Roman Gushchin <guro@fb.com>
> Date: Thu, 22 Oct 2020 17:12:32 -0700
> Subject: [PATCH] mm: memcg: link page counters to root if use_hierarchy is false
> 
> Richard reported a warning which can be reproduced by running the LTP
> madvise6 test (cgroup v1 in the non-hierarchical mode should be used):
> 
> [    9.841552] ------------[ cut here ]------------
> [    9.841788] WARNING: CPU: 0 PID: 12 at mm/page_counter.c:57 page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
> [    9.841982] Modules linked in:
> [    9.842072] CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.9.0-rc7-22-default #77
> [    9.842266] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
> [    9.842571] Workqueue: events drain_local_stock
> [    9.842750] RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
> [ 9.842894] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe ff ff 48 85 db 78 10 48 8b 6d 28 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b eb ec 90 e8 4b f9 88 2a 48 8b 17 48 39 d6 72 41 41 54 49 89
> [    9.843438] RSP: 0018:ffffb1c18006be28 EFLAGS: 00010086
> [    9.843585] RAX: ffffffffffffffff RBX: ffffffffffffffff RCX: ffff94803bc2cae0
> [    9.843806] RDX: 0000000000000001 RSI: ffffffffffffffff RDI: ffff948007d2b248
> [    9.844026] RBP: ffff948007d2b248 R08: ffff948007c58eb0 R09: ffff948007da05ac
> [    9.844248] R10: 0000000000000018 R11: 0000000000000018 R12: 0000000000000001
> [    9.844477] R13: ffffffffffffffff R14: 0000000000000000 R15: ffff94803bc2cac0
> [    9.844696] FS:  0000000000000000(0000) GS:ffff94803bc00000(0000) knlGS:0000000000000000
> [    9.844915] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    9.845096] CR2: 00007f0579ee0384 CR3: 000000002cc0a000 CR4: 00000000000006f0
> [    9.845319] Call Trace:
> [    9.845429] __memcg_kmem_uncharge (mm/memcontrol.c:3022)
> [    9.845582] drain_obj_stock (./include/linux/rcupdate.h:689 mm/memcontrol.c:3114)
> [    9.845684] drain_local_stock (mm/memcontrol.c:2255)
> [    9.845789] process_one_work (./arch/x86/include/asm/jump_label.h:25 ./include/linux/jump_label.h:200 ./include/trace/events/workqueue.h:108 kernel/workqueue.c:2274)
> [    9.845898] worker_thread (./include/linux/list.h:282 kernel/workqueue.c:2416)
> [    9.846034] ? process_one_work (kernel/workqueue.c:2358)
> [    9.846162] kthread (kernel/kthread.c:292)
> [    9.846271] ? __kthread_bind_mask (kernel/kthread.c:245)
> [    9.846420] ret_from_fork (arch/x86/entry/entry_64.S:300)
> [    9.846531] ---[ end trace 8b5647c1eba9d18a ]---
> 
> The problem occurs because in the non-hierarchical mode non-root page
> counters are not linked to root page counters, so the charge is not
> propagated to the root memory cgroup.
> 
> After the removal of the original memory cgroup and reparenting of the
> object cgroup, the root cgroup might be uncharged by draining a objcg
> stock, for example. It leads to an eventual underflow of the charge
> and triggers a warning.
> 
> Fix it by linking all page counters to corresponding root page
> counters in the non-hierarchical mode.
> 
> The patch doesn't affect how the hierarchical mode is working,
> which is the only sane and truly supported mode now.
> 
> Thanks to Richard for reporting, debugging and providing an
> alternative version of the fix!
> 
> Reported-by: ltp@lists.linux.it
> Debugged-by: Richard Palethorpe <rpalethorpe@suse.com>
> Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API")
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: stable@vger.kernel.org

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root
  2020-10-21 19:33               ` Roman Gushchin
@ 2020-10-23 16:30                 ` Johannes Weiner
  2020-11-10  1:27                   ` Roman Gushchin
  2020-11-03 13:22                 ` Michal Hocko
  1 sibling, 1 reply; 45+ messages in thread
From: Johannes Weiner @ 2020-10-23 16:30 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Michal Koutný,
	Richard Palethorpe, ltp, Andrew Morton, Shakeel Butt,
	Christoph Lameter, Michal Hocko, Tejun Heo, Vlastimil Babka,
	linux-mm, linux-kernel, Michal Hocko

On Wed, Oct 21, 2020 at 12:33:22PM -0700, Roman Gushchin wrote:
> On Tue, Oct 20, 2020 at 02:18:22PM -0400, Johannes Weiner wrote:
> > On Tue, Oct 20, 2020 at 10:07:17AM -0700, Roman Gushchin wrote:
> > > If we want these counter to function properly, then we should go into the opposite
> > > direction and remove the special handling of the root memory cgroup in many places.
> > 
> > I suspect this is also by far the most robust solution from a code and
> > maintenance POV.
> > 
> > I don't recall the page counter at the root level having been a
> > concern in recent years, even though it's widely used in production
> > environments. It's lockless and cache compact. It's also per-cpu
> > batched, which means it isn't actually part of the memcg hotpath.
> 
> 
> I agree.
> 
> Here is my first attempt. Comments are welcome!
> 
> It doesn't solve the original problem though (use_hierarchy == false and
> objcg reparenting), I'll send a separate patch for that.
> 
> Thanks!
> 
> --
> 
> From 9c7d94a3f999447417b02a7100527ce1922bc252 Mon Sep 17 00:00:00 2001
> From: Roman Gushchin <guro@fb.com>
> Date: Tue, 20 Oct 2020 18:05:43 -0700
> Subject: [PATCH RFC] mm: memcontrol: do not treat the root memory cgroup
>  specially
> 
> Currently the root memory cgroup is treated in a special way:
> it's not charged and uncharged directly (only indirectly with their
> descendants), processes belonging to the root memory cgroup are exempt
> from the kernel- and the socket memory accounting.
> 
> At the same time some of root level statistics and data are available
> to a user:
>   - cgroup v2: memory.stat
>   - cgroup v1: memory.stat, memory.usage_in_bytes, memory.memsw.usage_in_bytes,
>                memory.kmem.usage_in_bytes and memory.kmem.tcp.usage_in_bytes
> 
> Historically the reason for a special treatment was an avoidance
> of extra performance cost, however now it's unlikely a good reason:
> over years there was a significant improvement in the performance
> of the memory cgroup code. Also on a modern system actively using
> cgroups (e.g. managed by systemd) there are usually no (significant)
> processes in the root memory cgroup.
> 
> The special treatment of the root memory cgroups creates a number of
> issues visible to a user:
> 1) slab stats on the root level do not include the slab memory
>    consumed by processes in the root memory cgroup
> 2) non-slab kernel memory consumed by processes in the root memory cgroup
>    is not included into memory.kmem.usage_in_bytes
> 3) socket memory consumed by processes in the root memory cgroup
>    is not included into memory.kmem.tcp.usage_in_bytes
> 
> It complicates the code and increases a risk of new bugs.
> 
> This patch removes a number of exceptions related to the handling of
> the root memory cgroup. With this patch applied the root memory cgroup
> is treated uniformly to other cgroups in the following cases:
> 1) root memory cgroup is charged and uncharged directly, try_charge()
>    and cancel_charge() do not return immediately if the root memory
>    cgroups is passed. uncharge_batch() and __mem_cgroup_clear_mc()
>    do not handle the root memory cgroup specially.
> 2) per-memcg slab statistics is gathered for the root memory cgroup
> 3) shrinkers infra treats the root memory cgroup as any other memory
>    cgroup
> 4) non-slab kernel memory accounting doesn't exclude pages allocated
>    by processes belonging to the root memory cgroup
> 5) if a socket is opened by a process in the root memory cgroup,
>    the socket memory is accounted
> 6) root cgroup is charged for the used swap memory.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>

This looks great.

The try_charge(), cancel_charge() etc. paths are relatively
straight-forward and look correct to me.

The swap counters too.

Slab is a bit trickier, but it also looks correct to me.

I'm having some trouble with the shrinkers. Currently, tracked objects
allocated in non-root cgroups live in that cgroup. Tracked objects in
the root cgroup, as well as untracked objects, live in a global pool.
When reclaim iterates all memcgs and calls shrink_slab(), we special
case the root_mem_cgroup and redirect to the global pool.

After your patch we have tracked objects allocated in the root cgroup
actually live in the root cgroup. Removing the shrinker special case
is correct in order to shrink those - but it removes the call to
shrink the global pool of untracked allocation classes.

I think we need to restore the double call to shrink_slab() we had
prior to this:

commit aeed1d325d429ac9699c4bf62d17156d60905519
Author: Vladimir Davydov <vdavydov.dev@gmail.com>
Date:   Fri Aug 17 15:48:17 2018 -0700

    mm/vmscan.c: generalize shrink_slab() calls in shrink_node()
    
    The patch makes shrink_slab() be called for root_mem_cgroup in the same
    way as it's called for the rest of cgroups.  This simplifies the logic
    and improves the readability.

where we iterate through all cgroups, including the root, to reclaim
objects accounted to those respective groups; and then a call to scan
the global pool of untracked objects in that numa node.

For ease of review/verification, it could be helpful to split the
patch and remove the root exception case-by-case (not callsite by
callsite, but e.g. the swap counter, the memory counter etc.).


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4] mm: memcg/slab: Stop reparented obj_cgroups from charging root
  2020-10-23  0:40                           ` Roman Gushchin
  2020-10-23 15:44                             ` Johannes Weiner
@ 2020-10-23 16:41                             ` Shakeel Butt
  2020-10-26  7:32                             ` Richard Palethorpe
  2 siblings, 0 replies; 45+ messages in thread
From: Shakeel Butt @ 2020-10-23 16:41 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Richard Palethorpe, Linux MM, LKML, LTP List, Johannes Weiner,
	Andrew Morton, Christoph Lameter, Michal Hocko, Tejun Heo,
	Vlastimil Babka

On Thu, Oct 22, 2020 at 5:40 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Thu, Oct 22, 2020 at 04:59:56PM -0700, Shakeel Butt wrote:
> > On Thu, Oct 22, 2020 at 10:25 AM Roman Gushchin <guro@fb.com> wrote:
> > >
> > [snip]
> > > >
> > > > Since bf4f059954dc ("mm: memcg/slab: obj_cgroup API") is in 5.9, I
> > > > think we can take this patch for 5.9 and 5.10 but keep Roman's cleanup
> > > > for 5.11.
> > > >
> > > > What does everyone think?
> > >
> > > I think we should use the link to the root approach both for stable backports
> > > and for 5.11+, to keep them in sync. The cleanup (always charging the root cgroup)
> > > is not directly related to this problem, and we can keep it for 5.11+ only.
> > >
> > > Thanks!
> >
> > Roman, can you send the signed-off patch for the root linking for
> > use_hierarchy=0?
>
> Sure, here we are.
>
> Thanks!
>
> --
>
> From 19d66695f0ef1bf1ef7c51073ab91d67daa91362 Mon Sep 17 00:00:00 2001
> From: Roman Gushchin <guro@fb.com>
> Date: Thu, 22 Oct 2020 17:12:32 -0700
> Subject: [PATCH] mm: memcg: link page counters to root if use_hierarchy is false
>
> Richard reported a warning which can be reproduced by running the LTP
> madvise6 test (cgroup v1 in the non-hierarchical mode should be used):
>
> [    9.841552] ------------[ cut here ]------------
> [    9.841788] WARNING: CPU: 0 PID: 12 at mm/page_counter.c:57 page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
> [    9.841982] Modules linked in:
> [    9.842072] CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.9.0-rc7-22-default #77
> [    9.842266] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
> [    9.842571] Workqueue: events drain_local_stock
> [    9.842750] RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
> [ 9.842894] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe ff ff 48 85 db 78 10 48 8b 6d 28 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b eb ec 90 e8 4b f9 88 2a 48 8b 17 48 39 d6 72 41 41 54 49 89
> [    9.843438] RSP: 0018:ffffb1c18006be28 EFLAGS: 00010086
> [    9.843585] RAX: ffffffffffffffff RBX: ffffffffffffffff RCX: ffff94803bc2cae0
> [    9.843806] RDX: 0000000000000001 RSI: ffffffffffffffff RDI: ffff948007d2b248
> [    9.844026] RBP: ffff948007d2b248 R08: ffff948007c58eb0 R09: ffff948007da05ac
> [    9.844248] R10: 0000000000000018 R11: 0000000000000018 R12: 0000000000000001
> [    9.844477] R13: ffffffffffffffff R14: 0000000000000000 R15: ffff94803bc2cac0
> [    9.844696] FS:  0000000000000000(0000) GS:ffff94803bc00000(0000) knlGS:0000000000000000
> [    9.844915] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    9.845096] CR2: 00007f0579ee0384 CR3: 000000002cc0a000 CR4: 00000000000006f0
> [    9.845319] Call Trace:
> [    9.845429] __memcg_kmem_uncharge (mm/memcontrol.c:3022)
> [    9.845582] drain_obj_stock (./include/linux/rcupdate.h:689 mm/memcontrol.c:3114)
> [    9.845684] drain_local_stock (mm/memcontrol.c:2255)
> [    9.845789] process_one_work (./arch/x86/include/asm/jump_label.h:25 ./include/linux/jump_label.h:200 ./include/trace/events/workqueue.h:108 kernel/workqueue.c:2274)
> [    9.845898] worker_thread (./include/linux/list.h:282 kernel/workqueue.c:2416)
> [    9.846034] ? process_one_work (kernel/workqueue.c:2358)
> [    9.846162] kthread (kernel/kthread.c:292)
> [    9.846271] ? __kthread_bind_mask (kernel/kthread.c:245)
> [    9.846420] ret_from_fork (arch/x86/entry/entry_64.S:300)
> [    9.846531] ---[ end trace 8b5647c1eba9d18a ]---
>
> The problem occurs because in the non-hierarchical mode non-root page
> counters are not linked to root page counters, so the charge is not
> propagated to the root memory cgroup.
>
> After the removal of the original memory cgroup and reparenting of the
> object cgroup, the root cgroup might be uncharged by draining a objcg
> stock, for example. It leads to an eventual underflow of the charge
> and triggers a warning.
>
> Fix it by linking all page counters to corresponding root page
> counters in the non-hierarchical mode.
>
> The patch doesn't affect how the hierarchical mode is working,
> which is the only sane and truly supported mode now.
>
> Thanks to Richard for reporting, debugging and providing an
> alternative version of the fix!
>
> Reported-by: ltp@lists.linux.it
> Debugged-by: Richard Palethorpe <rpalethorpe@suse.com>
> Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API")
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: stable@vger.kernel.org

Reviewed-by: Shakeel Butt <shakeelb@google.com>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4] mm: memcg/slab: Stop reparented obj_cgroups from charging root
  2020-10-23  0:40                           ` Roman Gushchin
  2020-10-23 15:44                             ` Johannes Weiner
  2020-10-23 16:41                             ` Shakeel Butt
@ 2020-10-26  7:32                             ` Richard Palethorpe
  2020-10-26 23:14                               ` Roman Gushchin
  2 siblings, 1 reply; 45+ messages in thread
From: Richard Palethorpe @ 2020-10-26  7:32 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Shakeel Butt, Linux MM, LKML, LTP List, Johannes Weiner,
	Andrew Morton, Christoph Lameter, Michal Hocko, Tejun Heo,
	Vlastimil Babka

Hello Roman,

Roman Gushchin <guro@fb.com> writes:

> On Thu, Oct 22, 2020 at 04:59:56PM -0700, Shakeel Butt wrote:
>> On Thu, Oct 22, 2020 at 10:25 AM Roman Gushchin <guro@fb.com> wrote:
>> >
>> [snip]
>> > >
>> > > Since bf4f059954dc ("mm: memcg/slab: obj_cgroup API") is in 5.9, I
>> > > think we can take this patch for 5.9 and 5.10 but keep Roman's cleanup
>> > > for 5.11.
>> > >
>> > > What does everyone think?
>> >
>> > I think we should use the link to the root approach both for stable backports
>> > and for 5.11+, to keep them in sync. The cleanup (always charging the root cgroup)
>> > is not directly related to this problem, and we can keep it for 5.11+ only.
>> >
>> > Thanks!
>> 
>> Roman, can you send the signed-off patch for the root linking for
>> use_hierarchy=0?
>
> Sure, here we are.
>
> Thanks!
>
> --
>
> From 19d66695f0ef1bf1ef7c51073ab91d67daa91362 Mon Sep 17 00:00:00 2001
> From: Roman Gushchin <guro@fb.com>
> Date: Thu, 22 Oct 2020 17:12:32 -0700
> Subject: [PATCH] mm: memcg: link page counters to root if use_hierarchy is false
>
> Richard reported a warning which can be reproduced by running the LTP
> madvise6 test (cgroup v1 in the non-hierarchical mode should be used):
>
> [    9.841552] ------------[ cut here ]------------
> [    9.841788] WARNING: CPU: 0 PID: 12 at mm/page_counter.c:57 page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
> [    9.841982] Modules linked in:
> [    9.842072] CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.9.0-rc7-22-default #77
> [    9.842266] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
> [    9.842571] Workqueue: events drain_local_stock
> [    9.842750] RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
> [ 9.842894] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe ff ff 48 85 db 78 10 48 8b 6d 28 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b eb ec 90 e8 4b f9 88 2a 48 8b 17 48 39 d6 72 41 41 54 49 89
> [    9.843438] RSP: 0018:ffffb1c18006be28 EFLAGS: 00010086
> [    9.843585] RAX: ffffffffffffffff RBX: ffffffffffffffff RCX: ffff94803bc2cae0
> [    9.843806] RDX: 0000000000000001 RSI: ffffffffffffffff RDI: ffff948007d2b248
> [    9.844026] RBP: ffff948007d2b248 R08: ffff948007c58eb0 R09: ffff948007da05ac
> [    9.844248] R10: 0000000000000018 R11: 0000000000000018 R12: 0000000000000001
> [    9.844477] R13: ffffffffffffffff R14: 0000000000000000 R15: ffff94803bc2cac0
> [    9.844696] FS:  0000000000000000(0000) GS:ffff94803bc00000(0000) knlGS:0000000000000000
> [    9.844915] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    9.845096] CR2: 00007f0579ee0384 CR3: 000000002cc0a000 CR4: 00000000000006f0
> [    9.845319] Call Trace:
> [    9.845429] __memcg_kmem_uncharge (mm/memcontrol.c:3022)
> [    9.845582] drain_obj_stock (./include/linux/rcupdate.h:689 mm/memcontrol.c:3114)
> [    9.845684] drain_local_stock (mm/memcontrol.c:2255)
> [    9.845789] process_one_work (./arch/x86/include/asm/jump_label.h:25 ./include/linux/jump_label.h:200 ./include/trace/events/workqueue.h:108 kernel/workqueue.c:2274)
> [    9.845898] worker_thread (./include/linux/list.h:282 kernel/workqueue.c:2416)
> [    9.846034] ? process_one_work (kernel/workqueue.c:2358)
> [    9.846162] kthread (kernel/kthread.c:292)
> [    9.846271] ? __kthread_bind_mask (kernel/kthread.c:245)
> [    9.846420] ret_from_fork (arch/x86/entry/entry_64.S:300)
> [    9.846531] ---[ end trace 8b5647c1eba9d18a ]---
>
> The problem occurs because in the non-hierarchical mode non-root page
> counters are not linked to root page counters, so the charge is not
> propagated to the root memory cgroup.
>
> After the removal of the original memory cgroup and reparenting of the
> object cgroup, the root cgroup might be uncharged by draining a objcg

I think it is worth mentioning that reparenting will always be to root
to avoid any confusion about what may happen with deeper, broken,
hierarchies.

> stock, for example. It leads to an eventual underflow of the charge
> and triggers a warning.
>
> Fix it by linking all page counters to corresponding root page
> counters in the non-hierarchical mode.
>
> The patch doesn't affect how the hierarchical mode is working,
> which is the only sane and truly supported mode now.
>
> Thanks to Richard for reporting, debugging and providing an
> alternative version of the fix!
>
> Reported-by: ltp@lists.linux.it
> Debugged-by: Richard Palethorpe <rpalethorpe@suse.com>

Much appreciated, thanks!

-- 
Thank you,
Richard.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4] mm: memcg/slab: Stop reparented obj_cgroups from charging root
  2020-10-26  7:32                             ` Richard Palethorpe
@ 2020-10-26 23:14                               ` Roman Gushchin
  0 siblings, 0 replies; 45+ messages in thread
From: Roman Gushchin @ 2020-10-26 23:14 UTC (permalink / raw)
  To: Richard Palethorpe
  Cc: Shakeel Butt, Linux MM, LKML, LTP List, Johannes Weiner,
	Andrew Morton, Christoph Lameter, Michal Hocko, Tejun Heo,
	Vlastimil Babka

On Mon, Oct 26, 2020 at 07:32:39AM +0000, Richard Palethorpe wrote:
> Hello Roman,
> 
> Roman Gushchin <guro@fb.com> writes:
> 
> > On Thu, Oct 22, 2020 at 04:59:56PM -0700, Shakeel Butt wrote:
> >> On Thu, Oct 22, 2020 at 10:25 AM Roman Gushchin <guro@fb.com> wrote:
> >> >
> >> [snip]
> >> > >
> >> > > Since bf4f059954dc ("mm: memcg/slab: obj_cgroup API") is in 5.9, I
> >> > > think we can take this patch for 5.9 and 5.10 but keep Roman's cleanup
> >> > > for 5.11.
> >> > >
> >> > > What does everyone think?
> >> >
> >> > I think we should use the link to the root approach both for stable backports
> >> > and for 5.11+, to keep them in sync. The cleanup (always charging the root cgroup)
> >> > is not directly related to this problem, and we can keep it for 5.11+ only.
> >> >
> >> > Thanks!
> >> 
> >> Roman, can you send the signed-off patch for the root linking for
> >> use_hierarchy=0?
> >
> > Sure, here we are.
> >
> > Thanks!
> >
> > --
> >
> > From 19d66695f0ef1bf1ef7c51073ab91d67daa91362 Mon Sep 17 00:00:00 2001
> > From: Roman Gushchin <guro@fb.com>
> > Date: Thu, 22 Oct 2020 17:12:32 -0700
> > Subject: [PATCH] mm: memcg: link page counters to root if use_hierarchy is false
> >
> > Richard reported a warning which can be reproduced by running the LTP
> > madvise6 test (cgroup v1 in the non-hierarchical mode should be used):
> >
> > [    9.841552] ------------[ cut here ]------------
> > [    9.841788] WARNING: CPU: 0 PID: 12 at mm/page_counter.c:57 page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
> > [    9.841982] Modules linked in:
> > [    9.842072] CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.9.0-rc7-22-default #77
> > [    9.842266] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
> > [    9.842571] Workqueue: events drain_local_stock
> > [    9.842750] RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
> > [ 9.842894] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe ff ff 48 85 db 78 10 48 8b 6d 28 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b eb ec 90 e8 4b f9 88 2a 48 8b 17 48 39 d6 72 41 41 54 49 89
> > [    9.843438] RSP: 0018:ffffb1c18006be28 EFLAGS: 00010086
> > [    9.843585] RAX: ffffffffffffffff RBX: ffffffffffffffff RCX: ffff94803bc2cae0
> > [    9.843806] RDX: 0000000000000001 RSI: ffffffffffffffff RDI: ffff948007d2b248
> > [    9.844026] RBP: ffff948007d2b248 R08: ffff948007c58eb0 R09: ffff948007da05ac
> > [    9.844248] R10: 0000000000000018 R11: 0000000000000018 R12: 0000000000000001
> > [    9.844477] R13: ffffffffffffffff R14: 0000000000000000 R15: ffff94803bc2cac0
> > [    9.844696] FS:  0000000000000000(0000) GS:ffff94803bc00000(0000) knlGS:0000000000000000
> > [    9.844915] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [    9.845096] CR2: 00007f0579ee0384 CR3: 000000002cc0a000 CR4: 00000000000006f0
> > [    9.845319] Call Trace:
> > [    9.845429] __memcg_kmem_uncharge (mm/memcontrol.c:3022)
> > [    9.845582] drain_obj_stock (./include/linux/rcupdate.h:689 mm/memcontrol.c:3114)
> > [    9.845684] drain_local_stock (mm/memcontrol.c:2255)
> > [    9.845789] process_one_work (./arch/x86/include/asm/jump_label.h:25 ./include/linux/jump_label.h:200 ./include/trace/events/workqueue.h:108 kernel/workqueue.c:2274)
> > [    9.845898] worker_thread (./include/linux/list.h:282 kernel/workqueue.c:2416)
> > [    9.846034] ? process_one_work (kernel/workqueue.c:2358)
> > [    9.846162] kthread (kernel/kthread.c:292)
> > [    9.846271] ? __kthread_bind_mask (kernel/kthread.c:245)
> > [    9.846420] ret_from_fork (arch/x86/entry/entry_64.S:300)
> > [    9.846531] ---[ end trace 8b5647c1eba9d18a ]---
> >
> > The problem occurs because in the non-hierarchical mode non-root page
> > counters are not linked to root page counters, so the charge is not
> > propagated to the root memory cgroup.
> >
> > After the removal of the original memory cgroup and reparenting of the
> > object cgroup, the root cgroup might be uncharged by draining a objcg
> 
> I think it is worth mentioning that reparenting will always be to root
> to avoid any confusion about what may happen with deeper, broken,
> hierarchies.

I agree. Added and sent v2.

> 
> > stock, for example. It leads to an eventual underflow of the charge
> > and triggers a warning.
> >
> > Fix it by linking all page counters to corresponding root page
> > counters in the non-hierarchical mode.
> >
> > The patch doesn't affect how the hierarchical mode is working,
> > which is the only sane and truly supported mode now.
> >
> > Thanks to Richard for reporting, debugging and providing an
> > alternative version of the fix!
> >
> > Reported-by: ltp@lists.linux.it
> > Debugged-by: Richard Palethorpe <rpalethorpe@suse.com>
> 
> Much appreciated, thanks!

You did most of the work. Thank you!

Roman


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root
  2020-10-21 19:33               ` Roman Gushchin
  2020-10-23 16:30                 ` Johannes Weiner
@ 2020-11-03 13:22                 ` Michal Hocko
  2020-11-03 21:30                   ` Roman Gushchin
  1 sibling, 1 reply; 45+ messages in thread
From: Michal Hocko @ 2020-11-03 13:22 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Johannes Weiner, Michal Koutný,
	Richard Palethorpe, ltp, Andrew Morton, Shakeel Butt,
	Christoph Lameter, Tejun Heo, Vlastimil Babka, linux-mm,
	linux-kernel

Hi,
I am sorry but I was quite busy at the time this has been discussed.
Now that I got to this thread finally I am wondering whether this
resp. other root cgroup exceptions have any follow up.

I have seen the issue is fixed in Linus tree by 8de15e920dc8 ("mm:
memcg: link page counters to root if use_hierarchy is false"). Thanks
for taking care of that! I do agree that this approach is the most
reasonable immediate fix.

Btw. we have been carrying a warning about non hierarchical hierarchies
for years in our distribution kernels and haven't heard of any actual
bug reports (except for LTP driven ones). So we might be really close to
simply drop this functionality completely. This would simplify the code
and prevent from future surprises.

Thanks!
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root
  2020-11-03 13:22                 ` Michal Hocko
@ 2020-11-03 21:30                   ` Roman Gushchin
  0 siblings, 0 replies; 45+ messages in thread
From: Roman Gushchin @ 2020-11-03 21:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Michal Koutný,
	Richard Palethorpe, ltp, Andrew Morton, Shakeel Butt,
	Christoph Lameter, Tejun Heo, Vlastimil Babka, linux-mm,
	linux-kernel

On Tue, Nov 03, 2020 at 02:22:21PM +0100, Michal Hocko wrote:
> Hi,
> I am sorry but I was quite busy at the time this has been discussed.
> Now that I got to this thread finally I am wondering whether this
> resp. other root cgroup exceptions have any follow up.

I'll address a feedback I've got from Johannes and will post and updated
version soon.

> 
> I have seen the issue is fixed in Linus tree by 8de15e920dc8 ("mm:
> memcg: link page counters to root if use_hierarchy is false"). Thanks
> for taking care of that! I do agree that this approach is the most
> reasonable immediate fix.

Thanks!

> 
> Btw. we have been carrying a warning about non hierarchical hierarchies
> for years in our distribution kernels and haven't heard of any actual
> bug reports (except for LTP driven ones). So we might be really close to
> simply drop this functionality completely. This would simplify the code
> and prevent from future surprises.

Just sent an rfc patchset. Your feedback will be greatly appreciated!

Thanks!


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root
  2020-10-23 16:30                 ` Johannes Weiner
@ 2020-11-10  1:27                   ` Roman Gushchin
  2020-11-10 15:11                     ` Shakeel Butt
  0 siblings, 1 reply; 45+ messages in thread
From: Roman Gushchin @ 2020-11-10  1:27 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Koutný,
	Shakeel Butt, Richard Palethorpe, ltp, Andrew Morton,
	Christoph Lameter, Michal Hocko, Tejun Heo, Vlastimil Babka,
	linux-mm, linux-kernel, Michal Hocko

On Fri, Oct 23, 2020 at 12:30:53PM -0400, Johannes Weiner wrote:
> On Wed, Oct 21, 2020 at 12:33:22PM -0700, Roman Gushchin wrote:
> > On Tue, Oct 20, 2020 at 02:18:22PM -0400, Johannes Weiner wrote:
> > > On Tue, Oct 20, 2020 at 10:07:17AM -0700, Roman Gushchin wrote:
> > > > If we want these counter to function properly, then we should go into the opposite
> > > > direction and remove the special handling of the root memory cgroup in many places.
> > > 
> > > I suspect this is also by far the most robust solution from a code and
> > > maintenance POV.
> > > 
> > > I don't recall the page counter at the root level having been a
> > > concern in recent years, even though it's widely used in production
> > > environments. It's lockless and cache compact. It's also per-cpu
> > > batched, which means it isn't actually part of the memcg hotpath.
> > 
> > 
> > I agree.
> > 
> > Here is my first attempt. Comments are welcome!
> > 
> > It doesn't solve the original problem though (use_hierarchy == false and
> > objcg reparenting), I'll send a separate patch for that.
> > 
> > Thanks!
> > 
> > --
> > 
> > From 9c7d94a3f999447417b02a7100527ce1922bc252 Mon Sep 17 00:00:00 2001
> > From: Roman Gushchin <guro@fb.com>
> > Date: Tue, 20 Oct 2020 18:05:43 -0700
> > Subject: [PATCH RFC] mm: memcontrol: do not treat the root memory cgroup
> >  specially
> > 
> > Currently the root memory cgroup is treated in a special way:
> > it's not charged and uncharged directly (only indirectly with their
> > descendants), processes belonging to the root memory cgroup are exempt
> > from the kernel- and the socket memory accounting.
> > 
> > At the same time some of root level statistics and data are available
> > to a user:
> >   - cgroup v2: memory.stat
> >   - cgroup v1: memory.stat, memory.usage_in_bytes, memory.memsw.usage_in_bytes,
> >                memory.kmem.usage_in_bytes and memory.kmem.tcp.usage_in_bytes
> > 
> > Historically the reason for a special treatment was an avoidance
> > of extra performance cost, however now it's unlikely a good reason:
> > over years there was a significant improvement in the performance
> > of the memory cgroup code. Also on a modern system actively using
> > cgroups (e.g. managed by systemd) there are usually no (significant)
> > processes in the root memory cgroup.
> > 
> > The special treatment of the root memory cgroups creates a number of
> > issues visible to a user:
> > 1) slab stats on the root level do not include the slab memory
> >    consumed by processes in the root memory cgroup
> > 2) non-slab kernel memory consumed by processes in the root memory cgroup
> >    is not included into memory.kmem.usage_in_bytes
> > 3) socket memory consumed by processes in the root memory cgroup
> >    is not included into memory.kmem.tcp.usage_in_bytes
> > 
> > It complicates the code and increases a risk of new bugs.
> > 
> > This patch removes a number of exceptions related to the handling of
> > the root memory cgroup. With this patch applied the root memory cgroup
> > is treated uniformly to other cgroups in the following cases:
> > 1) root memory cgroup is charged and uncharged directly, try_charge()
> >    and cancel_charge() do not return immediately if the root memory
> >    cgroups is passed. uncharge_batch() and __mem_cgroup_clear_mc()
> >    do not handle the root memory cgroup specially.
> > 2) per-memcg slab statistics is gathered for the root memory cgroup
> > 3) shrinkers infra treats the root memory cgroup as any other memory
> >    cgroup
> > 4) non-slab kernel memory accounting doesn't exclude pages allocated
> >    by processes belonging to the root memory cgroup
> > 5) if a socket is opened by a process in the root memory cgroup,
> >    the socket memory is accounted
> > 6) root cgroup is charged for the used swap memory.
> > 
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> This looks great.
> 
> The try_charge(), cancel_charge() etc. paths are relatively
> straight-forward and look correct to me.
> 
> The swap counters too.
> 
> Slab is a bit trickier, but it also looks correct to me.
> 
> I'm having some trouble with the shrinkers. Currently, tracked objects
> allocated in non-root cgroups live in that cgroup. Tracked objects in
> the root cgroup, as well as untracked objects, live in a global pool.
> When reclaim iterates all memcgs and calls shrink_slab(), we special
> case the root_mem_cgroup and redirect to the global pool.
> 
> After your patch we have tracked objects allocated in the root cgroup
> actually live in the root cgroup. Removing the shrinker special case
> is correct in order to shrink those - but it removes the call to
> shrink the global pool of untracked allocation classes.
> 
> I think we need to restore the double call to shrink_slab() we had
> prior to this:
> 
> commit aeed1d325d429ac9699c4bf62d17156d60905519
> Author: Vladimir Davydov <vdavydov.dev@gmail.com>
> Date:   Fri Aug 17 15:48:17 2018 -0700
> 
>     mm/vmscan.c: generalize shrink_slab() calls in shrink_node()
>     
>     The patch makes shrink_slab() be called for root_mem_cgroup in the same
>     way as it's called for the rest of cgroups.  This simplifies the logic
>     and improves the readability.
> 
> where we iterate through all cgroups, including the root, to reclaim
> objects accounted to those respective groups; and then a call to scan
> the global pool of untracked objects in that numa node.

I agree, thank you for pointing at this commit.

> 
> For ease of review/verification, it could be helpful to split the
> patch and remove the root exception case-by-case (not callsite by
> callsite, but e.g. the swap counter, the memory counter etc.).

Sorry for a long pause, here's an update. I've split the patch,
fixed a couple of issues and was almost ready to send it upstream,
but then I've noticed that on cgroup v1 kmem and memsw counters
are sometimes heading into a negative territory and generating a warning
in dmesg. It happens for a short amount of time at early stages
of the system uptime. I haven't seen it happening with the memory counter.

My investigation showed that the reason is that the result of a
cgroup_subsys_on_dfl(memory_cgrp_subsys) call can be misleading at
early stages. Depending on the return value we charge or skip the kmem
counter and also handle the swap/memsw counter differently.

The problem is that cgroup_subsys_on_dfl(memory_cgrp_subsys)'s return value
can change at any particular moment. So I don't see how to make all root's
counters consistent without tracking them all no matter which cgroup version
is used. Which is obviously an overkill and will lead to an overhead, which
unlikely can be justified.

I'll appreciate any ideas, but I don't see a good path forward here
(except fixing a particular issue with root's slab stats with the
Muchun's patch).

Thanks!


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root
  2020-11-10  1:27                   ` Roman Gushchin
@ 2020-11-10 15:11                     ` Shakeel Butt
  2020-11-10 19:13                       ` Roman Gushchin
  2020-11-20 17:46                       ` Michal Koutný
  0 siblings, 2 replies; 45+ messages in thread
From: Shakeel Butt @ 2020-11-10 15:11 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Johannes Weiner, Michal Koutný,
	Richard Palethorpe, LTP List, Andrew Morton, Christoph Lameter,
	Michal Hocko, Tejun Heo, Vlastimil Babka, Linux MM, LKML,
	Michal Hocko

On Mon, Nov 9, 2020 at 5:28 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Fri, Oct 23, 2020 at 12:30:53PM -0400, Johannes Weiner wrote:
> > On Wed, Oct 21, 2020 at 12:33:22PM -0700, Roman Gushchin wrote:
> > > On Tue, Oct 20, 2020 at 02:18:22PM -0400, Johannes Weiner wrote:
> > > > On Tue, Oct 20, 2020 at 10:07:17AM -0700, Roman Gushchin wrote:
> > > > > If we want these counter to function properly, then we should go into the opposite
> > > > > direction and remove the special handling of the root memory cgroup in many places.
> > > >
> > > > I suspect this is also by far the most robust solution from a code and
> > > > maintenance POV.
> > > >
> > > > I don't recall the page counter at the root level having been a
> > > > concern in recent years, even though it's widely used in production
> > > > environments. It's lockless and cache compact. It's also per-cpu
> > > > batched, which means it isn't actually part of the memcg hotpath.
> > >
> > >
> > > I agree.
> > >
> > > Here is my first attempt. Comments are welcome!
> > >
> > > It doesn't solve the original problem though (use_hierarchy == false and
> > > objcg reparenting), I'll send a separate patch for that.
> > >
> > > Thanks!
> > >
> > > --
> > >
> > > From 9c7d94a3f999447417b02a7100527ce1922bc252 Mon Sep 17 00:00:00 2001
> > > From: Roman Gushchin <guro@fb.com>
> > > Date: Tue, 20 Oct 2020 18:05:43 -0700
> > > Subject: [PATCH RFC] mm: memcontrol: do not treat the root memory cgroup
> > >  specially
> > >
> > > Currently the root memory cgroup is treated in a special way:
> > > it's not charged and uncharged directly (only indirectly with their
> > > descendants), processes belonging to the root memory cgroup are exempt
> > > from the kernel- and the socket memory accounting.
> > >
> > > At the same time some of root level statistics and data are available
> > > to a user:
> > >   - cgroup v2: memory.stat
> > >   - cgroup v1: memory.stat, memory.usage_in_bytes, memory.memsw.usage_in_bytes,
> > >                memory.kmem.usage_in_bytes and memory.kmem.tcp.usage_in_bytes
> > >
> > > Historically the reason for a special treatment was an avoidance
> > > of extra performance cost, however now it's unlikely a good reason:
> > > over years there was a significant improvement in the performance
> > > of the memory cgroup code. Also on a modern system actively using
> > > cgroups (e.g. managed by systemd) there are usually no (significant)
> > > processes in the root memory cgroup.
> > >
> > > The special treatment of the root memory cgroups creates a number of
> > > issues visible to a user:
> > > 1) slab stats on the root level do not include the slab memory
> > >    consumed by processes in the root memory cgroup
> > > 2) non-slab kernel memory consumed by processes in the root memory cgroup
> > >    is not included into memory.kmem.usage_in_bytes
> > > 3) socket memory consumed by processes in the root memory cgroup
> > >    is not included into memory.kmem.tcp.usage_in_bytes
> > >
> > > It complicates the code and increases a risk of new bugs.
> > >
> > > This patch removes a number of exceptions related to the handling of
> > > the root memory cgroup. With this patch applied the root memory cgroup
> > > is treated uniformly to other cgroups in the following cases:
> > > 1) root memory cgroup is charged and uncharged directly, try_charge()
> > >    and cancel_charge() do not return immediately if the root memory
> > >    cgroups is passed. uncharge_batch() and __mem_cgroup_clear_mc()
> > >    do not handle the root memory cgroup specially.
> > > 2) per-memcg slab statistics is gathered for the root memory cgroup
> > > 3) shrinkers infra treats the root memory cgroup as any other memory
> > >    cgroup
> > > 4) non-slab kernel memory accounting doesn't exclude pages allocated
> > >    by processes belonging to the root memory cgroup
> > > 5) if a socket is opened by a process in the root memory cgroup,
> > >    the socket memory is accounted
> > > 6) root cgroup is charged for the used swap memory.
> > >
> > > Signed-off-by: Roman Gushchin <guro@fb.com>
> > > Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> >
> > This looks great.
> >
> > The try_charge(), cancel_charge() etc. paths are relatively
> > straight-forward and look correct to me.
> >
> > The swap counters too.
> >
> > Slab is a bit trickier, but it also looks correct to me.
> >
> > I'm having some trouble with the shrinkers. Currently, tracked objects
> > allocated in non-root cgroups live in that cgroup. Tracked objects in
> > the root cgroup, as well as untracked objects, live in a global pool.
> > When reclaim iterates all memcgs and calls shrink_slab(), we special
> > case the root_mem_cgroup and redirect to the global pool.
> >
> > After your patch we have tracked objects allocated in the root cgroup
> > actually live in the root cgroup. Removing the shrinker special case
> > is correct in order to shrink those - but it removes the call to
> > shrink the global pool of untracked allocation classes.
> >
> > I think we need to restore the double call to shrink_slab() we had
> > prior to this:
> >
> > commit aeed1d325d429ac9699c4bf62d17156d60905519
> > Author: Vladimir Davydov <vdavydov.dev@gmail.com>
> > Date:   Fri Aug 17 15:48:17 2018 -0700
> >
> >     mm/vmscan.c: generalize shrink_slab() calls in shrink_node()
> >
> >     The patch makes shrink_slab() be called for root_mem_cgroup in the same
> >     way as it's called for the rest of cgroups.  This simplifies the logic
> >     and improves the readability.
> >
> > where we iterate through all cgroups, including the root, to reclaim
> > objects accounted to those respective groups; and then a call to scan
> > the global pool of untracked objects in that numa node.
>
> I agree, thank you for pointing at this commit.
>
> >
> > For ease of review/verification, it could be helpful to split the
> > patch and remove the root exception case-by-case (not callsite by
> > callsite, but e.g. the swap counter, the memory counter etc.).
>
> Sorry for a long pause, here's an update. I've split the patch,
> fixed a couple of issues and was almost ready to send it upstream,
> but then I've noticed that on cgroup v1 kmem and memsw counters
> are sometimes heading into a negative territory and generating a warning
> in dmesg. It happens for a short amount of time at early stages
> of the system uptime. I haven't seen it happening with the memory counter.
>
> My investigation showed that the reason is that the result of a
> cgroup_subsys_on_dfl(memory_cgrp_subsys) call can be misleading at
> early stages. Depending on the return value we charge or skip the kmem
> counter and also handle the swap/memsw counter differently.
>
> The problem is that cgroup_subsys_on_dfl(memory_cgrp_subsys)'s return value
> can change at any particular moment. So I don't see how to make all root's
> counters consistent without tracking them all no matter which cgroup version
> is used. Which is obviously an overkill and will lead to an overhead, which
> unlikely can be justified.
>
> I'll appreciate any ideas, but I don't see a good path forward here
> (except fixing a particular issue with root's slab stats with the
> Muchun's patch).
>

Since the commit 0158115f702b0 ("memcg, kmem: deprecate
kmem.limit_in_bytes"), we are in the process of deprecating the limit
on kmem. If we decide that now is the time to deprecate it, we can
convert the kmem page counter to a memcg stat, update it for both v1
and v2 and serve v1's kmem.usage_in_bytes from that memcg stat. The
memcg stat is more efficient than the page counter, so I don't think
overhead should be an issue. This new memcg stat represents all types
of kmem memory for a memcg like slab, stack and no-type. What do you
think?


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root
  2020-11-10 15:11                     ` Shakeel Butt
@ 2020-11-10 19:13                       ` Roman Gushchin
  2020-11-20 17:46                       ` Michal Koutný
  1 sibling, 0 replies; 45+ messages in thread
From: Roman Gushchin @ 2020-11-10 19:13 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Michal Koutný,
	Richard Palethorpe, LTP List, Andrew Morton, Christoph Lameter,
	Michal Hocko, Tejun Heo, Vlastimil Babka, Linux MM, LKML,
	Michal Hocko

On Tue, Nov 10, 2020 at 07:11:28AM -0800, Shakeel Butt wrote:
> On Mon, Nov 9, 2020 at 5:28 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Fri, Oct 23, 2020 at 12:30:53PM -0400, Johannes Weiner wrote:
> > > On Wed, Oct 21, 2020 at 12:33:22PM -0700, Roman Gushchin wrote:
> > > > On Tue, Oct 20, 2020 at 02:18:22PM -0400, Johannes Weiner wrote:
> > > > > On Tue, Oct 20, 2020 at 10:07:17AM -0700, Roman Gushchin wrote:
> > > > > > If we want these counter to function properly, then we should go into the opposite
> > > > > > direction and remove the special handling of the root memory cgroup in many places.
> > > > >
> > > > > I suspect this is also by far the most robust solution from a code and
> > > > > maintenance POV.
> > > > >
> > > > > I don't recall the page counter at the root level having been a
> > > > > concern in recent years, even though it's widely used in production
> > > > > environments. It's lockless and cache compact. It's also per-cpu
> > > > > batched, which means it isn't actually part of the memcg hotpath.
> > > >
> > > >
> > > > I agree.
> > > >
> > > > Here is my first attempt. Comments are welcome!
> > > >
> > > > It doesn't solve the original problem though (use_hierarchy == false and
> > > > objcg reparenting), I'll send a separate patch for that.
> > > >
> > > > Thanks!
> > > >
> > > > --
> > > >
> > > > From 9c7d94a3f999447417b02a7100527ce1922bc252 Mon Sep 17 00:00:00 2001
> > > > From: Roman Gushchin <guro@fb.com>
> > > > Date: Tue, 20 Oct 2020 18:05:43 -0700
> > > > Subject: [PATCH RFC] mm: memcontrol: do not treat the root memory cgroup
> > > >  specially
> > > >
> > > > Currently the root memory cgroup is treated in a special way:
> > > > it's not charged and uncharged directly (only indirectly with their
> > > > descendants), processes belonging to the root memory cgroup are exempt
> > > > from the kernel- and the socket memory accounting.
> > > >
> > > > At the same time some of root level statistics and data are available
> > > > to a user:
> > > >   - cgroup v2: memory.stat
> > > >   - cgroup v1: memory.stat, memory.usage_in_bytes, memory.memsw.usage_in_bytes,
> > > >                memory.kmem.usage_in_bytes and memory.kmem.tcp.usage_in_bytes
> > > >
> > > > Historically the reason for a special treatment was an avoidance
> > > > of extra performance cost, however now it's unlikely a good reason:
> > > > over years there was a significant improvement in the performance
> > > > of the memory cgroup code. Also on a modern system actively using
> > > > cgroups (e.g. managed by systemd) there are usually no (significant)
> > > > processes in the root memory cgroup.
> > > >
> > > > The special treatment of the root memory cgroups creates a number of
> > > > issues visible to a user:
> > > > 1) slab stats on the root level do not include the slab memory
> > > >    consumed by processes in the root memory cgroup
> > > > 2) non-slab kernel memory consumed by processes in the root memory cgroup
> > > >    is not included into memory.kmem.usage_in_bytes
> > > > 3) socket memory consumed by processes in the root memory cgroup
> > > >    is not included into memory.kmem.tcp.usage_in_bytes
> > > >
> > > > It complicates the code and increases a risk of new bugs.
> > > >
> > > > This patch removes a number of exceptions related to the handling of
> > > > the root memory cgroup. With this patch applied the root memory cgroup
> > > > is treated uniformly to other cgroups in the following cases:
> > > > 1) root memory cgroup is charged and uncharged directly, try_charge()
> > > >    and cancel_charge() do not return immediately if the root memory
> > > >    cgroups is passed. uncharge_batch() and __mem_cgroup_clear_mc()
> > > >    do not handle the root memory cgroup specially.
> > > > 2) per-memcg slab statistics is gathered for the root memory cgroup
> > > > 3) shrinkers infra treats the root memory cgroup as any other memory
> > > >    cgroup
> > > > 4) non-slab kernel memory accounting doesn't exclude pages allocated
> > > >    by processes belonging to the root memory cgroup
> > > > 5) if a socket is opened by a process in the root memory cgroup,
> > > >    the socket memory is accounted
> > > > 6) root cgroup is charged for the used swap memory.
> > > >
> > > > Signed-off-by: Roman Gushchin <guro@fb.com>
> > > > Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> > >
> > > This looks great.
> > >
> > > The try_charge(), cancel_charge() etc. paths are relatively
> > > straight-forward and look correct to me.
> > >
> > > The swap counters too.
> > >
> > > Slab is a bit trickier, but it also looks correct to me.
> > >
> > > I'm having some trouble with the shrinkers. Currently, tracked objects
> > > allocated in non-root cgroups live in that cgroup. Tracked objects in
> > > the root cgroup, as well as untracked objects, live in a global pool.
> > > When reclaim iterates all memcgs and calls shrink_slab(), we special
> > > case the root_mem_cgroup and redirect to the global pool.
> > >
> > > After your patch we have tracked objects allocated in the root cgroup
> > > actually live in the root cgroup. Removing the shrinker special case
> > > is correct in order to shrink those - but it removes the call to
> > > shrink the global pool of untracked allocation classes.
> > >
> > > I think we need to restore the double call to shrink_slab() we had
> > > prior to this:
> > >
> > > commit aeed1d325d429ac9699c4bf62d17156d60905519
> > > Author: Vladimir Davydov <vdavydov.dev@gmail.com>
> > > Date:   Fri Aug 17 15:48:17 2018 -0700
> > >
> > >     mm/vmscan.c: generalize shrink_slab() calls in shrink_node()
> > >
> > >     The patch makes shrink_slab() be called for root_mem_cgroup in the same
> > >     way as it's called for the rest of cgroups.  This simplifies the logic
> > >     and improves the readability.
> > >
> > > where we iterate through all cgroups, including the root, to reclaim
> > > objects accounted to those respective groups; and then a call to scan
> > > the global pool of untracked objects in that numa node.
> >
> > I agree, thank you for pointing at this commit.
> >
> > >
> > > For ease of review/verification, it could be helpful to split the
> > > patch and remove the root exception case-by-case (not callsite by
> > > callsite, but e.g. the swap counter, the memory counter etc.).
> >
> > Sorry for a long pause, here's an update. I've split the patch,
> > fixed a couple of issues and was almost ready to send it upstream,
> > but then I've noticed that on cgroup v1 kmem and memsw counters
> > are sometimes heading into a negative territory and generating a warning
> > in dmesg. It happens for a short amount of time at early stages
> > of the system uptime. I haven't seen it happening with the memory counter.
> >
> > My investigation showed that the reason is that the result of a
> > cgroup_subsys_on_dfl(memory_cgrp_subsys) call can be misleading at
> > early stages. Depending on the return value we charge or skip the kmem
> > counter and also handle the swap/memsw counter differently.
> >
> > The problem is that cgroup_subsys_on_dfl(memory_cgrp_subsys)'s return value
> > can change at any particular moment. So I don't see how to make all root's
> > counters consistent without tracking them all no matter which cgroup version
> > is used. Which is obviously an overkill and will lead to an overhead, which
> > unlikely can be justified.
> >
> > I'll appreciate any ideas, but I don't see a good path forward here
> > (except fixing a particular issue with root's slab stats with the
> > Muchun's patch).
> >
> 
> Since the commit 0158115f702b0 ("memcg, kmem: deprecate
> kmem.limit_in_bytes"), we are in the process of deprecating the limit
> on kmem. If we decide that now is the time to deprecate it, we can
> convert the kmem page counter to a memcg stat, update it for both v1
> and v2 and serve v1's kmem.usage_in_bytes from that memcg stat. The
> memcg stat is more efficient than the page counter, so I don't think
> overhead should be an issue. This new memcg stat represents all types
> of kmem memory for a memcg like slab, stack and no-type. What do you
> think?

Hm, I don't see why it can't work, but it will not solve the memsw problem,
right?

Correct handling of root's kmem and tcpmem counters was the main reason
to select "always charge the root memory cgroup" over "never charge the
root memory cgroup". If we'll not use these counters directly, than it's
probably better to never charge the root memory cgroup, at least from
the performance point of view.

Thanks!


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root
  2020-11-10 15:11                     ` Shakeel Butt
  2020-11-10 19:13                       ` Roman Gushchin
@ 2020-11-20 17:46                       ` Michal Koutný
  1 sibling, 0 replies; 45+ messages in thread
From: Michal Koutný @ 2020-11-20 17:46 UTC (permalink / raw)
  To: Shakeel Butt, Roman Gushchin
  Cc: Johannes Weiner, Richard Palethorpe, LTP List, Andrew Morton,
	Christoph Lameter, Michal Hocko, Tejun Heo, Vlastimil Babka,
	Linux MM, LKML, Michal Hocko


[-- Attachment #1: Type: text/plain, Size: 996 bytes --]

Hi.

On Tue, Nov 10, 2020 at 07:11:28AM -0800, Shakeel Butt <shakeelb@google.com> wrote:
> > The problem is that cgroup_subsys_on_dfl(memory_cgrp_subsys)'s return value
> > can change at any particular moment.
The switch can happen only when singular (i.e. root-only) hierarchy
exists. (Or it could if rebind_subsystems() waited until all memcgs are
completely free'd.)

> Since the commit 0158115f702b0 ("memcg, kmem: deprecate
> kmem.limit_in_bytes"), we are in the process of deprecating the limit
> on kmem. If we decide that now is the time to deprecate it, we can
> convert the kmem page counter to a memcg stat, update it for both v1
> and v2 and serve v1's kmem.usage_in_bytes from that memcg stat.
So with the single memcg, it may be possible to reconstruct the
necessary counters in both directions using the statistics (or some
complementarity, without fine grained counters removal).

I didn't check all the charging/uncharging places, these are just my 2
cents to the issue.

Michal

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, back to index

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-14 19:07 [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root Richard Palethorpe
2020-10-14 20:08 ` Roman Gushchin
2020-10-16  5:40   ` Richard Palethorpe
2020-10-16  9:47 ` Michal Koutný
2020-10-16 10:41   ` Richard Palethorpe
2020-10-16 15:05     ` Richard Palethorpe
2020-10-16 17:26       ` Michal Koutný
2020-10-16 14:53   ` Johannes Weiner
2020-10-16 17:02     ` Roman Gushchin
2020-10-16 17:15     ` Michal Koutný
2020-10-19  8:45       ` Richard Palethorpe
2020-10-19  9:58         ` [PATCH v3] " Richard Palethorpe
2020-10-19 16:58           ` Shakeel Butt
2020-10-20  5:52             ` Richard Palethorpe
2020-10-20 13:49               ` Richard Palethorpe
2020-10-20 16:56                 ` Shakeel Butt
2020-10-21 20:32                   ` Roman Gushchin
2020-10-20 17:24               ` Michal Koutný
2020-10-22  7:04                 ` Richard Palethorpe
2020-10-22 12:28                   ` [PATCH v4] " Richard Palethorpe
2020-10-22 16:37                     ` Shakeel Butt
2020-10-22 17:25                       ` Roman Gushchin
2020-10-22 23:59                         ` Shakeel Butt
2020-10-23  0:40                           ` Roman Gushchin
2020-10-23 15:44                             ` Johannes Weiner
2020-10-23 16:41                             ` Shakeel Butt
2020-10-26  7:32                             ` Richard Palethorpe
2020-10-26 23:14                               ` Roman Gushchin
2020-10-19 22:28       ` [RFC PATCH] " Roman Gushchin
2020-10-20  6:04         ` Richard Palethorpe
2020-10-20 12:02           ` Richard Palethorpe
2020-10-20 14:48         ` Richard Palethorpe
2020-10-20 16:27         ` Michal Koutný
2020-10-20 17:07           ` Roman Gushchin
2020-10-20 18:18             ` Johannes Weiner
2020-10-21 19:33               ` Roman Gushchin
2020-10-23 16:30                 ` Johannes Weiner
2020-11-10  1:27                   ` Roman Gushchin
2020-11-10 15:11                     ` Shakeel Butt
2020-11-10 19:13                       ` Roman Gushchin
2020-11-20 17:46                       ` Michal Koutný
2020-11-03 13:22                 ` Michal Hocko
2020-11-03 21:30                   ` Roman Gushchin
2020-10-20 16:55         ` Shakeel Butt
2020-10-20 17:17           ` Roman Gushchin

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git