linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()
@ 2019-11-06 22:51 Roman Gushchin
  2019-11-06 22:51 ` [PATCH 2/2] mm: hugetlb: switch to css_tryget() in hugetlb_cgroup_charge_cgroup() Roman Gushchin
                   ` (4 more replies)
  0 siblings, 5 replies; 34+ messages in thread
From: Roman Gushchin @ 2019-11-06 22:51 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Michal Hocko, Johannes Weiner, linux-kernel, kernel-team,
	Roman Gushchin, stable, Tejun Heo

We've encountered a rcu stall in get_mem_cgroup_from_mm():

 rcu: INFO: rcu_sched self-detected stall on CPU
 rcu: 33-....: (21000 ticks this GP) idle=6c6/1/0x4000000000000002 softirq=35441/35441 fqs=5017
 (t=21031 jiffies g=324821 q=95837) NMI backtrace for cpu 33
 <...>
 RIP: 0010:get_mem_cgroup_from_mm+0x2f/0x90
 <...>
 __memcg_kmem_charge+0x55/0x140
 __alloc_pages_nodemask+0x267/0x320
 pipe_write+0x1ad/0x400
 new_sync_write+0x127/0x1c0
 __kernel_write+0x4f/0xf0
 dump_emit+0x91/0xc0
 writenote+0xa0/0xc0
 elf_core_dump+0x11af/0x1430
 do_coredump+0xc65/0xee0
 ? unix_stream_sendmsg+0x37d/0x3b0
 get_signal+0x132/0x7c0
 do_signal+0x36/0x640
 ? recalc_sigpending+0x17/0x50
 exit_to_usermode_loop+0x61/0xd0
 do_syscall_64+0xd4/0x100
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

The problem is caused by an exiting task which is associated with
an offline memcg. We're iterating over and over in the
do {} while (!css_tryget_online()) loop, but obviously the memcg won't
become online and the exiting task won't be migrated to a live memcg.

Let's fix it by switching from css_tryget_online() to css_tryget().

As css_tryget_online() cannot guarantee that the memcg won't go
offline, the check is usually useless, except some rare cases
when for example it determines if something should be presented
to a user.

A similar problem is described by commit 18fa84a2db0e ("cgroup: Use
css_tryget() instead of css_tryget_online() in task_get_css()").

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: stable@vger.kernel.org
Cc: Tejun Heo <tj@kernel.org>
---
 mm/memcontrol.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 50f5bc55fcec..c5b5f74cfd4d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -939,7 +939,7 @@ struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
 			if (unlikely(!memcg))
 				memcg = root_mem_cgroup;
 		}
-	} while (!css_tryget_online(&memcg->css));
+	} while (!css_tryget(&memcg->css));
 	rcu_read_unlock();
 	return memcg;
 }
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 2/2] mm: hugetlb: switch to css_tryget() in hugetlb_cgroup_charge_cgroup()
  2019-11-06 22:51 [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm() Roman Gushchin
@ 2019-11-06 22:51 ` Roman Gushchin
  2019-11-07  0:26   ` Johannes Weiner
                     ` (2 more replies)
  2019-11-07  0:22 ` [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm() Johannes Weiner
                   ` (3 subsequent siblings)
  4 siblings, 3 replies; 34+ messages in thread
From: Roman Gushchin @ 2019-11-06 22:51 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Michal Hocko, Johannes Weiner, linux-kernel, kernel-team,
	Roman Gushchin, stable, Tejun Heo

An exiting task might belong to an offline cgroup. In this case
an attempt to grab a cgroup reference from the task can end up
with an infinite loop in hugetlb_cgroup_charge_cgroup(), because
neither the cgroup will become online, neither the task will
be migrated to a live cgroup.

Fix this by switching over to css_tryget(). As css_tryget_online()
can't guarantee that the cgroup won't go offline, in most cases
the check doesn't make sense. In this particular case users of
hugetlb_cgroup_charge_cgroup() are not affected by this change.

A similar problem is described by commit 18fa84a2db0e ("cgroup: Use
css_tryget() instead of css_tryget_online() in task_get_css()").

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: stable@vger.kernel.org
Cc: Tejun Heo <tj@kernel.org>
---
 mm/hugetlb_cgroup.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index f1930fa0b445..2ac38bdc18a1 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -196,7 +196,7 @@ int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
 again:
 	rcu_read_lock();
 	h_cg = hugetlb_cgroup_from_task(current);
-	if (!css_tryget_online(&h_cg->css)) {
+	if (!css_tryget(&h_cg->css)) {
 		rcu_read_unlock();
 		goto again;
 	}
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()
  2019-11-06 22:51 [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm() Roman Gushchin
  2019-11-06 22:51 ` [PATCH 2/2] mm: hugetlb: switch to css_tryget() in hugetlb_cgroup_charge_cgroup() Roman Gushchin
@ 2019-11-07  0:22 ` Johannes Weiner
  2019-11-07  1:25   ` Shakeel Butt
  2019-11-07 12:21 ` Michal Hocko
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 34+ messages in thread
From: Johannes Weiner @ 2019-11-07  0:22 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, linux-kernel, kernel-team,
	stable, Tejun Heo

On Wed, Nov 06, 2019 at 02:51:30PM -0800, Roman Gushchin wrote:
> We've encountered a rcu stall in get_mem_cgroup_from_mm():
> 
>  rcu: INFO: rcu_sched self-detected stall on CPU
>  rcu: 33-....: (21000 ticks this GP) idle=6c6/1/0x4000000000000002 softirq=35441/35441 fqs=5017
>  (t=21031 jiffies g=324821 q=95837) NMI backtrace for cpu 33
>  <...>
>  RIP: 0010:get_mem_cgroup_from_mm+0x2f/0x90
>  <...>
>  __memcg_kmem_charge+0x55/0x140
>  __alloc_pages_nodemask+0x267/0x320
>  pipe_write+0x1ad/0x400
>  new_sync_write+0x127/0x1c0
>  __kernel_write+0x4f/0xf0
>  dump_emit+0x91/0xc0
>  writenote+0xa0/0xc0
>  elf_core_dump+0x11af/0x1430
>  do_coredump+0xc65/0xee0
>  ? unix_stream_sendmsg+0x37d/0x3b0
>  get_signal+0x132/0x7c0
>  do_signal+0x36/0x640
>  ? recalc_sigpending+0x17/0x50
>  exit_to_usermode_loop+0x61/0xd0
>  do_syscall_64+0xd4/0x100
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> The problem is caused by an exiting task which is associated with
> an offline memcg. We're iterating over and over in the
> do {} while (!css_tryget_online()) loop, but obviously the memcg won't
> become online and the exiting task won't be migrated to a live memcg.
> 
> Let's fix it by switching from css_tryget_online() to css_tryget().
> 
> As css_tryget_online() cannot guarantee that the memcg won't go
> offline, the check is usually useless, except some rare cases
> when for example it determines if something should be presented
> to a user.
> 
> A similar problem is described by commit 18fa84a2db0e ("cgroup: Use
> css_tryget() instead of css_tryget_online() in task_get_css()").
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: stable@vger.kernel.org
> Cc: Tejun Heo <tj@kernel.org>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

The bug aside, it doesn't matter whether the cgroup is online for the
callers. It used to matter when offlining needed to evacuate all
charges from the memcg, and so needed to prevent new ones from showing
up, but we don't care now.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2] mm: hugetlb: switch to css_tryget() in hugetlb_cgroup_charge_cgroup()
  2019-11-06 22:51 ` [PATCH 2/2] mm: hugetlb: switch to css_tryget() in hugetlb_cgroup_charge_cgroup() Roman Gushchin
@ 2019-11-07  0:26   ` Johannes Weiner
  2019-11-07  2:31   ` Shakeel Butt
  2019-11-07 15:44   ` Tejun Heo
  2 siblings, 0 replies; 34+ messages in thread
From: Johannes Weiner @ 2019-11-07  0:26 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, linux-kernel, kernel-team,
	stable, Tejun Heo

On Wed, Nov 06, 2019 at 02:51:31PM -0800, Roman Gushchin wrote:
> An exiting task might belong to an offline cgroup. In this case
> an attempt to grab a cgroup reference from the task can end up
> with an infinite loop in hugetlb_cgroup_charge_cgroup(), because
> neither the cgroup will become online, neither the task will
> be migrated to a live cgroup.
> 
> Fix this by switching over to css_tryget(). As css_tryget_online()
> can't guarantee that the cgroup won't go offline, in most cases
> the check doesn't make sense. In this particular case users of
> hugetlb_cgroup_charge_cgroup() are not affected by this change.
> 
> A similar problem is described by commit 18fa84a2db0e ("cgroup: Use
> css_tryget() instead of css_tryget_online() in task_get_css()").
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()
  2019-11-07  0:22 ` [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm() Johannes Weiner
@ 2019-11-07  1:25   ` Shakeel Butt
  2019-11-07  1:43     ` Johannes Weiner
  2019-11-07  1:43     ` Roman Gushchin
  0 siblings, 2 replies; 34+ messages in thread
From: Shakeel Butt @ 2019-11-07  1:25 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Roman Gushchin, Linux MM, Andrew Morton, Michal Hocko, LKML,
	Kernel Team, stable, Tejun Heo

On Wed, Nov 6, 2019 at 4:22 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Nov 06, 2019 at 02:51:30PM -0800, Roman Gushchin wrote:
> > We've encountered a rcu stall in get_mem_cgroup_from_mm():
> >
> >  rcu: INFO: rcu_sched self-detected stall on CPU
> >  rcu: 33-....: (21000 ticks this GP) idle=6c6/1/0x4000000000000002 softirq=35441/35441 fqs=5017
> >  (t=21031 jiffies g=324821 q=95837) NMI backtrace for cpu 33
> >  <...>
> >  RIP: 0010:get_mem_cgroup_from_mm+0x2f/0x90
> >  <...>
> >  __memcg_kmem_charge+0x55/0x140
> >  __alloc_pages_nodemask+0x267/0x320
> >  pipe_write+0x1ad/0x400
> >  new_sync_write+0x127/0x1c0
> >  __kernel_write+0x4f/0xf0
> >  dump_emit+0x91/0xc0
> >  writenote+0xa0/0xc0
> >  elf_core_dump+0x11af/0x1430
> >  do_coredump+0xc65/0xee0
> >  ? unix_stream_sendmsg+0x37d/0x3b0
> >  get_signal+0x132/0x7c0
> >  do_signal+0x36/0x640
> >  ? recalc_sigpending+0x17/0x50
> >  exit_to_usermode_loop+0x61/0xd0
> >  do_syscall_64+0xd4/0x100
> >  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >
> > The problem is caused by an exiting task which is associated with
> > an offline memcg. We're iterating over and over in the
> > do {} while (!css_tryget_online()) loop, but obviously the memcg won't
> > become online and the exiting task won't be migrated to a live memcg.
> >
> > Let's fix it by switching from css_tryget_online() to css_tryget().
> >
> > As css_tryget_online() cannot guarantee that the memcg won't go
> > offline, the check is usually useless, except some rare cases
> > when for example it determines if something should be presented
> > to a user.
> >
> > A similar problem is described by commit 18fa84a2db0e ("cgroup: Use
> > css_tryget() instead of css_tryget_online() in task_get_css()").
> >
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > Cc: stable@vger.kernel.org
> > Cc: Tejun Heo <tj@kernel.org>
>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
>
> The bug aside, it doesn't matter whether the cgroup is online for the
> callers. It used to matter when offlining needed to evacuate all
> charges from the memcg, and so needed to prevent new ones from showing
> up, but we don't care now.

Should get_mem_cgroup_from_current() and get_mem_cgroup_from_page() be
switched to css_tryget() as well then?


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()
  2019-11-07  1:25   ` Shakeel Butt
@ 2019-11-07  1:43     ` Johannes Weiner
  2019-11-07  1:43     ` Roman Gushchin
  1 sibling, 0 replies; 34+ messages in thread
From: Johannes Weiner @ 2019-11-07  1:43 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Roman Gushchin, Linux MM, Andrew Morton, Michal Hocko, LKML,
	Kernel Team, stable, Tejun Heo

On Wed, Nov 06, 2019 at 05:25:26PM -0800, Shakeel Butt wrote:
> On Wed, Nov 6, 2019 at 4:22 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Wed, Nov 06, 2019 at 02:51:30PM -0800, Roman Gushchin wrote:
> > > We've encountered a rcu stall in get_mem_cgroup_from_mm():
> > >
> > >  rcu: INFO: rcu_sched self-detected stall on CPU
> > >  rcu: 33-....: (21000 ticks this GP) idle=6c6/1/0x4000000000000002 softirq=35441/35441 fqs=5017
> > >  (t=21031 jiffies g=324821 q=95837) NMI backtrace for cpu 33
> > >  <...>
> > >  RIP: 0010:get_mem_cgroup_from_mm+0x2f/0x90
> > >  <...>
> > >  __memcg_kmem_charge+0x55/0x140
> > >  __alloc_pages_nodemask+0x267/0x320
> > >  pipe_write+0x1ad/0x400
> > >  new_sync_write+0x127/0x1c0
> > >  __kernel_write+0x4f/0xf0
> > >  dump_emit+0x91/0xc0
> > >  writenote+0xa0/0xc0
> > >  elf_core_dump+0x11af/0x1430
> > >  do_coredump+0xc65/0xee0
> > >  ? unix_stream_sendmsg+0x37d/0x3b0
> > >  get_signal+0x132/0x7c0
> > >  do_signal+0x36/0x640
> > >  ? recalc_sigpending+0x17/0x50
> > >  exit_to_usermode_loop+0x61/0xd0
> > >  do_syscall_64+0xd4/0x100
> > >  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > >
> > > The problem is caused by an exiting task which is associated with
> > > an offline memcg. We're iterating over and over in the
> > > do {} while (!css_tryget_online()) loop, but obviously the memcg won't
> > > become online and the exiting task won't be migrated to a live memcg.
> > >
> > > Let's fix it by switching from css_tryget_online() to css_tryget().
> > >
> > > As css_tryget_online() cannot guarantee that the memcg won't go
> > > offline, the check is usually useless, except some rare cases
> > > when for example it determines if something should be presented
> > > to a user.
> > >
> > > A similar problem is described by commit 18fa84a2db0e ("cgroup: Use
> > > css_tryget() instead of css_tryget_online() in task_get_css()").
> > >
> > > Signed-off-by: Roman Gushchin <guro@fb.com>
> > > Cc: stable@vger.kernel.org
> > > Cc: Tejun Heo <tj@kernel.org>
> >
> > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> >
> > The bug aside, it doesn't matter whether the cgroup is online for the
> > callers. It used to matter when offlining needed to evacuate all
> > charges from the memcg, and so needed to prevent new ones from showing
> > up, but we don't care now.
> 
> Should get_mem_cgroup_from_current() and get_mem_cgroup_from_page() be
> switched to css_tryget() as well then?

Yes. Looking at the remaining css_tryget_online() in memcontrol.c, I
don't think the online/offline distinction is meaningful for any of
them anymore.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()
  2019-11-07  1:25   ` Shakeel Butt
  2019-11-07  1:43     ` Johannes Weiner
@ 2019-11-07  1:43     ` Roman Gushchin
  2019-11-07  2:21       ` Shakeel Butt
  1 sibling, 1 reply; 34+ messages in thread
From: Roman Gushchin @ 2019-11-07  1:43 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Linux MM, Andrew Morton, Michal Hocko, LKML,
	Kernel Team, stable, Tejun Heo

On Wed, Nov 06, 2019 at 05:25:26PM -0800, Shakeel Butt wrote:
> On Wed, Nov 6, 2019 at 4:22 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Wed, Nov 06, 2019 at 02:51:30PM -0800, Roman Gushchin wrote:
> > > We've encountered a rcu stall in get_mem_cgroup_from_mm():
> > >
> > >  rcu: INFO: rcu_sched self-detected stall on CPU
> > >  rcu: 33-....: (21000 ticks this GP) idle=6c6/1/0x4000000000000002 softirq=35441/35441 fqs=5017
> > >  (t=21031 jiffies g=324821 q=95837) NMI backtrace for cpu 33
> > >  <...>
> > >  RIP: 0010:get_mem_cgroup_from_mm+0x2f/0x90
> > >  <...>
> > >  __memcg_kmem_charge+0x55/0x140
> > >  __alloc_pages_nodemask+0x267/0x320
> > >  pipe_write+0x1ad/0x400
> > >  new_sync_write+0x127/0x1c0
> > >  __kernel_write+0x4f/0xf0
> > >  dump_emit+0x91/0xc0
> > >  writenote+0xa0/0xc0
> > >  elf_core_dump+0x11af/0x1430
> > >  do_coredump+0xc65/0xee0
> > >  ? unix_stream_sendmsg+0x37d/0x3b0
> > >  get_signal+0x132/0x7c0
> > >  do_signal+0x36/0x640
> > >  ? recalc_sigpending+0x17/0x50
> > >  exit_to_usermode_loop+0x61/0xd0
> > >  do_syscall_64+0xd4/0x100
> > >  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > >
> > > The problem is caused by an exiting task which is associated with
> > > an offline memcg. We're iterating over and over in the
> > > do {} while (!css_tryget_online()) loop, but obviously the memcg won't
> > > become online and the exiting task won't be migrated to a live memcg.
> > >
> > > Let's fix it by switching from css_tryget_online() to css_tryget().
> > >
> > > As css_tryget_online() cannot guarantee that the memcg won't go
> > > offline, the check is usually useless, except some rare cases
> > > when for example it determines if something should be presented
> > > to a user.
> > >
> > > A similar problem is described by commit 18fa84a2db0e ("cgroup: Use
> > > css_tryget() instead of css_tryget_online() in task_get_css()").
> > >
> > > Signed-off-by: Roman Gushchin <guro@fb.com>
> > > Cc: stable@vger.kernel.org
> > > Cc: Tejun Heo <tj@kernel.org>
> >
> > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> >
> > The bug aside, it doesn't matter whether the cgroup is online for the
> > callers. It used to matter when offlining needed to evacuate all
> > charges from the memcg, and so needed to prevent new ones from showing
> > up, but we don't care now.
> 
> Should get_mem_cgroup_from_current() and get_mem_cgroup_from_page() be
> switched to css_tryget() as well then?

In those case it can't cause a rcu stall, so it's not a so urgent.
But you are right, we should probably do the same here. I'll look
at all remaining callers and prepare the patchset.

I'll also probably rename it to css_tryget_if_online() to make obvious
that it doesn't hold the cgroup from being onlined.

Thanks!


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()
  2019-11-07  1:43     ` Roman Gushchin
@ 2019-11-07  2:21       ` Shakeel Butt
  2019-11-07  2:28         ` Shakeel Butt
  0 siblings, 1 reply; 34+ messages in thread
From: Shakeel Butt @ 2019-11-07  2:21 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Johannes Weiner, Linux MM, Andrew Morton, Michal Hocko, LKML,
	Kernel Team, stable, Tejun Heo

On Wed, Nov 6, 2019 at 5:43 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Wed, Nov 06, 2019 at 05:25:26PM -0800, Shakeel Butt wrote:
> > On Wed, Nov 6, 2019 at 4:22 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > On Wed, Nov 06, 2019 at 02:51:30PM -0800, Roman Gushchin wrote:
> > > > We've encountered a rcu stall in get_mem_cgroup_from_mm():
> > > >
> > > >  rcu: INFO: rcu_sched self-detected stall on CPU
> > > >  rcu: 33-....: (21000 ticks this GP) idle=6c6/1/0x4000000000000002 softirq=35441/35441 fqs=5017
> > > >  (t=21031 jiffies g=324821 q=95837) NMI backtrace for cpu 33
> > > >  <...>
> > > >  RIP: 0010:get_mem_cgroup_from_mm+0x2f/0x90
> > > >  <...>
> > > >  __memcg_kmem_charge+0x55/0x140
> > > >  __alloc_pages_nodemask+0x267/0x320
> > > >  pipe_write+0x1ad/0x400
> > > >  new_sync_write+0x127/0x1c0
> > > >  __kernel_write+0x4f/0xf0
> > > >  dump_emit+0x91/0xc0
> > > >  writenote+0xa0/0xc0
> > > >  elf_core_dump+0x11af/0x1430
> > > >  do_coredump+0xc65/0xee0
> > > >  ? unix_stream_sendmsg+0x37d/0x3b0
> > > >  get_signal+0x132/0x7c0
> > > >  do_signal+0x36/0x640
> > > >  ? recalc_sigpending+0x17/0x50
> > > >  exit_to_usermode_loop+0x61/0xd0
> > > >  do_syscall_64+0xd4/0x100
> > > >  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > >
> > > > The problem is caused by an exiting task which is associated with
> > > > an offline memcg. We're iterating over and over in the
> > > > do {} while (!css_tryget_online()) loop, but obviously the memcg won't
> > > > become online and the exiting task won't be migrated to a live memcg.
> > > >
> > > > Let's fix it by switching from css_tryget_online() to css_tryget().
> > > >
> > > > As css_tryget_online() cannot guarantee that the memcg won't go
> > > > offline, the check is usually useless, except some rare cases
> > > > when for example it determines if something should be presented
> > > > to a user.
> > > >
> > > > A similar problem is described by commit 18fa84a2db0e ("cgroup: Use
> > > > css_tryget() instead of css_tryget_online() in task_get_css()").
> > > >
> > > > Signed-off-by: Roman Gushchin <guro@fb.com>
> > > > Cc: stable@vger.kernel.org
> > > > Cc: Tejun Heo <tj@kernel.org>
> > >
> > > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > >
> > > The bug aside, it doesn't matter whether the cgroup is online for the
> > > callers. It used to matter when offlining needed to evacuate all
> > > charges from the memcg, and so needed to prevent new ones from showing
> > > up, but we don't care now.
> >
> > Should get_mem_cgroup_from_current() and get_mem_cgroup_from_page() be
> > switched to css_tryget() as well then?
>
> In those case it can't cause a rcu stall, so it's not a so urgent.
> But you are right, we should probably do the same here. I'll look
> at all remaining callers and prepare the patchset.
>
> I'll also probably rename it to css_tryget_if_online() to make obvious
> that it doesn't hold the cgroup from being onlined.
>

SGTM


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()
  2019-11-07  2:21       ` Shakeel Butt
@ 2019-11-07  2:28         ` Shakeel Butt
  0 siblings, 0 replies; 34+ messages in thread
From: Shakeel Butt @ 2019-11-07  2:28 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Johannes Weiner, Linux MM, Andrew Morton, Michal Hocko, LKML,
	Kernel Team, stable, Tejun Heo

On Wed, Nov 6, 2019 at 6:21 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Wed, Nov 6, 2019 at 5:43 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Wed, Nov 06, 2019 at 05:25:26PM -0800, Shakeel Butt wrote:
> > > On Wed, Nov 6, 2019 at 4:22 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > >
> > > > On Wed, Nov 06, 2019 at 02:51:30PM -0800, Roman Gushchin wrote:
> > > > > We've encountered a rcu stall in get_mem_cgroup_from_mm():
> > > > >
> > > > >  rcu: INFO: rcu_sched self-detected stall on CPU
> > > > >  rcu: 33-....: (21000 ticks this GP) idle=6c6/1/0x4000000000000002 softirq=35441/35441 fqs=5017
> > > > >  (t=21031 jiffies g=324821 q=95837) NMI backtrace for cpu 33
> > > > >  <...>
> > > > >  RIP: 0010:get_mem_cgroup_from_mm+0x2f/0x90
> > > > >  <...>
> > > > >  __memcg_kmem_charge+0x55/0x140
> > > > >  __alloc_pages_nodemask+0x267/0x320
> > > > >  pipe_write+0x1ad/0x400
> > > > >  new_sync_write+0x127/0x1c0
> > > > >  __kernel_write+0x4f/0xf0
> > > > >  dump_emit+0x91/0xc0
> > > > >  writenote+0xa0/0xc0
> > > > >  elf_core_dump+0x11af/0x1430
> > > > >  do_coredump+0xc65/0xee0
> > > > >  ? unix_stream_sendmsg+0x37d/0x3b0
> > > > >  get_signal+0x132/0x7c0
> > > > >  do_signal+0x36/0x640
> > > > >  ? recalc_sigpending+0x17/0x50
> > > > >  exit_to_usermode_loop+0x61/0xd0
> > > > >  do_syscall_64+0xd4/0x100
> > > > >  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > >
> > > > > The problem is caused by an exiting task which is associated with
> > > > > an offline memcg. We're iterating over and over in the
> > > > > do {} while (!css_tryget_online()) loop, but obviously the memcg won't
> > > > > become online and the exiting task won't be migrated to a live memcg.
> > > > >
> > > > > Let's fix it by switching from css_tryget_online() to css_tryget().
> > > > >
> > > > > As css_tryget_online() cannot guarantee that the memcg won't go
> > > > > offline, the check is usually useless, except some rare cases
> > > > > when for example it determines if something should be presented
> > > > > to a user.
> > > > >
> > > > > A similar problem is described by commit 18fa84a2db0e ("cgroup: Use
> > > > > css_tryget() instead of css_tryget_online() in task_get_css()").
> > > > >
> > > > > Signed-off-by: Roman Gushchin <guro@fb.com>

Forgot to add:

Reviewed-by: Shakeel Butt <shakeeb@google.com>

> > > > > Cc: stable@vger.kernel.org
> > > > > Cc: Tejun Heo <tj@kernel.org>
> > > >
> > > > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > > >
> > > > The bug aside, it doesn't matter whether the cgroup is online for the
> > > > callers. It used to matter when offlining needed to evacuate all
> > > > charges from the memcg, and so needed to prevent new ones from showing
> > > > up, but we don't care now.
> > >
> > > Should get_mem_cgroup_from_current() and get_mem_cgroup_from_page() be
> > > switched to css_tryget() as well then?
> >
> > In those case it can't cause a rcu stall, so it's not a so urgent.
> > But you are right, we should probably do the same here. I'll look
> > at all remaining callers and prepare the patchset.
> >
> > I'll also probably rename it to css_tryget_if_online() to make obvious
> > that it doesn't hold the cgroup from being onlined.
> >
>
> SGTM


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2] mm: hugetlb: switch to css_tryget() in hugetlb_cgroup_charge_cgroup()
  2019-11-06 22:51 ` [PATCH 2/2] mm: hugetlb: switch to css_tryget() in hugetlb_cgroup_charge_cgroup() Roman Gushchin
  2019-11-07  0:26   ` Johannes Weiner
@ 2019-11-07  2:31   ` Shakeel Butt
  2019-11-07 15:44   ` Tejun Heo
  2 siblings, 0 replies; 34+ messages in thread
From: Shakeel Butt @ 2019-11-07  2:31 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Linux MM, Andrew Morton, Michal Hocko, Johannes Weiner, LKML,
	Kernel Team, stable, Tejun Heo

On Wed, Nov 6, 2019 at 2:53 PM Roman Gushchin <guro@fb.com> wrote:
>
> An exiting task might belong to an offline cgroup. In this case
> an attempt to grab a cgroup reference from the task can end up
> with an infinite loop in hugetlb_cgroup_charge_cgroup(), because
> neither the cgroup will become online, neither the task will
> be migrated to a live cgroup.
>
> Fix this by switching over to css_tryget(). As css_tryget_online()
> can't guarantee that the cgroup won't go offline, in most cases
> the check doesn't make sense. In this particular case users of
> hugetlb_cgroup_charge_cgroup() are not affected by this change.
>
> A similar problem is described by commit 18fa84a2db0e ("cgroup: Use
> css_tryget() instead of css_tryget_online() in task_get_css()").
>
> Signed-off-by: Roman Gushchin <guro@fb.com>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

> Cc: stable@vger.kernel.org
> Cc: Tejun Heo <tj@kernel.org>
> ---
>  mm/hugetlb_cgroup.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
> index f1930fa0b445..2ac38bdc18a1 100644
> --- a/mm/hugetlb_cgroup.c
> +++ b/mm/hugetlb_cgroup.c
> @@ -196,7 +196,7 @@ int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
>  again:
>         rcu_read_lock();
>         h_cg = hugetlb_cgroup_from_task(current);
> -       if (!css_tryget_online(&h_cg->css)) {
> +       if (!css_tryget(&h_cg->css)) {
>                 rcu_read_unlock();
>                 goto again;
>         }
> --
> 2.17.1
>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()
  2019-11-06 22:51 [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm() Roman Gushchin
  2019-11-06 22:51 ` [PATCH 2/2] mm: hugetlb: switch to css_tryget() in hugetlb_cgroup_charge_cgroup() Roman Gushchin
  2019-11-07  0:22 ` [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm() Johannes Weiner
@ 2019-11-07 12:21 ` Michal Hocko
  2019-11-07 15:43   ` Tejun Heo
  2019-11-07 16:42   ` Roman Gushchin
  2019-11-07 15:43 ` Tejun Heo
  2019-11-13 16:29 ` Michal Koutný
  4 siblings, 2 replies; 34+ messages in thread
From: Michal Hocko @ 2019-11-07 12:21 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Johannes Weiner, linux-kernel,
	kernel-team, stable, Tejun Heo

On Wed 06-11-19 14:51:30, Roman Gushchin wrote:
> We've encountered a rcu stall in get_mem_cgroup_from_mm():
> 
>  rcu: INFO: rcu_sched self-detected stall on CPU
>  rcu: 33-....: (21000 ticks this GP) idle=6c6/1/0x4000000000000002 softirq=35441/35441 fqs=5017
>  (t=21031 jiffies g=324821 q=95837) NMI backtrace for cpu 33
>  <...>
>  RIP: 0010:get_mem_cgroup_from_mm+0x2f/0x90
>  <...>
>  __memcg_kmem_charge+0x55/0x140
>  __alloc_pages_nodemask+0x267/0x320
>  pipe_write+0x1ad/0x400
>  new_sync_write+0x127/0x1c0
>  __kernel_write+0x4f/0xf0
>  dump_emit+0x91/0xc0
>  writenote+0xa0/0xc0
>  elf_core_dump+0x11af/0x1430
>  do_coredump+0xc65/0xee0
>  ? unix_stream_sendmsg+0x37d/0x3b0
>  get_signal+0x132/0x7c0
>  do_signal+0x36/0x640
>  ? recalc_sigpending+0x17/0x50
>  exit_to_usermode_loop+0x61/0xd0
>  do_syscall_64+0xd4/0x100
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> The problem is caused by an exiting task which is associated with
> an offline memcg.

Hmm, how can we have a task in an offline memcg? I thought that any
existing task will prevent cgroup removal from proceeding. Is this some
sort of race where the task managed to disassociate from the cgroup
while there is still a binding to a memcg existing? What am I missing?

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()
  2019-11-07 12:21 ` Michal Hocko
@ 2019-11-07 15:43   ` Tejun Heo
  2019-11-07 16:42   ` Roman Gushchin
  1 sibling, 0 replies; 34+ messages in thread
From: Tejun Heo @ 2019-11-07 15:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, linux-mm, Andrew Morton, Johannes Weiner,
	linux-kernel, kernel-team, stable

Hello,

On Thu, Nov 07, 2019 at 01:21:25PM +0100, Michal Hocko wrote:
> Hmm, how can we have a task in an offline memcg? I thought that any
> existing task will prevent cgroup removal from proceeding. Is this some
> sort of race where the task managed to disassociate from the cgroup
> while there is still a binding to a memcg existing? What am I missing?

The previous cgroup one was from bsd process accounting which happens
after the process is marked head.  This one IIUC is from core dumping
path.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()
  2019-11-06 22:51 [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm() Roman Gushchin
                   ` (2 preceding siblings ...)
  2019-11-07 12:21 ` Michal Hocko
@ 2019-11-07 15:43 ` Tejun Heo
  2019-11-13 16:29 ` Michal Koutný
  4 siblings, 0 replies; 34+ messages in thread
From: Tejun Heo @ 2019-11-07 15:43 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, Johannes Weiner,
	linux-kernel, kernel-team, stable

On Wed, Nov 06, 2019 at 02:51:30PM -0800, Roman Gushchin wrote:
> We've encountered a rcu stall in get_mem_cgroup_from_mm():
> 
>  rcu: INFO: rcu_sched self-detected stall on CPU
>  rcu: 33-....: (21000 ticks this GP) idle=6c6/1/0x4000000000000002 softirq=35441/35441 fqs=5017
>  (t=21031 jiffies g=324821 q=95837) NMI backtrace for cpu 33
>  <...>
>  RIP: 0010:get_mem_cgroup_from_mm+0x2f/0x90
>  <...>
>  __memcg_kmem_charge+0x55/0x140
>  __alloc_pages_nodemask+0x267/0x320
>  pipe_write+0x1ad/0x400
>  new_sync_write+0x127/0x1c0
>  __kernel_write+0x4f/0xf0
>  dump_emit+0x91/0xc0
>  writenote+0xa0/0xc0
>  elf_core_dump+0x11af/0x1430
>  do_coredump+0xc65/0xee0
>  ? unix_stream_sendmsg+0x37d/0x3b0
>  get_signal+0x132/0x7c0
>  do_signal+0x36/0x640
>  ? recalc_sigpending+0x17/0x50
>  exit_to_usermode_loop+0x61/0xd0
>  do_syscall_64+0xd4/0x100
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> The problem is caused by an exiting task which is associated with
> an offline memcg. We're iterating over and over in the
> do {} while (!css_tryget_online()) loop, but obviously the memcg won't
> become online and the exiting task won't be migrated to a live memcg.
> 
> Let's fix it by switching from css_tryget_online() to css_tryget().
> 
> As css_tryget_online() cannot guarantee that the memcg won't go
> offline, the check is usually useless, except some rare cases
> when for example it determines if something should be presented
> to a user.
> 
> A similar problem is described by commit 18fa84a2db0e ("cgroup: Use
> css_tryget() instead of css_tryget_online() in task_get_css()").
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: stable@vger.kernel.org
> Cc: Tejun Heo <tj@kernel.org>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2] mm: hugetlb: switch to css_tryget() in hugetlb_cgroup_charge_cgroup()
  2019-11-06 22:51 ` [PATCH 2/2] mm: hugetlb: switch to css_tryget() in hugetlb_cgroup_charge_cgroup() Roman Gushchin
  2019-11-07  0:26   ` Johannes Weiner
  2019-11-07  2:31   ` Shakeel Butt
@ 2019-11-07 15:44   ` Tejun Heo
  2 siblings, 0 replies; 34+ messages in thread
From: Tejun Heo @ 2019-11-07 15:44 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, Johannes Weiner,
	linux-kernel, kernel-team, stable

On Wed, Nov 06, 2019 at 02:51:31PM -0800, Roman Gushchin wrote:
> An exiting task might belong to an offline cgroup. In this case
> an attempt to grab a cgroup reference from the task can end up
> with an infinite loop in hugetlb_cgroup_charge_cgroup(), because
> neither the cgroup will become online, neither the task will
> be migrated to a live cgroup.
> 
> Fix this by switching over to css_tryget(). As css_tryget_online()
> can't guarantee that the cgroup won't go offline, in most cases
> the check doesn't make sense. In this particular case users of
> hugetlb_cgroup_charge_cgroup() are not affected by this change.
> 
> A similar problem is described by commit 18fa84a2db0e ("cgroup: Use
> css_tryget() instead of css_tryget_online() in task_get_css()").
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: stable@vger.kernel.org
> Cc: Tejun Heo <tj@kernel.org>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()
  2019-11-07 12:21 ` Michal Hocko
  2019-11-07 15:43   ` Tejun Heo
@ 2019-11-07 16:42   ` Roman Gushchin
  2019-11-07 17:02     ` Michal Hocko
  1 sibling, 1 reply; 34+ messages in thread
From: Roman Gushchin @ 2019-11-07 16:42 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Johannes Weiner, linux-kernel,
	Kernel Team, stable, Tejun Heo

On Thu, Nov 07, 2019 at 01:21:25PM +0100, Michal Hocko wrote:
> On Wed 06-11-19 14:51:30, Roman Gushchin wrote:
> > We've encountered a rcu stall in get_mem_cgroup_from_mm():
> > 
> >  rcu: INFO: rcu_sched self-detected stall on CPU
> >  rcu: 33-....: (21000 ticks this GP) idle=6c6/1/0x4000000000000002 softirq=35441/35441 fqs=5017
> >  (t=21031 jiffies g=324821 q=95837) NMI backtrace for cpu 33
> >  <...>
> >  RIP: 0010:get_mem_cgroup_from_mm+0x2f/0x90
> >  <...>
> >  __memcg_kmem_charge+0x55/0x140
> >  __alloc_pages_nodemask+0x267/0x320
> >  pipe_write+0x1ad/0x400
> >  new_sync_write+0x127/0x1c0
> >  __kernel_write+0x4f/0xf0
> >  dump_emit+0x91/0xc0
> >  writenote+0xa0/0xc0
> >  elf_core_dump+0x11af/0x1430
> >  do_coredump+0xc65/0xee0
> >  ? unix_stream_sendmsg+0x37d/0x3b0
> >  get_signal+0x132/0x7c0
> >  do_signal+0x36/0x640
> >  ? recalc_sigpending+0x17/0x50
> >  exit_to_usermode_loop+0x61/0xd0
> >  do_syscall_64+0xd4/0x100
> >  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 
> > The problem is caused by an exiting task which is associated with
> > an offline memcg.
> 
> Hmm, how can we have a task in an offline memcg? I thought that any
> existing task will prevent cgroup removal from proceeding. Is this some
> sort of race where the task managed to disassociate from the cgroup
> while there is still a binding to a memcg existing? What am I missing?

It's an exiting task with the PF_EXITING flag set and it's in their late stages
of life.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()
  2019-11-07 16:42   ` Roman Gushchin
@ 2019-11-07 17:02     ` Michal Hocko
  2019-11-07 22:41       ` Roman Gushchin
  0 siblings, 1 reply; 34+ messages in thread
From: Michal Hocko @ 2019-11-07 17:02 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Johannes Weiner, linux-kernel,
	Kernel Team, stable, Tejun Heo

On Thu 07-11-19 16:42:41, Roman Gushchin wrote:
> On Thu, Nov 07, 2019 at 01:21:25PM +0100, Michal Hocko wrote:
> > On Wed 06-11-19 14:51:30, Roman Gushchin wrote:
> > > We've encountered a rcu stall in get_mem_cgroup_from_mm():
> > > 
> > >  rcu: INFO: rcu_sched self-detected stall on CPU
> > >  rcu: 33-....: (21000 ticks this GP) idle=6c6/1/0x4000000000000002 softirq=35441/35441 fqs=5017
> > >  (t=21031 jiffies g=324821 q=95837) NMI backtrace for cpu 33
> > >  <...>
> > >  RIP: 0010:get_mem_cgroup_from_mm+0x2f/0x90
> > >  <...>
> > >  __memcg_kmem_charge+0x55/0x140
> > >  __alloc_pages_nodemask+0x267/0x320
> > >  pipe_write+0x1ad/0x400
> > >  new_sync_write+0x127/0x1c0
> > >  __kernel_write+0x4f/0xf0
> > >  dump_emit+0x91/0xc0
> > >  writenote+0xa0/0xc0
> > >  elf_core_dump+0x11af/0x1430
> > >  do_coredump+0xc65/0xee0
> > >  ? unix_stream_sendmsg+0x37d/0x3b0
> > >  get_signal+0x132/0x7c0
> > >  do_signal+0x36/0x640
> > >  ? recalc_sigpending+0x17/0x50
> > >  exit_to_usermode_loop+0x61/0xd0
> > >  do_syscall_64+0xd4/0x100
> > >  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > 
> > > The problem is caused by an exiting task which is associated with
> > > an offline memcg.
> > 
> > Hmm, how can we have a task in an offline memcg? I thought that any
> > existing task will prevent cgroup removal from proceeding. Is this some
> > sort of race where the task managed to disassociate from the cgroup
> > while there is still a binding to a memcg existing? What am I missing?
> 
> It's an exiting task with the PF_EXITING flag set and it's in their late stages
> of life.

This is a signal delivery path AFAIU (get_signal) and the coredumping
happens before do_exit. My understanding is that that unlinking
happens from cgroup_exit. So either I am misreading the backtrace or
there is some other way to leave cgroups or there is something more
going on.

JFTR I am not really disputing the patch but I simply do not understand
how the problem really happened.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()
  2019-11-07 17:02     ` Michal Hocko
@ 2019-11-07 22:41       ` Roman Gushchin
  2019-11-08  8:53         ` Michal Hocko
  0 siblings, 1 reply; 34+ messages in thread
From: Roman Gushchin @ 2019-11-07 22:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Johannes Weiner, linux-kernel,
	Kernel Team, stable, Tejun Heo

On Thu, Nov 07, 2019 at 06:02:00PM +0100, Michal Hocko wrote:
> On Thu 07-11-19 16:42:41, Roman Gushchin wrote:
> > On Thu, Nov 07, 2019 at 01:21:25PM +0100, Michal Hocko wrote:
> > > On Wed 06-11-19 14:51:30, Roman Gushchin wrote:
> > > > We've encountered a rcu stall in get_mem_cgroup_from_mm():
> > > > 
> > > >  rcu: INFO: rcu_sched self-detected stall on CPU
> > > >  rcu: 33-....: (21000 ticks this GP) idle=6c6/1/0x4000000000000002 softirq=35441/35441 fqs=5017
> > > >  (t=21031 jiffies g=324821 q=95837) NMI backtrace for cpu 33
> > > >  <...>
> > > >  RIP: 0010:get_mem_cgroup_from_mm+0x2f/0x90
> > > >  <...>
> > > >  __memcg_kmem_charge+0x55/0x140
> > > >  __alloc_pages_nodemask+0x267/0x320
> > > >  pipe_write+0x1ad/0x400
> > > >  new_sync_write+0x127/0x1c0
> > > >  __kernel_write+0x4f/0xf0
> > > >  dump_emit+0x91/0xc0
> > > >  writenote+0xa0/0xc0
> > > >  elf_core_dump+0x11af/0x1430
> > > >  do_coredump+0xc65/0xee0
> > > >  ? unix_stream_sendmsg+0x37d/0x3b0
> > > >  get_signal+0x132/0x7c0
> > > >  do_signal+0x36/0x640
> > > >  ? recalc_sigpending+0x17/0x50
> > > >  exit_to_usermode_loop+0x61/0xd0
> > > >  do_syscall_64+0xd4/0x100
> > > >  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > 
> > > > The problem is caused by an exiting task which is associated with
> > > > an offline memcg.
> > > 
> > > Hmm, how can we have a task in an offline memcg? I thought that any
> > > existing task will prevent cgroup removal from proceeding. Is this some
> > > sort of race where the task managed to disassociate from the cgroup
> > > while there is still a binding to a memcg existing? What am I missing?
> > 
> > It's an exiting task with the PF_EXITING flag set and it's in their late stages
> > of life.
> 
> This is a signal delivery path AFAIU (get_signal) and the coredumping
> happens before do_exit. My understanding is that that unlinking
> happens from cgroup_exit. So either I am misreading the backtrace or
> there is some other way to leave cgroups or there is something more
> going on.

Yeah, you're right. I have no better explanation for this and the similar,
mentioned in the commit bsd accounting issue, than some very rare race condition
that allows cgroups to be offlined with a task inside.

I'll think more about it.

Thanks, it's a really good question.

> 
> JFTR I am not really disputing the patch but I simply do not understand
> how the problem really happened.
> 
> -- 
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()
  2019-11-07 22:41       ` Roman Gushchin
@ 2019-11-08  8:53         ` Michal Hocko
  0 siblings, 0 replies; 34+ messages in thread
From: Michal Hocko @ 2019-11-08  8:53 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Johannes Weiner, linux-kernel,
	Kernel Team, stable, Tejun Heo

On Thu 07-11-19 22:41:13, Roman Gushchin wrote:
> On Thu, Nov 07, 2019 at 06:02:00PM +0100, Michal Hocko wrote:
> > On Thu 07-11-19 16:42:41, Roman Gushchin wrote:
[...]
> > > It's an exiting task with the PF_EXITING flag set and it's in their late stages
> > > of life.
> > 
> > This is a signal delivery path AFAIU (get_signal) and the coredumping
> > happens before do_exit. My understanding is that that unlinking
> > happens from cgroup_exit. So either I am misreading the backtrace or
> > there is some other way to leave cgroups or there is something more
> > going on.
> 
> Yeah, you're right. I have no better explanation for this and the similar,
> mentioned in the commit bsd accounting issue,

Tejun mentioned bsd accounting issue as well, but I do not see any
explicit reference to it in neither of the two patches.

> than some very rare race condition
> that allows cgroups to be offlined with a task inside.
> 
> I'll think more about it.

Thanks a lot. As I've said, I am not opposing this change once we have a
proper changelog but I find the explanation really weak. If there is a
race then it should be fixed as well.

Thanks!
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()
  2019-11-06 22:51 [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm() Roman Gushchin
                   ` (3 preceding siblings ...)
  2019-11-07 15:43 ` Tejun Heo
@ 2019-11-13 16:29 ` Michal Koutný
  2019-11-13 17:08   ` Roman Gushchin
  4 siblings, 1 reply; 34+ messages in thread
From: Michal Koutný @ 2019-11-13 16:29 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, Johannes Weiner,
	linux-kernel, kernel-team, stable, Tejun Heo

[-- Attachment #1: Type: text/plain, Size: 929 bytes --]

Hi.

On Wed, Nov 06, 2019 at 02:51:30PM -0800, Roman Gushchin <guro@fb.com> wrote:
> Let's fix it by switching from css_tryget_online() to css_tryget().
Is this a safe thing to do? The stack captures a kmem charge path, with
css_tryget() it may happen it gets an offlined memcg and carry out
charge into it. What happens when e.g. memcg_deactivate_kmem_caches is
skipped as a consequence?

> The problem is caused by an exiting task which is associated with
> an offline memcg. We're iterating over and over in the
> do {} while (!css_tryget_online()) loop, but obviously the memcg won't
> become online and the exiting task won't be migrated to a live memcg.
As discussed in other replies, the task is not yet exiting. However, the
access to memcg isn't through `current` but `mm->owner`, i.e. another
task of a threadgroup may have got stuck in an offlined memcg (I don't
have a good explanation for that though).

HTH,
Michal

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()
  2019-11-13 16:29 ` Michal Koutný
@ 2019-11-13 17:08   ` Roman Gushchin
  2019-11-14 19:16     ` Michal Hocko
  2019-11-21 15:28     ` Michal Koutný
  0 siblings, 2 replies; 34+ messages in thread
From: Roman Gushchin @ 2019-11-13 17:08 UTC (permalink / raw)
  To: Michal Koutný
  Cc: linux-mm, Andrew Morton, Michal Hocko, Johannes Weiner,
	linux-kernel, Kernel Team, stable, Tejun Heo

On Wed, Nov 13, 2019 at 05:29:34PM +0100, Michal Koutný wrote:
> Hi.
> 
> On Wed, Nov 06, 2019 at 02:51:30PM -0800, Roman Gushchin <guro@fb.com> wrote:
> > Let's fix it by switching from css_tryget_online() to css_tryget().
> Is this a safe thing to do? The stack captures a kmem charge path, with
> css_tryget() it may happen it gets an offlined memcg and carry out
> charge into it. What happens when e.g. memcg_deactivate_kmem_caches is
> skipped as a consequence?

The thing here is that css_tryget_online() cannot pin the online state,
so even if returned true, the cgroup can be offline at the return from
the function. So if we rely somewhere on it, it's already broken.

Generally speaking, it's better to reduce it's usage to the bare minimum.

> 
> > The problem is caused by an exiting task which is associated with
> > an offline memcg. We're iterating over and over in the
> > do {} while (!css_tryget_online()) loop, but obviously the memcg won't
> > become online and the exiting task won't be migrated to a live memcg.
> As discussed in other replies, the task is not yet exiting. However, the
> access to memcg isn't through `current` but `mm->owner`, i.e. another
> task of a threadgroup may have got stuck in an offlined memcg (I don't
> have a good explanation for that though).

Yes, it's true, and I've no idea how the memcg can be offline in this case too.
We've seen it only several times in fb production, so it seems to be a really
rare case. Could be anything from a tiny race somewhere to cpu bugs.

Thanks!


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()
  2019-11-13 17:08   ` Roman Gushchin
@ 2019-11-14 19:16     ` Michal Hocko
  2019-11-14 19:20       ` Tejun Heo
  2019-11-15 18:13       ` Roman Gushchin
  2019-11-21 15:28     ` Michal Koutný
  1 sibling, 2 replies; 34+ messages in thread
From: Michal Hocko @ 2019-11-14 19:16 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Michal Koutný,
	linux-mm, Andrew Morton, Johannes Weiner, linux-kernel,
	Kernel Team, stable, Tejun Heo

On Wed 13-11-19 17:08:29, Roman Gushchin wrote:
> On Wed, Nov 13, 2019 at 05:29:34PM +0100, Michal Koutný wrote:
> > Hi.
> > 
> > On Wed, Nov 06, 2019 at 02:51:30PM -0800, Roman Gushchin <guro@fb.com> wrote:
> > > Let's fix it by switching from css_tryget_online() to css_tryget().
> > Is this a safe thing to do? The stack captures a kmem charge path, with
> > css_tryget() it may happen it gets an offlined memcg and carry out
> > charge into it. What happens when e.g. memcg_deactivate_kmem_caches is
> > skipped as a consequence?
> 
> The thing here is that css_tryget_online() cannot pin the online state,
> so even if returned true, the cgroup can be offline at the return from
> the function. So if we rely somewhere on it, it's already broken.

Then what is the point of this function and what about all other users?

> Generally speaking, it's better to reduce it's usage to the bare minimum.

If it doesn't have any sensible semantic then I would argue it should go
altogether otherwise we are going to chase new users again and aagain?
 
> > > The problem is caused by an exiting task which is associated with
> > > an offline memcg. We're iterating over and over in the
> > > do {} while (!css_tryget_online()) loop, but obviously the memcg won't
> > > become online and the exiting task won't be migrated to a live memcg.
> > As discussed in other replies, the task is not yet exiting. However, the
> > access to memcg isn't through `current` but `mm->owner`, i.e. another
> > task of a threadgroup may have got stuck in an offlined memcg (I don't
> > have a good explanation for that though).

The trace however points to current->mm or current->active_memcg. Is it
possible that we have a stale active_memcg?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()
  2019-11-14 19:16     ` Michal Hocko
@ 2019-11-14 19:20       ` Tejun Heo
  2019-11-14 19:33         ` Michal Hocko
  2019-11-15 18:13       ` Roman Gushchin
  1 sibling, 1 reply; 34+ messages in thread
From: Tejun Heo @ 2019-11-14 19:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, Michal Koutný,
	linux-mm, Andrew Morton, Johannes Weiner, linux-kernel,
	Kernel Team, stable

Hello,

On Thu, Nov 14, 2019 at 08:16:57PM +0100, Michal Hocko wrote:
> Then what is the point of this function and what about all other users?

It is useful for controlling admissions of new userspace visible uses
- e.g. a tracepoint shouldn't be allowed to be attached to a cgroup
which has already been deleted.  We're just using it too liberally.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()
  2019-11-14 19:20       ` Tejun Heo
@ 2019-11-14 19:33         ` Michal Hocko
  2019-11-14 19:37           ` Tejun Heo
  0 siblings, 1 reply; 34+ messages in thread
From: Michal Hocko @ 2019-11-14 19:33 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Roman Gushchin, Michal Koutný,
	linux-mm, Andrew Morton, Johannes Weiner, linux-kernel,
	Kernel Team, stable

On Thu 14-11-19 11:20:18, Tejun Heo wrote:
> Hello,
> 
> On Thu, Nov 14, 2019 at 08:16:57PM +0100, Michal Hocko wrote:
> > Then what is the point of this function and what about all other users?
> 
> It is useful for controlling admissions of new userspace visible uses
> - e.g. a tracepoint shouldn't be allowed to be attached to a cgroup
> which has already been deleted.

I am not sure I understand. Roman says that the cgroup can get offline
right after the function returns. How is "already deleted" different
from "just deleted"? I thought that the state is preserved at least
while the rcu lock is held but my memory is dim here.

> We're just using it too liberally.

Can we get a doc update to be explicit about sensible usecases so that
others can be dropped accordingly? 
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()
  2019-11-14 19:33         ` Michal Hocko
@ 2019-11-14 19:37           ` Tejun Heo
  2019-11-15 17:40             ` Michal Hocko
  0 siblings, 1 reply; 34+ messages in thread
From: Tejun Heo @ 2019-11-14 19:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, Michal Koutný,
	linux-mm, Andrew Morton, Johannes Weiner, linux-kernel,
	Kernel Team, stable

Hello,

On Thu, Nov 14, 2019 at 08:33:40PM +0100, Michal Hocko wrote:
> > It is useful for controlling admissions of new userspace visible uses
> > - e.g. a tracepoint shouldn't be allowed to be attached to a cgroup
> > which has already been deleted.
> 
> I am not sure I understand. Roman says that the cgroup can get offline
> right after the function returns. How is "already deleted" different
> from "just deleted"? I thought that the state is preserved at least
> while the rcu lock is held but my memory is dim here.

It's the same difference as between "opening a file and deleting it"
and "deleting a file and opening it".  We shoud allow the former while
not allowing the latter.

> > We're just using it too liberally.
> 
> Can we get a doc update to be explicit about sensible usecases so that
> others can be dropped accordingly? 

Yeah, we should audit and convert the uses and update the doc.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()
  2019-11-14 19:37           ` Tejun Heo
@ 2019-11-15 17:40             ` Michal Hocko
  2019-11-15 17:45               ` Tejun Heo
  2019-11-15 17:47               ` Michal Hocko
  0 siblings, 2 replies; 34+ messages in thread
From: Michal Hocko @ 2019-11-15 17:40 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Roman Gushchin, Michal Koutný,
	linux-mm, Andrew Morton, Johannes Weiner, linux-kernel,
	Kernel Team, stable

On Thu 14-11-19 11:37:36, Tejun Heo wrote:
> Hello,
> 
> On Thu, Nov 14, 2019 at 08:33:40PM +0100, Michal Hocko wrote:
> > > It is useful for controlling admissions of new userspace visible uses
> > > - e.g. a tracepoint shouldn't be allowed to be attached to a cgroup
> > > which has already been deleted.
> > 
> > I am not sure I understand. Roman says that the cgroup can get offline
> > right after the function returns. How is "already deleted" different
> > from "just deleted"? I thought that the state is preserved at least
> > while the rcu lock is held but my memory is dim here.
> 
> It's the same difference as between "opening a file and deleting it"
> and "deleting a file and opening it".

I am sorry but I do not follow. How can css_tryget_online provide the
same semantic when the css can go offline right after the tryget call
returns so it is effectivelly undistinguishable from the case when the
css was already online before the call was made. Or is my understanding
of what Roman's said earlier in the thread?

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()
  2019-11-15 17:40             ` Michal Hocko
@ 2019-11-15 17:45               ` Tejun Heo
  2019-11-15 17:47               ` Michal Hocko
  1 sibling, 0 replies; 34+ messages in thread
From: Tejun Heo @ 2019-11-15 17:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, Michal Koutný,
	linux-mm, Andrew Morton, Johannes Weiner, linux-kernel,
	Kernel Team, stable

On Fri, Nov 15, 2019 at 06:40:31PM +0100, Michal Hocko wrote:
> I am sorry but I do not follow. How can css_tryget_online provide the
> same semantic when the css can go offline right after the tryget call
> returns so it is effectivelly undistinguishable from the case when the
> css was already online before the call was made. Or is my understanding
> of what Roman's said earlier in the thread?

It's *exactly* the same semantics as opening a file and deleting it
and keeping using it vs. deleting a file and then trying to open it.
You can't give out a new reference when the entity's visibility is
expected to be gone already and that's the exactly the condition that
tryget_online avoids.  I don't know how to better explain it if the
file analogy doesn't work.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()
  2019-11-15 17:40             ` Michal Hocko
  2019-11-15 17:45               ` Tejun Heo
@ 2019-11-15 17:47               ` Michal Hocko
  2019-11-15 17:48                 ` Tejun Heo
  2019-11-15 18:07                 ` Roman Gushchin
  1 sibling, 2 replies; 34+ messages in thread
From: Michal Hocko @ 2019-11-15 17:47 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Roman Gushchin, Michal Koutný,
	linux-mm, Andrew Morton, Johannes Weiner, linux-kernel,
	Kernel Team, stable

On Fri 15-11-19 18:40:31, Michal Hocko wrote:
> On Thu 14-11-19 11:37:36, Tejun Heo wrote:
> > Hello,
> > 
> > On Thu, Nov 14, 2019 at 08:33:40PM +0100, Michal Hocko wrote:
> > > > It is useful for controlling admissions of new userspace visible uses
> > > > - e.g. a tracepoint shouldn't be allowed to be attached to a cgroup
> > > > which has already been deleted.
> > > 
> > > I am not sure I understand. Roman says that the cgroup can get offline
> > > right after the function returns. How is "already deleted" different
> > > from "just deleted"? I thought that the state is preserved at least
> > > while the rcu lock is held but my memory is dim here.
> > 
> > It's the same difference as between "opening a file and deleting it"
> > and "deleting a file and opening it".
> 
> I am sorry but I do not follow. How can css_tryget_online provide the
> same semantic when the css can go offline right after the tryget call
> returns so it is effectivelly undistinguishable from the case when the
> css was already online before the call was made.

s@online@offline@

And reading after myself it turned out to sound differently than I
meant. What I wanted to say really is, what is the difference that
css_tryget_online really guarantee when the css might go offline right
after the call suceeds so more specifically what is the difference
between
	if (css_tryget()) {
		if (online)
			DO_SOMETHING
	}
and
	if (css_tryget_online()) {
		DO_SOMETHING
	}

both of them are racy and do not provide any guarantee wrt. online
state.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()
  2019-11-15 17:47               ` Michal Hocko
@ 2019-11-15 17:48                 ` Tejun Heo
  2019-11-15 18:03                   ` Michal Hocko
  2019-11-15 18:07                 ` Roman Gushchin
  1 sibling, 1 reply; 34+ messages in thread
From: Tejun Heo @ 2019-11-15 17:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, Michal Koutný,
	linux-mm, Andrew Morton, Johannes Weiner, linux-kernel,
	Kernel Team, stable

On Fri, Nov 15, 2019 at 06:47:21PM +0100, Michal Hocko wrote:
> s@online@offline@
> 
> And reading after myself it turned out to sound differently than I
> meant. What I wanted to say really is, what is the difference that
> css_tryget_online really guarantee when the css might go offline right
> after the call suceeds so more specifically what is the difference
> between
> 	if (css_tryget()) {
> 		if (online)
> 			DO_SOMETHING
> 	}
> and
> 	if (css_tryget_online()) {
> 		DO_SOMETHING
> 	}
> 
> both of them are racy and do not provide any guarantee wrt. online
> state.

It's about not giving new reference when the object is known to be
delted to the user.  Can you please think more about how file
deletions work?

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()
  2019-11-15 17:48                 ` Tejun Heo
@ 2019-11-15 18:03                   ` Michal Hocko
  0 siblings, 0 replies; 34+ messages in thread
From: Michal Hocko @ 2019-11-15 18:03 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Roman Gushchin, Michal Koutný,
	linux-mm, Andrew Morton, Johannes Weiner, linux-kernel,
	Kernel Team, stable

On Fri 15-11-19 09:48:44, Tejun Heo wrote:
> On Fri, Nov 15, 2019 at 06:47:21PM +0100, Michal Hocko wrote:
> > s@online@offline@
> > 
> > And reading after myself it turned out to sound differently than I
> > meant. What I wanted to say really is, what is the difference that
> > css_tryget_online really guarantee when the css might go offline right
> > after the call suceeds so more specifically what is the difference
> > between
> > 	if (css_tryget()) {
> > 		if (online)
> > 			DO_SOMETHING
> > 	}
> > and
> > 	if (css_tryget_online()) {
> > 		DO_SOMETHING
> > 	}
> > 
> > both of them are racy and do not provide any guarantee wrt. online
> > state.
> 
> It's about not giving new reference when the object is known to be
> delted to the user.

This part is clear to me. The failure just says it is already too late
to do anything. I just still struggle why the success is telling me much
more when the state might change before I can do anything on the object.
I could see a usefulness if I've had a guarantee that the object stays
online while I hold a $FOO lock but if there is nothing like that then 
we are just having already too late or potentially too late.

Anyway it's been a hard week and the brain is just going for the weekend
so I just might be dense now.

> Can you please think more about how file deletions work?

I will try that with a fresh brain next week.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()
  2019-11-15 17:47               ` Michal Hocko
  2019-11-15 17:48                 ` Tejun Heo
@ 2019-11-15 18:07                 ` Roman Gushchin
  2019-11-18  9:43                   ` Michal Hocko
  1 sibling, 1 reply; 34+ messages in thread
From: Roman Gushchin @ 2019-11-15 18:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tejun Heo, Michal Koutný,
	linux-mm, Andrew Morton, Johannes Weiner, linux-kernel,
	Kernel Team, stable

On Fri, Nov 15, 2019 at 06:47:21PM +0100, Michal Hocko wrote:
> On Fri 15-11-19 18:40:31, Michal Hocko wrote:
> > On Thu 14-11-19 11:37:36, Tejun Heo wrote:
> > > Hello,
> > > 
> > > On Thu, Nov 14, 2019 at 08:33:40PM +0100, Michal Hocko wrote:
> > > > > It is useful for controlling admissions of new userspace visible uses
> > > > > - e.g. a tracepoint shouldn't be allowed to be attached to a cgroup
> > > > > which has already been deleted.
> > > > 
> > > > I am not sure I understand. Roman says that the cgroup can get offline
> > > > right after the function returns. How is "already deleted" different
> > > > from "just deleted"? I thought that the state is preserved at least
> > > > while the rcu lock is held but my memory is dim here.
> > > 
> > > It's the same difference as between "opening a file and deleting it"
> > > and "deleting a file and opening it".
> > 
> > I am sorry but I do not follow. How can css_tryget_online provide the
> > same semantic when the css can go offline right after the tryget call
> > returns so it is effectivelly undistinguishable from the case when the
> > css was already online before the call was made.
> 
> s@online@offline@
> 
> And reading after myself it turned out to sound differently than I
> meant. What I wanted to say really is, what is the difference that
> css_tryget_online really guarantee when the css might go offline right
> after the call suceeds so more specifically what is the difference
> between
> 	if (css_tryget()) {
> 		if (online)
> 			DO_SOMETHING
> 	}
> and
> 	if (css_tryget_online()) {
> 		DO_SOMETHING
> 	}
> 
> both of them are racy and do not provide any guarantee wrt. online
> state.

Let me step back a little bit.

I think, we all agree that css_tryget_online() has a weird semantics,
in most cases is used only due to historical reasons and clearly asks
for a cleanup. So I suggest to stop arguing about it and wait for the
cleanup patchset. Then we can discuss each remaining use case in details,
if there will be any.

Thanks!


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()
  2019-11-14 19:16     ` Michal Hocko
  2019-11-14 19:20       ` Tejun Heo
@ 2019-11-15 18:13       ` Roman Gushchin
  1 sibling, 0 replies; 34+ messages in thread
From: Roman Gushchin @ 2019-11-15 18:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Michal Koutný,
	linux-mm, Andrew Morton, Johannes Weiner, linux-kernel,
	Kernel Team, stable, Tejun Heo

On Thu, Nov 14, 2019 at 08:16:57PM +0100, Michal Hocko wrote:
> On Wed 13-11-19 17:08:29, Roman Gushchin wrote:
> > On Wed, Nov 13, 2019 at 05:29:34PM +0100, Michal Koutný wrote:
> > > Hi.
> > > 
> > > On Wed, Nov 06, 2019 at 02:51:30PM -0800, Roman Gushchin <guro@fb.com> wrote:
> > > > Let's fix it by switching from css_tryget_online() to css_tryget().
> > > Is this a safe thing to do? The stack captures a kmem charge path, with
> > > css_tryget() it may happen it gets an offlined memcg and carry out
> > > charge into it. What happens when e.g. memcg_deactivate_kmem_caches is
> > > skipped as a consequence?
> > 
> > The thing here is that css_tryget_online() cannot pin the online state,
> > so even if returned true, the cgroup can be offline at the return from
> > the function. So if we rely somewhere on it, it's already broken.
> 
> Then what is the point of this function and what about all other users?
> 
> > Generally speaking, it's better to reduce it's usage to the bare minimum.
> 
> If it doesn't have any sensible semantic then I would argue it should go
> altogether otherwise we are going to chase new users again and aagain?

That's the plan: to audit all use cases and get rid of it where it's possible.

>  
> > > > The problem is caused by an exiting task which is associated with
> > > > an offline memcg. We're iterating over and over in the
> > > > do {} while (!css_tryget_online()) loop, but obviously the memcg won't
> > > > become online and the exiting task won't be migrated to a live memcg.
> > > As discussed in other replies, the task is not yet exiting. However, the
> > > access to memcg isn't through `current` but `mm->owner`, i.e. another
> > > task of a threadgroup may have got stuck in an offlined memcg (I don't
> > > have a good explanation for that though).
> 
> The trace however points to current->mm or current->active_memcg. Is it
> possible that we have a stale active_memcg?

It wouldn't cause a rcu stall.

Thanks!


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()
  2019-11-15 18:07                 ` Roman Gushchin
@ 2019-11-18  9:43                   ` Michal Hocko
  0 siblings, 0 replies; 34+ messages in thread
From: Michal Hocko @ 2019-11-18  9:43 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Tejun Heo, Michal Koutný,
	linux-mm, Andrew Morton, Johannes Weiner, linux-kernel,
	Kernel Team, stable

On Fri 15-11-19 18:07:34, Roman Gushchin wrote:
> On Fri, Nov 15, 2019 at 06:47:21PM +0100, Michal Hocko wrote:
> > On Fri 15-11-19 18:40:31, Michal Hocko wrote:
> > > On Thu 14-11-19 11:37:36, Tejun Heo wrote:
> > > > Hello,
> > > > 
> > > > On Thu, Nov 14, 2019 at 08:33:40PM +0100, Michal Hocko wrote:
> > > > > > It is useful for controlling admissions of new userspace visible uses
> > > > > > - e.g. a tracepoint shouldn't be allowed to be attached to a cgroup
> > > > > > which has already been deleted.
> > > > > 
> > > > > I am not sure I understand. Roman says that the cgroup can get offline
> > > > > right after the function returns. How is "already deleted" different
> > > > > from "just deleted"? I thought that the state is preserved at least
> > > > > while the rcu lock is held but my memory is dim here.
> > > > 
> > > > It's the same difference as between "opening a file and deleting it"
> > > > and "deleting a file and opening it".
> > > 
> > > I am sorry but I do not follow. How can css_tryget_online provide the
> > > same semantic when the css can go offline right after the tryget call
> > > returns so it is effectivelly undistinguishable from the case when the
> > > css was already online before the call was made.
> > 
> > s@online@offline@
> > 
> > And reading after myself it turned out to sound differently than I
> > meant. What I wanted to say really is, what is the difference that
> > css_tryget_online really guarantee when the css might go offline right
> > after the call suceeds so more specifically what is the difference
> > between
> > 	if (css_tryget()) {
> > 		if (online)
> > 			DO_SOMETHING
> > 	}
> > and
> > 	if (css_tryget_online()) {
> > 		DO_SOMETHING
> > 	}
> > 
> > both of them are racy and do not provide any guarantee wrt. online
> > state.
> 
> Let me step back a little bit.
> 
> I think, we all agree that css_tryget_online() has a weird semantics,
> in most cases is used only due to historical reasons and clearly asks
> for a cleanup. So I suggest to stop arguing about it and wait for the
> cleanup patchset. Then we can discuss each remaining use case in details,
> if there will be any.

Yes I am all in favor of the clean up patches as well as getting down
to the bottom of the underlying issue (race). Andrew has already sent
these two patches to Linus, unfortunatelly, even though the changelog
is slightly misleading (btw 18fa84a2db0e has the similar incorrect
reasoning).
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()
  2019-11-13 17:08   ` Roman Gushchin
  2019-11-14 19:16     ` Michal Hocko
@ 2019-11-21 15:28     ` Michal Koutný
  2019-11-22  8:20       ` Michal Hocko
  1 sibling, 1 reply; 34+ messages in thread
From: Michal Koutný @ 2019-11-21 15:28 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, Johannes Weiner,
	linux-kernel, Kernel Team, stable, Tejun Heo

[-- Attachment #1: Type: text/plain, Size: 935 bytes --]

On Wed, Nov 13, 2019 at 05:08:29PM +0000, Roman Gushchin <guro@fb.com> wrote:
> The thing here is that css_tryget_online() cannot pin the online state,
> so even if returned true, the cgroup can be offline at the return from
> the function.
Adding this for the record.

The actual offlining happens only from css_killed_ref_fn, which is
percpu_ref_kill_and_confirm confirmation callback. That is only called
after RCU grace period. So, css_tryget_online will pin the online state
_inside_ rcu_read_{,un}lock section.

The fact is that many call sites pass the css reference over
rcu_read_unlock boundary allowing potential offlining.

> So if we rely somewhere on it, it's already broken.
Charges into offlined memcg should fine (wrt the original patch). I
didn't check other callers though...

> Generally speaking, it's better to reduce it's usage to the bare minimum.
...I agree as it's confusing and potentially redundant.

Michal

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()
  2019-11-21 15:28     ` Michal Koutný
@ 2019-11-22  8:20       ` Michal Hocko
  0 siblings, 0 replies; 34+ messages in thread
From: Michal Hocko @ 2019-11-22  8:20 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Roman Gushchin, linux-mm, Andrew Morton, Johannes Weiner,
	linux-kernel, Kernel Team, stable, Tejun Heo

On Thu 21-11-19 16:28:47, Michal Koutny wrote:
> On Wed, Nov 13, 2019 at 05:08:29PM +0000, Roman Gushchin <guro@fb.com> wrote:
> > The thing here is that css_tryget_online() cannot pin the online state,
> > so even if returned true, the cgroup can be offline at the return from
> > the function.
> Adding this for the record.
> 
> The actual offlining happens only from css_killed_ref_fn, which is
> percpu_ref_kill_and_confirm confirmation callback. That is only called
> after RCU grace period. So, css_tryget_online will pin the online state
> _inside_ rcu_read_{,un}lock section.

Thanks for the confirmation. Maybe we should start with a trivial patch
and document this first. We can go and cleanup bogus users on top.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2019-11-22  8:20 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-06 22:51 [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm() Roman Gushchin
2019-11-06 22:51 ` [PATCH 2/2] mm: hugetlb: switch to css_tryget() in hugetlb_cgroup_charge_cgroup() Roman Gushchin
2019-11-07  0:26   ` Johannes Weiner
2019-11-07  2:31   ` Shakeel Butt
2019-11-07 15:44   ` Tejun Heo
2019-11-07  0:22 ` [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm() Johannes Weiner
2019-11-07  1:25   ` Shakeel Butt
2019-11-07  1:43     ` Johannes Weiner
2019-11-07  1:43     ` Roman Gushchin
2019-11-07  2:21       ` Shakeel Butt
2019-11-07  2:28         ` Shakeel Butt
2019-11-07 12:21 ` Michal Hocko
2019-11-07 15:43   ` Tejun Heo
2019-11-07 16:42   ` Roman Gushchin
2019-11-07 17:02     ` Michal Hocko
2019-11-07 22:41       ` Roman Gushchin
2019-11-08  8:53         ` Michal Hocko
2019-11-07 15:43 ` Tejun Heo
2019-11-13 16:29 ` Michal Koutný
2019-11-13 17:08   ` Roman Gushchin
2019-11-14 19:16     ` Michal Hocko
2019-11-14 19:20       ` Tejun Heo
2019-11-14 19:33         ` Michal Hocko
2019-11-14 19:37           ` Tejun Heo
2019-11-15 17:40             ` Michal Hocko
2019-11-15 17:45               ` Tejun Heo
2019-11-15 17:47               ` Michal Hocko
2019-11-15 17:48                 ` Tejun Heo
2019-11-15 18:03                   ` Michal Hocko
2019-11-15 18:07                 ` Roman Gushchin
2019-11-18  9:43                   ` Michal Hocko
2019-11-15 18:13       ` Roman Gushchin
2019-11-21 15:28     ` Michal Koutný
2019-11-22  8:20       ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).