linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 0/2] mm: memcg: 3.13 fixes
@ 2013-12-04 22:45 Johannes Weiner
  2013-12-04 22:45 ` [patch 1/2] mm: memcg: fix race condition between memcg teardown and swapin Johannes Weiner
  2013-12-04 22:45 ` [patch 2/2] mm: memcg: do not allow task about to OOM kill to bypass the limit Johannes Weiner
  0 siblings, 2 replies; 6+ messages in thread
From: Johannes Weiner @ 2013-12-04 22:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, David Rientjes, linux-mm, cgroups, linux-kernel

Hi,

here are two memcg fixes for 3.13.

The race condition in #1 is really long standing AFAICS, I just tagged
stable and will backport and evaluate this for any tree that sends me
a notice.

#2 changes what happens during a charge attempt between a task
entering memcg OOM and actually executing the kill.  I had these
charges bypass the limit in the hope that this would expedite the
kill, but there is no real evidence for it and David was worried about
an unecessary breach of isolation.  This was introduced in 3.12.

 mm/memcontrol.c | 38 +++++++++++++++++++++++++++++++++++++-
 1 file changed, 37 insertions(+), 1 deletion(-)


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [patch 1/2] mm: memcg: fix race condition between memcg teardown and swapin
  2013-12-04 22:45 [patch 0/2] mm: memcg: 3.13 fixes Johannes Weiner
@ 2013-12-04 22:45 ` Johannes Weiner
  2013-12-05 13:13   ` Michal Hocko
  2013-12-04 22:45 ` [patch 2/2] mm: memcg: do not allow task about to OOM kill to bypass the limit Johannes Weiner
  1 sibling, 1 reply; 6+ messages in thread
From: Johannes Weiner @ 2013-12-04 22:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, David Rientjes, linux-mm, cgroups, linux-kernel

There is a race condition between a memcg being torn down and a swapin
triggered from a different memcg of a page that was recorded to belong
to the exiting memcg on swapout (with CONFIG_MEMCG_SWAP extension).
The result is unreclaimable pages pointing to dead memcgs, which can
lead to anything from endless loops in later memcg teardown (the page
is charged to all hierarchical parents but is not on any LRU list) or
crashes from following the dangling memcg pointer.

Memcgs with tasks in them can not be torn down and usually charges
don't show up in memcgs without tasks.  Swapin with the
CONFIG_MEMCG_SWAP extension is the notable exception because it
charges the cgroup that was recorded as owner during swapout, which
may be empty and in the process of being torn down when a task in
another memcg triggers the swapin:

  teardown:                 swapin:

                            lookup_swap_cgroup_id()
                            rcu_read_lock()
                            mem_cgroup_lookup()
                            css_tryget()
                            rcu_read_unlock()
  disable css_tryget()
  call_rcu()
    offline_css()
      reparent_charges()
                            res_counter_charge() (hierarchical!)
                            css_put()
                              css_free()
                            pc->mem_cgroup = dead memcg
                            add page to dead lru

Add a final reparenting step into css_free() to make sure any such
raced charges are moved out of the memcg before it's finally freed.

In the longer term it would be cleaner to have the css_tryget() and
the res_counter charge under the same RCU lock section so that the
charge reparenting is deferred until the last charge whose tryget
succeeded is visible.  But this will require more invasive changes
that will be harder to evaluate and backport into stable, so better
defer them to a separate change set.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@kernel.org>
---
 mm/memcontrol.c | 36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e3aff0175d4c..f6a63f5b3827 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6355,6 +6355,42 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+	/*
+	 * XXX: css_offline() would be where we should reparent all
+	 * memory to prepare the cgroup for destruction.  However,
+	 * memcg does not do css_tryget() and res_counter charging
+	 * under the same RCU lock region, which means that charging
+	 * could race with offlining.  Offlining only happens to
+	 * cgroups with no tasks in them but charges can show up
+	 * without any tasks from the swapin path when the target
+	 * memcg is looked up from the swapout record and not from the
+	 * current task as it usually is.  A race like this can leak
+	 * charges and put pages with stale cgroup pointers into
+	 * circulation:
+	 *
+	 * #0                        #1
+	 *                           lookup_swap_cgroup_id()
+	 *                           rcu_read_lock()
+	 *                           mem_cgroup_lookup()
+	 *                           css_tryget()
+	 *                           rcu_read_unlock()
+	 * disable css_tryget()
+	 * call_rcu()
+	 *   offline_css()
+	 *     reparent_charges()
+	 *                           res_counter_charge()
+	 *                           css_put()
+	 *                             css_free()
+	 *                           pc->mem_cgroup = dead memcg
+	 *                           add page to lru
+	 *
+	 * The bulk of the charges are still moved in offline_css() to
+	 * avoid pinning a lot of pages in case a long-term reference
+	 * like a swapout record is deferring the css_free() to long
+	 * after offlining.  But this makes sure we catch any charges
+	 * made after offlining:
+	 */
+	mem_cgroup_reparent_charges(memcg);
 
 	memcg_destroy_kmem(memcg);
 	__mem_cgroup_free(memcg);
-- 
1.8.4.2


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [patch 2/2] mm: memcg: do not allow task about to OOM kill to bypass the limit
  2013-12-04 22:45 [patch 0/2] mm: memcg: 3.13 fixes Johannes Weiner
  2013-12-04 22:45 ` [patch 1/2] mm: memcg: fix race condition between memcg teardown and swapin Johannes Weiner
@ 2013-12-04 22:45 ` Johannes Weiner
  2013-12-04 23:51   ` David Rientjes
  2013-12-05 13:14   ` Michal Hocko
  1 sibling, 2 replies; 6+ messages in thread
From: Johannes Weiner @ 2013-12-04 22:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, David Rientjes, linux-mm, cgroups, linux-kernel

4942642080ea ("mm: memcg: handle non-error OOM situations more
gracefully") allowed tasks that already entered a memcg OOM condition
to bypass the memcg limit on subsequent allocation attempts hoping
this would expedite finishing the page fault and executing the kill.

David Rientjes is worried that this breaks memcg isolation guarantees
and since there is no evidence that the bypass actually speeds up
fault processing just change it so that these subsequent charge
attempts fail outright.  The notable exception being __GFP_NOFAIL
charges which are required to bypass the limit regardless.

Reported-by: David Rientjes <rientjes@google.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f6a63f5b3827..bf5e89457149 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2694,7 +2694,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 		goto bypass;
 
 	if (unlikely(task_in_memcg_oom(current)))
-		goto bypass;
+		goto nomem;
 
 	if (gfp_mask & __GFP_NOFAIL)
 		oom = false;
-- 
1.8.4.2


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [patch 2/2] mm: memcg: do not allow task about to OOM kill to bypass the limit
  2013-12-04 22:45 ` [patch 2/2] mm: memcg: do not allow task about to OOM kill to bypass the limit Johannes Weiner
@ 2013-12-04 23:51   ` David Rientjes
  2013-12-05 13:14   ` Michal Hocko
  1 sibling, 0 replies; 6+ messages in thread
From: David Rientjes @ 2013-12-04 23:51 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, linux-mm, cgroups, linux-kernel

On Wed, 4 Dec 2013, Johannes Weiner wrote:

> 4942642080ea ("mm: memcg: handle non-error OOM situations more
> gracefully") allowed tasks that already entered a memcg OOM condition
> to bypass the memcg limit on subsequent allocation attempts hoping
> this would expedite finishing the page fault and executing the kill.
> 
> David Rientjes is worried that this breaks memcg isolation guarantees
> and since there is no evidence that the bypass actually speeds up
> fault processing just change it so that these subsequent charge
> attempts fail outright.  The notable exception being __GFP_NOFAIL
> charges which are required to bypass the limit regardless.
> 
> Reported-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: David Rientjes <rientjes@google.com>

Thanks!

I think we should consider marking this for stable@vger.kernel.org for 
3.12 since the original patch went into 3.12-rc6.  Depending on the number 
of allocators in the oom memcg, the amount of memory bypassed can become 
quite large.

For example, in a memcg with a limit of 128MB, if you start 10 concurrent 
processes that simply allocate a lot of memory you can get quite a bit of 
memory bypassed.  If I start 10 membench processes, which would cause a 
128MB memcg to oom even if only one such process were running, we get:

# grep RSS /proc/1092[0-9]/status
VmRSS:	   15724 kB
VmRSS:	   15064 kB
VmRSS:	   13224 kB
VmRSS:	   14520 kB
VmRSS:	   14472 kB
VmRSS:	   13016 kB
VmRSS:	   13024 kB
VmRSS:	   14560 kB
VmRSS:	   14864 kB
VmRSS:	   14772 kB

And all of those total ~140MB of memory while bound to a memcg with a 
128MB limit, about 10% of memory is bypassed.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [patch 1/2] mm: memcg: fix race condition between memcg teardown and swapin
  2013-12-04 22:45 ` [patch 1/2] mm: memcg: fix race condition between memcg teardown and swapin Johannes Weiner
@ 2013-12-05 13:13   ` Michal Hocko
  0 siblings, 0 replies; 6+ messages in thread
From: Michal Hocko @ 2013-12-05 13:13 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, David Rientjes, linux-mm, cgroups, linux-kernel

On Wed 04-12-13 17:45:13, Johannes Weiner wrote:
> There is a race condition between a memcg being torn down and a swapin
> triggered from a different memcg of a page that was recorded to belong
> to the exiting memcg on swapout (with CONFIG_MEMCG_SWAP extension).
> The result is unreclaimable pages pointing to dead memcgs, which can
> lead to anything from endless loops in later memcg teardown (the page
> is charged to all hierarchical parents but is not on any LRU list) or
> crashes from following the dangling memcg pointer.
> 
> Memcgs with tasks in them can not be torn down and usually charges
> don't show up in memcgs without tasks.  Swapin with the
> CONFIG_MEMCG_SWAP extension is the notable exception because it
> charges the cgroup that was recorded as owner during swapout, which
> may be empty and in the process of being torn down when a task in
> another memcg triggers the swapin:
> 
>   teardown:                 swapin:
> 
>                             lookup_swap_cgroup_id()
>                             rcu_read_lock()
>                             mem_cgroup_lookup()
>                             css_tryget()
>                             rcu_read_unlock()
>   disable css_tryget()
>   call_rcu()
>     offline_css()
>       reparent_charges()
>                             res_counter_charge() (hierarchical!)
>                             css_put()
>                               css_free()
>                             pc->mem_cgroup = dead memcg
>                             add page to dead lru
> 
> Add a final reparenting step into css_free() to make sure any such
> raced charges are moved out of the memcg before it's finally freed.
> 
> In the longer term it would be cleaner to have the css_tryget() and
> the res_counter charge under the same RCU lock section so that the
> charge reparenting is deferred until the last charge whose tryget
> succeeded is visible.  But this will require more invasive changes
> that will be harder to evaluate and backport into stable, so better
> defer them to a separate change set.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Cc: <stable@kernel.org>

I have a patch to fix this properly but I didn't get to test it yet. I
will post it sometimes soon hopefully.

For the meantime
Acked-by: Michal Hocko <mhocko@suse.cz>

> ---
>  mm/memcontrol.c | 36 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 36 insertions(+)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e3aff0175d4c..f6a63f5b3827 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6355,6 +6355,42 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
>  static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> +	/*
> +	 * XXX: css_offline() would be where we should reparent all
> +	 * memory to prepare the cgroup for destruction.  However,
> +	 * memcg does not do css_tryget() and res_counter charging
> +	 * under the same RCU lock region, which means that charging
> +	 * could race with offlining.  Offlining only happens to
> +	 * cgroups with no tasks in them but charges can show up
> +	 * without any tasks from the swapin path when the target
> +	 * memcg is looked up from the swapout record and not from the
> +	 * current task as it usually is.  A race like this can leak
> +	 * charges and put pages with stale cgroup pointers into
> +	 * circulation:
> +	 *
> +	 * #0                        #1
> +	 *                           lookup_swap_cgroup_id()
> +	 *                           rcu_read_lock()
> +	 *                           mem_cgroup_lookup()
> +	 *                           css_tryget()
> +	 *                           rcu_read_unlock()
> +	 * disable css_tryget()
> +	 * call_rcu()
> +	 *   offline_css()
> +	 *     reparent_charges()
> +	 *                           res_counter_charge()
> +	 *                           css_put()
> +	 *                             css_free()
> +	 *                           pc->mem_cgroup = dead memcg
> +	 *                           add page to lru
> +	 *
> +	 * The bulk of the charges are still moved in offline_css() to
> +	 * avoid pinning a lot of pages in case a long-term reference
> +	 * like a swapout record is deferring the css_free() to long
> +	 * after offlining.  But this makes sure we catch any charges
> +	 * made after offlining:
> +	 */
> +	mem_cgroup_reparent_charges(memcg);
>  
>  	memcg_destroy_kmem(memcg);
>  	__mem_cgroup_free(memcg);
> -- 
> 1.8.4.2
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [patch 2/2] mm: memcg: do not allow task about to OOM kill to bypass the limit
  2013-12-04 22:45 ` [patch 2/2] mm: memcg: do not allow task about to OOM kill to bypass the limit Johannes Weiner
  2013-12-04 23:51   ` David Rientjes
@ 2013-12-05 13:14   ` Michal Hocko
  1 sibling, 0 replies; 6+ messages in thread
From: Michal Hocko @ 2013-12-05 13:14 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, David Rientjes, linux-mm, cgroups, linux-kernel

On Wed 04-12-13 17:45:14, Johannes Weiner wrote:
> 4942642080ea ("mm: memcg: handle non-error OOM situations more
> gracefully") allowed tasks that already entered a memcg OOM condition
> to bypass the memcg limit on subsequent allocation attempts hoping
> this would expedite finishing the page fault and executing the kill.
> 
> David Rientjes is worried that this breaks memcg isolation guarantees
> and since there is no evidence that the bypass actually speeds up
> fault processing just change it so that these subsequent charge
> attempts fail outright.  The notable exception being __GFP_NOFAIL
> charges which are required to bypass the limit regardless.
> 
> Reported-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

We want this in stable as well IMHO.
Acked-by: Michal Hocko <mhocko@suse.cz>

> ---
>  mm/memcontrol.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f6a63f5b3827..bf5e89457149 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2694,7 +2694,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>  		goto bypass;
>  
>  	if (unlikely(task_in_memcg_oom(current)))
> -		goto bypass;
> +		goto nomem;
>  
>  	if (gfp_mask & __GFP_NOFAIL)
>  		oom = false;
> -- 
> 1.8.4.2
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2013-12-05 13:14 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-04 22:45 [patch 0/2] mm: memcg: 3.13 fixes Johannes Weiner
2013-12-04 22:45 ` [patch 1/2] mm: memcg: fix race condition between memcg teardown and swapin Johannes Weiner
2013-12-05 13:13   ` Michal Hocko
2013-12-04 22:45 ` [patch 2/2] mm: memcg: do not allow task about to OOM kill to bypass the limit Johannes Weiner
2013-12-04 23:51   ` David Rientjes
2013-12-05 13:14   ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).