linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree
       [not found] <526028bd.k5qPj2+MDOK1o6ii%akpm@linux-foundation.org>
@ 2013-11-27 23:08 ` David Rientjes
  2013-11-27 23:33   ` Johannes Weiner
  0 siblings, 1 reply; 21+ messages in thread
From: David Rientjes @ 2013-11-27 23:08 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner
  Cc: stable, Michal Hocko, azurit, mm-commits, linux-kernel, linux-mm

On Thu, 17 Oct 2013, akpm@linux-foundation.org wrote:

> diff -puN mm/memcontrol.c~mm-memcg-handle-non-error-oom-situations-more-gracefully mm/memcontrol.c
> --- a/mm/memcontrol.c~mm-memcg-handle-non-error-oom-situations-more-gracefully
> +++ a/mm/memcontrol.c
> @@ -2161,110 +2161,59 @@ static void memcg_oom_recover(struct mem
>  		memcg_wakeup_oom(memcg);
>  }
>  
> -/*
> - * try to call OOM killer
> - */
>  static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
>  {
> -	bool locked;
> -	int wakeups;
> -
>  	if (!current->memcg_oom.may_oom)
>  		return;
> -
> -	current->memcg_oom.in_memcg_oom = 1;
> -
>  	/*
> -	 * As with any blocking lock, a contender needs to start
> -	 * listening for wakeups before attempting the trylock,
> -	 * otherwise it can miss the wakeup from the unlock and sleep
> -	 * indefinitely.  This is just open-coded because our locking
> -	 * is so particular to memcg hierarchies.
> +	 * We are in the middle of the charge context here, so we
> +	 * don't want to block when potentially sitting on a callstack
> +	 * that holds all kinds of filesystem and mm locks.
> +	 *
> +	 * Also, the caller may handle a failed allocation gracefully
> +	 * (like optional page cache readahead) and so an OOM killer
> +	 * invocation might not even be necessary.
> +	 *
> +	 * That's why we don't do anything here except remember the
> +	 * OOM context and then deal with it at the end of the page
> +	 * fault when the stack is unwound, the locks are released,
> +	 * and when we know whether the fault was overall successful.
>  	 */
> -	wakeups = atomic_read(&memcg->oom_wakeups);
> -	mem_cgroup_mark_under_oom(memcg);
> -
> -	locked = mem_cgroup_oom_trylock(memcg);
> -
> -	if (locked)
> -		mem_cgroup_oom_notify(memcg);
> -
> -	if (locked && !memcg->oom_kill_disable) {
> -		mem_cgroup_unmark_under_oom(memcg);
> -		mem_cgroup_out_of_memory(memcg, mask, order);
> -		mem_cgroup_oom_unlock(memcg);
> -		/*
> -		 * There is no guarantee that an OOM-lock contender
> -		 * sees the wakeups triggered by the OOM kill
> -		 * uncharges.  Wake any sleepers explicitely.
> -		 */
> -		memcg_oom_recover(memcg);
> -	} else {
> -		/*
> -		 * A system call can just return -ENOMEM, but if this
> -		 * is a page fault and somebody else is handling the
> -		 * OOM already, we need to sleep on the OOM waitqueue
> -		 * for this memcg until the situation is resolved.
> -		 * Which can take some time because it might be
> -		 * handled by a userspace task.
> -		 *
> -		 * However, this is the charge context, which means
> -		 * that we may sit on a large call stack and hold
> -		 * various filesystem locks, the mmap_sem etc. and we
> -		 * don't want the OOM handler to deadlock on them
> -		 * while we sit here and wait.  Store the current OOM
> -		 * context in the task_struct, then return -ENOMEM.
> -		 * At the end of the page fault handler, with the
> -		 * stack unwound, pagefault_out_of_memory() will check
> -		 * back with us by calling
> -		 * mem_cgroup_oom_synchronize(), possibly putting the
> -		 * task to sleep.
> -		 */
> -		current->memcg_oom.oom_locked = locked;
> -		current->memcg_oom.wakeups = wakeups;
> -		css_get(&memcg->css);
> -		current->memcg_oom.wait_on_memcg = memcg;
> -	}
> +	css_get(&memcg->css);
> +	current->memcg_oom.memcg = memcg;
> +	current->memcg_oom.gfp_mask = mask;
> +	current->memcg_oom.order = order;
>  }
>  
>  /**
>   * mem_cgroup_oom_synchronize - complete memcg OOM handling
> + * @handle: actually kill/wait or just clean up the OOM state
>   *
> - * This has to be called at the end of a page fault if the the memcg
> - * OOM handler was enabled and the fault is returning %VM_FAULT_OOM.
> + * This has to be called at the end of a page fault if the memcg OOM
> + * handler was enabled.
>   *
> - * Memcg supports userspace OOM handling, so failed allocations must
> + * Memcg supports userspace OOM handling where failed allocations must
>   * sleep on a waitqueue until the userspace task resolves the
>   * situation.  Sleeping directly in the charge context with all kinds
>   * of locks held is not a good idea, instead we remember an OOM state
>   * in the task and mem_cgroup_oom_synchronize() has to be called at
> - * the end of the page fault to put the task to sleep and clean up the
> - * OOM state.
> + * the end of the page fault to complete the OOM handling.
>   *
>   * Returns %true if an ongoing memcg OOM situation was detected and
> - * finalized, %false otherwise.
> + * completed, %false otherwise.
>   */
> -bool mem_cgroup_oom_synchronize(void)
> +bool mem_cgroup_oom_synchronize(bool handle)
>  {
> +	struct mem_cgroup *memcg = current->memcg_oom.memcg;
>  	struct oom_wait_info owait;
> -	struct mem_cgroup *memcg;
> +	bool locked;
>  
>  	/* OOM is global, do not handle */
> -	if (!current->memcg_oom.in_memcg_oom)
> -		return false;
> -
> -	/*
> -	 * We invoked the OOM killer but there is a chance that a kill
> -	 * did not free up any charges.  Everybody else might already
> -	 * be sleeping, so restart the fault and keep the rampage
> -	 * going until some charges are released.
> -	 */
> -	memcg = current->memcg_oom.wait_on_memcg;
>  	if (!memcg)
> -		goto out;
> +		return false;
>  
> -	if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
> -		goto out_memcg;
> +	if (!handle)
> +		goto cleanup;
>  
>  	owait.memcg = memcg;
>  	owait.wait.flags = 0;
> @@ -2273,13 +2222,25 @@ bool mem_cgroup_oom_synchronize(void)
>  	INIT_LIST_HEAD(&owait.wait.task_list);
>  
>  	prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
> -	/* Only sleep if we didn't miss any wakeups since OOM */
> -	if (atomic_read(&memcg->oom_wakeups) == current->memcg_oom.wakeups)
> +	mem_cgroup_mark_under_oom(memcg);
> +
> +	locked = mem_cgroup_oom_trylock(memcg);
> +
> +	if (locked)
> +		mem_cgroup_oom_notify(memcg);
> +
> +	if (locked && !memcg->oom_kill_disable) {
> +		mem_cgroup_unmark_under_oom(memcg);
> +		finish_wait(&memcg_oom_waitq, &owait.wait);
> +		mem_cgroup_out_of_memory(memcg, current->memcg_oom.gfp_mask,
> +					 current->memcg_oom.order);
> +	} else {
>  		schedule();
> -	finish_wait(&memcg_oom_waitq, &owait.wait);
> -out_memcg:
> -	mem_cgroup_unmark_under_oom(memcg);
> -	if (current->memcg_oom.oom_locked) {
> +		mem_cgroup_unmark_under_oom(memcg);
> +		finish_wait(&memcg_oom_waitq, &owait.wait);
> +	}
> +
> +	if (locked) {
>  		mem_cgroup_oom_unlock(memcg);
>  		/*
>  		 * There is no guarantee that an OOM-lock contender
> @@ -2288,10 +2249,9 @@ out_memcg:
>  		 */
>  		memcg_oom_recover(memcg);
>  	}
> +cleanup:
> +	current->memcg_oom.memcg = NULL;
>  	css_put(&memcg->css);
> -	current->memcg_oom.wait_on_memcg = NULL;
> -out:
> -	current->memcg_oom.in_memcg_oom = 0;
>  	return true;
>  }
>  
> @@ -2705,6 +2665,9 @@ static int __mem_cgroup_try_charge(struc
>  		     || fatal_signal_pending(current)))
>  		goto bypass;
>  
> +	if (unlikely(task_in_memcg_oom(current)))
> +		goto bypass;
> +
>  	/*
>  	 * We always charge the cgroup the mm_struct belongs to.
>  	 * The mm_struct's mem_cgroup changes on task migration if the

First, apologies that I didn't look at all these changes earlier even 
though I was cc'd.

The memcg oom killer has incurred a serious regression since the 3.12-rc6 
kernel where this was merged as 4942642080ea ("mm: memcg: handle non-error 
OOM situations more gracefully").  It cc'd stable@kernel.org although it 
doesn't appear to have been picked up yet, and I'm hoping that we can 
avoid having it merged in a stable kernel until we get this fixed.

This patch, specifically the above, allows memcgs to bypass their limits 
by charging the root memcg in oom conditions.

If I create a memcg, cg1, with memory.limit_in_bytes == 128MB and start a
memory allocator in it to induce oom, the memcg limit is trivially broken:

membench invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
membench cpuset=/ mems_allowed=0-3
CPU: 9 PID: 11388 Comm: membench Not tainted 3.13-rc1
 ffffc90015ec6000 ffff880671c3dc18 ffffffff8154a1e3 0000000000000007
 ffff880674c215d0 ffff880671c3dc98 ffffffff81548b45 ffff880671c3dc58
 ffffffff81151de7 0000000000000001 0000000000000292 ffff880800000000
Call Trace:
 [<ffffffff8154a1e3>] dump_stack+0x46/0x58
 [<ffffffff81548b45>] dump_header+0x7a/0x1bb
 [<ffffffff81151de7>] ? find_lock_task_mm+0x27/0x70
 [<ffffffff812e8b76>] ? ___ratelimit+0x96/0x110
 [<ffffffff811521c4>] oom_kill_process+0x1c4/0x330
 [<ffffffff81099ee5>] ? has_ns_capability_noaudit+0x15/0x20
 [<ffffffff811a916a>] mem_cgroup_oom_synchronize+0x50a/0x570
 [<ffffffff811a8660>] ? __mem_cgroup_try_charge_swapin+0x70/0x70
 [<ffffffff81152968>] pagefault_out_of_memory+0x18/0x90
 [<ffffffff81547b86>] mm_fault_error+0xb7/0x15a
 [<ffffffff81553efb>] __do_page_fault+0x42b/0x500
 [<ffffffff810c667d>] ? set_next_entity+0xad/0xd0
 [<ffffffff810c670b>] ? pick_next_task_fair+0x6b/0x170
 [<ffffffff8154d08e>] ? __schedule+0x38e/0x780
 [<ffffffff81553fde>] do_page_fault+0xe/0x10
 [<ffffffff815509e2>] page_fault+0x22/0x30
Task in /cg1 killed as a result of limit of /cg1
memory: usage 131072kB, limit 131072kB, failcnt 1053
memory+swap: usage 0kB, limit 18014398509481983kB, failcnt 0
kmem: usage 0kB, limit 18014398509481983kB, failcnt 0
Memory cgroup stats for /cg1: cache:84KB rss:130988KB rss_huge:116736KB mapped_file:72KB writeback:0KB inactive_anon:0KB active_anon:130976KB inactive_file:4KB active_file:0KB unevictable:0KB
[ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[ 7761]     0  7761     1106      483       5        0             0 bash
[11388]    99 11388   270773    33031      83        0             0 membench
Memory cgroup out of memory: Kill process 11388 (membench) score 1010 or sacrifice child
Killed process 11388 (membench) total-vm:1083092kB, anon-rss:130824kB, file-rss:1300kB

The score of 1010 shown for pid 11388 (membench) should never happen in 
the oom killer, the maximum value should always be 1000 in any oom 
context.  This indicates that the process has allocated more memory than 
is available to the memcg.  The rss value, 33031 pages, shows that it has 
allocated >129MB of memory in a memcg limited to 128MB.

The entire premise of memcg is to prevent processes attached to it to not 
be able to allocate more memory than allowed and this trivially breaks 
that premise in oom conditions.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree
  2013-11-27 23:08 ` [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree David Rientjes
@ 2013-11-27 23:33   ` Johannes Weiner
  2013-11-28  0:56     ` David Rientjes
  2013-11-28  9:12     ` Michal Hocko
  0 siblings, 2 replies; 21+ messages in thread
From: Johannes Weiner @ 2013-11-27 23:33 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, stable, Michal Hocko, azurit, mm-commits,
	linux-kernel, linux-mm

On Wed, Nov 27, 2013 at 03:08:30PM -0800, David Rientjes wrote:
> On Thu, 17 Oct 2013, akpm@linux-foundation.org wrote:
> 
> > diff -puN mm/memcontrol.c~mm-memcg-handle-non-error-oom-situations-more-gracefully mm/memcontrol.c
> > --- a/mm/memcontrol.c~mm-memcg-handle-non-error-oom-situations-more-gracefully
> > +++ a/mm/memcontrol.c
> > @@ -2161,110 +2161,59 @@ static void memcg_oom_recover(struct mem
> >  		memcg_wakeup_oom(memcg);
> >  }
> >  
> > -/*
> > - * try to call OOM killer
> > - */
> >  static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
> >  {
> > -	bool locked;
> > -	int wakeups;
> > -
> >  	if (!current->memcg_oom.may_oom)
> >  		return;
> > -
> > -	current->memcg_oom.in_memcg_oom = 1;
> > -
> >  	/*
> > -	 * As with any blocking lock, a contender needs to start
> > -	 * listening for wakeups before attempting the trylock,
> > -	 * otherwise it can miss the wakeup from the unlock and sleep
> > -	 * indefinitely.  This is just open-coded because our locking
> > -	 * is so particular to memcg hierarchies.
> > +	 * We are in the middle of the charge context here, so we
> > +	 * don't want to block when potentially sitting on a callstack
> > +	 * that holds all kinds of filesystem and mm locks.
> > +	 *
> > +	 * Also, the caller may handle a failed allocation gracefully
> > +	 * (like optional page cache readahead) and so an OOM killer
> > +	 * invocation might not even be necessary.
> > +	 *
> > +	 * That's why we don't do anything here except remember the
> > +	 * OOM context and then deal with it at the end of the page
> > +	 * fault when the stack is unwound, the locks are released,
> > +	 * and when we know whether the fault was overall successful.
> >  	 */
> > -	wakeups = atomic_read(&memcg->oom_wakeups);
> > -	mem_cgroup_mark_under_oom(memcg);
> > -
> > -	locked = mem_cgroup_oom_trylock(memcg);
> > -
> > -	if (locked)
> > -		mem_cgroup_oom_notify(memcg);
> > -
> > -	if (locked && !memcg->oom_kill_disable) {
> > -		mem_cgroup_unmark_under_oom(memcg);
> > -		mem_cgroup_out_of_memory(memcg, mask, order);
> > -		mem_cgroup_oom_unlock(memcg);
> > -		/*
> > -		 * There is no guarantee that an OOM-lock contender
> > -		 * sees the wakeups triggered by the OOM kill
> > -		 * uncharges.  Wake any sleepers explicitely.
> > -		 */
> > -		memcg_oom_recover(memcg);
> > -	} else {
> > -		/*
> > -		 * A system call can just return -ENOMEM, but if this
> > -		 * is a page fault and somebody else is handling the
> > -		 * OOM already, we need to sleep on the OOM waitqueue
> > -		 * for this memcg until the situation is resolved.
> > -		 * Which can take some time because it might be
> > -		 * handled by a userspace task.
> > -		 *
> > -		 * However, this is the charge context, which means
> > -		 * that we may sit on a large call stack and hold
> > -		 * various filesystem locks, the mmap_sem etc. and we
> > -		 * don't want the OOM handler to deadlock on them
> > -		 * while we sit here and wait.  Store the current OOM
> > -		 * context in the task_struct, then return -ENOMEM.
> > -		 * At the end of the page fault handler, with the
> > -		 * stack unwound, pagefault_out_of_memory() will check
> > -		 * back with us by calling
> > -		 * mem_cgroup_oom_synchronize(), possibly putting the
> > -		 * task to sleep.
> > -		 */
> > -		current->memcg_oom.oom_locked = locked;
> > -		current->memcg_oom.wakeups = wakeups;
> > -		css_get(&memcg->css);
> > -		current->memcg_oom.wait_on_memcg = memcg;
> > -	}
> > +	css_get(&memcg->css);
> > +	current->memcg_oom.memcg = memcg;
> > +	current->memcg_oom.gfp_mask = mask;
> > +	current->memcg_oom.order = order;
> >  }
> >  
> >  /**
> >   * mem_cgroup_oom_synchronize - complete memcg OOM handling
> > + * @handle: actually kill/wait or just clean up the OOM state
> >   *
> > - * This has to be called at the end of a page fault if the the memcg
> > - * OOM handler was enabled and the fault is returning %VM_FAULT_OOM.
> > + * This has to be called at the end of a page fault if the memcg OOM
> > + * handler was enabled.
> >   *
> > - * Memcg supports userspace OOM handling, so failed allocations must
> > + * Memcg supports userspace OOM handling where failed allocations must
> >   * sleep on a waitqueue until the userspace task resolves the
> >   * situation.  Sleeping directly in the charge context with all kinds
> >   * of locks held is not a good idea, instead we remember an OOM state
> >   * in the task and mem_cgroup_oom_synchronize() has to be called at
> > - * the end of the page fault to put the task to sleep and clean up the
> > - * OOM state.
> > + * the end of the page fault to complete the OOM handling.
> >   *
> >   * Returns %true if an ongoing memcg OOM situation was detected and
> > - * finalized, %false otherwise.
> > + * completed, %false otherwise.
> >   */
> > -bool mem_cgroup_oom_synchronize(void)
> > +bool mem_cgroup_oom_synchronize(bool handle)
> >  {
> > +	struct mem_cgroup *memcg = current->memcg_oom.memcg;
> >  	struct oom_wait_info owait;
> > -	struct mem_cgroup *memcg;
> > +	bool locked;
> >  
> >  	/* OOM is global, do not handle */
> > -	if (!current->memcg_oom.in_memcg_oom)
> > -		return false;
> > -
> > -	/*
> > -	 * We invoked the OOM killer but there is a chance that a kill
> > -	 * did not free up any charges.  Everybody else might already
> > -	 * be sleeping, so restart the fault and keep the rampage
> > -	 * going until some charges are released.
> > -	 */
> > -	memcg = current->memcg_oom.wait_on_memcg;
> >  	if (!memcg)
> > -		goto out;
> > +		return false;
> >  
> > -	if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
> > -		goto out_memcg;
> > +	if (!handle)
> > +		goto cleanup;
> >  
> >  	owait.memcg = memcg;
> >  	owait.wait.flags = 0;
> > @@ -2273,13 +2222,25 @@ bool mem_cgroup_oom_synchronize(void)
> >  	INIT_LIST_HEAD(&owait.wait.task_list);
> >  
> >  	prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
> > -	/* Only sleep if we didn't miss any wakeups since OOM */
> > -	if (atomic_read(&memcg->oom_wakeups) == current->memcg_oom.wakeups)
> > +	mem_cgroup_mark_under_oom(memcg);
> > +
> > +	locked = mem_cgroup_oom_trylock(memcg);
> > +
> > +	if (locked)
> > +		mem_cgroup_oom_notify(memcg);
> > +
> > +	if (locked && !memcg->oom_kill_disable) {
> > +		mem_cgroup_unmark_under_oom(memcg);
> > +		finish_wait(&memcg_oom_waitq, &owait.wait);
> > +		mem_cgroup_out_of_memory(memcg, current->memcg_oom.gfp_mask,
> > +					 current->memcg_oom.order);
> > +	} else {
> >  		schedule();
> > -	finish_wait(&memcg_oom_waitq, &owait.wait);
> > -out_memcg:
> > -	mem_cgroup_unmark_under_oom(memcg);
> > -	if (current->memcg_oom.oom_locked) {
> > +		mem_cgroup_unmark_under_oom(memcg);
> > +		finish_wait(&memcg_oom_waitq, &owait.wait);
> > +	}
> > +
> > +	if (locked) {
> >  		mem_cgroup_oom_unlock(memcg);
> >  		/*
> >  		 * There is no guarantee that an OOM-lock contender
> > @@ -2288,10 +2249,9 @@ out_memcg:
> >  		 */
> >  		memcg_oom_recover(memcg);
> >  	}
> > +cleanup:
> > +	current->memcg_oom.memcg = NULL;
> >  	css_put(&memcg->css);
> > -	current->memcg_oom.wait_on_memcg = NULL;
> > -out:
> > -	current->memcg_oom.in_memcg_oom = 0;
> >  	return true;
> >  }
> >  
> > @@ -2705,6 +2665,9 @@ static int __mem_cgroup_try_charge(struc
> >  		     || fatal_signal_pending(current)))
> >  		goto bypass;
> >  
> > +	if (unlikely(task_in_memcg_oom(current)))
> > +		goto bypass;
> > +
> >  	/*
> >  	 * We always charge the cgroup the mm_struct belongs to.
> >  	 * The mm_struct's mem_cgroup changes on task migration if the
> 
> First, apologies that I didn't look at all these changes earlier even 
> though I was cc'd.
> 
> The memcg oom killer has incurred a serious regression since the 3.12-rc6 
> kernel where this was merged as 4942642080ea ("mm: memcg: handle non-error 
> OOM situations more gracefully").  It cc'd stable@kernel.org although it 
> doesn't appear to have been picked up yet, and I'm hoping that we can 
> avoid having it merged in a stable kernel until we get this fixed.
> 
> This patch, specifically the above, allows memcgs to bypass their limits 
> by charging the root memcg in oom conditions.
> 
> If I create a memcg, cg1, with memory.limit_in_bytes == 128MB and start a
> memory allocator in it to induce oom, the memcg limit is trivially broken:
> 
> membench invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
> membench cpuset=/ mems_allowed=0-3
> CPU: 9 PID: 11388 Comm: membench Not tainted 3.13-rc1
>  ffffc90015ec6000 ffff880671c3dc18 ffffffff8154a1e3 0000000000000007
>  ffff880674c215d0 ffff880671c3dc98 ffffffff81548b45 ffff880671c3dc58
>  ffffffff81151de7 0000000000000001 0000000000000292 ffff880800000000
> Call Trace:
>  [<ffffffff8154a1e3>] dump_stack+0x46/0x58
>  [<ffffffff81548b45>] dump_header+0x7a/0x1bb
>  [<ffffffff81151de7>] ? find_lock_task_mm+0x27/0x70
>  [<ffffffff812e8b76>] ? ___ratelimit+0x96/0x110
>  [<ffffffff811521c4>] oom_kill_process+0x1c4/0x330
>  [<ffffffff81099ee5>] ? has_ns_capability_noaudit+0x15/0x20
>  [<ffffffff811a916a>] mem_cgroup_oom_synchronize+0x50a/0x570
>  [<ffffffff811a8660>] ? __mem_cgroup_try_charge_swapin+0x70/0x70
>  [<ffffffff81152968>] pagefault_out_of_memory+0x18/0x90
>  [<ffffffff81547b86>] mm_fault_error+0xb7/0x15a
>  [<ffffffff81553efb>] __do_page_fault+0x42b/0x500
>  [<ffffffff810c667d>] ? set_next_entity+0xad/0xd0
>  [<ffffffff810c670b>] ? pick_next_task_fair+0x6b/0x170
>  [<ffffffff8154d08e>] ? __schedule+0x38e/0x780
>  [<ffffffff81553fde>] do_page_fault+0xe/0x10
>  [<ffffffff815509e2>] page_fault+0x22/0x30
> Task in /cg1 killed as a result of limit of /cg1
> memory: usage 131072kB, limit 131072kB, failcnt 1053
> memory+swap: usage 0kB, limit 18014398509481983kB, failcnt 0
> kmem: usage 0kB, limit 18014398509481983kB, failcnt 0
> Memory cgroup stats for /cg1: cache:84KB rss:130988KB rss_huge:116736KB mapped_file:72KB writeback:0KB inactive_anon:0KB active_anon:130976KB inactive_file:4KB active_file:0KB unevictable:0KB
> [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
> [ 7761]     0  7761     1106      483       5        0             0 bash
> [11388]    99 11388   270773    33031      83        0             0 membench
> Memory cgroup out of memory: Kill process 11388 (membench) score 1010 or sacrifice child
> Killed process 11388 (membench) total-vm:1083092kB, anon-rss:130824kB, file-rss:1300kB
> 
> The score of 1010 shown for pid 11388 (membench) should never happen in 
> the oom killer, the maximum value should always be 1000 in any oom 
> context.  This indicates that the process has allocated more memory than 
> is available to the memcg.  The rss value, 33031 pages, shows that it has 
> allocated >129MB of memory in a memcg limited to 128MB.
> 
> The entire premise of memcg is to prevent processes attached to it to not 
> be able to allocate more memory than allowed and this trivially breaks 
> that premise in oom conditions.

We already allow a task to allocate beyond the limit if it's selected
by the OOM killer, so that it can exit faster.

My patch added that a task can bypass the limit when it decided to
trigger the OOM killer, so that it can get to the OOM kill faster.

So I don't think my patch has broken "the entire premise of memcgs".
At the same time, it also does not really rely on that bypass, we
should be able to remove it.

This patch series was not supposed to go into the last merge window, I
already told stable to hold off on these until further notice.

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 13b9d0f..5f9e467 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2675,7 +2675,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 		goto bypass;
 
 	if (unlikely(task_in_memcg_oom(current)))
-		goto bypass;
+		goto nomem;
 
 	/*
 	 * We always charge the cgroup the mm_struct belongs to.

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree
  2013-11-27 23:33   ` Johannes Weiner
@ 2013-11-28  0:56     ` David Rientjes
  2013-11-28  2:18       ` Johannes Weiner
  2013-11-28  9:12     ` Michal Hocko
  1 sibling, 1 reply; 21+ messages in thread
From: David Rientjes @ 2013-11-28  0:56 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, stable, Michal Hocko, azurit, mm-commits,
	linux-kernel, linux-mm

On Wed, 27 Nov 2013, Johannes Weiner wrote:

> > The memcg oom killer has incurred a serious regression since the 3.12-rc6 
> > kernel where this was merged as 4942642080ea ("mm: memcg: handle non-error 
> > OOM situations more gracefully").  It cc'd stable@kernel.org although it 
> > doesn't appear to have been picked up yet, and I'm hoping that we can 
> > avoid having it merged in a stable kernel until we get this fixed.
> > 
> > This patch, specifically the above, allows memcgs to bypass their limits 
> > by charging the root memcg in oom conditions.
> > 
> > If I create a memcg, cg1, with memory.limit_in_bytes == 128MB and start a
> > memory allocator in it to induce oom, the memcg limit is trivially broken:
> > 
> > membench invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
> > membench cpuset=/ mems_allowed=0-3
> > CPU: 9 PID: 11388 Comm: membench Not tainted 3.13-rc1
> >  ffffc90015ec6000 ffff880671c3dc18 ffffffff8154a1e3 0000000000000007
> >  ffff880674c215d0 ffff880671c3dc98 ffffffff81548b45 ffff880671c3dc58
> >  ffffffff81151de7 0000000000000001 0000000000000292 ffff880800000000
> > Call Trace:
> >  [<ffffffff8154a1e3>] dump_stack+0x46/0x58
> >  [<ffffffff81548b45>] dump_header+0x7a/0x1bb
> >  [<ffffffff81151de7>] ? find_lock_task_mm+0x27/0x70
> >  [<ffffffff812e8b76>] ? ___ratelimit+0x96/0x110
> >  [<ffffffff811521c4>] oom_kill_process+0x1c4/0x330
> >  [<ffffffff81099ee5>] ? has_ns_capability_noaudit+0x15/0x20
> >  [<ffffffff811a916a>] mem_cgroup_oom_synchronize+0x50a/0x570
> >  [<ffffffff811a8660>] ? __mem_cgroup_try_charge_swapin+0x70/0x70
> >  [<ffffffff81152968>] pagefault_out_of_memory+0x18/0x90
> >  [<ffffffff81547b86>] mm_fault_error+0xb7/0x15a
> >  [<ffffffff81553efb>] __do_page_fault+0x42b/0x500
> >  [<ffffffff810c667d>] ? set_next_entity+0xad/0xd0
> >  [<ffffffff810c670b>] ? pick_next_task_fair+0x6b/0x170
> >  [<ffffffff8154d08e>] ? __schedule+0x38e/0x780
> >  [<ffffffff81553fde>] do_page_fault+0xe/0x10
> >  [<ffffffff815509e2>] page_fault+0x22/0x30
> > Task in /cg1 killed as a result of limit of /cg1
> > memory: usage 131072kB, limit 131072kB, failcnt 1053
> > memory+swap: usage 0kB, limit 18014398509481983kB, failcnt 0
> > kmem: usage 0kB, limit 18014398509481983kB, failcnt 0
> > Memory cgroup stats for /cg1: cache:84KB rss:130988KB rss_huge:116736KB mapped_file:72KB writeback:0KB inactive_anon:0KB active_anon:130976KB inactive_file:4KB active_file:0KB unevictable:0KB
> > [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
> > [ 7761]     0  7761     1106      483       5        0             0 bash
> > [11388]    99 11388   270773    33031      83        0             0 membench
> > Memory cgroup out of memory: Kill process 11388 (membench) score 1010 or sacrifice child
> > Killed process 11388 (membench) total-vm:1083092kB, anon-rss:130824kB, file-rss:1300kB
> > 
> > The score of 1010 shown for pid 11388 (membench) should never happen in 
> > the oom killer, the maximum value should always be 1000 in any oom 
> > context.  This indicates that the process has allocated more memory than 
> > is available to the memcg.  The rss value, 33031 pages, shows that it has 
> > allocated >129MB of memory in a memcg limited to 128MB.
> > 
> > The entire premise of memcg is to prevent processes attached to it to not 
> > be able to allocate more memory than allowed and this trivially breaks 
> > that premise in oom conditions.
> 
> We already allow a task to allocate beyond the limit if it's selected
> by the OOM killer, so that it can exit faster.
> 
> My patch added that a task can bypass the limit when it decided to
> trigger the OOM killer, so that it can get to the OOM kill faster.
> 

The task that is bypassing the memcg charge to the root memcg may not be 
the process that is chosen by the oom killer, and it's possible the amount 
of memory freed by killing the victim is less than the amount of memory 
bypassed.

> So I don't think my patch has broken "the entire premise of memcgs".
> At the same time, it also does not really rely on that bypass, we
> should be able to remove it.
> 
> This patch series was not supposed to go into the last merge window, I
> already told stable to hold off on these until further notice.
> 

Were you targeting these to 3.13 instead?  If so, it would have already 
appeared in 3.13-rc1 anyway.  Is it still a work in progress?

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 13b9d0f..5f9e467 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2675,7 +2675,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>  		goto bypass;
>  
>  	if (unlikely(task_in_memcg_oom(current)))
> -		goto bypass;
> +		goto nomem;
>  
>  	/*
>  	 * We always charge the cgroup the mm_struct belongs to.

Is there any benefit to doing this over just schedule_timeout_killable() 
since we need to wait for mem_cgroup_oom_synchronize() to be able to make 
forward progress at this point?

Should we be checking mem_cgroup_margin() here to ensure 
task_in_memcg_oom() is still accurate and we haven't raced by freeing 
memory?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree
  2013-11-28  0:56     ` David Rientjes
@ 2013-11-28  2:18       ` Johannes Weiner
  2013-11-28  2:38         ` David Rientjes
  0 siblings, 1 reply; 21+ messages in thread
From: Johannes Weiner @ 2013-11-28  2:18 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, stable, Michal Hocko, azurit, mm-commits,
	linux-kernel, linux-mm

On Wed, Nov 27, 2013 at 04:56:04PM -0800, David Rientjes wrote:
> On Wed, 27 Nov 2013, Johannes Weiner wrote:
> 
> > > The memcg oom killer has incurred a serious regression since the 3.12-rc6 
> > > kernel where this was merged as 4942642080ea ("mm: memcg: handle non-error 
> > > OOM situations more gracefully").  It cc'd stable@kernel.org although it 
> > > doesn't appear to have been picked up yet, and I'm hoping that we can 
> > > avoid having it merged in a stable kernel until we get this fixed.
> > > 
> > > This patch, specifically the above, allows memcgs to bypass their limits 
> > > by charging the root memcg in oom conditions.
> > > 
> > > If I create a memcg, cg1, with memory.limit_in_bytes == 128MB and start a
> > > memory allocator in it to induce oom, the memcg limit is trivially broken:
> > > 
> > > membench invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
> > > membench cpuset=/ mems_allowed=0-3
> > > CPU: 9 PID: 11388 Comm: membench Not tainted 3.13-rc1
> > >  ffffc90015ec6000 ffff880671c3dc18 ffffffff8154a1e3 0000000000000007
> > >  ffff880674c215d0 ffff880671c3dc98 ffffffff81548b45 ffff880671c3dc58
> > >  ffffffff81151de7 0000000000000001 0000000000000292 ffff880800000000
> > > Call Trace:
> > >  [<ffffffff8154a1e3>] dump_stack+0x46/0x58
> > >  [<ffffffff81548b45>] dump_header+0x7a/0x1bb
> > >  [<ffffffff81151de7>] ? find_lock_task_mm+0x27/0x70
> > >  [<ffffffff812e8b76>] ? ___ratelimit+0x96/0x110
> > >  [<ffffffff811521c4>] oom_kill_process+0x1c4/0x330
> > >  [<ffffffff81099ee5>] ? has_ns_capability_noaudit+0x15/0x20
> > >  [<ffffffff811a916a>] mem_cgroup_oom_synchronize+0x50a/0x570
> > >  [<ffffffff811a8660>] ? __mem_cgroup_try_charge_swapin+0x70/0x70
> > >  [<ffffffff81152968>] pagefault_out_of_memory+0x18/0x90
> > >  [<ffffffff81547b86>] mm_fault_error+0xb7/0x15a
> > >  [<ffffffff81553efb>] __do_page_fault+0x42b/0x500
> > >  [<ffffffff810c667d>] ? set_next_entity+0xad/0xd0
> > >  [<ffffffff810c670b>] ? pick_next_task_fair+0x6b/0x170
> > >  [<ffffffff8154d08e>] ? __schedule+0x38e/0x780
> > >  [<ffffffff81553fde>] do_page_fault+0xe/0x10
> > >  [<ffffffff815509e2>] page_fault+0x22/0x30
> > > Task in /cg1 killed as a result of limit of /cg1
> > > memory: usage 131072kB, limit 131072kB, failcnt 1053
> > > memory+swap: usage 0kB, limit 18014398509481983kB, failcnt 0
> > > kmem: usage 0kB, limit 18014398509481983kB, failcnt 0
> > > Memory cgroup stats for /cg1: cache:84KB rss:130988KB rss_huge:116736KB mapped_file:72KB writeback:0KB inactive_anon:0KB active_anon:130976KB inactive_file:4KB active_file:0KB unevictable:0KB
> > > [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
> > > [ 7761]     0  7761     1106      483       5        0             0 bash
> > > [11388]    99 11388   270773    33031      83        0             0 membench
> > > Memory cgroup out of memory: Kill process 11388 (membench) score 1010 or sacrifice child
> > > Killed process 11388 (membench) total-vm:1083092kB, anon-rss:130824kB, file-rss:1300kB
> > > 
> > > The score of 1010 shown for pid 11388 (membench) should never happen in 
> > > the oom killer, the maximum value should always be 1000 in any oom 
> > > context.  This indicates that the process has allocated more memory than 
> > > is available to the memcg.  The rss value, 33031 pages, shows that it has 
> > > allocated >129MB of memory in a memcg limited to 128MB.
> > > 
> > > The entire premise of memcg is to prevent processes attached to it to not 
> > > be able to allocate more memory than allowed and this trivially breaks 
> > > that premise in oom conditions.
> > 
> > We already allow a task to allocate beyond the limit if it's selected
> > by the OOM killer, so that it can exit faster.
> > 
> > My patch added that a task can bypass the limit when it decided to
> > trigger the OOM killer, so that it can get to the OOM kill faster.
> > 
> 
> The task that is bypassing the memcg charge to the root memcg may not be 
> the process that is chosen by the oom killer, and it's possible the amount 
> of memory freed by killing the victim is less than the amount of memory 
> bypassed.

That's true, though unlikely.

> > So I don't think my patch has broken "the entire premise of memcgs".
> > At the same time, it also does not really rely on that bypass, we
> > should be able to remove it.
> > 
> > This patch series was not supposed to go into the last merge window, I
> > already told stable to hold off on these until further notice.
> > 
> 
> Were you targeting these to 3.13 instead?  If so, it would have already 
> appeared in 3.13-rc1 anyway.  Is it still a work in progress?

I don't know how to answer this question.

> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 13b9d0f..5f9e467 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2675,7 +2675,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> >  		goto bypass;
> >  
> >  	if (unlikely(task_in_memcg_oom(current)))
> > -		goto bypass;
> > +		goto nomem;
> >  
> >  	/*
> >  	 * We always charge the cgroup the mm_struct belongs to.
> 
> Is there any benefit to doing this over just schedule_timeout_killable() 
> since we need to wait for mem_cgroup_oom_synchronize() to be able to make 
> forward progress at this point?

This does not make any sense, current is the process that will execute
mem_cgroup_oom_synchronize().  current entered OOM, it will kill once
the fault handler returns.  This check is there to quickly end any
subsequent attempts of the fault handler to allocate memory after a
failed allocation and to get quickly to the OOM killer.

> Should we be checking mem_cgroup_margin() here to ensure 
> task_in_memcg_oom() is still accurate and we haven't raced by freeing 
> memory?

We would have invoked the OOM killer long before this point prior to
my patches.  There is a line we draw and from that point on we start
killing things.  I tried to explain multiple times now that there is
no race-free OOM killing and I'm tired of it.  Convince me otherwise
or stop repeating this non-sense.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree
  2013-11-28  2:18       ` Johannes Weiner
@ 2013-11-28  2:38         ` David Rientjes
  2013-11-28  3:13           ` Johannes Weiner
  0 siblings, 1 reply; 21+ messages in thread
From: David Rientjes @ 2013-11-28  2:38 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, stable, Michal Hocko, azurit, mm-commits,
	linux-kernel, linux-mm

On Wed, 27 Nov 2013, Johannes Weiner wrote:

> > The task that is bypassing the memcg charge to the root memcg may not be 
> > the process that is chosen by the oom killer, and it's possible the amount 
> > of memory freed by killing the victim is less than the amount of memory 
> > bypassed.
> 
> That's true, though unlikely.
> 

Well, the "goto bypass" allows it and it's trivial to cause by 
manipulating /proc/pid/oom_score_adj values to prefer processes with very 
little rss.  It will just continue looping and killing processes as they 
are forked and never cause the memcg to free memory below its limit.  At 
least the "goto nomem" allows us to free some memory instead of leaking to 
the root memcg.

> > Were you targeting these to 3.13 instead?  If so, it would have already 
> > appeared in 3.13-rc1 anyway.  Is it still a work in progress?
> 
> I don't know how to answer this question.
> 

It appears as though this work is being developed in Linus's tree rather 
than -mm, so I'm asking if we should consider backing some of it out for 
3.14 instead.

> > Should we be checking mem_cgroup_margin() here to ensure 
> > task_in_memcg_oom() is still accurate and we haven't raced by freeing 
> > memory?
> 
> We would have invoked the OOM killer long before this point prior to
> my patches.  There is a line we draw and from that point on we start
> killing things.  I tried to explain multiple times now that there is
> no race-free OOM killing and I'm tired of it.  Convince me otherwise
> or stop repeating this non-sense.
> 

In our internal kernel we call mem_cgroup_margin() with the order of the 
charge immediately prior to sending the SIGKILL to see if it's still 
needed even after selecting the victim.  It makes the race smaller.

It's obvious that after the SIGKILL is sent, either from the kernel or 
from userspace, that memory might subsequently be freed or another process 
might exit before the process killed could even wake up.  There's nothing 
we can do about that since we don't have psychic abilities.  I think we 
should try to reduce the chance for unnecessary oom killing as much as 
possible, however.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree
  2013-11-28  2:38         ` David Rientjes
@ 2013-11-28  3:13           ` Johannes Weiner
  2013-11-28  3:20             ` David Rientjes
  0 siblings, 1 reply; 21+ messages in thread
From: Johannes Weiner @ 2013-11-28  3:13 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, stable, Michal Hocko, azurit, mm-commits,
	linux-kernel, linux-mm

On Wed, Nov 27, 2013 at 06:38:31PM -0800, David Rientjes wrote:
> On Wed, 27 Nov 2013, Johannes Weiner wrote:
> 
> > > The task that is bypassing the memcg charge to the root memcg may not be 
> > > the process that is chosen by the oom killer, and it's possible the amount 
> > > of memory freed by killing the victim is less than the amount of memory 
> > > bypassed.
> > 
> > That's true, though unlikely.
> > 
> 
> Well, the "goto bypass" allows it and it's trivial to cause by 
> manipulating /proc/pid/oom_score_adj values to prefer processes with very 
> little rss.  It will just continue looping and killing processes as they 
> are forked and never cause the memcg to free memory below its limit.  At 
> least the "goto nomem" allows us to free some memory instead of leaking to 
> the root memcg.

Yes, that's the better way of doing it, I'll send the patch.  Thanks.

> > > Were you targeting these to 3.13 instead?  If so, it would have already 
> > > appeared in 3.13-rc1 anyway.  Is it still a work in progress?
> > 
> > I don't know how to answer this question.
> > 
> 
> It appears as though this work is being developed in Linus's tree rather 
> than -mm, so I'm asking if we should consider backing some of it out for 
> 3.14 instead.

The changes fix a deadlock problem.  Are they creating problems that
are worse than deadlocks, that would justify their revert?

> > > Should we be checking mem_cgroup_margin() here to ensure 
> > > task_in_memcg_oom() is still accurate and we haven't raced by freeing 
> > > memory?
> > 
> > We would have invoked the OOM killer long before this point prior to
> > my patches.  There is a line we draw and from that point on we start
> > killing things.  I tried to explain multiple times now that there is
> > no race-free OOM killing and I'm tired of it.  Convince me otherwise
> > or stop repeating this non-sense.
> > 
> 
> In our internal kernel we call mem_cgroup_margin() with the order of the 
> charge immediately prior to sending the SIGKILL to see if it's still 
> needed even after selecting the victim.  It makes the race smaller.
> 
> It's obvious that after the SIGKILL is sent, either from the kernel or 
> from userspace, that memory might subsequently be freed or another process 
> might exit before the process killed could even wake up.  There's nothing 
> we can do about that since we don't have psychic abilities.  I think we 
> should try to reduce the chance for unnecessary oom killing as much as 
> possible, however.

Since we can't physically draw a perfect line, we should strive for a
reasonable and intuitive line.  After that it's rapidly diminishing
returns.  Killing something after that much reclaim effort without
success is a completely reasonable and intuitive line to draw.  It's
also the line that has been drawn a long time ago and we're not
breaking this because of a micro optmimization.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree
  2013-11-28  3:13           ` Johannes Weiner
@ 2013-11-28  3:20             ` David Rientjes
  2013-11-28  3:52               ` Johannes Weiner
  2013-11-28 10:02               ` Michal Hocko
  0 siblings, 2 replies; 21+ messages in thread
From: David Rientjes @ 2013-11-28  3:20 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, azurit, mm-commits, linux-kernel, linux-mm

On Wed, 27 Nov 2013, Johannes Weiner wrote:

> > It appears as though this work is being developed in Linus's tree rather 
> > than -mm, so I'm asking if we should consider backing some of it out for 
> > 3.14 instead.
> 
> The changes fix a deadlock problem.  Are they creating problems that
> are worse than deadlocks, that would justify their revert?
> 

None that I am currently aware of, I'll continue to try them out.  I'd 
suggest just dropping the stable@kernel.org from the whole series though 
unless there is another report of such a problem that people are running 
into.

> Since we can't physically draw a perfect line, we should strive for a
> reasonable and intuitive line.  After that it's rapidly diminishing
> returns.  Killing something after that much reclaim effort without
> success is a completely reasonable and intuitive line to draw.  It's
> also the line that has been drawn a long time ago and we're not
> breaking this because of a micro optmimization.
> 

You don't think something like this is helpful after scanning a memcg will 
a large number of processes?

We've had this patch internally since we started using memcg, it has 
avoided some unnecessary oom killing.
---
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1836,6 +1836,13 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (!chosen)
 		return;
 	points = chosen_points * 1000 / totalpages;
+
+	/* One last chance to see if we really need to kill something */
+	if (mem_cgroup_margin(memcg) >= (1 << order)) {
+		put_task_struct(chosen);
+		return;
+	}
+
 	oom_kill_process(chosen, gfp_mask, order, points, totalpages, memcg,
 			 NULL, "Memory cgroup out of memory");
 }

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree
  2013-11-28  3:20             ` David Rientjes
@ 2013-11-28  3:52               ` Johannes Weiner
  2013-11-30  0:00                 ` David Rientjes
  2013-11-28 10:02               ` Michal Hocko
  1 sibling, 1 reply; 21+ messages in thread
From: Johannes Weiner @ 2013-11-28  3:52 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Michal Hocko, azurit, mm-commits, linux-kernel, linux-mm

On Wed, Nov 27, 2013 at 07:20:37PM -0800, David Rientjes wrote:
> On Wed, 27 Nov 2013, Johannes Weiner wrote:
> 
> > > It appears as though this work is being developed in Linus's tree rather 
> > > than -mm, so I'm asking if we should consider backing some of it out for 
> > > 3.14 instead.
> > 
> > The changes fix a deadlock problem.  Are they creating problems that
> > are worse than deadlocks, that would justify their revert?
> > 
> 
> None that I am currently aware of, I'll continue to try them out.  I'd 
> suggest just dropping the stable@kernel.org from the whole series though 
> unless there is another report of such a problem that people are running 
> into.

The series has long been merged, how do we drop stable@kernel.org from
it?

> > Since we can't physically draw a perfect line, we should strive for a
> > reasonable and intuitive line.  After that it's rapidly diminishing
> > returns.  Killing something after that much reclaim effort without
> > success is a completely reasonable and intuitive line to draw.  It's
> > also the line that has been drawn a long time ago and we're not
> > breaking this because of a micro optmimization.
> > 
> 
> You don't think something like this is helpful after scanning a memcg will 
> a large number of processes?
> 
> We've had this patch internally since we started using memcg, it has 
> avoided some unnecessary oom killing.

Do you have quantified data that OOM kills are reduced over a longer
sampling period?  How many kills are skipped?  How many of them are
deferred temporarily but the VM ended up having to kill something
anyway?  My theory still being that several loops of failed direct
reclaim and charge attempts likely say more about the machine state
than somebody randomly releasing some memory in the last minute...

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1836,6 +1836,13 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	if (!chosen)
>  		return;
>  	points = chosen_points * 1000 / totalpages;
> +
> +	/* One last chance to see if we really need to kill something */
> +	if (mem_cgroup_margin(memcg) >= (1 << order)) {
> +		put_task_struct(chosen);
> +		return;
> +	}
> +
>  	oom_kill_process(chosen, gfp_mask, order, points, totalpages, memcg,
>  			 NULL, "Memory cgroup out of memory");
>  }

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree
  2013-11-27 23:33   ` Johannes Weiner
  2013-11-28  0:56     ` David Rientjes
@ 2013-11-28  9:12     ` Michal Hocko
  2013-11-30  3:37       ` Johannes Weiner
  1 sibling, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2013-11-28  9:12 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Rientjes, Andrew Morton, stable, azurit, mm-commits,
	linux-kernel, linux-mm

On Wed 27-11-13 18:33:53, Johannes Weiner wrote:
[...]
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 13b9d0f..5f9e467 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2675,7 +2675,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>  		goto bypass;
>  
>  	if (unlikely(task_in_memcg_oom(current)))
> -		goto bypass;
> +		goto nomem;
>  
>  	/*
>  	 * We always charge the cgroup the mm_struct belongs to.

Yes, I think we really want this. Plan to send a patch? The first charge
failure due to OOM shouldn't be papered over by a later attempt if we
didn't get through mem_cgroup_oom_synchronize yet.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree
  2013-11-28  3:20             ` David Rientjes
  2013-11-28  3:52               ` Johannes Weiner
@ 2013-11-28 10:02               ` Michal Hocko
  2013-11-30  0:05                 ` David Rientjes
  1 sibling, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2013-11-28 10:02 UTC (permalink / raw)
  To: David Rientjes
  Cc: Johannes Weiner, Andrew Morton, azurit, mm-commits, linux-kernel,
	linux-mm

On Wed 27-11-13 19:20:37, David Rientjes wrote:
> On Wed, 27 Nov 2013, Johannes Weiner wrote:
> 
> > > It appears as though this work is being developed in Linus's tree rather 
> > > than -mm, so I'm asking if we should consider backing some of it out for 
> > > 3.14 instead.
> > 
> > The changes fix a deadlock problem.  Are they creating problems that
> > are worse than deadlocks, that would justify their revert?
> > 
> 
> None that I am currently aware of,

Are you saing that scenarios described in 3812c8c8f395 (mm: memcg: do not
trap chargers with full callstack on OOM) are not real or that _you_
haven't seen an issue like that?

The later doesn't seem to be so relevant as we had at least one user who
has seen those in the real life.

> I'll continue to try them out. 

> I'd suggest just dropping the stable@kernel.org from the whole series
> though unless there is another report of such a problem that people
> are running into.

The stable backport is another question though. Although the bug was
there since ages and the rework by Johannes is definitely a step forward
I would be careful to push this into stable becuase the rework brings an
user visible behavior change which might be unexpected (especially in
the middle of the stable life cycle). Seeing allocation failures instead
of OOM for charges outside of page fault context is surely a big change
in semantic.

Take mmap(MAP_POPULATE) as an example. Previously we would OOM if the
limit was reached. Now the syscall returns without any error code but a
later access might block on the OOM. While I do not see this as a
problem in general as this is consistent with !memcg case I wouldn't
like to see a breakage of the previous expectation in the middle stable
life cycle.

> > Since we can't physically draw a perfect line, we should strive for a
> > reasonable and intuitive line.  After that it's rapidly diminishing
> > returns.  Killing something after that much reclaim effort without
> > success is a completely reasonable and intuitive line to draw.  It's
> > also the line that has been drawn a long time ago and we're not
> > breaking this because of a micro optmimization.
> > 
> 
> You don't think something like this is helpful after scanning a memcg will 
> a large number of processes?

It looks as a one-shot workaround for short lived processes to me.
I agree with Johannes that it doesn't seem to be worth it. The race will
be always there. Moreover why should memcg OOM killer should behave
differently from the global OOM?

> We've had this patch internally since we started using memcg, it has 
> avoided some unnecessary oom killing.
> ---
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1836,6 +1836,13 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	if (!chosen)
>  		return;
>  	points = chosen_points * 1000 / totalpages;
> +
> +	/* One last chance to see if we really need to kill something */
> +	if (mem_cgroup_margin(memcg) >= (1 << order)) {
> +		put_task_struct(chosen);
> +		return;
> +	}
> +
>  	oom_kill_process(chosen, gfp_mask, order, points, totalpages, memcg,
>  			 NULL, "Memory cgroup out of memory");
>  }

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree
  2013-11-28  3:52               ` Johannes Weiner
@ 2013-11-30  0:00                 ` David Rientjes
  2013-11-30  0:51                   ` Greg KH
  2013-11-30  3:35                   ` Johannes Weiner
  0 siblings, 2 replies; 21+ messages in thread
From: David Rientjes @ 2013-11-30  0:00 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, azurit, mm-commits, stable,
	linux-kernel, linux-mm

On Wed, 27 Nov 2013, Johannes Weiner wrote:

> > None that I am currently aware of, I'll continue to try them out.  I'd 
> > suggest just dropping the stable@kernel.org from the whole series though 
> > unless there is another report of such a problem that people are running 
> > into.
> 
> The series has long been merged, how do we drop stable@kernel.org from
> it?
> 

You said you have informed stable to not merge these patches until further 
notice, I'd suggest simply avoid ever merging the whole series into a 
stable kernel since the problem isn't serious enough.  Marking changes 
that do "goto nomem" seem fine to mark for stable, though.

> > We've had this patch internally since we started using memcg, it has 
> > avoided some unnecessary oom killing.
> 
> Do you have quantified data that OOM kills are reduced over a longer
> sampling period?  How many kills are skipped?  How many of them are
> deferred temporarily but the VM ended up having to kill something
> anyway?

On the scale that we run memcg, we would see it daily in automated testing 
primarily because we panic the machine for memcg oom conditions where 
there are no killable processes.  It would typically manifest by two 
processes that are allocating memory in a memcg; one is oom killed, is 
allowed to allocate, handles its SIGKILL, exits and frees its memory and 
the second process which is oom disabled races with the uncharge and is 
oom disabled so the machine panics.

The upstream kernel of course doesn't panic in such a condition but if the 
same scenario were to have happened, the second process would be 
unnecessarily oom killed because it raced with the uncharge of the first 
victim and it had exited before the scan of processes in the memcg oom 
killer could detect it and defer.  So this patch definitely does prevent 
unnecessary oom killing when run at such a large scale that we do.

I'll send a formal patch.

> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1836,6 +1836,13 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >  	if (!chosen)
> >  		return;
> >  	points = chosen_points * 1000 / totalpages;
> > +
> > +	/* One last chance to see if we really need to kill something */
> > +	if (mem_cgroup_margin(memcg) >= (1 << order)) {
> > +		put_task_struct(chosen);
> > +		return;
> > +	}
> > +
> >  	oom_kill_process(chosen, gfp_mask, order, points, totalpages, memcg,
> >  			 NULL, "Memory cgroup out of memory");
> >  }
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree
  2013-11-28 10:02               ` Michal Hocko
@ 2013-11-30  0:05                 ` David Rientjes
  2013-12-02 13:12                   ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: David Rientjes @ 2013-11-30  0:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Andrew Morton, azurit, mm-commits, linux-kernel,
	linux-mm

On Thu, 28 Nov 2013, Michal Hocko wrote:

> > None that I am currently aware of,
> 
> Are you saing that scenarios described in 3812c8c8f395 (mm: memcg: do not
> trap chargers with full callstack on OOM) are not real or that _you_
> haven't seen an issue like that?
> 
> The later doesn't seem to be so relevant as we had at least one user who
> has seen those in the real life.
> 

I said I'm not currently aware of any additional problems with the 
patchset, but since Johannes said the entire series wasn't meant for that 
merge window, I asked if it was still being worked on.

> > You don't think something like this is helpful after scanning a memcg will 
> > a large number of processes?
> 
> It looks as a one-shot workaround for short lived processes to me.

It has nothing to do with how long a process has been running, both racing 
processes could have been running for years.  It's obvious that even this 
patch before calling oom_kill_process() does not catch a racing process 
that has already freed its memory and is exiting but it makes the 
liklihood significantly less in testing at scale.  It's simply better to 
avoid unnecessary oom killing at anytime possible and this is not a 
hotpath.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree
  2013-11-30  0:00                 ` David Rientjes
@ 2013-11-30  0:51                   ` Greg KH
  2013-11-30 10:25                     ` David Rientjes
  2013-11-30  3:35                   ` Johannes Weiner
  1 sibling, 1 reply; 21+ messages in thread
From: Greg KH @ 2013-11-30  0:51 UTC (permalink / raw)
  To: David Rientjes
  Cc: Johannes Weiner, Andrew Morton, Michal Hocko, azurit, mm-commits,
	stable, linux-kernel, linux-mm

On Fri, Nov 29, 2013 at 04:00:09PM -0800, David Rientjes wrote:
> On Wed, 27 Nov 2013, Johannes Weiner wrote:
> 
> > > None that I am currently aware of, I'll continue to try them out.  I'd 
> > > suggest just dropping the stable@kernel.org from the whole series though 
> > > unless there is another report of such a problem that people are running 
> > > into.
> > 
> > The series has long been merged, how do we drop stable@kernel.org from
> > it?
> > 
> 
> You said you have informed stable to not merge these patches until further 
> notice, I'd suggest simply avoid ever merging the whole series into a 
> stable kernel since the problem isn't serious enough.  Marking changes 
> that do "goto nomem" seem fine to mark for stable, though.

I'm lost.  These patches are in 3.12, so how can they not be "in
stable"?

What exactly do you want me to do here?

totally confused,

greg k-h

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree
  2013-11-30  0:00                 ` David Rientjes
  2013-11-30  0:51                   ` Greg KH
@ 2013-11-30  3:35                   ` Johannes Weiner
  2013-11-30 10:32                     ` David Rientjes
  1 sibling, 1 reply; 21+ messages in thread
From: Johannes Weiner @ 2013-11-30  3:35 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Michal Hocko, azurit, mm-commits, stable,
	linux-kernel, linux-mm

On Fri, Nov 29, 2013 at 04:00:09PM -0800, David Rientjes wrote:
> On Wed, 27 Nov 2013, Johannes Weiner wrote:
> 
> > > None that I am currently aware of, I'll continue to try them out.  I'd 
> > > suggest just dropping the stable@kernel.org from the whole series though 
> > > unless there is another report of such a problem that people are running 
> > > into.
> > 
> > The series has long been merged, how do we drop stable@kernel.org from
> > it?
> > 
> 
> You said you have informed stable to not merge these patches until further 
> notice, I'd suggest simply avoid ever merging the whole series into a 
> stable kernel since the problem isn't serious enough.  Marking changes 
> that do "goto nomem" seem fine to mark for stable, though.

These are followup fixes for the series that is upstream but didn't go
to stable.  I truly have no idea what you are talking about.

> > > We've had this patch internally since we started using memcg, it has 
> > > avoided some unnecessary oom killing.
> > 
> > Do you have quantified data that OOM kills are reduced over a longer
> > sampling period?  How many kills are skipped?  How many of them are
> > deferred temporarily but the VM ended up having to kill something
> > anyway?
> 
> On the scale that we run memcg, we would see it daily in automated testing 
> primarily because we panic the machine for memcg oom conditions where 
> there are no killable processes.  It would typically manifest by two 
> processes that are allocating memory in a memcg; one is oom killed, is 
> allowed to allocate, handles its SIGKILL, exits and frees its memory and 
> the second process which is oom disabled races with the uncharge and is 
> oom disabled so the machine panics.

So why don't you implement proper synchronization instead of putting
these random checks all over the map to make the race window just
small enough to not matter most of the time?

> The upstream kernel of course doesn't panic in such a condition but if the 
> same scenario were to have happened, the second process would be 
> unnecessarily oom killed because it raced with the uncharge of the first 
> victim and it had exited before the scan of processes in the memcg oom 
> killer could detect it and defer.  So this patch definitely does prevent 
> unnecessary oom killing when run at such a large scale that we do.

If you are really bothered by this race, then please have OOM kill
invocations wait for any outstanding TIF_MEMDIE tasks in the same
context.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree
  2013-11-28  9:12     ` Michal Hocko
@ 2013-11-30  3:37       ` Johannes Weiner
  0 siblings, 0 replies; 21+ messages in thread
From: Johannes Weiner @ 2013-11-30  3:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Rientjes, Andrew Morton, stable, azurit, mm-commits,
	linux-kernel, linux-mm

On Thu, Nov 28, 2013 at 10:12:55AM +0100, Michal Hocko wrote:
> On Wed 27-11-13 18:33:53, Johannes Weiner wrote:
> [...]
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 13b9d0f..5f9e467 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2675,7 +2675,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> >  		goto bypass;
> >  
> >  	if (unlikely(task_in_memcg_oom(current)))
> > -		goto bypass;
> > +		goto nomem;
> >  
> >  	/*
> >  	 * We always charge the cgroup the mm_struct belongs to.
> 
> Yes, I think we really want this. Plan to send a patch? The first charge
> failure due to OOM shouldn't be papered over by a later attempt if we
> didn't get through mem_cgroup_oom_synchronize yet.

Sure thing.  Will send this to Andrew on Monday.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree
  2013-11-30  0:51                   ` Greg KH
@ 2013-11-30 10:25                     ` David Rientjes
  0 siblings, 0 replies; 21+ messages in thread
From: David Rientjes @ 2013-11-30 10:25 UTC (permalink / raw)
  To: Greg KH
  Cc: Johannes Weiner, Andrew Morton, Michal Hocko, azurit, mm-commits,
	stable, linux-kernel, linux-mm

On Fri, 29 Nov 2013, Greg KH wrote:

> > > > None that I am currently aware of, I'll continue to try them out.  I'd 
> > > > suggest just dropping the stable@kernel.org from the whole series though 
> > > > unless there is another report of such a problem that people are running 
> > > > into.
> > > 
> > > The series has long been merged, how do we drop stable@kernel.org from
> > > it?
> > > 
> > 
> > You said you have informed stable to not merge these patches until further 
> > notice, I'd suggest simply avoid ever merging the whole series into a 
> > stable kernel since the problem isn't serious enough.  Marking changes 
> > that do "goto nomem" seem fine to mark for stable, though.
> 
> I'm lost.  These patches are in 3.12, so how can they not be "in
> stable"?
> 
> What exactly do you want me to do here?
> 

Sorry, I sympathize with your confusion since the handling of this patch 
series has been strange and confusing from the beginning.

I'm referring to the comment in this thread from Johannes: "[t]his patch 
series was not supposed to go into the last merge window, I already told 
stable to hold off on these until further notice" from 
http://marc.info/?l=linux-mm&m=138559524422298 and his intention to send 
the entire series to stable in 
http://marc.info/?l=linux-kernel&m=138539243412073.

>From that, I had thought you were already aware of this series and were 
waiting to merge it into previous stable kernels; I'm suggesting that 
stable doesn't touch this series with a ten-foot pole and only backport 
the fixes that have gone into 3.12.

That series is:

759496ba640c ("arch: mm: pass userspace fault flag to generic fault handler")
3a13c4d761b4 ("x86: finish user fault error path with fatal signal")
519e52473ebe ("mm: memcg: enable memcg OOM killer only for user faults")
fb2a6fc56be6 ("mm: memcg: rework and document OOM waiting and wakeup")
3812c8c8f395 ("mm: memcg: do not trap chargers with full callstack on OOM")

And then there's the mystery of 4942642080ea ("mm: memcg: handle non-error 
OOM situations more gracefully").  This patch went into 3.12-rc6 and is 
marked for stable@vger.kernel.org.  Its changelog indicates it fixes the 
last patch in the above series, but that patch isn't marked for 
stable@vger.kernel.org.

And then you have 84235de394d9 ("fs: buffer: move allocation failure loop 
into the allocator") which is marked for stable@vger.kernel.org, but 
3168ecbe1c04 ("mm: memcg: use proper memcg in limit bypass") which fixes 
the memory corruption in that commit isn't marked for stable.

I disagreed with the entire series being marked for stable in 
http://marc.info/?l=linux-kernel&m=137107020216528 since it violates the 
rules in Documentation/stable_kernel_rules.txt when it was originally 
proposed for stable.  The memcg oom waitqueue that this series avoids has 
been in memcg for 3 1/2 years since 2.6.34 and to date one user, cc'd, has 
reported any issues with it.

And then Johannes commented that he is asking stable to hold off on the 
series until further notice.  I'm suggesting the series not be merged into 
stable at all, and that's what's in the email you're responding to.

Further, since this is confusing enough as it is, I suggest if you do take 
84235de394d9 ("fs: buffer: move allocation failure loop into the 
allocator") that you certainly must also consider 3168ecbe1c04 ("mm: 
memcg: use proper memcg in limit bypass").  The former was backported to 
3.5 and they required an emergency release of 3.5.7.25 to take it back 
out.

There's also another patch that Johannes will shortly be sending to fix 
the leakage to the root memcg in oom conditions that can trivially cause 
large amounts of memory to be charged to the root memcg when it should 
have been isolated to the oom memcg.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree
  2013-11-30  3:35                   ` Johannes Weiner
@ 2013-11-30 10:32                     ` David Rientjes
  2013-11-30 15:55                       ` Johannes Weiner
  0 siblings, 1 reply; 21+ messages in thread
From: David Rientjes @ 2013-11-30 10:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, azurit, mm-commits, stable,
	linux-kernel, linux-mm

On Fri, 29 Nov 2013, Johannes Weiner wrote:

> > You said you have informed stable to not merge these patches until further 
> > notice, I'd suggest simply avoid ever merging the whole series into a 
> > stable kernel since the problem isn't serious enough.  Marking changes 
> > that do "goto nomem" seem fine to mark for stable, though.
> 
> These are followup fixes for the series that is upstream but didn't go
> to stable.  I truly have no idea what you are talking about.
> 

I'm responding to your comments[*] that indicate you were going to 
eventually be sending it to stable.

> > On the scale that we run memcg, we would see it daily in automated testing 
> > primarily because we panic the machine for memcg oom conditions where 
> > there are no killable processes.  It would typically manifest by two 
> > processes that are allocating memory in a memcg; one is oom killed, is 
> > allowed to allocate, handles its SIGKILL, exits and frees its memory and 
> > the second process which is oom disabled races with the uncharge and is 
> > oom disabled so the machine panics.
> 
> So why don't you implement proper synchronization instead of putting
> these random checks all over the map to make the race window just
> small enough to not matter most of the time?
> 

The oom killer can be quite expensive, so we have found that is 
advantageous after doing all that work that the memcg is still oom for 
the charge order before needlessly killing a process.  I am not suggesting 
that we add synchronization to the uncharge path for such a race, but 
merely a simple check as illustrated as due diligence.  I think a simple 
conditional in the oom killer to avoid needlessly killing a user job is 
beneficial and avoids questions from customers who have a kernel log 
showing an oom kill occurring in a memcg that is not oom.  We could even 
do the check in oom_kill_process() after dump_header() if you want to 
reduce any chance of that to avoid getting bug reports about such cases.

> If you are really bothered by this race, then please have OOM kill
> invocations wait for any outstanding TIF_MEMDIE tasks in the same
> context.
> 

The oom killer requires a tasklist scan, or an iteration over the set of 
processes attached to the memcg for the memcg case, to find a victim.  It 
already defers if it finds eligible threads with TIF_MEMDIE set.

Thanks.

[*] http://marc.info/?l=linux-mm&m=138559524422298
    http://marc.info/?l=linux-kernel&m=138539243412073

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree
  2013-11-30 10:32                     ` David Rientjes
@ 2013-11-30 15:55                       ` Johannes Weiner
  2013-11-30 22:12                         ` David Rientjes
  0 siblings, 1 reply; 21+ messages in thread
From: Johannes Weiner @ 2013-11-30 15:55 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Michal Hocko, azurit, mm-commits, stable,
	linux-kernel, linux-mm

On Sat, Nov 30, 2013 at 02:32:43AM -0800, David Rientjes wrote:
> On Fri, 29 Nov 2013, Johannes Weiner wrote:
> 
> > > You said you have informed stable to not merge these patches until further 
> > > notice, I'd suggest simply avoid ever merging the whole series into a 
> > > stable kernel since the problem isn't serious enough.  Marking changes 
> > > that do "goto nomem" seem fine to mark for stable, though.
> > 
> > These are followup fixes for the series that is upstream but didn't go
> > to stable.  I truly have no idea what you are talking about.
> > 
> 
> I'm responding to your comments[*] that indicate you were going to 
> eventually be sending it to stable.
> 
> > > On the scale that we run memcg, we would see it daily in automated testing 
> > > primarily because we panic the machine for memcg oom conditions where 
> > > there are no killable processes.  It would typically manifest by two 
> > > processes that are allocating memory in a memcg; one is oom killed, is 
> > > allowed to allocate, handles its SIGKILL, exits and frees its memory and 
> > > the second process which is oom disabled races with the uncharge and is 
> > > oom disabled so the machine panics.
> > 
> > So why don't you implement proper synchronization instead of putting
> > these random checks all over the map to make the race window just
> > small enough to not matter most of the time?
> > 
> 
> The oom killer can be quite expensive, so we have found that is 
> advantageous after doing all that work that the memcg is still oom for 
> the charge order before needlessly killing a process.  I am not suggesting 
> that we add synchronization to the uncharge path for such a race, but 
> merely a simple check as illustrated as due diligence.  I think a simple 
> conditional in the oom killer to avoid needlessly killing a user job is 
> beneficial and avoids questions from customers who have a kernel log 
> showing an oom kill occurring in a memcg that is not oom.  We could even 
> do the check in oom_kill_process() after dump_header() if you want to 
> reduce any chance of that to avoid getting bug reports about such cases.

I asked about quantified data of this last-minute check, you replied
with a race condition between an OOM kill victim and a subsequent OOM
kill invocation.

> > If you are really bothered by this race, then please have OOM kill
> > invocations wait for any outstanding TIF_MEMDIE tasks in the same
> > context.
> > 
> 
> The oom killer requires a tasklist scan, or an iteration over the set of 
> processes attached to the memcg for the memcg case, to find a victim.  It 
> already defers if it finds eligible threads with TIF_MEMDIE set.

And now you say that this race does not really exist and repeat the
same ramblings about last-minute checks to avoid unnecessary kills
again.  And again without any supporting data that I already asked
for.

The more I talk to you, the less sense this all makes.  Why do you
insist we merge this patch when you have apparently no idea why and
how it works, and can't demonstrate that it works in the first place?

I only followed you around in circles because I'm afraid that my
shutting up would be interpreted as agreement again and Andrew would
merge this anyway.  But this is unsustainable, the burden of proof
should be on you, not me.  I'm going to stop replying until you
provide the information I asked for.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree
  2013-11-30 15:55                       ` Johannes Weiner
@ 2013-11-30 22:12                         ` David Rientjes
  0 siblings, 0 replies; 21+ messages in thread
From: David Rientjes @ 2013-11-30 22:12 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, azurit, mm-commits, stable,
	linux-kernel, linux-mm

On Sat, 30 Nov 2013, Johannes Weiner wrote:

> > The oom killer requires a tasklist scan, or an iteration over the set of 
> > processes attached to the memcg for the memcg case, to find a victim.  It 
> > already defers if it finds eligible threads with TIF_MEMDIE set.
> 
> And now you say that this race does not really exist and repeat the
> same ramblings about last-minute checks to avoid unnecessary kills
> again.  And again without any supporting data that I already asked
> for.
> 

The race does exist, perhaps you don't understand what the race is?  This 
race occurs when process (A) declares oom and enters the oom killer, 
meanwhile an already oom killed victim (B) frees its memory and exits, and 
the process (A) oom kills another process even though the memcg is below 
its limit because of process (B).

When doing something expensive in the kernel like oom killing, it usually 
doesn't cause so much hassle when the suggestion is:

	<declare an action is necessary>
	<do something expensive>
	<select an action>
	if (!<action is still necessary>)
		abort
	<perform the action>

That type of check is fairly straight forward and makes sense.  It 
prevents unnecessary oom killing (although it can't guarantee it in all 
conditions) and prevents customers from reporting oom kills when the log 
shows there is memory available for their memcg.

When using memcg on a large scale to enforce memory isolation for user 
jobs, these types of scenarios happen often and there is no downside to 
adding such a check.  The oom killer is not a hotpath, it's not 
performance sensitive to the degree that we cannot add a simple 
conditional that checks the current limit, it prevents unnecessary oom 
kills, and prevents user confusion.

Without more invasive synchronization that would touch hotpaths, this is 
the best we can do: check if the oom kill is really necessary just before 
issuing the kill.  Having the kernel actually kill a user process is a 
serious matter and we should strive to ensure it is prevented whenever 
possible.

> The more I talk to you, the less sense this all makes.  Why do you
> insist we merge this patch when you have apparently no idea why and
> how it works, and can't demonstrate that it works in the first place?
> 

I'm not insisting anything, I don't make demands of others or maintainers 
like you do to merge or not merge anything.  I also haven't even formally 
proposed the patch with a changelog that would explain the motivation.

> I only followed you around in circles because I'm afraid that my
> shutting up would be interpreted as agreement again and Andrew would
> merge this anyway.  But this is unsustainable, the burden of proof
> should be on you, not me.  I'm going to stop replying until you
> provide the information I asked for.
> 

Andrew can't merge a patch that hasn't been proposed for merge.

Have a nice weekend.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree
  2013-11-30  0:05                 ` David Rientjes
@ 2013-12-02 13:12                   ` Michal Hocko
  2013-12-02 22:51                     ` David Rientjes
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2013-12-02 13:12 UTC (permalink / raw)
  To: David Rientjes
  Cc: Johannes Weiner, Andrew Morton, azurit, mm-commits, linux-kernel,
	linux-mm

On Fri 29-11-13 16:05:04, David Rientjes wrote:
> On Thu, 28 Nov 2013, Michal Hocko wrote:
> 
> > > None that I am currently aware of,
> > 
> > Are you saing that scenarios described in 3812c8c8f395 (mm: memcg: do not
> > trap chargers with full callstack on OOM) are not real or that _you_
> > haven't seen an issue like that?
> > 
> > The later doesn't seem to be so relevant as we had at least one user who
> > has seen those in the real life.
> > 
> 
> I said I'm not currently aware of any additional problems with the 
> patchset,

I have obviously misread your reply. Sorry about that.

> but since Johannes said the entire series wasn't meant for that 
> merge window, I asked if it was still being worked on.
> 
> > > You don't think something like this is helpful after scanning a memcg will 
> > > a large number of processes?
> > 
> > It looks as a one-shot workaround for short lived processes to me.
> 
> It has nothing to do with how long a process has been running, both racing 
> processes could have been running for years.  It's obvious that even this 
> patch before calling oom_kill_process() does not catch a racing process 
> that has already freed its memory and is exiting but it makes the 
> liklihood significantly less in testing at scale. 

I guess we need to know how much is significantly less.
oom_scan_process_thread already aborts on exiting tasks so we do not
kill anything and then the charge (whole page fault actually) is retried
when we check for the OOM again so my intuition would say that we gave
the exiting task quite a lot of time.

> It's simply better to avoid unnecessary oom killing at anytime
> possible and this is not a hotpath.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree
  2013-12-02 13:12                   ` Michal Hocko
@ 2013-12-02 22:51                     ` David Rientjes
  0 siblings, 0 replies; 21+ messages in thread
From: David Rientjes @ 2013-12-02 22:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Andrew Morton, azurit, mm-commits, linux-kernel,
	linux-mm

On Mon, 2 Dec 2013, Michal Hocko wrote:

> I guess we need to know how much is significantly less.
> oom_scan_process_thread already aborts on exiting tasks so we do not
> kill anything and then the charge (whole page fault actually) is retried
> when we check for the OOM again so my intuition would say that we gave
> the exiting task quite a lot of time.
> 

That isn't the race, though.  The race occurs when the oom killed process 
exits prior to the process iteration so it's not detected and yet its 
memory has already been freed and the memcg is no longer oom.  In other 
words, a process that has called mem_cgroup_oom_synchronize() at the same 
time that an oom killed process has freed its memory.  The result is an 
unnecessary oom killing and erroneous spam in the kernel log.

We all agree that this race cannot be completely closed (at least without 
synchronization in the uncharge path that we obviously don't want to add).  
We don't know if an oom killed process, or any process, will free its 
memory immediately after the kernel sends the SIGKILL.  However, there's 
absolutely no reason to not have a final check immediately before sending 
the SIGKILL to prevent that unnecessary oom kill.

I'm going to send the patch for review.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2013-12-02 22:51 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <526028bd.k5qPj2+MDOK1o6ii%akpm@linux-foundation.org>
2013-11-27 23:08 ` [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree David Rientjes
2013-11-27 23:33   ` Johannes Weiner
2013-11-28  0:56     ` David Rientjes
2013-11-28  2:18       ` Johannes Weiner
2013-11-28  2:38         ` David Rientjes
2013-11-28  3:13           ` Johannes Weiner
2013-11-28  3:20             ` David Rientjes
2013-11-28  3:52               ` Johannes Weiner
2013-11-30  0:00                 ` David Rientjes
2013-11-30  0:51                   ` Greg KH
2013-11-30 10:25                     ` David Rientjes
2013-11-30  3:35                   ` Johannes Weiner
2013-11-30 10:32                     ` David Rientjes
2013-11-30 15:55                       ` Johannes Weiner
2013-11-30 22:12                         ` David Rientjes
2013-11-28 10:02               ` Michal Hocko
2013-11-30  0:05                 ` David Rientjes
2013-12-02 13:12                   ` Michal Hocko
2013-12-02 22:51                     ` David Rientjes
2013-11-28  9:12     ` Michal Hocko
2013-11-30  3:37       ` Johannes Weiner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).