All of lore.kernel.org
 help / color / mirror / Atom feed
From: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
To: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: linux-mm <linux-mm@kvack.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Balbir Singh <balbir@linux.vnet.ibm.com>,
	Li Zefan <lizf@cn.fujitsu.com>, Paul Menage <menage@google.com>,
	Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Subject: Re: [PATCH -mmotm 2/8] memcg: move memcg_tasklist mutex
Date: Fri, 6 Nov 2009 16:49:34 +0900	[thread overview]
Message-ID: <20091106164934.b34d342f.nishimura@mxp.nes.nec.co.jp> (raw)
In-Reply-To: <20091106145459.351b407f.kamezawa.hiroyu@jp.fujitsu.com>

On Fri, 6 Nov 2009 14:54:59 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Fri, 6 Nov 2009 14:11:49 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > memcg_tasklist was introduced to serialize mem_cgroup_out_of_memory() and
> > mem_cgroup_move_task() to ensure tasks cannot be moved to another cgroup
> > during select_bad_process().
> > 
> > task_in_mem_cgroup(), which can be called by select_bad_process(), will check
> > whether a task is in the mem_cgroup or not by dereferencing task->cgroups
> > ->subsys[]. So, it would be desirable to change task->cgroups
> > (rcu_assign_pointer() in cgroup_attach_task() does it) with memcg_tasklist held.
> > 
> > Now that we can define cancel_attach(), we can safely release memcg_tasklist
> > on fail path even if we hold memcg_tasklist in can_attach(). So let's move
> > mutex_lock/unlock() of memcg_tasklist.
> > 
> > Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> > ---
> >  mm/memcontrol.c |   22 ++++++++++++++++++++--
> >  1 files changed, 20 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 4bd3451..d3b2ac0 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -3395,18 +3395,34 @@ static int mem_cgroup_populate(struct cgroup_subsys *ss,
> >  	return ret;
> >  }
> >  
> > +static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
> > +				struct cgroup *cgroup,
> > +				struct task_struct *p,
> > +				bool threadgroup)
> > +{
> > +	mutex_lock(&memcg_tasklist);
> > +	return 0;
> > +}
> 
> Hmm...Is this lock really necessary ?
> IOW, can't we just remove memcg_tasklist mutex ?
> What kind of bad race happens when we remove this ?
> 
It was introduced at commit 7f4d454d, in which I introduced the mutex instead of
using cgroup_mutex to fix a deadlock problem.

commit 7f4d454dee2e0bdd21bafd413d1c53e443a26540
Author: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Date:   Wed Jan 7 18:08:29 2009 -0800

    memcg: avoid deadlock caused by race between oom and cpuset_attach

    mpol_rebind_mm(), which can be called from cpuset_attach(), does
    down_write(mm->mmap_sem).  This means down_write(mm->mmap_sem) can be
    called under cgroup_mutex.

    OTOH, page fault path does down_read(mm->mmap_sem) and calls
    mem_cgroup_try_charge_xxx(), which may eventually calls
    mem_cgroup_out_of_memory().  And mem_cgroup_out_of_memory() calls
    cgroup_lock().  This means cgroup_lock() can be called under
    down_read(mm->mmap_sem).

    If those two paths race, deadlock can happen.

    This patch avoid this deadlock by:
      - remove cgroup_lock() from mem_cgroup_out_of_memory().
      - define new mutex (memcg_tasklist) and serialize mem_cgroup_move_task()
        (->attach handler of memory cgroup) and mem_cgroup_out_of_memory.

    Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
    Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Before the commit, mem_cgroup_out_of_memory() holded(and released afterward) cgroup_mutex.
Those codes was introduced at commit c7ba5c9e.

commit c7ba5c9e8176704bfac0729875fa62798037584d
Author: Pavel Emelianov <xemul@openvz.org>
Date:   Thu Feb 7 00:13:58 2008 -0800

    Memory controller: OOM handling

    Out of memory handling for cgroups over their limit. A task from the
    cgroup over limit is chosen using the existing OOM logic and killed.

    TODO:
    1. As discussed in the OLS BOF session, consider implementing a user
    space policy for OOM handling.

    [akpm@linux-foundation.org: fix build due to oom-killer changes]
    Signed-off-by: Pavel Emelianov <xemul@openvz.org>
    Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
    Cc: Paul Menage <menage@google.com>
    Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Cc: Nick Piggin <nickpiggin@yahoo.com.au>
    Cc: Kirill Korotaev <dev@sw.ru>
    Cc: Herbert Poetzl <herbert@13thfloor.at>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

I'm not sure about the intention of the original cgroup_lock() here, but I imagine that
it was for preventing task move during select_bad_process().

If there is no such a lock:

  Assume cgroup foo has exceeded its limit and is about to triggering oom.
  1. Process A, which has been in cgroup baa and uses large memory,
     is just moved to cgroup foo. Process A can be the candidates for being killed.
  2. Process B, which has been in cgroup foo and uses large memory,
     is just moved from cgroup foo. Process B can be excluded from the candidates for
     being killed. 

Hmm, but considering more, those race window exist anyway even if we holds a lock,
because try_charge decides wether it should trigger oom or not outside of the lock.

If this recharge feature is enabled, I think those problems might be avoided by doing like:

__mem_cgroup_try_charge()
{
	...
	if (oom) {
		mutex_lock(&memcg_tasklist);
		if (unlikely(mem_cgroup_check_under_limit)) {
			mutex_unlock(&memcg_tasklist);
			continue
		}
		mem_cgroup_out_of_memory();
		mutex_unlock(&memcg_tasklist);
		record_last_oom();
	}
	...
}

but it makes codes more complex and the recharge feature isn't necessarily enabled.

Well, I personally think we can remove these locks completely and make codes simpler.
What do you think ?


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2009-11-06  7:57 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-11-06  5:10 [PATCH -mmotm 0/8] memcg: recharge at task move Daisuke Nishimura
2009-11-06  5:11 ` [PATCH -mmotm 1/8] cgroup: introduce cancel_attach() Daisuke Nishimura
2009-11-09  6:57   ` Balbir Singh
2009-11-09  7:18     ` Li Zefan
2009-11-09  7:23     ` Daisuke Nishimura
2009-11-09  7:23   ` Li Zefan
2009-11-09  7:38     ` Daisuke Nishimura
2009-11-06  5:11 ` [PATCH -mmotm 2/8] memcg: move memcg_tasklist mutex Daisuke Nishimura
2009-11-06  5:54   ` KAMEZAWA Hiroyuki
2009-11-06  7:49     ` Daisuke Nishimura [this message]
2009-11-06  8:02       ` KAMEZAWA Hiroyuki
2009-11-10 19:14   ` Balbir Singh
2009-11-10 23:44     ` Daisuke Nishimura
2009-11-06  5:12 ` [PATCH -mmotm 3/8] memcg: add mem_cgroup_cancel_charge() Daisuke Nishimura
2009-11-06  5:13 ` [PATCH -mmotm 4/8] memcg: cleanup mem_cgroup_move_parent() Daisuke Nishimura
2009-11-06  5:56   ` KAMEZAWA Hiroyuki
2009-11-06  5:14 ` [PATCH -mmotm 5/8] memcg: add interface to recharge at task move Daisuke Nishimura
2009-11-06  6:06   ` KAMEZAWA Hiroyuki
2009-11-06  5:14 ` [PATCH -mmotm 6/8] memcg: recharge charges of anonymous page Daisuke Nishimura
2009-11-06  6:35   ` KAMEZAWA Hiroyuki
2009-11-09  0:31     ` Daisuke Nishimura
2009-11-06  5:15 ` [PATCH -mmotm 7/8] memcg: avoid oom during recharge at task move Daisuke Nishimura
2009-11-06  6:39   ` KAMEZAWA Hiroyuki
2009-11-06  5:16 ` [PATCH -mmotm 8/8] memcg: recharge charges of anonymous swap Daisuke Nishimura
2009-11-06  6:45 ` [PATCH -mmotm 0/8] memcg: recharge at task move KAMEZAWA Hiroyuki
2009-11-09  1:44   ` Daisuke Nishimura
2009-11-09  5:16     ` KAMEZAWA Hiroyuki
2009-11-09  5:08 ` Balbir Singh
2009-11-09  8:24   ` Daisuke Nishimura
2009-11-11  1:35 ` [PATCH -mmotm 0/3] some cleanups for memcg Daisuke Nishimura
2009-11-11  1:36   ` [PATCH -mmotm 1/3] memcg: add mem_cgroup_cancel_charge() Daisuke Nishimura
2009-11-11  4:24     ` Balbir Singh
2009-11-11  1:37   ` [PATCH -mmotm 2/3] memcg: cleanup mem_cgroup_move_parent() Daisuke Nishimura
2009-11-11 14:40     ` Balbir Singh
2009-11-11 15:16       ` Daisuke Nishimura
2009-11-11  1:39   ` [PATCH -mmotm 3/3] memcg: remove memcg_tasklist Daisuke Nishimura
2009-11-11  1:49     ` KAMEZAWA Hiroyuki
2009-11-11 16:01     ` Balbir Singh
2009-11-12  8:05       ` Daisuke Nishimura

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20091106164934.b34d342f.nishimura@mxp.nes.nec.co.jp \
    --to=nishimura@mxp.nes.nec.co.jp \
    --cc=akpm@linux-foundation.org \
    --cc=balbir@linux.vnet.ibm.com \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=linux-mm@kvack.org \
    --cc=lizf@cn.fujitsu.com \
    --cc=menage@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.