From mboxrd@z Thu Jan  1 00:00:00 1970
From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
Subject: Re: [PATCH v5 3/3] cgroups: make procs file writable
Date: Mon, 27 Dec 2010 02:21:23 -0500
Message-ID: <20101227072123.GA19652@ghc17.ghc.andrew.cmu.edu>
References: <20100811054851.GD8743@ghc17.ghc.andrew.cmu.edu>
	<AANLkTimRM8rDe+u7fTy853RK=1mnLJMK57Tci2OLPR7L@mail.gmail.com>
	<AANLkTim7HW0wNyqOPePFXmEMV8hx_fMKNMTAsSwkRzZX@mail.gmail.com>
	<AANLkTin7aK5uEFi0U+iU_9=cbfRTHfDzKsbWupn73fSL@mail.gmail.com>
	<20101216002603.6741874a.akpm@linux-foundation.org>
	<20101224033352.GA7804@ghc17.ghc.andrew.cmu.edu>
	<20101225025508.GA649@ghc17.ghc.andrew.cmu.edu>
	<20101227095353.48d95687.nishimura@mxp.nes.nec.co.jp>
	<20101227042254.GA15417@ghc17.ghc.andrew.cmu.edu>
	<20101227160041.07bff52a.kamezawa.hiroyu@jp.fujitsu.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <containers-bounces-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org>
Content-Disposition: inline
In-Reply-To: <20101227160041.07bff52a.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
List-Unsubscribe: <https://lists.linux-foundation.org/mailman/listinfo/containers>,
	<mailto:containers-request-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org?subject=unsubscribe>
List-Archive: <http://lists.linux-foundation.org/pipermail/containers>
List-Post: <mailto:containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org>
List-Help: <mailto:containers-request-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org?subject=help>
List-Subscribe: <https://lists.linux-foundation.org/mailman/listinfo/containers>,
	<mailto:containers-request-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org?subject=subscribe>
Sender: containers-bounces-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
Errors-To: containers-bounces-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
To: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
Cc: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org, Daisuke Nishimura <nishimura-YQH0OdQVrdy45+QrQBaojngSJqDPrsil@public.gmane.org>, oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org, ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org, Miao Xie <miaox-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>, David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>, Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>, Paul Menage <menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
List-Id: containers.vger.kernel.org

On Mon, Dec 27, 2010 at 04:00:41PM +0900, KAMEZAWA Hiroyuki wrote:
> On Sun, 26 Dec 2010 23:22:54 -0500
> Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
> 
> > On Mon, Dec 27, 2010 at 09:53:53AM +0900, Daisuke Nishimura wrote:
> > > On Fri, 24 Dec 2010 21:55:08 -0500
> > > Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
> > > 
> > > > On Thu, Dec 23, 2010 at 10:33:52PM -0500, Ben Blum wrote:
> > > > > On Thu, Dec 16, 2010 at 12:26:03AM -0800, Andrew Morton wrote:
> > > > > > Patches have gone a bit stale, sorry.  Refactoring in
> > > > > > kernel/cgroup_freezer.c necessitates a refresh and retest please.
> > > > > 
> > > > > commit 53feb29767c29c877f9d47dcfe14211b5b0f7ebd changed a bunch of stuff
> > > > > in kernel/cpuset.c to allocate nodemasks with NODEMASK_ALLOC (which
> > > > > wraps kmalloc) instead of on the stack.
> > > > > 
> > > > > 1. All these functions have 'void' return values, indicating that
> > > > >    calling them them must not fail. Sure there are bailout cases, but no
> > > > >    semblance of cross-function error propagation. Most importantly,
> > > > >    cpuset_attach is a subsystem callback, which MUST not fail given the
> > > > >    way it's used in cgroups, so relying on kmalloc is not safe.
> > > > > 
> > > > > 2. I'm working on a patch series which needs to hold tasklist_lock
> > > > >    across ss->attach callbacks (in cpuset_attach's "if (threadgroup)"
> > > > >    case, this is how safe traversal of tsk->thread_group will be
> > > > >    ensured), and kmalloc isn't allowed while holding a spin-lock. 
> > > > > 
> > > > > Why do we need heap-allocation here at all? In each case their scope is
> > > > > exactly the function's scope, and neither the commit nor the surrounding
> > > > > patch series give any explanation. I'd like to revert the patch if
> > > > > possible.
> > > > > 
> > > > > cc'ing Miao Xie (author) and David Rientjes (acker).
> > > > > 
> > > > > -- Ben
> > > > 
> > > > Well even with the proposed solution to this there is still another
> > > > problem that I see - that of mmap_sem. cpuset_attach() calls into
> > > > mpol_rebind_mm() and do_migrate_pages(), which take mm->mmap_sem for
> > > > writing and reading respectively. This is going to conflict with
> > > > tasklist_lock... but moreover, the memcontrol subsys also touches the
> > > > task's mm->mmap_sem, holding onto it between mem_cgroup_can_attach() and
> > > > mem_cgroup_move_task() - as of b1dd693e5b9348bd68a80e679e03cf9c0973b01b.
> > > > 
> > > > So we have (currently, even without my patches):
> > > > 
> > > > cgroup_attach_task
> > > > (1) cpuset_can_attach
> > > > (2) mem_cgroup_can_attach
> > > >      - down_read(&mm->mmap_sem);
> > > > (3) cpuset_attach
> > > >      - mpol_rebind_mm
> > > >         - down_write(&mm->mmap_sem);
> > > >         - up_write(&mm->mmap_sem);
> > > >      - cpuset_migrate_mm
> > > >         - do_migrate_pages
> > > >            - down_read(&mm->mmap_sem);
> > > >            - up_read(&mm->mmap_sem);
> > > > (4) mem_cgroup_move_task
> > > >      - mem_cgroup_clear_mc
> > > >         - up_read(...);
> > > > 
> > > hmm, nice catch.
> > > 
> > > > Is there some interdependency I'm missing here that guarantees recursive
> > > > locking/deadlock will be avoided? It all looks like typical-case code.
> > > > 
> > > Unfortunately, not.
> > > I couldn't hit this because I mount all subsystems onto different
> > > mount points in my environment.
> > > 
> > > > I think we should move taking the mmap_sem all the way up into
> > > > cgroup_attach_task and cgroup_attach_proc; it will be held for writing
> > > > the whole time. I don't quite understand the mempolicy stuff but maybe
> > > > there can be ways to use mpol_rebind_mm and do_migrate_pages when the
> > > > lock is already held.
> > > > 
> > > I agree.
> > > Another solution(just an idea): avoid enabling both "move_charge" feature of memcg
> > > and "memory_migrate" of cpuset at the same time iff they are mounted
> > > under the same mount point. But, hmm... it's not a good idea to make a subsystem
> > > take account of another subsystem, IMHO.
> > > 
> > > 
> > > Thanks,
> > > Daisuke Nishimura.
> > 
> > It looks to me like when memcg holds the mmap_sem the whole time, it's
> > just to avoid the deadlock, not that there's there some need for the
> > stuff under mmap_sem not to change between can_attach and attach. But if
> > there is such a need, then the write-side in mpol_rebind_mm may conflict
> > even with my proposed solution.
> > 
> > Regardless, the best way would be to avoid holding the mmap_sem across
> > the whole window, possibly by solving the move_charge deadlock some
> > other internal way, if at all possible?
> > 
> 
> IIUC, what you request is 'don't call any kind of function may sleep'.
> in can_attach() and attach() callback. That's impossible.
> Memory cgroup will go to sleep because of calling memory reclaim.
> 
> Is tasklist_lock the only way ? Can you catch "new thread" event in cgroup_fork() ?
> 
> For example,
> 
> ==
> void cgroup_fork(struct task_struct *child)
> {
> 	if (current->in_moving) {
> 		add_to_waitqueue somewhere.
> 	}
>         task_lock(current);
>         child->cgroups = current->cgroups;
>         get_css_set(child->cgroups);
>         task_unlock(current);
>         INIT_LIST_HEAD(&child->cg_list);
> }
> ==
> 
> ==
>  read_lock(&tasklist_lock);
>  list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
> 	tsk->in_moving = true;
>  }
>  read_unlock(&tasklist_lock);
> 
>  ->pre_attach()
>  ->attach()
> 
>  read_unlock(&tasklist_lock);
>  list_for_each_.....
> 	tsk->in_moving_cgroup = false;
> 
>  wakeup threads in waitq.
> ==
> 
> Ah yes, this will have some bug. But please avoid to take tasklist_lock() around
> all pre_attach() and attach().
> 
> Thanks,
> -Kame
> 
> 
> 
> 
> 
> Thanks,
> -Kame

You misunderstand slightly: the callbacks are split into a
once-per-thread function and a thread-independent function, and only the
former is not allowed to sleep. All of memcg's attachment is thread-
independent, so sleeping there is fine.

Also, the tasklist_lock isn't used to synchronize fork events; that's
what threadgroup_fork_lock (an rwsem) is for. It's for protecting the
thread-group list while iterating over it - so that a race with exec()
(de_thread(), specifically) doesn't change the threadgroup state. It's
basically never safe to iterate over leader->thread_group unless you're
in a no-sleep section.

-- Ben

(P.S. Paul, do you remember why we decided to use tasklist_lock in the
middle there instead of rcu_read_lock like the other places where it
iterates over ->thread_group?)