From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id F289A6B0044 for ; Fri, 6 Nov 2009 00:29:16 -0500 (EST) Date: Fri, 6 Nov 2009 14:10:11 +0900 From: Daisuke Nishimura Subject: [PATCH -mmotm 0/8] memcg: recharge at task move Message-Id: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: linux-mm Cc: Andrew Morton , Balbir Singh , KAMEZAWA Hiroyuki , Li Zefan , Paul Menage , Daisuke Nishimura List-ID: Hi. In current memcg, charges associated with a task aren't moved to the new cgroup at task move. These patches are for this feature, that is, for recharging to the new cgroup and, of course, uncharging from old cgroup at task move. Current virsion supports only recharge of non-shared(mapcount == 1) anonymous pages and swaps of those pages. I think it's enough as a first step. [1/8] cgroup: introduce cancel_attach() [2/8] memcg: move memcg_tasklist mutex [3/8] memcg: add mem_cgroup_cancel_charge() [4/8] memcg: cleanup mem_cgroup_move_parent() [5/8] memcg: add interface to recharge at task move [6/8] memcg: recharge charges of anonymous page [7/8] memcg: avoid oom during recharge at task move [8/8] memcg: recharge charges of anonymous swap 2 is dependent on 1 and 4 is dependent on 3. 3 and 4 are just for cleanups. 5-8 are the body of this feature. Major Changes from Oct13: - removed "[RFC]". - rebased on mmotm-2009-11-01-10-01. - dropped support for file cache and shmem/tmpfs(revisit in future). - Updated Documentation/cgroup/memory.txt. TODO: - add support for file cache, shmem/tmpfs, and shared(mapcount > 1) pages. - implement madvise(2) to let users decide the target vma for recharge. Any comments or suggestions would be welcome. Thanks, Dasiuke Nishimura. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id 143B16B0044 for ; Fri, 6 Nov 2009 00:30:11 -0500 (EST) Date: Fri, 6 Nov 2009 14:11:06 +0900 From: Daisuke Nishimura Subject: [PATCH -mmotm 1/8] cgroup: introduce cancel_attach() Message-Id: <20091106141106.a2bd995a.nishimura@mxp.nes.nec.co.jp> In-Reply-To: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> References: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: linux-mm Cc: Andrew Morton , Balbir Singh , KAMEZAWA Hiroyuki , Li Zefan , Paul Menage , Daisuke Nishimura List-ID: This patch adds cancel_attach() operation to struct cgroup_subsys. cancel_attach() can be used when can_attach() operation prepares something for the subsys, but we should rollback what can_attach() operation has prepared if attach task fails after we've succeeded in can_attach(). Signed-off-by: Daisuke Nishimura --- Documentation/cgroups/cgroups.txt | 13 +++++++++++- include/linux/cgroup.h | 2 + kernel/cgroup.c | 38 ++++++++++++++++++++++++++++++------ 3 files changed, 45 insertions(+), 8 deletions(-) diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt index 0b33bfe..c86947c 100644 --- a/Documentation/cgroups/cgroups.txt +++ b/Documentation/cgroups/cgroups.txt @@ -536,10 +536,21 @@ returns an error, this will abort the attach operation. If a NULL task is passed, then a successful result indicates that *any* unspecified task can be moved into the cgroup. Note that this isn't called on a fork. If this method returns 0 (success) then this should -remain valid while the caller holds cgroup_mutex. If threadgroup is +remain valid while the caller holds cgroup_mutex and it is ensured that either +attach() or cancel_attach() will be called in futer. If threadgroup is true, then a successful result indicates that all threads in the given thread's threadgroup can be moved together. +void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp, + struct task_struct *task, bool threadgroup) +(cgroup_mutex held by caller) + +Called when a task attach operation has failed after can_attach() has succeeded. +A subsystem whose can_attach() has some side-effects should provide this +function, so that the subsytem can implement a rollback. If not, not necessary. +This will be called only about subsystems whose can_attach() operation have +succeeded. + void attach(struct cgroup_subsys *ss, struct cgroup *cgrp, struct cgroup *old_cgrp, struct task_struct *task, bool threadgroup) diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 0008dee..d4cc200 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -427,6 +427,8 @@ struct cgroup_subsys { void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp); int (*can_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp, struct task_struct *tsk, bool threadgroup); + void (*cancel_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp, + struct task_struct *tsk, bool threadgroup); void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp, struct cgroup *old_cgrp, struct task_struct *tsk, bool threadgroup); diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 0249f4b..e443742 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -1539,7 +1539,7 @@ int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen) int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk) { int retval = 0; - struct cgroup_subsys *ss; + struct cgroup_subsys *ss, *failed_ss = NULL; struct cgroup *oldcgrp; struct css_set *cg; struct css_set *newcg; @@ -1553,8 +1553,16 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk) for_each_subsys(root, ss) { if (ss->can_attach) { retval = ss->can_attach(ss, cgrp, tsk, false); - if (retval) - return retval; + if (retval) { + /* + * Remember at which subsystem we've failed in + * can_attach() to call cancel_attach() only + * against subsystems whose attach() have + * succeeded(see below). + */ + failed_ss = ss; + goto out; + } } } @@ -1568,14 +1576,17 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk) */ newcg = find_css_set(cg, cgrp); put_css_set(cg); - if (!newcg) - return -ENOMEM; + if (!newcg) { + retval = -ENOMEM; + goto out; + } task_lock(tsk); if (tsk->flags & PF_EXITING) { task_unlock(tsk); put_css_set(newcg); - return -ESRCH; + retval = -ESRCH; + goto out; } rcu_assign_pointer(tsk->cgroups, newcg); task_unlock(tsk); @@ -1601,7 +1612,20 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk) * is no longer empty. */ cgroup_wakeup_rmdir_waiter(cgrp); - return 0; +out: + if (retval) + for_each_subsys(root, ss) { + if (ss == failed_ss) + /* + * This means can_attach() of this subsystem + * have failed, so we don't need to call + * cancel_attach() against rests of subsystems. + */ + break; + if (ss->cancel_attach) + ss->cancel_attach(ss, cgrp, tsk, false); + } + return retval; } /* -- 1.5.6.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id BF43D6B0062 for ; Fri, 6 Nov 2009 00:30:45 -0500 (EST) Date: Fri, 6 Nov 2009 14:11:49 +0900 From: Daisuke Nishimura Subject: [PATCH -mmotm 2/8] memcg: move memcg_tasklist mutex Message-Id: <20091106141149.9c7e94d5.nishimura@mxp.nes.nec.co.jp> In-Reply-To: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> References: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: linux-mm Cc: Andrew Morton , Balbir Singh , KAMEZAWA Hiroyuki , Li Zefan , Paul Menage , Daisuke Nishimura List-ID: memcg_tasklist was introduced to serialize mem_cgroup_out_of_memory() and mem_cgroup_move_task() to ensure tasks cannot be moved to another cgroup during select_bad_process(). task_in_mem_cgroup(), which can be called by select_bad_process(), will check whether a task is in the mem_cgroup or not by dereferencing task->cgroups ->subsys[]. So, it would be desirable to change task->cgroups (rcu_assign_pointer() in cgroup_attach_task() does it) with memcg_tasklist held. Now that we can define cancel_attach(), we can safely release memcg_tasklist on fail path even if we hold memcg_tasklist in can_attach(). So let's move mutex_lock/unlock() of memcg_tasklist. Signed-off-by: Daisuke Nishimura --- mm/memcontrol.c | 22 ++++++++++++++++++++-- 1 files changed, 20 insertions(+), 2 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 4bd3451..d3b2ac0 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3395,18 +3395,34 @@ static int mem_cgroup_populate(struct cgroup_subsys *ss, return ret; } +static int mem_cgroup_can_attach(struct cgroup_subsys *ss, + struct cgroup *cgroup, + struct task_struct *p, + bool threadgroup) +{ + mutex_lock(&memcg_tasklist); + return 0; +} + +static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss, + struct cgroup *cgroup, + struct task_struct *p, + bool threadgroup) +{ + mutex_unlock(&memcg_tasklist); +} + static void mem_cgroup_move_task(struct cgroup_subsys *ss, struct cgroup *cont, struct cgroup *old_cont, struct task_struct *p, bool threadgroup) { - mutex_lock(&memcg_tasklist); + mutex_unlock(&memcg_tasklist); /* * FIXME: It's better to move charges of this process from old * memcg to new memcg. But it's just on TODO-List now. */ - mutex_unlock(&memcg_tasklist); } struct cgroup_subsys mem_cgroup_subsys = { @@ -3416,6 +3432,8 @@ struct cgroup_subsys mem_cgroup_subsys = { .pre_destroy = mem_cgroup_pre_destroy, .destroy = mem_cgroup_destroy, .populate = mem_cgroup_populate, + .can_attach = mem_cgroup_can_attach, + .cancel_attach = mem_cgroup_cancel_attach, .attach = mem_cgroup_move_task, .early_init = 0, .use_id = 1, -- 1.5.6.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with ESMTP id 1C0676B0062 for ; Fri, 6 Nov 2009 00:31:23 -0500 (EST) Date: Fri, 6 Nov 2009 14:12:19 +0900 From: Daisuke Nishimura Subject: [PATCH -mmotm 3/8] memcg: add mem_cgroup_cancel_charge() Message-Id: <20091106141219.25f83e6f.nishimura@mxp.nes.nec.co.jp> In-Reply-To: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> References: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: linux-mm Cc: Andrew Morton , Balbir Singh , KAMEZAWA Hiroyuki , Li Zefan , Paul Menage , Daisuke Nishimura List-ID: There are some places calling both res_counter_uncharge() and css_put() to cancel the charge and the refcnt we have got by mem_cgroup_tyr_charge(). This patch introduces mem_cgroup_cancel_charge() and call it in those places. Signed-off-by: KAMEZAWA Hiroyuki Signed-off-by: Daisuke Nishimura --- mm/memcontrol.c | 38 ++++++++++++++++++-------------------- 1 files changed, 18 insertions(+), 20 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d3b2ac0..05e837c 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1500,6 +1500,21 @@ nomem: } /* + * Somemtimes we have to undo a charge we got by try_charge(). + * This function is for that and do uncharge, put css's refcnt. + * gotten by try_charge(). + */ +static void mem_cgroup_cancel_charge(struct mem_cgroup *mem) +{ + if (!mem_cgroup_is_root(mem)) { + res_counter_uncharge(&mem->res, PAGE_SIZE); + if (do_swap_account) + res_counter_uncharge(&mem->memsw, PAGE_SIZE); + } + css_put(&mem->css); +} + +/* * A helper function to get mem_cgroup from ID. must be called under * rcu_read_lock(). The caller must check css_is_removed() or some if * it's concern. (dropping refcnt from swap can be called against removed @@ -1565,12 +1580,7 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem, lock_page_cgroup(pc); if (unlikely(PageCgroupUsed(pc))) { unlock_page_cgroup(pc); - if (!mem_cgroup_is_root(mem)) { - res_counter_uncharge(&mem->res, PAGE_SIZE); - if (do_swap_account) - res_counter_uncharge(&mem->memsw, PAGE_SIZE); - } - css_put(&mem->css); + mem_cgroup_cancel_charge(mem); return; } @@ -1734,14 +1744,7 @@ static int mem_cgroup_move_parent(struct page_cgroup *pc, cancel: put_page(page); uncharge: - /* drop extra refcnt by try_charge() */ - css_put(&parent->css); - /* uncharge if move fails */ - if (!mem_cgroup_is_root(parent)) { - res_counter_uncharge(&parent->res, PAGE_SIZE); - if (do_swap_account) - res_counter_uncharge(&parent->memsw, PAGE_SIZE); - } + mem_cgroup_cancel_charge(parent); return ret; } @@ -1957,12 +1960,7 @@ void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem) return; if (!mem) return; - if (!mem_cgroup_is_root(mem)) { - res_counter_uncharge(&mem->res, PAGE_SIZE); - if (do_swap_account) - res_counter_uncharge(&mem->memsw, PAGE_SIZE); - } - css_put(&mem->css); + mem_cgroup_cancel_charge(mem); } static void -- 1.5.6.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id 1E5AF6B0044 for ; Fri, 6 Nov 2009 00:31:51 -0500 (EST) Date: Fri, 6 Nov 2009 14:13:01 +0900 From: Daisuke Nishimura Subject: [PATCH -mmotm 4/8] memcg: cleanup mem_cgroup_move_parent() Message-Id: <20091106141301.497f2cef.nishimura@mxp.nes.nec.co.jp> In-Reply-To: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> References: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: linux-mm Cc: Andrew Morton , Balbir Singh , KAMEZAWA Hiroyuki , Li Zefan , Paul Menage , Daisuke Nishimura List-ID: mem_cgroup_move_parent() calls try_charge first and cancel_charge on failure. IMHO, charge/uncharge(especially charge) is high cost operation, so we should avoid it as far as possible. This patch tries to delay try_charge in mem_cgroup_move_parent() by re-ordering checks it does. And this patch renames mem_cgroup_move_account() to __mem_cgroup_move_account(), changes the return value of __mem_cgroup_move_account() from int to void, and adds a new wrapper(mem_cgroup_move_account()), which checks whether a @pc is valid for moving account and calls __mem_cgroup_move_account(). This patch removes the last caller of trylock_page_cgroup(), so removes its definition too. Signed-off-by: Daisuke Nishimura --- include/linux/page_cgroup.h | 7 +--- mm/memcontrol.c | 84 ++++++++++++++++++------------------------- 2 files changed, 37 insertions(+), 54 deletions(-) diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h index 4b938d4..b0e4eb1 100644 --- a/include/linux/page_cgroup.h +++ b/include/linux/page_cgroup.h @@ -57,6 +57,8 @@ static inline void ClearPageCgroup##uname(struct page_cgroup *pc) \ static inline int TestClearPageCgroup##uname(struct page_cgroup *pc) \ { return test_and_clear_bit(PCG_##lname, &pc->flags); } +TESTPCGFLAG(Locked, LOCK) + /* Cache flag is set only once (at allocation) */ TESTPCGFLAG(Cache, CACHE) CLEARPCGFLAG(Cache, CACHE) @@ -86,11 +88,6 @@ static inline void lock_page_cgroup(struct page_cgroup *pc) bit_spin_lock(PCG_LOCK, &pc->flags); } -static inline int trylock_page_cgroup(struct page_cgroup *pc) -{ - return bit_spin_trylock(PCG_LOCK, &pc->flags); -} - static inline void unlock_page_cgroup(struct page_cgroup *pc) { bit_spin_unlock(PCG_LOCK, &pc->flags); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 05e837c..1ad3248 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1613,27 +1613,22 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem, } /** - * mem_cgroup_move_account - move account of the page + * __mem_cgroup_move_account - move account of the page * @pc: page_cgroup of the page. * @from: mem_cgroup which the page is moved from. * @to: mem_cgroup which the page is moved to. @from != @to. * * The caller must confirm following. * - page is not on LRU (isolate_page() is useful.) - * - * returns 0 at success, - * returns -EBUSY when lock is busy or "pc" is unstable. + * - the pc is locked, used, and ->mem_cgroup points to @from. * * This function does "uncharge" from old cgroup but doesn't do "charge" to * new cgroup. It should be done by a caller. */ -static int mem_cgroup_move_account(struct page_cgroup *pc, +static void __mem_cgroup_move_account(struct page_cgroup *pc, struct mem_cgroup *from, struct mem_cgroup *to) { - struct mem_cgroup_per_zone *from_mz, *to_mz; - int nid, zid; - int ret = -EBUSY; struct page *page; int cpu; struct mem_cgroup_stat *stat; @@ -1641,20 +1636,9 @@ static int mem_cgroup_move_account(struct page_cgroup *pc, VM_BUG_ON(from == to); VM_BUG_ON(PageLRU(pc->page)); - - nid = page_cgroup_nid(pc); - zid = page_cgroup_zid(pc); - from_mz = mem_cgroup_zoneinfo(from, nid, zid); - to_mz = mem_cgroup_zoneinfo(to, nid, zid); - - if (!trylock_page_cgroup(pc)) - return ret; - - if (!PageCgroupUsed(pc)) - goto out; - - if (pc->mem_cgroup != from) - goto out; + VM_BUG_ON(!PageCgroupLocked(pc)); + VM_BUG_ON(!PageCgroupUsed(pc)); + VM_BUG_ON(pc->mem_cgroup != from); if (!mem_cgroup_is_root(from)) res_counter_uncharge(&from->res, PAGE_SIZE); @@ -1683,15 +1667,28 @@ static int mem_cgroup_move_account(struct page_cgroup *pc, css_get(&to->css); pc->mem_cgroup = to; mem_cgroup_charge_statistics(to, pc, true); - ret = 0; -out: - unlock_page_cgroup(pc); /* * We charges against "to" which may not have any tasks. Then, "to" * can be under rmdir(). But in current implementation, caller of * this function is just force_empty() and it's garanteed that * "to" is never removed. So, we don't check rmdir status here. */ +} + +/* + * check whether the @pc is valid for moving account and call + * __mem_cgroup_move_account() + */ +static int mem_cgroup_move_account(struct page_cgroup *pc, + struct mem_cgroup *from, struct mem_cgroup *to) +{ + int ret = -EINVAL; + lock_page_cgroup(pc); + if (PageCgroupUsed(pc) && pc->mem_cgroup == from) { + __mem_cgroup_move_account(pc, from, to); + ret = 0; + } + unlock_page_cgroup(pc); return ret; } @@ -1713,38 +1710,27 @@ static int mem_cgroup_move_parent(struct page_cgroup *pc, if (!pcg) return -EINVAL; + ret = -EBUSY; + if (!get_page_unless_zero(page)) + goto out; + if (isolate_lru_page(page)) + goto put; parent = mem_cgroup_from_cont(pcg); - - ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false, page); if (ret || !parent) - return ret; - - if (!get_page_unless_zero(page)) { - ret = -EBUSY; - goto uncharge; - } - - ret = isolate_lru_page(page); - - if (ret) - goto cancel; + goto put_back; ret = mem_cgroup_move_account(pc, child, parent); - + if (!ret) + css_put(&parent->css); /* drop extra refcnt by try_charge() */ + else + mem_cgroup_cancel_charge(parent); /* does css_put */ +put_back: putback_lru_page(page); - if (!ret) { - put_page(page); - /* drop extra refcnt by try_charge() */ - css_put(&parent->css); - return 0; - } - -cancel: +put: put_page(page); -uncharge: - mem_cgroup_cancel_charge(parent); +out: return ret; } -- 1.5.6.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with ESMTP id 46EB06B0062 for ; Fri, 6 Nov 2009 00:32:22 -0500 (EST) Date: Fri, 6 Nov 2009 14:14:18 +0900 From: Daisuke Nishimura Subject: [PATCH -mmotm 5/8] memcg: add interface to recharge at task move Message-Id: <20091106141418.07338b99.nishimura@mxp.nes.nec.co.jp> In-Reply-To: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> References: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: linux-mm Cc: Andrew Morton , Balbir Singh , KAMEZAWA Hiroyuki , Li Zefan , Paul Menage , Daisuke Nishimura List-ID: In current memcg, charges associated with a task aren't moved to the new cgroup at task move. These patches are for this feature, that is, for recharging to the new cgroup and, of course, uncharging from old cgroup at task move. This patch adds "memory.recharge_at_immigrate" file, which is a flag file to determine whether charges should be moved to the new cgroup at task move or not, and read/write handlers of the file. This patch also adds no-op handlers for this feature. These handlers will be implemented in later patches. Signed-off-by: Daisuke Nishimura --- mm/memcontrol.c | 65 +++++++++++++++++++++++++++++++++++++++++++++++++----- 1 files changed, 59 insertions(+), 6 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 1ad3248..afa1179 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -227,6 +227,12 @@ struct mem_cgroup { bool memsw_is_minimum; /* + * Should we recharge charges of a task when a task is moved into this + * mem_cgroup ? + */ + bool recharge_at_immigrate; + + /* * statistics. This must be placed at the end of memcg. */ struct mem_cgroup_stat stat; @@ -2863,6 +2869,30 @@ static int mem_cgroup_reset(struct cgroup *cont, unsigned int event) return 0; } +static u64 mem_cgroup_recharge_read(struct cgroup *cgrp, + struct cftype *cft) +{ + return mem_cgroup_from_cont(cgrp)->recharge_at_immigrate; +} + +static int mem_cgroup_recharge_write(struct cgroup *cgrp, + struct cftype *cft, u64 val) +{ + struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp); + + if (val != 0 && val != 1) + return -EINVAL; + /* + * We check this value both in can_attach() and attach(), so we need + * cgroup lock to prevent this value from being inconsistent. + */ + cgroup_lock(); + mem->recharge_at_immigrate = val; + cgroup_unlock(); + + return 0; +} + /* For read statistics */ enum { @@ -3096,6 +3126,11 @@ static struct cftype mem_cgroup_files[] = { .read_u64 = mem_cgroup_swappiness_read, .write_u64 = mem_cgroup_swappiness_write, }, + { + .name = "recharge_at_immigrate", + .read_u64 = mem_cgroup_recharge_read, + .write_u64 = mem_cgroup_recharge_write, + }, }; #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP @@ -3343,6 +3378,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont) if (parent) mem->swappiness = get_swappiness(parent); atomic_set(&mem->refcnt, 1); + mem->recharge_at_immigrate = 0; return &mem->css; free_out: __mem_cgroup_free(mem); @@ -3379,13 +3415,26 @@ static int mem_cgroup_populate(struct cgroup_subsys *ss, return ret; } +/* Handlers for recharge at task move. */ +static int mem_cgroup_can_recharge(struct mem_cgroup *mem, + struct task_struct *p) +{ + return 0; +} + static int mem_cgroup_can_attach(struct cgroup_subsys *ss, struct cgroup *cgroup, struct task_struct *p, bool threadgroup) { - mutex_lock(&memcg_tasklist); - return 0; + int ret = 0; + struct mem_cgroup *mem = mem_cgroup_from_cont(cgroup); + + if (mem->recharge_at_immigrate && thread_group_leader(p)) + ret = mem_cgroup_can_recharge(mem, p); + if (!ret) + mutex_lock(&memcg_tasklist); + return ret; } static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss, @@ -3396,17 +3445,21 @@ static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss, mutex_unlock(&memcg_tasklist); } +static void mem_cgroup_recharge(void) +{ +} + static void mem_cgroup_move_task(struct cgroup_subsys *ss, struct cgroup *cont, struct cgroup *old_cont, struct task_struct *p, bool threadgroup) { + struct mem_cgroup *mem = mem_cgroup_from_cont(cont); + mutex_unlock(&memcg_tasklist); - /* - * FIXME: It's better to move charges of this process from old - * memcg to new memcg. But it's just on TODO-List now. - */ + if (mem->recharge_at_immigrate && thread_group_leader(p)) + mem_cgroup_recharge(); } struct cgroup_subsys mem_cgroup_subsys = { -- 1.5.6.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with ESMTP id 119296B006A for ; Fri, 6 Nov 2009 00:32:53 -0500 (EST) Date: Fri, 6 Nov 2009 14:14:48 +0900 From: Daisuke Nishimura Subject: [PATCH -mmotm 6/8] memcg: recharge charges of anonymous page Message-Id: <20091106141448.6548687a.nishimura@mxp.nes.nec.co.jp> In-Reply-To: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> References: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: linux-mm Cc: Andrew Morton , Balbir Singh , KAMEZAWA Hiroyuki , Li Zefan , Paul Menage , Daisuke Nishimura List-ID: This patch is the core part of this recharge-at-task-move feature. It implements functions to recharge charges of anonymous pages mapped only by the target task. Implementation: - define struct recharge_struct and a valuable of it(recharge) to remember the count of pre-charges and other information. - At can_attach(), parse the page table of the task and count the number of mapped pages which are charged to the source mem_cgroup, and call __mem_cgroup_try_charge() repeatedly and count up recharge.precharge. - At attach(), parse the page table again, find a target page as we did in can_attach(), and call mem_cgroup_move_account() about the page. - Cancel all charges if recharge.precharge > 0 on failure or at the end of task move. Signed-off-by: Daisuke Nishimura --- Documentation/cgroups/memory.txt | 36 +++++- mm/memcontrol.c | 275 +++++++++++++++++++++++++++++++++++++- 2 files changed, 306 insertions(+), 5 deletions(-) diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index b871f25..54281ff 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -262,10 +262,12 @@ some of the pages cached in the cgroup (page cache pages). 4.2 Task migration When a task migrates from one cgroup to another, it's charge is not -carried forward. The pages allocated from the original cgroup still +carried forward by default. The pages allocated from the original cgroup still remain charged to it, the charge is dropped when the page is freed or reclaimed. +Note: You can move charges of a task along with task migration. See 8. + 4.3 Removing a cgroup A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a @@ -414,7 +416,37 @@ NOTE1: Soft limits take effect over a long period of time, since they involve NOTE2: It is recommended to set the soft limit always below the hard limit, otherwise the hard limit will take precedence. -8. TODO +8. Recharge at task move + +Users can move charges associated with a task along with task move, that is, +uncharge from the old cgroup and charge to the new cgroup. + +8.1 Interface + +This feature is disabled by default. It can be enabled(and disabled again) by +writing to memory.recharge_at_immigrate of the destination cgroup. + +If you want to enable it + +# echo 1 > memory.recharget_at_immigrate + +Note: A value more than 1 will be supported in futer. See 8.2. + +And if you want disable it again + +# echo 0 > memory.recharget_at_immigrate + +8.2 Type of charges which can be recharged + +We recharge a charge which meets the following conditions. + +a. It must be charged to the old cgroup. +b. A charge of an anonymous page used by the target task. The page must be used + only by the target task. + +Note: More type of pages(e.g. file cache, shmem,) will be supported in future. + +9. TODO 1. Add support for accounting huge pages (as a separate controller) 2. Make per-cgroup scanner reclaim not-shared pages first diff --git a/mm/memcontrol.c b/mm/memcontrol.c index afa1179..f4b7116 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -21,6 +21,8 @@ #include #include #include +#include +#include #include #include #include @@ -239,6 +241,18 @@ struct mem_cgroup { }; /* + * A data structure and a valiable for recharging charges at task move. + * "recharge" and its members are protected by cgroup_lock + */ +struct recharge_struct { + struct mem_cgroup *from; + struct mem_cgroup *to; + struct task_struct *target; /* the target task being moved */ + unsigned long precharge; +}; +static struct recharge_struct recharge; + +/* * Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft * limit reclaim to prevent infinite loops, if they ever occur. */ @@ -1496,7 +1510,7 @@ charged: * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree. * if they exceeds softlimit. */ - if (mem_cgroup_soft_limit_check(mem)) + if (page && mem_cgroup_soft_limit_check(mem)) mem_cgroup_update_tree(mem, page); done: return 0; @@ -3416,10 +3430,170 @@ static int mem_cgroup_populate(struct cgroup_subsys *ss, } /* Handlers for recharge at task move. */ +/** + * is_target_pte_for_recharge - check a pte whether it is valid for recharge + * @vma: the vma the pte to be checked belongs + * @addr: the address corresponding to the pte to be checked + * @ptent: the pte to be checked + * @target: the pointer the target page will be stored(can be NULL) + * + * Returns + * 0(RECHARGE_TARGET_NONE): if the pte is not a target for recharge. + * 1(RECHARGE_TARGET_PAGE): if the page corresponding to this pte is a target + * for recharge. if @target is not NULL, the page is stored in target->page + * with extra refcnt got(Callers should handle it). + * + * Called with pte lock held. + */ +/* We add a new member later. */ +union recharge_target { + struct page *page; +}; + +/* We add a new type later. */ +enum recharge_target_type { + RECHARGE_TARGET_NONE, /* not used */ + RECHARGE_TARGET_PAGE, +}; + +static int is_target_pte_for_recharge(struct vm_area_struct *vma, + unsigned long addr, pte_t ptent, union recharge_target *target) +{ + struct page *page; + struct page_cgroup *pc; + int ret = 0; + + if (!pte_present(ptent)) + return 0; + + page = vm_normal_page(vma, addr, ptent); + if (!page || !page_mapped(page)) + return 0; + /* TODO: We don't recharge file(including shmem/tmpfs) pages for now. */ + if (!PageAnon(page)) + return 0; + /* + * TODO: We don't recharge shared(used by multiple processes) pages + * for now. + */ + if (page_mapcount(page) > 1) + return 0; + if (!get_page_unless_zero(page)) + return 0; + + pc = lookup_page_cgroup(page); + lock_page_cgroup(pc); + if (PageCgroupUsed(pc) && pc->mem_cgroup == recharge.from) { + ret = RECHARGE_TARGET_PAGE; + if (target) + target->page = page; + } + unlock_page_cgroup(pc); + + if (!ret || !target) + put_page(page); + + return ret; +} + +static int mem_cgroup_recharge_do_precharge(void) +{ + int ret = -ENOMEM; + struct mem_cgroup *mem = recharge.to; + + ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false, NULL); + if (ret || !mem) + return -ENOMEM; + + recharge.precharge++; + return ret; +} + +static int mem_cgroup_recharge_prepare_pte_range(pmd_t *pmd, + unsigned long addr, unsigned long end, + struct mm_walk *walk) +{ + int ret = 0; + unsigned long count = 0; + struct vm_area_struct *vma = walk->private; + pte_t *pte; + spinlock_t *ptl; + + pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); + for (; addr != end; pte++, addr += PAGE_SIZE) + if (is_target_pte_for_recharge(vma, addr, *pte, NULL)) + count++; + pte_unmap_unlock(pte - 1, ptl); + + while (count-- && !ret) + ret = mem_cgroup_recharge_do_precharge(); + + return ret; +} + +static int mem_cgroup_recharge_prepare(void) +{ + int ret = 0; + struct mm_struct *mm; + struct vm_area_struct *vma; + + mm = get_task_mm(recharge.target); + if (!mm) + return 0; + + down_read(&mm->mmap_sem); + for (vma = mm->mmap; vma; vma = vma->vm_next) { + struct mm_walk mem_cgroup_recharge_prepare_walk = { + .pmd_entry = mem_cgroup_recharge_prepare_pte_range, + .mm = mm, + .private = vma, + }; + if (signal_pending(current)) { + ret = -EINTR; + break; + } + if (is_vm_hugetlb_page(vma)) + continue; + ret = walk_page_range(vma->vm_start, vma->vm_end, + &mem_cgroup_recharge_prepare_walk); + if (ret) + break; + cond_resched(); + } + up_read(&mm->mmap_sem); + + mmput(mm); + return ret; +} + +static void mem_cgroup_clear_recharge(void) +{ + while (recharge.precharge--) + mem_cgroup_cancel_charge(recharge.to); + recharge.from = NULL; + recharge.to = NULL; + recharge.target = NULL; +} + static int mem_cgroup_can_recharge(struct mem_cgroup *mem, struct task_struct *p) { - return 0; + int ret; + struct mem_cgroup *from = mem_cgroup_from_task(p); + + if (from == mem) + return 0; + + recharge.from = from; + recharge.to = mem; + recharge.target = p; + recharge.precharge = 0; + + ret = mem_cgroup_recharge_prepare(); + + if (ret) + mem_cgroup_clear_recharge(); + return ret; } static int mem_cgroup_can_attach(struct cgroup_subsys *ss, @@ -3442,11 +3616,104 @@ static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss, struct task_struct *p, bool threadgroup) { + struct mem_cgroup *mem = mem_cgroup_from_cont(cgroup); + mutex_unlock(&memcg_tasklist); + if (mem->recharge_at_immigrate && thread_group_leader(p)) + mem_cgroup_clear_recharge(); +} + +static int mem_cgroup_recharge_pte_range(pmd_t *pmd, + unsigned long addr, unsigned long end, + struct mm_walk *walk) +{ + int ret = 0; + struct vm_area_struct *vma = walk->private; + pte_t *pte; + spinlock_t *ptl; + +retry: + pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); + for (; addr != end; addr += PAGE_SIZE) { + pte_t ptent = *(pte++); + union recharge_target target; + int type; + struct page *page; + struct page_cgroup *pc; + + if (!recharge.precharge) + break; + + type = is_target_pte_for_recharge(vma, addr, ptent, &target); + switch (type) { + case RECHARGE_TARGET_PAGE: + page = target.page; + if (isolate_lru_page(page)) + goto put; + pc = lookup_page_cgroup(page); + if (!mem_cgroup_move_account(pc, + recharge.from, recharge.to)) { + css_put(&recharge.to->css); + recharge.precharge--; + } + putback_lru_page(page); +put: /* is_target_pte_for_recharge() gets the page */ + put_page(page); + break; + default: + continue; + } + } + pte_unmap_unlock(pte - 1, ptl); + + if (addr != end) { + /* + * We have consumed all precharges we got in can_attach(). + * We try precharge one by one, but don't do any additional + * precharges nor recharges to recharge.to if we have failed in + * precharge once in attach() phase. + */ + ret = mem_cgroup_recharge_do_precharge(); + if (!ret) + goto retry; + } + + return ret; } static void mem_cgroup_recharge(void) { + struct mm_struct *mm; + struct vm_area_struct *vma; + + mm = get_task_mm(recharge.target); + if (!mm) + return; + + lru_add_drain_all(); + down_read(&mm->mmap_sem); + for (vma = mm->mmap; vma; vma = vma->vm_next) { + int ret; + struct mm_walk mem_cgroup_recharge_walk = { + .pmd_entry = mem_cgroup_recharge_pte_range, + .mm = mm, + .private = vma, + }; + if (is_vm_hugetlb_page(vma)) + continue; + ret = walk_page_range(vma->vm_start, vma->vm_end, + &mem_cgroup_recharge_walk); + if (ret) + /* + * means we have consumed all precharges and failed in + * doing additional precharge. Just abandon here. + */ + break; + cond_resched(); + } + up_read(&mm->mmap_sem); + + mmput(mm); } static void mem_cgroup_move_task(struct cgroup_subsys *ss, @@ -3458,8 +3725,10 @@ static void mem_cgroup_move_task(struct cgroup_subsys *ss, struct mem_cgroup *mem = mem_cgroup_from_cont(cont); mutex_unlock(&memcg_tasklist); - if (mem->recharge_at_immigrate && thread_group_leader(p)) + if (mem->recharge_at_immigrate && thread_group_leader(p)) { mem_cgroup_recharge(); + mem_cgroup_clear_recharge(); + } } struct cgroup_subsys mem_cgroup_subsys = { -- 1.5.6.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id 4FA5B6B0044 for ; Fri, 6 Nov 2009 00:33:28 -0500 (EST) Date: Fri, 6 Nov 2009 14:15:32 +0900 From: Daisuke Nishimura Subject: [PATCH -mmotm 7/8] memcg: avoid oom during recharge at task move Message-Id: <20091106141532.a2fe1187.nishimura@mxp.nes.nec.co.jp> In-Reply-To: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> References: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: linux-mm Cc: Andrew Morton , Balbir Singh , KAMEZAWA Hiroyuki , Li Zefan , Paul Menage , Daisuke Nishimura List-ID: This recharge-at-task-move feature has extra charges(pre-charges) on "to" mem_cgroup during recharging. This means unnecessary oom can happen. This patch tries to avoid such oom. Signed-off-by: Daisuke Nishimura --- mm/memcontrol.c | 27 +++++++++++++++++++++++++++ 1 files changed, 27 insertions(+), 0 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index f4b7116..7e96f3b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -248,6 +248,7 @@ struct recharge_struct { struct mem_cgroup *from; struct mem_cgroup *to; struct task_struct *target; /* the target task being moved */ + struct task_struct *working; /* a task moving the target task */ unsigned long precharge; }; static struct recharge_struct recharge; @@ -1493,6 +1494,30 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm, if (mem_cgroup_check_under_limit(mem_over_limit)) continue; + /* try to avoid oom while someone is recharging */ + if (recharge.working && current != recharge.working) { + struct mem_cgroup *dest; + bool do_continue = false; + /* + * There is a small race that "dest" can be freed by + * rmdir, so we use css_tryget(). + */ + rcu_read_lock(); + dest = recharge.to; + if (dest && css_tryget(&dest->css)) { + if (dest->use_hierarchy) + do_continue = css_is_ancestor( + &dest->css, + &mem_over_limit->css); + else + do_continue = (dest == mem_over_limit); + css_put(&dest->css); + } + rcu_read_unlock(); + if (do_continue) + continue; + } + if (!nr_retries--) { if (oom) { mutex_lock(&memcg_tasklist); @@ -3573,6 +3598,7 @@ static void mem_cgroup_clear_recharge(void) recharge.from = NULL; recharge.to = NULL; recharge.target = NULL; + recharge.working = NULL; } static int mem_cgroup_can_recharge(struct mem_cgroup *mem, @@ -3587,6 +3613,7 @@ static int mem_cgroup_can_recharge(struct mem_cgroup *mem, recharge.from = from; recharge.to = mem; recharge.target = p; + recharge.working = current; recharge.precharge = 0; ret = mem_cgroup_recharge_prepare(); -- 1.5.6.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id 45F5D6B0062 for ; Fri, 6 Nov 2009 00:33:56 -0500 (EST) Date: Fri, 6 Nov 2009 14:16:04 +0900 From: Daisuke Nishimura Subject: [PATCH -mmotm 8/8] memcg: recharge charges of anonymous swap Message-Id: <20091106141604.ff80e40f.nishimura@mxp.nes.nec.co.jp> In-Reply-To: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> References: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: linux-mm Cc: Andrew Morton , Balbir Singh , KAMEZAWA Hiroyuki , Li Zefan , Paul Menage , Daisuke Nishimura List-ID: This patch is another core part of this recharge-at-task-move feature. It enables recharge of anonymous swaps. To move the charge of swap, we need to exchange swap_cgroup's record. In current implementation, swap_cgroup's record is protected by: - page lock: if the entry is on swap cache. - swap_lock: if the entry is not on swap cache. This works well in usual swap-in/out activity. But this behavior make charge migration of swap check many conditions to exchange swap_cgroup's record safely. So I changed modification of swap_cgroup's recored(swap_cgroup_record()) to use xchg, and define a new function to cmpxchg swap_cgroup's record. This patch also enables recharge of non pte_present but not uncharged swap caches, which can be exist on swap-out path, by getting the target pages via find_get_page() as do_mincore() does. Signed-off-by: Daisuke Nishimura --- Documentation/cgroups/memory.txt | 5 +- include/linux/page_cgroup.h | 2 + include/linux/swap.h | 1 + mm/memcontrol.c | 140 ++++++++++++++++++++++++++++++-------- mm/page_cgroup.c | 35 +++++++++- mm/swapfile.c | 32 +++++++++ 6 files changed, 182 insertions(+), 33 deletions(-) diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index 54281ff..2820626 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -441,8 +441,9 @@ And if you want disable it again We recharge a charge which meets the following conditions. a. It must be charged to the old cgroup. -b. A charge of an anonymous page used by the target task. The page must be used - only by the target task. +b. A charge of an anonymous page(or swap of it) used by the target task. + The page(or swap) must be used only by the target task. You must enable + Swap Extension(see 2.4) to enable recharge of swap. Note: More type of pages(e.g. file cache, shmem,) will be supported in future. diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h index b0e4eb1..30b0813 100644 --- a/include/linux/page_cgroup.h +++ b/include/linux/page_cgroup.h @@ -118,6 +118,8 @@ static inline void __init page_cgroup_init_flatmem(void) #include #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP +extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent, + unsigned short old, unsigned short new); extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id); extern unsigned short lookup_swap_cgroup(swp_entry_t ent); extern int swap_cgroup_swapon(int type, unsigned long max_pages); diff --git a/include/linux/swap.h b/include/linux/swap.h index 9f0ca32..2a3209e 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -355,6 +355,7 @@ static inline void disable_swap_token(void) #ifdef CONFIG_CGROUP_MEM_RES_CTLR extern void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout); +extern int mem_cgroup_count_swap_user(swp_entry_t ent, struct page **pagep); #else static inline void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 7e96f3b..50e28df 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -34,6 +34,7 @@ #include #include #include +#include #include #include #include @@ -2231,6 +2232,49 @@ void mem_cgroup_uncharge_swap(swp_entry_t ent) } rcu_read_unlock(); } + +/** + * mem_cgroup_move_swap_account - move swap charge and swap_cgroup's record. + * @entry: swap entry to be moved + * @from: mem_cgroup which the entry is moved from + * @to: mem_cgroup which the entry is moved to + * + * It successes only when the swap_cgroup's record for this entry is the same + * as the mem_cgroup's id of @from. + * + * Returns 0 on success, 1 on failure. + * + * The caller must have called __mem_cgroup_try_charge on @to. + */ +static int mem_cgroup_move_swap_account(swp_entry_t entry, + struct mem_cgroup *from, struct mem_cgroup *to) +{ + unsigned short old_id, new_id; + + old_id = css_id(&from->css); + new_id = css_id(&to->css); + + if (swap_cgroup_cmpxchg(entry, old_id, new_id) == old_id) { + if (!mem_cgroup_is_root(from)) + res_counter_uncharge(&from->memsw, PAGE_SIZE); + mem_cgroup_swap_statistics(from, false); + mem_cgroup_put(from); + + if (!mem_cgroup_is_root(to)) + res_counter_uncharge(&to->res, PAGE_SIZE); + mem_cgroup_swap_statistics(to, true); + mem_cgroup_get(to); + + return 0; + } + return 1; +} +#else +static inline int mem_cgroup_move_swap_account(swp_entry_t entry, + struct mem_cgroup *from, struct mem_cgroup *to) +{ + return 1; +} #endif /* @@ -3460,63 +3504,92 @@ static int mem_cgroup_populate(struct cgroup_subsys *ss, * @vma: the vma the pte to be checked belongs * @addr: the address corresponding to the pte to be checked * @ptent: the pte to be checked - * @target: the pointer the target page will be stored(can be NULL) + * @target: the pointer the target page or entry will be stored(can be NULL) * * Returns * 0(RECHARGE_TARGET_NONE): if the pte is not a target for recharge. * 1(RECHARGE_TARGET_PAGE): if the page corresponding to this pte is a target * for recharge. if @target is not NULL, the page is stored in target->page * with extra refcnt got(Callers should handle it). + * 2(MIGRATION_TARGET_SWAP): if the swap entry corresponding to this pte is a + * target for charge migration. if @target is not NULL, the entry is stored + * in target->ent. * * Called with pte lock held. */ -/* We add a new member later. */ union recharge_target { struct page *page; + swp_entry_t ent; }; -/* We add a new type later. */ enum recharge_target_type { RECHARGE_TARGET_NONE, /* not used */ RECHARGE_TARGET_PAGE, + RECHARGE_TARGET_SWAP, }; static int is_target_pte_for_recharge(struct vm_area_struct *vma, unsigned long addr, pte_t ptent, union recharge_target *target) { - struct page *page; + struct page *page = NULL; struct page_cgroup *pc; + swp_entry_t ent = { .val = 0 }; int ret = 0; + int user = 0; - if (!pte_present(ptent)) - return 0; - - page = vm_normal_page(vma, addr, ptent); - if (!page || !page_mapped(page)) - return 0; - /* TODO: We don't recharge file(including shmem/tmpfs) pages for now. */ - if (!PageAnon(page)) - return 0; - /* - * TODO: We don't recharge shared(used by multiple processes) pages - * for now. - */ - if (page_mapcount(page) > 1) - return 0; - if (!get_page_unless_zero(page)) + if (!pte_present(ptent)) { + /* TODO: handle swap of shmes/tmpfs */ + if (pte_none(ptent) || pte_file(ptent)) + return 0; + else if (is_swap_pte(ptent)) { + ent = pte_to_swp_entry(ptent); + if (non_swap_entry(ent)) + return 0; + user = mem_cgroup_count_swap_user(ent, &page); + } + } else { + page = vm_normal_page(vma, addr, ptent); + if (!page || !page_mapped(page)) + return 0; + /* + * TODO: We don't recharge file(including shmem/tmpfs) pages + * for now. + */ + if (!PageAnon(page)) + return 0; + if (!get_page_unless_zero(page)) + return 0; + user = page_mapcount(page); + } + if (user > 1) { + /* + * TODO: We don't recharge shared(used by multiple processes) + * pages for now. + */ + if (page) + put_page(page); return 0; + } - pc = lookup_page_cgroup(page); - lock_page_cgroup(pc); - if (PageCgroupUsed(pc) && pc->mem_cgroup == recharge.from) { - ret = RECHARGE_TARGET_PAGE; + if (page) { + pc = lookup_page_cgroup(page); + lock_page_cgroup(pc); + if (PageCgroupUsed(pc) && pc->mem_cgroup == recharge.from) { + ret = RECHARGE_TARGET_PAGE; + if (target) + target->page = page; + } + unlock_page_cgroup(pc); + if (!ret || !target) + put_page(page); + } + /* fall throught */ + if (ent.val && do_swap_account && !ret && + css_id(&recharge.from->css) == lookup_swap_cgroup(ent)) { + ret = RECHARGE_TARGET_SWAP; if (target) - target->page = page; + target->ent = ent; } - unlock_page_cgroup(pc); - - if (!ret || !target) - put_page(page); return ret; } @@ -3667,6 +3740,7 @@ retry: int type; struct page *page; struct page_cgroup *pc; + swp_entry_t ent; if (!recharge.precharge) break; @@ -3687,6 +3761,14 @@ retry: put: /* is_target_pte_for_recharge() gets the page */ put_page(page); break; + case RECHARGE_TARGET_SWAP: + ent = target.ent; + if (!mem_cgroup_move_swap_account(ent, + recharge.from, recharge.to)) { + css_put(&recharge.to->css); + recharge.precharge--; + } + break; default: continue; } diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c index 3d535d5..213b0ee 100644 --- a/mm/page_cgroup.c +++ b/mm/page_cgroup.c @@ -9,6 +9,7 @@ #include #include #include +#include static void __meminit __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn) @@ -335,6 +336,37 @@ not_enough_page: } /** + * swap_cgroup_cmpxchg - cmpxchg mem_cgroup's id for this swp_entry. + * @end: swap entry to be cmpxchged + * @old: old id + * @new: new id + * + * Returns old id at success, 0 at failure. + * (There is no mem_cgroup useing 0 as its id) + */ +unsigned short swap_cgroup_cmpxchg(swp_entry_t ent, + unsigned short old, unsigned short new) +{ + int type = swp_type(ent); + unsigned long offset = swp_offset(ent); + unsigned long idx = offset / SC_PER_PAGE; + unsigned long pos = offset & SC_POS_MASK; + struct swap_cgroup_ctrl *ctrl; + struct page *mappage; + struct swap_cgroup *sc; + + ctrl = &swap_cgroup_ctrl[type]; + + mappage = ctrl->map[idx]; + sc = page_address(mappage); + sc += pos; + if (cmpxchg(&sc->id, old, new) == old) + return old; + else + return 0; +} + +/** * swap_cgroup_record - record mem_cgroup for this swp_entry. * @ent: swap entry to be recorded into * @mem: mem_cgroup to be recorded @@ -358,8 +390,7 @@ unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id) mappage = ctrl->map[idx]; sc = page_address(mappage); sc += pos; - old = sc->id; - sc->id = id; + old = xchg(&sc->id, id); return old; } diff --git a/mm/swapfile.c b/mm/swapfile.c index 93e71cf..28eef54 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -719,6 +719,38 @@ int free_swap_and_cache(swp_entry_t entry) return p != NULL; } +#ifdef CONFIG_CGROUP_MEM_RES_CTLR +/** + * mem_cgroup_count_swap_user - count the user of a swap entry + * @ent: the swap entry to be checked + * @pagep: the pointer for the swap cache page of the entry to be stored + * + * Returns the number of the user of the swap entry. + * If the entry is found on swap cache, the page is stored to pagep with + * refcount of it being incremented. + * + * This function can be used only for swaps for anonymous pages. + */ +int mem_cgroup_count_swap_user(swp_entry_t ent, struct page **pagep) +{ + struct page *page; + struct swap_info_struct *p; + int count = 0; + + page = find_get_page(&swapper_space, ent.val); + if (page) + count += page_mapcount(page); + p = swap_info_get(ent); + if (p) { + count += swap_count(p->swap_map[swp_offset(ent)]); + spin_unlock(&swap_lock); + } + + *pagep = page; + return count; +} +#endif + #ifdef CONFIG_HIBERNATION /* * Find the swap type that corresponds to given device (if any). -- 1.5.6.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with SMTP id 6611D6B0044 for ; Fri, 6 Nov 2009 00:57:41 -0500 (EST) Received: from m4.gw.fujitsu.co.jp ([10.0.50.74]) by fgwmail6.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id nA65vcBk030781 for (envelope-from kamezawa.hiroyu@jp.fujitsu.com); Fri, 6 Nov 2009 14:57:38 +0900 Received: from smail (m4 [127.0.0.1]) by outgoing.m4.gw.fujitsu.co.jp (Postfix) with ESMTP id 866AE45DE7B for ; Fri, 6 Nov 2009 14:57:37 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (s4.gw.fujitsu.co.jp [10.0.50.94]) by m4.gw.fujitsu.co.jp (Postfix) with ESMTP id 4D09045DE4D for ; Fri, 6 Nov 2009 14:57:37 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id B830AE1800F for ; Fri, 6 Nov 2009 14:57:36 +0900 (JST) Received: from ml14.s.css.fujitsu.com (ml14.s.css.fujitsu.com [10.249.87.104]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id DB1FE1DB803F for ; Fri, 6 Nov 2009 14:57:31 +0900 (JST) Date: Fri, 6 Nov 2009 14:54:59 +0900 From: KAMEZAWA Hiroyuki Subject: Re: [PATCH -mmotm 2/8] memcg: move memcg_tasklist mutex Message-Id: <20091106145459.351b407f.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <20091106141149.9c7e94d5.nishimura@mxp.nes.nec.co.jp> References: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> <20091106141149.9c7e94d5.nishimura@mxp.nes.nec.co.jp> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Daisuke Nishimura Cc: linux-mm , Andrew Morton , Balbir Singh , Li Zefan , Paul Menage List-ID: On Fri, 6 Nov 2009 14:11:49 +0900 Daisuke Nishimura wrote: > memcg_tasklist was introduced to serialize mem_cgroup_out_of_memory() and > mem_cgroup_move_task() to ensure tasks cannot be moved to another cgroup > during select_bad_process(). > > task_in_mem_cgroup(), which can be called by select_bad_process(), will check > whether a task is in the mem_cgroup or not by dereferencing task->cgroups > ->subsys[]. So, it would be desirable to change task->cgroups > (rcu_assign_pointer() in cgroup_attach_task() does it) with memcg_tasklist held. > > Now that we can define cancel_attach(), we can safely release memcg_tasklist > on fail path even if we hold memcg_tasklist in can_attach(). So let's move > mutex_lock/unlock() of memcg_tasklist. > > Signed-off-by: Daisuke Nishimura > --- > mm/memcontrol.c | 22 ++++++++++++++++++++-- > 1 files changed, 20 insertions(+), 2 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 4bd3451..d3b2ac0 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -3395,18 +3395,34 @@ static int mem_cgroup_populate(struct cgroup_subsys *ss, > return ret; > } > > +static int mem_cgroup_can_attach(struct cgroup_subsys *ss, > + struct cgroup *cgroup, > + struct task_struct *p, > + bool threadgroup) > +{ > + mutex_lock(&memcg_tasklist); > + return 0; > +} Hmm...Is this lock really necessary ? IOW, can't we just remove memcg_tasklist mutex ? What kind of bad race happens when we remove this ? Thanks, -Kame > + > +static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss, > + struct cgroup *cgroup, > + struct task_struct *p, > + bool threadgroup) > +{ > + mutex_unlock(&memcg_tasklist); > +} > + > static void mem_cgroup_move_task(struct cgroup_subsys *ss, > struct cgroup *cont, > struct cgroup *old_cont, > struct task_struct *p, > bool threadgroup) > { > - mutex_lock(&memcg_tasklist); > + mutex_unlock(&memcg_tasklist); > /* > * FIXME: It's better to move charges of this process from old > * memcg to new memcg. But it's just on TODO-List now. > */ > - mutex_unlock(&memcg_tasklist); > } > > struct cgroup_subsys mem_cgroup_subsys = { > @@ -3416,6 +3432,8 @@ struct cgroup_subsys mem_cgroup_subsys = { > .pre_destroy = mem_cgroup_pre_destroy, > .destroy = mem_cgroup_destroy, > .populate = mem_cgroup_populate, > + .can_attach = mem_cgroup_can_attach, > + .cancel_attach = mem_cgroup_cancel_attach, > .attach = mem_cgroup_move_task, > .early_init = 0, > .use_id = 1, Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with SMTP id 3E60E6B0044 for ; Fri, 6 Nov 2009 00:59:06 -0500 (EST) Received: from m4.gw.fujitsu.co.jp ([10.0.50.74]) by fgwmail5.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id nA65x3bb004289 for (envelope-from kamezawa.hiroyu@jp.fujitsu.com); Fri, 6 Nov 2009 14:59:03 +0900 Received: from smail (m4 [127.0.0.1]) by outgoing.m4.gw.fujitsu.co.jp (Postfix) with ESMTP id 0968F45DE6F for ; Fri, 6 Nov 2009 14:59:03 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (s4.gw.fujitsu.co.jp [10.0.50.94]) by m4.gw.fujitsu.co.jp (Postfix) with ESMTP id DE55A45DE4D for ; Fri, 6 Nov 2009 14:59:02 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id CAD2D1DB803B for ; Fri, 6 Nov 2009 14:59:02 +0900 (JST) Received: from m106.s.css.fujitsu.com (m106.s.css.fujitsu.com [10.249.87.106]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id 8094BE18001 for ; Fri, 6 Nov 2009 14:59:02 +0900 (JST) Date: Fri, 6 Nov 2009 14:56:29 +0900 From: KAMEZAWA Hiroyuki Subject: Re: [PATCH -mmotm 4/8] memcg: cleanup mem_cgroup_move_parent() Message-Id: <20091106145629.58b810e3.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <20091106141301.497f2cef.nishimura@mxp.nes.nec.co.jp> References: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> <20091106141301.497f2cef.nishimura@mxp.nes.nec.co.jp> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Daisuke Nishimura Cc: linux-mm , Andrew Morton , Balbir Singh , Li Zefan , Paul Menage List-ID: On Fri, 6 Nov 2009 14:13:01 +0900 Daisuke Nishimura wrote: > mem_cgroup_move_parent() calls try_charge first and cancel_charge on failure. > IMHO, charge/uncharge(especially charge) is high cost operation, so we should > avoid it as far as possible. > > This patch tries to delay try_charge in mem_cgroup_move_parent() by re-ordering > checks it does. > > And this patch renames mem_cgroup_move_account() to __mem_cgroup_move_account(), > changes the return value of __mem_cgroup_move_account() from int to void, > and adds a new wrapper(mem_cgroup_move_account()), which checks whether a @pc > is valid for moving account and calls __mem_cgroup_move_account(). > > This patch removes the last caller of trylock_page_cgroup(), so removes its > definition too. > > Signed-off-by: Daisuke Nishimura Acked-by: KAMEZAWA Hiroyuki > --- > include/linux/page_cgroup.h | 7 +--- > mm/memcontrol.c | 84 ++++++++++++++++++------------------------- > 2 files changed, 37 insertions(+), 54 deletions(-) > > diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h > index 4b938d4..b0e4eb1 100644 > --- a/include/linux/page_cgroup.h > +++ b/include/linux/page_cgroup.h > @@ -57,6 +57,8 @@ static inline void ClearPageCgroup##uname(struct page_cgroup *pc) \ > static inline int TestClearPageCgroup##uname(struct page_cgroup *pc) \ > { return test_and_clear_bit(PCG_##lname, &pc->flags); } > > +TESTPCGFLAG(Locked, LOCK) > + > /* Cache flag is set only once (at allocation) */ > TESTPCGFLAG(Cache, CACHE) > CLEARPCGFLAG(Cache, CACHE) > @@ -86,11 +88,6 @@ static inline void lock_page_cgroup(struct page_cgroup *pc) > bit_spin_lock(PCG_LOCK, &pc->flags); > } > > -static inline int trylock_page_cgroup(struct page_cgroup *pc) > -{ > - return bit_spin_trylock(PCG_LOCK, &pc->flags); > -} > - > static inline void unlock_page_cgroup(struct page_cgroup *pc) > { > bit_spin_unlock(PCG_LOCK, &pc->flags); > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 05e837c..1ad3248 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1613,27 +1613,22 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem, > } > > /** > - * mem_cgroup_move_account - move account of the page > + * __mem_cgroup_move_account - move account of the page > * @pc: page_cgroup of the page. > * @from: mem_cgroup which the page is moved from. > * @to: mem_cgroup which the page is moved to. @from != @to. > * > * The caller must confirm following. > * - page is not on LRU (isolate_page() is useful.) > - * > - * returns 0 at success, > - * returns -EBUSY when lock is busy or "pc" is unstable. > + * - the pc is locked, used, and ->mem_cgroup points to @from. > * > * This function does "uncharge" from old cgroup but doesn't do "charge" to > * new cgroup. It should be done by a caller. > */ > > -static int mem_cgroup_move_account(struct page_cgroup *pc, > +static void __mem_cgroup_move_account(struct page_cgroup *pc, > struct mem_cgroup *from, struct mem_cgroup *to) > { > - struct mem_cgroup_per_zone *from_mz, *to_mz; > - int nid, zid; > - int ret = -EBUSY; > struct page *page; > int cpu; > struct mem_cgroup_stat *stat; > @@ -1641,20 +1636,9 @@ static int mem_cgroup_move_account(struct page_cgroup *pc, > > VM_BUG_ON(from == to); > VM_BUG_ON(PageLRU(pc->page)); > - > - nid = page_cgroup_nid(pc); > - zid = page_cgroup_zid(pc); > - from_mz = mem_cgroup_zoneinfo(from, nid, zid); > - to_mz = mem_cgroup_zoneinfo(to, nid, zid); > - > - if (!trylock_page_cgroup(pc)) > - return ret; > - > - if (!PageCgroupUsed(pc)) > - goto out; > - > - if (pc->mem_cgroup != from) > - goto out; > + VM_BUG_ON(!PageCgroupLocked(pc)); > + VM_BUG_ON(!PageCgroupUsed(pc)); > + VM_BUG_ON(pc->mem_cgroup != from); > > if (!mem_cgroup_is_root(from)) > res_counter_uncharge(&from->res, PAGE_SIZE); > @@ -1683,15 +1667,28 @@ static int mem_cgroup_move_account(struct page_cgroup *pc, > css_get(&to->css); > pc->mem_cgroup = to; > mem_cgroup_charge_statistics(to, pc, true); > - ret = 0; > -out: > - unlock_page_cgroup(pc); > /* > * We charges against "to" which may not have any tasks. Then, "to" > * can be under rmdir(). But in current implementation, caller of > * this function is just force_empty() and it's garanteed that > * "to" is never removed. So, we don't check rmdir status here. > */ > +} > + > +/* > + * check whether the @pc is valid for moving account and call > + * __mem_cgroup_move_account() > + */ > +static int mem_cgroup_move_account(struct page_cgroup *pc, > + struct mem_cgroup *from, struct mem_cgroup *to) > +{ > + int ret = -EINVAL; > + lock_page_cgroup(pc); > + if (PageCgroupUsed(pc) && pc->mem_cgroup == from) { > + __mem_cgroup_move_account(pc, from, to); > + ret = 0; > + } > + unlock_page_cgroup(pc); > return ret; > } > > @@ -1713,38 +1710,27 @@ static int mem_cgroup_move_parent(struct page_cgroup *pc, > if (!pcg) > return -EINVAL; > > + ret = -EBUSY; > + if (!get_page_unless_zero(page)) > + goto out; > + if (isolate_lru_page(page)) > + goto put; > > parent = mem_cgroup_from_cont(pcg); > - > - > ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false, page); > if (ret || !parent) > - return ret; > - > - if (!get_page_unless_zero(page)) { > - ret = -EBUSY; > - goto uncharge; > - } > - > - ret = isolate_lru_page(page); > - > - if (ret) > - goto cancel; > + goto put_back; > > ret = mem_cgroup_move_account(pc, child, parent); > - > + if (!ret) > + css_put(&parent->css); /* drop extra refcnt by try_charge() */ > + else > + mem_cgroup_cancel_charge(parent); /* does css_put */ > +put_back: > putback_lru_page(page); > - if (!ret) { > - put_page(page); > - /* drop extra refcnt by try_charge() */ > - css_put(&parent->css); > - return 0; > - } > - > -cancel: > +put: > put_page(page); > -uncharge: > - mem_cgroup_cancel_charge(parent); > +out: > return ret; > } > > -- > 1.5.6.1 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with SMTP id 9B35E6B0044 for ; Fri, 6 Nov 2009 01:09:15 -0500 (EST) Received: from m5.gw.fujitsu.co.jp ([10.0.50.75]) by fgwmail7.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id nA669CXS007312 for (envelope-from kamezawa.hiroyu@jp.fujitsu.com); Fri, 6 Nov 2009 15:09:13 +0900 Received: from smail (m5 [127.0.0.1]) by outgoing.m5.gw.fujitsu.co.jp (Postfix) with ESMTP id 94D7645DE4F for ; Fri, 6 Nov 2009 15:09:12 +0900 (JST) Received: from s5.gw.fujitsu.co.jp (s5.gw.fujitsu.co.jp [10.0.50.95]) by m5.gw.fujitsu.co.jp (Postfix) with ESMTP id 753F045DE4E for ; Fri, 6 Nov 2009 15:09:12 +0900 (JST) Received: from s5.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s5.gw.fujitsu.co.jp (Postfix) with ESMTP id 605F41DB8040 for ; Fri, 6 Nov 2009 15:09:12 +0900 (JST) Received: from m105.s.css.fujitsu.com (m105.s.css.fujitsu.com [10.249.87.105]) by s5.gw.fujitsu.co.jp (Postfix) with ESMTP id 0FAB91DB803F for ; Fri, 6 Nov 2009 15:09:12 +0900 (JST) Date: Fri, 6 Nov 2009 15:06:40 +0900 From: KAMEZAWA Hiroyuki Subject: Re: [PATCH -mmotm 5/8] memcg: add interface to recharge at task move Message-Id: <20091106150640.52c92ce8.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <20091106141418.07338b99.nishimura@mxp.nes.nec.co.jp> References: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> <20091106141418.07338b99.nishimura@mxp.nes.nec.co.jp> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Daisuke Nishimura Cc: linux-mm , Andrew Morton , Balbir Singh , Li Zefan , Paul Menage List-ID: On Fri, 6 Nov 2009 14:14:18 +0900 Daisuke Nishimura wrote: > In current memcg, charges associated with a task aren't moved to the new cgroup > at task move. These patches are for this feature, that is, for recharging to > the new cgroup and, of course, uncharging from old cgroup at task move. > > This patch adds "memory.recharge_at_immigrate" file, which is a flag file to > determine whether charges should be moved to the new cgroup at task move or > not, and read/write handlers of the file. > This patch also adds no-op handlers for this feature. These handlers will be > implemented in later patches. > > Signed-off-by: Daisuke Nishimura Acked-by: KAMEZAWA Hiroyuki BTW, Even if this patch is just for interface, adding Documentation here is a good choice, I think. It makes things clear. > + > + if (mem->recharge_at_immigrate && thread_group_leader(p)) > + ret = mem_cgroup_can_recharge(mem, p); My small concern is whtehter thread_group_leader(p) is _always_ equal to mm->owner..If not, things will be complicated. Could you clarify ? If mm->owner is better, we should use mm->owner here. (Non-usual case but..) Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with SMTP id 4BA796B0044 for ; Fri, 6 Nov 2009 01:38:07 -0500 (EST) Received: from m3.gw.fujitsu.co.jp ([10.0.50.73]) by fgwmail5.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id nA66c47u020806 for (envelope-from kamezawa.hiroyu@jp.fujitsu.com); Fri, 6 Nov 2009 15:38:04 +0900 Received: from smail (m3 [127.0.0.1]) by outgoing.m3.gw.fujitsu.co.jp (Postfix) with ESMTP id CB61845DE50 for ; Fri, 6 Nov 2009 15:38:03 +0900 (JST) Received: from s3.gw.fujitsu.co.jp (s3.gw.fujitsu.co.jp [10.0.50.93]) by m3.gw.fujitsu.co.jp (Postfix) with ESMTP id A6D7645DE4F for ; Fri, 6 Nov 2009 15:38:03 +0900 (JST) Received: from s3.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s3.gw.fujitsu.co.jp (Postfix) with ESMTP id 52DF61DB803B for ; Fri, 6 Nov 2009 15:38:03 +0900 (JST) Received: from ml14.s.css.fujitsu.com (ml14.s.css.fujitsu.com [10.249.87.104]) by s3.gw.fujitsu.co.jp (Postfix) with ESMTP id F34EC1DB803F for ; Fri, 6 Nov 2009 15:37:59 +0900 (JST) Date: Fri, 6 Nov 2009 15:35:26 +0900 From: KAMEZAWA Hiroyuki Subject: Re: [PATCH -mmotm 6/8] memcg: recharge charges of anonymous page Message-Id: <20091106153526.19b70518.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <20091106141448.6548687a.nishimura@mxp.nes.nec.co.jp> References: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> <20091106141448.6548687a.nishimura@mxp.nes.nec.co.jp> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Daisuke Nishimura Cc: linux-mm , Andrew Morton , Balbir Singh , Li Zefan , Paul Menage List-ID: On Fri, 6 Nov 2009 14:14:48 +0900 Daisuke Nishimura wrote: > This patch is the core part of this recharge-at-task-move feature. > It implements functions to recharge charges of anonymous pages mapped only by > the target task. > > Implementation: > - define struct recharge_struct and a valuable of it(recharge) to remember > the count of pre-charges and other information. > - At can_attach(), parse the page table of the task and count the number of > mapped pages which are charged to the source mem_cgroup, and call > __mem_cgroup_try_charge() repeatedly and count up recharge.precharge. > - At attach(), parse the page table again, find a target page as we did in > can_attach(), and call mem_cgroup_move_account() about the page. > - Cancel all charges if recharge.precharge > 0 on failure or at the end of > task move. > Changelog ? > Signed-off-by: Daisuke Nishimura > --- > Documentation/cgroups/memory.txt | 36 +++++- > mm/memcontrol.c | 275 +++++++++++++++++++++++++++++++++++++- > 2 files changed, 306 insertions(+), 5 deletions(-) > > diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt > index b871f25..54281ff 100644 > --- a/Documentation/cgroups/memory.txt > +++ b/Documentation/cgroups/memory.txt > @@ -262,10 +262,12 @@ some of the pages cached in the cgroup (page cache pages). > 4.2 Task migration > > When a task migrates from one cgroup to another, it's charge is not > -carried forward. The pages allocated from the original cgroup still > +carried forward by default. The pages allocated from the original cgroup still > remain charged to it, the charge is dropped when the page is freed or > reclaimed. > > +Note: You can move charges of a task along with task migration. See 8. > + > 4.3 Removing a cgroup > > A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a > @@ -414,7 +416,37 @@ NOTE1: Soft limits take effect over a long period of time, since they involve > NOTE2: It is recommended to set the soft limit always below the hard limit, > otherwise the hard limit will take precedence. > > -8. TODO > +8. Recharge at task move > + > +Users can move charges associated with a task along with task move, that is, > +uncharge from the old cgroup and charge to the new cgroup. > + > +8.1 Interface > + > +This feature is disabled by default. It can be enabled(and disabled again) by > +writing to memory.recharge_at_immigrate of the destination cgroup. > + > +If you want to enable it > + > +# echo 1 > memory.recharget_at_immigrate > + > +Note: A value more than 1 will be supported in futer. See 8.2. > + > +And if you want disable it again > + > +# echo 0 > memory.recharget_at_immigrate > + > +8.2 Type of charges which can be recharged > + > +We recharge a charge which meets the following conditions. > + > +a. It must be charged to the old cgroup. > +b. A charge of an anonymous page used by the target task. The page must be used > + only by the target task. > + > +Note: More type of pages(e.g. file cache, shmem,) will be supported in future. > + > +9. TODO > > 1. Add support for accounting huge pages (as a separate controller) > 2. Make per-cgroup scanner reclaim not-shared pages first > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index afa1179..f4b7116 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -21,6 +21,8 @@ > #include > #include > #include > +#include > +#include > #include > #include > #include > @@ -239,6 +241,18 @@ struct mem_cgroup { > }; > > /* > + * A data structure and a valiable for recharging charges at task move. > + * "recharge" and its members are protected by cgroup_lock > + */ > +struct recharge_struct { > + struct mem_cgroup *from; > + struct mem_cgroup *to; > + struct task_struct *target; /* the target task being moved */ > + unsigned long precharge; > +}; > +static struct recharge_struct recharge; > + > +/* > * Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft > * limit reclaim to prevent infinite loops, if they ever occur. > */ > @@ -1496,7 +1510,7 @@ charged: > * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree. > * if they exceeds softlimit. > */ > - if (mem_cgroup_soft_limit_check(mem)) > + if (page && mem_cgroup_soft_limit_check(mem)) > mem_cgroup_update_tree(mem, page); > done: > return 0; > @@ -3416,10 +3430,170 @@ static int mem_cgroup_populate(struct cgroup_subsys *ss, > } > > /* Handlers for recharge at task move. */ > +/** > + * is_target_pte_for_recharge - check a pte whether it is valid for recharge > + * @vma: the vma the pte to be checked belongs > + * @addr: the address corresponding to the pte to be checked > + * @ptent: the pte to be checked > + * @target: the pointer the target page will be stored(can be NULL) > + * > + * Returns > + * 0(RECHARGE_TARGET_NONE): if the pte is not a target for recharge. > + * 1(RECHARGE_TARGET_PAGE): if the page corresponding to this pte is a target > + * for recharge. if @target is not NULL, the page is stored in target->page > + * with extra refcnt got(Callers should handle it). > + * > + * Called with pte lock held. > + */ > +/* We add a new member later. */ > +union recharge_target { > + struct page *page; > +}; > + > +/* We add a new type later. */ > +enum recharge_target_type { > + RECHARGE_TARGET_NONE, /* not used */ > + RECHARGE_TARGET_PAGE, > +}; > + > +static int is_target_pte_for_recharge(struct vm_area_struct *vma, > + unsigned long addr, pte_t ptent, union recharge_target *target) > +{ > + struct page *page; > + struct page_cgroup *pc; > + int ret = 0; > + > + if (!pte_present(ptent)) > + return 0; > + > + page = vm_normal_page(vma, addr, ptent); > + if (!page || !page_mapped(page)) > + return 0; > + /* TODO: We don't recharge file(including shmem/tmpfs) pages for now. */ > + if (!PageAnon(page)) > + return 0; > + /* > + * TODO: We don't recharge shared(used by multiple processes) pages > + * for now. > + */ > + if (page_mapcount(page) > 1) > + return 0; > + if (!get_page_unless_zero(page)) > + return 0; > + > + pc = lookup_page_cgroup(page); > + lock_page_cgroup(pc); > + if (PageCgroupUsed(pc) && pc->mem_cgroup == recharge.from) { > + ret = RECHARGE_TARGET_PAGE; > + if (target) > + target->page = page; > + } > + unlock_page_cgroup(pc); > + > + if (!ret || !target) > + put_page(page); > + > + return ret; > +} > + > +static int mem_cgroup_recharge_do_precharge(void) > +{ > + int ret = -ENOMEM; > + struct mem_cgroup *mem = recharge.to; > + > + ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false, NULL); > + if (ret || !mem) > + return -ENOMEM; > + > + recharge.precharge++; > + return ret; > +} > + > +static int mem_cgroup_recharge_prepare_pte_range(pmd_t *pmd, > + unsigned long addr, unsigned long end, > + struct mm_walk *walk) > +{ > + int ret = 0; > + unsigned long count = 0; > + struct vm_area_struct *vma = walk->private; > + pte_t *pte; > + spinlock_t *ptl; > + > + pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); > + for (; addr != end; pte++, addr += PAGE_SIZE) > + if (is_target_pte_for_recharge(vma, addr, *pte, NULL)) > + count++; > + pte_unmap_unlock(pte - 1, ptl); > + > + while (count-- && !ret) > + ret = mem_cgroup_recharge_do_precharge(); > + > + return ret; > +} > + > +static int mem_cgroup_recharge_prepare(void) > +{ > + int ret = 0; > + struct mm_struct *mm; > + struct vm_area_struct *vma; > + > + mm = get_task_mm(recharge.target); > + if (!mm) > + return 0; > + > + down_read(&mm->mmap_sem); > + for (vma = mm->mmap; vma; vma = vma->vm_next) { > + struct mm_walk mem_cgroup_recharge_prepare_walk = { > + .pmd_entry = mem_cgroup_recharge_prepare_pte_range, > + .mm = mm, > + .private = vma, > + }; > + if (signal_pending(current)) { > + ret = -EINTR; > + break; > + } > + if (is_vm_hugetlb_page(vma)) > + continue; > + ret = walk_page_range(vma->vm_start, vma->vm_end, > + &mem_cgroup_recharge_prepare_walk); > + if (ret) > + break; > + cond_resched(); > + } > + up_read(&mm->mmap_sem); > + > + mmput(mm); > + return ret; > +} > + > +static void mem_cgroup_clear_recharge(void) > +{ > + while (recharge.precharge--) > + mem_cgroup_cancel_charge(recharge.to); > + recharge.from = NULL; > + recharge.to = NULL; > + recharge.target = NULL; > +} > + > static int mem_cgroup_can_recharge(struct mem_cgroup *mem, > struct task_struct *p) > { > - return 0; > + int ret; > + struct mem_cgroup *from = mem_cgroup_from_task(p); > + > + if (from == mem) > + return 0; > + > + recharge.from = from; > + recharge.to = mem; > + recharge.target = p; > + recharge.precharge = 0; > + > + ret = mem_cgroup_recharge_prepare(); > + > + if (ret) > + mem_cgroup_clear_recharge(); > + return ret; > } > Hmm...Hmm...looks nicer. But can I have another suggestion ? == static int mem_cgroup_can_recharge(struct mem_cgroup *mem, struct task_struct *p) { unsigned long rss; struct mm_struct *mm; mm = get_task_mm(p); if (!mm) return; rss = get_mm_counter(mm, anon_rss); precharge(rss); mmput(mm); } == Do you think anonymous memory are so shared at "move" as that we need page table scan ? If typical sequence is == fork() -> exec() move child == No problem will happen. > static int mem_cgroup_can_attach(struct cgroup_subsys *ss, > @@ -3442,11 +3616,104 @@ static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss, > struct task_struct *p, > bool threadgroup) > { > + struct mem_cgroup *mem = mem_cgroup_from_cont(cgroup); > + > mutex_unlock(&memcg_tasklist); > + if (mem->recharge_at_immigrate && thread_group_leader(p)) > + mem_cgroup_clear_recharge(); > +} > + > +static int mem_cgroup_recharge_pte_range(pmd_t *pmd, > + unsigned long addr, unsigned long end, > + struct mm_walk *walk) > +{ > + int ret = 0; > + struct vm_area_struct *vma = walk->private; > + pte_t *pte; > + spinlock_t *ptl; > + > +retry: > + pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); > + for (; addr != end; addr += PAGE_SIZE) { > + pte_t ptent = *(pte++); > + union recharge_target target; > + int type; > + struct page *page; > + struct page_cgroup *pc; > + > + if (!recharge.precharge) > + break; > + > + type = is_target_pte_for_recharge(vma, addr, ptent, &target); > + switch (type) { > + case RECHARGE_TARGET_PAGE: > + page = target.page; > + if (isolate_lru_page(page)) > + goto put; > + pc = lookup_page_cgroup(page); > + if (!mem_cgroup_move_account(pc, > + recharge.from, recharge.to)) { > + css_put(&recharge.to->css); > + recharge.precharge--; > + } > + putback_lru_page(page); > +put: /* is_target_pte_for_recharge() gets the page */ > + put_page(page); > + break; > + default: > + continue; continue for what ? And we forget "failed to move" pte. This "move" is best-effort service. Right ? > + } > + } > + pte_unmap_unlock(pte - 1, ptl); > + > + if (addr != end) { > + /* > + * We have consumed all precharges we got in can_attach(). > + * We try precharge one by one, but don't do any additional > + * precharges nor recharges to recharge.to if we have failed in > + * precharge once in attach() phase. > + */ > + ret = mem_cgroup_recharge_do_precharge(); > + if (!ret) > + goto retry; > + } > + > + return ret; > } > > static void mem_cgroup_recharge(void) > { > + struct mm_struct *mm; > + struct vm_area_struct *vma; > + > + mm = get_task_mm(recharge.target); > + if (!mm) > + return; > + > + lru_add_drain_all(); > + down_read(&mm->mmap_sem); > + for (vma = mm->mmap; vma; vma = vma->vm_next) { > + int ret; > + struct mm_walk mem_cgroup_recharge_walk = { > + .pmd_entry = mem_cgroup_recharge_pte_range, > + .mm = mm, > + .private = vma, > + }; > + if (is_vm_hugetlb_page(vma)) > + continue; > + ret = walk_page_range(vma->vm_start, vma->vm_end, > + &mem_cgroup_recharge_walk); At _this_ point, check VM_SHARED and skip scan is a sane operation. Could you add checks ? > + if (ret) > + /* > + * means we have consumed all precharges and failed in > + * doing additional precharge. Just abandon here. > + */ > + break; > + cond_resched(); > + } > + up_read(&mm->mmap_sem); > + > + mmput(mm); > } > > static void mem_cgroup_move_task(struct cgroup_subsys *ss, > @@ -3458,8 +3725,10 @@ static void mem_cgroup_move_task(struct cgroup_subsys *ss, > struct mem_cgroup *mem = mem_cgroup_from_cont(cont); > > mutex_unlock(&memcg_tasklist); > - if (mem->recharge_at_immigrate && thread_group_leader(p)) > + if (mem->recharge_at_immigrate && thread_group_leader(p)) { > mem_cgroup_recharge(); > + mem_cgroup_clear_recharge(); > + } Is it guranteed that thread_group_leader(p) is true if this is true at can_attach() ? If no, if (.....) { mem_cgroup_recharge() } mem_cgroup_cleare_recharge() is better. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id 40F7E6B0044 for ; Fri, 6 Nov 2009 01:42:00 -0500 (EST) Received: from m5.gw.fujitsu.co.jp ([10.0.50.75]) by fgwmail5.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id nA66fvdG022483 for (envelope-from kamezawa.hiroyu@jp.fujitsu.com); Fri, 6 Nov 2009 15:41:58 +0900 Received: from smail (m5 [127.0.0.1]) by outgoing.m5.gw.fujitsu.co.jp (Postfix) with ESMTP id 55D1245DE4F for ; Fri, 6 Nov 2009 15:41:57 +0900 (JST) Received: from s5.gw.fujitsu.co.jp (s5.gw.fujitsu.co.jp [10.0.50.95]) by m5.gw.fujitsu.co.jp (Postfix) with ESMTP id 2CE0045DE4E for ; Fri, 6 Nov 2009 15:41:57 +0900 (JST) Received: from s5.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s5.gw.fujitsu.co.jp (Postfix) with ESMTP id D9F7B1DB8040 for ; Fri, 6 Nov 2009 15:41:56 +0900 (JST) Received: from ml13.s.css.fujitsu.com (ml13.s.css.fujitsu.com [10.249.87.103]) by s5.gw.fujitsu.co.jp (Postfix) with ESMTP id 9317A1DB8038 for ; Fri, 6 Nov 2009 15:41:56 +0900 (JST) Date: Fri, 6 Nov 2009 15:39:23 +0900 From: KAMEZAWA Hiroyuki Subject: Re: [PATCH -mmotm 7/8] memcg: avoid oom during recharge at task move Message-Id: <20091106153923.753b0238.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <20091106141532.a2fe1187.nishimura@mxp.nes.nec.co.jp> References: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> <20091106141532.a2fe1187.nishimura@mxp.nes.nec.co.jp> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Daisuke Nishimura Cc: linux-mm , Andrew Morton , Balbir Singh , Li Zefan , Paul Menage List-ID: On Fri, 6 Nov 2009 14:15:32 +0900 Daisuke Nishimura wrote: > This recharge-at-task-move feature has extra charges(pre-charges) on "to" > mem_cgroup during recharging. This means unnecessary oom can happen. > > This patch tries to avoid such oom. > > Signed-off-by: Daisuke Nishimura > --- > mm/memcontrol.c | 27 +++++++++++++++++++++++++++ > 1 files changed, 27 insertions(+), 0 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index f4b7116..7e96f3b 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -248,6 +248,7 @@ struct recharge_struct { > struct mem_cgroup *from; > struct mem_cgroup *to; > struct task_struct *target; /* the target task being moved */ > + struct task_struct *working; /* a task moving the target task */ > unsigned long precharge; > }; > static struct recharge_struct recharge; > @@ -1493,6 +1494,30 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm, > if (mem_cgroup_check_under_limit(mem_over_limit)) > continue; > > + /* try to avoid oom while someone is recharging */ > + if (recharge.working && current != recharge.working) { > + struct mem_cgroup *dest; > + bool do_continue = false; > + /* > + * There is a small race that "dest" can be freed by > + * rmdir, so we use css_tryget(). > + */ > + rcu_read_lock(); > + dest = recharge.to; > + if (dest && css_tryget(&dest->css)) { > + if (dest->use_hierarchy) > + do_continue = css_is_ancestor( > + &dest->css, > + &mem_over_limit->css); > + else > + do_continue = (dest == mem_over_limit); > + css_put(&dest->css); > + } > + rcu_read_unlock(); > + if (do_continue) > + continue; > + } I think it's better to do this here, rather than continue. == if (do_continue) { mutex_lock(&memcg_tasklist_lock); mutex_unlock(&memcg_tasklist_lock); contiue; } == Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with SMTP id 991FD6B0044 for ; Fri, 6 Nov 2009 01:48:22 -0500 (EST) Received: from m2.gw.fujitsu.co.jp ([10.0.50.72]) by fgwmail6.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id nA66mJoD019458 for (envelope-from kamezawa.hiroyu@jp.fujitsu.com); Fri, 6 Nov 2009 15:48:19 +0900 Received: from smail (m2 [127.0.0.1]) by outgoing.m2.gw.fujitsu.co.jp (Postfix) with ESMTP id 994F745DE62 for ; Fri, 6 Nov 2009 15:48:19 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (s2.gw.fujitsu.co.jp [10.0.50.92]) by m2.gw.fujitsu.co.jp (Postfix) with ESMTP id 72BF845DE57 for ; Fri, 6 Nov 2009 15:48:19 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id 4F32F1DB8046 for ; Fri, 6 Nov 2009 15:48:19 +0900 (JST) Received: from m108.s.css.fujitsu.com (m108.s.css.fujitsu.com [10.249.87.108]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id 022831DB803B for ; Fri, 6 Nov 2009 15:48:16 +0900 (JST) Date: Fri, 6 Nov 2009 15:45:42 +0900 From: KAMEZAWA Hiroyuki Subject: Re: [PATCH -mmotm 0/8] memcg: recharge at task move Message-Id: <20091106154542.5ca9bb61.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> References: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Daisuke Nishimura Cc: linux-mm , Andrew Morton , Balbir Singh , Li Zefan , Paul Menage List-ID: On Fri, 6 Nov 2009 14:10:11 +0900 Daisuke Nishimura wrote: > Hi. > > In current memcg, charges associated with a task aren't moved to the new cgroup > at task move. These patches are for this feature, that is, for recharging to > the new cgroup and, of course, uncharging from old cgroup at task move. > > Current virsion supports only recharge of non-shared(mapcount == 1) anonymous pages > and swaps of those pages. I think it's enough as a first step. > > [1/8] cgroup: introduce cancel_attach() > [2/8] memcg: move memcg_tasklist mutex > [3/8] memcg: add mem_cgroup_cancel_charge() > [4/8] memcg: cleanup mem_cgroup_move_parent() > [5/8] memcg: add interface to recharge at task move > [6/8] memcg: recharge charges of anonymous page > [7/8] memcg: avoid oom during recharge at task move > [8/8] memcg: recharge charges of anonymous swap > > 2 is dependent on 1 and 4 is dependent on 3. > 3 and 4 are just for cleanups. > 5-8 are the body of this feature. > > Major Changes from Oct13: > - removed "[RFC]". > - rebased on mmotm-2009-11-01-10-01. > - dropped support for file cache and shmem/tmpfs(revisit in future). > - Updated Documentation/cgroup/memory.txt. > Seems much nicer but I have some nitpicks as already commented. For [8/8], mm->swap_usage counter may be a help for making it faster. Concern is how it's shared but will not be very big error. > TODO: > - add support for file cache, shmem/tmpfs, and shared(mapcount > 1) pages. > - implement madvise(2) to let users decide the target vma for recharge. > About this, I think "force_move_shared_account" flag is enough, I think. But we have to clarify "mmap()ed but not on page table" entries are not moved.... Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id 213FB6B0044 for ; Fri, 6 Nov 2009 02:57:09 -0500 (EST) Date: Fri, 6 Nov 2009 16:49:34 +0900 From: Daisuke Nishimura Subject: Re: [PATCH -mmotm 2/8] memcg: move memcg_tasklist mutex Message-Id: <20091106164934.b34d342f.nishimura@mxp.nes.nec.co.jp> In-Reply-To: <20091106145459.351b407f.kamezawa.hiroyu@jp.fujitsu.com> References: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> <20091106141149.9c7e94d5.nishimura@mxp.nes.nec.co.jp> <20091106145459.351b407f.kamezawa.hiroyu@jp.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: KAMEZAWA Hiroyuki Cc: linux-mm , Andrew Morton , Balbir Singh , Li Zefan , Paul Menage , Daisuke Nishimura List-ID: On Fri, 6 Nov 2009 14:54:59 +0900, KAMEZAWA Hiroyuki wrote: > On Fri, 6 Nov 2009 14:11:49 +0900 > Daisuke Nishimura wrote: > > > memcg_tasklist was introduced to serialize mem_cgroup_out_of_memory() and > > mem_cgroup_move_task() to ensure tasks cannot be moved to another cgroup > > during select_bad_process(). > > > > task_in_mem_cgroup(), which can be called by select_bad_process(), will check > > whether a task is in the mem_cgroup or not by dereferencing task->cgroups > > ->subsys[]. So, it would be desirable to change task->cgroups > > (rcu_assign_pointer() in cgroup_attach_task() does it) with memcg_tasklist held. > > > > Now that we can define cancel_attach(), we can safely release memcg_tasklist > > on fail path even if we hold memcg_tasklist in can_attach(). So let's move > > mutex_lock/unlock() of memcg_tasklist. > > > > Signed-off-by: Daisuke Nishimura > > --- > > mm/memcontrol.c | 22 ++++++++++++++++++++-- > > 1 files changed, 20 insertions(+), 2 deletions(-) > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 4bd3451..d3b2ac0 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -3395,18 +3395,34 @@ static int mem_cgroup_populate(struct cgroup_subsys *ss, > > return ret; > > } > > > > +static int mem_cgroup_can_attach(struct cgroup_subsys *ss, > > + struct cgroup *cgroup, > > + struct task_struct *p, > > + bool threadgroup) > > +{ > > + mutex_lock(&memcg_tasklist); > > + return 0; > > +} > > Hmm...Is this lock really necessary ? > IOW, can't we just remove memcg_tasklist mutex ? > What kind of bad race happens when we remove this ? > It was introduced at commit 7f4d454d, in which I introduced the mutex instead of using cgroup_mutex to fix a deadlock problem. commit 7f4d454dee2e0bdd21bafd413d1c53e443a26540 Author: Daisuke Nishimura Date: Wed Jan 7 18:08:29 2009 -0800 memcg: avoid deadlock caused by race between oom and cpuset_attach mpol_rebind_mm(), which can be called from cpuset_attach(), does down_write(mm->mmap_sem). This means down_write(mm->mmap_sem) can be called under cgroup_mutex. OTOH, page fault path does down_read(mm->mmap_sem) and calls mem_cgroup_try_charge_xxx(), which may eventually calls mem_cgroup_out_of_memory(). And mem_cgroup_out_of_memory() calls cgroup_lock(). This means cgroup_lock() can be called under down_read(mm->mmap_sem). If those two paths race, deadlock can happen. This patch avoid this deadlock by: - remove cgroup_lock() from mem_cgroup_out_of_memory(). - define new mutex (memcg_tasklist) and serialize mem_cgroup_move_task() (->attach handler of memory cgroup) and mem_cgroup_out_of_memory. Signed-off-by: Daisuke Nishimura Reviewed-by: KAMEZAWA Hiroyuki Acked-by: Balbir Singh Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Before the commit, mem_cgroup_out_of_memory() holded(and released afterward) cgroup_mutex. Those codes was introduced at commit c7ba5c9e. commit c7ba5c9e8176704bfac0729875fa62798037584d Author: Pavel Emelianov Date: Thu Feb 7 00:13:58 2008 -0800 Memory controller: OOM handling Out of memory handling for cgroups over their limit. A task from the cgroup over limit is chosen using the existing OOM logic and killed. TODO: 1. As discussed in the OLS BOF session, consider implementing a user space policy for OOM handling. [akpm@linux-foundation.org: fix build due to oom-killer changes] Signed-off-by: Pavel Emelianov Signed-off-by: Balbir Singh Cc: Paul Menage Cc: Peter Zijlstra Cc: "Eric W. Biederman" Cc: Nick Piggin Cc: Kirill Korotaev Cc: Herbert Poetzl Cc: David Rientjes Cc: Vaidyanathan Srinivasan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds I'm not sure about the intention of the original cgroup_lock() here, but I imagine that it was for preventing task move during select_bad_process(). If there is no such a lock: Assume cgroup foo has exceeded its limit and is about to triggering oom. 1. Process A, which has been in cgroup baa and uses large memory, is just moved to cgroup foo. Process A can be the candidates for being killed. 2. Process B, which has been in cgroup foo and uses large memory, is just moved from cgroup foo. Process B can be excluded from the candidates for being killed. Hmm, but considering more, those race window exist anyway even if we holds a lock, because try_charge decides wether it should trigger oom or not outside of the lock. If this recharge feature is enabled, I think those problems might be avoided by doing like: __mem_cgroup_try_charge() { ... if (oom) { mutex_lock(&memcg_tasklist); if (unlikely(mem_cgroup_check_under_limit)) { mutex_unlock(&memcg_tasklist); continue } mem_cgroup_out_of_memory(); mutex_unlock(&memcg_tasklist); record_last_oom(); } ... } but it makes codes more complex and the recharge feature isn't necessarily enabled. Well, I personally think we can remove these locks completely and make codes simpler. What do you think ? Thanks, Daisuke Nishimura. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with SMTP id D70356B0044 for ; Fri, 6 Nov 2009 03:05:02 -0500 (EST) Received: from m4.gw.fujitsu.co.jp ([10.0.50.74]) by fgwmail5.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id nA68503Q025669 for (envelope-from kamezawa.hiroyu@jp.fujitsu.com); Fri, 6 Nov 2009 17:05:00 +0900 Received: from smail (m4 [127.0.0.1]) by outgoing.m4.gw.fujitsu.co.jp (Postfix) with ESMTP id A8A5E45DE6E for ; Fri, 6 Nov 2009 17:04:59 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (s4.gw.fujitsu.co.jp [10.0.50.94]) by m4.gw.fujitsu.co.jp (Postfix) with ESMTP id 7D64A45DE60 for ; Fri, 6 Nov 2009 17:04:59 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id 405E81DB803A for ; Fri, 6 Nov 2009 17:04:59 +0900 (JST) Received: from ml14.s.css.fujitsu.com (ml14.s.css.fujitsu.com [10.249.87.104]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id C368C1DB803B for ; Fri, 6 Nov 2009 17:04:58 +0900 (JST) Date: Fri, 6 Nov 2009 17:02:25 +0900 From: KAMEZAWA Hiroyuki Subject: Re: [PATCH -mmotm 2/8] memcg: move memcg_tasklist mutex Message-Id: <20091106170225.0e8bd880.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <20091106164934.b34d342f.nishimura@mxp.nes.nec.co.jp> References: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> <20091106141149.9c7e94d5.nishimura@mxp.nes.nec.co.jp> <20091106145459.351b407f.kamezawa.hiroyu@jp.fujitsu.com> <20091106164934.b34d342f.nishimura@mxp.nes.nec.co.jp> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Daisuke Nishimura Cc: linux-mm , Andrew Morton , Balbir Singh , Li Zefan , Paul Menage List-ID: On Fri, 6 Nov 2009 16:49:34 +0900 Daisuke Nishimura wrote: > If there is no such a lock: > > Assume cgroup foo has exceeded its limit and is about to triggering oom. > 1. Process A, which has been in cgroup baa and uses large memory, > is just moved to cgroup foo. Process A can be the candidates for being killed. > 2. Process B, which has been in cgroup foo and uses large memory, > is just moved from cgroup foo. Process B can be excluded from the candidates for > being killed. > > Hmm, but considering more, those race window exist anyway even if we holds a lock, > because try_charge decides wether it should trigger oom or not outside of the lock. > yes, that's point. > If this recharge feature is enabled, I think those problems might be avoided by doing like: > > __mem_cgroup_try_charge() > { > ... > if (oom) { > mutex_lock(&memcg_tasklist); > if (unlikely(mem_cgroup_check_under_limit)) { > mutex_unlock(&memcg_tasklist); > continue > } > mem_cgroup_out_of_memory(); > mutex_unlock(&memcg_tasklist); > record_last_oom(); > } > ... > } > > but it makes codes more complex and the recharge feature isn't necessarily enabled. > > Well, I personally think we can remove these locks completely and make codes simpler. > What do you think ? I myself vote for removing this lock ;) Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with ESMTP id B82516B004D for ; Sun, 8 Nov 2009 19:36:57 -0500 (EST) Date: Mon, 9 Nov 2009 09:31:05 +0900 From: Daisuke Nishimura Subject: Re: [PATCH -mmotm 6/8] memcg: recharge charges of anonymous page Message-Id: <20091109093105.ef5596d6.nishimura@mxp.nes.nec.co.jp> In-Reply-To: <20091106153526.19b70518.kamezawa.hiroyu@jp.fujitsu.com> References: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> <20091106141448.6548687a.nishimura@mxp.nes.nec.co.jp> <20091106153526.19b70518.kamezawa.hiroyu@jp.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: KAMEZAWA Hiroyuki Cc: linux-mm , Andrew Morton , Balbir Singh , Li Zefan , Paul Menage , Daisuke Nishimura List-ID: On Fri, 6 Nov 2009 15:35:26 +0900, KAMEZAWA Hiroyuki wrote: > On Fri, 6 Nov 2009 14:14:48 +0900 > Daisuke Nishimura wrote: > > > This patch is the core part of this recharge-at-task-move feature. > > It implements functions to recharge charges of anonymous pages mapped only by > > the target task. > > > > Implementation: > > - define struct recharge_struct and a valuable of it(recharge) to remember > > the count of pre-charges and other information. > > - At can_attach(), parse the page table of the task and count the number of > > mapped pages which are charged to the source mem_cgroup, and call > > __mem_cgroup_try_charge() repeatedly and count up recharge.precharge. > > - At attach(), parse the page table again, find a target page as we did in > > can_attach(), and call mem_cgroup_move_account() about the page. > > - Cancel all charges if recharge.precharge > 0 on failure or at the end of > > task move. > > > > Changelog ? > > will add. > > Signed-off-by: Daisuke Nishimura > > --- > > Documentation/cgroups/memory.txt | 36 +++++- > > mm/memcontrol.c | 275 +++++++++++++++++++++++++++++++++++++- > > 2 files changed, 306 insertions(+), 5 deletions(-) > > > > diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt > > index b871f25..54281ff 100644 > > --- a/Documentation/cgroups/memory.txt > > +++ b/Documentation/cgroups/memory.txt > > @@ -262,10 +262,12 @@ some of the pages cached in the cgroup (page cache pages). > > 4.2 Task migration > > > > When a task migrates from one cgroup to another, it's charge is not > > -carried forward. The pages allocated from the original cgroup still > > +carried forward by default. The pages allocated from the original cgroup still > > remain charged to it, the charge is dropped when the page is freed or > > reclaimed. > > > > +Note: You can move charges of a task along with task migration. See 8. > > + > > 4.3 Removing a cgroup > > > > A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a > > @@ -414,7 +416,37 @@ NOTE1: Soft limits take effect over a long period of time, since they involve > > NOTE2: It is recommended to set the soft limit always below the hard limit, > > otherwise the hard limit will take precedence. > > > > -8. TODO > > +8. Recharge at task move > > + > > +Users can move charges associated with a task along with task move, that is, > > +uncharge from the old cgroup and charge to the new cgroup. > > + > > +8.1 Interface > > + > > +This feature is disabled by default. It can be enabled(and disabled again) by > > +writing to memory.recharge_at_immigrate of the destination cgroup. > > + > > +If you want to enable it > > + > > +# echo 1 > memory.recharget_at_immigrate > > + > > +Note: A value more than 1 will be supported in futer. See 8.2. > > + > > +And if you want disable it again > > + > > +# echo 0 > memory.recharget_at_immigrate > > + > > +8.2 Type of charges which can be recharged > > + > > +We recharge a charge which meets the following conditions. > > + > > +a. It must be charged to the old cgroup. > > +b. A charge of an anonymous page used by the target task. The page must be used > > + only by the target task. > > + > > +Note: More type of pages(e.g. file cache, shmem,) will be supported in future. > > + > > +9. TODO > > > > 1. Add support for accounting huge pages (as a separate controller) > > 2. Make per-cgroup scanner reclaim not-shared pages first > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index afa1179..f4b7116 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -21,6 +21,8 @@ > > #include > > #include > > #include > > +#include > > +#include > > #include > > #include > > #include > > @@ -239,6 +241,18 @@ struct mem_cgroup { > > }; > > > > /* > > + * A data structure and a valiable for recharging charges at task move. > > + * "recharge" and its members are protected by cgroup_lock > > + */ > > +struct recharge_struct { > > + struct mem_cgroup *from; > > + struct mem_cgroup *to; > > + struct task_struct *target; /* the target task being moved */ > > + unsigned long precharge; > > +}; > > +static struct recharge_struct recharge; > > + > > +/* > > * Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft > > * limit reclaim to prevent infinite loops, if they ever occur. > > */ > > @@ -1496,7 +1510,7 @@ charged: > > * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree. > > * if they exceeds softlimit. > > */ > > - if (mem_cgroup_soft_limit_check(mem)) > > + if (page && mem_cgroup_soft_limit_check(mem)) > > mem_cgroup_update_tree(mem, page); > > done: > > return 0; > > @@ -3416,10 +3430,170 @@ static int mem_cgroup_populate(struct cgroup_subsys *ss, > > } > > > > /* Handlers for recharge at task move. */ > > +/** > > + * is_target_pte_for_recharge - check a pte whether it is valid for recharge > > + * @vma: the vma the pte to be checked belongs > > + * @addr: the address corresponding to the pte to be checked > > + * @ptent: the pte to be checked > > + * @target: the pointer the target page will be stored(can be NULL) > > + * > > + * Returns > > + * 0(RECHARGE_TARGET_NONE): if the pte is not a target for recharge. > > + * 1(RECHARGE_TARGET_PAGE): if the page corresponding to this pte is a target > > + * for recharge. if @target is not NULL, the page is stored in target->page > > + * with extra refcnt got(Callers should handle it). > > + * > > + * Called with pte lock held. > > + */ > > +/* We add a new member later. */ > > +union recharge_target { > > + struct page *page; > > +}; > > + > > +/* We add a new type later. */ > > +enum recharge_target_type { > > + RECHARGE_TARGET_NONE, /* not used */ > > + RECHARGE_TARGET_PAGE, > > +}; > > + > > +static int is_target_pte_for_recharge(struct vm_area_struct *vma, > > + unsigned long addr, pte_t ptent, union recharge_target *target) > > +{ > > + struct page *page; > > + struct page_cgroup *pc; > > + int ret = 0; > > + > > + if (!pte_present(ptent)) > > + return 0; > > + > > + page = vm_normal_page(vma, addr, ptent); > > + if (!page || !page_mapped(page)) > > + return 0; > > + /* TODO: We don't recharge file(including shmem/tmpfs) pages for now. */ > > + if (!PageAnon(page)) > > + return 0; > > + /* > > + * TODO: We don't recharge shared(used by multiple processes) pages > > + * for now. > > + */ > > + if (page_mapcount(page) > 1) > > + return 0; > > + if (!get_page_unless_zero(page)) > > + return 0; > > + > > + pc = lookup_page_cgroup(page); > > + lock_page_cgroup(pc); > > + if (PageCgroupUsed(pc) && pc->mem_cgroup == recharge.from) { > > + ret = RECHARGE_TARGET_PAGE; > > + if (target) > > + target->page = page; > > + } > > + unlock_page_cgroup(pc); > > + > > + if (!ret || !target) > > + put_page(page); > > + > > + return ret; > > +} > > + > > +static int mem_cgroup_recharge_do_precharge(void) > > +{ > > + int ret = -ENOMEM; > > + struct mem_cgroup *mem = recharge.to; > > + > > + ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false, NULL); > > + if (ret || !mem) > > + return -ENOMEM; > > + > > + recharge.precharge++; > > + return ret; > > +} > > + > > +static int mem_cgroup_recharge_prepare_pte_range(pmd_t *pmd, > > + unsigned long addr, unsigned long end, > > + struct mm_walk *walk) > > +{ > > + int ret = 0; > > + unsigned long count = 0; > > + struct vm_area_struct *vma = walk->private; > > + pte_t *pte; > > + spinlock_t *ptl; > > + > > + pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); > > + for (; addr != end; pte++, addr += PAGE_SIZE) > > + if (is_target_pte_for_recharge(vma, addr, *pte, NULL)) > > + count++; > > + pte_unmap_unlock(pte - 1, ptl); > > + > > + while (count-- && !ret) > > + ret = mem_cgroup_recharge_do_precharge(); > > + > > + return ret; > > +} > > + > > +static int mem_cgroup_recharge_prepare(void) > > +{ > > + int ret = 0; > > + struct mm_struct *mm; > > + struct vm_area_struct *vma; > > + > > + mm = get_task_mm(recharge.target); > > + if (!mm) > > + return 0; > > + > > + down_read(&mm->mmap_sem); > > + for (vma = mm->mmap; vma; vma = vma->vm_next) { > > + struct mm_walk mem_cgroup_recharge_prepare_walk = { > > + .pmd_entry = mem_cgroup_recharge_prepare_pte_range, > > + .mm = mm, > > + .private = vma, > > + }; > > + if (signal_pending(current)) { > > + ret = -EINTR; > > + break; > > + } > > + if (is_vm_hugetlb_page(vma)) > > + continue; > > + ret = walk_page_range(vma->vm_start, vma->vm_end, > > + &mem_cgroup_recharge_prepare_walk); > > + if (ret) > > + break; > > + cond_resched(); > > + } > > + up_read(&mm->mmap_sem); > > + > > + mmput(mm); > > + return ret; > > +} > > + > > +static void mem_cgroup_clear_recharge(void) > > +{ > > + while (recharge.precharge--) > > + mem_cgroup_cancel_charge(recharge.to); > > + recharge.from = NULL; > > + recharge.to = NULL; > > + recharge.target = NULL; > > +} > > + > > static int mem_cgroup_can_recharge(struct mem_cgroup *mem, > > struct task_struct *p) > > { > > - return 0; > > + int ret; > > + struct mem_cgroup *from = mem_cgroup_from_task(p); > > + > > + if (from == mem) > > + return 0; > > + > > + recharge.from = from; > > + recharge.to = mem; > > + recharge.target = p; > > + recharge.precharge = 0; > > + > > + ret = mem_cgroup_recharge_prepare(); > > + > > + if (ret) > > + mem_cgroup_clear_recharge(); > > + return ret; > > } > > > > Hmm...Hmm...looks nicer. But can I have another suggestion ? > > == > static int mem_cgroup_can_recharge(struct mem_cgroup *mem, > struct task_struct *p) > { > unsigned long rss; > struct mm_struct *mm; > > mm = get_task_mm(p); > if (!mm) > return; > > rss = get_mm_counter(mm, anon_rss); > precharge(rss); > mmput(mm); > } > == > Do you think anonymous memory are so shared at "move" as that > we need page table scan ? > The reason why I scanned page table twice was to move "swap" charge. There was no counter for swap per process. Yes, the counter is just being added (by you :)). It cannot be used for shmem's swaps, but I won't handle them for now and we can recharge them in attach() phase(by best-effort) anyway. I'll just count the mm_counter in can_attach() phase. > If typical sequence is > == > fork() > -> exec() > > move child > == > No problem will happen. > > > static int mem_cgroup_can_attach(struct cgroup_subsys *ss, > > @@ -3442,11 +3616,104 @@ static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss, > > struct task_struct *p, > > bool threadgroup) > > { > > + struct mem_cgroup *mem = mem_cgroup_from_cont(cgroup); > > + > > mutex_unlock(&memcg_tasklist); > > + if (mem->recharge_at_immigrate && thread_group_leader(p)) > > + mem_cgroup_clear_recharge(); > > +} > > + > > +static int mem_cgroup_recharge_pte_range(pmd_t *pmd, > > + unsigned long addr, unsigned long end, > > + struct mm_walk *walk) > > +{ > > + int ret = 0; > > + struct vm_area_struct *vma = walk->private; > > + pte_t *pte; > > + spinlock_t *ptl; > > + > > +retry: > > + pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); > > + for (; addr != end; addr += PAGE_SIZE) { > > + pte_t ptent = *(pte++); > > + union recharge_target target; > > + int type; > > + struct page *page; > > + struct page_cgroup *pc; > > + > > + if (!recharge.precharge) > > + break; > > + > > + type = is_target_pte_for_recharge(vma, addr, ptent, &target); > > + switch (type) { > > + case RECHARGE_TARGET_PAGE: > > + page = target.page; > > + if (isolate_lru_page(page)) > > + goto put; > > + pc = lookup_page_cgroup(page); > > + if (!mem_cgroup_move_account(pc, > > + recharge.from, recharge.to)) { > > + css_put(&recharge.to->css); > > + recharge.precharge--; > > + } > > + putback_lru_page(page); > > +put: /* is_target_pte_for_recharge() gets the page */ > > + put_page(page); > > + break; > > + default: > > + continue; > > continue for what ? > Nothing :) Just move to the next step in this for-loop. I think it's a leftover of my old code... > And we forget "failed to move" pte. This "move" is best-effort service. > Right ? > "if (!recharge.precharge)" above and "if (addr != end)" bellow will do that. > > + } > > + } > > + pte_unmap_unlock(pte - 1, ptl); > > + > > + if (addr != end) { > > + /* > > + * We have consumed all precharges we got in can_attach(). > > + * We try precharge one by one, but don't do any additional > > + * precharges nor recharges to recharge.to if we have failed in > > + * precharge once in attach() phase. > > + */ > > + ret = mem_cgroup_recharge_do_precharge(); > > + if (!ret) > > + goto retry; > > + } > > + > > + return ret; > > } > > > > > > > static void mem_cgroup_recharge(void) > > { > > + struct mm_struct *mm; > > + struct vm_area_struct *vma; > > + > > + mm = get_task_mm(recharge.target); > > + if (!mm) > > + return; > > + > > + lru_add_drain_all(); > > + down_read(&mm->mmap_sem); > > + for (vma = mm->mmap; vma; vma = vma->vm_next) { > > + int ret; > > + struct mm_walk mem_cgroup_recharge_walk = { > > + .pmd_entry = mem_cgroup_recharge_pte_range, > > + .mm = mm, > > + .private = vma, > > + }; > > + if (is_vm_hugetlb_page(vma)) > > + continue; > > + ret = walk_page_range(vma->vm_start, vma->vm_end, > > + &mem_cgroup_recharge_walk); > > At _this_ point, check VM_SHARED and skip scan is a sane operation. > Could you add checks ? > will do. > > > + if (ret) > > + /* > > + * means we have consumed all precharges and failed in > > + * doing additional precharge. Just abandon here. > > + */ > > + break; > > + cond_resched(); > > + } > > + up_read(&mm->mmap_sem); > > + > > + mmput(mm); > > } > > > > static void mem_cgroup_move_task(struct cgroup_subsys *ss, > > @@ -3458,8 +3725,10 @@ static void mem_cgroup_move_task(struct cgroup_subsys *ss, > > struct mem_cgroup *mem = mem_cgroup_from_cont(cont); > > > > mutex_unlock(&memcg_tasklist); > > - if (mem->recharge_at_immigrate && thread_group_leader(p)) > > + if (mem->recharge_at_immigrate && thread_group_leader(p)) { > > mem_cgroup_recharge(); > > + mem_cgroup_clear_recharge(); > > + } > > Is it guranteed that thread_group_leader(p) is true if this is true at > can_attach() ? > If no, > if (.....) { > mem_cgroup_recharge() > } > mem_cgroup_cleare_recharge() > > is better. > will change. Thank you for your review. Thanks, Daisuke Nishimura. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id B8C5C6B004D for ; Sun, 8 Nov 2009 20:52:10 -0500 (EST) Date: Mon, 9 Nov 2009 10:44:46 +0900 From: Daisuke Nishimura Subject: Re: [PATCH -mmotm 0/8] memcg: recharge at task move Message-Id: <20091109104446.b2d9ef66.nishimura@mxp.nes.nec.co.jp> In-Reply-To: <20091106154542.5ca9bb61.kamezawa.hiroyu@jp.fujitsu.com> References: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> <20091106154542.5ca9bb61.kamezawa.hiroyu@jp.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: KAMEZAWA Hiroyuki Cc: linux-mm , Andrew Morton , Balbir Singh , Li Zefan , Paul Menage , Daisuke Nishimura List-ID: On Fri, 6 Nov 2009 15:45:42 +0900, KAMEZAWA Hiroyuki wrote: > On Fri, 6 Nov 2009 14:10:11 +0900 > Daisuke Nishimura wrote: > > > Hi. > > > > In current memcg, charges associated with a task aren't moved to the new cgroup > > at task move. These patches are for this feature, that is, for recharging to > > the new cgroup and, of course, uncharging from old cgroup at task move. > > > > Current virsion supports only recharge of non-shared(mapcount == 1) anonymous pages > > and swaps of those pages. I think it's enough as a first step. > > > > [1/8] cgroup: introduce cancel_attach() > > [2/8] memcg: move memcg_tasklist mutex > > [3/8] memcg: add mem_cgroup_cancel_charge() > > [4/8] memcg: cleanup mem_cgroup_move_parent() > > [5/8] memcg: add interface to recharge at task move > > [6/8] memcg: recharge charges of anonymous page > > [7/8] memcg: avoid oom during recharge at task move > > [8/8] memcg: recharge charges of anonymous swap > > > > 2 is dependent on 1 and 4 is dependent on 3. > > 3 and 4 are just for cleanups. > > 5-8 are the body of this feature. > > > > Major Changes from Oct13: > > - removed "[RFC]". > > - rebased on mmotm-2009-11-01-10-01. > > - dropped support for file cache and shmem/tmpfs(revisit in future). > > - Updated Documentation/cgroup/memory.txt. > > > > Seems much nicer but I have some nitpicks as already commented. > > For [8/8], mm->swap_usage counter may be a help for making it faster. > Concern is how it's shared but will not be very big error. > will change as I mentioned in another mail. I'll repost 3 and 4 as cleanup(I think they are ready for inclusion), and post removal-of-memcg_tasklist as a separate patch. I'll postpone the body of this feature(waiting for your percpu change and per-process swap counter at least). > > TODO: > > - add support for file cache, shmem/tmpfs, and shared(mapcount > 1) pages. > > - implement madvise(2) to let users decide the target vma for recharge. > > > > About this, I think "force_move_shared_account" flag is enough, I think. > But we have to clarify "mmap()ed but not on page table" entries are not > moved.... > You mean swap entries of shmem/tmpfs, do you ? I agree they are hard to handle.. My concern is: - I want to add support for private file caches, shmes/tmpfs pages(including swaps of them), and "shared" pages by some means in future, and let an admin or a middle-ware decide how to handle them. - Once this feature has been merged(at .33?), I don't want to change the behavior when a user set "recharge_at_immigrate=1". So, I'll "extend" the meaning of "recharge_at_immigrate" or add a new flag file to support other type of charges. Thanks, Daisuke Nishimura. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with ESMTP id 1CD016B004D for ; Mon, 9 Nov 2009 00:08:13 -0500 (EST) Received: from d23relay04.au.ibm.com (d23relay04.au.ibm.com [202.81.31.246]) by e23smtp06.au.ibm.com (8.14.3/8.13.1) with ESMTP id nA957unV009442 for ; Mon, 9 Nov 2009 16:07:56 +1100 Received: from d23av04.au.ibm.com (d23av04.au.ibm.com [9.190.235.139]) by d23relay04.au.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id nA9553du790638 for ; Mon, 9 Nov 2009 16:05:03 +1100 Received: from d23av04.au.ibm.com (loopback [127.0.0.1]) by d23av04.au.ibm.com (8.14.3/8.13.1/NCO v10.0 AVout) with ESMTP id nA9588Bd013334 for ; Mon, 9 Nov 2009 16:08:08 +1100 Date: Mon, 9 Nov 2009 10:38:06 +0530 From: Balbir Singh Subject: Re: [PATCH -mmotm 0/8] memcg: recharge at task move Message-ID: <20091109050806.GB3042@balbir.in.ibm.com> Reply-To: balbir@linux.vnet.ibm.com References: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> Sender: owner-linux-mm@kvack.org To: Daisuke Nishimura Cc: linux-mm , Andrew Morton , KAMEZAWA Hiroyuki , Li Zefan , Paul Menage List-ID: * nishimura@mxp.nes.nec.co.jp [2009-11-06 14:10:11]: > Hi. > > In current memcg, charges associated with a task aren't moved to the new cgroup > at task move. These patches are for this feature, that is, for recharging to > the new cgroup and, of course, uncharging from old cgroup at task move. > > Current virsion supports only recharge of non-shared(mapcount == 1) anonymous pages > and swaps of those pages. I think it's enough as a first step. > What is the motivation? Is to provide better accountability? I think it is worthwhile to look into it, provided the cost of migration is not too high. It was on my list of things to look at, but I found that if cpu/memory are mounted together and the cost of migration is high, it can be a bottleneck in some cases. I'll review the patches a bit more. -- Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id E5FCC6B004D for ; Mon, 9 Nov 2009 00:19:11 -0500 (EST) Received: from m3.gw.fujitsu.co.jp ([10.0.50.73]) by fgwmail7.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id nA95J83O022858 for (envelope-from kamezawa.hiroyu@jp.fujitsu.com); Mon, 9 Nov 2009 14:19:09 +0900 Received: from smail (m3 [127.0.0.1]) by outgoing.m3.gw.fujitsu.co.jp (Postfix) with ESMTP id 95C9545DE54 for ; Mon, 9 Nov 2009 14:19:08 +0900 (JST) Received: from s3.gw.fujitsu.co.jp (s3.gw.fujitsu.co.jp [10.0.50.93]) by m3.gw.fujitsu.co.jp (Postfix) with ESMTP id 5D48E45DE4D for ; Mon, 9 Nov 2009 14:19:08 +0900 (JST) Received: from s3.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s3.gw.fujitsu.co.jp (Postfix) with ESMTP id 348B11DB804D for ; Mon, 9 Nov 2009 14:19:08 +0900 (JST) Received: from m108.s.css.fujitsu.com (m108.s.css.fujitsu.com [10.249.87.108]) by s3.gw.fujitsu.co.jp (Postfix) with ESMTP id CEDA81DB8041 for ; Mon, 9 Nov 2009 14:19:07 +0900 (JST) Date: Mon, 9 Nov 2009 14:16:09 +0900 From: KAMEZAWA Hiroyuki Subject: Re: [PATCH -mmotm 0/8] memcg: recharge at task move Message-Id: <20091109141609.331eee77.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <20091109104446.b2d9ef66.nishimura@mxp.nes.nec.co.jp> References: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> <20091106154542.5ca9bb61.kamezawa.hiroyu@jp.fujitsu.com> <20091109104446.b2d9ef66.nishimura@mxp.nes.nec.co.jp> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Daisuke Nishimura Cc: linux-mm , Andrew Morton , Balbir Singh , Li Zefan , Paul Menage List-ID: On Mon, 9 Nov 2009 10:44:46 +0900 Daisuke Nishimura wrote: > > Seems much nicer but I have some nitpicks as already commented. > > > > For [8/8], mm->swap_usage counter may be a help for making it faster. > > Concern is how it's shared but will not be very big error. > > > will change as I mentioned in another mail. > > I'll repost 3 and 4 as cleanup(I think they are ready for inclusion), > and post removal-of-memcg_tasklist as a separate patch. > > I'll postpone the body of this feature(waiting for your percpu change > and per-process swap counter at least). > Thanks. I agree 1-4 are ready for merge. But, please get ack by Paul. or Li zefan. Paul, Li, how do you think about [1/8] ? > > > TODO: > > > - add support for file cache, shmem/tmpfs, and shared(mapcount > 1) pages. > > > - implement madvise(2) to let users decide the target vma for recharge. > > > > > > > About this, I think "force_move_shared_account" flag is enough, I think. > > But we have to clarify "mmap()ed but not on page table" entries are not > > moved.... > > > You mean swap entries of shmem/tmpfs, do you ? I agree they are hard to handle.. > Yes and No. Including mmaped file. > > My concern is: > > - I want to add support for private file caches, shmes/tmpfs pages(including swaps of them), > and "shared" pages by some means in future, and let an admin or a middle-ware > decide how to handle them. Yes, I agree that middle ware may want that. > - Once this feature has been merged(at .33?), I don't want to change the behavior > when a user set "recharge_at_immigrate=1". > So, I'll "extend" the meaning of "recharge_at_immigrate" or add a new flag file > to support other type of charges. Yes, I think making recharge_at_immigrate as "bitmask" makes sense. So, showing "recharge_at_immigrate" is a bitmask _now_ both in code and documentation in clear way will be a help for future. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with ESMTP id 689E06B004D for ; Mon, 9 Nov 2009 01:58:07 -0500 (EST) Received: from d28relay01.in.ibm.com (d28relay01.in.ibm.com [9.184.220.58]) by e28smtp07.in.ibm.com (8.14.3/8.13.1) with ESMTP id nA96w0h3002900 for ; Mon, 9 Nov 2009 12:28:00 +0530 Received: from d28av01.in.ibm.com (d28av01.in.ibm.com [9.184.220.63]) by d28relay01.in.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id nA96w0Cq2797820 for ; Mon, 9 Nov 2009 12:28:00 +0530 Received: from d28av01.in.ibm.com (loopback [127.0.0.1]) by d28av01.in.ibm.com (8.14.3/8.13.1/NCO v10.0 AVout) with ESMTP id nA96w0wr002963 for ; Mon, 9 Nov 2009 12:28:00 +0530 Date: Mon, 9 Nov 2009 12:27:59 +0530 From: Balbir Singh Subject: Re: [PATCH -mmotm 1/8] cgroup: introduce cancel_attach() Message-ID: <20091109065759.GC3042@balbir.in.ibm.com> Reply-To: balbir@linux.vnet.ibm.com References: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> <20091106141106.a2bd995a.nishimura@mxp.nes.nec.co.jp> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: <20091106141106.a2bd995a.nishimura@mxp.nes.nec.co.jp> Sender: owner-linux-mm@kvack.org To: Daisuke Nishimura Cc: linux-mm , Andrew Morton , KAMEZAWA Hiroyuki , Li Zefan , Paul Menage List-ID: * nishimura@mxp.nes.nec.co.jp [2009-11-06 14:11:06]: > This patch adds cancel_attach() operation to struct cgroup_subsys. > cancel_attach() can be used when can_attach() operation prepares something > for the subsys, but we should rollback what can_attach() operation has prepared > if attach task fails after we've succeeded in can_attach(). > > Signed-off-by: Daisuke Nishimura > --- > Documentation/cgroups/cgroups.txt | 13 +++++++++++- > include/linux/cgroup.h | 2 + > kernel/cgroup.c | 38 ++++++++++++++++++++++++++++++------ > 3 files changed, 45 insertions(+), 8 deletions(-) > > diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt > index 0b33bfe..c86947c 100644 > --- a/Documentation/cgroups/cgroups.txt > +++ b/Documentation/cgroups/cgroups.txt > @@ -536,10 +536,21 @@ returns an error, this will abort the attach operation. If a NULL > task is passed, then a successful result indicates that *any* > unspecified task can be moved into the cgroup. Note that this isn't > called on a fork. If this method returns 0 (success) then this should > -remain valid while the caller holds cgroup_mutex. If threadgroup is > +remain valid while the caller holds cgroup_mutex and it is ensured that either > +attach() or cancel_attach() will be called in futer. If threadgroup is > true, then a successful result indicates that all threads in the given > thread's threadgroup can be moved together. > > +void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp, > + struct task_struct *task, bool threadgroup) > +(cgroup_mutex held by caller) > + > +Called when a task attach operation has failed after can_attach() has succeeded. > +A subsystem whose can_attach() has some side-effects should provide this > +function, so that the subsytem can implement a rollback. If not, not necessary. > +This will be called only about subsystems whose can_attach() operation have > +succeeded. > + > void attach(struct cgroup_subsys *ss, struct cgroup *cgrp, > struct cgroup *old_cgrp, struct task_struct *task, > bool threadgroup) > diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h > index 0008dee..d4cc200 100644 > --- a/include/linux/cgroup.h > +++ b/include/linux/cgroup.h > @@ -427,6 +427,8 @@ struct cgroup_subsys { > void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp); > int (*can_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp, > struct task_struct *tsk, bool threadgroup); > + void (*cancel_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp, > + struct task_struct *tsk, bool threadgroup); > void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp, > struct cgroup *old_cgrp, struct task_struct *tsk, > bool threadgroup); > diff --git a/kernel/cgroup.c b/kernel/cgroup.c > index 0249f4b..e443742 100644 > --- a/kernel/cgroup.c > +++ b/kernel/cgroup.c > @@ -1539,7 +1539,7 @@ int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen) > int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk) > { > int retval = 0; > - struct cgroup_subsys *ss; > + struct cgroup_subsys *ss, *failed_ss = NULL; > struct cgroup *oldcgrp; > struct css_set *cg; > struct css_set *newcg; > @@ -1553,8 +1553,16 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk) > for_each_subsys(root, ss) { > if (ss->can_attach) { > retval = ss->can_attach(ss, cgrp, tsk, false); > - if (retval) > - return retval; > + if (retval) { > + /* > + * Remember at which subsystem we've failed in > + * can_attach() to call cancel_attach() only > + * against subsystems whose attach() have > + * succeeded(see below). > + */ > + failed_ss = ss; failed_ss is global? Is it a marker into an array of subsystems? Don't we need more than one failed_ss for each failed subsystem? Or do we find the first failed subsystem, cancel_attach and fail all migrations? > + goto out; > + } > } > } > > @@ -1568,14 +1576,17 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk) > */ > newcg = find_css_set(cg, cgrp); > put_css_set(cg); > - if (!newcg) > - return -ENOMEM; > + if (!newcg) { > + retval = -ENOMEM; > + goto out; > + } > > task_lock(tsk); > if (tsk->flags & PF_EXITING) { > task_unlock(tsk); > put_css_set(newcg); > - return -ESRCH; > + retval = -ESRCH; > + goto out; > } > rcu_assign_pointer(tsk->cgroups, newcg); > task_unlock(tsk); > @@ -1601,7 +1612,20 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk) > * is no longer empty. > */ > cgroup_wakeup_rmdir_waiter(cgrp); > - return 0; > +out: > + if (retval) > + for_each_subsys(root, ss) { > + if (ss == failed_ss) > + /* > + * This means can_attach() of this subsystem > + * have failed, so we don't need to call > + * cancel_attach() against rests of subsystems. > + */ > + break; > + if (ss->cancel_attach) > + ss->cancel_attach(ss, cgrp, tsk, false); > + } > + return retval; > } > -- Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id BF5186B004D for ; Mon, 9 Nov 2009 02:19:16 -0500 (EST) Message-ID: <4AF7C238.2080203@cn.fujitsu.com> Date: Mon, 09 Nov 2009 15:18:16 +0800 From: Li Zefan MIME-Version: 1.0 Subject: Re: [PATCH -mmotm 1/8] cgroup: introduce cancel_attach() References: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> <20091106141106.a2bd995a.nishimura@mxp.nes.nec.co.jp> <20091109065759.GC3042@balbir.in.ibm.com> In-Reply-To: <20091109065759.GC3042@balbir.in.ibm.com> Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: balbir@linux.vnet.ibm.com Cc: Daisuke Nishimura , linux-mm , Andrew Morton , KAMEZAWA Hiroyuki , Paul Menage List-ID: >> @@ -1553,8 +1553,16 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk) >> for_each_subsys(root, ss) { >> if (ss->can_attach) { >> retval = ss->can_attach(ss, cgrp, tsk, false); >> - if (retval) >> - return retval; >> + if (retval) { >> + /* >> + * Remember at which subsystem we've failed in >> + * can_attach() to call cancel_attach() only >> + * against subsystems whose attach() have >> + * succeeded(see below). >> + */ >> + failed_ss = ss; > > failed_ss is global? Is it a marker into an array of subsystems? Don't > we need more than one failed_ss for each failed subsystem? Or do we > find the first failed subsystem, cancel_attach and fail all > migrations? > The latter. We record the first subsys that failed can_attach(), and break out the for loop, and call cancel_attach() on those subsystems that has succeeded in can_attach(). >> + goto out; >> + } >> } >> } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with SMTP id AE1386B0062 for ; Mon, 9 Nov 2009 02:24:01 -0500 (EST) Message-ID: <4AF7C356.5020504@cn.fujitsu.com> Date: Mon, 09 Nov 2009 15:23:02 +0800 From: Li Zefan MIME-Version: 1.0 Subject: Re: [PATCH -mmotm 1/8] cgroup: introduce cancel_attach() References: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> <20091106141106.a2bd995a.nishimura@mxp.nes.nec.co.jp> In-Reply-To: <20091106141106.a2bd995a.nishimura@mxp.nes.nec.co.jp> Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Daisuke Nishimura Cc: linux-mm , Andrew Morton , Balbir Singh , KAMEZAWA Hiroyuki , Paul Menage List-ID: Daisuke Nishimura wrote: > This patch adds cancel_attach() operation to struct cgroup_subsys. > cancel_attach() can be used when can_attach() operation prepares something > for the subsys, but we should rollback what can_attach() operation has prepared > if attach task fails after we've succeeded in can_attach(). > > Signed-off-by: Daisuke Nishimura Acked-by: Li Zefan > --- > Documentation/cgroups/cgroups.txt | 13 +++++++++++- > include/linux/cgroup.h | 2 + > kernel/cgroup.c | 38 ++++++++++++++++++++++++++++++------ > 3 files changed, 45 insertions(+), 8 deletions(-) > > diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt > index 0b33bfe..c86947c 100644 > --- a/Documentation/cgroups/cgroups.txt > +++ b/Documentation/cgroups/cgroups.txt > @@ -536,10 +536,21 @@ returns an error, this will abort the attach operation. If a NULL > task is passed, then a successful result indicates that *any* > unspecified task can be moved into the cgroup. Note that this isn't > called on a fork. If this method returns 0 (success) then this should > -remain valid while the caller holds cgroup_mutex. If threadgroup is > +remain valid while the caller holds cgroup_mutex and it is ensured that either > +attach() or cancel_attach() will be called in futer. If threadgroup is s/futer/future > true, then a successful result indicates that all threads in the given > thread's threadgroup can be moved together. ... > +out: > + if (retval) I prefer: if (reval) { ... } > + for_each_subsys(root, ss) { > + if (ss == failed_ss) > + /* > + * This means can_attach() of this subsystem > + * have failed, so we don't need to call > + * cancel_attach() against rests of subsystems. > + */ > + break; > + if (ss->cancel_attach) > + ss->cancel_attach(ss, cgrp, tsk, false); > + } > + return retval; > } > > /* -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with ESMTP id 1B6936B004D for ; Mon, 9 Nov 2009 02:30:15 -0500 (EST) Date: Mon, 9 Nov 2009 16:23:30 +0900 From: Daisuke Nishimura Subject: Re: [PATCH -mmotm 1/8] cgroup: introduce cancel_attach() Message-Id: <20091109162330.e6a268b7.nishimura@mxp.nes.nec.co.jp> In-Reply-To: <20091109065759.GC3042@balbir.in.ibm.com> References: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> <20091106141106.a2bd995a.nishimura@mxp.nes.nec.co.jp> <20091109065759.GC3042@balbir.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: balbir@linux.vnet.ibm.com Cc: linux-mm , Andrew Morton , KAMEZAWA Hiroyuki , Li Zefan , Paul Menage , Daisuke Nishimura List-ID: On Mon, 9 Nov 2009 12:27:59 +0530, Balbir Singh wrote: > * nishimura@mxp.nes.nec.co.jp [2009-11-06 14:11:06]: > > > This patch adds cancel_attach() operation to struct cgroup_subsys. > > cancel_attach() can be used when can_attach() operation prepares something > > for the subsys, but we should rollback what can_attach() operation has prepared > > if attach task fails after we've succeeded in can_attach(). > > > > Signed-off-by: Daisuke Nishimura > > --- > > Documentation/cgroups/cgroups.txt | 13 +++++++++++- > > include/linux/cgroup.h | 2 + > > kernel/cgroup.c | 38 ++++++++++++++++++++++++++++++------ > > 3 files changed, 45 insertions(+), 8 deletions(-) > > > > diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt > > index 0b33bfe..c86947c 100644 > > --- a/Documentation/cgroups/cgroups.txt > > +++ b/Documentation/cgroups/cgroups.txt > > @@ -536,10 +536,21 @@ returns an error, this will abort the attach operation. If a NULL > > task is passed, then a successful result indicates that *any* > > unspecified task can be moved into the cgroup. Note that this isn't > > called on a fork. If this method returns 0 (success) then this should > > -remain valid while the caller holds cgroup_mutex. If threadgroup is > > +remain valid while the caller holds cgroup_mutex and it is ensured that either > > +attach() or cancel_attach() will be called in futer. If threadgroup is > > true, then a successful result indicates that all threads in the given > > thread's threadgroup can be moved together. > > > > +void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp, > > + struct task_struct *task, bool threadgroup) > > +(cgroup_mutex held by caller) > > + > > +Called when a task attach operation has failed after can_attach() has succeeded. > > +A subsystem whose can_attach() has some side-effects should provide this > > +function, so that the subsytem can implement a rollback. If not, not necessary. > > +This will be called only about subsystems whose can_attach() operation have > > +succeeded. > > + > > void attach(struct cgroup_subsys *ss, struct cgroup *cgrp, > > struct cgroup *old_cgrp, struct task_struct *task, > > bool threadgroup) > > diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h > > index 0008dee..d4cc200 100644 > > --- a/include/linux/cgroup.h > > +++ b/include/linux/cgroup.h > > @@ -427,6 +427,8 @@ struct cgroup_subsys { > > void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp); > > int (*can_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp, > > struct task_struct *tsk, bool threadgroup); > > + void (*cancel_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp, > > + struct task_struct *tsk, bool threadgroup); > > void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp, > > struct cgroup *old_cgrp, struct task_struct *tsk, > > bool threadgroup); > > diff --git a/kernel/cgroup.c b/kernel/cgroup.c > > index 0249f4b..e443742 100644 > > --- a/kernel/cgroup.c > > +++ b/kernel/cgroup.c > > @@ -1539,7 +1539,7 @@ int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen) > > int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk) > > { > > int retval = 0; > > - struct cgroup_subsys *ss; > > + struct cgroup_subsys *ss, *failed_ss = NULL; > > struct cgroup *oldcgrp; > > struct css_set *cg; > > struct css_set *newcg; > > @@ -1553,8 +1553,16 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk) > > for_each_subsys(root, ss) { > > if (ss->can_attach) { > > retval = ss->can_attach(ss, cgrp, tsk, false); > > - if (retval) > > - return retval; > > + if (retval) { > > + /* > > + * Remember at which subsystem we've failed in > > + * can_attach() to call cancel_attach() only > > + * against subsystems whose attach() have > > + * succeeded(see below). > > + */ > > + failed_ss = ss; > > failed_ss is global? Is it a marker into an array of subsystems? Don't > we need more than one failed_ss for each failed subsystem? Or do we > find the first failed subsystem, cancel_attach and fail all > migrations? > failed_ss is just a local valiable(see above definition) :) It is used to remember "at which subsystem of for_each_subsys we failed in can_attach()". By remembering it, we can avoid calling cancel_attach() against subsystems whose can_attach() has failed or has not even called. Thanks, Daisuke Nishimura. > > + goto out; > > + } > > } > > } > > > > @@ -1568,14 +1576,17 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk) > > */ > > newcg = find_css_set(cg, cgrp); > > put_css_set(cg); > > - if (!newcg) > > - return -ENOMEM; > > + if (!newcg) { > > + retval = -ENOMEM; > > + goto out; > > + } > > > > task_lock(tsk); > > if (tsk->flags & PF_EXITING) { > > task_unlock(tsk); > > put_css_set(newcg); > > - return -ESRCH; > > + retval = -ESRCH; > > + goto out; > > } > > rcu_assign_pointer(tsk->cgroups, newcg); > > task_unlock(tsk); > > @@ -1601,7 +1612,20 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk) > > * is no longer empty. > > */ > > cgroup_wakeup_rmdir_waiter(cgrp); > > - return 0; > > +out: > > + if (retval) > > + for_each_subsys(root, ss) { > > + if (ss == failed_ss) > > + /* > > + * This means can_attach() of this subsystem > > + * have failed, so we don't need to call > > + * cancel_attach() against rests of subsystems. > > + */ > > + break; > > + if (ss->cancel_attach) > > + ss->cancel_attach(ss, cgrp, tsk, false); > > + } > > + return retval; > > } > > > > -- > Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with ESMTP id ECEE06B004D for ; Mon, 9 Nov 2009 02:43:24 -0500 (EST) Date: Mon, 9 Nov 2009 16:38:36 +0900 From: Daisuke Nishimura Subject: Re: [PATCH -mmotm 1/8] cgroup: introduce cancel_attach() Message-Id: <20091109163836.c656819f.nishimura@mxp.nes.nec.co.jp> In-Reply-To: <4AF7C356.5020504@cn.fujitsu.com> References: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> <20091106141106.a2bd995a.nishimura@mxp.nes.nec.co.jp> <4AF7C356.5020504@cn.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Li Zefan Cc: linux-mm , Andrew Morton , Balbir Singh , KAMEZAWA Hiroyuki , Paul Menage , Daisuke Nishimura List-ID: On Mon, 09 Nov 2009 15:23:02 +0800, Li Zefan wrote: > Daisuke Nishimura wrote: > > This patch adds cancel_attach() operation to struct cgroup_subsys. > > cancel_attach() can be used when can_attach() operation prepares something > > for the subsys, but we should rollback what can_attach() operation has prepared > > if attach task fails after we've succeeded in can_attach(). > > > > Signed-off-by: Daisuke Nishimura > > Acked-by: Li Zefan > Thank you for your ack. > > --- > > Documentation/cgroups/cgroups.txt | 13 +++++++++++- > > include/linux/cgroup.h | 2 + > > kernel/cgroup.c | 38 ++++++++++++++++++++++++++++++------ > > 3 files changed, 45 insertions(+), 8 deletions(-) > > > > diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt > > index 0b33bfe..c86947c 100644 > > --- a/Documentation/cgroups/cgroups.txt > > +++ b/Documentation/cgroups/cgroups.txt > > @@ -536,10 +536,21 @@ returns an error, this will abort the attach operation. If a NULL > > task is passed, then a successful result indicates that *any* > > unspecified task can be moved into the cgroup. Note that this isn't > > called on a fork. If this method returns 0 (success) then this should > > -remain valid while the caller holds cgroup_mutex. If threadgroup is > > +remain valid while the caller holds cgroup_mutex and it is ensured that either > > +attach() or cancel_attach() will be called in futer. If threadgroup is > > s/futer/future > Ah, I'm sorry. will fix. > > true, then a successful result indicates that all threads in the given > > thread's threadgroup can be moved together. > ... > > +out: > > + if (retval) > > I prefer: > > if (reval) { > ... > } > Okey. will change in next post. Thanks, Daisuke Nishimura. > > + for_each_subsys(root, ss) { > > + if (ss == failed_ss) > > + /* > > + * This means can_attach() of this subsystem > > + * have failed, so we don't need to call > > + * cancel_attach() against rests of subsystems. > > + */ > > + break; > > + if (ss->cancel_attach) > > + ss->cancel_attach(ss, cgrp, tsk, false); > > + } > > + return retval; > > } > > > > /* -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id 716546B004D for ; Mon, 9 Nov 2009 03:36:35 -0500 (EST) Date: Mon, 9 Nov 2009 17:24:44 +0900 From: Daisuke Nishimura Subject: Re: [PATCH -mmotm 0/8] memcg: recharge at task move Message-Id: <20091109172444.00ceae65.nishimura@mxp.nes.nec.co.jp> In-Reply-To: <20091109050806.GB3042@balbir.in.ibm.com> References: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> <20091109050806.GB3042@balbir.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: balbir@linux.vnet.ibm.com Cc: linux-mm , Andrew Morton , KAMEZAWA Hiroyuki , Li Zefan , Paul Menage , Daisuke Nishimura List-ID: On Mon, 9 Nov 2009 10:38:06 +0530, Balbir Singh wrote: > * nishimura@mxp.nes.nec.co.jp [2009-11-06 14:10:11]: > > > Hi. > > > > In current memcg, charges associated with a task aren't moved to the new cgroup > > at task move. These patches are for this feature, that is, for recharging to > > the new cgroup and, of course, uncharging from old cgroup at task move. > > > > Current virsion supports only recharge of non-shared(mapcount == 1) anonymous pages > > and swaps of those pages. I think it's enough as a first step. > > > > What is the motivation? Is to provide better accountability? I think > it is worthwhile to look into it, provided the cost of migration is > not too high. It was on my list of things to look at, but I found that > if cpu/memory are mounted together and the cost of migration is high, > it can be a bottleneck in some cases. I'll review the patches a bit > more. > My purpose of using memcg is to restrict the usage of memory/swap by a group of processes. IOW, prepare a box with a configurable size and put processes into it. The point is, there is users who think charges are associated with processes not with the box. Current behavior is very unnatural for those users. If a user comes to want to manage some processes under another new box, because, for example, they use more memory than expected and he wants to isolate the influence from the original group, current behavior is very bad. Anyway, we don't move any charge by default and an admin or a middle-ware can decide the behavior. Thanks, Daisuke Nishimura. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with ESMTP id 995C56B004D for ; Tue, 10 Nov 2009 14:14:46 -0500 (EST) Received: from d28relay03.in.ibm.com (d28relay03.in.ibm.com [9.184.220.60]) by e28smtp09.in.ibm.com (8.14.3/8.13.1) with ESMTP id nAAIw9Gw030747 for ; Wed, 11 Nov 2009 00:28:09 +0530 Received: from d28av03.in.ibm.com (d28av03.in.ibm.com [9.184.220.65]) by d28relay03.in.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id nAAJEPdF3784716 for ; Wed, 11 Nov 2009 00:44:25 +0530 Received: from d28av03.in.ibm.com (loopback [127.0.0.1]) by d28av03.in.ibm.com (8.14.3/8.13.1/NCO v10.0 AVout) with ESMTP id nAAJEOua014947 for ; Wed, 11 Nov 2009 06:14:25 +1100 Date: Wed, 11 Nov 2009 00:44:23 +0530 From: Balbir Singh Subject: Re: [PATCH -mmotm 2/8] memcg: move memcg_tasklist mutex Message-ID: <20091110191423.GD3314@balbir.in.ibm.com> Reply-To: balbir@linux.vnet.ibm.com References: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> <20091106141149.9c7e94d5.nishimura@mxp.nes.nec.co.jp> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: <20091106141149.9c7e94d5.nishimura@mxp.nes.nec.co.jp> Sender: owner-linux-mm@kvack.org To: Daisuke Nishimura Cc: linux-mm , Andrew Morton , KAMEZAWA Hiroyuki , Li Zefan , Paul Menage List-ID: * nishimura@mxp.nes.nec.co.jp [2009-11-06 14:11:49]: > memcg_tasklist was introduced to serialize mem_cgroup_out_of_memory() and > mem_cgroup_move_task() to ensure tasks cannot be moved to another cgroup > during select_bad_process(). > > task_in_mem_cgroup(), which can be called by select_bad_process(), will check > whether a task is in the mem_cgroup or not by dereferencing task->cgroups > ->subsys[]. So, it would be desirable to change task->cgroups > (rcu_assign_pointer() in cgroup_attach_task() does it) with memcg_tasklist held. > > Now that we can define cancel_attach(), we can safely release memcg_tasklist > on fail path even if we hold memcg_tasklist in can_attach(). So let's move > mutex_lock/unlock() of memcg_tasklist. > > Signed-off-by: Daisuke Nishimura > --- > mm/memcontrol.c | 22 ++++++++++++++++++++-- > 1 files changed, 20 insertions(+), 2 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 4bd3451..d3b2ac0 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -3395,18 +3395,34 @@ static int mem_cgroup_populate(struct cgroup_subsys *ss, > return ret; > } > > +static int mem_cgroup_can_attach(struct cgroup_subsys *ss, > + struct cgroup *cgroup, > + struct task_struct *p, > + bool threadgroup) > +{ > + mutex_lock(&memcg_tasklist); > + return 0; > +} > + > +static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss, > + struct cgroup *cgroup, > + struct task_struct *p, > + bool threadgroup) > +{ > + mutex_unlock(&memcg_tasklist); > +} > + > static void mem_cgroup_move_task(struct cgroup_subsys *ss, > struct cgroup *cont, > struct cgroup *old_cont, > struct task_struct *p, > bool threadgroup) > { > - mutex_lock(&memcg_tasklist); > + mutex_unlock(&memcg_tasklist); What does this mean for nesting? I think the API's are called with cgroup_mutex held, so memcg_tasklist nests under cgroup_mutex right? Could you please document that at the mutex declaration point. Shouldn't you be removing the FIXME as well? > /* > * FIXME: It's better to move charges of this process from old > * memcg to new memcg. But it's just on TODO-List now. > */ > - mutex_unlock(&memcg_tasklist); > } > > struct cgroup_subsys mem_cgroup_subsys = { > @@ -3416,6 +3432,8 @@ struct cgroup_subsys mem_cgroup_subsys = { > .pre_destroy = mem_cgroup_pre_destroy, > .destroy = mem_cgroup_destroy, > .populate = mem_cgroup_populate, > + .can_attach = mem_cgroup_can_attach, > + .cancel_attach = mem_cgroup_cancel_attach, > .attach = mem_cgroup_move_task, > .early_init = 0, > .use_id = 1, > -- > 1.5.6.1 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id A26FE6B004D for ; Tue, 10 Nov 2009 18:51:16 -0500 (EST) Date: Wed, 11 Nov 2009 08:44:35 +0900 From: Daisuke Nishimura Subject: Re: [PATCH -mmotm 2/8] memcg: move memcg_tasklist mutex Message-Id: <20091111084435.4686ba4f.nishimura@mxp.nes.nec.co.jp> In-Reply-To: <20091110191423.GD3314@balbir.in.ibm.com> References: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> <20091106141149.9c7e94d5.nishimura@mxp.nes.nec.co.jp> <20091110191423.GD3314@balbir.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: balbir@linux.vnet.ibm.com Cc: linux-mm , Andrew Morton , KAMEZAWA Hiroyuki , Li Zefan , Paul Menage , Daisuke Nishimura List-ID: On Wed, 11 Nov 2009 00:44:23 +0530, Balbir Singh wrote: > * nishimura@mxp.nes.nec.co.jp [2009-11-06 14:11:49]: > > > memcg_tasklist was introduced to serialize mem_cgroup_out_of_memory() and > > mem_cgroup_move_task() to ensure tasks cannot be moved to another cgroup > > during select_bad_process(). > > > > task_in_mem_cgroup(), which can be called by select_bad_process(), will check > > whether a task is in the mem_cgroup or not by dereferencing task->cgroups > > ->subsys[]. So, it would be desirable to change task->cgroups > > (rcu_assign_pointer() in cgroup_attach_task() does it) with memcg_tasklist held. > > > > Now that we can define cancel_attach(), we can safely release memcg_tasklist > > on fail path even if we hold memcg_tasklist in can_attach(). So let's move > > mutex_lock/unlock() of memcg_tasklist. > > > > Signed-off-by: Daisuke Nishimura > > --- > > mm/memcontrol.c | 22 ++++++++++++++++++++-- > > 1 files changed, 20 insertions(+), 2 deletions(-) > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 4bd3451..d3b2ac0 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -3395,18 +3395,34 @@ static int mem_cgroup_populate(struct cgroup_subsys *ss, > > return ret; > > } > > > > +static int mem_cgroup_can_attach(struct cgroup_subsys *ss, > > + struct cgroup *cgroup, > > + struct task_struct *p, > > + bool threadgroup) > > +{ > > + mutex_lock(&memcg_tasklist); > > + return 0; > > +} > > + > > +static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss, > > + struct cgroup *cgroup, > > + struct task_struct *p, > > + bool threadgroup) > > +{ > > + mutex_unlock(&memcg_tasklist); > > +} > > + > > static void mem_cgroup_move_task(struct cgroup_subsys *ss, > > struct cgroup *cont, > > struct cgroup *old_cont, > > struct task_struct *p, > > bool threadgroup) > > { > > - mutex_lock(&memcg_tasklist); > > + mutex_unlock(&memcg_tasklist); > > What does this mean for nesting? I think the API's are called with > cgroup_mutex held, so memcg_tasklist nests under cgroup_mutex right? Yes. > Could you please document that at the mutex declaration point. I'm going to remove this mutex completely. It's no use as I said in another mail(http://marc.info/?l=linux-mm&m=125749423314702&w=2). > Shouldn't you be removing the FIXME as well? > I remove this FIXME comment in [5/8] :) This patch itself has nothing to do with this recharge feature. Thanks, Daisuke Nishimura. > > /* > > * FIXME: It's better to move charges of this process from old > > * memcg to new memcg. But it's just on TODO-List now. > > */ > > - mutex_unlock(&memcg_tasklist); > > } > > > > struct cgroup_subsys mem_cgroup_subsys = { > > @@ -3416,6 +3432,8 @@ struct cgroup_subsys mem_cgroup_subsys = { > > .pre_destroy = mem_cgroup_pre_destroy, > > .destroy = mem_cgroup_destroy, > > .populate = mem_cgroup_populate, > > + .can_attach = mem_cgroup_can_attach, > > + .cancel_attach = mem_cgroup_cancel_attach, > > .attach = mem_cgroup_move_task, > > .early_init = 0, > > .use_id = 1, > > -- > > 1.5.6.1 > > > > -- > > To unsubscribe, send a message with 'unsubscribe linux-mm' in > > the body to majordomo@kvack.org. For more info on Linux MM, > > see: http://www.linux-mm.org/ . > > Don't email: email@kvack.org > > -- > Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id 4D7806B004D for ; Tue, 10 Nov 2009 20:43:11 -0500 (EST) Date: Wed, 11 Nov 2009 10:35:33 +0900 From: Daisuke Nishimura Subject: [PATCH -mmotm 0/3] some cleanups for memcg Message-Id: <20091111103533.c634ff8d.nishimura@mxp.nes.nec.co.jp> In-Reply-To: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> References: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Balbir Singh , KAMEZAWA Hiroyuki , Daisuke Nishimura , linux-mm List-ID: Hi. These are cleanup patches split from my recharge-at-task-move patch set posted in Nov/06. [1/3] memcg: add mem_cgroup_cancel_charge() [2/3] memcg: cleanup mem_cgroup_move_parent() [3/3] memcg: remove memcg_tasklist 1 is corresponding to 3 in original patch set, and 2 to 4. There is no practical change from then. I think they are ready for merge. 3 is a substitutional patch for 2 in original patch set. I want ack or some comments about this patch. Thanks, Daisuke Nishimura. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with ESMTP id B70D66B0062 for ; Tue, 10 Nov 2009 20:43:47 -0500 (EST) Date: Wed, 11 Nov 2009 10:36:49 +0900 From: Daisuke Nishimura Subject: [PATCH -mmotm 1/3] memcg: add mem_cgroup_cancel_charge() Message-Id: <20091111103649.e40e0e60.nishimura@mxp.nes.nec.co.jp> In-Reply-To: <20091111103533.c634ff8d.nishimura@mxp.nes.nec.co.jp> References: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> <20091111103533.c634ff8d.nishimura@mxp.nes.nec.co.jp> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Balbir Singh , KAMEZAWA Hiroyuki , Daisuke Nishimura , linux-mm List-ID: There are some places calling both res_counter_uncharge() and css_put() to cancel the charge and the refcnt we have got by mem_cgroup_tyr_charge(). This patch introduces mem_cgroup_cancel_charge() and call it in those places. Signed-off-by: KAMEZAWA Hiroyuki Signed-off-by: Daisuke Nishimura --- mm/memcontrol.c | 38 ++++++++++++++++++-------------------- 1 files changed, 18 insertions(+), 20 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 4bd3451..d92c398 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1500,6 +1500,21 @@ nomem: } /* + * Somemtimes we have to undo a charge we got by try_charge(). + * This function is for that and do uncharge, put css's refcnt. + * gotten by try_charge(). + */ +static void mem_cgroup_cancel_charge(struct mem_cgroup *mem) +{ + if (!mem_cgroup_is_root(mem)) { + res_counter_uncharge(&mem->res, PAGE_SIZE); + if (do_swap_account) + res_counter_uncharge(&mem->memsw, PAGE_SIZE); + } + css_put(&mem->css); +} + +/* * A helper function to get mem_cgroup from ID. must be called under * rcu_read_lock(). The caller must check css_is_removed() or some if * it's concern. (dropping refcnt from swap can be called against removed @@ -1565,12 +1580,7 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem, lock_page_cgroup(pc); if (unlikely(PageCgroupUsed(pc))) { unlock_page_cgroup(pc); - if (!mem_cgroup_is_root(mem)) { - res_counter_uncharge(&mem->res, PAGE_SIZE); - if (do_swap_account) - res_counter_uncharge(&mem->memsw, PAGE_SIZE); - } - css_put(&mem->css); + mem_cgroup_cancel_charge(mem); return; } @@ -1734,14 +1744,7 @@ static int mem_cgroup_move_parent(struct page_cgroup *pc, cancel: put_page(page); uncharge: - /* drop extra refcnt by try_charge() */ - css_put(&parent->css); - /* uncharge if move fails */ - if (!mem_cgroup_is_root(parent)) { - res_counter_uncharge(&parent->res, PAGE_SIZE); - if (do_swap_account) - res_counter_uncharge(&parent->memsw, PAGE_SIZE); - } + mem_cgroup_cancel_charge(parent); return ret; } @@ -1957,12 +1960,7 @@ void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem) return; if (!mem) return; - if (!mem_cgroup_is_root(mem)) { - res_counter_uncharge(&mem->res, PAGE_SIZE); - if (do_swap_account) - res_counter_uncharge(&mem->memsw, PAGE_SIZE); - } - css_put(&mem->css); + mem_cgroup_cancel_charge(mem); } static void -- 1.5.6.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with ESMTP id 236696B006A for ; Tue, 10 Nov 2009 20:44:28 -0500 (EST) Date: Wed, 11 Nov 2009 10:37:41 +0900 From: Daisuke Nishimura Subject: [PATCH -mmotm 2/3] memcg: cleanup mem_cgroup_move_parent() Message-Id: <20091111103741.f35e9ffe.nishimura@mxp.nes.nec.co.jp> In-Reply-To: <20091111103533.c634ff8d.nishimura@mxp.nes.nec.co.jp> References: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> <20091111103533.c634ff8d.nishimura@mxp.nes.nec.co.jp> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Balbir Singh , KAMEZAWA Hiroyuki , Daisuke Nishimura , linux-mm List-ID: mem_cgroup_move_parent() calls try_charge first and cancel_charge on failure. IMHO, charge/uncharge(especially charge) is high cost operation, so we should avoid it as far as possible. This patch tries to delay try_charge in mem_cgroup_move_parent() by re-ordering checks it does. And this patch renames mem_cgroup_move_account() to __mem_cgroup_move_account(), changes the return value of __mem_cgroup_move_account() from int to void, and adds a new wrapper(mem_cgroup_move_account()), which checks whether a @pc is valid for moving account and calls __mem_cgroup_move_account(). This patch removes the last caller of trylock_page_cgroup(), so removes its definition too. Signed-off-by: Daisuke Nishimura Acked-by: KAMEZAWA Hiroyuki --- include/linux/page_cgroup.h | 7 +--- mm/memcontrol.c | 84 ++++++++++++++++++------------------------- 2 files changed, 37 insertions(+), 54 deletions(-) diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h index 4b938d4..b0e4eb1 100644 --- a/include/linux/page_cgroup.h +++ b/include/linux/page_cgroup.h @@ -57,6 +57,8 @@ static inline void ClearPageCgroup##uname(struct page_cgroup *pc) \ static inline int TestClearPageCgroup##uname(struct page_cgroup *pc) \ { return test_and_clear_bit(PCG_##lname, &pc->flags); } +TESTPCGFLAG(Locked, LOCK) + /* Cache flag is set only once (at allocation) */ TESTPCGFLAG(Cache, CACHE) CLEARPCGFLAG(Cache, CACHE) @@ -86,11 +88,6 @@ static inline void lock_page_cgroup(struct page_cgroup *pc) bit_spin_lock(PCG_LOCK, &pc->flags); } -static inline int trylock_page_cgroup(struct page_cgroup *pc) -{ - return bit_spin_trylock(PCG_LOCK, &pc->flags); -} - static inline void unlock_page_cgroup(struct page_cgroup *pc) { bit_spin_unlock(PCG_LOCK, &pc->flags); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d92c398..2f1283b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1613,27 +1613,22 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem, } /** - * mem_cgroup_move_account - move account of the page + * __mem_cgroup_move_account - move account of the page * @pc: page_cgroup of the page. * @from: mem_cgroup which the page is moved from. * @to: mem_cgroup which the page is moved to. @from != @to. * * The caller must confirm following. * - page is not on LRU (isolate_page() is useful.) - * - * returns 0 at success, - * returns -EBUSY when lock is busy or "pc" is unstable. + * - the pc is locked, used, and ->mem_cgroup points to @from. * * This function does "uncharge" from old cgroup but doesn't do "charge" to * new cgroup. It should be done by a caller. */ -static int mem_cgroup_move_account(struct page_cgroup *pc, +static void __mem_cgroup_move_account(struct page_cgroup *pc, struct mem_cgroup *from, struct mem_cgroup *to) { - struct mem_cgroup_per_zone *from_mz, *to_mz; - int nid, zid; - int ret = -EBUSY; struct page *page; int cpu; struct mem_cgroup_stat *stat; @@ -1641,20 +1636,9 @@ static int mem_cgroup_move_account(struct page_cgroup *pc, VM_BUG_ON(from == to); VM_BUG_ON(PageLRU(pc->page)); - - nid = page_cgroup_nid(pc); - zid = page_cgroup_zid(pc); - from_mz = mem_cgroup_zoneinfo(from, nid, zid); - to_mz = mem_cgroup_zoneinfo(to, nid, zid); - - if (!trylock_page_cgroup(pc)) - return ret; - - if (!PageCgroupUsed(pc)) - goto out; - - if (pc->mem_cgroup != from) - goto out; + VM_BUG_ON(!PageCgroupLocked(pc)); + VM_BUG_ON(!PageCgroupUsed(pc)); + VM_BUG_ON(pc->mem_cgroup != from); if (!mem_cgroup_is_root(from)) res_counter_uncharge(&from->res, PAGE_SIZE); @@ -1683,15 +1667,28 @@ static int mem_cgroup_move_account(struct page_cgroup *pc, css_get(&to->css); pc->mem_cgroup = to; mem_cgroup_charge_statistics(to, pc, true); - ret = 0; -out: - unlock_page_cgroup(pc); /* * We charges against "to" which may not have any tasks. Then, "to" * can be under rmdir(). But in current implementation, caller of * this function is just force_empty() and it's garanteed that * "to" is never removed. So, we don't check rmdir status here. */ +} + +/* + * check whether the @pc is valid for moving account and call + * __mem_cgroup_move_account() + */ +static int mem_cgroup_move_account(struct page_cgroup *pc, + struct mem_cgroup *from, struct mem_cgroup *to) +{ + int ret = -EINVAL; + lock_page_cgroup(pc); + if (PageCgroupUsed(pc) && pc->mem_cgroup == from) { + __mem_cgroup_move_account(pc, from, to); + ret = 0; + } + unlock_page_cgroup(pc); return ret; } @@ -1713,38 +1710,27 @@ static int mem_cgroup_move_parent(struct page_cgroup *pc, if (!pcg) return -EINVAL; + ret = -EBUSY; + if (!get_page_unless_zero(page)) + goto out; + if (isolate_lru_page(page)) + goto put; parent = mem_cgroup_from_cont(pcg); - - ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false, page); if (ret || !parent) - return ret; - - if (!get_page_unless_zero(page)) { - ret = -EBUSY; - goto uncharge; - } - - ret = isolate_lru_page(page); - - if (ret) - goto cancel; + goto put_back; ret = mem_cgroup_move_account(pc, child, parent); - + if (!ret) + css_put(&parent->css); /* drop extra refcnt by try_charge() */ + else + mem_cgroup_cancel_charge(parent); /* does css_put */ +put_back: putback_lru_page(page); - if (!ret) { - put_page(page); - /* drop extra refcnt by try_charge() */ - css_put(&parent->css); - return 0; - } - -cancel: +put: put_page(page); -uncharge: - mem_cgroup_cancel_charge(parent); +out: return ret; } -- 1.5.6.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id B40A16B004D for ; Tue, 10 Nov 2009 20:45:01 -0500 (EST) Date: Wed, 11 Nov 2009 10:39:06 +0900 From: Daisuke Nishimura Subject: [PATCH -mmotm 3/3] memcg: remove memcg_tasklist Message-Id: <20091111103906.5c3563bb.nishimura@mxp.nes.nec.co.jp> In-Reply-To: <20091111103533.c634ff8d.nishimura@mxp.nes.nec.co.jp> References: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> <20091111103533.c634ff8d.nishimura@mxp.nes.nec.co.jp> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Balbir Singh , KAMEZAWA Hiroyuki , Daisuke Nishimura , linux-mm List-ID: memcg_tasklist was introduced at commit 7f4d454d(memcg: avoid deadlock caused by race between oom and cpuset_attach) instead of cgroup_mutex to fix a deadlock problem. The cgroup_mutex, which was removed by the commit, in mem_cgroup_out_of_memory() was originally introduced at commit c7ba5c9e (Memory controller: OOM handling). IIUC, the intention of this cgroup_mutex was to prevent task move during select_bad_process() so that situations like below can be avoided. Assume cgroup "foo" has exceeded its limit and is about to trigger oom. 1. Process A, which has been in cgroup "baa" and uses large memory, is just moved to cgroup "foo". Process A can be the candidates for being killed. 2. Process B, which has been in cgroup "foo" and uses large memory, is just moved from cgroup "foo". Process B can be excluded from the candidates for being killed. But these race window exists anyway even if we hold a lock, because __mem_cgroup_try_charge() decides wether it should trigger oom or not outside of the lock. So the original cgroup_mutex in mem_cgroup_out_of_memory and thus current memcg_tasklist has no use. And IMHO, those races are not so critical for users. This patch removes it and make codes simpler. Signed-off-by: Daisuke Nishimura --- mm/memcontrol.c | 5 ----- 1 files changed, 0 insertions(+), 5 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 2f1283b..a74fcc2 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -55,7 +55,6 @@ static int really_do_swap_account __initdata = 1; /* for remember boot option*/ #define do_swap_account (0) #endif -static DEFINE_MUTEX(memcg_tasklist); /* can be hold under cgroup_mutex */ #define SOFTLIMIT_EVENTS_THRESH (1000) /* @@ -1475,9 +1474,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm, if (!nr_retries--) { if (oom) { - mutex_lock(&memcg_tasklist); mem_cgroup_out_of_memory(mem_over_limit, gfp_mask); - mutex_unlock(&memcg_tasklist); record_last_oom(mem_over_limit); } goto nomem; @@ -3385,12 +3382,10 @@ static void mem_cgroup_move_task(struct cgroup_subsys *ss, struct task_struct *p, bool threadgroup) { - mutex_lock(&memcg_tasklist); /* * FIXME: It's better to move charges of this process from old * memcg to new memcg. But it's just on TODO-List now. */ - mutex_unlock(&memcg_tasklist); } struct cgroup_subsys mem_cgroup_subsys = { -- 1.5.6.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with SMTP id A338B6B004D for ; Tue, 10 Nov 2009 20:52:17 -0500 (EST) Received: from m5.gw.fujitsu.co.jp ([10.0.50.75]) by fgwmail7.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id nAB1qBPM019224 for (envelope-from kamezawa.hiroyu@jp.fujitsu.com); Wed, 11 Nov 2009 10:52:12 +0900 Received: from smail (m5 [127.0.0.1]) by outgoing.m5.gw.fujitsu.co.jp (Postfix) with ESMTP id 5D05B45DE51 for ; Wed, 11 Nov 2009 10:52:11 +0900 (JST) Received: from s5.gw.fujitsu.co.jp (s5.gw.fujitsu.co.jp [10.0.50.95]) by m5.gw.fujitsu.co.jp (Postfix) with ESMTP id 3ACF645DE4F for ; Wed, 11 Nov 2009 10:52:11 +0900 (JST) Received: from s5.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s5.gw.fujitsu.co.jp (Postfix) with ESMTP id 2489BE1800C for ; Wed, 11 Nov 2009 10:52:11 +0900 (JST) Received: from m107.s.css.fujitsu.com (m107.s.css.fujitsu.com [10.249.87.107]) by s5.gw.fujitsu.co.jp (Postfix) with ESMTP id C07E6E1800A for ; Wed, 11 Nov 2009 10:52:10 +0900 (JST) Date: Wed, 11 Nov 2009 10:49:34 +0900 From: KAMEZAWA Hiroyuki Subject: Re: [PATCH -mmotm 3/3] memcg: remove memcg_tasklist Message-Id: <20091111104934.eee96fc4.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <20091111103906.5c3563bb.nishimura@mxp.nes.nec.co.jp> References: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> <20091111103533.c634ff8d.nishimura@mxp.nes.nec.co.jp> <20091111103906.5c3563bb.nishimura@mxp.nes.nec.co.jp> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Daisuke Nishimura Cc: Andrew Morton , Balbir Singh , linux-mm List-ID: On Wed, 11 Nov 2009 10:39:06 +0900 Daisuke Nishimura wrote: > memcg_tasklist was introduced at commit 7f4d454d(memcg: avoid deadlock caused > by race between oom and cpuset_attach) instead of cgroup_mutex to fix a deadlock > problem. The cgroup_mutex, which was removed by the commit, in > mem_cgroup_out_of_memory() was originally introduced at commit c7ba5c9e > (Memory controller: OOM handling). > > IIUC, the intention of this cgroup_mutex was to prevent task move during > select_bad_process() so that situations like below can be avoided. > > Assume cgroup "foo" has exceeded its limit and is about to trigger oom. > 1. Process A, which has been in cgroup "baa" and uses large memory, is just > moved to cgroup "foo". Process A can be the candidates for being killed. > 2. Process B, which has been in cgroup "foo" and uses large memory, is just > moved from cgroup "foo". Process B can be excluded from the candidates for > being killed. > > But these race window exists anyway even if we hold a lock, because > __mem_cgroup_try_charge() decides wether it should trigger oom or not outside > of the lock. So the original cgroup_mutex in mem_cgroup_out_of_memory and thus > current memcg_tasklist has no use. And IMHO, those races are not so critical > for users. > > This patch removes it and make codes simpler. > > Signed-off-by: Daisuke Nishimura Acked-by: KAMEZAWA Hiroyuki -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id 8063D6B004D for ; Tue, 10 Nov 2009 23:25:50 -0500 (EST) Received: from d23relay04.au.ibm.com (d23relay04.au.ibm.com [202.81.31.246]) by e23smtp06.au.ibm.com (8.14.3/8.13.1) with ESMTP id nAB4ObXI027007 for ; Wed, 11 Nov 2009 15:24:37 +1100 Received: from d23av03.au.ibm.com (d23av03.au.ibm.com [9.190.234.97]) by d23relay04.au.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id nAB4LgtT1662978 for ; Wed, 11 Nov 2009 15:21:42 +1100 Received: from d23av03.au.ibm.com (loopback [127.0.0.1]) by d23av03.au.ibm.com (8.14.3/8.13.1/NCO v10.0 AVout) with ESMTP id nAB4Ooex006340 for ; Wed, 11 Nov 2009 15:24:50 +1100 Date: Wed, 11 Nov 2009 09:54:48 +0530 From: Balbir Singh Subject: Re: [PATCH -mmotm 1/3] memcg: add mem_cgroup_cancel_charge() Message-ID: <20091111042448.GG3314@balbir.in.ibm.com> Reply-To: balbir@linux.vnet.ibm.com References: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> <20091111103533.c634ff8d.nishimura@mxp.nes.nec.co.jp> <20091111103649.e40e0e60.nishimura@mxp.nes.nec.co.jp> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: <20091111103649.e40e0e60.nishimura@mxp.nes.nec.co.jp> Sender: owner-linux-mm@kvack.org To: Daisuke Nishimura Cc: Andrew Morton , KAMEZAWA Hiroyuki , linux-mm List-ID: * nishimura@mxp.nes.nec.co.jp [2009-11-11 10:36:49]: > There are some places calling both res_counter_uncharge() and css_put() > to cancel the charge and the refcnt we have got by mem_cgroup_tyr_charge(). > > This patch introduces mem_cgroup_cancel_charge() and call it in those places. > > Signed-off-by: KAMEZAWA Hiroyuki > Signed-off-by: Daisuke Nishimura Reviewed-by: Balbir Singh -- Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id CB5D06B004D for ; Wed, 11 Nov 2009 09:41:00 -0500 (EST) Received: from d28relay03.in.ibm.com (d28relay03.in.ibm.com [9.184.220.60]) by e28smtp04.in.ibm.com (8.14.3/8.13.1) with ESMTP id nABEer66016359 for ; Wed, 11 Nov 2009 20:10:53 +0530 Received: from d28av03.in.ibm.com (d28av03.in.ibm.com [9.184.220.65]) by d28relay03.in.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id nABEeqBU3678410 for ; Wed, 11 Nov 2009 20:10:53 +0530 Received: from d28av03.in.ibm.com (loopback [127.0.0.1]) by d28av03.in.ibm.com (8.14.3/8.13.1/NCO v10.0 AVout) with ESMTP id nABEeql6009599 for ; Thu, 12 Nov 2009 01:40:52 +1100 Date: Wed, 11 Nov 2009 20:10:50 +0530 From: Balbir Singh Subject: Re: [PATCH -mmotm 2/3] memcg: cleanup mem_cgroup_move_parent() Message-ID: <20091111144050.GL3314@balbir.in.ibm.com> Reply-To: balbir@linux.vnet.ibm.com References: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> <20091111103533.c634ff8d.nishimura@mxp.nes.nec.co.jp> <20091111103741.f35e9ffe.nishimura@mxp.nes.nec.co.jp> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: <20091111103741.f35e9ffe.nishimura@mxp.nes.nec.co.jp> Sender: owner-linux-mm@kvack.org To: Daisuke Nishimura Cc: Andrew Morton , KAMEZAWA Hiroyuki , linux-mm List-ID: * nishimura@mxp.nes.nec.co.jp [2009-11-11 10:37:41]: > mem_cgroup_move_parent() calls try_charge first and cancel_charge on failure. > IMHO, charge/uncharge(especially charge) is high cost operation, so we should > avoid it as far as possible. > But cancel_charge and move_account are not frequent operations, does optimizing this give a significant benefit? > This patch tries to delay try_charge in mem_cgroup_move_parent() by re-ordering > checks it does. > > And this patch renames mem_cgroup_move_account() to __mem_cgroup_move_account(), > changes the return value of __mem_cgroup_move_account() from int to void, > and adds a new wrapper(mem_cgroup_move_account()), which checks whether a @pc > is valid for moving account and calls __mem_cgroup_move_account(). > > This patch removes the last caller of trylock_page_cgroup(), so removes its > definition too. > > Signed-off-by: Daisuke Nishimura > Acked-by: KAMEZAWA Hiroyuki > --- > include/linux/page_cgroup.h | 7 +--- > mm/memcontrol.c | 84 ++++++++++++++++++------------------------- > 2 files changed, 37 insertions(+), 54 deletions(-) > > diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h > index 4b938d4..b0e4eb1 100644 > --- a/include/linux/page_cgroup.h > +++ b/include/linux/page_cgroup.h > @@ -57,6 +57,8 @@ static inline void ClearPageCgroup##uname(struct page_cgroup *pc) \ > static inline int TestClearPageCgroup##uname(struct page_cgroup *pc) \ > { return test_and_clear_bit(PCG_##lname, &pc->flags); } > > +TESTPCGFLAG(Locked, LOCK) > + > /* Cache flag is set only once (at allocation) */ > TESTPCGFLAG(Cache, CACHE) > CLEARPCGFLAG(Cache, CACHE) > @@ -86,11 +88,6 @@ static inline void lock_page_cgroup(struct page_cgroup *pc) > bit_spin_lock(PCG_LOCK, &pc->flags); > } > > -static inline int trylock_page_cgroup(struct page_cgroup *pc) > -{ > - return bit_spin_trylock(PCG_LOCK, &pc->flags); > -} > - > static inline void unlock_page_cgroup(struct page_cgroup *pc) > { > bit_spin_unlock(PCG_LOCK, &pc->flags); > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index d92c398..2f1283b 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1613,27 +1613,22 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem, > } > > /** > - * mem_cgroup_move_account - move account of the page > + * __mem_cgroup_move_account - move account of the page > * @pc: page_cgroup of the page. > * @from: mem_cgroup which the page is moved from. > * @to: mem_cgroup which the page is moved to. @from != @to. > * > * The caller must confirm following. > * - page is not on LRU (isolate_page() is useful.) > - * > - * returns 0 at success, > - * returns -EBUSY when lock is busy or "pc" is unstable. > + * - the pc is locked, used, and ->mem_cgroup points to @from. > * > * This function does "uncharge" from old cgroup but doesn't do "charge" to > * new cgroup. It should be done by a caller. > */ > > -static int mem_cgroup_move_account(struct page_cgroup *pc, > +static void __mem_cgroup_move_account(struct page_cgroup *pc, > struct mem_cgroup *from, struct mem_cgroup *to) > { > - struct mem_cgroup_per_zone *from_mz, *to_mz; > - int nid, zid; > - int ret = -EBUSY; > struct page *page; > int cpu; > struct mem_cgroup_stat *stat; > @@ -1641,20 +1636,9 @@ static int mem_cgroup_move_account(struct page_cgroup *pc, > > VM_BUG_ON(from == to); > VM_BUG_ON(PageLRU(pc->page)); > - > - nid = page_cgroup_nid(pc); > - zid = page_cgroup_zid(pc); > - from_mz = mem_cgroup_zoneinfo(from, nid, zid); > - to_mz = mem_cgroup_zoneinfo(to, nid, zid); > - > - if (!trylock_page_cgroup(pc)) > - return ret; > - > - if (!PageCgroupUsed(pc)) > - goto out; > - > - if (pc->mem_cgroup != from) > - goto out; > + VM_BUG_ON(!PageCgroupLocked(pc)); > + VM_BUG_ON(!PageCgroupUsed(pc)); > + VM_BUG_ON(pc->mem_cgroup != from); > > if (!mem_cgroup_is_root(from)) > res_counter_uncharge(&from->res, PAGE_SIZE); > @@ -1683,15 +1667,28 @@ static int mem_cgroup_move_account(struct page_cgroup *pc, > css_get(&to->css); > pc->mem_cgroup = to; > mem_cgroup_charge_statistics(to, pc, true); > - ret = 0; > -out: > - unlock_page_cgroup(pc); > /* > * We charges against "to" which may not have any tasks. Then, "to" > * can be under rmdir(). But in current implementation, caller of > * this function is just force_empty() and it's garanteed that > * "to" is never removed. So, we don't check rmdir status here. > */ > +} > + > +/* > + * check whether the @pc is valid for moving account and call > + * __mem_cgroup_move_account() > + */ > +static int mem_cgroup_move_account(struct page_cgroup *pc, > + struct mem_cgroup *from, struct mem_cgroup *to) > +{ > + int ret = -EINVAL; > + lock_page_cgroup(pc); > + if (PageCgroupUsed(pc) && pc->mem_cgroup == from) { > + __mem_cgroup_move_account(pc, from, to); > + ret = 0; > + } > + unlock_page_cgroup(pc); > return ret; > } > > @@ -1713,38 +1710,27 @@ static int mem_cgroup_move_parent(struct page_cgroup *pc, > if (!pcg) > return -EINVAL; > > + ret = -EBUSY; > + if (!get_page_unless_zero(page)) > + goto out; > + if (isolate_lru_page(page)) > + goto put; > > parent = mem_cgroup_from_cont(pcg); > - > - > ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false, page); > if (ret || !parent) > - return ret; > - > - if (!get_page_unless_zero(page)) { > - ret = -EBUSY; > - goto uncharge; > - } > - > - ret = isolate_lru_page(page); > - > - if (ret) > - goto cancel; > + goto put_back; > > ret = mem_cgroup_move_account(pc, child, parent); > - > + if (!ret) > + css_put(&parent->css); /* drop extra refcnt by try_charge() */ > + else > + mem_cgroup_cancel_charge(parent); /* does css_put */ > +put_back: > putback_lru_page(page); > - if (!ret) { > - put_page(page); > - /* drop extra refcnt by try_charge() */ > - css_put(&parent->css); > - return 0; > - } > - > -cancel: > +put: > put_page(page); > -uncharge: > - mem_cgroup_cancel_charge(parent); > +out: > return ret; > } > Looks good overall -- Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with SMTP id 6D61D6B004D for ; Wed, 11 Nov 2009 10:16:53 -0500 (EST) Date: Thu, 12 Nov 2009 00:16:49 +0900 From: Daisuke Nishimura Subject: Re: [PATCH -mmotm 2/3] memcg: cleanup mem_cgroup_move_parent() Message-Id: <20091112001649.ba228103.d-nishimura@mtf.biglobe.ne.jp> In-Reply-To: <20091111144050.GL3314@balbir.in.ibm.com> References: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> <20091111103533.c634ff8d.nishimura@mxp.nes.nec.co.jp> <20091111103741.f35e9ffe.nishimura@mxp.nes.nec.co.jp> <20091111144050.GL3314@balbir.in.ibm.com> Reply-To: nishimura@mxp.nes.nec.co.jp Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: balbir@linux.vnet.ibm.com Cc: Andrew Morton , KAMEZAWA Hiroyuki , linux-mm , d-nishimura@mtf.biglobe.ne.jp, Daisuke Nishimura List-ID: On Wed, 11 Nov 2009 20:10:50 +0530 Balbir Singh wrote: > * nishimura@mxp.nes.nec.co.jp [2009-11-11 10:37:41]: > > > mem_cgroup_move_parent() calls try_charge first and cancel_charge on failure. > > IMHO, charge/uncharge(especially charge) is high cost operation, so we should > > avoid it as far as possible. > > > > But cancel_charge and move_account are not frequent operations, does > optimizing this give a significant benefit? > I agree they are not called so frequently, so the benefit would not be so big. But, the number of lines of memcontrol.c decreases a bit by these patches ;) IMHO, current mem_cgroup_move_parent() is a bit hard to understand, and mem_cgroup_move_account() have redundant codes. So, I cleaned them up. Moreover, mem_cgroup_cancel_charge(), which I introduced in [1/3], and the new wrapper function of mem_cgroup_move_account(), which I introduced in this patch, will be used in my recharge-at-task-move patches and make them more readable. > Looks good overall > Thank you. Daisuke Nishimura. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id 758386B004D for ; Wed, 11 Nov 2009 11:01:42 -0500 (EST) Received: from d23relay03.au.ibm.com (d23relay03.au.ibm.com [202.81.31.245]) by e23smtp08.au.ibm.com (8.14.3/8.13.1) with ESMTP id nABG1cda019313 for ; Thu, 12 Nov 2009 03:01:38 +1100 Received: from d23av01.au.ibm.com (d23av01.au.ibm.com [9.190.234.96]) by d23relay03.au.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id nABG1b1d1781838 for ; Thu, 12 Nov 2009 03:01:38 +1100 Received: from d23av01.au.ibm.com (loopback [127.0.0.1]) by d23av01.au.ibm.com (8.14.3/8.13.1/NCO v10.0 AVout) with ESMTP id nABG1bXO018166 for ; Thu, 12 Nov 2009 03:01:37 +1100 Date: Wed, 11 Nov 2009 21:31:34 +0530 From: Balbir Singh Subject: Re: [PATCH -mmotm 3/3] memcg: remove memcg_tasklist Message-ID: <20091111160134.GM3314@balbir.in.ibm.com> Reply-To: balbir@linux.vnet.ibm.com References: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> <20091111103533.c634ff8d.nishimura@mxp.nes.nec.co.jp> <20091111103906.5c3563bb.nishimura@mxp.nes.nec.co.jp> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: <20091111103906.5c3563bb.nishimura@mxp.nes.nec.co.jp> Sender: owner-linux-mm@kvack.org To: Daisuke Nishimura Cc: Andrew Morton , KAMEZAWA Hiroyuki , linux-mm List-ID: * nishimura@mxp.nes.nec.co.jp [2009-11-11 10:39:06]: > memcg_tasklist was introduced at commit 7f4d454d(memcg: avoid deadlock caused > by race between oom and cpuset_attach) instead of cgroup_mutex to fix a deadlock > problem. The cgroup_mutex, which was removed by the commit, in > mem_cgroup_out_of_memory() was originally introduced at commit c7ba5c9e > (Memory controller: OOM handling). > > IIUC, the intention of this cgroup_mutex was to prevent task move during > select_bad_process() so that situations like below can be avoided. > > Assume cgroup "foo" has exceeded its limit and is about to trigger oom. > 1. Process A, which has been in cgroup "baa" and uses large memory, is just > moved to cgroup "foo". Process A can be the candidates for being killed. > 2. Process B, which has been in cgroup "foo" and uses large memory, is just > moved from cgroup "foo". Process B can be excluded from the candidates for > being killed. > > But these race window exists anyway even if we hold a lock, because > __mem_cgroup_try_charge() decides wether it should trigger oom or not outside > of the lock. So the original cgroup_mutex in mem_cgroup_out_of_memory and thus > current memcg_tasklist has no use. And IMHO, those races are not so critical > for users. > > This patch removes it and make codes simpler. > Could you please test for side-effects like concurrent OOM. An idea of how the patchset was tested would be good to have, given the implications of these changes. Not-Yet-Acked-by: Balbir Singh -- Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id 0D81A6B004D for ; Thu, 12 Nov 2009 03:22:21 -0500 (EST) Date: Thu, 12 Nov 2009 17:05:10 +0900 From: Daisuke Nishimura Subject: Re: [PATCH -mmotm 3/3] memcg: remove memcg_tasklist Message-Id: <20091112170510.e635df1f.nishimura@mxp.nes.nec.co.jp> In-Reply-To: <20091111160134.GM3314@balbir.in.ibm.com> References: <20091106141011.3ded1551.nishimura@mxp.nes.nec.co.jp> <20091111103533.c634ff8d.nishimura@mxp.nes.nec.co.jp> <20091111103906.5c3563bb.nishimura@mxp.nes.nec.co.jp> <20091111160134.GM3314@balbir.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: balbir@linux.vnet.ibm.com Cc: Andrew Morton , KAMEZAWA Hiroyuki , linux-mm , Daisuke Nishimura List-ID: On Wed, 11 Nov 2009 21:31:34 +0530, Balbir Singh wrote: > * nishimura@mxp.nes.nec.co.jp [2009-11-11 10:39:06]: > > > memcg_tasklist was introduced at commit 7f4d454d(memcg: avoid deadlock caused > > by race between oom and cpuset_attach) instead of cgroup_mutex to fix a deadlock > > problem. The cgroup_mutex, which was removed by the commit, in > > mem_cgroup_out_of_memory() was originally introduced at commit c7ba5c9e > > (Memory controller: OOM handling). > > > > IIUC, the intention of this cgroup_mutex was to prevent task move during > > select_bad_process() so that situations like below can be avoided. > > > > Assume cgroup "foo" has exceeded its limit and is about to trigger oom. > > 1. Process A, which has been in cgroup "baa" and uses large memory, is just > > moved to cgroup "foo". Process A can be the candidates for being killed. > > 2. Process B, which has been in cgroup "foo" and uses large memory, is just > > moved from cgroup "foo". Process B can be excluded from the candidates for > > being killed. > > > > But these race window exists anyway even if we hold a lock, because > > __mem_cgroup_try_charge() decides wether it should trigger oom or not outside > > of the lock. So the original cgroup_mutex in mem_cgroup_out_of_memory and thus > > current memcg_tasklist has no use. And IMHO, those races are not so critical > > for users. > > > > This patch removes it and make codes simpler. > > > > Could you please test for side-effects like concurrent OOM. An idea of > how the patchset was tested would be good to have, given the > implications of these changes. > hmm, I'm not sure what is your concern. My point is, __mem_cgroup_try_charge() decides wether it should trigger oom or not outside of this mutex. mem_cgroup_out_of_memory(selecting a target task and killing it) itself would be serialized by this mutex, but the decision wether we should trigger oom or not is made outside of this mutex, so this mutex has no meaning(oom will happen anyway). Actually, I tested following scenario. - make a cgroup with mem.limit == memsw.limit == 128M. - under the cgroup, run a test program which consumes about 8MB as anonymous for each. - I can run up to 15 instances of this test program, but when I started 16th one, oom is triggered. The number of processes being kill is random(not necessarily one process). This behavior doesn't change before and after this patch. Thanks, Daisuke Nishimura. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org