All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH -mmotm 0/5] memcg: recharge at task move (19/Nov)
@ 2009-11-19  4:27 Daisuke Nishimura
  2009-11-19  4:28 ` [PATCH -mmotm 1/5] cgroup: introduce cancel_attach() Daisuke Nishimura
                   ` (5 more replies)
  0 siblings, 6 replies; 20+ messages in thread
From: Daisuke Nishimura @ 2009-11-19  4:27 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Balbir Singh, KAMEZAWA Hiroyuki, Li Zefan,
	Paul Menage, Daisuke Nishimura

Hi.

These are current patches of my recharge-at-task-move feature.
They(precisely, only [5/5]) are dependent on KAMEZAWA-san's per-process swap usage patch,
which is not merged yet, so are not for inclusion yet. I post them just for review and
to share my current code.

In current memcg, charges associated with a task aren't moved to the new cgroup
at task move. Some users feel this behavior to be strange.
These patches are for this feature, that is, for recharging to
the new cgroup and, of course, uncharging from old cgroup at task move.

Current version supports only recharge of non-shared(mapcount == 1) anonymous pages
and swaps of those pages. I think it's enough as a first step.

  [1/5] cgroup: introduce cancel_attach()
  [2/5] memcg: add interface to recharge at task move
  [3/5] memcg: recharge charges of anonymous page
  [4/5] memcg: avoid oom during recharge at task move
  [5/5] memcg: recharge charges of anonymous swap

Overall history of this patch set:
2009/11/19
- rebase on mmotm-2009-11-17-14-03 + KAMEZAWA-san's show per-process swap usage
  via procfs patch(v3).
- in can_attach(), instead of parsing the page table, make use of per process
  mm_counter(anon_rss, swap_usage).
- handle recharge_at_immigrate as bitmask(as I did in first version)
- use mm->owner instead of thread_group_leader()
2009/11/06
- remove "[RFC]".
- rebase on mmotm-2009-11-01-10-01.
- drop support for file cache and shmem/tmpfs(revisit in future).
- update Documentation/cgroup/memory.txt.
2009/10/13
- rebase on mmotm-2009-10-09-01-07 + KAMEZAWA-san's batched charge/uncharge(Oct09) + part
of KAMEZAWA-san's cleanup/fix patches(4,5,7 of Sep25 with some fixes).
- change the term "migrate" to "recharge".
2009/09/24
- change "migrate_charge" flag from "int" to "bool".
- in can_attach(), parse the page table of the task and count only the number
  of target ptes and call try_charge() repeatedly. No isolation at this phase.
- in attach(), parse the page table of the task again, and isolate the target
  page and call move_account() one by one.
- do no swap-in in moving swap account any more.
- add support for shmem/tmpfs's swap.
- update Documentation/cgroup/cgroup.txt.
2009/09/17
- first version

TODO:
- add support for file cache, shmem/tmpfs, and shared(mapcount > 1) pages.
- implement madvise(2) to let users decide the target vma for recharge.

Any comments or suggestions would be welcome.


Thanks,
Dasiuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH -mmotm 1/5] cgroup: introduce cancel_attach()
  2009-11-19  4:27 [PATCH -mmotm 0/5] memcg: recharge at task move (19/Nov) Daisuke Nishimura
@ 2009-11-19  4:28 ` Daisuke Nishimura
  2009-11-19 21:42   ` Paul Menage
  2009-11-19  4:29 ` [PATCH -mmotm 2/5] memcg: add interface to recharge at task move Daisuke Nishimura
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Daisuke Nishimura @ 2009-11-19  4:28 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Balbir Singh, KAMEZAWA Hiroyuki, Li Zefan,
	Paul Menage, Daisuke Nishimura

This patch adds cancel_attach() operation to struct cgroup_subsys.
cancel_attach() can be used when can_attach() operation prepares something
for the subsys, but we should rollback what can_attach() operation has prepared
if attach task fails after we've succeeded in can_attach().

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>

Changelog: 2009/11/19
- fix typo and coding style.
Changelog: 2009/09/24
- add explanation about cancel_attach() to Documentation/cgroup/cgroup.txt.
---
 Documentation/cgroups/cgroups.txt |   13 +++++++++++-
 include/linux/cgroup.h            |    2 +
 kernel/cgroup.c                   |   39 ++++++++++++++++++++++++++++++------
 3 files changed, 46 insertions(+), 8 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index 0b33bfe..d450826 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -536,10 +536,21 @@ returns an error, this will abort the attach operation.  If a NULL
 task is passed, then a successful result indicates that *any*
 unspecified task can be moved into the cgroup. Note that this isn't
 called on a fork. If this method returns 0 (success) then this should
-remain valid while the caller holds cgroup_mutex. If threadgroup is
+remain valid while the caller holds cgroup_mutex and it is ensured that either
+attach() or cancel_attach() will be called in future. If threadgroup is
 true, then a successful result indicates that all threads in the given
 thread's threadgroup can be moved together.
 
+void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
+	       struct task_struct *task, bool threadgroup)
+(cgroup_mutex held by caller)
+
+Called when a task attach operation has failed after can_attach() has succeeded.
+A subsystem whose can_attach() has some side-effects should provide this
+function, so that the subsytem can implement a rollback. If not, not necessary.
+This will be called only about subsystems whose can_attach() operation have
+succeeded.
+
 void attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 	    struct cgroup *old_cgrp, struct task_struct *task,
 	    bool threadgroup)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 0008dee..d4cc200 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -427,6 +427,8 @@ struct cgroup_subsys {
 	void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	int (*can_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
 			  struct task_struct *tsk, bool threadgroup);
+	void (*cancel_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
+			  struct task_struct *tsk, bool threadgroup);
 	void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
 			struct cgroup *old_cgrp, struct task_struct *tsk,
 			bool threadgroup);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 0249f4b..826d948 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1539,7 +1539,7 @@ int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen)
 int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
 	int retval = 0;
-	struct cgroup_subsys *ss;
+	struct cgroup_subsys *ss, *failed_ss = NULL;
 	struct cgroup *oldcgrp;
 	struct css_set *cg;
 	struct css_set *newcg;
@@ -1553,8 +1553,16 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 	for_each_subsys(root, ss) {
 		if (ss->can_attach) {
 			retval = ss->can_attach(ss, cgrp, tsk, false);
-			if (retval)
-				return retval;
+			if (retval) {
+				/*
+				 * Remember at which subsystem we've failed in
+				 * can_attach() to call cancel_attach() only
+				 * against subsystems whose attach() have
+				 * succeeded(see below).
+				 */
+				failed_ss = ss;
+				goto out;
+			}
 		}
 	}
 
@@ -1568,14 +1576,17 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 	 */
 	newcg = find_css_set(cg, cgrp);
 	put_css_set(cg);
-	if (!newcg)
-		return -ENOMEM;
+	if (!newcg) {
+		retval = -ENOMEM;
+		goto out;
+	}
 
 	task_lock(tsk);
 	if (tsk->flags & PF_EXITING) {
 		task_unlock(tsk);
 		put_css_set(newcg);
-		return -ESRCH;
+		retval = -ESRCH;
+		goto out;
 	}
 	rcu_assign_pointer(tsk->cgroups, newcg);
 	task_unlock(tsk);
@@ -1601,7 +1612,21 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 	 * is no longer empty.
 	 */
 	cgroup_wakeup_rmdir_waiter(cgrp);
-	return 0;
+out:
+	if (retval) {
+		for_each_subsys(root, ss) {
+			if (ss == failed_ss)
+				/*
+				 * This means can_attach() of this subsystem
+				 * have failed, so we don't need to call
+				 * cancel_attach() against rests of subsystems.
+				 */
+				break;
+			if (ss->cancel_attach)
+				ss->cancel_attach(ss, cgrp, tsk, false);
+		}
+	}
+	return retval;
 }
 
 /*
-- 
1.5.6.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH -mmotm 2/5] memcg: add interface to recharge at task move
  2009-11-19  4:27 [PATCH -mmotm 0/5] memcg: recharge at task move (19/Nov) Daisuke Nishimura
  2009-11-19  4:28 ` [PATCH -mmotm 1/5] cgroup: introduce cancel_attach() Daisuke Nishimura
@ 2009-11-19  4:29 ` Daisuke Nishimura
  2009-11-20 15:42   ` Balbir Singh
  2009-11-19  4:29 ` [PATCH -mmotm 3/5] memcg: recharge charges of anonymous page Daisuke Nishimura
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Daisuke Nishimura @ 2009-11-19  4:29 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Balbir Singh, KAMEZAWA Hiroyuki, Li Zefan,
	Paul Menage, Daisuke Nishimura

In current memcg, charges associated with a task aren't moved to the new cgroup
at task move. Some users feel this behavior to be strange.
These patches are for this feature, that is, for recharging to the new cgroup
and, of course, uncharging from old cgroup at task move.

This patch adds "memory.recharge_at_immigrate" file, which is a flag file to
determine whether charges should be moved to the new cgroup at task move or
not and what type of charges should be recharged. This patch also adds read
and write handlers of the file.

This patch also adds no-op handlers for this feature. These handlers will be
implemented in later patches. And you cannot write any values other than 0
to recharge_at_immigrate yet.

Changelog: 2009/11/19
- consolidate changes in Documentation/cgroup/memory.txt, which were made in
  other patches separately.
- handle recharge_at_immigrate as bitmask(as I did in first version).
- use mm->owner instead of thread_group_leader().
Changelog: 2009/09/24
- change the term "migration" to "recharge".
- handle the flag as bool not bitmask to make codes simple.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
 Documentation/cgroups/memory.txt |   42 ++++++++++++++++-
 mm/memcontrol.c                  |   93 ++++++++++++++++++++++++++++++++++++--
 2 files changed, 129 insertions(+), 6 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index b871f25..809585e 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -262,10 +262,12 @@ some of the pages cached in the cgroup (page cache pages).
 4.2 Task migration
 
 When a task migrates from one cgroup to another, it's charge is not
-carried forward. The pages allocated from the original cgroup still
+carried forward by default. The pages allocated from the original cgroup still
 remain charged to it, the charge is dropped when the page is freed or
 reclaimed.
 
+Note: You can move charges of a task along with task migration. See 8.
+
 4.3 Removing a cgroup
 
 A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a
@@ -414,7 +416,43 @@ NOTE1: Soft limits take effect over a long period of time, since they involve
 NOTE2: It is recommended to set the soft limit always below the hard limit,
        otherwise the hard limit will take precedence.
 
-8. TODO
+8. Recharge at task move
+
+Users can move charges associated with a task along with task move, that is,
+uncharge from the old cgroup and charge to the new cgroup.
+
+8.1 Interface
+
+This feature is disabled by default. It can be enabled(and disabled again) by
+writing to memory.recharge_at_immigrate of the destination cgroup.
+
+If you want to enable it:
+
+# echo (some positive value) > memory.recharge_at_immigrate
+
+Note: Each bits of recharge_at_immigrate has its own meaning about what type of
+charges should be recharged. See 8.2 for details.
+
+And if you want disable it again:
+
+# echo 0 > memory.recharge_at_immigrate
+
+8.2 Type of charges which can be recharged
+
+Each bits of recharge_at_immigrate has its own meaning about what type of
+charges should be recharged.
+
+  bit | what type of charges would be recharged ?
+ -----+------------------------------------------------------------------------
+   0  | A charge of an anonymous page(or swap of it) used by the target task.
+      | Those pages and swaps must be used only by the target task. You must
+      | enable Swap Extension(see 2.4) to enable recharge of swap.
+
+Note: Those pages and swaps must be charged to the old cgroup.
+Note: More type of pages(e.g. file cache, shmem,) will be supported by other
+bits in future.
+
+9. TODO
 
 1. Add support for accounting huge pages (as a separate controller)
 2. Make per-cgroup scanner reclaim not-shared pages first
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fc16f08..13fe93d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -226,11 +226,23 @@ struct mem_cgroup {
 	bool		memsw_is_minimum;
 
 	/*
+	 * Should we recharge charges of a task when a task is moved into this
+	 * mem_cgroup ? And what type of charges should we recharge ?
+	 */
+	unsigned long 	recharge_at_immigrate;
+
+	/*
 	 * statistics. This must be placed at the end of memcg.
 	 */
 	struct mem_cgroup_stat stat;
 };
 
+/* Stuffs for recharge at task move. */
+/* Types of charges to be recharged */
+enum recharge_type {
+	NR_RECHARGE_TYPE,
+};
+
 /*
  * Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft
  * limit reclaim to prevent infinite loops, if they ever occur.
@@ -2860,6 +2872,31 @@ static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
 	return 0;
 }
 
+static u64 mem_cgroup_recharge_read(struct cgroup *cgrp,
+					struct cftype *cft)
+{
+	return mem_cgroup_from_cont(cgrp)->recharge_at_immigrate;
+}
+
+static int mem_cgroup_recharge_write(struct cgroup *cgrp,
+					struct cftype *cft, u64 val)
+{
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+
+	if (val >= (1 << NR_RECHARGE_TYPE))
+		return -EINVAL;
+	/*
+	 * We check this value several times in both in can_attach() and
+	 * attach(), so we need cgroup lock to prevent this value from being
+	 * inconsistent.
+	 */
+	cgroup_lock();
+	mem->recharge_at_immigrate = val;
+	cgroup_unlock();
+
+	return 0;
+}
+
 
 /* For read statistics */
 enum {
@@ -3093,6 +3130,11 @@ static struct cftype mem_cgroup_files[] = {
 		.read_u64 = mem_cgroup_swappiness_read,
 		.write_u64 = mem_cgroup_swappiness_write,
 	},
+	{
+		.name = "recharge_at_immigrate",
+		.read_u64 = mem_cgroup_recharge_read,
+		.write_u64 = mem_cgroup_recharge_write,
+	},
 };
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
@@ -3340,6 +3382,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	if (parent)
 		mem->swappiness = get_swappiness(parent);
 	atomic_set(&mem->refcnt, 1);
+	mem->recharge_at_immigrate = 0;
 	return &mem->css;
 free_out:
 	__mem_cgroup_free(mem);
@@ -3376,16 +3419,56 @@ static int mem_cgroup_populate(struct cgroup_subsys *ss,
 	return ret;
 }
 
+/* Handlers for recharge at task move. */
+static int mem_cgroup_can_recharge(void)
+{
+	return 0;
+}
+
+static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
+				struct cgroup *cgroup,
+				struct task_struct *p,
+				bool threadgroup)
+{
+	int ret = 0;
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cgroup);
+
+	if (mem->recharge_at_immigrate) {
+		struct mm_struct *mm;
+		struct mem_cgroup *from = mem_cgroup_from_task(p);
+
+		VM_BUG_ON(from == mem);
+
+		mm = get_task_mm(p);
+		if (!mm)
+			return 0;
+
+		if (mm->owner == p)
+			ret = mem_cgroup_can_recharge();
+
+		mmput(mm);
+	}
+	return ret;
+}
+
+static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,
+				struct cgroup *cgroup,
+				struct task_struct *p,
+				bool threadgroup)
+{
+}
+
+static void mem_cgroup_recharge(void)
+{
+}
+
 static void mem_cgroup_move_task(struct cgroup_subsys *ss,
 				struct cgroup *cont,
 				struct cgroup *old_cont,
 				struct task_struct *p,
 				bool threadgroup)
 {
-	/*
-	 * FIXME: It's better to move charges of this process from old
-	 * memcg to new memcg. But it's just on TODO-List now.
-	 */
+	mem_cgroup_recharge();
 }
 
 struct cgroup_subsys mem_cgroup_subsys = {
@@ -3395,6 +3478,8 @@ struct cgroup_subsys mem_cgroup_subsys = {
 	.pre_destroy = mem_cgroup_pre_destroy,
 	.destroy = mem_cgroup_destroy,
 	.populate = mem_cgroup_populate,
+	.can_attach = mem_cgroup_can_attach,
+	.cancel_attach = mem_cgroup_cancel_attach,
 	.attach = mem_cgroup_move_task,
 	.early_init = 0,
 	.use_id = 1,
-- 
1.5.6.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH -mmotm 3/5] memcg: recharge charges of anonymous page
  2009-11-19  4:27 [PATCH -mmotm 0/5] memcg: recharge at task move (19/Nov) Daisuke Nishimura
  2009-11-19  4:28 ` [PATCH -mmotm 1/5] cgroup: introduce cancel_attach() Daisuke Nishimura
  2009-11-19  4:29 ` [PATCH -mmotm 2/5] memcg: add interface to recharge at task move Daisuke Nishimura
@ 2009-11-19  4:29 ` Daisuke Nishimura
  2009-11-19  4:30 ` [PATCH -mmotm 4/5] memcg: avoid oom during recharge at task move Daisuke Nishimura
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: Daisuke Nishimura @ 2009-11-19  4:29 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Balbir Singh, KAMEZAWA Hiroyuki, Li Zefan,
	Paul Menage, Daisuke Nishimura

This patch is the core part of this recharge-at-task-move feature.
It implements functions to recharge charges of anonymous pages mapped only by
the target task.

Implementation:
- define struct recharge_struct and a valuable of it(recharge) to remember
  the count of pre-charges and other information.
- At can_attach(), get anon_rss of the target mm, call __mem_cgroup_try_charge()
  repeatedly and count up recharge.precharge.
- At attach(), parse the page table, find a target page to be recharged, and
  call mem_cgroup_move_account() about the page.
- Cancel all charges if recharge.precharge > 0 on failure or at the end of
  task move.

Changelog: 2009/11/19
- in can_attach(), instead of parsing the page table, make use of per process
  mm_counter(anon_rss).
- loosen the valid check in is_target_pte_for_recharge().
Changelog: 2009/11/06
- drop support for file cache, shmem/tmpfs and shared(used by multiple processes)
  pages(revisit in future).
Changelog: 2009/10/13
- change the term "migrate" to "recharge".
Changelog: 2009/09/24
- in can_attach(), parse the page table of the task and count only the number
  of target ptes and call try_charge() repeatedly. No isolation at this phase.
- in attach(), parse the page table of the task again, and isolate the target
  page and call move_account() one by one.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
 mm/memcontrol.c |  241 +++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 files changed, 234 insertions(+), 7 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 13fe93d..df363da 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -21,6 +21,7 @@
 #include <linux/memcontrol.h>
 #include <linux/cgroup.h>
 #include <linux/mm.h>
+#include <linux/hugetlb.h>
 #include <linux/pagemap.h>
 #include <linux/smp.h>
 #include <linux/page-flags.h>
@@ -240,8 +241,17 @@ struct mem_cgroup {
 /* Stuffs for recharge at task move. */
 /* Types of charges to be recharged */
 enum recharge_type {
+	RECHARGE_TYPE_ANON,	/* private anonymous page and swap of it */
 	NR_RECHARGE_TYPE,
 };
+/* "recharge" and its members are protected by cgroup_lock */
+struct recharge_struct {
+	struct mem_cgroup *from;
+	struct mem_cgroup *to;
+	unsigned long precharge;
+};
+static struct recharge_struct recharge;
+
 
 /*
  * Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft
@@ -1499,7 +1509,7 @@ charged:
 	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
 	 * if they exceeds softlimit.
 	 */
-	if (mem_cgroup_soft_limit_check(mem))
+	if (page && mem_cgroup_soft_limit_check(mem))
 		mem_cgroup_update_tree(mem, page);
 done:
 	return 0;
@@ -3420,9 +3430,50 @@ static int mem_cgroup_populate(struct cgroup_subsys *ss,
 }
 
 /* Handlers for recharge at task move. */
-static int mem_cgroup_can_recharge(void)
+static int mem_cgroup_do_precharge(void)
 {
-	return 0;
+	int ret = -ENOMEM;
+	struct mem_cgroup *mem = recharge.to;
+
+	ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false, NULL);
+	if (ret || !mem)
+		return -ENOMEM;
+
+	recharge.precharge++;
+	return ret;
+}
+
+#define PRECHARGE_AT_ONCE	256
+static int mem_cgroup_prepare_recharge(struct mm_struct *mm)
+{
+	int ret = 0;
+	int count = PRECHARGE_AT_ONCE;
+	unsigned long prepare = 0;
+	bool recharge_anon = test_bit(RECHARGE_TYPE_ANON,
+					&recharge.to->recharge_at_immigrate);
+
+	if (recharge_anon)
+		prepare += get_mm_counter(mm, anon_rss);
+
+	while (!ret && prepare--) {
+		if (!count--) {
+			count = PRECHARGE_AT_ONCE;
+			cond_resched();
+		}
+		ret = mem_cgroup_do_precharge();
+	}
+
+	return ret;
+}
+
+static void mem_cgroup_clear_recharge(void)
+{
+	while (recharge.precharge) {
+		mem_cgroup_cancel_charge(recharge.to);
+		recharge.precharge--;
+	}
+	recharge.from = NULL;
+	recharge.to = NULL;
 }
 
 static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
@@ -3443,8 +3494,18 @@ static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
 		if (!mm)
 			return 0;
 
-		if (mm->owner == p)
-			ret = mem_cgroup_can_recharge();
+		if (mm->owner == p) {
+			VM_BUG_ON(recharge.from);
+			VM_BUG_ON(recharge.to);
+			VM_BUG_ON(recharge.precharge);
+			recharge.from = from;
+			recharge.to = mem;
+			recharge.precharge = 0;
+
+			ret = mem_cgroup_prepare_recharge(mm);
+			if (ret)
+				mem_cgroup_clear_recharge();
+		}
 
 		mmput(mm);
 	}
@@ -3456,10 +3517,165 @@ static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,
 				struct task_struct *p,
 				bool threadgroup)
 {
+	mem_cgroup_clear_recharge();
 }
 
-static void mem_cgroup_recharge(void)
+/**
+ * is_target_pte_for_recharge - check a pte whether it is valid for recharge
+ * @vma: the vma the pte to be checked belongs
+ * @addr: the address corresponding to the pte to be checked
+ * @ptent: the pte to be checked
+ * @target: the pointer the target page will be stored
+ *
+ * Returns
+ *   0(RECHARGE_TARGET_NONE): if the pte is not a target for recharge.
+ *   1(RECHARGE_TARGET_PAGE): if the page corresponding to this pte is a target
+ *     for recharge. if @target is not NULL, the page is stored in target->page
+ *     with extra refcnt got(Callers should handle it).
+ *
+ * Called with pte lock held.
+ */
+/* We add a new member later. */
+union recharge_target {
+	struct page	*page;
+};
+
+/* We add a new type later. */
+enum recharge_target_type {
+	RECHARGE_TARGET_NONE,	/* not used */
+	RECHARGE_TARGET_PAGE,
+};
+
+static int is_target_pte_for_recharge(struct vm_area_struct *vma,
+		unsigned long addr, pte_t ptent, union recharge_target *target)
 {
+	struct page *page;
+	struct page_cgroup *pc;
+	int ret = 0;
+	bool recharge_anon = test_bit(RECHARGE_TYPE_ANON,
+					&recharge.to->recharge_at_immigrate);
+
+	if (!pte_present(ptent))
+		return 0;
+
+	page = vm_normal_page(vma, addr, ptent);
+	if (!page || !page_mapped(page))
+		return 0;
+	/* TODO: We don't recharge file(including shmem/tmpfs) pages for now. */
+	if (!recharge_anon || !PageAnon(page))
+		return 0;
+	/*
+	 * TODO: We don't recharge shared(used by multiple processes) pages
+	 * for now.
+	 */
+	if (page_mapcount(page) > 1)
+		return 0;
+	if (!get_page_unless_zero(page))
+		return 0;
+
+	pc = lookup_page_cgroup(page);
+	/*
+	 * Do only loose check w/o page_cgroup lock. mem_cgroup_move_account()
+	 * checks the pc is valid or not under the lock.
+	 */
+	if (PageCgroupUsed(pc)) {
+		ret = RECHARGE_TARGET_PAGE;
+		target->page = page;
+	}
+
+	if (!ret)
+		put_page(page);
+
+	return ret;
+}
+
+static int mem_cgroup_recharge_pte_range(pmd_t *pmd,
+				unsigned long addr, unsigned long end,
+				struct mm_walk *walk)
+{
+	int ret = 0;
+	struct vm_area_struct *vma = walk->private;
+	pte_t *pte;
+	spinlock_t *ptl;
+
+retry:
+	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	for (; addr != end; addr += PAGE_SIZE) {
+		pte_t ptent = *(pte++);
+		union recharge_target target;
+		int type;
+		struct page *page;
+		struct page_cgroup *pc;
+
+		if (!recharge.precharge)
+			break;
+
+		type = is_target_pte_for_recharge(vma, addr, ptent, &target);
+		switch (type) {
+		case RECHARGE_TARGET_PAGE:
+			page = target.page;
+			if (isolate_lru_page(page))
+				goto put;
+			pc = lookup_page_cgroup(page);
+			if (!mem_cgroup_move_account(pc,
+						recharge.from, recharge.to)) {
+				css_put(&recharge.to->css);
+				recharge.precharge--;
+			}
+			putback_lru_page(page);
+put:			/* is_target_pte_for_recharge() gets the page */
+			put_page(page);
+			break;
+		default:
+			break;
+		}
+	}
+	pte_unmap_unlock(pte - 1, ptl);
+
+	if (addr != end) {
+		/*
+		 * We have consumed all precharges we got in can_attach().
+		 * We try precharge one by one, but don't do any additional
+		 * precharges nor recharges to recharge.to if we have failed in
+		 * precharge once in attach() phase.
+		 */
+		ret = mem_cgroup_do_precharge();
+		if (!ret)
+			goto retry;
+	}
+
+	return ret;
+}
+
+static void mem_cgroup_recharge(struct mm_struct *mm)
+{
+	struct vm_area_struct *vma;
+
+	lru_add_drain_all();
+	down_read(&mm->mmap_sem);
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		int ret;
+		struct mm_walk mem_cgroup_recharge_walk = {
+			.pmd_entry = mem_cgroup_recharge_pte_range,
+			.mm = mm,
+			.private = vma,
+		};
+		if (is_vm_hugetlb_page(vma))
+			continue;
+		/* TODO: We don't recharge shmem/tmpfs pages for now. */
+		if (vma->vm_flags & VM_SHARED)
+			continue;
+		ret = walk_page_range(vma->vm_start, vma->vm_end,
+						&mem_cgroup_recharge_walk);
+		if (ret)
+			/*
+			 * means we have consumed all precharges and failed in
+			 * doing additional precharge. Just abandon here.
+			 */
+			break;
+		cond_resched();
+	}
+	up_read(&mm->mmap_sem);
 }
 
 static void mem_cgroup_move_task(struct cgroup_subsys *ss,
@@ -3468,7 +3684,18 @@ static void mem_cgroup_move_task(struct cgroup_subsys *ss,
 				struct task_struct *p,
 				bool threadgroup)
 {
-	mem_cgroup_recharge();
+	struct mm_struct *mm;
+
+	if (!recharge.to)
+		/* no need to recharge */
+		return;
+
+	mm = get_task_mm(p);
+	if (mm) {
+		mem_cgroup_recharge(mm);
+		mmput(mm);
+	}
+	mem_cgroup_clear_recharge();
 }
 
 struct cgroup_subsys mem_cgroup_subsys = {
-- 
1.5.6.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH -mmotm 4/5] memcg: avoid oom during recharge at task move
  2009-11-19  4:27 [PATCH -mmotm 0/5] memcg: recharge at task move (19/Nov) Daisuke Nishimura
                   ` (2 preceding siblings ...)
  2009-11-19  4:29 ` [PATCH -mmotm 3/5] memcg: recharge charges of anonymous page Daisuke Nishimura
@ 2009-11-19  4:30 ` Daisuke Nishimura
  2009-11-23  5:10   ` Balbir Singh
  2009-11-19  4:31 ` [PATCH -mmotm 5/5] memcg: recharge charges of anonymous swap Daisuke Nishimura
  2009-11-19 19:03 ` [PATCH -mmotm 0/5] memcg: recharge at task move (19/Nov) Balbir Singh
  5 siblings, 1 reply; 20+ messages in thread
From: Daisuke Nishimura @ 2009-11-19  4:30 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Balbir Singh, KAMEZAWA Hiroyuki, Li Zefan,
	Paul Menage, Daisuke Nishimura

This recharge-at-task-move feature has extra charges(pre-charges) on "to"
mem_cgroup during recharging. This means unnecessary oom can happen.

This patch tries to avoid such oom.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
 mm/memcontrol.c |   28 ++++++++++++++++++++++++++++
 1 files changed, 28 insertions(+), 0 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index df363da..3a07383 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -249,6 +249,7 @@ struct recharge_struct {
 	struct mem_cgroup *from;
 	struct mem_cgroup *to;
 	unsigned long precharge;
+	struct task_struct *working;	/* a task moving the target task */
 };
 static struct recharge_struct recharge;
 
@@ -1494,6 +1495,30 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 		if (mem_cgroup_check_under_limit(mem_over_limit))
 			continue;
 
+		/* try to avoid oom while someone is recharging */
+		if (recharge.working && current != recharge.working) {
+			struct mem_cgroup *dest;
+			bool do_continue = false;
+			/*
+			 * There is a small race that "dest" can be freed by
+			 * rmdir, so we use css_tryget().
+			 */
+			rcu_read_lock();
+			dest = recharge.to;
+			if (dest && css_tryget(&dest->css)) {
+				if (dest->use_hierarchy)
+					do_continue = css_is_ancestor(
+							&dest->css,
+							&mem_over_limit->css);
+				else
+					do_continue = (dest == mem_over_limit);
+				css_put(&dest->css);
+			}
+			rcu_read_unlock();
+			if (do_continue)
+				continue;
+		}
+
 		if (!nr_retries--) {
 			if (oom) {
 				mem_cgroup_out_of_memory(mem_over_limit, gfp_mask);
@@ -3474,6 +3499,7 @@ static void mem_cgroup_clear_recharge(void)
 	}
 	recharge.from = NULL;
 	recharge.to = NULL;
+	recharge.working = NULL;
 }
 
 static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
@@ -3498,9 +3524,11 @@ static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
 			VM_BUG_ON(recharge.from);
 			VM_BUG_ON(recharge.to);
 			VM_BUG_ON(recharge.precharge);
+			VM_BUG_ON(recharge.working);
 			recharge.from = from;
 			recharge.to = mem;
 			recharge.precharge = 0;
+			recharge.working = current;
 
 			ret = mem_cgroup_prepare_recharge(mm);
 			if (ret)
-- 
1.5.6.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH -mmotm 5/5] memcg: recharge charges of anonymous swap
  2009-11-19  4:27 [PATCH -mmotm 0/5] memcg: recharge at task move (19/Nov) Daisuke Nishimura
                   ` (3 preceding siblings ...)
  2009-11-19  4:30 ` [PATCH -mmotm 4/5] memcg: avoid oom during recharge at task move Daisuke Nishimura
@ 2009-11-19  4:31 ` Daisuke Nishimura
  2009-11-23  6:59   ` Balbir Singh
  2009-11-19 19:03 ` [PATCH -mmotm 0/5] memcg: recharge at task move (19/Nov) Balbir Singh
  5 siblings, 1 reply; 20+ messages in thread
From: Daisuke Nishimura @ 2009-11-19  4:31 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Balbir Singh, KAMEZAWA Hiroyuki, Li Zefan,
	Paul Menage, Daisuke Nishimura

This patch is another core part of this recharge-at-task-move feature.
It enables recharge of anonymous swaps.

To move the charge of swap, we need to exchange swap_cgroup's record.

In current implementation, swap_cgroup's record is protected by:

  - page lock: if the entry is on swap cache.
  - swap_lock: if the entry is not on swap cache.

This works well in usual swap-in/out activity.

But this behavior make charge migration of swap check many conditions to
exchange swap_cgroup's record safely.

So I changed modification of swap_cgroup's recored(swap_cgroup_record())
to use xchg, and define a new function to cmpxchg swap_cgroup's record.

This patch also enables recharge of non pte_present but not uncharged swap
caches, which can be exist on swap-out path,  by getting the target pages via
find_get_page() as do_mincore() does.

Changelog: 2009/11/19
- in can_attach(), instead of parsing the page table, make use of per process
  mm_counter(swap_usage).
Changelog: 2009/11/06
- drop support for shmem's swap(revisit in future).
- add mem_cgroup_count_swap_user() to prevent moving charges of swaps used by
  multiple processes(revisit in future).
Changelog: 2009/09/24
- do no swap-in in moving swap account any more.
- add support for shmem's swap.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
 include/linux/page_cgroup.h |    2 +
 include/linux/swap.h        |    1 +
 mm/memcontrol.c             |  150 ++++++++++++++++++++++++++++++++++---------
 mm/page_cgroup.c            |   35 ++++++++++-
 mm/swapfile.c               |   31 +++++++++
 5 files changed, 186 insertions(+), 33 deletions(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index b0e4eb1..30b0813 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -118,6 +118,8 @@ static inline void __init page_cgroup_init_flatmem(void)
 #include <linux/swap.h>
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
+					unsigned short old, unsigned short new);
 extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id);
 extern unsigned short lookup_swap_cgroup(swp_entry_t ent);
 extern int swap_cgroup_swapon(int type, unsigned long max_pages);
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 9f0ca32..2a3209e 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -355,6 +355,7 @@ static inline void disable_swap_token(void)
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 extern void
 mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout);
+extern int mem_cgroup_count_swap_user(swp_entry_t ent, struct page **pagep);
 #else
 static inline void
 mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3a07383..ea00a93 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -33,6 +33,7 @@
 #include <linux/rbtree.h>
 #include <linux/slab.h>
 #include <linux/swap.h>
+#include <linux/swapops.h>
 #include <linux/spinlock.h>
 #include <linux/fs.h>
 #include <linux/seq_file.h>
@@ -2230,6 +2231,49 @@ void mem_cgroup_uncharge_swap(swp_entry_t ent)
 	}
 	rcu_read_unlock();
 }
+
+/**
+ * mem_cgroup_move_swap_account - move swap charge and swap_cgroup's record.
+ * @entry: swap entry to be moved
+ * @from:  mem_cgroup which the entry is moved from
+ * @to:  mem_cgroup which the entry is moved to
+ *
+ * It successes only when the swap_cgroup's record for this entry is the same
+ * as the mem_cgroup's id of @from.
+ *
+ * Returns 0 on success, -EINVAL on failure.
+ *
+ * The caller must have called __mem_cgroup_try_charge on @to.
+ */
+static int mem_cgroup_move_swap_account(swp_entry_t entry,
+				struct mem_cgroup *from, struct mem_cgroup *to)
+{
+	unsigned short old_id, new_id;
+
+	old_id = css_id(&from->css);
+	new_id = css_id(&to->css);
+
+	if (swap_cgroup_cmpxchg(entry, old_id, new_id) == old_id) {
+		if (!mem_cgroup_is_root(from))
+			res_counter_uncharge(&from->memsw, PAGE_SIZE);
+		mem_cgroup_swap_statistics(from, false);
+		mem_cgroup_put(from);
+
+		if (!mem_cgroup_is_root(to))
+			res_counter_uncharge(&to->res, PAGE_SIZE);
+		mem_cgroup_swap_statistics(to, true);
+		mem_cgroup_get(to);
+
+		return 0;
+	}
+	return -EINVAL;
+}
+#else
+static inline int mem_cgroup_move_swap_account(swp_entry_t entry,
+				struct mem_cgroup *from, struct mem_cgroup *to)
+{
+	return -EINVAL;
+}
 #endif
 
 /*
@@ -3477,8 +3521,10 @@ static int mem_cgroup_prepare_recharge(struct mm_struct *mm)
 	bool recharge_anon = test_bit(RECHARGE_TYPE_ANON,
 					&recharge.to->recharge_at_immigrate);
 
-	if (recharge_anon)
+	if (recharge_anon) {
 		prepare += get_mm_counter(mm, anon_rss);
+		prepare += get_mm_counter(mm, swap_usage);
+	}
 
 	while (!ret && prepare--) {
 		if (!count--) {
@@ -3553,66 +3599,99 @@ static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,
  * @vma: the vma the pte to be checked belongs
  * @addr: the address corresponding to the pte to be checked
  * @ptent: the pte to be checked
- * @target: the pointer the target page will be stored
+ * @target: the pointer the target page or swap entry will be stored
  *
  * Returns
  *   0(RECHARGE_TARGET_NONE): if the pte is not a target for recharge.
  *   1(RECHARGE_TARGET_PAGE): if the page corresponding to this pte is a target
  *     for recharge. if @target is not NULL, the page is stored in target->page
  *     with extra refcnt got(Callers should handle it).
+ *   2(MIGRATION_TARGET_SWAP): if the swap entry corresponding to this pte is a
+ *     target for charge migration. if @target is not NULL, the entry is stored
+ *     in target->ent.
  *
  * Called with pte lock held.
  */
-/* We add a new member later. */
 union recharge_target {
 	struct page	*page;
+	swp_entry_t	ent;
 };
 
-/* We add a new type later. */
 enum recharge_target_type {
 	RECHARGE_TARGET_NONE,	/* not used */
 	RECHARGE_TARGET_PAGE,
+	RECHARGE_TARGET_SWAP,
 };
 
 static int is_target_pte_for_recharge(struct vm_area_struct *vma,
 		unsigned long addr, pte_t ptent, union recharge_target *target)
 {
-	struct page *page;
+	struct page *page = NULL;
 	struct page_cgroup *pc;
+	swp_entry_t ent = { .val = 0 };
+	int user = 0;
 	int ret = 0;
 	bool recharge_anon = test_bit(RECHARGE_TYPE_ANON,
 					&recharge.to->recharge_at_immigrate);
 
-	if (!pte_present(ptent))
-		return 0;
+	if (!pte_present(ptent)) {
+		/* TODO: handle swap of shmes/tmpfs */
+		if (pte_none(ptent) || pte_file(ptent))
+			return 0;
+		else if (is_swap_pte(ptent)) {
+			ent = pte_to_swp_entry(ptent);
+			if (!recharge_anon || non_swap_entry(ent))
+				return 0;
+			user = mem_cgroup_count_swap_user(ent, &page);
+		}
+	} else {
+		page = vm_normal_page(vma, addr, ptent);
+		if (!page || !page_mapped(page))
+			return 0;
+		/*
+		 * TODO: We don't recharge file(including shmem/tmpfs) pages
+		 * for now.
+		 */
+		if (!recharge_anon || !PageAnon(page))
+			return 0;
+		if (!get_page_unless_zero(page))
+			return 0;
+		user = page_mapcount(page);
+	}
 
-	page = vm_normal_page(vma, addr, ptent);
-	if (!page || !page_mapped(page))
-		return 0;
-	/* TODO: We don't recharge file(including shmem/tmpfs) pages for now. */
-	if (!recharge_anon || !PageAnon(page))
-		return 0;
-	/*
-	 * TODO: We don't recharge shared(used by multiple processes) pages
-	 * for now.
-	 */
-	if (page_mapcount(page) > 1)
-		return 0;
-	if (!get_page_unless_zero(page))
+	if (user > 1) {
+		/*
+		 * TODO: We don't recharge shared(used by multiple processes)
+		 * pages for now.
+		 */
+		if (page)
+			put_page(page);
 		return 0;
-
-	pc = lookup_page_cgroup(page);
-	/*
-	 * Do only loose check w/o page_cgroup lock. mem_cgroup_move_account()
-	 * checks the pc is valid or not under the lock.
-	 */
-	if (PageCgroupUsed(pc)) {
-		ret = RECHARGE_TARGET_PAGE;
-		target->page = page;
 	}
 
-	if (!ret)
-		put_page(page);
+	if (page) {
+		pc = lookup_page_cgroup(page);
+		/*
+		 * Do only loose check w/o page_cgroup lock.
+		 * mem_cgroup_move_account() checks the pc is valid or not under
+		 * the lock.
+		 */
+		if (PageCgroupUsed(pc)) {
+			ret = RECHARGE_TARGET_PAGE;
+			target->page = page;
+		}
+		if (!ret)
+			put_page(page);
+	}
+	/* fall throught */
+	if (ent.val && do_swap_account && !ret) {
+		/*
+		 * mem_cgroup_move_swap_account() checks the entry is valid or
+		 * not.
+		 */
+		ret = RECHARGE_TARGET_SWAP;
+		target->ent = ent;
+	}
 
 	return ret;
 }
@@ -3634,6 +3713,7 @@ retry:
 		int type;
 		struct page *page;
 		struct page_cgroup *pc;
+		swp_entry_t ent;
 
 		if (!recharge.precharge)
 			break;
@@ -3654,6 +3734,14 @@ retry:
 put:			/* is_target_pte_for_recharge() gets the page */
 			put_page(page);
 			break;
+		case RECHARGE_TARGET_SWAP:
+			ent = target.ent;
+			if (!mem_cgroup_move_swap_account(ent,
+						recharge.from, recharge.to)) {
+				css_put(&recharge.to->css);
+				recharge.precharge--;
+			}
+			break;
 		default:
 			break;
 		}
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index 3d535d5..213b0ee 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -9,6 +9,7 @@
 #include <linux/vmalloc.h>
 #include <linux/cgroup.h>
 #include <linux/swapops.h>
+#include <asm/cmpxchg.h>
 
 static void __meminit
 __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
@@ -335,6 +336,37 @@ not_enough_page:
 }
 
 /**
+ * swap_cgroup_cmpxchg - cmpxchg mem_cgroup's id for this swp_entry.
+ * @end: swap entry to be cmpxchged
+ * @old: old id
+ * @new: new id
+ *
+ * Returns old id at success, 0 at failure.
+ * (There is no mem_cgroup useing 0 as its id)
+ */
+unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
+					unsigned short old, unsigned short new)
+{
+	int type = swp_type(ent);
+	unsigned long offset = swp_offset(ent);
+	unsigned long idx = offset / SC_PER_PAGE;
+	unsigned long pos = offset & SC_POS_MASK;
+	struct swap_cgroup_ctrl *ctrl;
+	struct page *mappage;
+	struct swap_cgroup *sc;
+
+	ctrl = &swap_cgroup_ctrl[type];
+
+	mappage = ctrl->map[idx];
+	sc = page_address(mappage);
+	sc += pos;
+	if (cmpxchg(&sc->id, old, new) == old)
+		return old;
+	else
+		return 0;
+}
+
+/**
  * swap_cgroup_record - record mem_cgroup for this swp_entry.
  * @ent: swap entry to be recorded into
  * @mem: mem_cgroup to be recorded
@@ -358,8 +390,7 @@ unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id)
 	mappage = ctrl->map[idx];
 	sc = page_address(mappage);
 	sc += pos;
-	old = sc->id;
-	sc->id = id;
+	old = xchg(&sc->id, id);
 
 	return old;
 }
diff --git a/mm/swapfile.c b/mm/swapfile.c
index f32d716..cc2e859 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -719,6 +719,37 @@ int free_swap_and_cache(swp_entry_t entry)
 	return p != NULL;
 }
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+/**
+ * mem_cgroup_count_swap_user - count the user of a swap entry
+ * @ent: the swap entry to be checked
+ * @pagep: the pointer for the swap cache page of the entry to be stored
+ *
+ * Returns the number of the user of the swap entry. The number is valid only
+ * for swaps of anonymous pages.
+ * If the entry is found on swap cache, the page is stored to pagep with
+ * refcount of it being incremented.
+ */
+int mem_cgroup_count_swap_user(swp_entry_t ent, struct page **pagep)
+{
+	struct page *page;
+	struct swap_info_struct *p;
+	int count = 0;
+
+	page = find_get_page(&swapper_space, ent.val);
+	if (page)
+		count += page_mapcount(page);
+	p = swap_info_get(ent);
+	if (p) {
+		count += swap_count(p->swap_map[swp_offset(ent)]);
+		spin_unlock(&swap_lock);
+	}
+
+	*pagep = page;
+	return count;
+}
+#endif
+
 #ifdef CONFIG_HIBERNATION
 /*
  * Find the swap type that corresponds to given device (if any).
-- 
1.5.6.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH -mmotm 0/5] memcg: recharge at task move (19/Nov)
  2009-11-19  4:27 [PATCH -mmotm 0/5] memcg: recharge at task move (19/Nov) Daisuke Nishimura
                   ` (4 preceding siblings ...)
  2009-11-19  4:31 ` [PATCH -mmotm 5/5] memcg: recharge charges of anonymous swap Daisuke Nishimura
@ 2009-11-19 19:03 ` Balbir Singh
  5 siblings, 0 replies; 20+ messages in thread
From: Balbir Singh @ 2009-11-19 19:03 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: linux-mm, Andrew Morton, KAMEZAWA Hiroyuki, Li Zefan, Paul Menage

* nishimura@mxp.nes.nec.co.jp <nishimura@mxp.nes.nec.co.jp> [2009-11-19 13:27:34]:

> Hi.
> 
> These are current patches of my recharge-at-task-move feature.
> They(precisely, only [5/5]) are dependent on KAMEZAWA-san's per-process swap usage patch,
> which is not merged yet, so are not for inclusion yet. I post them just for review and
> to share my current code.
> 
> In current memcg, charges associated with a task aren't moved to the new cgroup
> at task move. Some users feel this behavior to be strange.
> These patches are for this feature, that is, for recharging to
> the new cgroup and, of course, uncharging from old cgroup at task move.
> 
> Current version supports only recharge of non-shared(mapcount == 1) anonymous pages
> and swaps of those pages. I think it's enough as a first step.
> 
>   [1/5] cgroup: introduce cancel_attach()
>   [2/5] memcg: add interface to recharge at task move
>   [3/5] memcg: recharge charges of anonymous page
>   [4/5] memcg: avoid oom during recharge at task move
>   [5/5] memcg: recharge charges of anonymous swap
>

Thanks for the posting, I'll test and review the patches (hopefully
tomorrow).
 
-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH -mmotm 1/5] cgroup: introduce cancel_attach()
  2009-11-19  4:28 ` [PATCH -mmotm 1/5] cgroup: introduce cancel_attach() Daisuke Nishimura
@ 2009-11-19 21:42   ` Paul Menage
  2009-11-19 23:49     ` Daisuke Nishimura
  0 siblings, 1 reply; 20+ messages in thread
From: Paul Menage @ 2009-11-19 21:42 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: linux-mm, Andrew Morton, Balbir Singh, KAMEZAWA Hiroyuki, Li Zefan

On Wed, Nov 18, 2009 at 8:28 PM, Daisuke Nishimura
<nishimura@mxp.nes.nec.co.jp> wrote:
> This patch adds cancel_attach() operation to struct cgroup_subsys.
> cancel_attach() can be used when can_attach() operation prepares something
> for the subsys, but we should rollback what can_attach() operation has prepared
> if attach task fails after we've succeeded in can_attach().
>
> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> Acked-by: Li Zefan <lizf@cn.fujitsu.com>

Reviewed-by: Paul Menage <menage@google.com>

> +                               /*
> +                                * Remember at which subsystem we've failed in
> +                                * can_attach() to call cancel_attach() only
> +                                * against subsystems whose attach() have
> +                                * succeeded(see below).

Maybe: Remember on which subsystem the can_attach() failed, so that we
only call cancel_attach() against the subsystems whose can_attach()
succeeded. (See below)

> +                               /*
> +                                * This means can_attach() of this subsystem
> +                                * have failed, so we don't need to call
> +                                * cancel_attach() against rests of subsystems.
> +                                */

Maybe: This subsystem was the one that failed the can_attach() check
earlier, so we don't need to call cancel_attach() against it or any
remaining subsystems.

Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH -mmotm 1/5] cgroup: introduce cancel_attach()
  2009-11-19 21:42   ` Paul Menage
@ 2009-11-19 23:49     ` Daisuke Nishimura
  0 siblings, 0 replies; 20+ messages in thread
From: Daisuke Nishimura @ 2009-11-19 23:49 UTC (permalink / raw)
  To: Paul Menage
  Cc: linux-mm, Andrew Morton, Balbir Singh, KAMEZAWA Hiroyuki,
	Li Zefan, Daisuke Nishimura

On Thu, 19 Nov 2009 13:42:19 -0800, Paul Menage <menage@google.com> wrote:
> On Wed, Nov 18, 2009 at 8:28 PM, Daisuke Nishimura
> <nishimura@mxp.nes.nec.co.jp> wrote:
> > This patch adds cancel_attach() operation to struct cgroup_subsys.
> > cancel_attach() can be used when can_attach() operation prepares something
> > for the subsys, but we should rollback what can_attach() operation has prepared
> > if attach task fails after we've succeeded in can_attach().
> >
> > Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> > Acked-by: Li Zefan <lizf@cn.fujitsu.com>
> 
> Reviewed-by: Paul Menage <menage@google.com>
> 
Thanks.

> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  /*
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A * Remember at which subsystem we've failed in
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A * can_attach() to call cancel_attach() only
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A * against subsystems whose attach() have
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A * succeeded(see below).
> 
> Maybe: Remember on which subsystem the can_attach() failed, so that we
> only call cancel_attach() against the subsystems whose can_attach()
> succeeded. (See below)
> 
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  /*
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A * This means can_attach() of this subsystem
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A * have failed, so we don't need to call
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A * cancel_attach() against rests of subsystems.
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A */
> 
> Maybe: This subsystem was the one that failed the can_attach() check
> earlier, so we don't need to call cancel_attach() against it or any
> remaining subsystems.
> 
Thank you for your fixes.
They are more clear and correct comments. I'll merge them.


Regards,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH -mmotm 2/5] memcg: add interface to recharge at task move
  2009-11-19  4:29 ` [PATCH -mmotm 2/5] memcg: add interface to recharge at task move Daisuke Nishimura
@ 2009-11-20 15:42   ` Balbir Singh
  2009-11-23 23:56     ` Daisuke Nishimura
  0 siblings, 1 reply; 20+ messages in thread
From: Balbir Singh @ 2009-11-20 15:42 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: linux-mm, Andrew Morton, KAMEZAWA Hiroyuki, Li Zefan, Paul Menage

* nishimura@mxp.nes.nec.co.jp <nishimura@mxp.nes.nec.co.jp> [2009-11-19 13:29:07]:

> In current memcg, charges associated with a task aren't moved to the new cgroup
> at task move. Some users feel this behavior to be strange.
> These patches are for this feature, that is, for recharging to the new cgroup
> and, of course, uncharging from old cgroup at task move.
> 
> This patch adds "memory.recharge_at_immigrate" file, which is a flag file to
> determine whether charges should be moved to the new cgroup at task move or
> not and what type of charges should be recharged. This patch also adds read
> and write handlers of the file.
> 
> This patch also adds no-op handlers for this feature. These handlers will be
> implemented in later patches. And you cannot write any values other than 0
> to recharge_at_immigrate yet.

A basic question that we can clarify in the document, charge will move
only when mm->owner moves right?

> 
> Changelog: 2009/11/19
> - consolidate changes in Documentation/cgroup/memory.txt, which were made in
>   other patches separately.
> - handle recharge_at_immigrate as bitmask(as I did in first version).
> - use mm->owner instead of thread_group_leader().
> Changelog: 2009/09/24
> - change the term "migration" to "recharge".
> - handle the flag as bool not bitmask to make codes simple.
> 
> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> ---
>  Documentation/cgroups/memory.txt |   42 ++++++++++++++++-
>  mm/memcontrol.c                  |   93 ++++++++++++++++++++++++++++++++++++--
>  2 files changed, 129 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index b871f25..809585e 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -262,10 +262,12 @@ some of the pages cached in the cgroup (page cache pages).
>  4.2 Task migration
> 
>  When a task migrates from one cgroup to another, it's charge is not
> -carried forward. The pages allocated from the original cgroup still
> +carried forward by default. The pages allocated from the original cgroup still
>  remain charged to it, the charge is dropped when the page is freed or
>  reclaimed.
> 
> +Note: You can move charges of a task along with task migration. See 8.
> +
>  4.3 Removing a cgroup
> 
>  A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a
> @@ -414,7 +416,43 @@ NOTE1: Soft limits take effect over a long period of time, since they involve
>  NOTE2: It is recommended to set the soft limit always below the hard limit,
>         otherwise the hard limit will take precedence.
> 
> -8. TODO
> +8. Recharge at task move
> +
> +Users can move charges associated with a task along with task move, that is,
> +uncharge from the old cgroup and charge to the new cgroup.
> +
> +8.1 Interface
> +
> +This feature is disabled by default. It can be enabled(and disabled again) by
> +writing to memory.recharge_at_immigrate of the destination cgroup.
> +
> +If you want to enable it:
> +
> +# echo (some positive value) > memory.recharge_at_immigrate
> +
> +Note: Each bits of recharge_at_immigrate has its own meaning about what type of
> +charges should be recharged. See 8.2 for details.
> +
> +And if you want disable it again:
> +
> +# echo 0 > memory.recharge_at_immigrate
> +
> +8.2 Type of charges which can be recharged
> +
> +Each bits of recharge_at_immigrate has its own meaning about what type of
> +charges should be recharged.
> +
> +  bit | what type of charges would be recharged ?
> + -----+------------------------------------------------------------------------
> +   0  | A charge of an anonymous page(or swap of it) used by the target task.
> +      | Those pages and swaps must be used only by the target task. You must
> +      | enable Swap Extension(see 2.4) to enable recharge of swap.
> +
> +Note: Those pages and swaps must be charged to the old cgroup.
> +Note: More type of pages(e.g. file cache, shmem,) will be supported by other
> +bits in future.
> +
> +9. TODO
> 
>  1. Add support for accounting huge pages (as a separate controller)
>  2. Make per-cgroup scanner reclaim not-shared pages first
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index fc16f08..13fe93d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -226,11 +226,23 @@ struct mem_cgroup {
>  	bool		memsw_is_minimum;
> 
>  	/*
> +	 * Should we recharge charges of a task when a task is moved into this
> +	 * mem_cgroup ? And what type of charges should we recharge ?
> +	 */
> +	unsigned long 	recharge_at_immigrate;

recharge sounds confusing, should be use migrate_charge or
move_charge?

> +
> +	/*
>  	 * statistics. This must be placed at the end of memcg.
>  	 */
>  	struct mem_cgroup_stat stat;
>  };
> 
> +/* Stuffs for recharge at task move. */
> +/* Types of charges to be recharged */
> +enum recharge_type {
> +	NR_RECHARGE_TYPE,
> +};


Can you document that these are left shifted and hence should
be treated as power of 2 or bits in a map.

> +
>  /*
>   * Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft
>   * limit reclaim to prevent infinite loops, if they ever occur.
> @@ -2860,6 +2872,31 @@ static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
>  	return 0;
>  }
> 
> +static u64 mem_cgroup_recharge_read(struct cgroup *cgrp,
> +					struct cftype *cft)
> +{
> +	return mem_cgroup_from_cont(cgrp)->recharge_at_immigrate;
> +}
> +
> +static int mem_cgroup_recharge_write(struct cgroup *cgrp,
> +					struct cftype *cft, u64 val)
> +{
> +	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
> +
> +	if (val >= (1 << NR_RECHARGE_TYPE))
> +		return -EINVAL;
> +	/*
> +	 * We check this value several times in both in can_attach() and
> +	 * attach(), so we need cgroup lock to prevent this value from being
> +	 * inconsistent.
> +	 */
> +	cgroup_lock();
> +	mem->recharge_at_immigrate = val;
> +	cgroup_unlock();
> +
> +	return 0;
> +}
> +
> 
>  /* For read statistics */
>  enum {
> @@ -3093,6 +3130,11 @@ static struct cftype mem_cgroup_files[] = {
>  		.read_u64 = mem_cgroup_swappiness_read,
>  		.write_u64 = mem_cgroup_swappiness_write,
>  	},
> +	{
> +		.name = "recharge_at_immigrate",
> +		.read_u64 = mem_cgroup_recharge_read,
> +		.write_u64 = mem_cgroup_recharge_write,
> +	},
>  };
> 
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> @@ -3340,6 +3382,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  	if (parent)
>  		mem->swappiness = get_swappiness(parent);
>  	atomic_set(&mem->refcnt, 1);
> +	mem->recharge_at_immigrate = 0;

Should we not inherit this from the parent in a hierarchy?

>  	return &mem->css;
>  free_out:
>  	__mem_cgroup_free(mem);
> @@ -3376,16 +3419,56 @@ static int mem_cgroup_populate(struct cgroup_subsys *ss,
>  	return ret;
>  }
> 
> +/* Handlers for recharge at task move. */
> +static int mem_cgroup_can_recharge(void)
> +{
> +	return 0;
> +}
> +
> +static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
> +				struct cgroup *cgroup,
> +				struct task_struct *p,
> +				bool threadgroup)
> +{
> +	int ret = 0;
> +	struct mem_cgroup *mem = mem_cgroup_from_cont(cgroup);
> +
> +	if (mem->recharge_at_immigrate) {
> +		struct mm_struct *mm;
> +		struct mem_cgroup *from = mem_cgroup_from_task(p);
> +
> +		VM_BUG_ON(from == mem);
> +
> +		mm = get_task_mm(p);
> +		if (!mm)
> +			return 0;
> +
> +		if (mm->owner == p)
> +			ret = mem_cgroup_can_recharge();
> +
> +		mmput(mm);
> +	}
> +	return ret;
> +}
> +
> +static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,
> +				struct cgroup *cgroup,
> +				struct task_struct *p,
> +				bool threadgroup)
> +{
> +}
> +
> +static void mem_cgroup_recharge(void)
> +{
> +}
> +
>  static void mem_cgroup_move_task(struct cgroup_subsys *ss,
>  				struct cgroup *cont,
>  				struct cgroup *old_cont,
>  				struct task_struct *p,
>  				bool threadgroup)
>  {
> -	/*
> -	 * FIXME: It's better to move charges of this process from old
> -	 * memcg to new memcg. But it's just on TODO-List now.
> -	 */
> +	mem_cgroup_recharge();
>  }
> 
>  struct cgroup_subsys mem_cgroup_subsys = {
> @@ -3395,6 +3478,8 @@ struct cgroup_subsys mem_cgroup_subsys = {
>  	.pre_destroy = mem_cgroup_pre_destroy,
>  	.destroy = mem_cgroup_destroy,
>  	.populate = mem_cgroup_populate,
> +	.can_attach = mem_cgroup_can_attach,
> +	.cancel_attach = mem_cgroup_cancel_attach,
>  	.attach = mem_cgroup_move_task,
>  	.early_init = 0,
>  	.use_id = 1,
> -- 
> 1.5.6.1
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH -mmotm 4/5] memcg: avoid oom during recharge at task move
  2009-11-19  4:30 ` [PATCH -mmotm 4/5] memcg: avoid oom during recharge at task move Daisuke Nishimura
@ 2009-11-23  5:10   ` Balbir Singh
  2009-11-24  2:43     ` Daisuke Nishimura
  0 siblings, 1 reply; 20+ messages in thread
From: Balbir Singh @ 2009-11-23  5:10 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: linux-mm, Andrew Morton, KAMEZAWA Hiroyuki, Li Zefan, Paul Menage

* nishimura@mxp.nes.nec.co.jp <nishimura@mxp.nes.nec.co.jp> [2009-11-19 13:30:30]:

> This recharge-at-task-move feature has extra charges(pre-charges) on "to"
> mem_cgroup during recharging. This means unnecessary oom can happen.
> 
> This patch tries to avoid such oom.
> 
> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> ---
>  mm/memcontrol.c |   28 ++++++++++++++++++++++++++++
>  1 files changed, 28 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index df363da..3a07383 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -249,6 +249,7 @@ struct recharge_struct {
>  	struct mem_cgroup *from;
>  	struct mem_cgroup *to;
>  	unsigned long precharge;
> +	struct task_struct *working;	/* a task moving the target task */

working does not sound like an appropriate name

>  };
>  static struct recharge_struct recharge;
> 
> @@ -1494,6 +1495,30 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>  		if (mem_cgroup_check_under_limit(mem_over_limit))
>  			continue;
> 
> +		/* try to avoid oom while someone is recharging */
> +		if (recharge.working && current != recharge.working) {
> +			struct mem_cgroup *dest;
> +			bool do_continue = false;
> +			/*
> +			 * There is a small race that "dest" can be freed by
> +			 * rmdir, so we use css_tryget().
> +			 */
> +			rcu_read_lock();
> +			dest = recharge.to;
> +			if (dest && css_tryget(&dest->css)) {
> +				if (dest->use_hierarchy)
> +					do_continue = css_is_ancestor(
> +							&dest->css,
> +							&mem_over_limit->css);
> +				else
> +					do_continue = (dest == mem_over_limit);
> +				css_put(&dest->css);
> +			}
> +			rcu_read_unlock();
> +			if (do_continue)
> +				continue;

IIUC, if dest is the current cgroup we are trying to charge to or an
ancestor of the current cgroup, we don't OOM?

> +		}
> +
>  		if (!nr_retries--) {
>  			if (oom) {
>  				mem_cgroup_out_of_memory(mem_over_limit, gfp_mask);
> @@ -3474,6 +3499,7 @@ static void mem_cgroup_clear_recharge(void)
>  	}
>  	recharge.from = NULL;
>  	recharge.to = NULL;
> +	recharge.working = NULL;
>  }
> 
>  static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
> @@ -3498,9 +3524,11 @@ static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
>  			VM_BUG_ON(recharge.from);
>  			VM_BUG_ON(recharge.to);
>  			VM_BUG_ON(recharge.precharge);
> +			VM_BUG_ON(recharge.working);
>  			recharge.from = from;
>  			recharge.to = mem;
>  			recharge.precharge = 0;
> +			recharge.working = current;
> 
>  			ret = mem_cgroup_prepare_recharge(mm);
>  			if (ret)

Sorry, if I missed it, but I did not see any time overhead of moving a
task after these changes. Could you please help me understand the cost
of moving say a task with 1G anonymous memory to another group and
the cost of moving a task with 512MB anonymous and 512 page cache
mapped, etc. It would be nice to understand the overall cost.

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH -mmotm 5/5] memcg: recharge charges of anonymous swap
  2009-11-19  4:31 ` [PATCH -mmotm 5/5] memcg: recharge charges of anonymous swap Daisuke Nishimura
@ 2009-11-23  6:59   ` Balbir Singh
  2009-11-24  7:54     ` Daisuke Nishimura
  0 siblings, 1 reply; 20+ messages in thread
From: Balbir Singh @ 2009-11-23  6:59 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: linux-mm, Andrew Morton, KAMEZAWA Hiroyuki, Li Zefan, Paul Menage

* nishimura@mxp.nes.nec.co.jp <nishimura@mxp.nes.nec.co.jp> [2009-11-19 13:31:20]:

> This patch is another core part of this recharge-at-task-move feature.
> It enables recharge of anonymous swaps.
> 
> To move the charge of swap, we need to exchange swap_cgroup's record.
> 
> In current implementation, swap_cgroup's record is protected by:
> 
>   - page lock: if the entry is on swap cache.
>   - swap_lock: if the entry is not on swap cache.
> 
> This works well in usual swap-in/out activity.
> 
> But this behavior make charge migration of swap check many conditions to
> exchange swap_cgroup's record safely.
> 
> So I changed modification of swap_cgroup's recored(swap_cgroup_record())
> to use xchg, and define a new function to cmpxchg swap_cgroup's record.
> 
> This patch also enables recharge of non pte_present but not uncharged swap
> caches, which can be exist on swap-out path,  by getting the target pages via
> find_get_page() as do_mincore() does.
> 
> Changelog: 2009/11/19
> - in can_attach(), instead of parsing the page table, make use of per process
>   mm_counter(swap_usage).
> Changelog: 2009/11/06
> - drop support for shmem's swap(revisit in future).
> - add mem_cgroup_count_swap_user() to prevent moving charges of swaps used by
>   multiple processes(revisit in future).
> Changelog: 2009/09/24
> - do no swap-in in moving swap account any more.
> - add support for shmem's swap.
> 
> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> ---
>  include/linux/page_cgroup.h |    2 +
>  include/linux/swap.h        |    1 +
>  mm/memcontrol.c             |  150 ++++++++++++++++++++++++++++++++++---------
>  mm/page_cgroup.c            |   35 ++++++++++-
>  mm/swapfile.c               |   31 +++++++++
>  5 files changed, 186 insertions(+), 33 deletions(-)
> 
> diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
> index b0e4eb1..30b0813 100644
> --- a/include/linux/page_cgroup.h
> +++ b/include/linux/page_cgroup.h
> @@ -118,6 +118,8 @@ static inline void __init page_cgroup_init_flatmem(void)
>  #include <linux/swap.h>
> 
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> +extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
> +					unsigned short old, unsigned short new);
>  extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id);
>  extern unsigned short lookup_swap_cgroup(swp_entry_t ent);
>  extern int swap_cgroup_swapon(int type, unsigned long max_pages);
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 9f0ca32..2a3209e 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -355,6 +355,7 @@ static inline void disable_swap_token(void)
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
>  extern void
>  mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout);
> +extern int mem_cgroup_count_swap_user(swp_entry_t ent, struct page **pagep);
>  #else
>  static inline void
>  mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 3a07383..ea00a93 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -33,6 +33,7 @@
>  #include <linux/rbtree.h>
>  #include <linux/slab.h>
>  #include <linux/swap.h>
> +#include <linux/swapops.h>
>  #include <linux/spinlock.h>
>  #include <linux/fs.h>
>  #include <linux/seq_file.h>
> @@ -2230,6 +2231,49 @@ void mem_cgroup_uncharge_swap(swp_entry_t ent)
>  	}
>  	rcu_read_unlock();
>  }
> +
> +/**
> + * mem_cgroup_move_swap_account - move swap charge and swap_cgroup's record.
> + * @entry: swap entry to be moved
> + * @from:  mem_cgroup which the entry is moved from
> + * @to:  mem_cgroup which the entry is moved to
> + *
> + * It successes only when the swap_cgroup's record for this entry is the same
         ^^^^^^^^^
Should be succeeds

> + * as the mem_cgroup's id of @from.
> + *
> + * Returns 0 on success, -EINVAL on failure.
> + *
> + * The caller must have called __mem_cgroup_try_charge on @to.
> + */
> +static int mem_cgroup_move_swap_account(swp_entry_t entry,
> +				struct mem_cgroup *from, struct mem_cgroup *to)
> +{
> +	unsigned short old_id, new_id;
> +
> +	old_id = css_id(&from->css);
> +	new_id = css_id(&to->css);
> +
> +	if (swap_cgroup_cmpxchg(entry, old_id, new_id) == old_id) {
> +		if (!mem_cgroup_is_root(from))
> +			res_counter_uncharge(&from->memsw, PAGE_SIZE);
> +		mem_cgroup_swap_statistics(from, false);
> +		mem_cgroup_put(from);
> +
> +		if (!mem_cgroup_is_root(to))
> +			res_counter_uncharge(&to->res, PAGE_SIZE);

Uncharge from "to" as well? Need more comments to understand the code.

> +		mem_cgroup_swap_statistics(to, true);
> +		mem_cgroup_get(to);
> +
> +		return 0;
> +	}
> +	return -EINVAL;
> +}
> +#else
> +static inline int mem_cgroup_move_swap_account(swp_entry_t entry,
> +				struct mem_cgroup *from, struct mem_cgroup *to)
> +{
> +	return -EINVAL;
> +}
>  #endif
> 
>  /*
> @@ -3477,8 +3521,10 @@ static int mem_cgroup_prepare_recharge(struct mm_struct *mm)
>  	bool recharge_anon = test_bit(RECHARGE_TYPE_ANON,
>  					&recharge.to->recharge_at_immigrate);
> 
> -	if (recharge_anon)
> +	if (recharge_anon) {
>  		prepare += get_mm_counter(mm, anon_rss);
> +		prepare += get_mm_counter(mm, swap_usage);
> +	}

Does this logic handle shared pages correctly? Could you please check.

> 
>  	while (!ret && prepare--) {
>  		if (!count--) {
> @@ -3553,66 +3599,99 @@ static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,
>   * @vma: the vma the pte to be checked belongs
>   * @addr: the address corresponding to the pte to be checked
>   * @ptent: the pte to be checked
> - * @target: the pointer the target page will be stored
> + * @target: the pointer the target page or swap entry will be stored
>   *
>   * Returns
>   *   0(RECHARGE_TARGET_NONE): if the pte is not a target for recharge.
>   *   1(RECHARGE_TARGET_PAGE): if the page corresponding to this pte is a target
>   *     for recharge. if @target is not NULL, the page is stored in target->page
>   *     with extra refcnt got(Callers should handle it).
> + *   2(MIGRATION_TARGET_SWAP): if the swap entry corresponding to this pte is a
> + *     target for charge migration. if @target is not NULL, the entry is stored
> + *     in target->ent.
>   *
>   * Called with pte lock held.
>   */
> -/* We add a new member later. */
>  union recharge_target {
>  	struct page	*page;
> +	swp_entry_t	ent;
>  };
> 
> -/* We add a new type later. */
>  enum recharge_target_type {
>  	RECHARGE_TARGET_NONE,	/* not used */
>  	RECHARGE_TARGET_PAGE,
> +	RECHARGE_TARGET_SWAP,
>  };
> 
>  static int is_target_pte_for_recharge(struct vm_area_struct *vma,
>  		unsigned long addr, pte_t ptent, union recharge_target *target)
>  {
> -	struct page *page;
> +	struct page *page = NULL;
>  	struct page_cgroup *pc;
> +	swp_entry_t ent = { .val = 0 };
> +	int user = 0;
>  	int ret = 0;
>  	bool recharge_anon = test_bit(RECHARGE_TYPE_ANON,
>  					&recharge.to->recharge_at_immigrate);
> 
> -	if (!pte_present(ptent))
> -		return 0;
> +	if (!pte_present(ptent)) {
> +		/* TODO: handle swap of shmes/tmpfs */
> +		if (pte_none(ptent) || pte_file(ptent))
> +			return 0;
> +		else if (is_swap_pte(ptent)) {
> +			ent = pte_to_swp_entry(ptent);
> +			if (!recharge_anon || non_swap_entry(ent))
> +				return 0;
> +			user = mem_cgroup_count_swap_user(ent, &page);
> +		}
> +	} else {
> +		page = vm_normal_page(vma, addr, ptent);
> +		if (!page || !page_mapped(page))
> +			return 0;
> +		/*
> +		 * TODO: We don't recharge file(including shmem/tmpfs) pages
> +		 * for now.
> +		 */
> +		if (!recharge_anon || !PageAnon(page))
> +			return 0;
> +		if (!get_page_unless_zero(page))
> +			return 0;
> +		user = page_mapcount(page);
> +	}
> 
> -	page = vm_normal_page(vma, addr, ptent);
> -	if (!page || !page_mapped(page))
> -		return 0;
> -	/* TODO: We don't recharge file(including shmem/tmpfs) pages for now. */
> -	if (!recharge_anon || !PageAnon(page))
> -		return 0;
> -	/*
> -	 * TODO: We don't recharge shared(used by multiple processes) pages
> -	 * for now.
> -	 */
> -	if (page_mapcount(page) > 1)
> -		return 0;
> -	if (!get_page_unless_zero(page))
> +	if (user > 1) {

users or usage_count would be better name.

> +		/*
> +		 * TODO: We don't recharge shared(used by multiple processes)
> +		 * pages for now.
> +		 */
> +		if (page)
> +			put_page(page);
>  		return 0;
> -
> -	pc = lookup_page_cgroup(page);
> -	/*
> -	 * Do only loose check w/o page_cgroup lock. mem_cgroup_move_account()
> -	 * checks the pc is valid or not under the lock.
> -	 */
> -	if (PageCgroupUsed(pc)) {
> -		ret = RECHARGE_TARGET_PAGE;
> -		target->page = page;
>  	}
> 
> -	if (!ret)
> -		put_page(page);
> +	if (page) {
> +		pc = lookup_page_cgroup(page);
> +		/*
> +		 * Do only loose check w/o page_cgroup lock.
> +		 * mem_cgroup_move_account() checks the pc is valid or not under
> +		 * the lock.
> +		 */
> +		if (PageCgroupUsed(pc)) {
> +			ret = RECHARGE_TARGET_PAGE;
> +			target->page = page;
> +		}
> +		if (!ret)
> +			put_page(page);
> +	}
> +	/* fall throught */

should be through

> +	if (ent.val && do_swap_account && !ret) {
> +		/*
> +		 * mem_cgroup_move_swap_account() checks the entry is valid or
> +		 * not.
> +		 */
> +		ret = RECHARGE_TARGET_SWAP;
> +		target->ent = ent;
> +	}
> 
>  	return ret;
>  }
> @@ -3634,6 +3713,7 @@ retry:
>  		int type;
>  		struct page *page;
>  		struct page_cgroup *pc;
> +		swp_entry_t ent;
> 
>  		if (!recharge.precharge)
>  			break;
> @@ -3654,6 +3734,14 @@ retry:
>  put:			/* is_target_pte_for_recharge() gets the page */
>  			put_page(page);
>  			break;
> +		case RECHARGE_TARGET_SWAP:
> +			ent = target.ent;
> +			if (!mem_cgroup_move_swap_account(ent,
> +						recharge.from, recharge.to)) {
> +				css_put(&recharge.to->css);
> +				recharge.precharge--;
> +			}
> +			break;
>  		default:
>  			break;
>  		}
> diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
> index 3d535d5..213b0ee 100644
> --- a/mm/page_cgroup.c
> +++ b/mm/page_cgroup.c
> @@ -9,6 +9,7 @@
>  #include <linux/vmalloc.h>
>  #include <linux/cgroup.h>
>  #include <linux/swapops.h>
> +#include <asm/cmpxchg.h>
> 
>  static void __meminit
>  __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
> @@ -335,6 +336,37 @@ not_enough_page:
>  }
> 
>  /**
> + * swap_cgroup_cmpxchg - cmpxchg mem_cgroup's id for this swp_entry.
> + * @end: swap entry to be cmpxchged
> + * @old: old id
> + * @new: new id
> + *
> + * Returns old id at success, 0 at failure.
> + * (There is no mem_cgroup useing 0 as its id)
> + */
> +unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
> +					unsigned short old, unsigned short new)
> +{
> +	int type = swp_type(ent);
> +	unsigned long offset = swp_offset(ent);
> +	unsigned long idx = offset / SC_PER_PAGE;
> +	unsigned long pos = offset & SC_POS_MASK;
> +	struct swap_cgroup_ctrl *ctrl;
> +	struct page *mappage;
> +	struct swap_cgroup *sc;
> +
> +	ctrl = &swap_cgroup_ctrl[type];
> +
> +	mappage = ctrl->map[idx];
> +	sc = page_address(mappage);
> +	sc += pos;
> +	if (cmpxchg(&sc->id, old, new) == old)
> +		return old;
> +	else
> +		return 0;
> +}
> +
> +/**
>   * swap_cgroup_record - record mem_cgroup for this swp_entry.
>   * @ent: swap entry to be recorded into
>   * @mem: mem_cgroup to be recorded
> @@ -358,8 +390,7 @@ unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id)
>  	mappage = ctrl->map[idx];
>  	sc = page_address(mappage);
>  	sc += pos;
> -	old = sc->id;
> -	sc->id = id;
> +	old = xchg(&sc->id, id);
> 
>  	return old;
>  }
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index f32d716..cc2e859 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -719,6 +719,37 @@ int free_swap_and_cache(swp_entry_t entry)
>  	return p != NULL;
>  }
> 
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +/**
> + * mem_cgroup_count_swap_user - count the user of a swap entry
> + * @ent: the swap entry to be checked
> + * @pagep: the pointer for the swap cache page of the entry to be stored
> + *
> + * Returns the number of the user of the swap entry. The number is valid only
> + * for swaps of anonymous pages.
> + * If the entry is found on swap cache, the page is stored to pagep with
> + * refcount of it being incremented.
> + */
> +int mem_cgroup_count_swap_user(swp_entry_t ent, struct page **pagep)
> +{
> +	struct page *page;
> +	struct swap_info_struct *p;
> +	int count = 0;
> +
> +	page = find_get_page(&swapper_space, ent.val);
> +	if (page)
> +		count += page_mapcount(page);
> +	p = swap_info_get(ent);
> +	if (p) {
> +		count += swap_count(p->swap_map[swp_offset(ent)]);
> +		spin_unlock(&swap_lock);
> +	}
> +
> +	*pagep = page;
> +	return count;
> +}
> +#endif
> +
>  #ifdef CONFIG_HIBERNATION
>  /*
>   * Find the swap type that corresponds to given device (if any).
> -- 
> 1.5.6.1
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH -mmotm 2/5] memcg: add interface to recharge at task move
  2009-11-20 15:42   ` Balbir Singh
@ 2009-11-23 23:56     ` Daisuke Nishimura
  0 siblings, 0 replies; 20+ messages in thread
From: Daisuke Nishimura @ 2009-11-23 23:56 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, Andrew Morton, KAMEZAWA Hiroyuki, Li Zefan,
	Paul Menage, Daisuke Nishimura

Thank you for your review and comment.

On Fri, 20 Nov 2009 21:12:45 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> * nishimura@mxp.nes.nec.co.jp <nishimura@mxp.nes.nec.co.jp> [2009-11-19 13:29:07]:
> 
> > In current memcg, charges associated with a task aren't moved to the new cgroup
> > at task move. Some users feel this behavior to be strange.
> > These patches are for this feature, that is, for recharging to the new cgroup
> > and, of course, uncharging from old cgroup at task move.
> > 
> > This patch adds "memory.recharge_at_immigrate" file, which is a flag file to
> > determine whether charges should be moved to the new cgroup at task move or
> > not and what type of charges should be recharged. This patch also adds read
> > and write handlers of the file.
> > 
> > This patch also adds no-op handlers for this feature. These handlers will be
> > implemented in later patches. And you cannot write any values other than 0
> > to recharge_at_immigrate yet.
> 
> A basic question that we can clarify in the document, charge will move
> only when mm->owner moves right?
> 
yes.
I'll add comments in the patch description and memory.txt.

> > 
> > Changelog: 2009/11/19
> > - consolidate changes in Documentation/cgroup/memory.txt, which were made in
> >   other patches separately.
> > - handle recharge_at_immigrate as bitmask(as I did in first version).
> > - use mm->owner instead of thread_group_leader().
> > Changelog: 2009/09/24
> > - change the term "migration" to "recharge".
> > - handle the flag as bool not bitmask to make codes simple.
> > 
> > Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> > ---
> >  Documentation/cgroups/memory.txt |   42 ++++++++++++++++-
> >  mm/memcontrol.c                  |   93 ++++++++++++++++++++++++++++++++++++--
> >  2 files changed, 129 insertions(+), 6 deletions(-)
> > 
> > diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> > index b871f25..809585e 100644
> > --- a/Documentation/cgroups/memory.txt
> > +++ b/Documentation/cgroups/memory.txt
> > @@ -262,10 +262,12 @@ some of the pages cached in the cgroup (page cache pages).
> >  4.2 Task migration
> > 
> >  When a task migrates from one cgroup to another, it's charge is not
> > -carried forward. The pages allocated from the original cgroup still
> > +carried forward by default. The pages allocated from the original cgroup still
> >  remain charged to it, the charge is dropped when the page is freed or
> >  reclaimed.
> > 
> > +Note: You can move charges of a task along with task migration. See 8.
> > +
> >  4.3 Removing a cgroup
> > 
> >  A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a
> > @@ -414,7 +416,43 @@ NOTE1: Soft limits take effect over a long period of time, since they involve
> >  NOTE2: It is recommended to set the soft limit always below the hard limit,
> >         otherwise the hard limit will take precedence.
> > 
> > -8. TODO
> > +8. Recharge at task move
> > +
> > +Users can move charges associated with a task along with task move, that is,
> > +uncharge from the old cgroup and charge to the new cgroup.
> > +
> > +8.1 Interface
> > +
> > +This feature is disabled by default. It can be enabled(and disabled again) by
> > +writing to memory.recharge_at_immigrate of the destination cgroup.
> > +
> > +If you want to enable it:
> > +
> > +# echo (some positive value) > memory.recharge_at_immigrate
> > +
> > +Note: Each bits of recharge_at_immigrate has its own meaning about what type of
> > +charges should be recharged. See 8.2 for details.
> > +
> > +And if you want disable it again:
> > +
> > +# echo 0 > memory.recharge_at_immigrate
> > +
> > +8.2 Type of charges which can be recharged
> > +
> > +Each bits of recharge_at_immigrate has its own meaning about what type of
> > +charges should be recharged.
> > +
> > +  bit | what type of charges would be recharged ?
> > + -----+------------------------------------------------------------------------
> > +   0  | A charge of an anonymous page(or swap of it) used by the target task.
> > +      | Those pages and swaps must be used only by the target task. You must
> > +      | enable Swap Extension(see 2.4) to enable recharge of swap.
> > +
> > +Note: Those pages and swaps must be charged to the old cgroup.
> > +Note: More type of pages(e.g. file cache, shmem,) will be supported by other
> > +bits in future.
> > +
> > +9. TODO
> > 
> >  1. Add support for accounting huge pages (as a separate controller)
> >  2. Make per-cgroup scanner reclaim not-shared pages first
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index fc16f08..13fe93d 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -226,11 +226,23 @@ struct mem_cgroup {
> >  	bool		memsw_is_minimum;
> > 
> >  	/*
> > +	 * Should we recharge charges of a task when a task is moved into this
> > +	 * mem_cgroup ? And what type of charges should we recharge ?
> > +	 */
> > +	unsigned long 	recharge_at_immigrate;
> 
> recharge sounds confusing, should be use migrate_charge or
> move_charge?
> 
O.K.
The term "migrate" can be confused with "page migration",
so I'll use "move_charge"(including other function names).

> > +
> > +	/*
> >  	 * statistics. This must be placed at the end of memcg.
> >  	 */
> >  	struct mem_cgroup_stat stat;
> >  };
> > 
> > +/* Stuffs for recharge at task move. */
> > +/* Types of charges to be recharged */
> > +enum recharge_type {
> > +	NR_RECHARGE_TYPE,
> > +};
> 
> 
> Can you document that these are left shifted and hence should
> be treated as power of 2 or bits in a map.
> 
will do.

> > +
> >  /*
> >   * Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft
> >   * limit reclaim to prevent infinite loops, if they ever occur.
> > @@ -2860,6 +2872,31 @@ static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
> >  	return 0;
> >  }
> > 
> > +static u64 mem_cgroup_recharge_read(struct cgroup *cgrp,
> > +					struct cftype *cft)
> > +{
> > +	return mem_cgroup_from_cont(cgrp)->recharge_at_immigrate;
> > +}
> > +
> > +static int mem_cgroup_recharge_write(struct cgroup *cgrp,
> > +					struct cftype *cft, u64 val)
> > +{
> > +	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
> > +
> > +	if (val >= (1 << NR_RECHARGE_TYPE))
> > +		return -EINVAL;
> > +	/*
> > +	 * We check this value several times in both in can_attach() and
> > +	 * attach(), so we need cgroup lock to prevent this value from being
> > +	 * inconsistent.
> > +	 */
> > +	cgroup_lock();
> > +	mem->recharge_at_immigrate = val;
> > +	cgroup_unlock();
> > +
> > +	return 0;
> > +}
> > +
> > 
> >  /* For read statistics */
> >  enum {
> > @@ -3093,6 +3130,11 @@ static struct cftype mem_cgroup_files[] = {
> >  		.read_u64 = mem_cgroup_swappiness_read,
> >  		.write_u64 = mem_cgroup_swappiness_write,
> >  	},
> > +	{
> > +		.name = "recharge_at_immigrate",
> > +		.read_u64 = mem_cgroup_recharge_read,
> > +		.write_u64 = mem_cgroup_recharge_write,
> > +	},
> >  };
> > 
> >  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> > @@ -3340,6 +3382,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
> >  	if (parent)
> >  		mem->swappiness = get_swappiness(parent);
> >  	atomic_set(&mem->refcnt, 1);
> > +	mem->recharge_at_immigrate = 0;
> 
> Should we not inherit this from the parent in a hierarchy?
> 
hmm, good question.
IMHO it's unnecessary, because this patch moves charges which are charged against
the source cgroup itself, not the hierarchy including it.


Regards,
Daisuke Nishimura.

> >  	return &mem->css;
> >  free_out:
> >  	__mem_cgroup_free(mem);
> > @@ -3376,16 +3419,56 @@ static int mem_cgroup_populate(struct cgroup_subsys *ss,
> >  	return ret;
> >  }
> > 
> > +/* Handlers for recharge at task move. */
> > +static int mem_cgroup_can_recharge(void)
> > +{
> > +	return 0;
> > +}
> > +
> > +static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
> > +				struct cgroup *cgroup,
> > +				struct task_struct *p,
> > +				bool threadgroup)
> > +{
> > +	int ret = 0;
> > +	struct mem_cgroup *mem = mem_cgroup_from_cont(cgroup);
> > +
> > +	if (mem->recharge_at_immigrate) {
> > +		struct mm_struct *mm;
> > +		struct mem_cgroup *from = mem_cgroup_from_task(p);
> > +
> > +		VM_BUG_ON(from == mem);
> > +
> > +		mm = get_task_mm(p);
> > +		if (!mm)
> > +			return 0;
> > +
> > +		if (mm->owner == p)
> > +			ret = mem_cgroup_can_recharge();
> > +
> > +		mmput(mm);
> > +	}
> > +	return ret;
> > +}
> > +
> > +static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,
> > +				struct cgroup *cgroup,
> > +				struct task_struct *p,
> > +				bool threadgroup)
> > +{
> > +}
> > +
> > +static void mem_cgroup_recharge(void)
> > +{
> > +}
> > +
> >  static void mem_cgroup_move_task(struct cgroup_subsys *ss,
> >  				struct cgroup *cont,
> >  				struct cgroup *old_cont,
> >  				struct task_struct *p,
> >  				bool threadgroup)
> >  {
> > -	/*
> > -	 * FIXME: It's better to move charges of this process from old
> > -	 * memcg to new memcg. But it's just on TODO-List now.
> > -	 */
> > +	mem_cgroup_recharge();
> >  }
> > 
> >  struct cgroup_subsys mem_cgroup_subsys = {
> > @@ -3395,6 +3478,8 @@ struct cgroup_subsys mem_cgroup_subsys = {
> >  	.pre_destroy = mem_cgroup_pre_destroy,
> >  	.destroy = mem_cgroup_destroy,
> >  	.populate = mem_cgroup_populate,
> > +	.can_attach = mem_cgroup_can_attach,
> > +	.cancel_attach = mem_cgroup_cancel_attach,
> >  	.attach = mem_cgroup_move_task,
> >  	.early_init = 0,
> >  	.use_id = 1,
> > -- 
> > 1.5.6.1
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 
> -- 
> 	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH -mmotm 4/5] memcg: avoid oom during recharge at task move
  2009-11-23  5:10   ` Balbir Singh
@ 2009-11-24  2:43     ` Daisuke Nishimura
  2009-11-27  4:58       ` Daisuke Nishimura
  0 siblings, 1 reply; 20+ messages in thread
From: Daisuke Nishimura @ 2009-11-24  2:43 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, Andrew Morton, KAMEZAWA Hiroyuki, Li Zefan,
	Paul Menage, Daisuke Nishimura

On Mon, 23 Nov 2009 10:40:41 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> * nishimura@mxp.nes.nec.co.jp <nishimura@mxp.nes.nec.co.jp> [2009-11-19 13:30:30]:
> 
> > This recharge-at-task-move feature has extra charges(pre-charges) on "to"
> > mem_cgroup during recharging. This means unnecessary oom can happen.
> > 
> > This patch tries to avoid such oom.
> > 
> > Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> > ---
> >  mm/memcontrol.c |   28 ++++++++++++++++++++++++++++
> >  1 files changed, 28 insertions(+), 0 deletions(-)
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index df363da..3a07383 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -249,6 +249,7 @@ struct recharge_struct {
> >  	struct mem_cgroup *from;
> >  	struct mem_cgroup *to;
> >  	unsigned long precharge;
> > +	struct task_struct *working;	/* a task moving the target task */
> 
> working does not sound like an appropriate name
> 
Then, what's about "moving" ?

> >  };
> >  static struct recharge_struct recharge;
> > 
> > @@ -1494,6 +1495,30 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> >  		if (mem_cgroup_check_under_limit(mem_over_limit))
> >  			continue;
> > 
> > +		/* try to avoid oom while someone is recharging */
> > +		if (recharge.working && current != recharge.working) {
> > +			struct mem_cgroup *dest;
> > +			bool do_continue = false;
> > +			/*
> > +			 * There is a small race that "dest" can be freed by
> > +			 * rmdir, so we use css_tryget().
> > +			 */
> > +			rcu_read_lock();
> > +			dest = recharge.to;
> > +			if (dest && css_tryget(&dest->css)) {
> > +				if (dest->use_hierarchy)
> > +					do_continue = css_is_ancestor(
> > +							&dest->css,
> > +							&mem_over_limit->css);
> > +				else
> > +					do_continue = (dest == mem_over_limit);
> > +				css_put(&dest->css);
> > +			}
> > +			rcu_read_unlock();
> > +			if (do_continue)
> > +				continue;
> 
> IIUC, if dest is the current cgroup we are trying to charge to or an
> ancestor of the current cgroup, we don't OOM?
> 
We don't OOM:
- if dest is the cgroup we are trying to charge to(in w/o hierarchy case).
- if the cgroup we are trying to charge to is the ancestor of dest(in hierarchy case).
because this feature preserves some amount of charged to dest cgroup, so we would better
to avoid oom during moving charge about dest cgroup.

BTW, the above code has a bug. We should check mem_over_limit->use_hierarchy,
not dest->use_hierarchy. Checking dest->use_hierarchy returns true even when:

	/A : use_hierarchy == 0		<- mem_over_limit
		00/: use_hierarchy == 1	<- dest

(IIUC, css_is_ancestor() only checks hierarchy in cgroup layer.)

And current task_in_mem_cgroup() has the same bug, which leads to killing an
innocent task(I'll check, test, and send a patch later).

> > +		}
> > +
> >  		if (!nr_retries--) {
> >  			if (oom) {
> >  				mem_cgroup_out_of_memory(mem_over_limit, gfp_mask);
> > @@ -3474,6 +3499,7 @@ static void mem_cgroup_clear_recharge(void)
> >  	}
> >  	recharge.from = NULL;
> >  	recharge.to = NULL;
> > +	recharge.working = NULL;
> >  }
> > 
> >  static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
> > @@ -3498,9 +3524,11 @@ static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
> >  			VM_BUG_ON(recharge.from);
> >  			VM_BUG_ON(recharge.to);
> >  			VM_BUG_ON(recharge.precharge);
> > +			VM_BUG_ON(recharge.working);
> >  			recharge.from = from;
> >  			recharge.to = mem;
> >  			recharge.precharge = 0;
> > +			recharge.working = current;
> > 
> >  			ret = mem_cgroup_prepare_recharge(mm);
> >  			if (ret)
> 
> Sorry, if I missed it, but I did not see any time overhead of moving a
> task after these changes. Could you please help me understand the cost
> of moving say a task with 1G anonymous memory to another group and
> the cost of moving a task with 512MB anonymous and 512 page cache
> mapped, etc. It would be nice to understand the overall cost.
> 
O.K.
I'll test programs with big anonymous pages and measure the time and report.


Regards,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH -mmotm 5/5] memcg: recharge charges of anonymous swap
  2009-11-23  6:59   ` Balbir Singh
@ 2009-11-24  7:54     ` Daisuke Nishimura
  0 siblings, 0 replies; 20+ messages in thread
From: Daisuke Nishimura @ 2009-11-24  7:54 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, Andrew Morton, KAMEZAWA Hiroyuki, Li Zefan,
	Paul Menage, Daisuke Nishimura

On Mon, 23 Nov 2009 12:29:52 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> * nishimura@mxp.nes.nec.co.jp <nishimura@mxp.nes.nec.co.jp> [2009-11-19 13:31:20]:
> 
> > This patch is another core part of this recharge-at-task-move feature.
> > It enables recharge of anonymous swaps.
> > 
> > To move the charge of swap, we need to exchange swap_cgroup's record.
> > 
> > In current implementation, swap_cgroup's record is protected by:
> > 
> >   - page lock: if the entry is on swap cache.
> >   - swap_lock: if the entry is not on swap cache.
> > 
> > This works well in usual swap-in/out activity.
> > 
> > But this behavior make charge migration of swap check many conditions to
> > exchange swap_cgroup's record safely.
> > 
> > So I changed modification of swap_cgroup's recored(swap_cgroup_record())
> > to use xchg, and define a new function to cmpxchg swap_cgroup's record.
> > 
> > This patch also enables recharge of non pte_present but not uncharged swap
> > caches, which can be exist on swap-out path,  by getting the target pages via
> > find_get_page() as do_mincore() does.
> > 
> > Changelog: 2009/11/19
> > - in can_attach(), instead of parsing the page table, make use of per process
> >   mm_counter(swap_usage).
> > Changelog: 2009/11/06
> > - drop support for shmem's swap(revisit in future).
> > - add mem_cgroup_count_swap_user() to prevent moving charges of swaps used by
> >   multiple processes(revisit in future).
> > Changelog: 2009/09/24
> > - do no swap-in in moving swap account any more.
> > - add support for shmem's swap.
> > 
> > Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> > ---
> >  include/linux/page_cgroup.h |    2 +
> >  include/linux/swap.h        |    1 +
> >  mm/memcontrol.c             |  150 ++++++++++++++++++++++++++++++++++---------
> >  mm/page_cgroup.c            |   35 ++++++++++-
> >  mm/swapfile.c               |   31 +++++++++
> >  5 files changed, 186 insertions(+), 33 deletions(-)
> > 
> > diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
> > index b0e4eb1..30b0813 100644
> > --- a/include/linux/page_cgroup.h
> > +++ b/include/linux/page_cgroup.h
> > @@ -118,6 +118,8 @@ static inline void __init page_cgroup_init_flatmem(void)
> >  #include <linux/swap.h>
> > 
> >  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> > +extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
> > +					unsigned short old, unsigned short new);
> >  extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id);
> >  extern unsigned short lookup_swap_cgroup(swp_entry_t ent);
> >  extern int swap_cgroup_swapon(int type, unsigned long max_pages);
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 9f0ca32..2a3209e 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -355,6 +355,7 @@ static inline void disable_swap_token(void)
> >  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> >  extern void
> >  mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout);
> > +extern int mem_cgroup_count_swap_user(swp_entry_t ent, struct page **pagep);
> >  #else
> >  static inline void
> >  mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout)
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 3a07383..ea00a93 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -33,6 +33,7 @@
> >  #include <linux/rbtree.h>
> >  #include <linux/slab.h>
> >  #include <linux/swap.h>
> > +#include <linux/swapops.h>
> >  #include <linux/spinlock.h>
> >  #include <linux/fs.h>
> >  #include <linux/seq_file.h>
> > @@ -2230,6 +2231,49 @@ void mem_cgroup_uncharge_swap(swp_entry_t ent)
> >  	}
> >  	rcu_read_unlock();
> >  }
> > +
> > +/**
> > + * mem_cgroup_move_swap_account - move swap charge and swap_cgroup's record.
> > + * @entry: swap entry to be moved
> > + * @from:  mem_cgroup which the entry is moved from
> > + * @to:  mem_cgroup which the entry is moved to
> > + *
> > + * It successes only when the swap_cgroup's record for this entry is the same
>          ^^^^^^^^^
> Should be succeeds
> 
oops, sorry for my poor English...

> > + * as the mem_cgroup's id of @from.
> > + *
> > + * Returns 0 on success, -EINVAL on failure.
> > + *
> > + * The caller must have called __mem_cgroup_try_charge on @to.
> > + */
> > +static int mem_cgroup_move_swap_account(swp_entry_t entry,
> > +				struct mem_cgroup *from, struct mem_cgroup *to)
> > +{
> > +	unsigned short old_id, new_id;
> > +
> > +	old_id = css_id(&from->css);
> > +	new_id = css_id(&to->css);
> > +
> > +	if (swap_cgroup_cmpxchg(entry, old_id, new_id) == old_id) {
> > +		if (!mem_cgroup_is_root(from))
> > +			res_counter_uncharge(&from->memsw, PAGE_SIZE);
> > +		mem_cgroup_swap_statistics(from, false);
> > +		mem_cgroup_put(from);
> > +
> > +		if (!mem_cgroup_is_root(to))
> > +			res_counter_uncharge(&to->res, PAGE_SIZE);
> 
> Uncharge from "to" as well? Need more comments to understand the code.
> 
We've charged not only to->memsw but also to->res in can_attach(try_charge does it),
so we should uncharge to->res here.
I'm sorry for confusing you. I'll add a comment.

> > +		mem_cgroup_swap_statistics(to, true);
> > +		mem_cgroup_get(to);
> > +
> > +		return 0;
> > +	}
> > +	return -EINVAL;
> > +}
> > +#else
> > +static inline int mem_cgroup_move_swap_account(swp_entry_t entry,
> > +				struct mem_cgroup *from, struct mem_cgroup *to)
> > +{
> > +	return -EINVAL;
> > +}
> >  #endif
> > 
> >  /*
> > @@ -3477,8 +3521,10 @@ static int mem_cgroup_prepare_recharge(struct mm_struct *mm)
> >  	bool recharge_anon = test_bit(RECHARGE_TYPE_ANON,
> >  					&recharge.to->recharge_at_immigrate);
> > 
> > -	if (recharge_anon)
> > +	if (recharge_anon) {
> >  		prepare += get_mm_counter(mm, anon_rss);
> > +		prepare += get_mm_counter(mm, swap_usage);
> > +	}
> 
> Does this logic handle shared pages correctly? Could you please check.
> 
No. We ignore wether the page(or swap) is shared or not at this phase(in can_attach).
It's handled strictly in attach phase, where we don't move charges if it's shared and
call cancel_charge against "to".

> > 
> >  	while (!ret && prepare--) {
> >  		if (!count--) {
> > @@ -3553,66 +3599,99 @@ static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,
> >   * @vma: the vma the pte to be checked belongs
> >   * @addr: the address corresponding to the pte to be checked
> >   * @ptent: the pte to be checked
> > - * @target: the pointer the target page will be stored
> > + * @target: the pointer the target page or swap entry will be stored
> >   *
> >   * Returns
> >   *   0(RECHARGE_TARGET_NONE): if the pte is not a target for recharge.
> >   *   1(RECHARGE_TARGET_PAGE): if the page corresponding to this pte is a target
> >   *     for recharge. if @target is not NULL, the page is stored in target->page
> >   *     with extra refcnt got(Callers should handle it).
> > + *   2(MIGRATION_TARGET_SWAP): if the swap entry corresponding to this pte is a
> > + *     target for charge migration. if @target is not NULL, the entry is stored
> > + *     in target->ent.
> >   *
> >   * Called with pte lock held.
> >   */
> > -/* We add a new member later. */
> >  union recharge_target {
> >  	struct page	*page;
> > +	swp_entry_t	ent;
> >  };
> > 
> > -/* We add a new type later. */
> >  enum recharge_target_type {
> >  	RECHARGE_TARGET_NONE,	/* not used */
> >  	RECHARGE_TARGET_PAGE,
> > +	RECHARGE_TARGET_SWAP,
> >  };
> > 
> >  static int is_target_pte_for_recharge(struct vm_area_struct *vma,
> >  		unsigned long addr, pte_t ptent, union recharge_target *target)
> >  {
> > -	struct page *page;
> > +	struct page *page = NULL;
> >  	struct page_cgroup *pc;
> > +	swp_entry_t ent = { .val = 0 };
> > +	int user = 0;
> >  	int ret = 0;
> >  	bool recharge_anon = test_bit(RECHARGE_TYPE_ANON,
> >  					&recharge.to->recharge_at_immigrate);
> > 
> > -	if (!pte_present(ptent))
> > -		return 0;
> > +	if (!pte_present(ptent)) {
> > +		/* TODO: handle swap of shmes/tmpfs */
> > +		if (pte_none(ptent) || pte_file(ptent))
> > +			return 0;
> > +		else if (is_swap_pte(ptent)) {
> > +			ent = pte_to_swp_entry(ptent);
> > +			if (!recharge_anon || non_swap_entry(ent))
> > +				return 0;
> > +			user = mem_cgroup_count_swap_user(ent, &page);
> > +		}
> > +	} else {
> > +		page = vm_normal_page(vma, addr, ptent);
> > +		if (!page || !page_mapped(page))
> > +			return 0;
> > +		/*
> > +		 * TODO: We don't recharge file(including shmem/tmpfs) pages
> > +		 * for now.
> > +		 */
> > +		if (!recharge_anon || !PageAnon(page))
> > +			return 0;
> > +		if (!get_page_unless_zero(page))
> > +			return 0;
> > +		user = page_mapcount(page);
> > +	}
> > 
> > -	page = vm_normal_page(vma, addr, ptent);
> > -	if (!page || !page_mapped(page))
> > -		return 0;
> > -	/* TODO: We don't recharge file(including shmem/tmpfs) pages for now. */
> > -	if (!recharge_anon || !PageAnon(page))
> > -		return 0;
> > -	/*
> > -	 * TODO: We don't recharge shared(used by multiple processes) pages
> > -	 * for now.
> > -	 */
> > -	if (page_mapcount(page) > 1)
> > -		return 0;
> > -	if (!get_page_unless_zero(page))
> > +	if (user > 1) {
> 
> users or usage_count would be better name.
> 
Thank you for your suggestion.

> > +		/*
> > +		 * TODO: We don't recharge shared(used by multiple processes)
> > +		 * pages for now.
> > +		 */
> > +		if (page)
> > +			put_page(page);
> >  		return 0;
> > -
> > -	pc = lookup_page_cgroup(page);
> > -	/*
> > -	 * Do only loose check w/o page_cgroup lock. mem_cgroup_move_account()
> > -	 * checks the pc is valid or not under the lock.
> > -	 */
> > -	if (PageCgroupUsed(pc)) {
> > -		ret = RECHARGE_TARGET_PAGE;
> > -		target->page = page;
> >  	}
> > 
> > -	if (!ret)
> > -		put_page(page);
> > +	if (page) {
> > +		pc = lookup_page_cgroup(page);
> > +		/*
> > +		 * Do only loose check w/o page_cgroup lock.
> > +		 * mem_cgroup_move_account() checks the pc is valid or not under
> > +		 * the lock.
> > +		 */
> > +		if (PageCgroupUsed(pc)) {
> > +			ret = RECHARGE_TARGET_PAGE;
> > +			target->page = page;
> > +		}
> > +		if (!ret)
> > +			put_page(page);
> > +	}
> > +	/* fall throught */
> 
> should be through
> 
will fix.


Regards,
Daisuke Nishimura

> > +	if (ent.val && do_swap_account && !ret) {
> > +		/*
> > +		 * mem_cgroup_move_swap_account() checks the entry is valid or
> > +		 * not.
> > +		 */
> > +		ret = RECHARGE_TARGET_SWAP;
> > +		target->ent = ent;
> > +	}
> > 
> >  	return ret;
> >  }
> > @@ -3634,6 +3713,7 @@ retry:
> >  		int type;
> >  		struct page *page;
> >  		struct page_cgroup *pc;
> > +		swp_entry_t ent;
> > 
> >  		if (!recharge.precharge)
> >  			break;
> > @@ -3654,6 +3734,14 @@ retry:
> >  put:			/* is_target_pte_for_recharge() gets the page */
> >  			put_page(page);
> >  			break;
> > +		case RECHARGE_TARGET_SWAP:
> > +			ent = target.ent;
> > +			if (!mem_cgroup_move_swap_account(ent,
> > +						recharge.from, recharge.to)) {
> > +				css_put(&recharge.to->css);
> > +				recharge.precharge--;
> > +			}
> > +			break;
> >  		default:
> >  			break;
> >  		}
> > diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
> > index 3d535d5..213b0ee 100644
> > --- a/mm/page_cgroup.c
> > +++ b/mm/page_cgroup.c
> > @@ -9,6 +9,7 @@
> >  #include <linux/vmalloc.h>
> >  #include <linux/cgroup.h>
> >  #include <linux/swapops.h>
> > +#include <asm/cmpxchg.h>
> > 
> >  static void __meminit
> >  __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
> > @@ -335,6 +336,37 @@ not_enough_page:
> >  }
> > 
> >  /**
> > + * swap_cgroup_cmpxchg - cmpxchg mem_cgroup's id for this swp_entry.
> > + * @end: swap entry to be cmpxchged
> > + * @old: old id
> > + * @new: new id
> > + *
> > + * Returns old id at success, 0 at failure.
> > + * (There is no mem_cgroup useing 0 as its id)
> > + */
> > +unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
> > +					unsigned short old, unsigned short new)
> > +{
> > +	int type = swp_type(ent);
> > +	unsigned long offset = swp_offset(ent);
> > +	unsigned long idx = offset / SC_PER_PAGE;
> > +	unsigned long pos = offset & SC_POS_MASK;
> > +	struct swap_cgroup_ctrl *ctrl;
> > +	struct page *mappage;
> > +	struct swap_cgroup *sc;
> > +
> > +	ctrl = &swap_cgroup_ctrl[type];
> > +
> > +	mappage = ctrl->map[idx];
> > +	sc = page_address(mappage);
> > +	sc += pos;
> > +	if (cmpxchg(&sc->id, old, new) == old)
> > +		return old;
> > +	else
> > +		return 0;
> > +}
> > +
> > +/**
> >   * swap_cgroup_record - record mem_cgroup for this swp_entry.
> >   * @ent: swap entry to be recorded into
> >   * @mem: mem_cgroup to be recorded
> > @@ -358,8 +390,7 @@ unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id)
> >  	mappage = ctrl->map[idx];
> >  	sc = page_address(mappage);
> >  	sc += pos;
> > -	old = sc->id;
> > -	sc->id = id;
> > +	old = xchg(&sc->id, id);
> > 
> >  	return old;
> >  }
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index f32d716..cc2e859 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -719,6 +719,37 @@ int free_swap_and_cache(swp_entry_t entry)
> >  	return p != NULL;
> >  }
> > 
> > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> > +/**
> > + * mem_cgroup_count_swap_user - count the user of a swap entry
> > + * @ent: the swap entry to be checked
> > + * @pagep: the pointer for the swap cache page of the entry to be stored
> > + *
> > + * Returns the number of the user of the swap entry. The number is valid only
> > + * for swaps of anonymous pages.
> > + * If the entry is found on swap cache, the page is stored to pagep with
> > + * refcount of it being incremented.
> > + */
> > +int mem_cgroup_count_swap_user(swp_entry_t ent, struct page **pagep)
> > +{
> > +	struct page *page;
> > +	struct swap_info_struct *p;
> > +	int count = 0;
> > +
> > +	page = find_get_page(&swapper_space, ent.val);
> > +	if (page)
> > +		count += page_mapcount(page);
> > +	p = swap_info_get(ent);
> > +	if (p) {
> > +		count += swap_count(p->swap_map[swp_offset(ent)]);
> > +		spin_unlock(&swap_lock);
> > +	}
> > +
> > +	*pagep = page;
> > +	return count;
> > +}
> > +#endif
> > +
> >  #ifdef CONFIG_HIBERNATION
> >  /*
> >   * Find the swap type that corresponds to given device (if any).
> > -- 
> > 1.5.6.1
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 
> -- 
> 	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH -mmotm 4/5] memcg: avoid oom during recharge at task move
  2009-11-24  2:43     ` Daisuke Nishimura
@ 2009-11-27  4:58       ` Daisuke Nishimura
  2009-12-03  4:58         ` Daisuke Nishimura
  0 siblings, 1 reply; 20+ messages in thread
From: Daisuke Nishimura @ 2009-11-27  4:58 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, Andrew Morton, KAMEZAWA Hiroyuki, Li Zefan,
	Paul Menage, Daisuke Nishimura

[-- Attachment #1: Type: text/plain, Size: 1426 bytes --]

> > Sorry, if I missed it, but I did not see any time overhead of moving a
> > task after these changes. Could you please help me understand the cost
> > of moving say a task with 1G anonymous memory to another group and
> > the cost of moving a task with 512MB anonymous and 512 page cache
> > mapped, etc. It would be nice to understand the overall cost.
> > 
> O.K.
> I'll test programs with big anonymous pages and measure the time and report.
> 
I measured the elapsed time of "echo <pid> > <some path>/tasks" on KVM guest
with 4CPU/4GB(Xeon/3GHz).

- used the attached simple program.
- made 2 directories(00, 01) under root, and enabled recharge_at_immigrate in both.
- measured the elapsed time by "time -p" for moving between:

  (1) root -> 00
  (2) 00 -> 01

  we don't need to call res_counter_uncharge against root, so (1) would be smaller
  than (2).

  (3) 00(setting mem.limit to half size of total) -> 01

  To compare the overhead of anon and swap.

Results:

       |  252M  |  512M  |   1G
  -----+--------+--------+--------
   (1) |  0.21  |  0.41  |  0.821
  -----+--------+--------+--------
   (2) |  0.43  |  0.85  |  1.71
  -----+--------+--------+--------
   (3) |  0.40  |  0.81  |  1.62
  -----+--------+--------+--------


hmm, it would be better to add some comments to memory.txt like:

  Note: It may take several seconds if you move charges in giga bytes order.


Regards,
Daisuke Nishimura.


[-- Attachment #2: bigmem.c --]
[-- Type: text/x-csrc, Size: 660 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>
#include <sys/mman.h>
#include <unistd.h>

void
usage(void)
{
	fprintf(stderr, "bigmem <anon size(MB)>\n");
}

int
main(int argc, char *argv[])
{
	void *buf;
	size_t size;
	pid_t pid;

	if (argc != 2) {
		usage();
		return 1;
	}

	pid = getpid();
	fprintf(stdout, "pid is %d\n", pid);

	size = atol(argv[1]) * 1024 * 1024;
	buf = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
	if (buf == MAP_FAILED) {
		perror(NULL);
		return errno;
	}

	memset(buf, 0, size);
	fprintf(stdout, "allocated %ld bytes anonymous memory\n");

	pause();
}


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH -mmotm 4/5] memcg: avoid oom during recharge at task move
  2009-11-27  4:58       ` Daisuke Nishimura
@ 2009-12-03  4:58         ` Daisuke Nishimura
  2009-12-03  5:22           ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 20+ messages in thread
From: Daisuke Nishimura @ 2009-12-03  4:58 UTC (permalink / raw)
  To: Balbir Singh, KAMEZAWA Hiroyuki
  Cc: linux-mm, Andrew Morton, Li Zefan, Paul Menage, Daisuke Nishimura

On Fri, 27 Nov 2009 13:58:10 +0900, Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > > Sorry, if I missed it, but I did not see any time overhead of moving a
> > > task after these changes. Could you please help me understand the cost
> > > of moving say a task with 1G anonymous memory to another group and
> > > the cost of moving a task with 512MB anonymous and 512 page cache
> > > mapped, etc. It would be nice to understand the overall cost.
> > > 
> > O.K.
> > I'll test programs with big anonymous pages and measure the time and report.
> > 
> I measured the elapsed time of "echo <pid> > <some path>/tasks" on KVM guest
> with 4CPU/4GB(Xeon/3GHz).
> 
> - used the attached simple program.
> - made 2 directories(00, 01) under root, and enabled recharge_at_immigrate in both.
> - measured the elapsed time by "time -p" for moving between:
> 
>   (1) root -> 00
>   (2) 00 -> 01
> 
>   we don't need to call res_counter_uncharge against root, so (1) would be smaller
>   than (2).
> 
>   (3) 00(setting mem.limit to half size of total) -> 01
> 
>   To compare the overhead of anon and swap.
> 
> Results:
> 
>        |  252M  |  512M  |   1G
>   -----+--------+--------+--------
>    (1) |  0.21  |  0.41  |  0.821
>   -----+--------+--------+--------
>    (2) |  0.43  |  0.85  |  1.71
>   -----+--------+--------+--------
>    (3) |  0.40  |  0.81  |  1.62
>   -----+--------+--------+--------
> 
I'm now trying to decrease these overhead as much as possible, and the current
status is bellow.

(support for moving swap charge has not been pushed yet in my tree, so I tested
only (1) and (2) cases.)

       |  252M  |  512M  |   1G
  -----+--------+--------+--------
   (1) |  0.20  |  0.40  |  0.81
  -----+--------+--------+--------
   (2) |  0.20  |  0.40  |  0.81

What I've done are are:
- Instead of calling res_counter_uncharge() against the old cgroup in __mem_cgroup_move_account()
  evrytime, call res_counter_uncharge(PAGE_SIZE * moved) at the end of task migration once.
- Instead of calling try_charge repeatedly, call res_counter_charge(PAGE_SIZE * necessary)
  in can_attach() if possible.
- Not only res_counter_charge/uncharge, consolidate css_get()/put() too.


BTW, KAMEZAWA-san, are you planning to add mm_counter for swap yet ?
To tell the truth, instead of making use of mm_counter, I want to parse the page table
in can_attach as I did before, because:
- parsing the page table in can_attach seems not to add so big overheads(see below).
- if we add support for file-cache and shmem in future, I think we need to parse the page table
  anyway, because there is no independent mm_counter for shmem. I want to treat them
  independently because users don't consider shmem as file-cahce, IMHO.

(parsing the page table in can_attach)
       |  252M  |  512M  |   1G
  -----+--------+--------+--------
   (1) |  0.21  |  0.41  |  0.83
  -----+--------+--------+--------
   (2) |  0.21  |  0.41  |  0.83

Hopefully, I want to post a new version in this week.


Regards,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH -mmotm 4/5] memcg: avoid oom during recharge at task move
  2009-12-03  4:58         ` Daisuke Nishimura
@ 2009-12-03  5:22           ` KAMEZAWA Hiroyuki
  2009-12-03  6:00             ` Daisuke Nishimura
  0 siblings, 1 reply; 20+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-12-03  5:22 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: Balbir Singh, linux-mm, Andrew Morton, Li Zefan, Paul Menage

On Thu, 3 Dec 2009 13:58:05 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> I'm now trying to decrease these overhead as much as possible, and the current
> status is bellow.
> 
thanks.

> (support for moving swap charge has not been pushed yet in my tree, so I tested
> only (1) and (2) cases.)
> 
>        |  252M  |  512M  |   1G
>   -----+--------+--------+--------
>    (1) |  0.20  |  0.40  |  0.81
>   -----+--------+--------+--------
>    (2) |  0.20  |  0.40  |  0.81
> 
What is the unit of each numbers ? seconds ? And migration of a process with 1G bytes
requires 0.8sec ? But, hmm, speed up twice! sounds nice.


> What I've done are are:
> - Instead of calling res_counter_uncharge() against the old cgroup in __mem_cgroup_move_account()
>   evrytime, call res_counter_uncharge(PAGE_SIZE * moved) at the end of task migration once.
sounds reasonable.

> - Instead of calling try_charge repeatedly, call res_counter_charge(PAGE_SIZE * necessary)
>   in can_attach() if possible.
sounds reasonable, too.

> - Not only res_counter_charge/uncharge, consolidate css_get()/put() too.
> 
please do. But, hmm, I'd like to remove css_put/get per pages ;) But I put it aside now.

> 
> BTW, KAMEZAWA-san, are you planning to add mm_counter for swap yet ?
yes. please see my newest patch ;) extreme one.http://marc.info/?l=linux-mm&m=125980393923228&w=2

> To tell the truth, instead of making use of mm_counter, I want to parse the page table
> in can_attach as I did before, because:
> - parsing the page table in can_attach seems not to add so big overheads(see below).
ok.

> - if we add support for file-cache and shmem in future, I think we need to parse the page table
>   anyway, because there is no independent mm_counter for shmem. I want to treat them
>   independently because users don't consider shmem as file-cahce, IMHO.
> 
ok. about scanning page tables. 
Moving 1G means moving 262144, scanning 128 page tables. Maybe not very big cost.

I still doubt moving "shared" pages "silently" is useful but it's another topic, here.

> (parsing the page table in can_attach)
>        |  252M  |  512M  |   1G
>   -----+--------+--------+--------
>    (1) |  0.21  |  0.41  |  0.83
>   -----+--------+--------+--------
>    (2) |  0.21  |  0.41  |  0.83
> 
> Hopefully, I want to post a new version in this week.
> 

Thank you for your efforts.

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH -mmotm 4/5] memcg: avoid oom during recharge at task move
  2009-12-03  5:22           ` KAMEZAWA Hiroyuki
@ 2009-12-03  6:00             ` Daisuke Nishimura
  2009-12-03  7:40               ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 20+ messages in thread
From: Daisuke Nishimura @ 2009-12-03  6:00 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Balbir Singh, linux-mm, Andrew Morton, Li Zefan, Paul Menage,
	Daisuke Nishimura

On Thu, 3 Dec 2009 14:22:43 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Thu, 3 Dec 2009 13:58:05 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > I'm now trying to decrease these overhead as much as possible, and the current
> > status is bellow.
> > 
> thanks.
> 
> > (support for moving swap charge has not been pushed yet in my tree, so I tested
> > only (1) and (2) cases.)
> > 
> >        |  252M  |  512M  |   1G
> >   -----+--------+--------+--------
> >    (1) |  0.20  |  0.40  |  0.81
> >   -----+--------+--------+--------
> >    (2) |  0.20  |  0.40  |  0.81
> > 
> What is the unit of each numbers ? seconds ? And migration of a process with 1G bytes
> requires 0.8sec ? But, hmm, speed up twice! sounds nice.
> 
Ah, these numbers mean "seconds".
I agree they are big yet...

> 
> > What I've done are are:
> > - Instead of calling res_counter_uncharge() against the old cgroup in __mem_cgroup_move_account()
> >   evrytime, call res_counter_uncharge(PAGE_SIZE * moved) at the end of task migration once.
> sounds reasonable.
> 
> > - Instead of calling try_charge repeatedly, call res_counter_charge(PAGE_SIZE * necessary)
> >   in can_attach() if possible.
> sounds reasonable, too.
> 
> > - Not only res_counter_charge/uncharge, consolidate css_get()/put() too.
> > 
> please do. But, hmm, I'd like to remove css_put/get per pages ;) But I put it aside now.
> 
I do agree with you, but removing them would be a big change..
This change reduced about 0.2sec in 1GB case, so it's a workaround for now.


Thanks,
Daisuke Nishimura.

> > 
> > BTW, KAMEZAWA-san, are you planning to add mm_counter for swap yet ?
> yes. please see my newest patch ;) extreme one.http://marc.info/?l=linux-mm&m=125980393923228&w=2
> 
> > To tell the truth, instead of making use of mm_counter, I want to parse the page table
> > in can_attach as I did before, because:
> > - parsing the page table in can_attach seems not to add so big overheads(see below).
> ok.
> 
> > - if we add support for file-cache and shmem in future, I think we need to parse the page table
> >   anyway, because there is no independent mm_counter for shmem. I want to treat them
> >   independently because users don't consider shmem as file-cahce, IMHO.
> > 
> ok. about scanning page tables. 
> Moving 1G means moving 262144, scanning 128 page tables. Maybe not very big cost.
> 
> I still doubt moving "shared" pages "silently" is useful but it's another topic, here.
> 
> > (parsing the page table in can_attach)
> >        |  252M  |  512M  |   1G
> >   -----+--------+--------+--------
> >    (1) |  0.21  |  0.41  |  0.83
> >   -----+--------+--------+--------
> >    (2) |  0.21  |  0.41  |  0.83
> > 
> > Hopefully, I want to post a new version in this week.
> > 
> 
> Thank you for your efforts.
> 
> Thanks,
> -Kame
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH -mmotm 4/5] memcg: avoid oom during recharge at task move
  2009-12-03  6:00             ` Daisuke Nishimura
@ 2009-12-03  7:40               ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 20+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-12-03  7:40 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: Balbir Singh, linux-mm, Andrew Morton, Li Zefan, Paul Menage

On Thu, 3 Dec 2009 15:00:33 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Thu, 3 Dec 2009 14:22:43 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Thu, 3 Dec 2009 13:58:05 +0900
> > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > > I'm now trying to decrease these overhead as much as possible, and the current
> > > status is bellow.
> > > 
> > thanks.
> > 
> > > (support for moving swap charge has not been pushed yet in my tree, so I tested
> > > only (1) and (2) cases.)
> > > 
> > >        |  252M  |  512M  |   1G
> > >   -----+--------+--------+--------
> > >    (1) |  0.20  |  0.40  |  0.81
> > >   -----+--------+--------+--------
> > >    (2) |  0.20  |  0.40  |  0.81
> > > 
> > What is the unit of each numbers ? seconds ? And migration of a process with 1G bytes
> > requires 0.8sec ? But, hmm, speed up twice! sounds nice.
> > 
> Ah, these numbers mean "seconds".
> I agree they are big yet...
> 
But maybe reducing this will requires big efforts (or impossible).
So, this number is not very bad I think.


> > 
> > > What I've done are are:
> > > - Instead of calling res_counter_uncharge() against the old cgroup in __mem_cgroup_move_account()
> > >   evrytime, call res_counter_uncharge(PAGE_SIZE * moved) at the end of task migration once.
> > sounds reasonable.
> > 
> > > - Instead of calling try_charge repeatedly, call res_counter_charge(PAGE_SIZE * necessary)
> > >   in can_attach() if possible.
> > sounds reasonable, too.
> > 
> > > - Not only res_counter_charge/uncharge, consolidate css_get()/put() too.
> > > 
> > please do. But, hmm, I'd like to remove css_put/get per pages ;) But I put it aside now.
> > 
> I do agree with you, but removing them would be a big change..
> This change reduced about 0.2sec in 1GB case, so it's a workaround for now.
> 
please go ahead with coalesced css_put()/get().
I agree that workaround is necessary now.

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2009-12-03  7:43 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-11-19  4:27 [PATCH -mmotm 0/5] memcg: recharge at task move (19/Nov) Daisuke Nishimura
2009-11-19  4:28 ` [PATCH -mmotm 1/5] cgroup: introduce cancel_attach() Daisuke Nishimura
2009-11-19 21:42   ` Paul Menage
2009-11-19 23:49     ` Daisuke Nishimura
2009-11-19  4:29 ` [PATCH -mmotm 2/5] memcg: add interface to recharge at task move Daisuke Nishimura
2009-11-20 15:42   ` Balbir Singh
2009-11-23 23:56     ` Daisuke Nishimura
2009-11-19  4:29 ` [PATCH -mmotm 3/5] memcg: recharge charges of anonymous page Daisuke Nishimura
2009-11-19  4:30 ` [PATCH -mmotm 4/5] memcg: avoid oom during recharge at task move Daisuke Nishimura
2009-11-23  5:10   ` Balbir Singh
2009-11-24  2:43     ` Daisuke Nishimura
2009-11-27  4:58       ` Daisuke Nishimura
2009-12-03  4:58         ` Daisuke Nishimura
2009-12-03  5:22           ` KAMEZAWA Hiroyuki
2009-12-03  6:00             ` Daisuke Nishimura
2009-12-03  7:40               ` KAMEZAWA Hiroyuki
2009-11-19  4:31 ` [PATCH -mmotm 5/5] memcg: recharge charges of anonymous swap Daisuke Nishimura
2009-11-23  6:59   ` Balbir Singh
2009-11-24  7:54     ` Daisuke Nishimura
2009-11-19 19:03 ` [PATCH -mmotm 0/5] memcg: recharge at task move (19/Nov) Balbir Singh

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.