[BUGFIX][PATCH] memcg: fix race in file_mapped accouting flag management

All of lore.kernel.org
 help / color / mirror / Atom feed

* [BUGFIX][PATCH] memcg: fix race in file_mapped accouting flag management
@ 2010-09-13  7:08 ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-13  7:08 UTC (permalink / raw)
  To: linux-mm; +Cc: balbir, nishimura, gthelen, linux-kernel, akpm, stable

I think this small race is not very critical but it's bug.
We have this race since 2.6.34. 
=
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Now. memory cgroup accounts file-mapped by counter and flag.
counter is working in the same way with zone_stat but FileMapped flag only
exists in memcg (for helping move_account).

This flag can be updated wrongly in a case. Assume CPU0 and CPU1
and a thread mapping a page on CPU0, another thread unmapping it on CPU1.

    CPU0                   		CPU1
				rmv rmap (mapcount 1->0)
   add rmap (mapcount 0->1)
   lock_page_cgroup()
   memcg counter+1		(some delay)
   set MAPPED FLAG.
   unlock_page_cgroup()
				lock_page_cgroup()
				memcg counter-1
				clear MAPPED flag

In above sequence, counter is properly updated but FLAG is not.
This means that representing a state by a flag which is maintained by
counter needs some specail care.

To handle this, at claering a flag, this patch check mapcount directly and
clear the flag only when mapcount == 0. (if mapcount >0, someone will make
it to zero later and flag will be cleared.)

Reverse case, dec-after-inc cannot be a problem because page_table_lock()
works well for it. (IOW, to make above sequence, 2 processes should touch
the same page at once with map/unmap.)

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Index: lockless-update/mm/memcontrol.c
===================================================================
--- lockless-update.orig/mm/memcontrol.c
+++ lockless-update/mm/memcontrol.c
@@ -1485,7 +1485,8 @@ void mem_cgroup_update_file_mapped(struc
 		SetPageCgroupFileMapped(pc);
 	} else {
 		__this_cpu_dec(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		ClearPageCgroupFileMapped(pc);
+		if (page_mapped(page)) /* for race between dec->inc counter */
+			ClearPageCgroupFileMapped(pc);
 	}

 done:

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [BUGFIX][PATCH] memcg: fix race in file_mapped accouting flag management
@ 2010-09-13  7:08 ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-13  7:08 UTC (permalink / raw)
  To: linux-mm; +Cc: balbir, nishimura, gthelen, linux-kernel, akpm, stable

I think this small race is not very critical but it's bug.
We have this race since 2.6.34. 
=
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Now. memory cgroup accounts file-mapped by counter and flag.
counter is working in the same way with zone_stat but FileMapped flag only
exists in memcg (for helping move_account).

This flag can be updated wrongly in a case. Assume CPU0 and CPU1
and a thread mapping a page on CPU0, another thread unmapping it on CPU1.

    CPU0                   		CPU1
				rmv rmap (mapcount 1->0)
   add rmap (mapcount 0->1)
   lock_page_cgroup()
   memcg counter+1		(some delay)
   set MAPPED FLAG.
   unlock_page_cgroup()
				lock_page_cgroup()
				memcg counter-1
				clear MAPPED flag

In above sequence, counter is properly updated but FLAG is not.
This means that representing a state by a flag which is maintained by
counter needs some specail care.

To handle this, at claering a flag, this patch check mapcount directly and
clear the flag only when mapcount == 0. (if mapcount >0, someone will make
it to zero later and flag will be cleared.)

Reverse case, dec-after-inc cannot be a problem because page_table_lock()
works well for it. (IOW, to make above sequence, 2 processes should touch
the same page at once with map/unmap.)

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Index: lockless-update/mm/memcontrol.c
===================================================================
--- lockless-update.orig/mm/memcontrol.c
+++ lockless-update/mm/memcontrol.c
@@ -1485,7 +1485,8 @@ void mem_cgroup_update_file_mapped(struc
 		SetPageCgroupFileMapped(pc);
 	} else {
 		__this_cpu_dec(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		ClearPageCgroupFileMapped(pc);
+		if (page_mapped(page)) /* for race between dec->inc counter */
+			ClearPageCgroupFileMapped(pc);
 	}

 done:

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] memcg: fix race in file_mapped accouting flag management
  2010-09-13  7:08 ` KAMEZAWA Hiroyuki
@ 2010-09-13  7:13   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-13  7:13 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, balbir, nishimura, gthelen, linux-kernel, akpm

Based on mmtom + fix-race-in-file_mapped-accounting flag management patch.
This patch is not for bug fix but for performance improvement.
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

At accounting file events per memory cgroup, we need to find memory cgroup
via page_cgroup->mem_cgroup. Now, we use lock_page_cgroup() for guarantee
pc->mem_cgroup is not overwritten while we make use of it.

But, considering the context which page-cgroup for files are accessed,
we can use alternative light-weight mutual execusion in the most case.

At handling file-caches, the only race we have to take care of is "moving"
account, IOW, overwriting page_cgroup->mem_cgroup. 
(See comment in the patch)

Unlike charge/uncharge, "move" happens not so frequently. It happens only when
rmdir() and task-moving (with a special settings.)
This patch adds a race-checker for file-cache-status accounting v.s. account
moving. The new per-cpu-per-memcg counter MEM_CGROUP_ON_MOVE is added.
The routine for account move 
  1. Increment it before start moving
  2. Call synchronize_rcu()
  3. Decrement it after the end of moving.
By this, file-status-counting routine can check it needs to call
lock_page_cgroup(). In most case, I doesn't need to call it.

Following is a perf data of a process which mmap()/munmap 32MB of file cache
in a minute.

Before patch:
    28.25%     mmap  mmap               [.] main
    22.64%     mmap  [kernel.kallsyms]  [k] page_fault
     9.96%     mmap  [kernel.kallsyms]  [k] mem_cgroup_update_file_mapped
     3.67%     mmap  [kernel.kallsyms]  [k] filemap_fault
     3.50%     mmap  [kernel.kallsyms]  [k] unmap_vmas
     2.99%     mmap  [kernel.kallsyms]  [k] __do_fault
     2.76%     mmap  [kernel.kallsyms]  [k] find_get_page

After patch:
    30.00%     mmap  mmap               [.] main
    23.78%     mmap  [kernel.kallsyms]  [k] page_fault
     5.52%     mmap  [kernel.kallsyms]  [k] mem_cgroup_update_file_mapped
     3.81%     mmap  [kernel.kallsyms]  [k] unmap_vmas
     3.26%     mmap  [kernel.kallsyms]  [k] find_get_page
     3.18%     mmap  [kernel.kallsyms]  [k] __do_fault
     3.03%     mmap  [kernel.kallsyms]  [k] filemap_fault
     2.40%     mmap  [kernel.kallsyms]  [k] handle_mm_fault
     2.40%     mmap  [kernel.kallsyms]  [k] do_page_fault

This patch reduces memcg's cost to some extent.
(mem_cgroup_update_file_mapped is called by both of map/unmap)

Note: It seems some more improvements are required..but no idea.
      maybe removing set/unset flag is required.

Changelog: 20100913
 - decoupled with ID patches.
 - updated comments.

Changelog: 20100901
 - changes id_to_memcg(pc, true) to be id_to_memcg(pc, false)
   in update_file_mapped()
 - updated comments on lock rule of update_file_mapped()
Changelog: 20100825
 - added a comment about mc.lock
 - fixed bad lock.
Changelog: 20100804
 - added a comment for possible optimization hint.
Changelog: 20100730
 - some cleanup.
Changelog: 20100729
 - replaced __this_cpu_xxx() with this_cpu_xxx
   (because we don't call spinlock)
 - added VM_BUG_ON().

Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   99 ++++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 85 insertions(+), 14 deletions(-)

Index: lockless-update/mm/memcontrol.c
===================================================================
--- lockless-update.orig/mm/memcontrol.c
+++ lockless-update/mm/memcontrol.c
@@ -90,6 +90,7 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
 	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
 	MEM_CGROUP_EVENTS,	/* incremented at every  pagein/pageout */
+	MEM_CGROUP_ON_MOVE,	/* someone is moving account between groups */
 
 	MEM_CGROUP_STAT_NSTATS,
 };
@@ -1051,7 +1052,46 @@ static unsigned int get_swappiness(struc
 	return swappiness;
 }
 
-/* A routine for testing mem is not under move_account */
+static void mem_cgroup_start_move(struct mem_cgroup *mem)
+{
+	int cpu;
+	/* Because this is for moving account, reuse mc.lock */
+	spin_lock(&mc.lock);
+	for_each_possible_cpu(cpu)
+		per_cpu(mem->stat->count[MEM_CGROUP_ON_MOVE], cpu) += 1;
+	spin_unlock(&mc.lock);
+
+	synchronize_rcu();
+}
+
+static void mem_cgroup_end_move(struct mem_cgroup *mem)
+{
+	int cpu;
+
+	if (!mem)
+		return;
+	spin_lock(&mc.lock);
+	for_each_possible_cpu(cpu)
+		per_cpu(mem->stat->count[MEM_CGROUP_ON_MOVE], cpu) -= 1;
+	spin_unlock(&mc.lock);
+}
+/*
+ * 2 routines for checking "mem" is under move_account() or not.
+ *
+ * mem_cgroup_stealed() - checking a cgroup is mc.from or not. This is used
+ *			  for avoiding race in accounting. If true,
+ *			  pc->mem_cgroup may be overwritten.
+ *
+ * mem_cgroup_under_move() - checking a cgroup is mc.from or mc.to or
+ *			  under hierarchy of moving cgroups. This is for
+ *			  waiting at hith-memory prressure caused by "move".
+ */
+
+static bool mem_cgroup_stealed(struct mem_cgroup *mem)
+{
+	VM_BUG_ON(!rcu_read_lock_held());
+	return this_cpu_read(mem->stat->count[MEM_CGROUP_ON_MOVE]) > 0;
+}
 
 static bool mem_cgroup_under_move(struct mem_cgroup *mem)
 {
@@ -1462,35 +1502,62 @@ bool mem_cgroup_handle_oom(struct mem_cg
 /*
  * Currently used to update mapped file statistics, but the routine can be
  * generalized to update other statistics as well.
+ *
+ * Notes: Race condition
+ *
+ * We usually use page_cgroup_lock() for accessing page_cgroup member but
+ * it tends to be costly. But considering some conditions, we doesn't need
+ * to do so _always_.
+ *
+ * Considering "charge", lock_page_cgroup() is not required because all
+ * file-stat operations happen after a page is attached to radix-tree. There
+ * are no race with "charge".
+ *
+ * Considering "uncharge", we know that memcg doesn't clear pc->mem_cgroup
+ * at "uncharge" intentionally. So, we always see valid pc->mem_cgroup even
+ * if there are race with "uncharge". Statistics itself is properly handled
+ * by flags.
+ *
+ * Considering "move", this is an only case we see a race. To make the race
+ * small, we check MEM_CGROUP_ON_MOVE percpu value and detect there are
+ * possibility of race condition. If there is, we take a lock.
  */
 void mem_cgroup_update_file_mapped(struct page *page, int val)
 {
 	struct mem_cgroup *mem;
-	struct page_cgroup *pc;
+	struct page_cgroup *pc = lookup_page_cgroup(page);
+	bool need_unlock = false;
 
-	pc = lookup_page_cgroup(page);
 	if (unlikely(!pc))
 		return;
 
-	lock_page_cgroup(pc);
+	rcu_read_lock();
 	mem = pc->mem_cgroup;
-	if (!mem || !PageCgroupUsed(pc))
-		goto done;
-
-	/*
-	 * Preemption is already disabled. We can use __this_cpu_xxx
-	 */
+	if (unlikely(!mem || !PageCgroupUsed(pc)))
+		goto out;
+	/* pc->mem_cgroup is unstable ? */
+	if (unlikely(mem_cgroup_stealed(mem))) {
+		/* take a lock against to access pc->mem_cgroup */
+		lock_page_cgroup(pc);
+		need_unlock = true;
+		mem = pc->mem_cgroup;
+		if (!mem || !PageCgroupUsed(pc))
+			goto out;
+	}
 	if (val > 0) {
-		__this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
+		this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
 		SetPageCgroupFileMapped(pc);
 	} else {
-		__this_cpu_dec(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
+		this_cpu_dec(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
 		if (!page_mapped(page)) /* for race between dec->inc counter */
 			ClearPageCgroupFileMapped(pc);
 	}
 
-done:
-	unlock_page_cgroup(pc);
+out:
+	if (unlikely(need_unlock))
+		unlock_page_cgroup(pc);
+	rcu_read_unlock();
+	return;
 }
 
 /*
@@ -3039,6 +3106,7 @@ move_account:
 		lru_add_drain_all();
 		drain_all_stock_sync();
 		ret = 0;
+		mem_cgroup_start_move(mem);
 		for_each_node_state(node, N_HIGH_MEMORY) {
 			for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) {
 				enum lru_list l;
@@ -3052,6 +3120,7 @@ move_account:
 			if (ret)
 				break;
 		}
+		mem_cgroup_end_move(mem);
 		memcg_oom_recover(mem);
 		/* it seems parent cgroup doesn't have enough mem */
 		if (ret == -ENOMEM)
@@ -4510,6 +4579,7 @@ static void mem_cgroup_clear_mc(void)
 	mc.to = NULL;
 	mc.moving_task = NULL;
 	spin_unlock(&mc.lock);
+	mem_cgroup_end_move(from);
 	memcg_oom_recover(from);
 	memcg_oom_recover(to);
 	wake_up_all(&mc.waitq);
@@ -4540,6 +4610,7 @@ static int mem_cgroup_can_attach(struct 
 			VM_BUG_ON(mc.moved_charge);
 			VM_BUG_ON(mc.moved_swap);
 			VM_BUG_ON(mc.moving_task);
+			mem_cgroup_start_move(from);
 			spin_lock(&mc.lock);
 			mc.from = from;
 			mc.to = mem;



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] memcg: fix race in file_mapped accouting flag management
@ 2010-09-13  7:13   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-13  7:13 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, balbir, nishimura, gthelen, linux-kernel, akpm

Based on mmtom + fix-race-in-file_mapped-accounting flag management patch.
This patch is not for bug fix but for performance improvement.
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

At accounting file events per memory cgroup, we need to find memory cgroup
via page_cgroup->mem_cgroup. Now, we use lock_page_cgroup() for guarantee
pc->mem_cgroup is not overwritten while we make use of it.

But, considering the context which page-cgroup for files are accessed,
we can use alternative light-weight mutual execusion in the most case.

At handling file-caches, the only race we have to take care of is "moving"
account, IOW, overwriting page_cgroup->mem_cgroup. 
(See comment in the patch)

Unlike charge/uncharge, "move" happens not so frequently. It happens only when
rmdir() and task-moving (with a special settings.)
This patch adds a race-checker for file-cache-status accounting v.s. account
moving. The new per-cpu-per-memcg counter MEM_CGROUP_ON_MOVE is added.
The routine for account move 
  1. Increment it before start moving
  2. Call synchronize_rcu()
  3. Decrement it after the end of moving.
By this, file-status-counting routine can check it needs to call
lock_page_cgroup(). In most case, I doesn't need to call it.

Following is a perf data of a process which mmap()/munmap 32MB of file cache
in a minute.

Before patch:
    28.25%     mmap  mmap               [.] main
    22.64%     mmap  [kernel.kallsyms]  [k] page_fault
     9.96%     mmap  [kernel.kallsyms]  [k] mem_cgroup_update_file_mapped
     3.67%     mmap  [kernel.kallsyms]  [k] filemap_fault
     3.50%     mmap  [kernel.kallsyms]  [k] unmap_vmas
     2.99%     mmap  [kernel.kallsyms]  [k] __do_fault
     2.76%     mmap  [kernel.kallsyms]  [k] find_get_page

After patch:
    30.00%     mmap  mmap               [.] main
    23.78%     mmap  [kernel.kallsyms]  [k] page_fault
     5.52%     mmap  [kernel.kallsyms]  [k] mem_cgroup_update_file_mapped
     3.81%     mmap  [kernel.kallsyms]  [k] unmap_vmas
     3.26%     mmap  [kernel.kallsyms]  [k] find_get_page
     3.18%     mmap  [kernel.kallsyms]  [k] __do_fault
     3.03%     mmap  [kernel.kallsyms]  [k] filemap_fault
     2.40%     mmap  [kernel.kallsyms]  [k] handle_mm_fault
     2.40%     mmap  [kernel.kallsyms]  [k] do_page_fault

This patch reduces memcg's cost to some extent.
(mem_cgroup_update_file_mapped is called by both of map/unmap)

Note: It seems some more improvements are required..but no idea.
      maybe removing set/unset flag is required.

Changelog: 20100913
 - decoupled with ID patches.
 - updated comments.

Changelog: 20100901
 - changes id_to_memcg(pc, true) to be id_to_memcg(pc, false)
   in update_file_mapped()
 - updated comments on lock rule of update_file_mapped()
Changelog: 20100825
 - added a comment about mc.lock
 - fixed bad lock.
Changelog: 20100804
 - added a comment for possible optimization hint.
Changelog: 20100730
 - some cleanup.
Changelog: 20100729
 - replaced __this_cpu_xxx() with this_cpu_xxx
   (because we don't call spinlock)
 - added VM_BUG_ON().

Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   99 ++++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 85 insertions(+), 14 deletions(-)

Index: lockless-update/mm/memcontrol.c
===================================================================
--- lockless-update.orig/mm/memcontrol.c
+++ lockless-update/mm/memcontrol.c
@@ -90,6 +90,7 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
 	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
 	MEM_CGROUP_EVENTS,	/* incremented at every  pagein/pageout */
+	MEM_CGROUP_ON_MOVE,	/* someone is moving account between groups */
 
 	MEM_CGROUP_STAT_NSTATS,
 };
@@ -1051,7 +1052,46 @@ static unsigned int get_swappiness(struc
 	return swappiness;
 }
 
-/* A routine for testing mem is not under move_account */
+static void mem_cgroup_start_move(struct mem_cgroup *mem)
+{
+	int cpu;
+	/* Because this is for moving account, reuse mc.lock */
+	spin_lock(&mc.lock);
+	for_each_possible_cpu(cpu)
+		per_cpu(mem->stat->count[MEM_CGROUP_ON_MOVE], cpu) += 1;
+	spin_unlock(&mc.lock);
+
+	synchronize_rcu();
+}
+
+static void mem_cgroup_end_move(struct mem_cgroup *mem)
+{
+	int cpu;
+
+	if (!mem)
+		return;
+	spin_lock(&mc.lock);
+	for_each_possible_cpu(cpu)
+		per_cpu(mem->stat->count[MEM_CGROUP_ON_MOVE], cpu) -= 1;
+	spin_unlock(&mc.lock);
+}
+/*
+ * 2 routines for checking "mem" is under move_account() or not.
+ *
+ * mem_cgroup_stealed() - checking a cgroup is mc.from or not. This is used
+ *			  for avoiding race in accounting. If true,
+ *			  pc->mem_cgroup may be overwritten.
+ *
+ * mem_cgroup_under_move() - checking a cgroup is mc.from or mc.to or
+ *			  under hierarchy of moving cgroups. This is for
+ *			  waiting at hith-memory prressure caused by "move".
+ */
+
+static bool mem_cgroup_stealed(struct mem_cgroup *mem)
+{
+	VM_BUG_ON(!rcu_read_lock_held());
+	return this_cpu_read(mem->stat->count[MEM_CGROUP_ON_MOVE]) > 0;
+}
 
 static bool mem_cgroup_under_move(struct mem_cgroup *mem)
 {
@@ -1462,35 +1502,62 @@ bool mem_cgroup_handle_oom(struct mem_cg
 /*
  * Currently used to update mapped file statistics, but the routine can be
  * generalized to update other statistics as well.
+ *
+ * Notes: Race condition
+ *
+ * We usually use page_cgroup_lock() for accessing page_cgroup member but
+ * it tends to be costly. But considering some conditions, we doesn't need
+ * to do so _always_.
+ *
+ * Considering "charge", lock_page_cgroup() is not required because all
+ * file-stat operations happen after a page is attached to radix-tree. There
+ * are no race with "charge".
+ *
+ * Considering "uncharge", we know that memcg doesn't clear pc->mem_cgroup
+ * at "uncharge" intentionally. So, we always see valid pc->mem_cgroup even
+ * if there are race with "uncharge". Statistics itself is properly handled
+ * by flags.
+ *
+ * Considering "move", this is an only case we see a race. To make the race
+ * small, we check MEM_CGROUP_ON_MOVE percpu value and detect there are
+ * possibility of race condition. If there is, we take a lock.
  */
 void mem_cgroup_update_file_mapped(struct page *page, int val)
 {
 	struct mem_cgroup *mem;
-	struct page_cgroup *pc;
+	struct page_cgroup *pc = lookup_page_cgroup(page);
+	bool need_unlock = false;
 
-	pc = lookup_page_cgroup(page);
 	if (unlikely(!pc))
 		return;
 
-	lock_page_cgroup(pc);
+	rcu_read_lock();
 	mem = pc->mem_cgroup;
-	if (!mem || !PageCgroupUsed(pc))
-		goto done;
-
-	/*
-	 * Preemption is already disabled. We can use __this_cpu_xxx
-	 */
+	if (unlikely(!mem || !PageCgroupUsed(pc)))
+		goto out;
+	/* pc->mem_cgroup is unstable ? */
+	if (unlikely(mem_cgroup_stealed(mem))) {
+		/* take a lock against to access pc->mem_cgroup */
+		lock_page_cgroup(pc);
+		need_unlock = true;
+		mem = pc->mem_cgroup;
+		if (!mem || !PageCgroupUsed(pc))
+			goto out;
+	}
 	if (val > 0) {
-		__this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
+		this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
 		SetPageCgroupFileMapped(pc);
 	} else {
-		__this_cpu_dec(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
+		this_cpu_dec(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
 		if (!page_mapped(page)) /* for race between dec->inc counter */
 			ClearPageCgroupFileMapped(pc);
 	}
 
-done:
-	unlock_page_cgroup(pc);
+out:
+	if (unlikely(need_unlock))
+		unlock_page_cgroup(pc);
+	rcu_read_unlock();
+	return;
 }
 
 /*
@@ -3039,6 +3106,7 @@ move_account:
 		lru_add_drain_all();
 		drain_all_stock_sync();
 		ret = 0;
+		mem_cgroup_start_move(mem);
 		for_each_node_state(node, N_HIGH_MEMORY) {
 			for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) {
 				enum lru_list l;
@@ -3052,6 +3120,7 @@ move_account:
 			if (ret)
 				break;
 		}
+		mem_cgroup_end_move(mem);
 		memcg_oom_recover(mem);
 		/* it seems parent cgroup doesn't have enough mem */
 		if (ret == -ENOMEM)
@@ -4510,6 +4579,7 @@ static void mem_cgroup_clear_mc(void)
 	mc.to = NULL;
 	mc.moving_task = NULL;
 	spin_unlock(&mc.lock);
+	mem_cgroup_end_move(from);
 	memcg_oom_recover(from);
 	memcg_oom_recover(to);
 	wake_up_all(&mc.waitq);
@@ -4540,6 +4610,7 @@ static int mem_cgroup_can_attach(struct 
 			VM_BUG_ON(mc.moved_charge);
 			VM_BUG_ON(mc.moved_swap);
 			VM_BUG_ON(mc.moving_task);
+			mem_cgroup_start_move(from);
 			spin_lock(&mc.lock);
 			mc.from = from;
 			mc.to = mem;


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] memcg: avoid lock in updating file_mapped (Was fix race in file_mapped accouting flag management
  2010-09-13  7:13   ` KAMEZAWA Hiroyuki
@ 2010-09-13  8:01     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-13  8:01 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, balbir, nishimura, gthelen, linux-kernel, akpm


Very sorry, subject was wrong..(reposting).

==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

At accounting file events per memory cgroup, we need to find memory cgroup
via page_cgroup->mem_cgroup. Now, we use lock_page_cgroup() for guarantee
pc->mem_cgroup is not overwritten while we make use of it.

But, considering the context which page-cgroup for files are accessed,
we can use alternative light-weight mutual execusion in the most case.

At handling file-caches, the only race we have to take care of is "moving"
account, IOW, overwriting page_cgroup->mem_cgroup. 
(See comment in the patch)

Unlike charge/uncharge, "move" happens not so frequently. It happens only when
rmdir() and task-moving (with a special settings.)
This patch adds a race-checker for file-cache-status accounting v.s. account
moving. The new per-cpu-per-memcg counter MEM_CGROUP_ON_MOVE is added.
The routine for account move 
  1. Increment it before start moving
  2. Call synchronize_rcu()
  3. Decrement it after the end of moving.
By this, file-status-counting routine can check it needs to call
lock_page_cgroup(). In most case, I doesn't need to call it.

Following is a perf data of a process which mmap()/munmap 32MB of file cache
in a minute.

Before patch:
    28.25%     mmap  mmap               [.] main
    22.64%     mmap  [kernel.kallsyms]  [k] page_fault
     9.96%     mmap  [kernel.kallsyms]  [k] mem_cgroup_update_file_mapped
     3.67%     mmap  [kernel.kallsyms]  [k] filemap_fault
     3.50%     mmap  [kernel.kallsyms]  [k] unmap_vmas
     2.99%     mmap  [kernel.kallsyms]  [k] __do_fault
     2.76%     mmap  [kernel.kallsyms]  [k] find_get_page

After patch:
    30.00%     mmap  mmap               [.] main
    23.78%     mmap  [kernel.kallsyms]  [k] page_fault
     5.52%     mmap  [kernel.kallsyms]  [k] mem_cgroup_update_file_mapped
     3.81%     mmap  [kernel.kallsyms]  [k] unmap_vmas
     3.26%     mmap  [kernel.kallsyms]  [k] find_get_page
     3.18%     mmap  [kernel.kallsyms]  [k] __do_fault
     3.03%     mmap  [kernel.kallsyms]  [k] filemap_fault
     2.40%     mmap  [kernel.kallsyms]  [k] handle_mm_fault
     2.40%     mmap  [kernel.kallsyms]  [k] do_page_fault

This patch reduces memcg's cost to some extent.
(mem_cgroup_update_file_mapped is called by both of map/unmap)

Note: It seems some more improvements are required..but no idea.
      maybe removing set/unset flag is required.

Changelog: 20100913
 - decoupled with ID patches.
 - updated comments.

Changelog: 20100901
 - changes id_to_memcg(pc, true) to be id_to_memcg(pc, false)
   in update_file_mapped()
 - updated comments on lock rule of update_file_mapped()
Changelog: 20100825
 - added a comment about mc.lock
 - fixed bad lock.
Changelog: 20100804
 - added a comment for possible optimization hint.
Changelog: 20100730
 - some cleanup.
Changelog: 20100729
 - replaced __this_cpu_xxx() with this_cpu_xxx
   (because we don't call spinlock)
 - added VM_BUG_ON().

Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   99 ++++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 85 insertions(+), 14 deletions(-)

Index: lockless-update/mm/memcontrol.c
===================================================================
--- lockless-update.orig/mm/memcontrol.c
+++ lockless-update/mm/memcontrol.c
@@ -90,6 +90,7 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
 	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
 	MEM_CGROUP_EVENTS,	/* incremented at every  pagein/pageout */
+	MEM_CGROUP_ON_MOVE,	/* someone is moving account between groups */
 
 	MEM_CGROUP_STAT_NSTATS,
 };
@@ -1051,7 +1052,46 @@ static unsigned int get_swappiness(struc
 	return swappiness;
 }
 
-/* A routine for testing mem is not under move_account */
+static void mem_cgroup_start_move(struct mem_cgroup *mem)
+{
+	int cpu;
+	/* Because this is for moving account, reuse mc.lock */
+	spin_lock(&mc.lock);
+	for_each_possible_cpu(cpu)
+		per_cpu(mem->stat->count[MEM_CGROUP_ON_MOVE], cpu) += 1;
+	spin_unlock(&mc.lock);
+
+	synchronize_rcu();
+}
+
+static void mem_cgroup_end_move(struct mem_cgroup *mem)
+{
+	int cpu;
+
+	if (!mem)
+		return;
+	spin_lock(&mc.lock);
+	for_each_possible_cpu(cpu)
+		per_cpu(mem->stat->count[MEM_CGROUP_ON_MOVE], cpu) -= 1;
+	spin_unlock(&mc.lock);
+}
+/*
+ * 2 routines for checking "mem" is under move_account() or not.
+ *
+ * mem_cgroup_stealed() - checking a cgroup is mc.from or not. This is used
+ *			  for avoiding race in accounting. If true,
+ *			  pc->mem_cgroup may be overwritten.
+ *
+ * mem_cgroup_under_move() - checking a cgroup is mc.from or mc.to or
+ *			  under hierarchy of moving cgroups. This is for
+ *			  waiting at hith-memory prressure caused by "move".
+ */
+
+static bool mem_cgroup_stealed(struct mem_cgroup *mem)
+{
+	VM_BUG_ON(!rcu_read_lock_held());
+	return this_cpu_read(mem->stat->count[MEM_CGROUP_ON_MOVE]) > 0;
+}
 
 static bool mem_cgroup_under_move(struct mem_cgroup *mem)
 {
@@ -1462,35 +1502,62 @@ bool mem_cgroup_handle_oom(struct mem_cg
 /*
  * Currently used to update mapped file statistics, but the routine can be
  * generalized to update other statistics as well.
+ *
+ * Notes: Race condition
+ *
+ * We usually use page_cgroup_lock() for accessing page_cgroup member but
+ * it tends to be costly. But considering some conditions, we doesn't need
+ * to do so _always_.
+ *
+ * Considering "charge", lock_page_cgroup() is not required because all
+ * file-stat operations happen after a page is attached to radix-tree. There
+ * are no race with "charge".
+ *
+ * Considering "uncharge", we know that memcg doesn't clear pc->mem_cgroup
+ * at "uncharge" intentionally. So, we always see valid pc->mem_cgroup even
+ * if there are race with "uncharge". Statistics itself is properly handled
+ * by flags.
+ *
+ * Considering "move", this is an only case we see a race. To make the race
+ * small, we check MEM_CGROUP_ON_MOVE percpu value and detect there are
+ * possibility of race condition. If there is, we take a lock.
  */
 void mem_cgroup_update_file_mapped(struct page *page, int val)
 {
 	struct mem_cgroup *mem;
-	struct page_cgroup *pc;
+	struct page_cgroup *pc = lookup_page_cgroup(page);
+	bool need_unlock = false;
 
-	pc = lookup_page_cgroup(page);
 	if (unlikely(!pc))
 		return;
 
-	lock_page_cgroup(pc);
+	rcu_read_lock();
 	mem = pc->mem_cgroup;
-	if (!mem || !PageCgroupUsed(pc))
-		goto done;
-
-	/*
-	 * Preemption is already disabled. We can use __this_cpu_xxx
-	 */
+	if (unlikely(!mem || !PageCgroupUsed(pc)))
+		goto out;
+	/* pc->mem_cgroup is unstable ? */
+	if (unlikely(mem_cgroup_stealed(mem))) {
+		/* take a lock against to access pc->mem_cgroup */
+		lock_page_cgroup(pc);
+		need_unlock = true;
+		mem = pc->mem_cgroup;
+		if (!mem || !PageCgroupUsed(pc))
+			goto out;
+	}
 	if (val > 0) {
-		__this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
+		this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
 		SetPageCgroupFileMapped(pc);
 	} else {
-		__this_cpu_dec(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
+		this_cpu_dec(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
 		if (!page_mapped(page)) /* for race between dec->inc counter */
 			ClearPageCgroupFileMapped(pc);
 	}
 
-done:
-	unlock_page_cgroup(pc);
+out:
+	if (unlikely(need_unlock))
+		unlock_page_cgroup(pc);
+	rcu_read_unlock();
+	return;
 }
 
 /*
@@ -3039,6 +3106,7 @@ move_account:
 		lru_add_drain_all();
 		drain_all_stock_sync();
 		ret = 0;
+		mem_cgroup_start_move(mem);
 		for_each_node_state(node, N_HIGH_MEMORY) {
 			for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) {
 				enum lru_list l;
@@ -3052,6 +3120,7 @@ move_account:
 			if (ret)
 				break;
 		}
+		mem_cgroup_end_move(mem);
 		memcg_oom_recover(mem);
 		/* it seems parent cgroup doesn't have enough mem */
 		if (ret == -ENOMEM)
@@ -4510,6 +4579,7 @@ static void mem_cgroup_clear_mc(void)
 	mc.to = NULL;
 	mc.moving_task = NULL;
 	spin_unlock(&mc.lock);
+	mem_cgroup_end_move(from);
 	memcg_oom_recover(from);
 	memcg_oom_recover(to);
 	wake_up_all(&mc.waitq);
@@ -4540,6 +4610,7 @@ static int mem_cgroup_can_attach(struct 
 			VM_BUG_ON(mc.moved_charge);
 			VM_BUG_ON(mc.moved_swap);
 			VM_BUG_ON(mc.moving_task);
+			mem_cgroup_start_move(from);
 			spin_lock(&mc.lock);
 			mc.from = from;
 			mc.to = mem;


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] memcg: avoid lock in updating file_mapped (Was fix race in file_mapped accouting flag management
@ 2010-09-13  8:01     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-13  8:01 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, balbir, nishimura, gthelen, linux-kernel, akpm


Very sorry, subject was wrong..(reposting).

==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

At accounting file events per memory cgroup, we need to find memory cgroup
via page_cgroup->mem_cgroup. Now, we use lock_page_cgroup() for guarantee
pc->mem_cgroup is not overwritten while we make use of it.

But, considering the context which page-cgroup for files are accessed,
we can use alternative light-weight mutual execusion in the most case.

At handling file-caches, the only race we have to take care of is "moving"
account, IOW, overwriting page_cgroup->mem_cgroup. 
(See comment in the patch)

Unlike charge/uncharge, "move" happens not so frequently. It happens only when
rmdir() and task-moving (with a special settings.)
This patch adds a race-checker for file-cache-status accounting v.s. account
moving. The new per-cpu-per-memcg counter MEM_CGROUP_ON_MOVE is added.
The routine for account move 
  1. Increment it before start moving
  2. Call synchronize_rcu()
  3. Decrement it after the end of moving.
By this, file-status-counting routine can check it needs to call
lock_page_cgroup(). In most case, I doesn't need to call it.

Following is a perf data of a process which mmap()/munmap 32MB of file cache
in a minute.

Before patch:
    28.25%     mmap  mmap               [.] main
    22.64%     mmap  [kernel.kallsyms]  [k] page_fault
     9.96%     mmap  [kernel.kallsyms]  [k] mem_cgroup_update_file_mapped
     3.67%     mmap  [kernel.kallsyms]  [k] filemap_fault
     3.50%     mmap  [kernel.kallsyms]  [k] unmap_vmas
     2.99%     mmap  [kernel.kallsyms]  [k] __do_fault
     2.76%     mmap  [kernel.kallsyms]  [k] find_get_page

After patch:
    30.00%     mmap  mmap               [.] main
    23.78%     mmap  [kernel.kallsyms]  [k] page_fault
     5.52%     mmap  [kernel.kallsyms]  [k] mem_cgroup_update_file_mapped
     3.81%     mmap  [kernel.kallsyms]  [k] unmap_vmas
     3.26%     mmap  [kernel.kallsyms]  [k] find_get_page
     3.18%     mmap  [kernel.kallsyms]  [k] __do_fault
     3.03%     mmap  [kernel.kallsyms]  [k] filemap_fault
     2.40%     mmap  [kernel.kallsyms]  [k] handle_mm_fault
     2.40%     mmap  [kernel.kallsyms]  [k] do_page_fault

This patch reduces memcg's cost to some extent.
(mem_cgroup_update_file_mapped is called by both of map/unmap)

Note: It seems some more improvements are required..but no idea.
      maybe removing set/unset flag is required.

Changelog: 20100913
 - decoupled with ID patches.
 - updated comments.

Changelog: 20100901
 - changes id_to_memcg(pc, true) to be id_to_memcg(pc, false)
   in update_file_mapped()
 - updated comments on lock rule of update_file_mapped()
Changelog: 20100825
 - added a comment about mc.lock
 - fixed bad lock.
Changelog: 20100804
 - added a comment for possible optimization hint.
Changelog: 20100730
 - some cleanup.
Changelog: 20100729
 - replaced __this_cpu_xxx() with this_cpu_xxx
   (because we don't call spinlock)
 - added VM_BUG_ON().

Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   99 ++++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 85 insertions(+), 14 deletions(-)

Index: lockless-update/mm/memcontrol.c
===================================================================
--- lockless-update.orig/mm/memcontrol.c
+++ lockless-update/mm/memcontrol.c
@@ -90,6 +90,7 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
 	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
 	MEM_CGROUP_EVENTS,	/* incremented at every  pagein/pageout */
+	MEM_CGROUP_ON_MOVE,	/* someone is moving account between groups */
 
 	MEM_CGROUP_STAT_NSTATS,
 };
@@ -1051,7 +1052,46 @@ static unsigned int get_swappiness(struc
 	return swappiness;
 }
 
-/* A routine for testing mem is not under move_account */
+static void mem_cgroup_start_move(struct mem_cgroup *mem)
+{
+	int cpu;
+	/* Because this is for moving account, reuse mc.lock */
+	spin_lock(&mc.lock);
+	for_each_possible_cpu(cpu)
+		per_cpu(mem->stat->count[MEM_CGROUP_ON_MOVE], cpu) += 1;
+	spin_unlock(&mc.lock);
+
+	synchronize_rcu();
+}
+
+static void mem_cgroup_end_move(struct mem_cgroup *mem)
+{
+	int cpu;
+
+	if (!mem)
+		return;
+	spin_lock(&mc.lock);
+	for_each_possible_cpu(cpu)
+		per_cpu(mem->stat->count[MEM_CGROUP_ON_MOVE], cpu) -= 1;
+	spin_unlock(&mc.lock);
+}
+/*
+ * 2 routines for checking "mem" is under move_account() or not.
+ *
+ * mem_cgroup_stealed() - checking a cgroup is mc.from or not. This is used
+ *			  for avoiding race in accounting. If true,
+ *			  pc->mem_cgroup may be overwritten.
+ *
+ * mem_cgroup_under_move() - checking a cgroup is mc.from or mc.to or
+ *			  under hierarchy of moving cgroups. This is for
+ *			  waiting at hith-memory prressure caused by "move".
+ */
+
+static bool mem_cgroup_stealed(struct mem_cgroup *mem)
+{
+	VM_BUG_ON(!rcu_read_lock_held());
+	return this_cpu_read(mem->stat->count[MEM_CGROUP_ON_MOVE]) > 0;
+}
 
 static bool mem_cgroup_under_move(struct mem_cgroup *mem)
 {
@@ -1462,35 +1502,62 @@ bool mem_cgroup_handle_oom(struct mem_cg
 /*
  * Currently used to update mapped file statistics, but the routine can be
  * generalized to update other statistics as well.
+ *
+ * Notes: Race condition
+ *
+ * We usually use page_cgroup_lock() for accessing page_cgroup member but
+ * it tends to be costly. But considering some conditions, we doesn't need
+ * to do so _always_.
+ *
+ * Considering "charge", lock_page_cgroup() is not required because all
+ * file-stat operations happen after a page is attached to radix-tree. There
+ * are no race with "charge".
+ *
+ * Considering "uncharge", we know that memcg doesn't clear pc->mem_cgroup
+ * at "uncharge" intentionally. So, we always see valid pc->mem_cgroup even
+ * if there are race with "uncharge". Statistics itself is properly handled
+ * by flags.
+ *
+ * Considering "move", this is an only case we see a race. To make the race
+ * small, we check MEM_CGROUP_ON_MOVE percpu value and detect there are
+ * possibility of race condition. If there is, we take a lock.
  */
 void mem_cgroup_update_file_mapped(struct page *page, int val)
 {
 	struct mem_cgroup *mem;
-	struct page_cgroup *pc;
+	struct page_cgroup *pc = lookup_page_cgroup(page);
+	bool need_unlock = false;
 
-	pc = lookup_page_cgroup(page);
 	if (unlikely(!pc))
 		return;
 
-	lock_page_cgroup(pc);
+	rcu_read_lock();
 	mem = pc->mem_cgroup;
-	if (!mem || !PageCgroupUsed(pc))
-		goto done;
-
-	/*
-	 * Preemption is already disabled. We can use __this_cpu_xxx
-	 */
+	if (unlikely(!mem || !PageCgroupUsed(pc)))
+		goto out;
+	/* pc->mem_cgroup is unstable ? */
+	if (unlikely(mem_cgroup_stealed(mem))) {
+		/* take a lock against to access pc->mem_cgroup */
+		lock_page_cgroup(pc);
+		need_unlock = true;
+		mem = pc->mem_cgroup;
+		if (!mem || !PageCgroupUsed(pc))
+			goto out;
+	}
 	if (val > 0) {
-		__this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
+		this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
 		SetPageCgroupFileMapped(pc);
 	} else {
-		__this_cpu_dec(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
+		this_cpu_dec(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
 		if (!page_mapped(page)) /* for race between dec->inc counter */
 			ClearPageCgroupFileMapped(pc);
 	}
 
-done:
-	unlock_page_cgroup(pc);
+out:
+	if (unlikely(need_unlock))
+		unlock_page_cgroup(pc);
+	rcu_read_unlock();
+	return;
 }
 
 /*
@@ -3039,6 +3106,7 @@ move_account:
 		lru_add_drain_all();
 		drain_all_stock_sync();
 		ret = 0;
+		mem_cgroup_start_move(mem);
 		for_each_node_state(node, N_HIGH_MEMORY) {
 			for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) {
 				enum lru_list l;
@@ -3052,6 +3120,7 @@ move_account:
 			if (ret)
 				break;
 		}
+		mem_cgroup_end_move(mem);
 		memcg_oom_recover(mem);
 		/* it seems parent cgroup doesn't have enough mem */
 		if (ret == -ENOMEM)
@@ -4510,6 +4579,7 @@ static void mem_cgroup_clear_mc(void)
 	mc.to = NULL;
 	mc.moving_task = NULL;
 	spin_unlock(&mc.lock);
+	mem_cgroup_end_move(from);
 	memcg_oom_recover(from);
 	memcg_oom_recover(to);
 	wake_up_all(&mc.waitq);
@@ -4540,6 +4610,7 @@ static int mem_cgroup_can_attach(struct 
 			VM_BUG_ON(mc.moved_charge);
 			VM_BUG_ON(mc.moved_swap);
 			VM_BUG_ON(mc.moving_task);
+			mem_cgroup_start_move(from);
 			spin_lock(&mc.lock);
 			mc.from = from;
 			mc.to = mem;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [BUGFIX][PATCH] memcg: fix race in file_mapped accouting flag management
  2010-09-13  7:08 ` KAMEZAWA Hiroyuki
@ 2010-09-13  8:47   ` Balbir Singh
  -1 siblings, 0 replies; 22+ messages in thread
From: Balbir Singh @ 2010-09-13  8:47 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, nishimura, gthelen, linux-kernel, akpm, stable

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-09-13 16:08:22]:

> 
> I think this small race is not very critical but it's bug.
> We have this race since 2.6.34. 
> =
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Now. memory cgroup accounts file-mapped by counter and flag.
> counter is working in the same way with zone_stat but FileMapped flag only
> exists in memcg (for helping move_account).
> 
> This flag can be updated wrongly in a case. Assume CPU0 and CPU1
> and a thread mapping a page on CPU0, another thread unmapping it on CPU1.
> 
>     CPU0                   		CPU1
> 				rmv rmap (mapcount 1->0)
>    add rmap (mapcount 0->1) 
>    lock_page_cgroup()
>    memcg counter+1		(some delay)
>    set MAPPED FLAG.
>    unlock_page_cgroup()
> 				lock_page_cgroup()
> 				memcg counter-1
> 				clear MAPPED flag
> 
> In above sequence, counter is properly updated but FLAG is not.
> This means that representing a state by a flag which is maintained by
> counter needs some specail care.

In the situation above who has the PTE lock? Are we not synchronized
via the PTE lock such that add rmap and rm rmap, will not happen
simultaneously?

> 
> To handle this, at claering a flag, this patch check mapcount directly and
                     ^^^^ (clearing)
> clear the flag only when mapcount == 0. (if mapcount >0, someone will make
> it to zero later and flag will be cleared.)
> 
> Reverse case, dec-after-inc cannot be a problem because page_table_lock()
> works well for it. (IOW, to make above sequence, 2 processes should touch
> the same page at once with map/unmap.)
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [BUGFIX][PATCH] memcg: fix race in file_mapped accouting flag management
@ 2010-09-13  8:47   ` Balbir Singh
  0 siblings, 0 replies; 22+ messages in thread
From: Balbir Singh @ 2010-09-13  8:47 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, nishimura, gthelen, linux-kernel, akpm, stable

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-09-13 16:08:22]:

> 
> I think this small race is not very critical but it's bug.
> We have this race since 2.6.34. 
> =
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Now. memory cgroup accounts file-mapped by counter and flag.
> counter is working in the same way with zone_stat but FileMapped flag only
> exists in memcg (for helping move_account).
> 
> This flag can be updated wrongly in a case. Assume CPU0 and CPU1
> and a thread mapping a page on CPU0, another thread unmapping it on CPU1.
> 
>     CPU0                   		CPU1
> 				rmv rmap (mapcount 1->0)
>    add rmap (mapcount 0->1) 
>    lock_page_cgroup()
>    memcg counter+1		(some delay)
>    set MAPPED FLAG.
>    unlock_page_cgroup()
> 				lock_page_cgroup()
> 				memcg counter-1
> 				clear MAPPED flag
> 
> In above sequence, counter is properly updated but FLAG is not.
> This means that representing a state by a flag which is maintained by
> counter needs some specail care.

In the situation above who has the PTE lock? Are we not synchronized
via the PTE lock such that add rmap and rm rmap, will not happen
simultaneously?

> 
> To handle this, at claering a flag, this patch check mapcount directly and
                     ^^^^ (clearing)
> clear the flag only when mapcount == 0. (if mapcount >0, someone will make
> it to zero later and flag will be cleared.)
> 
> Reverse case, dec-after-inc cannot be a problem because page_table_lock()
> works well for it. (IOW, to make above sequence, 2 processes should touch
> the same page at once with map/unmap.)
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [BUGFIX][PATCH] memcg: fix race in file_mapped accouting flag management
  2010-09-13  8:47   ` Balbir Singh
@ 2010-09-13 15:28     ` Hiroyuki Kamezawa
  -1 siblings, 0 replies; 22+ messages in thread
From: Hiroyuki Kamezawa @ 2010-09-13 15:28 UTC (permalink / raw)
  To: balbir
  Cc: KAMEZAWA Hiroyuki, linux-mm, nishimura, gthelen, linux-kernel,
	akpm, stable

2010/9/13 Balbir Singh <balbir@linux.vnet.ibm.com>:
> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-09-13 16:08:22]:
>
>>
>> I think this small race is not very critical but it's bug.
>> We have this race since 2.6.34.
>> =
>> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>>
>> Now. memory cgroup accounts file-mapped by counter and flag.
>> counter is working in the same way with zone_stat but FileMapped flag only
>> exists in memcg (for helping move_account).
>>
>> This flag can be updated wrongly in a case. Assume CPU0 and CPU1
>> and a thread mapping a page on CPU0, another thread unmapping it on CPU1.
>>
>>     CPU0                              CPU1
>>                               rmv rmap (mapcount 1->0)
>>    add rmap (mapcount 0->1)
>>    lock_page_cgroup()
>>    memcg counter+1            (some delay)
>>    set MAPPED FLAG.
>>    unlock_page_cgroup()
>>                               lock_page_cgroup()
>>                               memcg counter-1
>>                               clear MAPPED flag
>>
>> In above sequence, counter is properly updated but FLAG is not.
>> This means that representing a state by a flag which is maintained by
>> counter needs some specail care.
>
> In the situation above who has the PTE lock? Are we not synchronized
> via the PTE lock such that add rmap and rm rmap, will not happen
> simultaneously?
>
In this case, a process for map and one for unmap can be different.

Assume process A maps a file cache and process B not.
While process A unmap a file, process B can map it.
pte lock is no help.

Thanks,
-Kame

Thanks,
-Kame

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [BUGFIX][PATCH] memcg: fix race in file_mapped accouting flag management
@ 2010-09-13 15:28     ` Hiroyuki Kamezawa
  0 siblings, 0 replies; 22+ messages in thread
From: Hiroyuki Kamezawa @ 2010-09-13 15:28 UTC (permalink / raw)
  To: balbir
  Cc: KAMEZAWA Hiroyuki, linux-mm, nishimura, gthelen, linux-kernel,
	akpm, stable

2010/9/13 Balbir Singh <balbir@linux.vnet.ibm.com>:
> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-09-13 16:08:22]:
>
>>
>> I think this small race is not very critical but it's bug.
>> We have this race since 2.6.34.
>> =
>> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>>
>> Now. memory cgroup accounts file-mapped by counter and flag.
>> counter is working in the same way with zone_stat but FileMapped flag only
>> exists in memcg (for helping move_account).
>>
>> This flag can be updated wrongly in a case. Assume CPU0 and CPU1
>> and a thread mapping a page on CPU0, another thread unmapping it on CPU1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [BUGFIX][PATCH] memcg: fix race in file_mapped accouting flag management
  2010-09-13 15:28     ` Hiroyuki Kamezawa
@ 2010-09-13 17:17       ` Balbir Singh
  -1 siblings, 0 replies; 22+ messages in thread
From: Balbir Singh @ 2010-09-13 17:17 UTC (permalink / raw)
  To: Hiroyuki Kamezawa
  Cc: KAMEZAWA Hiroyuki, linux-mm, nishimura, gthelen, linux-kernel,
	akpm, stable

* Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com> [2010-09-14 00:28:30]:

> > In the situation above who has the PTE lock? Are we not synchronized
> > via the PTE lock such that add rmap and rm rmap, will not happen
> > simultaneously?
> >
> In this case, a process for map and one for unmap can be different.
> 
> Assume process A maps a file cache and process B not.
> While process A unmap a file, process B can map it.
> pte lock is no help.
>

Correct, so while the accounting is correct, the flag can definitely
go wrong. I misread your race description earlier.

Thanks! 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [BUGFIX][PATCH] memcg: fix race in file_mapped accouting flag management
@ 2010-09-13 17:17       ` Balbir Singh
  0 siblings, 0 replies; 22+ messages in thread
From: Balbir Singh @ 2010-09-13 17:17 UTC (permalink / raw)
  To: Hiroyuki Kamezawa
  Cc: KAMEZAWA Hiroyuki, linux-mm, nishimura, gthelen, linux-kernel,
	akpm, stable

* Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com> [2010-09-14 00:28:30]:

> > In the situation above who has the PTE lock? Are we not synchronized
> > via the PTE lock such that add rmap and rm rmap, will not happen
> > simultaneously?
> >
> In this case, a process for map and one for unmap can be different.
> 
> Assume process A maps a file cache and process B not.
> While process A unmap a file, process B can map it.
> pte lock is no help.
>

Correct, so while the accounting is correct, the flag can definitely
go wrong. I misread your race description earlier.

Thanks! 

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] memcg: avoid lock in updating file_mapped (Was fix race in file_mapped accouting flag management
  2010-09-13  8:01     ` KAMEZAWA Hiroyuki
@ 2010-09-13 17:26       ` Balbir Singh
  -1 siblings, 0 replies; 22+ messages in thread
From: Balbir Singh @ 2010-09-13 17:26 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, nishimura, gthelen, linux-kernel, akpm

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-09-13 17:01:51]:

> 
> Very sorry, subject was wrong..(reposting).
> 
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> At accounting file events per memory cgroup, we need to find memory cgroup
> via page_cgroup->mem_cgroup. Now, we use lock_page_cgroup() for guarantee
> pc->mem_cgroup is not overwritten while we make use of it.
> 
> But, considering the context which page-cgroup for files are accessed,
> we can use alternative light-weight mutual execusion in the most case.
> 
> At handling file-caches, the only race we have to take care of is "moving"
> account, IOW, overwriting page_cgroup->mem_cgroup. 
> (See comment in the patch)
> 
> Unlike charge/uncharge, "move" happens not so frequently. It happens only when
> rmdir() and task-moving (with a special settings.)
> This patch adds a race-checker for file-cache-status accounting v.s. account
> moving. The new per-cpu-per-memcg counter MEM_CGROUP_ON_MOVE is added.
> The routine for account move 
>   1. Increment it before start moving
>   2. Call synchronize_rcu()
>   3. Decrement it after the end of moving.
> By this, file-status-counting routine can check it needs to call
> lock_page_cgroup(). In most case, I doesn't need to call it.
> 
> Following is a perf data of a process which mmap()/munmap 32MB of file cache
> in a minute.
> 
> Before patch:
>     28.25%     mmap  mmap               [.] main
>     22.64%     mmap  [kernel.kallsyms]  [k] page_fault
>      9.96%     mmap  [kernel.kallsyms]  [k] mem_cgroup_update_file_mapped
>      3.67%     mmap  [kernel.kallsyms]  [k] filemap_fault
>      3.50%     mmap  [kernel.kallsyms]  [k] unmap_vmas
>      2.99%     mmap  [kernel.kallsyms]  [k] __do_fault
>      2.76%     mmap  [kernel.kallsyms]  [k] find_get_page
> 
> After patch:
>     30.00%     mmap  mmap               [.] main
>     23.78%     mmap  [kernel.kallsyms]  [k] page_fault
>      5.52%     mmap  [kernel.kallsyms]  [k] mem_cgroup_update_file_mapped
>      3.81%     mmap  [kernel.kallsyms]  [k] unmap_vmas
>      3.26%     mmap  [kernel.kallsyms]  [k] find_get_page
>      3.18%     mmap  [kernel.kallsyms]  [k] __do_fault
>      3.03%     mmap  [kernel.kallsyms]  [k] filemap_fault
>      2.40%     mmap  [kernel.kallsyms]  [k] handle_mm_fault
>      2.40%     mmap  [kernel.kallsyms]  [k] do_page_fault
> 
> This patch reduces memcg's cost to some extent.
> (mem_cgroup_update_file_mapped is called by both of map/unmap)
> 
> Note: It seems some more improvements are required..but no idea.
>       maybe removing set/unset flag is required.
> 
> Changelog: 20100913
>  - decoupled with ID patches.
>  - updated comments.
> 
> Changelog: 20100901
>  - changes id_to_memcg(pc, true) to be id_to_memcg(pc, false)
>    in update_file_mapped()
>  - updated comments on lock rule of update_file_mapped()
> Changelog: 20100825
>  - added a comment about mc.lock
>  - fixed bad lock.
> Changelog: 20100804
>  - added a comment for possible optimization hint.
> Changelog: 20100730
>  - some cleanup.
> Changelog: 20100729
>  - replaced __this_cpu_xxx() with this_cpu_xxx
>    (because we don't call spinlock)
>  - added VM_BUG_ON().
> 
> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/memcontrol.c |   99 ++++++++++++++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 85 insertions(+), 14 deletions(-)
> 
> Index: lockless-update/mm/memcontrol.c
> ===================================================================
> --- lockless-update.orig/mm/memcontrol.c
> +++ lockless-update/mm/memcontrol.c
> @@ -90,6 +90,7 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
>  	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
>  	MEM_CGROUP_EVENTS,	/* incremented at every  pagein/pageout */
> +	MEM_CGROUP_ON_MOVE,	/* someone is moving account between groups */
> 
>  	MEM_CGROUP_STAT_NSTATS,
>  };
> @@ -1051,7 +1052,46 @@ static unsigned int get_swappiness(struc
>  	return swappiness;
>  }
> 
> -/* A routine for testing mem is not under move_account */
> +static void mem_cgroup_start_move(struct mem_cgroup *mem)
> +{
> +	int cpu;
> +	/* Because this is for moving account, reuse mc.lock */
> +	spin_lock(&mc.lock);
> +	for_each_possible_cpu(cpu)

for_each_possible_cpu() might be too much, no?

I recommend we use a get_online_cpus()/put_online_cpus() pair
around the call and optimize.

> +		per_cpu(mem->stat->count[MEM_CGROUP_ON_MOVE], cpu) += 1;
> +	spin_unlock(&mc.lock);
> +
> +	synchronize_rcu();
> +}
> +
> +static void mem_cgroup_end_move(struct mem_cgroup *mem)
> +{
> +	int cpu;
> +
> +	if (!mem)
> +		return;
> +	spin_lock(&mc.lock);
> +	for_each_possible_cpu(cpu)
> +		per_cpu(mem->stat->count[MEM_CGROUP_ON_MOVE], cpu) -= 1;

Same as above


-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] memcg: avoid lock in updating file_mapped (Was fix race in file_mapped accouting flag management
@ 2010-09-13 17:26       ` Balbir Singh
  0 siblings, 0 replies; 22+ messages in thread
From: Balbir Singh @ 2010-09-13 17:26 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, nishimura, gthelen, linux-kernel, akpm

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-09-13 17:01:51]:

> 
> Very sorry, subject was wrong..(reposting).
> 
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> At accounting file events per memory cgroup, we need to find memory cgroup
> via page_cgroup->mem_cgroup. Now, we use lock_page_cgroup() for guarantee
> pc->mem_cgroup is not overwritten while we make use of it.
> 
> But, considering the context which page-cgroup for files are accessed,
> we can use alternative light-weight mutual execusion in the most case.
> 
> At handling file-caches, the only race we have to take care of is "moving"
> account, IOW, overwriting page_cgroup->mem_cgroup. 
> (See comment in the patch)
> 
> Unlike charge/uncharge, "move" happens not so frequently. It happens only when
> rmdir() and task-moving (with a special settings.)
> This patch adds a race-checker for file-cache-status accounting v.s. account
> moving. The new per-cpu-per-memcg counter MEM_CGROUP_ON_MOVE is added.
> The routine for account move 
>   1. Increment it before start moving
>   2. Call synchronize_rcu()
>   3. Decrement it after the end of moving.
> By this, file-status-counting routine can check it needs to call
> lock_page_cgroup(). In most case, I doesn't need to call it.
> 
> Following is a perf data of a process which mmap()/munmap 32MB of file cache
> in a minute.
> 
> Before patch:
>     28.25%     mmap  mmap               [.] main
>     22.64%     mmap  [kernel.kallsyms]  [k] page_fault
>      9.96%     mmap  [kernel.kallsyms]  [k] mem_cgroup_update_file_mapped
>      3.67%     mmap  [kernel.kallsyms]  [k] filemap_fault
>      3.50%     mmap  [kernel.kallsyms]  [k] unmap_vmas
>      2.99%     mmap  [kernel.kallsyms]  [k] __do_fault
>      2.76%     mmap  [kernel.kallsyms]  [k] find_get_page
> 
> After patch:
>     30.00%     mmap  mmap               [.] main
>     23.78%     mmap  [kernel.kallsyms]  [k] page_fault
>      5.52%     mmap  [kernel.kallsyms]  [k] mem_cgroup_update_file_mapped
>      3.81%     mmap  [kernel.kallsyms]  [k] unmap_vmas
>      3.26%     mmap  [kernel.kallsyms]  [k] find_get_page
>      3.18%     mmap  [kernel.kallsyms]  [k] __do_fault
>      3.03%     mmap  [kernel.kallsyms]  [k] filemap_fault
>      2.40%     mmap  [kernel.kallsyms]  [k] handle_mm_fault
>      2.40%     mmap  [kernel.kallsyms]  [k] do_page_fault
> 
> This patch reduces memcg's cost to some extent.
> (mem_cgroup_update_file_mapped is called by both of map/unmap)
> 
> Note: It seems some more improvements are required..but no idea.
>       maybe removing set/unset flag is required.
> 
> Changelog: 20100913
>  - decoupled with ID patches.
>  - updated comments.
> 
> Changelog: 20100901
>  - changes id_to_memcg(pc, true) to be id_to_memcg(pc, false)
>    in update_file_mapped()
>  - updated comments on lock rule of update_file_mapped()
> Changelog: 20100825
>  - added a comment about mc.lock
>  - fixed bad lock.
> Changelog: 20100804
>  - added a comment for possible optimization hint.
> Changelog: 20100730
>  - some cleanup.
> Changelog: 20100729
>  - replaced __this_cpu_xxx() with this_cpu_xxx
>    (because we don't call spinlock)
>  - added VM_BUG_ON().
> 
> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/memcontrol.c |   99 ++++++++++++++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 85 insertions(+), 14 deletions(-)
> 
> Index: lockless-update/mm/memcontrol.c
> ===================================================================
> --- lockless-update.orig/mm/memcontrol.c
> +++ lockless-update/mm/memcontrol.c
> @@ -90,6 +90,7 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
>  	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
>  	MEM_CGROUP_EVENTS,	/* incremented at every  pagein/pageout */
> +	MEM_CGROUP_ON_MOVE,	/* someone is moving account between groups */
> 
>  	MEM_CGROUP_STAT_NSTATS,
>  };
> @@ -1051,7 +1052,46 @@ static unsigned int get_swappiness(struc
>  	return swappiness;
>  }
> 
> -/* A routine for testing mem is not under move_account */
> +static void mem_cgroup_start_move(struct mem_cgroup *mem)
> +{
> +	int cpu;
> +	/* Because this is for moving account, reuse mc.lock */
> +	spin_lock(&mc.lock);
> +	for_each_possible_cpu(cpu)

for_each_possible_cpu() might be too much, no?

I recommend we use a get_online_cpus()/put_online_cpus() pair
around the call and optimize.

> +		per_cpu(mem->stat->count[MEM_CGROUP_ON_MOVE], cpu) += 1;
> +	spin_unlock(&mc.lock);
> +
> +	synchronize_rcu();
> +}
> +
> +static void mem_cgroup_end_move(struct mem_cgroup *mem)
> +{
> +	int cpu;
> +
> +	if (!mem)
> +		return;
> +	spin_lock(&mc.lock);
> +	for_each_possible_cpu(cpu)
> +		per_cpu(mem->stat->count[MEM_CGROUP_ON_MOVE], cpu) -= 1;

Same as above


-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [BUGFIX][PATCH] memcg: fix race in file_mapped accouting flag management
  2010-09-13  7:08 ` KAMEZAWA Hiroyuki
@ 2010-09-13 21:08   ` Andrew Morton
  -1 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2010-09-13 21:08 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, balbir, nishimura, gthelen, linux-kernel, stable

On Mon, 13 Sep 2010 16:08:22 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> 
> I think this small race is not very critical but it's bug.
> We have this race since 2.6.34. 
> =
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Now. memory cgroup accounts file-mapped by counter and flag.
> counter is working in the same way with zone_stat but FileMapped flag only
> exists in memcg (for helping move_account).
> 
> This flag can be updated wrongly in a case. Assume CPU0 and CPU1
> and a thread mapping a page on CPU0, another thread unmapping it on CPU1.
> 
>     CPU0                   		CPU1
> 				rmv rmap (mapcount 1->0)
>    add rmap (mapcount 0->1)
>    lock_page_cgroup()
>    memcg counter+1		(some delay)
>    set MAPPED FLAG.
>    unlock_page_cgroup()
> 				lock_page_cgroup()
> 				memcg counter-1
> 				clear MAPPED flag
> 
> In above sequence, counter is properly updated but FLAG is not.
> This means that representing a state by a flag which is maintained by
> counter needs some specail care.
> 
> To handle this, at claering a flag, this patch check mapcount directly and
> clear the flag only when mapcount == 0. (if mapcount >0, someone will make
> it to zero later and flag will be cleared.)
> 
> Reverse case, dec-after-inc cannot be a problem because page_table_lock()
> works well for it. (IOW, to make above sequence, 2 processes should touch
> the same page at once with map/unmap.)
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/memcontrol.c |    3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> Index: lockless-update/mm/memcontrol.c
> ===================================================================
> --- lockless-update.orig/mm/memcontrol.c
> +++ lockless-update/mm/memcontrol.c
> @@ -1485,7 +1485,8 @@ void mem_cgroup_update_file_mapped(struc
>  		SetPageCgroupFileMapped(pc);
>  	} else {
>  		__this_cpu_dec(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		ClearPageCgroupFileMapped(pc);
> +		if (page_mapped(page)) /* for race between dec->inc counter */
> +			ClearPageCgroupFileMapped(pc);
>  	}

This should be !page_mapped(), shouldn't it?

And your second patch _does_ have !page_mapped() here, which is why the
second patch didn't apply.

I tried to fix things up.  Please check.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [BUGFIX][PATCH] memcg: fix race in file_mapped accouting flag management
@ 2010-09-13 21:08   ` Andrew Morton
  0 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2010-09-13 21:08 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, balbir, nishimura, gthelen, linux-kernel, stable

On Mon, 13 Sep 2010 16:08:22 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> 
> I think this small race is not very critical but it's bug.
> We have this race since 2.6.34. 
> =
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Now. memory cgroup accounts file-mapped by counter and flag.
> counter is working in the same way with zone_stat but FileMapped flag only
> exists in memcg (for helping move_account).
> 
> This flag can be updated wrongly in a case. Assume CPU0 and CPU1
> and a thread mapping a page on CPU0, another thread unmapping it on CPU1.
> 
>     CPU0                   		CPU1
> 				rmv rmap (mapcount 1->0)
>    add rmap (mapcount 0->1)
>    lock_page_cgroup()
>    memcg counter+1		(some delay)
>    set MAPPED FLAG.
>    unlock_page_cgroup()
> 				lock_page_cgroup()
> 				memcg counter-1
> 				clear MAPPED flag
> 
> In above sequence, counter is properly updated but FLAG is not.
> This means that representing a state by a flag which is maintained by
> counter needs some specail care.
> 
> To handle this, at claering a flag, this patch check mapcount directly and
> clear the flag only when mapcount == 0. (if mapcount >0, someone will make
> it to zero later and flag will be cleared.)
> 
> Reverse case, dec-after-inc cannot be a problem because page_table_lock()
> works well for it. (IOW, to make above sequence, 2 processes should touch
> the same page at once with map/unmap.)
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/memcontrol.c |    3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> Index: lockless-update/mm/memcontrol.c
> ===================================================================
> --- lockless-update.orig/mm/memcontrol.c
> +++ lockless-update/mm/memcontrol.c
> @@ -1485,7 +1485,8 @@ void mem_cgroup_update_file_mapped(struc
>  		SetPageCgroupFileMapped(pc);
>  	} else {
>  		__this_cpu_dec(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		ClearPageCgroupFileMapped(pc);
> +		if (page_mapped(page)) /* for race between dec->inc counter */
> +			ClearPageCgroupFileMapped(pc);
>  	}

This should be !page_mapped(), shouldn't it?

And your second patch _does_ have !page_mapped() here, which is why the
second patch didn't apply.

I tried to fix things up.  Please check.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [BUGFIX][PATCH] memcg: fix race in file_mapped accouting flag management
  2010-09-13 21:08   ` Andrew Morton
@ 2010-09-14  4:35     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-14  4:35 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, balbir, nishimura, gthelen, linux-kernel, stable

On Mon, 13 Sep 2010 14:08:03 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Mon, 13 Sep 2010 16:08:22 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
<snip>
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  mm/memcontrol.c |    3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > Index: lockless-update/mm/memcontrol.c
> > ===================================================================
> > --- lockless-update.orig/mm/memcontrol.c
> > +++ lockless-update/mm/memcontrol.c
> > @@ -1485,7 +1485,8 @@ void mem_cgroup_update_file_mapped(struc
> >  		SetPageCgroupFileMapped(pc);
> >  	} else {
> >  		__this_cpu_dec(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > -		ClearPageCgroupFileMapped(pc);
> > +		if (page_mapped(page)) /* for race between dec->inc counter */
> > +			ClearPageCgroupFileMapped(pc);
> >  	}
> 
> This should be !page_mapped(), shouldn't it?
> 

Ahhhh, yes. reflesh miss..

> And your second patch _does_ have !page_mapped() here, which is why the
> second patch didn't apply.
> 
Very sorry.

> I tried to fix things up.  Please check.

Thank you. 

-Kame

> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [BUGFIX][PATCH] memcg: fix race in file_mapped accouting flag management
@ 2010-09-14  4:35     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-14  4:35 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, balbir, nishimura, gthelen, linux-kernel, stable

On Mon, 13 Sep 2010 14:08:03 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Mon, 13 Sep 2010 16:08:22 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
<snip>
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  mm/memcontrol.c |    3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > Index: lockless-update/mm/memcontrol.c
> > ===================================================================
> > --- lockless-update.orig/mm/memcontrol.c
> > +++ lockless-update/mm/memcontrol.c
> > @@ -1485,7 +1485,8 @@ void mem_cgroup_update_file_mapped(struc
> >  		SetPageCgroupFileMapped(pc);
> >  	} else {
> >  		__this_cpu_dec(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > -		ClearPageCgroupFileMapped(pc);
> > +		if (page_mapped(page)) /* for race between dec->inc counter */
> > +			ClearPageCgroupFileMapped(pc);
> >  	}
> 
> This should be !page_mapped(), shouldn't it?
> 

Ahhhh, yes. reflesh miss..

> And your second patch _does_ have !page_mapped() here, which is why the
> second patch didn't apply.
> 
Very sorry.

> I tried to fix things up.  Please check.

Thank you. 

-Kame

> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [BUGFIX][PATCH] memcg: fix race in file_mapped accouting flag management
  2010-09-13 21:08   ` Andrew Morton
@ 2010-09-14  4:38     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-14  4:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, balbir, nishimura, gthelen, linux-kernel, stable


Sorry, reposting..(because mail client died while sending..)
==
On Mon, 13 Sep 2010 14:08:03 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Mon, 13 Sep 2010 16:08:22 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
<snip>
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  mm/memcontrol.c |    3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > Index: lockless-update/mm/memcontrol.c
> > ===================================================================
> > --- lockless-update.orig/mm/memcontrol.c
> > +++ lockless-update/mm/memcontrol.c
> > @@ -1485,7 +1485,8 @@ void mem_cgroup_update_file_mapped(struc
> >  		SetPageCgroupFileMapped(pc);
> >  	} else {
> >  		__this_cpu_dec(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > -		ClearPageCgroupFileMapped(pc);
> > +		if (page_mapped(page)) /* for race between dec->inc counter */
> > +			ClearPageCgroupFileMapped(pc);
> >  	}
> 
> This should be !page_mapped(), shouldn't it?
> 

Ahhhh, yes. reflesh miss..

> And your second patch _does_ have !page_mapped() here, which is why the
> second patch didn't apply.
> 
Very sorry.

> I tried to fix things up.  Please check.

Thank you. 

-Kame

> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [BUGFIX][PATCH] memcg: fix race in file_mapped accouting flag management
@ 2010-09-14  4:38     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-14  4:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, balbir, nishimura, gthelen, linux-kernel, stable


Sorry, reposting..(because mail client died while sending..)
==
On Mon, 13 Sep 2010 14:08:03 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Mon, 13 Sep 2010 16:08:22 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
<snip>
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  mm/memcontrol.c |    3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > Index: lockless-update/mm/memcontrol.c
> > ===================================================================
> > --- lockless-update.orig/mm/memcontrol.c
> > +++ lockless-update/mm/memcontrol.c
> > @@ -1485,7 +1485,8 @@ void mem_cgroup_update_file_mapped(struc
> >  		SetPageCgroupFileMapped(pc);
> >  	} else {
> >  		__this_cpu_dec(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > -		ClearPageCgroupFileMapped(pc);
> > +		if (page_mapped(page)) /* for race between dec->inc counter */
> > +			ClearPageCgroupFileMapped(pc);
> >  	}
> 
> This should be !page_mapped(), shouldn't it?
> 

Ahhhh, yes. reflesh miss..

> And your second patch _does_ have !page_mapped() here, which is why the
> second patch didn't apply.
> 
Very sorry.

> I tried to fix things up.  Please check.

Thank you. 

-Kame

> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] memcg: avoid lock in updating file_mapped (Was fix race in file_mapped accouting flag management
  2010-09-13 17:26       ` Balbir Singh
@ 2010-09-14  4:55         ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-14  4:55 UTC (permalink / raw)
  To: balbir; +Cc: linux-mm, nishimura, gthelen, linux-kernel, akpm

On Mon, 13 Sep 2010 22:56:19 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-09-13 17:01:51]:
> 
> > 
> > Very sorry, subject was wrong..(reposting).
> > 
> > ==
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > At accounting file events per memory cgroup, we need to find memory cgroup
> > via page_cgroup->mem_cgroup. Now, we use lock_page_cgroup() for guarantee
> > pc->mem_cgroup is not overwritten while we make use of it.
> > 
> > But, considering the context which page-cgroup for files are accessed,
> > we can use alternative light-weight mutual execusion in the most case.
> > 
> > At handling file-caches, the only race we have to take care of is "moving"
> > account, IOW, overwriting page_cgroup->mem_cgroup. 
> > (See comment in the patch)
> > 
> > Unlike charge/uncharge, "move" happens not so frequently. It happens only when
> > rmdir() and task-moving (with a special settings.)
> > This patch adds a race-checker for file-cache-status accounting v.s. account
> > moving. The new per-cpu-per-memcg counter MEM_CGROUP_ON_MOVE is added.
> > The routine for account move 
> >   1. Increment it before start moving
> >   2. Call synchronize_rcu()
> >   3. Decrement it after the end of moving.
> > By this, file-status-counting routine can check it needs to call
> > lock_page_cgroup(). In most case, I doesn't need to call it.
> > 
> > Following is a perf data of a process which mmap()/munmap 32MB of file cache
> > in a minute.
> > 
> > Before patch:
> >     28.25%     mmap  mmap               [.] main
> >     22.64%     mmap  [kernel.kallsyms]  [k] page_fault
> >      9.96%     mmap  [kernel.kallsyms]  [k] mem_cgroup_update_file_mapped
> >      3.67%     mmap  [kernel.kallsyms]  [k] filemap_fault
> >      3.50%     mmap  [kernel.kallsyms]  [k] unmap_vmas
> >      2.99%     mmap  [kernel.kallsyms]  [k] __do_fault
> >      2.76%     mmap  [kernel.kallsyms]  [k] find_get_page
> > 
> > After patch:
> >     30.00%     mmap  mmap               [.] main
> >     23.78%     mmap  [kernel.kallsyms]  [k] page_fault
> >      5.52%     mmap  [kernel.kallsyms]  [k] mem_cgroup_update_file_mapped
> >      3.81%     mmap  [kernel.kallsyms]  [k] unmap_vmas
> >      3.26%     mmap  [kernel.kallsyms]  [k] find_get_page
> >      3.18%     mmap  [kernel.kallsyms]  [k] __do_fault
> >      3.03%     mmap  [kernel.kallsyms]  [k] filemap_fault
> >      2.40%     mmap  [kernel.kallsyms]  [k] handle_mm_fault
> >      2.40%     mmap  [kernel.kallsyms]  [k] do_page_fault
> > 
> > This patch reduces memcg's cost to some extent.
> > (mem_cgroup_update_file_mapped is called by both of map/unmap)
> > 
> > Note: It seems some more improvements are required..but no idea.
> >       maybe removing set/unset flag is required.
> > 
> > Changelog: 20100913
> >  - decoupled with ID patches.
> >  - updated comments.
> > 
> > Changelog: 20100901
> >  - changes id_to_memcg(pc, true) to be id_to_memcg(pc, false)
> >    in update_file_mapped()
> >  - updated comments on lock rule of update_file_mapped()
> > Changelog: 20100825
> >  - added a comment about mc.lock
> >  - fixed bad lock.
> > Changelog: 20100804
> >  - added a comment for possible optimization hint.
> > Changelog: 20100730
> >  - some cleanup.
> > Changelog: 20100729
> >  - replaced __this_cpu_xxx() with this_cpu_xxx
> >    (because we don't call spinlock)
> >  - added VM_BUG_ON().
> > 
> > Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  mm/memcontrol.c |   99 ++++++++++++++++++++++++++++++++++++++++++++++++--------
> >  1 file changed, 85 insertions(+), 14 deletions(-)
> > 
> > Index: lockless-update/mm/memcontrol.c
> > ===================================================================
> > --- lockless-update.orig/mm/memcontrol.c
> > +++ lockless-update/mm/memcontrol.c
> > @@ -90,6 +90,7 @@ enum mem_cgroup_stat_index {
> >  	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
> >  	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> >  	MEM_CGROUP_EVENTS,	/* incremented at every  pagein/pageout */
> > +	MEM_CGROUP_ON_MOVE,	/* someone is moving account between groups */
> > 
> >  	MEM_CGROUP_STAT_NSTATS,
> >  };
> > @@ -1051,7 +1052,46 @@ static unsigned int get_swappiness(struc
> >  	return swappiness;
> >  }
> > 
> > -/* A routine for testing mem is not under move_account */
> > +static void mem_cgroup_start_move(struct mem_cgroup *mem)
> > +{
> > +	int cpu;
> > +	/* Because this is for moving account, reuse mc.lock */
> > +	spin_lock(&mc.lock);
> > +	for_each_possible_cpu(cpu)
> 
> for_each_possible_cpu() might be too much, no?
> 
> I recommend we use a get_online_cpus()/put_online_cpus() pair
> around the call and optimize.
> 
That makes the patch big, I will have to add cpu hotplug notifier.
If you really want, I'll write an add-on and some clean ups.

And get_online_cpus() requires hotplug notifiers, anyway.

If not using notifier,

	get_onlie_cpus()
	per_cpu(MEM_CGROUP_ON_MOVE) += 1;
	.....do very heavy work, which can sleep.
	per_cpu(MEM_CGROUP_ON_MOVE) -= 1;
	put_online_cpu()

This cannot be justified in mission critical servers.

Now, this code itself is only called at rmdir() and move_task().
Both are very slow.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] memcg: avoid lock in updating file_mapped (Was fix race in file_mapped accouting flag management
@ 2010-09-14  4:55         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-14  4:55 UTC (permalink / raw)
  To: balbir; +Cc: linux-mm, nishimura, gthelen, linux-kernel, akpm

On Mon, 13 Sep 2010 22:56:19 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-09-13 17:01:51]:
> 
> > 
> > Very sorry, subject was wrong..(reposting).
> > 
> > ==
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > At accounting file events per memory cgroup, we need to find memory cgroup
> > via page_cgroup->mem_cgroup. Now, we use lock_page_cgroup() for guarantee
> > pc->mem_cgroup is not overwritten while we make use of it.
> > 
> > But, considering the context which page-cgroup for files are accessed,
> > we can use alternative light-weight mutual execusion in the most case.
> > 
> > At handling file-caches, the only race we have to take care of is "moving"
> > account, IOW, overwriting page_cgroup->mem_cgroup. 
> > (See comment in the patch)
> > 
> > Unlike charge/uncharge, "move" happens not so frequently. It happens only when
> > rmdir() and task-moving (with a special settings.)
> > This patch adds a race-checker for file-cache-status accounting v.s. account
> > moving. The new per-cpu-per-memcg counter MEM_CGROUP_ON_MOVE is added.
> > The routine for account move 
> >   1. Increment it before start moving
> >   2. Call synchronize_rcu()
> >   3. Decrement it after the end of moving.
> > By this, file-status-counting routine can check it needs to call
> > lock_page_cgroup(). In most case, I doesn't need to call it.
> > 
> > Following is a perf data of a process which mmap()/munmap 32MB of file cache
> > in a minute.
> > 
> > Before patch:
> >     28.25%     mmap  mmap               [.] main
> >     22.64%     mmap  [kernel.kallsyms]  [k] page_fault
> >      9.96%     mmap  [kernel.kallsyms]  [k] mem_cgroup_update_file_mapped
> >      3.67%     mmap  [kernel.kallsyms]  [k] filemap_fault
> >      3.50%     mmap  [kernel.kallsyms]  [k] unmap_vmas
> >      2.99%     mmap  [kernel.kallsyms]  [k] __do_fault
> >      2.76%     mmap  [kernel.kallsyms]  [k] find_get_page
> > 
> > After patch:
> >     30.00%     mmap  mmap               [.] main
> >     23.78%     mmap  [kernel.kallsyms]  [k] page_fault
> >      5.52%     mmap  [kernel.kallsyms]  [k] mem_cgroup_update_file_mapped
> >      3.81%     mmap  [kernel.kallsyms]  [k] unmap_vmas
> >      3.26%     mmap  [kernel.kallsyms]  [k] find_get_page
> >      3.18%     mmap  [kernel.kallsyms]  [k] __do_fault
> >      3.03%     mmap  [kernel.kallsyms]  [k] filemap_fault
> >      2.40%     mmap  [kernel.kallsyms]  [k] handle_mm_fault
> >      2.40%     mmap  [kernel.kallsyms]  [k] do_page_fault
> > 
> > This patch reduces memcg's cost to some extent.
> > (mem_cgroup_update_file_mapped is called by both of map/unmap)
> > 
> > Note: It seems some more improvements are required..but no idea.
> >       maybe removing set/unset flag is required.
> > 
> > Changelog: 20100913
> >  - decoupled with ID patches.
> >  - updated comments.
> > 
> > Changelog: 20100901
> >  - changes id_to_memcg(pc, true) to be id_to_memcg(pc, false)
> >    in update_file_mapped()
> >  - updated comments on lock rule of update_file_mapped()
> > Changelog: 20100825
> >  - added a comment about mc.lock
> >  - fixed bad lock.
> > Changelog: 20100804
> >  - added a comment for possible optimization hint.
> > Changelog: 20100730
> >  - some cleanup.
> > Changelog: 20100729
> >  - replaced __this_cpu_xxx() with this_cpu_xxx
> >    (because we don't call spinlock)
> >  - added VM_BUG_ON().
> > 
> > Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  mm/memcontrol.c |   99 ++++++++++++++++++++++++++++++++++++++++++++++++--------
> >  1 file changed, 85 insertions(+), 14 deletions(-)
> > 
> > Index: lockless-update/mm/memcontrol.c
> > ===================================================================
> > --- lockless-update.orig/mm/memcontrol.c
> > +++ lockless-update/mm/memcontrol.c
> > @@ -90,6 +90,7 @@ enum mem_cgroup_stat_index {
> >  	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
> >  	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> >  	MEM_CGROUP_EVENTS,	/* incremented at every  pagein/pageout */
> > +	MEM_CGROUP_ON_MOVE,	/* someone is moving account between groups */
> > 
> >  	MEM_CGROUP_STAT_NSTATS,
> >  };
> > @@ -1051,7 +1052,46 @@ static unsigned int get_swappiness(struc
> >  	return swappiness;
> >  }
> > 
> > -/* A routine for testing mem is not under move_account */
> > +static void mem_cgroup_start_move(struct mem_cgroup *mem)
> > +{
> > +	int cpu;
> > +	/* Because this is for moving account, reuse mc.lock */
> > +	spin_lock(&mc.lock);
> > +	for_each_possible_cpu(cpu)
> 
> for_each_possible_cpu() might be too much, no?
> 
> I recommend we use a get_online_cpus()/put_online_cpus() pair
> around the call and optimize.
> 
That makes the patch big, I will have to add cpu hotplug notifier.
If you really want, I'll write an add-on and some clean ups.

And get_online_cpus() requires hotplug notifiers, anyway.

If not using notifier,

	get_onlie_cpus()
	per_cpu(MEM_CGROUP_ON_MOVE) += 1;
	.....do very heavy work, which can sleep.
	per_cpu(MEM_CGROUP_ON_MOVE) -= 1;
	put_online_cpu()

This cannot be justified in mission critical servers.

Now, this code itself is only called at rmdir() and move_task().
Both are very slow.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2010-09-14  5:00 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-13  7:08 [BUGFIX][PATCH] memcg: fix race in file_mapped accouting flag management KAMEZAWA Hiroyuki
2010-09-13  7:08 ` KAMEZAWA Hiroyuki
2010-09-13  7:13 ` [PATCH] " KAMEZAWA Hiroyuki
2010-09-13  7:13   ` KAMEZAWA Hiroyuki
2010-09-13  8:01   ` [PATCH] memcg: avoid lock in updating file_mapped (Was " KAMEZAWA Hiroyuki
2010-09-13  8:01     ` KAMEZAWA Hiroyuki
2010-09-13 17:26     ` Balbir Singh
2010-09-13 17:26       ` Balbir Singh
2010-09-14  4:55       ` KAMEZAWA Hiroyuki
2010-09-14  4:55         ` KAMEZAWA Hiroyuki
2010-09-13  8:47 ` [BUGFIX][PATCH] memcg: " Balbir Singh
2010-09-13  8:47   ` Balbir Singh
2010-09-13 15:28   ` Hiroyuki Kamezawa
2010-09-13 15:28     ` Hiroyuki Kamezawa
2010-09-13 17:17     ` Balbir Singh
2010-09-13 17:17       ` Balbir Singh
2010-09-13 21:08 ` Andrew Morton
2010-09-13 21:08   ` Andrew Morton
2010-09-14  4:35   ` KAMEZAWA Hiroyuki
2010-09-14  4:35     ` KAMEZAWA Hiroyuki
2010-09-14  4:38   ` KAMEZAWA Hiroyuki
2010-09-14  4:38     ` KAMEZAWA Hiroyuki

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.