All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V2 0/3] memcg: simply lock of page stat accounting
@ 2013-05-13  5:03 ` Sha Zhengju
  0 siblings, 0 replies; 36+ messages in thread
From: Sha Zhengju @ 2013-05-13  5:03 UTC (permalink / raw)
  To: cgroups, linux-mm
  Cc: mhocko, kamezawa.hiroyu, akpm, hughd, gthelen, Sha Zhengju

Hi,

This is my second attempt to make memcg page stat lock simpler, the
first version: http://www.spinics.net/lists/linux-mm/msg50037.html.

In this version I investigate the potential race conditions among
page stat, move_account, charge, uncharge and try to prove it race
safe of my proposing lock scheme. The first patch is the basis of
the patchset, so if I've made some stupid mistake please do not
hesitate to point it out.

Change log:
v2 <- v1:
   * rewrite comments on race condition
   * split orignal large patch to two parts
   * change too heavy try_get_mem_cgroup_from_page() to rcu_read_lock
     to hold memcg alive

Sha Zhengju (3):
   memcg: rewrite the comment about race condition of page stat accounting
   memcg: alter mem_cgroup_{update,inc,dec}_page_stat() args to memcg pointer
   memcg: simplify lock of memcg page stat account	

 include/linux/memcontrol.h |   14 ++++++-------
 mm/memcontrol.c            |   16 ++++++---------
 mm/rmap.c                  |   49 +++++++++++++++++++++++++++++++++-----------
 3 files changed, 50 insertions(+), 29 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH V2 0/3] memcg: simply lock of page stat accounting
@ 2013-05-13  5:03 ` Sha Zhengju
  0 siblings, 0 replies; 36+ messages in thread
From: Sha Zhengju @ 2013-05-13  5:03 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg
  Cc: mhocko-AlSwsSmVLrQ, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	hughd-hpIqsD4AKlfQT0dZR+AlfA, gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	Sha Zhengju

Hi,

This is my second attempt to make memcg page stat lock simpler, the
first version: http://www.spinics.net/lists/linux-mm/msg50037.html.

In this version I investigate the potential race conditions among
page stat, move_account, charge, uncharge and try to prove it race
safe of my proposing lock scheme. The first patch is the basis of
the patchset, so if I've made some stupid mistake please do not
hesitate to point it out.

Change log:
v2 <- v1:
   * rewrite comments on race condition
   * split orignal large patch to two parts
   * change too heavy try_get_mem_cgroup_from_page() to rcu_read_lock
     to hold memcg alive

Sha Zhengju (3):
   memcg: rewrite the comment about race condition of page stat accounting
   memcg: alter mem_cgroup_{update,inc,dec}_page_stat() args to memcg pointer
   memcg: simplify lock of memcg page stat account	

 include/linux/memcontrol.h |   14 ++++++-------
 mm/memcontrol.c            |   16 ++++++---------
 mm/rmap.c                  |   49 +++++++++++++++++++++++++++++++++-----------
 3 files changed, 50 insertions(+), 29 deletions(-)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH V2 1/3] memcg: rewrite the comment about race condition of page stat accounting
  2013-05-13  5:03 ` Sha Zhengju
  (?)
@ 2013-05-13  5:04 ` Sha Zhengju
  -1 siblings, 0 replies; 36+ messages in thread
From: Sha Zhengju @ 2013-05-13  5:04 UTC (permalink / raw)
  To: cgroups, linux-mm
  Cc: mhocko, kamezawa.hiroyu, akpm, hughd, gthelen, Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

While doing memcg page stat accounting, we need to access page_cgroup
members including pc->mem_cgroup and pc->flags, so there are 3 candidates
which have potential race conditions with it: move account, charge, uncharge.

But page stat and uncharge can also take it easy because the former is done
before the page is deleted from radix-tree and the later is after the delete,
so they will be serialized by some other locks(like page lock).

So the races among them will be solved by:

	  stat	   	move		  charge	    uncharge
stat	   X	      move lock   	 no race	    no race
move	       		 X	     lock_page_cgroup    lock_page_cgroup
charge  				    X		 lock_page_cgroup
uncharge						       X

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
---
 mm/memcontrol.c |    7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fe4f123..b31513e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2317,9 +2317,10 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
  * are no race with "charge".
  *
  * Considering "uncharge", we know that memcg doesn't clear pc->mem_cgroup
- * at "uncharge" intentionally. So, we always see valid pc->mem_cgroup even
- * if there are race with "uncharge". Statistics itself is properly handled
- * by flags.
+ * at "uncharge" intentionally but clear PageCgroupUsed flag for that page.
+ * Besides, the file-stat operations happen before a page is deleted from
+ * radix-tree while uncharge is after the delete. So there are no race with
+ * "uncharge" too.
  *
  * Considering "move", this is an only case we see a race. To make the race
  * small, we check mm->moving_account and detect there are possibility of race
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH V2 2/3] memcg: alter mem_cgroup_{update,inc,dec}_page_stat() args to memcg pointer
  2013-05-13  5:03 ` Sha Zhengju
  (?)
  (?)
@ 2013-05-13  5:05 ` Sha Zhengju
  2013-05-13 12:25   ` Michal Hocko
  2013-05-14  0:15     ` Kamezawa Hiroyuki
  -1 siblings, 2 replies; 36+ messages in thread
From: Sha Zhengju @ 2013-05-13  5:05 UTC (permalink / raw)
  To: cgroups, linux-mm
  Cc: mhocko, kamezawa.hiroyu, akpm, hughd, gthelen, Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

Change the first argument of mem_cgroup_{update,inc,dec}_page_stat() from
'struct page *' to 'struct mem_cgroup *', and so move PageCgroupUsed(pc)
checking out of mem_cgroup_update_page_stat(). This is a prepare patch for
the following memcg page stat lock simplifying.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
---
 include/linux/memcontrol.h |   14 +++++++-------
 mm/memcontrol.c            |    9 ++-------
 mm/rmap.c                  |   28 +++++++++++++++++++++++++---
 3 files changed, 34 insertions(+), 17 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d6183f0..7af3ed3 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -163,20 +163,20 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
 	rcu_read_unlock();
 }
 
-void mem_cgroup_update_page_stat(struct page *page,
+void mem_cgroup_update_page_stat(struct mem_cgroup *memcg,
 				 enum mem_cgroup_page_stat_item idx,
 				 int val);
 
-static inline void mem_cgroup_inc_page_stat(struct page *page,
+static inline void mem_cgroup_inc_page_stat(struct mem_cgroup *memcg,
 					    enum mem_cgroup_page_stat_item idx)
 {
-	mem_cgroup_update_page_stat(page, idx, 1);
+	mem_cgroup_update_page_stat(memcg, idx, 1);
 }
 
-static inline void mem_cgroup_dec_page_stat(struct page *page,
+static inline void mem_cgroup_dec_page_stat(struct mem_cgroup *memcg,
 					    enum mem_cgroup_page_stat_item idx)
 {
-	mem_cgroup_update_page_stat(page, idx, -1);
+	mem_cgroup_update_page_stat(memcg, idx, -1);
 }
 
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
@@ -347,12 +347,12 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
 {
 }
 
-static inline void mem_cgroup_inc_page_stat(struct page *page,
+static inline void mem_cgroup_inc_page_stat(struct mem_cgroup *memcg,
 					    enum mem_cgroup_page_stat_item idx)
 {
 }
 
-static inline void mem_cgroup_dec_page_stat(struct page *page,
+static inline void mem_cgroup_dec_page_stat(struct mem_cgroup *memcg,
 					    enum mem_cgroup_page_stat_item idx)
 {
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b31513e..a394ba4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2367,18 +2367,13 @@ void __mem_cgroup_end_update_page_stat(struct page *page, unsigned long *flags)
 	move_unlock_mem_cgroup(pc->mem_cgroup, flags);
 }
 
-void mem_cgroup_update_page_stat(struct page *page,
+void mem_cgroup_update_page_stat(struct mem_cgroup *memcg,
 				 enum mem_cgroup_page_stat_item idx, int val)
 {
-	struct mem_cgroup *memcg;
-	struct page_cgroup *pc = lookup_page_cgroup(page);
-	unsigned long uninitialized_var(flags);
-
 	if (mem_cgroup_disabled())
 		return;
 
-	memcg = pc->mem_cgroup;
-	if (unlikely(!memcg || !PageCgroupUsed(pc)))
+	if (unlikely(!memcg))
 		return;
 
 	switch (idx) {
diff --git a/mm/rmap.c b/mm/rmap.c
index 6280da8..a03c2a9 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1109,12 +1109,24 @@ void page_add_file_rmap(struct page *page)
 {
 	bool locked;
 	unsigned long flags;
+	struct page_cgroup *pc;
+	struct mem_cgroup *memcg = NULL;
 
 	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
+	pc = lookup_page_cgroup(page);
+
+	rcu_read_lock();
+	memcg = pc->mem_cgroup;
+	if (unlikely(!PageCgroupUsed(pc)))
+		memcg = NULL;
+
 	if (atomic_inc_and_test(&page->_mapcount)) {
 		__inc_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
+		if (memcg)
+			mem_cgroup_inc_page_stat(memcg, MEMCG_NR_FILE_MAPPED);
 	}
+	rcu_read_unlock();
+
 	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 }
 
@@ -1129,14 +1141,22 @@ void page_remove_rmap(struct page *page)
 	bool anon = PageAnon(page);
 	bool locked;
 	unsigned long flags;
+	struct page_cgroup *pc;
+	struct mem_cgroup *memcg = NULL;
 
 	/*
 	 * The anon case has no mem_cgroup page_stat to update; but may
 	 * uncharge_page() below, where the lock ordering can deadlock if
 	 * we hold the lock against page_stat move: so avoid it on anon.
 	 */
-	if (!anon)
+	if (!anon) {
 		mem_cgroup_begin_update_page_stat(page, &locked, &flags);
+		pc = lookup_page_cgroup(page);
+		rcu_read_lock();
+		memcg = pc->mem_cgroup;
+		if (unlikely(!PageCgroupUsed(pc)))
+			memcg = NULL;
+	}
 
 	/* page still mapped by someone else? */
 	if (!atomic_add_negative(-1, &page->_mapcount))
@@ -1157,7 +1177,9 @@ void page_remove_rmap(struct page *page)
 					      NR_ANON_TRANSPARENT_HUGEPAGES);
 	} else {
 		__dec_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
+		if (memcg)
+			mem_cgroup_dec_page_stat(memcg, MEMCG_NR_FILE_MAPPED);
+		rcu_read_unlock();
 		mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	}
 	if (unlikely(PageMlocked(page)))
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH V2 3/3] memcg: simplify lock of memcg page stat account
  2013-05-13  5:03 ` Sha Zhengju
                   ` (2 preceding siblings ...)
  (?)
@ 2013-05-13  5:05 ` Sha Zhengju
  2013-05-13 13:12   ` Michal Hocko
  -1 siblings, 1 reply; 36+ messages in thread
From: Sha Zhengju @ 2013-05-13  5:05 UTC (permalink / raw)
  To: cgroups, linux-mm
  Cc: mhocko, kamezawa.hiroyu, akpm, hughd, gthelen, Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

After removing duplicated information like PCG_* flags in
'struct page_cgroup'(commit 2ff76f1193), there's a problem between
"move" and "page stat accounting"(only FILE_MAPPED is supported now
but other stats will be added in future, and here I'd like to take
dirty page as an example):

Assume CPU-A does "page stat accounting" and CPU-B does "move"

CPU-A                        CPU-B
TestSet PG_dirty
(delay)              	move_lock_mem_cgroup()
                        if (PageDirty(page)) {
                             old_memcg->nr_dirty --
                             new_memcg->nr_dirty++
                        }
                        pc->mem_cgroup = new_memcg;
                        move_unlock_mem_cgroup()

move_lock_mem_cgroup()
memcg = pc->mem_cgroup
memcg->nr_dirty++
move_unlock_mem_cgroup()

while accounting information of new_memcg may be double-counted. So we
use a bigger lock to solve this problem:  (commit: 89c06bd52f)

      move_lock_mem_cgroup() <-- mem_cgroup_begin_update_page_stat()
      TestSetPageDirty(page)
      update page stats (without any checks)
      move_unlock_mem_cgroup() <-- mem_cgroup_begin_update_page_stat()


But this method also has its pros and cons: at present we use two layers
of lock avoidance(memcg_moving and memcg->moving_account) then spinlock
on memcg (see mem_cgroup_begin_update_page_stat()), but the lock
granularity is a little bigger that not only the critical section but
also some code logic is in the range of locking which may be deadlock
prone. While trying to add memcg dirty page accounting, it gets into
further difficulty with page cache radix-tree lock and even worse
mem_cgroup_begin_update_page_stat() requires nesting
(https://lkml.org/lkml/2013/1/2/48). However, when the current patch is
preparing, the lock nesting problem is longer possible as s390/mm has
reworked it out(commit:abf09bed), but it should be better
if we can make the lock simpler and recursive safe.

A choice may be:

       CPU-A (stat)                 CPU-B (move)

move_lock_mem_cgroup()
if (PageCgroupUsed(pc)) ---(1)
   needinc = 1;
old_memcg = pc->mem_cgroup
ret = TestSetPageDirty(page)	lock_page_cgroup();
move_unlock_mem_cgroup()     	if (!PageCgroupUsed(pc)) ---(2)
				  return;
                             	move_lock_mem_cgroup()
                             	if (PageDirty) {
                                  old_memcg->nr_dirty --;
                                  new_memcg->nr_dirty ++;
                             	}
                             	pc->mem_cgroup = new_memcg
                             	move_unlock_mem_cgroup()
if (needinc & ret)		unlock_page_cgroup();
   old_memcg->nr_dirty ++


For CPU-A, we save pc->mem_cgroup in a temporary variable just before
TestSetPageDirty inside move_lock and then update stats if the page is set
PG_dirty successfully. But CPU-B may do "moving" in advance that
"old_memcg->nr_dirty --" will make old_memcg->nr_dirty incorrect but
soon CPU-A will do "old_memcg->nr_dirty ++" finally that amend the stats.
Now as only old_memcg saving and TestSetPageDirty is done under move_lock,
the possibility of deadlock or recursion is greatly reduced.

But is it race safe? As we know there're 4 candidates among which exist race
condition: page stat, move account, charge and uncharge. In the previous
illustration, PageCgroupUsed judgment can be different in step (1) and (2)
because of charge or uncharge. But here we don't need additional
synchronization since there's no race between stat & charge or stat & uncharge.
For details please see comments above __mem_cgroup_begin_update_page_stat().


Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
---
 mm/rmap.c |   23 +++++++++++++----------
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index a03c2a9..7b58576 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -57,6 +57,7 @@
 #include <linux/migrate.h>
 #include <linux/hugetlb.h>
 #include <linux/backing-dev.h>
+#include <linux/page_cgroup.h>
 
 #include <asm/tlbflush.h>
 
@@ -1111,6 +1112,7 @@ void page_add_file_rmap(struct page *page)
 	unsigned long flags;
 	struct page_cgroup *pc;
 	struct mem_cgroup *memcg = NULL;
+	bool ret;
 
 	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	pc = lookup_page_cgroup(page);
@@ -1119,15 +1121,15 @@ void page_add_file_rmap(struct page *page)
 	memcg = pc->mem_cgroup;
 	if (unlikely(!PageCgroupUsed(pc)))
 		memcg = NULL;
+	ret = atomic_inc_and_test(&page->_mapcount);
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 
-	if (atomic_inc_and_test(&page->_mapcount)) {
+	if (ret) {
 		__inc_zone_page_state(page, NR_FILE_MAPPED);
 		if (memcg)
 			mem_cgroup_inc_page_stat(memcg, MEMCG_NR_FILE_MAPPED);
 	}
 	rcu_read_unlock();
-
-	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 }
 
 /**
@@ -1143,6 +1145,7 @@ void page_remove_rmap(struct page *page)
 	unsigned long flags;
 	struct page_cgroup *pc;
 	struct mem_cgroup *memcg = NULL;
+	bool ret;
 
 	/*
 	 * The anon case has no mem_cgroup page_stat to update; but may
@@ -1158,16 +1161,20 @@ void page_remove_rmap(struct page *page)
 			memcg = NULL;
 	}
 
+	ret = atomic_add_negative(-1, &page->_mapcount);
+	if (!anon)
+		mem_cgroup_end_update_page_stat(page, &locked, &flags);
+
 	/* page still mapped by someone else? */
-	if (!atomic_add_negative(-1, &page->_mapcount))
-		goto out;
+	if (!ret)
+		return;
 
 	/*
 	 * Hugepages are not counted in NR_ANON_PAGES nor NR_FILE_MAPPED
 	 * and not charged by memcg for now.
 	 */
 	if (unlikely(PageHuge(page)))
-		goto out;
+		return;
 	if (anon) {
 		mem_cgroup_uncharge_page(page);
 		if (!PageTransHuge(page))
@@ -1180,7 +1187,6 @@ void page_remove_rmap(struct page *page)
 		if (memcg)
 			mem_cgroup_dec_page_stat(memcg, MEMCG_NR_FILE_MAPPED);
 		rcu_read_unlock();
-		mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	}
 	if (unlikely(PageMlocked(page)))
 		clear_page_mlock(page);
@@ -1194,9 +1200,6 @@ void page_remove_rmap(struct page *page)
 	 * faster for those pages still in swapcache.
 	 */
 	return;
-out:
-	if (!anon)
-		mem_cgroup_end_update_page_stat(page, &locked, &flags);
 }
 
 /*
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH V2 2/3] memcg: alter mem_cgroup_{update,inc,dec}_page_stat() args to memcg pointer
  2013-05-13  5:05 ` [PATCH V2 2/3] memcg: alter mem_cgroup_{update,inc,dec}_page_stat() args to memcg pointer Sha Zhengju
@ 2013-05-13 12:25   ` Michal Hocko
  2013-05-14  9:00     ` Sha Zhengju
  2013-05-14  0:15     ` Kamezawa Hiroyuki
  1 sibling, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2013-05-13 12:25 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: cgroups, linux-mm, kamezawa.hiroyu, akpm, hughd, gthelen, Sha Zhengju

On Mon 13-05-13 13:05:24, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> Change the first argument of mem_cgroup_{update,inc,dec}_page_stat() from
> 'struct page *' to 'struct mem_cgroup *', and so move PageCgroupUsed(pc)
> checking out of mem_cgroup_update_page_stat(). This is a prepare patch for
> the following memcg page stat lock simplifying.

No, please do not do this because it just spreads memcg specific code
out of memcontrol.c. Besides that the patch is not correct.
[...]
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1109,12 +1109,24 @@ void page_add_file_rmap(struct page *page)
>  {
>  	bool locked;
>  	unsigned long flags;
> +	struct page_cgroup *pc;
> +	struct mem_cgroup *memcg = NULL;
>  
>  	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +	pc = lookup_page_cgroup(page);
> +
> +	rcu_read_lock();
> +	memcg = pc->mem_cgroup;

a) unnecessary RCU take for memcg disabled and b) worse KABOOM in that case
as page_cgroup is NULL. We really do not want to put
mem_cgroup_disabled() tests all over the place. The idea behind
mem_cgroup_begin_update_page_stat was to be almost a noop for !memcg
(and the real noop for !CONFIG_MEMCG).

so Nak to this approach
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V2 3/3] memcg: simplify lock of memcg page stat account
  2013-05-13  5:05 ` [PATCH V2 3/3] memcg: simplify lock of memcg page stat account Sha Zhengju
@ 2013-05-13 13:12   ` Michal Hocko
  2013-05-13 13:38       ` Michal Hocko
  2013-05-14  8:35     ` Sha Zhengju
  0 siblings, 2 replies; 36+ messages in thread
From: Michal Hocko @ 2013-05-13 13:12 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: cgroups, linux-mm, kamezawa.hiroyu, akpm, hughd, gthelen, Sha Zhengju

On Mon 13-05-13 13:05:44, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> After removing duplicated information like PCG_* flags in
> 'struct page_cgroup'(commit 2ff76f1193), there's a problem between
> "move" and "page stat accounting"(only FILE_MAPPED is supported now
> but other stats will be added in future, and here I'd like to take
> dirty page as an example):
> 
> Assume CPU-A does "page stat accounting" and CPU-B does "move"
> 
> CPU-A                        CPU-B
> TestSet PG_dirty
> (delay)              	move_lock_mem_cgroup()
>                         if (PageDirty(page)) {
>                              old_memcg->nr_dirty --
>                              new_memcg->nr_dirty++
>                         }
>                         pc->mem_cgroup = new_memcg;
>                         move_unlock_mem_cgroup()
> 
> move_lock_mem_cgroup()
> memcg = pc->mem_cgroup
> memcg->nr_dirty++
> move_unlock_mem_cgroup()
> 
> while accounting information of new_memcg may be double-counted. So we
> use a bigger lock to solve this problem:  (commit: 89c06bd52f)
> 
>       move_lock_mem_cgroup() <-- mem_cgroup_begin_update_page_stat()
>       TestSetPageDirty(page)
>       update page stats (without any checks)
>       move_unlock_mem_cgroup() <-- mem_cgroup_begin_update_page_stat()
> 
> 
> But this method also has its pros and cons: at present we use two layers
> of lock avoidance(memcg_moving and memcg->moving_account) then spinlock
> on memcg (see mem_cgroup_begin_update_page_stat()), but the lock
> granularity is a little bigger that not only the critical section but
> also some code logic is in the range of locking which may be deadlock
> prone. While trying to add memcg dirty page accounting, it gets into
> further difficulty with page cache radix-tree lock and even worse
> mem_cgroup_begin_update_page_stat() requires nesting
> (https://lkml.org/lkml/2013/1/2/48). However, when the current patch is
> preparing, the lock nesting problem is longer possible as s390/mm has
> reworked it out(commit:abf09bed), but it should be better
> if we can make the lock simpler and recursive safe.

This patch doesn't make the charge move locking recursive safe. It
just tries to overcome the problem in the path where it doesn't exist
anymore. mem_cgroup_begin_update_page_stat would still deadlock if it
was re-entered.

It makes PageCgroupUsed usage even more tricky because it uses it out of
lock_page_cgroup context. It seems that this would work in this
particular path because atomic_inc_and_test(_mapcount) will protect from
double accounting but the whole dance around old_memcg seems pointless
to me.

I am sorry but I do not think this is the right approach. IMO we should
focus on mem_cgroup_begin_update_page_stat and make it really recursive
safe - ideally without any additional overhead (which sounds like a real
challenge)

[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V2 3/3] memcg: simplify lock of memcg page stat account
@ 2013-05-13 13:38       ` Michal Hocko
  0 siblings, 0 replies; 36+ messages in thread
From: Michal Hocko @ 2013-05-13 13:38 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: cgroups, linux-mm, kamezawa.hiroyu, akpm, hughd, gthelen, Sha Zhengju

On Mon 13-05-13 15:12:51, Michal Hocko wrote:
[...]
> I am sorry but I do not think this is the right approach. IMO we should
> focus on mem_cgroup_begin_update_page_stat and make it really recursive
> safe - ideally without any additional overhead (which sounds like a real
> challenge)

Or maybe we should just not over complicate this and simply consider
recursivness when it starts being an issue. It is not a problem for
rmap accounting anymore and dirty pages accounting seems to be safe as
well and pages under writeback accounting was OK even previously.
It doesn't make much sense to block dirty pages accounting by a
non-existing problem.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V2 3/3] memcg: simplify lock of memcg page stat account
@ 2013-05-13 13:38       ` Michal Hocko
  0 siblings, 0 replies; 36+ messages in thread
From: Michal Hocko @ 2013-05-13 13:38 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	hughd-hpIqsD4AKlfQT0dZR+AlfA, gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	Sha Zhengju

On Mon 13-05-13 15:12:51, Michal Hocko wrote:
[...]
> I am sorry but I do not think this is the right approach. IMO we should
> focus on mem_cgroup_begin_update_page_stat and make it really recursive
> safe - ideally without any additional overhead (which sounds like a real
> challenge)

Or maybe we should just not over complicate this and simply consider
recursivness when it starts being an issue. It is not a problem for
rmap accounting anymore and dirty pages accounting seems to be safe as
well and pages under writeback accounting was OK even previously.
It doesn't make much sense to block dirty pages accounting by a
non-existing problem.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V2 2/3] memcg: alter mem_cgroup_{update,inc,dec}_page_stat() args to memcg pointer
@ 2013-05-14  0:15     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 36+ messages in thread
From: Kamezawa Hiroyuki @ 2013-05-14  0:15 UTC (permalink / raw)
  To: Sha Zhengju; +Cc: cgroups, linux-mm, mhocko, akpm, hughd, gthelen, Sha Zhengju

(2013/05/13 14:05), Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> Change the first argument of mem_cgroup_{update,inc,dec}_page_stat() from
> 'struct page *' to 'struct mem_cgroup *', and so move PageCgroupUsed(pc)
> checking out of mem_cgroup_update_page_stat(). This is a prepare patch for
> the following memcg page stat lock simplifying.
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>

Why we need to find page_cgroup and memcg and check pc->flags bit
even if memcg is unused ? I think it's too slow.

Sorry, NAK.



> ---
>   include/linux/memcontrol.h |   14 +++++++-------
>   mm/memcontrol.c            |    9 ++-------
>   mm/rmap.c                  |   28 +++++++++++++++++++++++++---
>   3 files changed, 34 insertions(+), 17 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index d6183f0..7af3ed3 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -163,20 +163,20 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
>   	rcu_read_unlock();
>   }
>   
> -void mem_cgroup_update_page_stat(struct page *page,
> +void mem_cgroup_update_page_stat(struct mem_cgroup *memcg,
>   				 enum mem_cgroup_page_stat_item idx,
>   				 int val);
>   
> -static inline void mem_cgroup_inc_page_stat(struct page *page,
> +static inline void mem_cgroup_inc_page_stat(struct mem_cgroup *memcg,
>   					    enum mem_cgroup_page_stat_item idx)
>   {
> -	mem_cgroup_update_page_stat(page, idx, 1);
> +	mem_cgroup_update_page_stat(memcg, idx, 1);
>   }
>   
> -static inline void mem_cgroup_dec_page_stat(struct page *page,
> +static inline void mem_cgroup_dec_page_stat(struct mem_cgroup *memcg,
>   					    enum mem_cgroup_page_stat_item idx)
>   {
> -	mem_cgroup_update_page_stat(page, idx, -1);
> +	mem_cgroup_update_page_stat(memcg, idx, -1);
>   }
>   
>   unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> @@ -347,12 +347,12 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
>   {
>   }
>   
> -static inline void mem_cgroup_inc_page_stat(struct page *page,
> +static inline void mem_cgroup_inc_page_stat(struct mem_cgroup *memcg,
>   					    enum mem_cgroup_page_stat_item idx)
>   {
>   }
>   
> -static inline void mem_cgroup_dec_page_stat(struct page *page,
> +static inline void mem_cgroup_dec_page_stat(struct mem_cgroup *memcg,
>   					    enum mem_cgroup_page_stat_item idx)
>   {
>   }
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b31513e..a394ba4 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2367,18 +2367,13 @@ void __mem_cgroup_end_update_page_stat(struct page *page, unsigned long *flags)
>   	move_unlock_mem_cgroup(pc->mem_cgroup, flags);
>   }
>   
> -void mem_cgroup_update_page_stat(struct page *page,
> +void mem_cgroup_update_page_stat(struct mem_cgroup *memcg,
>   				 enum mem_cgroup_page_stat_item idx, int val)
>   {
> -	struct mem_cgroup *memcg;
> -	struct page_cgroup *pc = lookup_page_cgroup(page);
> -	unsigned long uninitialized_var(flags);
> -
>   	if (mem_cgroup_disabled())
>   		return;
>   
> -	memcg = pc->mem_cgroup;
> -	if (unlikely(!memcg || !PageCgroupUsed(pc)))
> +	if (unlikely(!memcg))
>   		return;
>   
>   	switch (idx) {
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 6280da8..a03c2a9 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1109,12 +1109,24 @@ void page_add_file_rmap(struct page *page)
>   {
>   	bool locked;
>   	unsigned long flags;
> +	struct page_cgroup *pc;
> +	struct mem_cgroup *memcg = NULL;
>   
>   	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +	pc = lookup_page_cgroup(page);
> +
> +	rcu_read_lock();
> +	memcg = pc->mem_cgroup;
> +	if (unlikely(!PageCgroupUsed(pc)))
> +		memcg = NULL;
> +
>   	if (atomic_inc_and_test(&page->_mapcount)) {
>   		__inc_zone_page_state(page, NR_FILE_MAPPED);
> -		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
> +		if (memcg)
> +			mem_cgroup_inc_page_stat(memcg, MEMCG_NR_FILE_MAPPED);
>   	}
> +	rcu_read_unlock();
> +
>   	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>   }
>   
> @@ -1129,14 +1141,22 @@ void page_remove_rmap(struct page *page)
>   	bool anon = PageAnon(page);
>   	bool locked;
>   	unsigned long flags;
> +	struct page_cgroup *pc;
> +	struct mem_cgroup *memcg = NULL;
>   
>   	/*
>   	 * The anon case has no mem_cgroup page_stat to update; but may
>   	 * uncharge_page() below, where the lock ordering can deadlock if
>   	 * we hold the lock against page_stat move: so avoid it on anon.
>   	 */
> -	if (!anon)
> +	if (!anon) {
>   		mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +		pc = lookup_page_cgroup(page);
> +		rcu_read_lock();
> +		memcg = pc->mem_cgroup;
> +		if (unlikely(!PageCgroupUsed(pc)))
> +			memcg = NULL;
> +	}
>   
>   	/* page still mapped by someone else? */
>   	if (!atomic_add_negative(-1, &page->_mapcount))
> @@ -1157,7 +1177,9 @@ void page_remove_rmap(struct page *page)
>   					      NR_ANON_TRANSPARENT_HUGEPAGES);
>   	} else {
>   		__dec_zone_page_state(page, NR_FILE_MAPPED);
> -		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
> +		if (memcg)
> +			mem_cgroup_dec_page_stat(memcg, MEMCG_NR_FILE_MAPPED);
> +		rcu_read_unlock();
>   		mem_cgroup_end_update_page_stat(page, &locked, &flags);
>   	}
>   	if (unlikely(PageMlocked(page)))
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V2 2/3] memcg: alter mem_cgroup_{update,inc,dec}_page_stat() args to memcg pointer
@ 2013-05-14  0:15     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 36+ messages in thread
From: Kamezawa Hiroyuki @ 2013-05-14  0:15 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	mhocko-AlSwsSmVLrQ, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	hughd-hpIqsD4AKlfQT0dZR+AlfA, gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	Sha Zhengju

(2013/05/13 14:05), Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> 
> Change the first argument of mem_cgroup_{update,inc,dec}_page_stat() from
> 'struct page *' to 'struct mem_cgroup *', and so move PageCgroupUsed(pc)
> checking out of mem_cgroup_update_page_stat(). This is a prepare patch for
> the following memcg page stat lock simplifying.
> 
> Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>

Why we need to find page_cgroup and memcg and check pc->flags bit
even if memcg is unused ? I think it's too slow.

Sorry, NAK.



> ---
>   include/linux/memcontrol.h |   14 +++++++-------
>   mm/memcontrol.c            |    9 ++-------
>   mm/rmap.c                  |   28 +++++++++++++++++++++++++---
>   3 files changed, 34 insertions(+), 17 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index d6183f0..7af3ed3 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -163,20 +163,20 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
>   	rcu_read_unlock();
>   }
>   
> -void mem_cgroup_update_page_stat(struct page *page,
> +void mem_cgroup_update_page_stat(struct mem_cgroup *memcg,
>   				 enum mem_cgroup_page_stat_item idx,
>   				 int val);
>   
> -static inline void mem_cgroup_inc_page_stat(struct page *page,
> +static inline void mem_cgroup_inc_page_stat(struct mem_cgroup *memcg,
>   					    enum mem_cgroup_page_stat_item idx)
>   {
> -	mem_cgroup_update_page_stat(page, idx, 1);
> +	mem_cgroup_update_page_stat(memcg, idx, 1);
>   }
>   
> -static inline void mem_cgroup_dec_page_stat(struct page *page,
> +static inline void mem_cgroup_dec_page_stat(struct mem_cgroup *memcg,
>   					    enum mem_cgroup_page_stat_item idx)
>   {
> -	mem_cgroup_update_page_stat(page, idx, -1);
> +	mem_cgroup_update_page_stat(memcg, idx, -1);
>   }
>   
>   unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> @@ -347,12 +347,12 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
>   {
>   }
>   
> -static inline void mem_cgroup_inc_page_stat(struct page *page,
> +static inline void mem_cgroup_inc_page_stat(struct mem_cgroup *memcg,
>   					    enum mem_cgroup_page_stat_item idx)
>   {
>   }
>   
> -static inline void mem_cgroup_dec_page_stat(struct page *page,
> +static inline void mem_cgroup_dec_page_stat(struct mem_cgroup *memcg,
>   					    enum mem_cgroup_page_stat_item idx)
>   {
>   }
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b31513e..a394ba4 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2367,18 +2367,13 @@ void __mem_cgroup_end_update_page_stat(struct page *page, unsigned long *flags)
>   	move_unlock_mem_cgroup(pc->mem_cgroup, flags);
>   }
>   
> -void mem_cgroup_update_page_stat(struct page *page,
> +void mem_cgroup_update_page_stat(struct mem_cgroup *memcg,
>   				 enum mem_cgroup_page_stat_item idx, int val)
>   {
> -	struct mem_cgroup *memcg;
> -	struct page_cgroup *pc = lookup_page_cgroup(page);
> -	unsigned long uninitialized_var(flags);
> -
>   	if (mem_cgroup_disabled())
>   		return;
>   
> -	memcg = pc->mem_cgroup;
> -	if (unlikely(!memcg || !PageCgroupUsed(pc)))
> +	if (unlikely(!memcg))
>   		return;
>   
>   	switch (idx) {
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 6280da8..a03c2a9 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1109,12 +1109,24 @@ void page_add_file_rmap(struct page *page)
>   {
>   	bool locked;
>   	unsigned long flags;
> +	struct page_cgroup *pc;
> +	struct mem_cgroup *memcg = NULL;
>   
>   	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +	pc = lookup_page_cgroup(page);
> +
> +	rcu_read_lock();
> +	memcg = pc->mem_cgroup;
> +	if (unlikely(!PageCgroupUsed(pc)))
> +		memcg = NULL;
> +
>   	if (atomic_inc_and_test(&page->_mapcount)) {
>   		__inc_zone_page_state(page, NR_FILE_MAPPED);
> -		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
> +		if (memcg)
> +			mem_cgroup_inc_page_stat(memcg, MEMCG_NR_FILE_MAPPED);
>   	}
> +	rcu_read_unlock();
> +
>   	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>   }
>   
> @@ -1129,14 +1141,22 @@ void page_remove_rmap(struct page *page)
>   	bool anon = PageAnon(page);
>   	bool locked;
>   	unsigned long flags;
> +	struct page_cgroup *pc;
> +	struct mem_cgroup *memcg = NULL;
>   
>   	/*
>   	 * The anon case has no mem_cgroup page_stat to update; but may
>   	 * uncharge_page() below, where the lock ordering can deadlock if
>   	 * we hold the lock against page_stat move: so avoid it on anon.
>   	 */
> -	if (!anon)
> +	if (!anon) {
>   		mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +		pc = lookup_page_cgroup(page);
> +		rcu_read_lock();
> +		memcg = pc->mem_cgroup;
> +		if (unlikely(!PageCgroupUsed(pc)))
> +			memcg = NULL;
> +	}
>   
>   	/* page still mapped by someone else? */
>   	if (!atomic_add_negative(-1, &page->_mapcount))
> @@ -1157,7 +1177,9 @@ void page_remove_rmap(struct page *page)
>   					      NR_ANON_TRANSPARENT_HUGEPAGES);
>   	} else {
>   		__dec_zone_page_state(page, NR_FILE_MAPPED);
> -		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
> +		if (memcg)
> +			mem_cgroup_dec_page_stat(memcg, MEMCG_NR_FILE_MAPPED);
> +		rcu_read_unlock();
>   		mem_cgroup_end_update_page_stat(page, &locked, &flags);
>   	}
>   	if (unlikely(PageMlocked(page)))
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V2 0/3] memcg: simply lock of page stat accounting
@ 2013-05-14  0:41   ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 36+ messages in thread
From: Kamezawa Hiroyuki @ 2013-05-14  0:41 UTC (permalink / raw)
  To: Sha Zhengju; +Cc: cgroups, linux-mm, mhocko, akpm, hughd, gthelen, Sha Zhengju

If you want to rewrite all things and make memcg cleaner, I don't stop it.
But, how about starting with this simeple one for your 1st purpose ? 
doesn't work ? dirty ?

== this patch is untested. ==
 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V2 0/3] memcg: simply lock of page stat accounting
@ 2013-05-14  0:41   ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 36+ messages in thread
From: Kamezawa Hiroyuki @ 2013-05-14  0:41 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	mhocko-AlSwsSmVLrQ, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	hughd-hpIqsD4AKlfQT0dZR+AlfA, gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	Sha Zhengju

If you want to rewrite all things and make memcg cleaner, I don't stop it.
But, how about starting with this simeple one for your 1st purpose ? 
doesn't work ? dirty ?

== this patch is untested. ==
 
From 95e405451f56933c4777e64bb02326ec0462f7a7 Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
Date: Tue, 14 May 2013 09:40:55 +0900
Subject: [PATCH] Allow nesting lock of memcg's page stat accouting.

Sha Zhengju and Michal Hocko pointed out that
mem_cgroup_begin/end_update_page_stat() should be nested lock.
https://lkml.org/lkml/2013/1/2/48

page_remove_rmap
  mem_cgroup_begin_update_page_stat		<<< 1
    set_page_dirty
      __set_page_dirty_buffers
        __set_page_dirty
          mem_cgroup_begin_update_page_stat	<<< 2
            move_lock_mem_cgroup
              spin_lock_irqsave(&memcg->move_lock, *flags);

This patch add a nesting functionality with per-thread counter.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
---
 include/linux/sched.h |    1 +
 mm/memcontrol.c       |   22 +++++++++++++++++++++-
 2 files changed, 22 insertions(+), 1 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 84ceef5..cca3229 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1402,6 +1402,7 @@ struct task_struct {
 		unsigned long memsw_nr_pages; /* uncharged mem+swap usage */
 	} memcg_batch;
 	unsigned int memcg_kmem_skip_account;
+	unsigned int memcg_page_stat_accounting;
 #endif
 #ifdef CONFIG_HAVE_HW_BREAKPOINT
 	atomic_t ptrace_bp_refcnt;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 357371a..152f8df 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2352,12 +2352,30 @@ again:
 	 */
 	if (!mem_cgroup_stolen(memcg))
 		return;
+	/*
+	 * In some case, we need nested lock of this.
+	 * page_remove_rmap
+	 *   mem_cgroup_begin_update_page_stat		<<< 1
+	 *     set_page_dirty
+	 *       __set_page_dirty_buffers
+	 *         __set_page_dirty
+	 *           mem_cgroup_begin_update_page_stat	<<< 2
+	 *             move_lock_mem_cgroup
+	 *               spin_lock_irqsave(&memcg->move_lock, *flags);
+	 *
+	 * We avoid this deadlock by having per thread counter.
+	 */
+	if (current->memcg_page_stat_accounting > 0) {
+		current->memcg_page_stat_accounting++;
+		return;
+	}
 
 	move_lock_mem_cgroup(memcg, flags);
 	if (memcg != pc->mem_cgroup || !PageCgroupUsed(pc)) {
 		move_unlock_mem_cgroup(memcg, flags);
 		goto again;
 	}
+	current->memcg_page_stat_accounting = 1;
 	*locked = true;
 }
 
@@ -2370,7 +2388,9 @@ void __mem_cgroup_end_update_page_stat(struct page *page, unsigned long *flags)
 	 * lock is held because a routine modifies pc->mem_cgroup
 	 * should take move_lock_mem_cgroup().
 	 */
-	move_unlock_mem_cgroup(pc->mem_cgroup, flags);
+	current->memcg_page_stat_accounting--;
+	if (!current->memcg_page_stat_accounting)
+		move_unlock_mem_cgroup(pc->mem_cgroup, flags);
 }
 
 void mem_cgroup_update_page_stat(struct page *page,
-- 
1.7.4.1



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH V2 0/3] memcg: simply lock of page stat accounting
@ 2013-05-14  7:13     ` Michal Hocko
  0 siblings, 0 replies; 36+ messages in thread
From: Michal Hocko @ 2013-05-14  7:13 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Sha Zhengju, cgroups, linux-mm, akpm, hughd, gthelen, Sha Zhengju

On Tue 14-05-13 09:41:42, KAMEZAWA Hiroyuki wrote:
> If you want to rewrite all things and make memcg cleaner, I don't stop it.
> But, how about starting with this simeple one for your 1st purpose ? 
> doesn't work ? dirty ?
> 
> == this patch is untested. ==

And it is unnecessary as the trace is no longer possible as
set_page_dirty is no longer called from page_remove_rmap see
abf09bed3cceadd809f0356065c2ada6cee90d4a

> From 95e405451f56933c4777e64bb02326ec0462f7a7 Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Date: Tue, 14 May 2013 09:40:55 +0900
> Subject: [PATCH] Allow nesting lock of memcg's page stat accouting.
> 
> Sha Zhengju and Michal Hocko pointed out that
> mem_cgroup_begin/end_update_page_stat() should be nested lock.
> https://lkml.org/lkml/2013/1/2/48
> 
> page_remove_rmap
>   mem_cgroup_begin_update_page_stat		<<< 1
>     set_page_dirty
>       __set_page_dirty_buffers
>         __set_page_dirty
>           mem_cgroup_begin_update_page_stat	<<< 2
>             move_lock_mem_cgroup
>               spin_lock_irqsave(&memcg->move_lock, *flags);
> 
> This patch add a nesting functionality with per-thread counter.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/sched.h |    1 +
>  mm/memcontrol.c       |   22 +++++++++++++++++++++-
>  2 files changed, 22 insertions(+), 1 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 84ceef5..cca3229 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1402,6 +1402,7 @@ struct task_struct {
>  		unsigned long memsw_nr_pages; /* uncharged mem+swap usage */
>  	} memcg_batch;
>  	unsigned int memcg_kmem_skip_account;
> +	unsigned int memcg_page_stat_accounting;
>  #endif
>  #ifdef CONFIG_HAVE_HW_BREAKPOINT
>  	atomic_t ptrace_bp_refcnt;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 357371a..152f8df 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2352,12 +2352,30 @@ again:
>  	 */
>  	if (!mem_cgroup_stolen(memcg))
>  		return;
> +	/*
> +	 * In some case, we need nested lock of this.
> +	 * page_remove_rmap
> +	 *   mem_cgroup_begin_update_page_stat		<<< 1
> +	 *     set_page_dirty
> +	 *       __set_page_dirty_buffers
> +	 *         __set_page_dirty
> +	 *           mem_cgroup_begin_update_page_stat	<<< 2
> +	 *             move_lock_mem_cgroup
> +	 *               spin_lock_irqsave(&memcg->move_lock, *flags);
> +	 *
> +	 * We avoid this deadlock by having per thread counter.
> +	 */
> +	if (current->memcg_page_stat_accounting > 0) {
> +		current->memcg_page_stat_accounting++;
> +		return;
> +	}
>  
>  	move_lock_mem_cgroup(memcg, flags);
>  	if (memcg != pc->mem_cgroup || !PageCgroupUsed(pc)) {
>  		move_unlock_mem_cgroup(memcg, flags);
>  		goto again;
>  	}
> +	current->memcg_page_stat_accounting = 1;
>  	*locked = true;
>  }
>  
> @@ -2370,7 +2388,9 @@ void __mem_cgroup_end_update_page_stat(struct page *page, unsigned long *flags)
>  	 * lock is held because a routine modifies pc->mem_cgroup
>  	 * should take move_lock_mem_cgroup().
>  	 */
> -	move_unlock_mem_cgroup(pc->mem_cgroup, flags);
> +	current->memcg_page_stat_accounting--;
> +	if (!current->memcg_page_stat_accounting)
> +		move_unlock_mem_cgroup(pc->mem_cgroup, flags);
>  }
>  
>  void mem_cgroup_update_page_stat(struct page *page,
> -- 
> 1.7.4.1
> 
> 
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V2 0/3] memcg: simply lock of page stat accounting
@ 2013-05-14  7:13     ` Michal Hocko
  0 siblings, 0 replies; 36+ messages in thread
From: Michal Hocko @ 2013-05-14  7:13 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Sha Zhengju, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	hughd-hpIqsD4AKlfQT0dZR+AlfA, gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	Sha Zhengju

On Tue 14-05-13 09:41:42, KAMEZAWA Hiroyuki wrote:
> If you want to rewrite all things and make memcg cleaner, I don't stop it.
> But, how about starting with this simeple one for your 1st purpose ? 
> doesn't work ? dirty ?
> 
> == this patch is untested. ==

And it is unnecessary as the trace is no longer possible as
set_page_dirty is no longer called from page_remove_rmap see
abf09bed3cceadd809f0356065c2ada6cee90d4a

> From 95e405451f56933c4777e64bb02326ec0462f7a7 Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> Date: Tue, 14 May 2013 09:40:55 +0900
> Subject: [PATCH] Allow nesting lock of memcg's page stat accouting.
> 
> Sha Zhengju and Michal Hocko pointed out that
> mem_cgroup_begin/end_update_page_stat() should be nested lock.
> https://lkml.org/lkml/2013/1/2/48
> 
> page_remove_rmap
>   mem_cgroup_begin_update_page_stat		<<< 1
>     set_page_dirty
>       __set_page_dirty_buffers
>         __set_page_dirty
>           mem_cgroup_begin_update_page_stat	<<< 2
>             move_lock_mem_cgroup
>               spin_lock_irqsave(&memcg->move_lock, *flags);
> 
> This patch add a nesting functionality with per-thread counter.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> ---
>  include/linux/sched.h |    1 +
>  mm/memcontrol.c       |   22 +++++++++++++++++++++-
>  2 files changed, 22 insertions(+), 1 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 84ceef5..cca3229 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1402,6 +1402,7 @@ struct task_struct {
>  		unsigned long memsw_nr_pages; /* uncharged mem+swap usage */
>  	} memcg_batch;
>  	unsigned int memcg_kmem_skip_account;
> +	unsigned int memcg_page_stat_accounting;
>  #endif
>  #ifdef CONFIG_HAVE_HW_BREAKPOINT
>  	atomic_t ptrace_bp_refcnt;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 357371a..152f8df 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2352,12 +2352,30 @@ again:
>  	 */
>  	if (!mem_cgroup_stolen(memcg))
>  		return;
> +	/*
> +	 * In some case, we need nested lock of this.
> +	 * page_remove_rmap
> +	 *   mem_cgroup_begin_update_page_stat		<<< 1
> +	 *     set_page_dirty
> +	 *       __set_page_dirty_buffers
> +	 *         __set_page_dirty
> +	 *           mem_cgroup_begin_update_page_stat	<<< 2
> +	 *             move_lock_mem_cgroup
> +	 *               spin_lock_irqsave(&memcg->move_lock, *flags);
> +	 *
> +	 * We avoid this deadlock by having per thread counter.
> +	 */
> +	if (current->memcg_page_stat_accounting > 0) {
> +		current->memcg_page_stat_accounting++;
> +		return;
> +	}
>  
>  	move_lock_mem_cgroup(memcg, flags);
>  	if (memcg != pc->mem_cgroup || !PageCgroupUsed(pc)) {
>  		move_unlock_mem_cgroup(memcg, flags);
>  		goto again;
>  	}
> +	current->memcg_page_stat_accounting = 1;
>  	*locked = true;
>  }
>  
> @@ -2370,7 +2388,9 @@ void __mem_cgroup_end_update_page_stat(struct page *page, unsigned long *flags)
>  	 * lock is held because a routine modifies pc->mem_cgroup
>  	 * should take move_lock_mem_cgroup().
>  	 */
> -	move_unlock_mem_cgroup(pc->mem_cgroup, flags);
> +	current->memcg_page_stat_accounting--;
> +	if (!current->memcg_page_stat_accounting)
> +		move_unlock_mem_cgroup(pc->mem_cgroup, flags);
>  }
>  
>  void mem_cgroup_update_page_stat(struct page *page,
> -- 
> 1.7.4.1
> 
> 
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V2 3/3] memcg: simplify lock of memcg page stat account
  2013-05-13 13:12   ` Michal Hocko
  2013-05-13 13:38       ` Michal Hocko
@ 2013-05-14  8:35     ` Sha Zhengju
  1 sibling, 0 replies; 36+ messages in thread
From: Sha Zhengju @ 2013-05-14  8:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Cgroups, linux-mm, KAMEZAWA Hiroyuki, Andrew Morton,
	Hugh Dickins, Greg Thelen, Sha Zhengju

[-- Attachment #1: Type: text/plain, Size: 3807 bytes --]

Hi Michal,

Thank you for reviewing the patch from your busy work!

On Mon, May 13, 2013 at 9:12 PM, Michal Hocko <mhocko@suse.cz> wrote:
> On Mon 13-05-13 13:05:44, Sha Zhengju wrote:
>> From: Sha Zhengju <handai.szj@taobao.com>
>>
>> After removing duplicated information like PCG_* flags in
>> 'struct page_cgroup'(commit 2ff76f1193), there's a problem between
>> "move" and "page stat accounting"(only FILE_MAPPED is supported now
>> but other stats will be added in future, and here I'd like to take
>> dirty page as an example):
>>
>> Assume CPU-A does "page stat accounting" and CPU-B does "move"
>>
>> CPU-A                        CPU-B
>> TestSet PG_dirty
>> (delay)               move_lock_mem_cgroup()
>>                         if (PageDirty(page)) {
>>                              old_memcg->nr_dirty --
>>                              new_memcg->nr_dirty++
>>                         }
>>                         pc->mem_cgroup = new_memcg;
>>                         move_unlock_mem_cgroup()
>>
>> move_lock_mem_cgroup()
>> memcg = pc->mem_cgroup
>> memcg->nr_dirty++
>> move_unlock_mem_cgroup()
>>
>> while accounting information of new_memcg may be double-counted. So we
>> use a bigger lock to solve this problem:  (commit: 89c06bd52f)
>>
>>       move_lock_mem_cgroup() <-- mem_cgroup_begin_update_page_stat()
>>       TestSetPageDirty(page)
>>       update page stats (without any checks)
>>       move_unlock_mem_cgroup() <-- mem_cgroup_begin_update_page_stat()
>>
>>
>> But this method also has its pros and cons: at present we use two layers
>> of lock avoidance(memcg_moving and memcg->moving_account) then spinlock
>> on memcg (see mem_cgroup_begin_update_page_stat()), but the lock
>> granularity is a little bigger that not only the critical section but
>> also some code logic is in the range of locking which may be deadlock
>> prone. While trying to add memcg dirty page accounting, it gets into
>> further difficulty with page cache radix-tree lock and even worse
>> mem_cgroup_begin_update_page_stat() requires nesting
>> (https://lkml.org/lkml/2013/1/2/48). However, when the current patch is
>> preparing, the lock nesting problem is longer possible as s390/mm has
>> reworked it out(commit:abf09bed), but it should be better
>> if we can make the lock simpler and recursive safe.
>
> This patch doesn't make the charge move locking recursive safe. It
> just tries to overcome the problem in the path where it doesn't exist
> anymore. mem_cgroup_begin_update_page_stat would still deadlock if it
> was re-entered.

Referring to deadlock or recursive, I think one of the reasons is that the
scope of lock is too large and includes some complicated codes in. So this
patch is trying to make lock regions as small as possible to lower
possibility of recursion. Yeah, mem_cgroup_begin_update_page_stat still
can't re-entered after this patch, but if we can avoid re-enter calls at
the very beginning, it can also solve our problem, doesn't it?

>
> It makes PageCgroupUsed usage even more tricky because it uses it out of
> lock_page_cgroup context. It seems that this would work in this

This is why I investigate all those four to find whether using
PageCgroupUsed here is race safe... it's really a little trick to be
honest...

> particular path because atomic_inc_and_test(_mapcount) will protect from
> double accounting but the whole dance around old_memcg seems pointless
> to me.

There's no problem with FILE_MAPPED accounting, it will be serialized by
page table lock.

>
> I am sorry but I do not think this is the right approach. IMO we should
> focus on mem_cgroup_begin_update_page_stat and make it really recursive
> safe - ideally without any additional overhead (which sounds like a real
> challenge)
>
> [...]
> --
> Michal Hocko
> SUSE Labs



Thanks,
Sha

[-- Attachment #2: Type: text/html, Size: 4735 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V2 2/3] memcg: alter mem_cgroup_{update,inc,dec}_page_stat() args to memcg pointer
  2013-05-13 12:25   ` Michal Hocko
@ 2013-05-14  9:00     ` Sha Zhengju
  2013-05-14  9:10       ` Michal Hocko
  0 siblings, 1 reply; 36+ messages in thread
From: Sha Zhengju @ 2013-05-14  9:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Cgroups, linux-mm, KAMEZAWA Hiroyuki, Andrew Morton,
	Hugh Dickins, Greg Thelen, Sha Zhengju

On Mon, May 13, 2013 at 8:25 PM, Michal Hocko <mhocko@suse.cz> wrote:
> On Mon 13-05-13 13:05:24, Sha Zhengju wrote:
>> From: Sha Zhengju <handai.szj@taobao.com>
>>
>> Change the first argument of mem_cgroup_{update,inc,dec}_page_stat() from
>> 'struct page *' to 'struct mem_cgroup *', and so move PageCgroupUsed(pc)
>> checking out of mem_cgroup_update_page_stat(). This is a prepare patch for
>> the following memcg page stat lock simplifying.
>
> No, please do not do this because it just spreads memcg specific code
> out of memcontrol.c. Besides that the patch is not correct.
> [...]
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1109,12 +1109,24 @@ void page_add_file_rmap(struct page *page)
>>  {
>>       bool locked;
>>       unsigned long flags;
>> +     struct page_cgroup *pc;
>> +     struct mem_cgroup *memcg = NULL;
>>
>>       mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>> +     pc = lookup_page_cgroup(page);
>> +
>> +     rcu_read_lock();
>> +     memcg = pc->mem_cgroup;
>
> a) unnecessary RCU take for memcg disabled and b) worse KABOOM in that case
> as page_cgroup is NULL. We really do not want to put
> mem_cgroup_disabled() tests all over the place. The idea behind
> mem_cgroup_begin_update_page_stat was to be almost a noop for !memcg
> (and the real noop for !CONFIG_MEMCG).

It's indeed an unwise behavior. How about also wrapping it in
mm/memcontrol.c or memcontrol.h?

>
> so Nak to this approach



--
Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V2 2/3] memcg: alter mem_cgroup_{update,inc,dec}_page_stat() args to memcg pointer
@ 2013-05-14  9:03       ` Sha Zhengju
  0 siblings, 0 replies; 36+ messages in thread
From: Sha Zhengju @ 2013-05-14  9:03 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Cgroups, linux-mm, Michal Hocko, Andrew Morton, Hugh Dickins,
	Greg Thelen, Sha Zhengju

Hi Kame,

On Tue, May 14, 2013 at 8:15 AM, Kamezawa Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> (2013/05/13 14:05), Sha Zhengju wrote:
>> From: Sha Zhengju <handai.szj@taobao.com>
>>
>> Change the first argument of mem_cgroup_{update,inc,dec}_page_stat() from
>> 'struct page *' to 'struct mem_cgroup *', and so move PageCgroupUsed(pc)
>> checking out of mem_cgroup_update_page_stat(). This is a prepare patch for
>> the following memcg page stat lock simplifying.
>>
>> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
>
> Why we need to find page_cgroup and memcg and check pc->flags bit
> even if memcg is unused ? I think it's too slow.
>
> Sorry, NAK.

Yeah, it's my fault... my brain filter the !CONFIG_MEMCG automatically.
Sorry for the stupid mistake and thank you for reviewing!


Thanks,
Sha

>
>
>
>> ---
>>   include/linux/memcontrol.h |   14 +++++++-------
>>   mm/memcontrol.c            |    9 ++-------
>>   mm/rmap.c                  |   28 +++++++++++++++++++++++++---
>>   3 files changed, 34 insertions(+), 17 deletions(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index d6183f0..7af3ed3 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -163,20 +163,20 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
>>       rcu_read_unlock();
>>   }
>>
>> -void mem_cgroup_update_page_stat(struct page *page,
>> +void mem_cgroup_update_page_stat(struct mem_cgroup *memcg,
>>                                enum mem_cgroup_page_stat_item idx,
>>                                int val);
>>
>> -static inline void mem_cgroup_inc_page_stat(struct page *page,
>> +static inline void mem_cgroup_inc_page_stat(struct mem_cgroup *memcg,
>>                                           enum mem_cgroup_page_stat_item idx)
>>   {
>> -     mem_cgroup_update_page_stat(page, idx, 1);
>> +     mem_cgroup_update_page_stat(memcg, idx, 1);
>>   }
>>
>> -static inline void mem_cgroup_dec_page_stat(struct page *page,
>> +static inline void mem_cgroup_dec_page_stat(struct mem_cgroup *memcg,
>>                                           enum mem_cgroup_page_stat_item idx)
>>   {
>> -     mem_cgroup_update_page_stat(page, idx, -1);
>> +     mem_cgroup_update_page_stat(memcg, idx, -1);
>>   }
>>
>>   unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>> @@ -347,12 +347,12 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
>>   {
>>   }
>>
>> -static inline void mem_cgroup_inc_page_stat(struct page *page,
>> +static inline void mem_cgroup_inc_page_stat(struct mem_cgroup *memcg,
>>                                           enum mem_cgroup_page_stat_item idx)
>>   {
>>   }
>>
>> -static inline void mem_cgroup_dec_page_stat(struct page *page,
>> +static inline void mem_cgroup_dec_page_stat(struct mem_cgroup *memcg,
>>                                           enum mem_cgroup_page_stat_item idx)
>>   {
>>   }
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index b31513e..a394ba4 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -2367,18 +2367,13 @@ void __mem_cgroup_end_update_page_stat(struct page *page, unsigned long *flags)
>>       move_unlock_mem_cgroup(pc->mem_cgroup, flags);
>>   }
>>
>> -void mem_cgroup_update_page_stat(struct page *page,
>> +void mem_cgroup_update_page_stat(struct mem_cgroup *memcg,
>>                                enum mem_cgroup_page_stat_item idx, int val)
>>   {
>> -     struct mem_cgroup *memcg;
>> -     struct page_cgroup *pc = lookup_page_cgroup(page);
>> -     unsigned long uninitialized_var(flags);
>> -
>>       if (mem_cgroup_disabled())
>>               return;
>>
>> -     memcg = pc->mem_cgroup;
>> -     if (unlikely(!memcg || !PageCgroupUsed(pc)))
>> +     if (unlikely(!memcg))
>>               return;
>>
>>       switch (idx) {
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 6280da8..a03c2a9 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1109,12 +1109,24 @@ void page_add_file_rmap(struct page *page)
>>   {
>>       bool locked;
>>       unsigned long flags;
>> +     struct page_cgroup *pc;
>> +     struct mem_cgroup *memcg = NULL;
>>
>>       mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>> +     pc = lookup_page_cgroup(page);
>> +
>> +     rcu_read_lock();
>> +     memcg = pc->mem_cgroup;
>> +     if (unlikely(!PageCgroupUsed(pc)))
>> +             memcg = NULL;
>> +
>>       if (atomic_inc_and_test(&page->_mapcount)) {
>>               __inc_zone_page_state(page, NR_FILE_MAPPED);
>> -             mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
>> +             if (memcg)
>> +                     mem_cgroup_inc_page_stat(memcg, MEMCG_NR_FILE_MAPPED);
>>       }
>> +     rcu_read_unlock();
>> +
>>       mem_cgroup_end_update_page_stat(page, &locked, &flags);
>>   }
>>
>> @@ -1129,14 +1141,22 @@ void page_remove_rmap(struct page *page)
>>       bool anon = PageAnon(page);
>>       bool locked;
>>       unsigned long flags;
>> +     struct page_cgroup *pc;
>> +     struct mem_cgroup *memcg = NULL;
>>
>>       /*
>>        * The anon case has no mem_cgroup page_stat to update; but may
>>        * uncharge_page() below, where the lock ordering can deadlock if
>>        * we hold the lock against page_stat move: so avoid it on anon.
>>        */
>> -     if (!anon)
>> +     if (!anon) {
>>               mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>> +             pc = lookup_page_cgroup(page);
>> +             rcu_read_lock();
>> +             memcg = pc->mem_cgroup;
>> +             if (unlikely(!PageCgroupUsed(pc)))
>> +                     memcg = NULL;
>> +     }
>>
>>       /* page still mapped by someone else? */
>>       if (!atomic_add_negative(-1, &page->_mapcount))
>> @@ -1157,7 +1177,9 @@ void page_remove_rmap(struct page *page)
>>                                             NR_ANON_TRANSPARENT_HUGEPAGES);
>>       } else {
>>               __dec_zone_page_state(page, NR_FILE_MAPPED);
>> -             mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
>> +             if (memcg)
>> +                     mem_cgroup_dec_page_stat(memcg, MEMCG_NR_FILE_MAPPED);
>> +             rcu_read_unlock();
>>               mem_cgroup_end_update_page_stat(page, &locked, &flags);
>>       }
>>       if (unlikely(PageMlocked(page)))
>>
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V2 2/3] memcg: alter mem_cgroup_{update,inc,dec}_page_stat() args to memcg pointer
@ 2013-05-14  9:03       ` Sha Zhengju
  0 siblings, 0 replies; 36+ messages in thread
From: Sha Zhengju @ 2013-05-14  9:03 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Cgroups, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Michal Hocko,
	Andrew Morton, Hugh Dickins, Greg Thelen, Sha Zhengju

Hi Kame,

On Tue, May 14, 2013 at 8:15 AM, Kamezawa Hiroyuki
<kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> wrote:
> (2013/05/13 14:05), Sha Zhengju wrote:
>> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
>>
>> Change the first argument of mem_cgroup_{update,inc,dec}_page_stat() from
>> 'struct page *' to 'struct mem_cgroup *', and so move PageCgroupUsed(pc)
>> checking out of mem_cgroup_update_page_stat(). This is a prepare patch for
>> the following memcg page stat lock simplifying.
>>
>> Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
>
> Why we need to find page_cgroup and memcg and check pc->flags bit
> even if memcg is unused ? I think it's too slow.
>
> Sorry, NAK.

Yeah, it's my fault... my brain filter the !CONFIG_MEMCG automatically.
Sorry for the stupid mistake and thank you for reviewing!


Thanks,
Sha

>
>
>
>> ---
>>   include/linux/memcontrol.h |   14 +++++++-------
>>   mm/memcontrol.c            |    9 ++-------
>>   mm/rmap.c                  |   28 +++++++++++++++++++++++++---
>>   3 files changed, 34 insertions(+), 17 deletions(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index d6183f0..7af3ed3 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -163,20 +163,20 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
>>       rcu_read_unlock();
>>   }
>>
>> -void mem_cgroup_update_page_stat(struct page *page,
>> +void mem_cgroup_update_page_stat(struct mem_cgroup *memcg,
>>                                enum mem_cgroup_page_stat_item idx,
>>                                int val);
>>
>> -static inline void mem_cgroup_inc_page_stat(struct page *page,
>> +static inline void mem_cgroup_inc_page_stat(struct mem_cgroup *memcg,
>>                                           enum mem_cgroup_page_stat_item idx)
>>   {
>> -     mem_cgroup_update_page_stat(page, idx, 1);
>> +     mem_cgroup_update_page_stat(memcg, idx, 1);
>>   }
>>
>> -static inline void mem_cgroup_dec_page_stat(struct page *page,
>> +static inline void mem_cgroup_dec_page_stat(struct mem_cgroup *memcg,
>>                                           enum mem_cgroup_page_stat_item idx)
>>   {
>> -     mem_cgroup_update_page_stat(page, idx, -1);
>> +     mem_cgroup_update_page_stat(memcg, idx, -1);
>>   }
>>
>>   unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>> @@ -347,12 +347,12 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
>>   {
>>   }
>>
>> -static inline void mem_cgroup_inc_page_stat(struct page *page,
>> +static inline void mem_cgroup_inc_page_stat(struct mem_cgroup *memcg,
>>                                           enum mem_cgroup_page_stat_item idx)
>>   {
>>   }
>>
>> -static inline void mem_cgroup_dec_page_stat(struct page *page,
>> +static inline void mem_cgroup_dec_page_stat(struct mem_cgroup *memcg,
>>                                           enum mem_cgroup_page_stat_item idx)
>>   {
>>   }
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index b31513e..a394ba4 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -2367,18 +2367,13 @@ void __mem_cgroup_end_update_page_stat(struct page *page, unsigned long *flags)
>>       move_unlock_mem_cgroup(pc->mem_cgroup, flags);
>>   }
>>
>> -void mem_cgroup_update_page_stat(struct page *page,
>> +void mem_cgroup_update_page_stat(struct mem_cgroup *memcg,
>>                                enum mem_cgroup_page_stat_item idx, int val)
>>   {
>> -     struct mem_cgroup *memcg;
>> -     struct page_cgroup *pc = lookup_page_cgroup(page);
>> -     unsigned long uninitialized_var(flags);
>> -
>>       if (mem_cgroup_disabled())
>>               return;
>>
>> -     memcg = pc->mem_cgroup;
>> -     if (unlikely(!memcg || !PageCgroupUsed(pc)))
>> +     if (unlikely(!memcg))
>>               return;
>>
>>       switch (idx) {
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 6280da8..a03c2a9 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1109,12 +1109,24 @@ void page_add_file_rmap(struct page *page)
>>   {
>>       bool locked;
>>       unsigned long flags;
>> +     struct page_cgroup *pc;
>> +     struct mem_cgroup *memcg = NULL;
>>
>>       mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>> +     pc = lookup_page_cgroup(page);
>> +
>> +     rcu_read_lock();
>> +     memcg = pc->mem_cgroup;
>> +     if (unlikely(!PageCgroupUsed(pc)))
>> +             memcg = NULL;
>> +
>>       if (atomic_inc_and_test(&page->_mapcount)) {
>>               __inc_zone_page_state(page, NR_FILE_MAPPED);
>> -             mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
>> +             if (memcg)
>> +                     mem_cgroup_inc_page_stat(memcg, MEMCG_NR_FILE_MAPPED);
>>       }
>> +     rcu_read_unlock();
>> +
>>       mem_cgroup_end_update_page_stat(page, &locked, &flags);
>>   }
>>
>> @@ -1129,14 +1141,22 @@ void page_remove_rmap(struct page *page)
>>       bool anon = PageAnon(page);
>>       bool locked;
>>       unsigned long flags;
>> +     struct page_cgroup *pc;
>> +     struct mem_cgroup *memcg = NULL;
>>
>>       /*
>>        * The anon case has no mem_cgroup page_stat to update; but may
>>        * uncharge_page() below, where the lock ordering can deadlock if
>>        * we hold the lock against page_stat move: so avoid it on anon.
>>        */
>> -     if (!anon)
>> +     if (!anon) {
>>               mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>> +             pc = lookup_page_cgroup(page);
>> +             rcu_read_lock();
>> +             memcg = pc->mem_cgroup;
>> +             if (unlikely(!PageCgroupUsed(pc)))
>> +                     memcg = NULL;
>> +     }
>>
>>       /* page still mapped by someone else? */
>>       if (!atomic_add_negative(-1, &page->_mapcount))
>> @@ -1157,7 +1177,9 @@ void page_remove_rmap(struct page *page)
>>                                             NR_ANON_TRANSPARENT_HUGEPAGES);
>>       } else {
>>               __dec_zone_page_state(page, NR_FILE_MAPPED);
>> -             mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
>> +             if (memcg)
>> +                     mem_cgroup_dec_page_stat(memcg, MEMCG_NR_FILE_MAPPED);
>> +             rcu_read_unlock();
>>               mem_cgroup_end_update_page_stat(page, &locked, &flags);
>>       }
>>       if (unlikely(PageMlocked(page)))
>>
>
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V2 2/3] memcg: alter mem_cgroup_{update,inc,dec}_page_stat() args to memcg pointer
  2013-05-14  9:00     ` Sha Zhengju
@ 2013-05-14  9:10       ` Michal Hocko
  0 siblings, 0 replies; 36+ messages in thread
From: Michal Hocko @ 2013-05-14  9:10 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: Cgroups, linux-mm, KAMEZAWA Hiroyuki, Andrew Morton,
	Hugh Dickins, Greg Thelen, Sha Zhengju

On Tue 14-05-13 17:00:08, Sha Zhengju wrote:
> On Mon, May 13, 2013 at 8:25 PM, Michal Hocko <mhocko@suse.cz> wrote:
> > On Mon 13-05-13 13:05:24, Sha Zhengju wrote:
> >> From: Sha Zhengju <handai.szj@taobao.com>
> >>
> >> Change the first argument of mem_cgroup_{update,inc,dec}_page_stat() from
> >> 'struct page *' to 'struct mem_cgroup *', and so move PageCgroupUsed(pc)
> >> checking out of mem_cgroup_update_page_stat(). This is a prepare patch for
> >> the following memcg page stat lock simplifying.
> >
> > No, please do not do this because it just spreads memcg specific code
> > out of memcontrol.c. Besides that the patch is not correct.
> > [...]
> >> --- a/mm/rmap.c
> >> +++ b/mm/rmap.c
> >> @@ -1109,12 +1109,24 @@ void page_add_file_rmap(struct page *page)
> >>  {
> >>       bool locked;
> >>       unsigned long flags;
> >> +     struct page_cgroup *pc;
> >> +     struct mem_cgroup *memcg = NULL;
> >>
> >>       mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> >> +     pc = lookup_page_cgroup(page);
> >> +
> >> +     rcu_read_lock();
> >> +     memcg = pc->mem_cgroup;
> >
> > a) unnecessary RCU take for memcg disabled and b) worse KABOOM in that case
> > as page_cgroup is NULL. We really do not want to put
> > mem_cgroup_disabled() tests all over the place. The idea behind
> > mem_cgroup_begin_update_page_stat was to be almost a noop for !memcg
> > (and the real noop for !CONFIG_MEMCG).
> 
> It's indeed an unwise behavior. How about also wrapping it in
> mm/memcontrol.c or memcontrol.h?

I just think it doesn't make much sense to overcomplicate this -
especially when this is not an issue anymore.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V2 3/3] memcg: simplify lock of memcg page stat account
@ 2013-05-14  9:13         ` Sha Zhengju
  0 siblings, 0 replies; 36+ messages in thread
From: Sha Zhengju @ 2013-05-14  9:13 UTC (permalink / raw)
  To: Michal Hocko, KAMEZAWA Hiroyuki
  Cc: Cgroups, linux-mm, Andrew Morton, Hugh Dickins, Greg Thelen, Sha Zhengju

On Mon, May 13, 2013 at 9:38 PM, Michal Hocko <mhocko@suse.cz> wrote:
>
> On Mon 13-05-13 15:12:51, Michal Hocko wrote:
> [...]
> > I am sorry but I do not think this is the right approach. IMO we should
> > focus on mem_cgroup_begin_update_page_stat and make it really recursive
> > safe - ideally without any additional overhead (which sounds like a real
> > challenge)
>
> Or maybe we should just not over complicate this and simply consider
> recursivness when it starts being an issue. It is not a problem for
> rmap accounting anymore and dirty pages accounting seems to be safe as
> well and pages under writeback accounting was OK even previously.
> It doesn't make much sense to block dirty pages accounting by a
> non-existing problem.
>

Yes, the dirty/writeback accounting seems okay now. I sent this patch
out to see if I can do something to simplify the locks but this
approach seems to have its own drawbacks. Since you and Kame are NAK
to this, in the order of importance I'll put the patch aside and
continue the work of dirty page accounting. :)

Thanks for the teaching!


--
Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V2 3/3] memcg: simplify lock of memcg page stat account
@ 2013-05-14  9:13         ` Sha Zhengju
  0 siblings, 0 replies; 36+ messages in thread
From: Sha Zhengju @ 2013-05-14  9:13 UTC (permalink / raw)
  To: Michal Hocko, KAMEZAWA Hiroyuki
  Cc: Cgroups, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andrew Morton,
	Hugh Dickins, Greg Thelen, Sha Zhengju

On Mon, May 13, 2013 at 9:38 PM, Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> wrote:
>
> On Mon 13-05-13 15:12:51, Michal Hocko wrote:
> [...]
> > I am sorry but I do not think this is the right approach. IMO we should
> > focus on mem_cgroup_begin_update_page_stat and make it really recursive
> > safe - ideally without any additional overhead (which sounds like a real
> > challenge)
>
> Or maybe we should just not over complicate this and simply consider
> recursivness when it starts being an issue. It is not a problem for
> rmap accounting anymore and dirty pages accounting seems to be safe as
> well and pages under writeback accounting was OK even previously.
> It doesn't make much sense to block dirty pages accounting by a
> non-existing problem.
>

Yes, the dirty/writeback accounting seems okay now. I sent this patch
out to see if I can do something to simplify the locks but this
approach seems to have its own drawbacks. Since you and Kame are NAK
to this, in the order of importance I'll put the patch aside and
continue the work of dirty page accounting. :)

Thanks for the teaching!


--
Thanks,
Sha

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V2 3/3] memcg: simplify lock of memcg page stat account
@ 2013-05-14  9:28           ` Michal Hocko
  0 siblings, 0 replies; 36+ messages in thread
From: Michal Hocko @ 2013-05-14  9:28 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: KAMEZAWA Hiroyuki, Cgroups, linux-mm, Andrew Morton,
	Hugh Dickins, Greg Thelen, Sha Zhengju

On Tue 14-05-13 17:13:07, Sha Zhengju wrote:
> On Mon, May 13, 2013 at 9:38 PM, Michal Hocko <mhocko@suse.cz> wrote:
> >
> > On Mon 13-05-13 15:12:51, Michal Hocko wrote:
> > [...]
> > > I am sorry but I do not think this is the right approach. IMO we should
> > > focus on mem_cgroup_begin_update_page_stat and make it really recursive
> > > safe - ideally without any additional overhead (which sounds like a real
> > > challenge)
> >
> > Or maybe we should just not over complicate this and simply consider
> > recursivness when it starts being an issue. It is not a problem for
> > rmap accounting anymore and dirty pages accounting seems to be safe as
> > well and pages under writeback accounting was OK even previously.
> > It doesn't make much sense to block dirty pages accounting by a
> > non-existing problem.
> >
> 
> Yes, the dirty/writeback accounting seems okay now. I sent this patch
> out to see if I can do something to simplify the locks but this
> approach seems to have its own drawbacks. Since you and Kame are NAK
> to this, in the order of importance I'll put the patch aside and
> continue the work of dirty page accounting. :)

Thanks for your work!

> Thanks for the teaching!

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V2 3/3] memcg: simplify lock of memcg page stat account
@ 2013-05-14  9:28           ` Michal Hocko
  0 siblings, 0 replies; 36+ messages in thread
From: Michal Hocko @ 2013-05-14  9:28 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: KAMEZAWA Hiroyuki, Cgroups, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Andrew Morton, Hugh Dickins, Greg Thelen, Sha Zhengju

On Tue 14-05-13 17:13:07, Sha Zhengju wrote:
> On Mon, May 13, 2013 at 9:38 PM, Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> wrote:
> >
> > On Mon 13-05-13 15:12:51, Michal Hocko wrote:
> > [...]
> > > I am sorry but I do not think this is the right approach. IMO we should
> > > focus on mem_cgroup_begin_update_page_stat and make it really recursive
> > > safe - ideally without any additional overhead (which sounds like a real
> > > challenge)
> >
> > Or maybe we should just not over complicate this and simply consider
> > recursivness when it starts being an issue. It is not a problem for
> > rmap accounting anymore and dirty pages accounting seems to be safe as
> > well and pages under writeback accounting was OK even previously.
> > It doesn't make much sense to block dirty pages accounting by a
> > non-existing problem.
> >
> 
> Yes, the dirty/writeback accounting seems okay now. I sent this patch
> out to see if I can do something to simplify the locks but this
> approach seems to have its own drawbacks. Since you and Kame are NAK
> to this, in the order of importance I'll put the patch aside and
> continue the work of dirty page accounting. :)

Thanks for your work!

> Thanks for the teaching!

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V2 0/3] memcg: simply lock of page stat accounting
  2013-05-13  5:03 ` Sha Zhengju
                   ` (4 preceding siblings ...)
  (?)
@ 2013-05-15 12:35 ` Konstantin Khlebnikov
  2013-05-15 13:41     ` Michal Hocko
  -1 siblings, 1 reply; 36+ messages in thread
From: Konstantin Khlebnikov @ 2013-05-15 12:35 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: cgroups, linux-mm, mhocko, kamezawa.hiroyu, akpm, hughd, gthelen,
	Sha Zhengju

Sha Zhengju wrote:
> Hi,
>
> This is my second attempt to make memcg page stat lock simpler, the
> first version: http://www.spinics.net/lists/linux-mm/msg50037.html.
>
> In this version I investigate the potential race conditions among
> page stat, move_account, charge, uncharge and try to prove it race
> safe of my proposing lock scheme. The first patch is the basis of
> the patchset, so if I've made some stupid mistake please do not
> hesitate to point it out.

I have a provocational question. Who needs these numbers? I mean per-cgroup
nr_mapped and so on. It's too hard to maintain them carefully and I don't know
any clear usage for them. I have written several implementations of this stuff
for openvz kernel. But at the end I have decided to just remove it.
Do anybody knows really useful use cases for these nr_mapped counters?


In our kernel we have per-container nr_dirty and nr_writeback counters. Bit they are
implemented on top of radix-tree tags, and their owners are stored on inode/mapping.
So, this is completely different story.

I definitely have missed some discussions about these questions. Or not?
I hope it's a good time to return.

>
> Change log:
> v2<- v1:
>     * rewrite comments on race condition
>     * split orignal large patch to two parts
>     * change too heavy try_get_mem_cgroup_from_page() to rcu_read_lock
>       to hold memcg alive
>
> Sha Zhengju (3):
>     memcg: rewrite the comment about race condition of page stat accounting
>     memcg: alter mem_cgroup_{update,inc,dec}_page_stat() args to memcg pointer
>     memcg: simplify lock of memcg page stat account	
>
>   include/linux/memcontrol.h |   14 ++++++-------
>   mm/memcontrol.c            |   16 ++++++---------
>   mm/rmap.c                  |   49 +++++++++++++++++++++++++++++++++-----------
>   3 files changed, 50 insertions(+), 29 deletions(-)
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email:<a href=mailto:"dont@kvack.org">  email@kvack.org</a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V2 0/3] memcg: simply lock of page stat accounting
@ 2013-05-15 13:41     ` Michal Hocko
  0 siblings, 0 replies; 36+ messages in thread
From: Michal Hocko @ 2013-05-15 13:41 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Sha Zhengju, cgroups, linux-mm, kamezawa.hiroyu, akpm, hughd,
	gthelen, Sha Zhengju

On Wed 15-05-13 16:35:08, Konstantin Khlebnikov wrote:
> Sha Zhengju wrote:
> >Hi,
> >
> >This is my second attempt to make memcg page stat lock simpler, the
> >first version: http://www.spinics.net/lists/linux-mm/msg50037.html.
> >
> >In this version I investigate the potential race conditions among
> >page stat, move_account, charge, uncharge and try to prove it race
> >safe of my proposing lock scheme. The first patch is the basis of
> >the patchset, so if I've made some stupid mistake please do not
> >hesitate to point it out.
> 
> I have a provocational question. Who needs these numbers? I mean
> per-cgroup nr_mapped and so on.

Well, I guess it makes some sense to know how much page cache and anon
memory is charged to the group. I am using that to monitor the per-group
memory usage. I can imagine a even better coverage - something
/proc/meminfo like.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V2 0/3] memcg: simply lock of page stat accounting
@ 2013-05-15 13:41     ` Michal Hocko
  0 siblings, 0 replies; 36+ messages in thread
From: Michal Hocko @ 2013-05-15 13:41 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Sha Zhengju, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	hughd-hpIqsD4AKlfQT0dZR+AlfA, gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	Sha Zhengju

On Wed 15-05-13 16:35:08, Konstantin Khlebnikov wrote:
> Sha Zhengju wrote:
> >Hi,
> >
> >This is my second attempt to make memcg page stat lock simpler, the
> >first version: http://www.spinics.net/lists/linux-mm/msg50037.html.
> >
> >In this version I investigate the potential race conditions among
> >page stat, move_account, charge, uncharge and try to prove it race
> >safe of my proposing lock scheme. The first patch is the basis of
> >the patchset, so if I've made some stupid mistake please do not
> >hesitate to point it out.
> 
> I have a provocational question. Who needs these numbers? I mean
> per-cgroup nr_mapped and so on.

Well, I guess it makes some sense to know how much page cache and anon
memory is charged to the group. I am using that to monitor the per-group
memory usage. I can imagine a even better coverage - something
/proc/meminfo like.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V2 0/3] memcg: simply lock of page stat accounting
  2013-05-15 13:41     ` Michal Hocko
  (?)
@ 2013-05-16  4:28     ` Konstantin Khlebnikov
  2013-05-16 13:28         ` Michal Hocko
  -1 siblings, 1 reply; 36+ messages in thread
From: Konstantin Khlebnikov @ 2013-05-16  4:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Sha Zhengju, cgroups, linux-mm, kamezawa.hiroyu, akpm, hughd,
	gthelen, Sha Zhengju

Michal Hocko wrote:
> On Wed 15-05-13 16:35:08, Konstantin Khlebnikov wrote:
>> Sha Zhengju wrote:
>>> Hi,
>>>
>>> This is my second attempt to make memcg page stat lock simpler, the
>>> first version: http://www.spinics.net/lists/linux-mm/msg50037.html.
>>>
>>> In this version I investigate the potential race conditions among
>>> page stat, move_account, charge, uncharge and try to prove it race
>>> safe of my proposing lock scheme. The first patch is the basis of
>>> the patchset, so if I've made some stupid mistake please do not
>>> hesitate to point it out.
>>
>> I have a provocational question. Who needs these numbers? I mean
>> per-cgroup nr_mapped and so on.
>
> Well, I guess it makes some sense to know how much page cache and anon
> memory is charged to the group. I am using that to monitor the per-group
> memory usage. I can imagine a even better coverage - something
> /proc/meminfo like.
>

I think page counters from lru-vectors can give enough information for that.

If somebody needs more detailed information there are enough ways to get it.
Amount of mapped pages can be estimated via summing rss counters from mm-structs.
Exact numbers can be obtained via examining /proc/pid/pagemap.

I don't think that simulating 'Mapped' line in /proc/mapfile is a worth reason
for adding such weird stuff into the rmap code on map/unmap paths.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V2 0/3] memcg: simply lock of page stat accounting
@ 2013-05-16 13:28         ` Michal Hocko
  0 siblings, 0 replies; 36+ messages in thread
From: Michal Hocko @ 2013-05-16 13:28 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Sha Zhengju, cgroups, linux-mm, kamezawa.hiroyu, akpm, hughd,
	gthelen, Sha Zhengju

On Thu 16-05-13 08:28:33, Konstantin Khlebnikov wrote:
> Michal Hocko wrote:
> >On Wed 15-05-13 16:35:08, Konstantin Khlebnikov wrote:
> >>Sha Zhengju wrote:
> >>>Hi,
> >>>
> >>>This is my second attempt to make memcg page stat lock simpler, the
> >>>first version: http://www.spinics.net/lists/linux-mm/msg50037.html.
> >>>
> >>>In this version I investigate the potential race conditions among
> >>>page stat, move_account, charge, uncharge and try to prove it race
> >>>safe of my proposing lock scheme. The first patch is the basis of
> >>>the patchset, so if I've made some stupid mistake please do not
> >>>hesitate to point it out.
> >>
> >>I have a provocational question. Who needs these numbers? I mean
> >>per-cgroup nr_mapped and so on.
> >
> >Well, I guess it makes some sense to know how much page cache and anon
> >memory is charged to the group. I am using that to monitor the per-group
> >memory usage. I can imagine a even better coverage - something
> >/proc/meminfo like.
> >
> 
> I think page counters from lru-vectors can give enough information for that.

not for dirty and writeback data which is the next step.

> If somebody needs more detailed information there are enough ways to get it.
> Amount of mapped pages can be estimated via summing rss counters from mm-structs.
> Exact numbers can be obtained via examining /proc/pid/pagemap.

How do you find out whether given pages were charged to the group of
interest - e.g. shared data or taks that has moved from a different
group without move_at_immigrate?

> I don't think that simulating 'Mapped' line in /proc/mapfile is a worth reason
> for adding such weird stuff into the rmap code on map/unmap paths.

The accounting code is trying to be not intrusive as much as possible.
This patchset makes it more complicated without a good reason and that
is why it has been Nacked by me.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V2 0/3] memcg: simply lock of page stat accounting
@ 2013-05-16 13:28         ` Michal Hocko
  0 siblings, 0 replies; 36+ messages in thread
From: Michal Hocko @ 2013-05-16 13:28 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Sha Zhengju, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	hughd-hpIqsD4AKlfQT0dZR+AlfA, gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	Sha Zhengju

On Thu 16-05-13 08:28:33, Konstantin Khlebnikov wrote:
> Michal Hocko wrote:
> >On Wed 15-05-13 16:35:08, Konstantin Khlebnikov wrote:
> >>Sha Zhengju wrote:
> >>>Hi,
> >>>
> >>>This is my second attempt to make memcg page stat lock simpler, the
> >>>first version: http://www.spinics.net/lists/linux-mm/msg50037.html.
> >>>
> >>>In this version I investigate the potential race conditions among
> >>>page stat, move_account, charge, uncharge and try to prove it race
> >>>safe of my proposing lock scheme. The first patch is the basis of
> >>>the patchset, so if I've made some stupid mistake please do not
> >>>hesitate to point it out.
> >>
> >>I have a provocational question. Who needs these numbers? I mean
> >>per-cgroup nr_mapped and so on.
> >
> >Well, I guess it makes some sense to know how much page cache and anon
> >memory is charged to the group. I am using that to monitor the per-group
> >memory usage. I can imagine a even better coverage - something
> >/proc/meminfo like.
> >
> 
> I think page counters from lru-vectors can give enough information for that.

not for dirty and writeback data which is the next step.

> If somebody needs more detailed information there are enough ways to get it.
> Amount of mapped pages can be estimated via summing rss counters from mm-structs.
> Exact numbers can be obtained via examining /proc/pid/pagemap.

How do you find out whether given pages were charged to the group of
interest - e.g. shared data or taks that has moved from a different
group without move_at_immigrate?

> I don't think that simulating 'Mapped' line in /proc/mapfile is a worth reason
> for adding such weird stuff into the rmap code on map/unmap paths.

The accounting code is trying to be not intrusive as much as possible.
This patchset makes it more complicated without a good reason and that
is why it has been Nacked by me.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V2 0/3] memcg: simply lock of page stat accounting
  2013-05-16 13:28         ` Michal Hocko
  (?)
@ 2013-05-17  5:57         ` Konstantin Khlebnikov
  2013-05-17  8:38           ` Michal Hocko
  -1 siblings, 1 reply; 36+ messages in thread
From: Konstantin Khlebnikov @ 2013-05-17  5:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Sha Zhengju, cgroups, linux-mm, kamezawa.hiroyu, akpm, hughd,
	gthelen, Sha Zhengju

Michal Hocko wrote:
> On Thu 16-05-13 08:28:33, Konstantin Khlebnikov wrote:
>> Michal Hocko wrote:
>>> On Wed 15-05-13 16:35:08, Konstantin Khlebnikov wrote:
>>>> Sha Zhengju wrote:
>>>>> Hi,
>>>>>
>>>>> This is my second attempt to make memcg page stat lock simpler, the
>>>>> first version: http://www.spinics.net/lists/linux-mm/msg50037.html.
>>>>>
>>>>> In this version I investigate the potential race conditions among
>>>>> page stat, move_account, charge, uncharge and try to prove it race
>>>>> safe of my proposing lock scheme. The first patch is the basis of
>>>>> the patchset, so if I've made some stupid mistake please do not
>>>>> hesitate to point it out.
>>>>
>>>> I have a provocational question. Who needs these numbers? I mean
>>>> per-cgroup nr_mapped and so on.
>>>
>>> Well, I guess it makes some sense to know how much page cache and anon
>>> memory is charged to the group. I am using that to monitor the per-group
>>> memory usage. I can imagine a even better coverage - something
>>> /proc/meminfo like.
>>>
>>
>> I think page counters from lru-vectors can give enough information for that.
>
> not for dirty and writeback data which is the next step.

I think tracking dirty and writeback pages in per-inode manner is much more useful.
If there is only one cgroup per inode who responds for all dirtied pages we can use this
hint during writeback process to account disk operations and throttle tasks in that cgroup.

This approach allows to easily implement effective IO bandwidth controller in the VFS layer.
Actually we did this in our commercial product, feature called 'iolimits' works exactly in this
way. Unlike to blkcg this disk bandwidth controller doesn't suffer from priority inversion
bugs related to fs journal, and it works for non-disk filesystems like NFS and FUSE.
This is something like 'balance-dirty-pages' on steroids which also can handle read
operations and can take IOPS counters into account.

>
>> If somebody needs more detailed information there are enough ways to get it.
>> Amount of mapped pages can be estimated via summing rss counters from mm-structs.
>> Exact numbers can be obtained via examining /proc/pid/pagemap.
>
> How do you find out whether given pages were charged to the group of
> interest - e.g. shared data or taks that has moved from a different
> group without move_at_immigrate?

For example we can export pages ownership and charging state via single file in proc,
something similar to /proc/kpageflags

BTW
In our kernel the memory controller tries to change page's ownership at first mmap and
at each page activation, probably it's worth to add this into mainline memcg too.

>
>> I don't think that simulating 'Mapped' line in /proc/mapfile is a worth reason
>> for adding such weird stuff into the rmap code on map/unmap paths.
>
> The accounting code is trying to be not intrusive as much as possible.
> This patchset makes it more complicated without a good reason and that
> is why it has been Nacked by me.

I think we can remove it or replace it with something different but much less intrusive,
if nobody strictly requires exactly this approach in managing 'mapped' pages counters.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V2 0/3] memcg: simply lock of page stat accounting
  2013-05-17  5:57         ` Konstantin Khlebnikov
@ 2013-05-17  8:38           ` Michal Hocko
  2013-05-17 10:29               ` Konstantin Khlebnikov
  0 siblings, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2013-05-17  8:38 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Sha Zhengju, cgroups, linux-mm, kamezawa.hiroyu, akpm, hughd,
	gthelen, Sha Zhengju

On Fri 17-05-13 09:57:37, Konstantin Khlebnikov wrote:
> Michal Hocko wrote:
> >On Thu 16-05-13 08:28:33, Konstantin Khlebnikov wrote:
[...]
> >>If somebody needs more detailed information there are enough ways to get it.
> >>Amount of mapped pages can be estimated via summing rss counters from mm-structs.
> >>Exact numbers can be obtained via examining /proc/pid/pagemap.
> >
> >How do you find out whether given pages were charged to the group of
> >interest - e.g. shared data or taks that has moved from a different
> >group without move_at_immigrate?
> 
> For example we can export pages ownership and charging state via
> single file in proc, something similar to /proc/kpageflags

So you would like to add a new interface with cryptic api (I consider
kpageflags to be a devel tool not an admin aid) to replace something
that is easy to use? Doesn't make much sense to me.

> BTW
> In our kernel the memory controller tries to change page's ownership
> at first mmap and at each page activation, probably it's worth to add
> this into mainline memcg too.

Dunno, there are different approaches for this. I haven't evalueted them
so I don't know all the pros and cons. Why not just unmap&uncharge the
page when the charging process dies. This should be more lightweight
wrt. recharge on re-activation.
 
> >>I don't think that simulating 'Mapped' line in /proc/mapfile is a worth reason
> >>for adding such weird stuff into the rmap code on map/unmap paths.
> >
> >The accounting code is trying to be not intrusive as much as possible.
> >This patchset makes it more complicated without a good reason and that
> >is why it has been Nacked by me.
> 
> I think we can remove it or replace it with something different but
> much less intrusive, if nobody strictly requires exactly this approach
> in managing 'mapped' pages counters.

Do you have any numbers on the intrusiveness? I do not mind to change
the internal implementation but the file is a part of the user space API
so we cannot get rid of it.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V2 0/3] memcg: simply lock of page stat accounting
@ 2013-05-17 10:29               ` Konstantin Khlebnikov
  0 siblings, 0 replies; 36+ messages in thread
From: Konstantin Khlebnikov @ 2013-05-17 10:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Sha Zhengju, cgroups, linux-mm, kamezawa.hiroyu, akpm, hughd,
	gthelen, Sha Zhengju

Michal Hocko wrote:
> On Fri 17-05-13 09:57:37, Konstantin Khlebnikov wrote:
>> Michal Hocko wrote:
>>> On Thu 16-05-13 08:28:33, Konstantin Khlebnikov wrote:
> [...]
>>>> If somebody needs more detailed information there are enough ways to get it.
>>>> Amount of mapped pages can be estimated via summing rss counters from mm-structs.
>>>> Exact numbers can be obtained via examining /proc/pid/pagemap.
>>>
>>> How do you find out whether given pages were charged to the group of
>>> interest - e.g. shared data or taks that has moved from a different
>>> group without move_at_immigrate?
>>
>> For example we can export pages ownership and charging state via
>> single file in proc, something similar to /proc/kpageflags
>
> So you would like to add a new interface with cryptic api (I consider
> kpageflags to be a devel tool not an admin aid) to replace something
> that is easy to use? Doesn't make much sense to me.

Hmm. Kernel api is just api. It's not supposed to be a tool itself.
It must be usable for making tools which admins can use.

>
>> BTW
>> In our kernel the memory controller tries to change page's ownership
>> at first mmap and at each page activation, probably it's worth to add
>> this into mainline memcg too.
>
> Dunno, there are different approaches for this. I haven't evalueted them
> so I don't know all the pros and cons. Why not just unmap&uncharge the
> page when the charging process dies. This should be more lightweight
> wrt. recharge on re-activation.

On activation it's mostly free, because we need to lock lru anyway.
If memcg charge/uncharge isn't fast enough for that it should be optimized.


So, you propose to uncharge pages at unmap and charge them to who?
To the next mm in rmap?

What if next map will happens from different memcg, but after that unmap:

A:map
<pages owned by A>
A:unmap
<no new owner in rmap>
B:map
<pages still owned by A>

Anyway this is difficult question and I bring this just to show that currently
used logic isn't perfect in many ways. Switching ownership lazily in particular
places is a best solution which I can see now. So, page will be owned by its last
active user.

>
>>>> I don't think that simulating 'Mapped' line in /proc/mapfile is a worth reason
>>>> for adding such weird stuff into the rmap code on map/unmap paths.
>>>
>>> The accounting code is trying to be not intrusive as much as possible.
>>> This patchset makes it more complicated without a good reason and that
>>> is why it has been Nacked by me.
>>
>> I think we can remove it or replace it with something different but
>> much less intrusive, if nobody strictly requires exactly this approach
>> in managing 'mapped' pages counters.
>
> Do you have any numbers on the intrusiveness? I do not mind to change
> the internal implementation but the file is a part of the user space API
> so we cannot get rid of it.

Main problem here that it increases complexity of switching pages' ownership.
We need not just charge/uncharge page and move it to the different LRU, but
also synchronize with map/unmap paths in rmap to update nr-mapped counters
carefully. If you want to add per-page dirty/writeback counters it will
increase that complexity even further.

I haven't started yet migrating our product to mainline memory controller,
so I don't have any numbers. But my old code completely lockless on charging
and uncharging fast paths. LRU lock is the only lock on changing page's owner,
LRU isolation itself protects page ownership reference.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V2 0/3] memcg: simply lock of page stat accounting
@ 2013-05-17 10:29               ` Konstantin Khlebnikov
  0 siblings, 0 replies; 36+ messages in thread
From: Konstantin Khlebnikov @ 2013-05-17 10:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Sha Zhengju, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	hughd-hpIqsD4AKlfQT0dZR+AlfA, gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	Sha Zhengju

Michal Hocko wrote:
> On Fri 17-05-13 09:57:37, Konstantin Khlebnikov wrote:
>> Michal Hocko wrote:
>>> On Thu 16-05-13 08:28:33, Konstantin Khlebnikov wrote:
> [...]
>>>> If somebody needs more detailed information there are enough ways to get it.
>>>> Amount of mapped pages can be estimated via summing rss counters from mm-structs.
>>>> Exact numbers can be obtained via examining /proc/pid/pagemap.
>>>
>>> How do you find out whether given pages were charged to the group of
>>> interest - e.g. shared data or taks that has moved from a different
>>> group without move_at_immigrate?
>>
>> For example we can export pages ownership and charging state via
>> single file in proc, something similar to /proc/kpageflags
>
> So you would like to add a new interface with cryptic api (I consider
> kpageflags to be a devel tool not an admin aid) to replace something
> that is easy to use? Doesn't make much sense to me.

Hmm. Kernel api is just api. It's not supposed to be a tool itself.
It must be usable for making tools which admins can use.

>
>> BTW
>> In our kernel the memory controller tries to change page's ownership
>> at first mmap and at each page activation, probably it's worth to add
>> this into mainline memcg too.
>
> Dunno, there are different approaches for this. I haven't evalueted them
> so I don't know all the pros and cons. Why not just unmap&uncharge the
> page when the charging process dies. This should be more lightweight
> wrt. recharge on re-activation.

On activation it's mostly free, because we need to lock lru anyway.
If memcg charge/uncharge isn't fast enough for that it should be optimized.


So, you propose to uncharge pages at unmap and charge them to who?
To the next mm in rmap?

What if next map will happens from different memcg, but after that unmap:

A:map
<pages owned by A>
A:unmap
<no new owner in rmap>
B:map
<pages still owned by A>

Anyway this is difficult question and I bring this just to show that currently
used logic isn't perfect in many ways. Switching ownership lazily in particular
places is a best solution which I can see now. So, page will be owned by its last
active user.

>
>>>> I don't think that simulating 'Mapped' line in /proc/mapfile is a worth reason
>>>> for adding such weird stuff into the rmap code on map/unmap paths.
>>>
>>> The accounting code is trying to be not intrusive as much as possible.
>>> This patchset makes it more complicated without a good reason and that
>>> is why it has been Nacked by me.
>>
>> I think we can remove it or replace it with something different but
>> much less intrusive, if nobody strictly requires exactly this approach
>> in managing 'mapped' pages counters.
>
> Do you have any numbers on the intrusiveness? I do not mind to change
> the internal implementation but the file is a part of the user space API
> so we cannot get rid of it.

Main problem here that it increases complexity of switching pages' ownership.
We need not just charge/uncharge page and move it to the different LRU, but
also synchronize with map/unmap paths in rmap to update nr-mapped counters
carefully. If you want to add per-page dirty/writeback counters it will
increase that complexity even further.

I haven't started yet migrating our product to mainline memory controller,
so I don't have any numbers. But my old code completely lockless on charging
and uncharging fast paths. LRU lock is the only lock on changing page's owner,
LRU isolation itself protects page ownership reference.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V2 0/3] memcg: simply lock of page stat accounting
@ 2013-05-17 12:53                 ` Michal Hocko
  0 siblings, 0 replies; 36+ messages in thread
From: Michal Hocko @ 2013-05-17 12:53 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Sha Zhengju, cgroups, linux-mm, kamezawa.hiroyu, akpm, hughd,
	gthelen, Sha Zhengju

On Fri 17-05-13 14:29:33, Konstantin Khlebnikov wrote:
> Michal Hocko wrote:
> >On Fri 17-05-13 09:57:37, Konstantin Khlebnikov wrote:
> >>Michal Hocko wrote:
> >>>On Thu 16-05-13 08:28:33, Konstantin Khlebnikov wrote:
> >[...]
> >>>>If somebody needs more detailed information there are enough ways to get it.
> >>>>Amount of mapped pages can be estimated via summing rss counters from mm-structs.
> >>>>Exact numbers can be obtained via examining /proc/pid/pagemap.
> >>>
> >>>How do you find out whether given pages were charged to the group of
> >>>interest - e.g. shared data or taks that has moved from a different
> >>>group without move_at_immigrate?
> >>
> >>For example we can export pages ownership and charging state via
> >>single file in proc, something similar to /proc/kpageflags
> >
> >So you would like to add a new interface with cryptic api (I consider
> >kpageflags to be a devel tool not an admin aid) to replace something
> >that is easy to use? Doesn't make much sense to me.
> 
> Hmm. Kernel api is just api. It's not supposed to be a tool itself.
> It must be usable for making tools which admins can use.

But the file is there and we have to keep it for ever. That was my
point. I do not care how do we account.
 
> >>BTW
> >>In our kernel the memory controller tries to change page's ownership
> >>at first mmap and at each page activation, probably it's worth to add
> >>this into mainline memcg too.
> >
> >Dunno, there are different approaches for this. I haven't evalueted them
> >so I don't know all the pros and cons. Why not just unmap&uncharge the
> >page when the charging process dies. This should be more lightweight
> >wrt. recharge on re-activation.
> 
> On activation it's mostly free, because we need to lock lru anyway.
> If memcg charge/uncharge isn't fast enough for that it should be optimized.

Be careful about that. charge can end up in OOM lock waiting for the
user to handle free some memory. You cannot call charge from within any
locks (well mmap_sem for reading is somehow acceptable because that one
is really hard to get rid of).
 
> So, you propose to uncharge pages at unmap and charge them to who?
> To the next mm in rmap?

No, unmap it from all processes if there is no other process from the
same group as the exiting task and charge on the re-fault.
I have heared about some usecases where this could help to age shared
data but the reporter disappeared so I've lost interest in fixing it
because it wouldn't be cheap and I do not remember many details about
the workload which would be necessary to justify such a change.

> What if next map will happens from different memcg, but after that unmap:
> 
> A:map
> <pages owned by A>
> A:unmap
> <no new owner in rmap>
> B:map
> <pages still owned by A>
> 
> Anyway this is difficult question and I bring this just to show
> that currently used logic isn't perfect in many ways.

Well, the world is not ideal. I wasn't involved in the early discussion
about this topic but what I remember from mailing list archives is that
"charge at the first touch" was the most reasonable compromise at the
time.

> Switching ownership lazily in particular places is a best solution
> which I can see now. So, page will be owned by its last active user.

There are other options as well. We can have a background daemon which
unmaps shared pages if there no processes from the group it has been
charged against. This wouldn't touch any hot paths. But this is getting
off-topic.
 
> >>>>I don't think that simulating 'Mapped' line in /proc/mapfile is a worth reason
> >>>>for adding such weird stuff into the rmap code on map/unmap paths.
> >>>
> >>>The accounting code is trying to be not intrusive as much as possible.
> >>>This patchset makes it more complicated without a good reason and that
> >>>is why it has been Nacked by me.
> >>
> >>I think we can remove it or replace it with something different but
> >>much less intrusive, if nobody strictly requires exactly this approach
> >>in managing 'mapped' pages counters.
> >
> >Do you have any numbers on the intrusiveness? I do not mind to change
> >the internal implementation but the file is a part of the user space API
> >so we cannot get rid of it.
> 
> Main problem here that it increases complexity of switching pages'
> ownership.  We need not just charge/uncharge page and move it to
> the different LRU, but also synchronize with map/unmap paths in
> rmap to update nr-mapped counters carefully. If you want to add
> per-page dirty/writeback counters it will increase that complexity
> even further.

The patches didn't add too much complexity when it were sent the last
time. Well, there was a big problem at the time because s390 has called
set_page_dirty from the rmap code which caused a lot of troubles but
that one is gone and so our free ride can start.

> I haven't started yet migrating our product to mainline memory
> controller, so I don't have any numbers. But my old code completely
> lockless on charging and uncharging fast paths. LRU lock is the only
> lock on changing page's owner, LRU isolation itself protects page
> ownership reference.

Send those patches and we can discuss that further.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH V2 0/3] memcg: simply lock of page stat accounting
@ 2013-05-17 12:53                 ` Michal Hocko
  0 siblings, 0 replies; 36+ messages in thread
From: Michal Hocko @ 2013-05-17 12:53 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Sha Zhengju, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	hughd-hpIqsD4AKlfQT0dZR+AlfA, gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	Sha Zhengju

On Fri 17-05-13 14:29:33, Konstantin Khlebnikov wrote:
> Michal Hocko wrote:
> >On Fri 17-05-13 09:57:37, Konstantin Khlebnikov wrote:
> >>Michal Hocko wrote:
> >>>On Thu 16-05-13 08:28:33, Konstantin Khlebnikov wrote:
> >[...]
> >>>>If somebody needs more detailed information there are enough ways to get it.
> >>>>Amount of mapped pages can be estimated via summing rss counters from mm-structs.
> >>>>Exact numbers can be obtained via examining /proc/pid/pagemap.
> >>>
> >>>How do you find out whether given pages were charged to the group of
> >>>interest - e.g. shared data or taks that has moved from a different
> >>>group without move_at_immigrate?
> >>
> >>For example we can export pages ownership and charging state via
> >>single file in proc, something similar to /proc/kpageflags
> >
> >So you would like to add a new interface with cryptic api (I consider
> >kpageflags to be a devel tool not an admin aid) to replace something
> >that is easy to use? Doesn't make much sense to me.
> 
> Hmm. Kernel api is just api. It's not supposed to be a tool itself.
> It must be usable for making tools which admins can use.

But the file is there and we have to keep it for ever. That was my
point. I do not care how do we account.
 
> >>BTW
> >>In our kernel the memory controller tries to change page's ownership
> >>at first mmap and at each page activation, probably it's worth to add
> >>this into mainline memcg too.
> >
> >Dunno, there are different approaches for this. I haven't evalueted them
> >so I don't know all the pros and cons. Why not just unmap&uncharge the
> >page when the charging process dies. This should be more lightweight
> >wrt. recharge on re-activation.
> 
> On activation it's mostly free, because we need to lock lru anyway.
> If memcg charge/uncharge isn't fast enough for that it should be optimized.

Be careful about that. charge can end up in OOM lock waiting for the
user to handle free some memory. You cannot call charge from within any
locks (well mmap_sem for reading is somehow acceptable because that one
is really hard to get rid of).
 
> So, you propose to uncharge pages at unmap and charge them to who?
> To the next mm in rmap?

No, unmap it from all processes if there is no other process from the
same group as the exiting task and charge on the re-fault.
I have heared about some usecases where this could help to age shared
data but the reporter disappeared so I've lost interest in fixing it
because it wouldn't be cheap and I do not remember many details about
the workload which would be necessary to justify such a change.

> What if next map will happens from different memcg, but after that unmap:
> 
> A:map
> <pages owned by A>
> A:unmap
> <no new owner in rmap>
> B:map
> <pages still owned by A>
> 
> Anyway this is difficult question and I bring this just to show
> that currently used logic isn't perfect in many ways.

Well, the world is not ideal. I wasn't involved in the early discussion
about this topic but what I remember from mailing list archives is that
"charge at the first touch" was the most reasonable compromise at the
time.

> Switching ownership lazily in particular places is a best solution
> which I can see now. So, page will be owned by its last active user.

There are other options as well. We can have a background daemon which
unmaps shared pages if there no processes from the group it has been
charged against. This wouldn't touch any hot paths. But this is getting
off-topic.
 
> >>>>I don't think that simulating 'Mapped' line in /proc/mapfile is a worth reason
> >>>>for adding such weird stuff into the rmap code on map/unmap paths.
> >>>
> >>>The accounting code is trying to be not intrusive as much as possible.
> >>>This patchset makes it more complicated without a good reason and that
> >>>is why it has been Nacked by me.
> >>
> >>I think we can remove it or replace it with something different but
> >>much less intrusive, if nobody strictly requires exactly this approach
> >>in managing 'mapped' pages counters.
> >
> >Do you have any numbers on the intrusiveness? I do not mind to change
> >the internal implementation but the file is a part of the user space API
> >so we cannot get rid of it.
> 
> Main problem here that it increases complexity of switching pages'
> ownership.  We need not just charge/uncharge page and move it to
> the different LRU, but also synchronize with map/unmap paths in
> rmap to update nr-mapped counters carefully. If you want to add
> per-page dirty/writeback counters it will increase that complexity
> even further.

The patches didn't add too much complexity when it were sent the last
time. Well, there was a big problem at the time because s390 has called
set_page_dirty from the rmap code which caused a lot of troubles but
that one is gone and so our free ride can start.

> I haven't started yet migrating our product to mainline memory
> controller, so I don't have any numbers. But my old code completely
> lockless on charging and uncharging fast paths. LRU lock is the only
> lock on changing page's owner, LRU isolation itself protects page
> ownership reference.

Send those patches and we can discuss that further.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2013-05-17 12:53 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-13  5:03 [PATCH V2 0/3] memcg: simply lock of page stat accounting Sha Zhengju
2013-05-13  5:03 ` Sha Zhengju
2013-05-13  5:04 ` [PATCH V2 1/3] memcg: rewrite the comment about race condition " Sha Zhengju
2013-05-13  5:05 ` [PATCH V2 2/3] memcg: alter mem_cgroup_{update,inc,dec}_page_stat() args to memcg pointer Sha Zhengju
2013-05-13 12:25   ` Michal Hocko
2013-05-14  9:00     ` Sha Zhengju
2013-05-14  9:10       ` Michal Hocko
2013-05-14  0:15   ` Kamezawa Hiroyuki
2013-05-14  0:15     ` Kamezawa Hiroyuki
2013-05-14  9:03     ` Sha Zhengju
2013-05-14  9:03       ` Sha Zhengju
2013-05-13  5:05 ` [PATCH V2 3/3] memcg: simplify lock of memcg page stat account Sha Zhengju
2013-05-13 13:12   ` Michal Hocko
2013-05-13 13:38     ` Michal Hocko
2013-05-13 13:38       ` Michal Hocko
2013-05-14  9:13       ` Sha Zhengju
2013-05-14  9:13         ` Sha Zhengju
2013-05-14  9:28         ` Michal Hocko
2013-05-14  9:28           ` Michal Hocko
2013-05-14  8:35     ` Sha Zhengju
2013-05-14  0:41 ` [PATCH V2 0/3] memcg: simply lock of page stat accounting Kamezawa Hiroyuki
2013-05-14  0:41   ` Kamezawa Hiroyuki
2013-05-14  7:13   ` Michal Hocko
2013-05-14  7:13     ` Michal Hocko
2013-05-15 12:35 ` Konstantin Khlebnikov
2013-05-15 13:41   ` Michal Hocko
2013-05-15 13:41     ` Michal Hocko
2013-05-16  4:28     ` Konstantin Khlebnikov
2013-05-16 13:28       ` Michal Hocko
2013-05-16 13:28         ` Michal Hocko
2013-05-17  5:57         ` Konstantin Khlebnikov
2013-05-17  8:38           ` Michal Hocko
2013-05-17 10:29             ` Konstantin Khlebnikov
2013-05-17 10:29               ` Konstantin Khlebnikov
2013-05-17 12:53               ` Michal Hocko
2013-05-17 12:53                 ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.