All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V3 0/8] Per-cgroup page stat accounting
@ 2012-12-25 17:18 ` Sha Zhengju
  0 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2012-12-25 17:18 UTC (permalink / raw)
  To: linux-kernel, cgroups, linux-mm
  Cc: mhocko, akpm, kamezawa.hiroyu, gthelen, fengguang.wu, glommer,
	Sha Zhengju

Hi, list

This is V3 patch series that provide the ability for each memory cgroup to
have independent dirty/writeback page statistics which can provide information
for per-cgroup direct reclaim or some.

In the first three prepare patches, we have done some cleanup and reworked vfs
set page dirty routines to make "modify page info" and "dirty page accouting" stay
in one function as much as possible for the sake of memcg bigger lock(test numbers
are in the specific patch). There is no change comparing to V2.

Patch 4/8 and 5/8 are acctually doing memcg dirty and writeback page accounting.
We change lock orders of mapping->tree_lock and memcg->move_lock to prevent
deadlock. Test numbers of previous version show that there is some performance
decrease after patching the accouting once memcg is enabled. The reason is that
if no memcg exists but root_mem_cgroup, all allocated pages are belonging to root memcg
and they will go through root memcg statistics routines which brings overheads.
So we do some optimization in patch 6/8 and 7/8: patch 6 chooses to give up accounting
root memcg stat but changes behavior of memcg_stat_show() instead; patch 7 uses jump
label to disable memcg page stat accounting code when not in use which is inspired by
a similar optimization from Glauber Costa
(memcg: make it suck faster; https://lkml.org/lkml/2012/9/25/154).

On a 4g memory and 4-core i5 CPU machine, we pushing 1G data through 600M memcg
(memory.limit_in_bytes=600M, memory.memsw.limit_in_bytes=1500M) by fio:
fio (ioengine=sync/write/buffered/bs=4k/size=1g/numjobs=2/group_reporting/thread)

Following is performance comparison between before/after the whole series
(test it for 10 times and get the average numbers):
Before:
write: io=2048.0MB, bw=214527KB/s, iops=53631.2 , runt= 9880.1msec
lat (usec): min=1 , max=1685.06K, avg=36.182, stdev=3153.97

After:
write: io=2048.0MB, bw=193069KB/s, iops=48266.6 , runt= 11078.6msec
lat (usec): min=1 , max=1634.26K, avg=40.598, stdev=3135.81

Note that now the impact is little(~1%).

Any comments are welcomed. : )

Change log:
v3 <--v2
	1. change lock order of mapping->tree_lock and memcg->move_lock
	2. performance optimization in 6/8 and 7/8
v2 <-- v1:
        1. add test numbers
        2. some small fix and comments

Sha Zhengju (8):
	memcg-remove-MEMCG_NR_FILE_MAPPED.patch
	Make-TestSetPageDirty-and-dirty-page-accounting-in-o.patch
	use-vfs-__set_page_dirty-interface-instead-of-doing-.patch
	memcg-add-per-cgroup-dirty-pages-accounting.patch
	memcg-add-per-cgroup-writeback-pages-accounting.patch
	memcg-Don-t-account-root_mem_cgroup-page-statistics.patch
	memcg-disable-memcg-page-stat-accounting-code-when-n.patch
	memcg-Document-cgroup-dirty-writeback-memory-statist.patch

 Documentation/cgroups/memory.txt |    2 +
 fs/buffer.c                      |   37 +++++++----
 fs/ceph/addr.c                   |   20 +-----
 include/linux/buffer_head.h      |    2 +
 include/linux/memcontrol.h       |   39 ++++++++---
 mm/filemap.c                     |   10 +++
 mm/memcontrol.c                  |  134 ++++++++++++++++++++++++++++++--------
 mm/page-writeback.c              |   56 ++++++++++++++--
 mm/rmap.c                        |    4 +-
 mm/truncate.c                    |    6 ++
 10 files changed, 235 insertions(+), 75 deletions(-)


^ permalink raw reply	[flat|nested] 124+ messages in thread

* [PATCH V3 0/8] Per-cgroup page stat accounting
@ 2012-12-25 17:18 ` Sha Zhengju
  0 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2012-12-25 17:18 UTC (permalink / raw)
  To: linux-kernel, cgroups, linux-mm
  Cc: mhocko, akpm, kamezawa.hiroyu, gthelen, fengguang.wu, glommer,
	Sha Zhengju

Hi, list

This is V3 patch series that provide the ability for each memory cgroup to
have independent dirty/writeback page statistics which can provide information
for per-cgroup direct reclaim or some.

In the first three prepare patches, we have done some cleanup and reworked vfs
set page dirty routines to make "modify page info" and "dirty page accouting" stay
in one function as much as possible for the sake of memcg bigger lock(test numbers
are in the specific patch). There is no change comparing to V2.

Patch 4/8 and 5/8 are acctually doing memcg dirty and writeback page accounting.
We change lock orders of mapping->tree_lock and memcg->move_lock to prevent
deadlock. Test numbers of previous version show that there is some performance
decrease after patching the accouting once memcg is enabled. The reason is that
if no memcg exists but root_mem_cgroup, all allocated pages are belonging to root memcg
and they will go through root memcg statistics routines which brings overheads.
So we do some optimization in patch 6/8 and 7/8: patch 6 chooses to give up accounting
root memcg stat but changes behavior of memcg_stat_show() instead; patch 7 uses jump
label to disable memcg page stat accounting code when not in use which is inspired by
a similar optimization from Glauber Costa
(memcg: make it suck faster; https://lkml.org/lkml/2012/9/25/154).

On a 4g memory and 4-core i5 CPU machine, we pushing 1G data through 600M memcg
(memory.limit_in_bytes=600M, memory.memsw.limit_in_bytes=1500M) by fio:
fio (ioengine=sync/write/buffered/bs=4k/size=1g/numjobs=2/group_reporting/thread)

Following is performance comparison between before/after the whole series
(test it for 10 times and get the average numbers):
Before:
write: io=2048.0MB, bw=214527KB/s, iops=53631.2 , runt= 9880.1msec
lat (usec): min=1 , max=1685.06K, avg=36.182, stdev=3153.97

After:
write: io=2048.0MB, bw=193069KB/s, iops=48266.6 , runt= 11078.6msec
lat (usec): min=1 , max=1634.26K, avg=40.598, stdev=3135.81

Note that now the impact is little(~1%).

Any comments are welcomed. : )

Change log:
v3 <--v2
	1. change lock order of mapping->tree_lock and memcg->move_lock
	2. performance optimization in 6/8 and 7/8
v2 <-- v1:
        1. add test numbers
        2. some small fix and comments

Sha Zhengju (8):
	memcg-remove-MEMCG_NR_FILE_MAPPED.patch
	Make-TestSetPageDirty-and-dirty-page-accounting-in-o.patch
	use-vfs-__set_page_dirty-interface-instead-of-doing-.patch
	memcg-add-per-cgroup-dirty-pages-accounting.patch
	memcg-add-per-cgroup-writeback-pages-accounting.patch
	memcg-Don-t-account-root_mem_cgroup-page-statistics.patch
	memcg-disable-memcg-page-stat-accounting-code-when-n.patch
	memcg-Document-cgroup-dirty-writeback-memory-statist.patch

 Documentation/cgroups/memory.txt |    2 +
 fs/buffer.c                      |   37 +++++++----
 fs/ceph/addr.c                   |   20 +-----
 include/linux/buffer_head.h      |    2 +
 include/linux/memcontrol.h       |   39 ++++++++---
 mm/filemap.c                     |   10 +++
 mm/memcontrol.c                  |  134 ++++++++++++++++++++++++++++++--------
 mm/page-writeback.c              |   56 ++++++++++++++--
 mm/rmap.c                        |    4 +-
 mm/truncate.c                    |    6 ++
 10 files changed, 235 insertions(+), 75 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* [PATCH V3 1/8] memcg: remove MEMCG_NR_FILE_MAPPED
  2012-12-25 17:18 ` Sha Zhengju
@ 2012-12-25 17:20   ` Sha Zhengju
  -1 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2012-12-25 17:20 UTC (permalink / raw)
  To: linux-kernel, cgroups, linux-mm
  Cc: mhocko, akpm, kamezawa.hiroyu, gthelen, fengguang.wu, glommer,
	Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

While accounting memcg page stat, it's not worth to use MEMCG_NR_FILE_MAPPED
as an extra layer of indirection because of the complexity and presumed
performance overhead. We can use MEM_CGROUP_STAT_FILE_MAPPED directly.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Greg Thelen <gthelen@google.com>
---
 include/linux/memcontrol.h |   28 ++++++++++++++++++++--------
 mm/memcontrol.c            |   22 +++-------------------
 mm/rmap.c                  |    4 ++--
 3 files changed, 25 insertions(+), 29 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0108a56..5421b8a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -30,9 +30,21 @@ struct page;
 struct mm_struct;
 struct kmem_cache;
 
-/* Stats that can be updated by kernel. */
-enum mem_cgroup_page_stat_item {
-	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
+/*
+ * Statistics for memory cgroup.
+ *
+ * The corresponding mem_cgroup_stat_names is defined in mm/memcontrol.c,
+ * These two lists should keep in accord with each other.
+ */
+enum mem_cgroup_stat_index {
+	/*
+	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
+	 */
+	MEM_CGROUP_STAT_CACHE,	   /* # of pages charged as cache */
+	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
+	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
+	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
+	MEM_CGROUP_STAT_NSTATS,
 };
 
 struct mem_cgroup_reclaim_cookie {
@@ -165,17 +177,17 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
 }
 
 void mem_cgroup_update_page_stat(struct page *page,
-				 enum mem_cgroup_page_stat_item idx,
+				 enum mem_cgroup_stat_index idx,
 				 int val);
 
 static inline void mem_cgroup_inc_page_stat(struct page *page,
-					    enum mem_cgroup_page_stat_item idx)
+					    enum mem_cgroup_stat_index idx)
 {
 	mem_cgroup_update_page_stat(page, idx, 1);
 }
 
 static inline void mem_cgroup_dec_page_stat(struct page *page,
-					    enum mem_cgroup_page_stat_item idx)
+					    enum mem_cgroup_stat_index idx)
 {
 	mem_cgroup_update_page_stat(page, idx, -1);
 }
@@ -355,12 +367,12 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
 }
 
 static inline void mem_cgroup_inc_page_stat(struct page *page,
-					    enum mem_cgroup_page_stat_item idx)
+					    enum mem_cgroup_stat_index idx)
 {
 }
 
 static inline void mem_cgroup_dec_page_stat(struct page *page,
-					    enum mem_cgroup_page_stat_item idx)
+					    enum mem_cgroup_stat_index idx)
 {
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1ea8951..d450c04 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -86,17 +86,9 @@ static int really_do_swap_account __initdata = 0;
 
 /*
  * Statistics for memory cgroup.
+ * The corresponding mem_cgroup_stat_index is defined in include/linux/
+ * memcontrol.h. These two lists should keep in accord with each other.
  */
-enum mem_cgroup_stat_index {
-	/*
-	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
-	 */
-	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
-	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
-	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
-	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
-	MEM_CGROUP_STAT_NSTATS,
-};
 
 static const char * const mem_cgroup_stat_names[] = {
 	"cache",
@@ -2226,7 +2218,7 @@ void __mem_cgroup_end_update_page_stat(struct page *page, unsigned long *flags)
 }
 
 void mem_cgroup_update_page_stat(struct page *page,
-				 enum mem_cgroup_page_stat_item idx, int val)
+				 enum mem_cgroup_stat_index idx, int val)
 {
 	struct mem_cgroup *memcg;
 	struct page_cgroup *pc = lookup_page_cgroup(page);
@@ -2239,14 +2231,6 @@ void mem_cgroup_update_page_stat(struct page *page,
 	if (unlikely(!memcg || !PageCgroupUsed(pc)))
 		return;
 
-	switch (idx) {
-	case MEMCG_NR_FILE_MAPPED:
-		idx = MEM_CGROUP_STAT_FILE_MAPPED;
-		break;
-	default:
-		BUG();
-	}
-
 	this_cpu_add(memcg->stat->count[idx], val);
 }
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 59b0dca..c7de54e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1116,7 +1116,7 @@ void page_add_file_rmap(struct page *page)
 	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (atomic_inc_and_test(&page->_mapcount)) {
 		__inc_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
+		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
 	}
 	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 }
@@ -1184,7 +1184,7 @@ void page_remove_rmap(struct page *page)
 					      NR_ANON_TRANSPARENT_HUGEPAGES);
 	} else {
 		__dec_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
+		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
 		mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	}
 	if (unlikely(PageMlocked(page)))
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH V3 1/8] memcg: remove MEMCG_NR_FILE_MAPPED
@ 2012-12-25 17:20   ` Sha Zhengju
  0 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2012-12-25 17:20 UTC (permalink / raw)
  To: linux-kernel, cgroups, linux-mm
  Cc: mhocko, akpm, kamezawa.hiroyu, gthelen, fengguang.wu, glommer,
	Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

While accounting memcg page stat, it's not worth to use MEMCG_NR_FILE_MAPPED
as an extra layer of indirection because of the complexity and presumed
performance overhead. We can use MEM_CGROUP_STAT_FILE_MAPPED directly.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Greg Thelen <gthelen@google.com>
---
 include/linux/memcontrol.h |   28 ++++++++++++++++++++--------
 mm/memcontrol.c            |   22 +++-------------------
 mm/rmap.c                  |    4 ++--
 3 files changed, 25 insertions(+), 29 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0108a56..5421b8a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -30,9 +30,21 @@ struct page;
 struct mm_struct;
 struct kmem_cache;
 
-/* Stats that can be updated by kernel. */
-enum mem_cgroup_page_stat_item {
-	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
+/*
+ * Statistics for memory cgroup.
+ *
+ * The corresponding mem_cgroup_stat_names is defined in mm/memcontrol.c,
+ * These two lists should keep in accord with each other.
+ */
+enum mem_cgroup_stat_index {
+	/*
+	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
+	 */
+	MEM_CGROUP_STAT_CACHE,	   /* # of pages charged as cache */
+	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
+	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
+	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
+	MEM_CGROUP_STAT_NSTATS,
 };
 
 struct mem_cgroup_reclaim_cookie {
@@ -165,17 +177,17 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
 }
 
 void mem_cgroup_update_page_stat(struct page *page,
-				 enum mem_cgroup_page_stat_item idx,
+				 enum mem_cgroup_stat_index idx,
 				 int val);
 
 static inline void mem_cgroup_inc_page_stat(struct page *page,
-					    enum mem_cgroup_page_stat_item idx)
+					    enum mem_cgroup_stat_index idx)
 {
 	mem_cgroup_update_page_stat(page, idx, 1);
 }
 
 static inline void mem_cgroup_dec_page_stat(struct page *page,
-					    enum mem_cgroup_page_stat_item idx)
+					    enum mem_cgroup_stat_index idx)
 {
 	mem_cgroup_update_page_stat(page, idx, -1);
 }
@@ -355,12 +367,12 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
 }
 
 static inline void mem_cgroup_inc_page_stat(struct page *page,
-					    enum mem_cgroup_page_stat_item idx)
+					    enum mem_cgroup_stat_index idx)
 {
 }
 
 static inline void mem_cgroup_dec_page_stat(struct page *page,
-					    enum mem_cgroup_page_stat_item idx)
+					    enum mem_cgroup_stat_index idx)
 {
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1ea8951..d450c04 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -86,17 +86,9 @@ static int really_do_swap_account __initdata = 0;
 
 /*
  * Statistics for memory cgroup.
+ * The corresponding mem_cgroup_stat_index is defined in include/linux/
+ * memcontrol.h. These two lists should keep in accord with each other.
  */
-enum mem_cgroup_stat_index {
-	/*
-	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
-	 */
-	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
-	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
-	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
-	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
-	MEM_CGROUP_STAT_NSTATS,
-};
 
 static const char * const mem_cgroup_stat_names[] = {
 	"cache",
@@ -2226,7 +2218,7 @@ void __mem_cgroup_end_update_page_stat(struct page *page, unsigned long *flags)
 }
 
 void mem_cgroup_update_page_stat(struct page *page,
-				 enum mem_cgroup_page_stat_item idx, int val)
+				 enum mem_cgroup_stat_index idx, int val)
 {
 	struct mem_cgroup *memcg;
 	struct page_cgroup *pc = lookup_page_cgroup(page);
@@ -2239,14 +2231,6 @@ void mem_cgroup_update_page_stat(struct page *page,
 	if (unlikely(!memcg || !PageCgroupUsed(pc)))
 		return;
 
-	switch (idx) {
-	case MEMCG_NR_FILE_MAPPED:
-		idx = MEM_CGROUP_STAT_FILE_MAPPED;
-		break;
-	default:
-		BUG();
-	}
-
 	this_cpu_add(memcg->stat->count[idx], val);
 }
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 59b0dca..c7de54e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1116,7 +1116,7 @@ void page_add_file_rmap(struct page *page)
 	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (atomic_inc_and_test(&page->_mapcount)) {
 		__inc_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
+		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
 	}
 	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 }
@@ -1184,7 +1184,7 @@ void page_remove_rmap(struct page *page)
 					      NR_ANON_TRANSPARENT_HUGEPAGES);
 	} else {
 		__dec_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
+		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
 		mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	}
 	if (unlikely(PageMlocked(page)))
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH V3 2/8] Make TestSetPageDirty and dirty page accounting in one func
@ 2012-12-25 17:22   ` Sha Zhengju
  0 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2012-12-25 17:22 UTC (permalink / raw)
  To: linux-kernel, cgroups, linux-mm, linux-fsdevel
  Cc: dchinner, mhocko, akpm, kamezawa.hiroyu, gthelen, fengguang.wu,
	glommer, Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

Commit a8e7d49a(Fix race in create_empty_buffers() vs __set_page_dirty_buffers())
extracts TestSetPageDirty from __set_page_dirty and is far away from
account_page_dirtied. But it's better to make the two operations in one single
function to keep modular. So in order to avoid the potential race mentioned in
commit a8e7d49a, we can hold private_lock until __set_page_dirty completes.
There's no deadlock between ->private_lock and ->tree_lock after confirmation.
It's a prepare patch for following memcg dirty page accounting patches.


Here is some test numbers that before/after this patch:
Test steps(Mem-4g, ext4):
drop_cache; sync
fio (ioengine=sync/write/buffered/bs=4k/size=1g/numjobs=2/group_reporting/thread)

We test it for 10 times and get the average numbers:
Before:
write: io=2048.0MB, bw=254117KB/s, iops=63528.9 , runt=  8279msec
lat (usec): min=1 , max=742361 , avg=30.918, stdev=1601.02
After:
write: io=2048.0MB, bw=254044KB/s, iops=63510.3 , runt=  8274.4msec
lat (usec): min=1 , max=856333 , avg=31.043, stdev=1769.32

Note that the impact is little(<1%).


Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
---
 fs/buffer.c |   24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index c017a2d..3b032b9 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -609,9 +609,15 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
  * If warn is true, then emit a warning if the page is not uptodate and has
  * not been truncated.
  */
-static void __set_page_dirty(struct page *page,
+static int __set_page_dirty(struct page *page,
 		struct address_space *mapping, int warn)
 {
+	if (unlikely(!mapping))
+		return !TestSetPageDirty(page);
+
+	if (TestSetPageDirty(page))
+		return 0;
+
 	spin_lock_irq(&mapping->tree_lock);
 	if (page->mapping) {	/* Race with truncate? */
 		WARN_ON_ONCE(warn && !PageUptodate(page));
@@ -621,6 +627,8 @@ static void __set_page_dirty(struct page *page,
 	}
 	spin_unlock_irq(&mapping->tree_lock);
 	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+
+	return 1;
 }
 
 /*
@@ -666,11 +674,9 @@ int __set_page_dirty_buffers(struct page *page)
 			bh = bh->b_this_page;
 		} while (bh != head);
 	}
-	newly_dirty = !TestSetPageDirty(page);
+	newly_dirty = __set_page_dirty(page, mapping, 1);
 	spin_unlock(&mapping->private_lock);
 
-	if (newly_dirty)
-		__set_page_dirty(page, mapping, 1);
 	return newly_dirty;
 }
 EXPORT_SYMBOL(__set_page_dirty_buffers);
@@ -1125,14 +1131,8 @@ void mark_buffer_dirty(struct buffer_head *bh)
 			return;
 	}
 
-	if (!test_set_buffer_dirty(bh)) {
-		struct page *page = bh->b_page;
-		if (!TestSetPageDirty(page)) {
-			struct address_space *mapping = page_mapping(page);
-			if (mapping)
-				__set_page_dirty(page, mapping, 0);
-		}
-	}
+	if (!test_set_buffer_dirty(bh))
+		__set_page_dirty(bh->b_page, page_mapping(bh->b_page), 0);
 }
 EXPORT_SYMBOL(mark_buffer_dirty);
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH V3 2/8] Make TestSetPageDirty and dirty page accounting in one func
@ 2012-12-25 17:22   ` Sha Zhengju
  0 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2012-12-25 17:22 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: dchinner-H+wXaHxf7aLQT0dZR+AlfA, mhocko-AlSwsSmVLrQ,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	glommer-bzQdu9zFT3WakBO8gow8eQ, Sha Zhengju

From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>

Commit a8e7d49a(Fix race in create_empty_buffers() vs __set_page_dirty_buffers())
extracts TestSetPageDirty from __set_page_dirty and is far away from
account_page_dirtied. But it's better to make the two operations in one single
function to keep modular. So in order to avoid the potential race mentioned in
commit a8e7d49a, we can hold private_lock until __set_page_dirty completes.
There's no deadlock between ->private_lock and ->tree_lock after confirmation.
It's a prepare patch for following memcg dirty page accounting patches.


Here is some test numbers that before/after this patch:
Test steps(Mem-4g, ext4):
drop_cache; sync
fio (ioengine=sync/write/buffered/bs=4k/size=1g/numjobs=2/group_reporting/thread)

We test it for 10 times and get the average numbers:
Before:
write: io=2048.0MB, bw=254117KB/s, iops=63528.9 , runt=  8279msec
lat (usec): min=1 , max=742361 , avg=30.918, stdev=1601.02
After:
write: io=2048.0MB, bw=254044KB/s, iops=63510.3 , runt=  8274.4msec
lat (usec): min=1 , max=856333 , avg=31.043, stdev=1769.32

Note that the impact is little(<1%).


Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
---
 fs/buffer.c |   24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index c017a2d..3b032b9 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -609,9 +609,15 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
  * If warn is true, then emit a warning if the page is not uptodate and has
  * not been truncated.
  */
-static void __set_page_dirty(struct page *page,
+static int __set_page_dirty(struct page *page,
 		struct address_space *mapping, int warn)
 {
+	if (unlikely(!mapping))
+		return !TestSetPageDirty(page);
+
+	if (TestSetPageDirty(page))
+		return 0;
+
 	spin_lock_irq(&mapping->tree_lock);
 	if (page->mapping) {	/* Race with truncate? */
 		WARN_ON_ONCE(warn && !PageUptodate(page));
@@ -621,6 +627,8 @@ static void __set_page_dirty(struct page *page,
 	}
 	spin_unlock_irq(&mapping->tree_lock);
 	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+
+	return 1;
 }
 
 /*
@@ -666,11 +674,9 @@ int __set_page_dirty_buffers(struct page *page)
 			bh = bh->b_this_page;
 		} while (bh != head);
 	}
-	newly_dirty = !TestSetPageDirty(page);
+	newly_dirty = __set_page_dirty(page, mapping, 1);
 	spin_unlock(&mapping->private_lock);
 
-	if (newly_dirty)
-		__set_page_dirty(page, mapping, 1);
 	return newly_dirty;
 }
 EXPORT_SYMBOL(__set_page_dirty_buffers);
@@ -1125,14 +1131,8 @@ void mark_buffer_dirty(struct buffer_head *bh)
 			return;
 	}
 
-	if (!test_set_buffer_dirty(bh)) {
-		struct page *page = bh->b_page;
-		if (!TestSetPageDirty(page)) {
-			struct address_space *mapping = page_mapping(page);
-			if (mapping)
-				__set_page_dirty(page, mapping, 0);
-		}
-	}
+	if (!test_set_buffer_dirty(bh))
+		__set_page_dirty(bh->b_page, page_mapping(bh->b_page), 0);
 }
 EXPORT_SYMBOL(mark_buffer_dirty);
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH V3 2/8] Make TestSetPageDirty and dirty page accounting in one func
@ 2012-12-25 17:22   ` Sha Zhengju
  0 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2012-12-25 17:22 UTC (permalink / raw)
  To: linux-kernel, cgroups, linux-mm, linux-fsdevel
  Cc: dchinner, mhocko, akpm, kamezawa.hiroyu, gthelen, fengguang.wu,
	glommer, Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

Commit a8e7d49a(Fix race in create_empty_buffers() vs __set_page_dirty_buffers())
extracts TestSetPageDirty from __set_page_dirty and is far away from
account_page_dirtied. But it's better to make the two operations in one single
function to keep modular. So in order to avoid the potential race mentioned in
commit a8e7d49a, we can hold private_lock until __set_page_dirty completes.
There's no deadlock between ->private_lock and ->tree_lock after confirmation.
It's a prepare patch for following memcg dirty page accounting patches.


Here is some test numbers that before/after this patch:
Test steps(Mem-4g, ext4):
drop_cache; sync
fio (ioengine=sync/write/buffered/bs=4k/size=1g/numjobs=2/group_reporting/thread)

We test it for 10 times and get the average numbers:
Before:
write: io=2048.0MB, bw=254117KB/s, iops=63528.9 , runt=  8279msec
lat (usec): min=1 , max=742361 , avg=30.918, stdev=1601.02
After:
write: io=2048.0MB, bw=254044KB/s, iops=63510.3 , runt=  8274.4msec
lat (usec): min=1 , max=856333 , avg=31.043, stdev=1769.32

Note that the impact is little(<1%).


Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
---
 fs/buffer.c |   24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index c017a2d..3b032b9 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -609,9 +609,15 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
  * If warn is true, then emit a warning if the page is not uptodate and has
  * not been truncated.
  */
-static void __set_page_dirty(struct page *page,
+static int __set_page_dirty(struct page *page,
 		struct address_space *mapping, int warn)
 {
+	if (unlikely(!mapping))
+		return !TestSetPageDirty(page);
+
+	if (TestSetPageDirty(page))
+		return 0;
+
 	spin_lock_irq(&mapping->tree_lock);
 	if (page->mapping) {	/* Race with truncate? */
 		WARN_ON_ONCE(warn && !PageUptodate(page));
@@ -621,6 +627,8 @@ static void __set_page_dirty(struct page *page,
 	}
 	spin_unlock_irq(&mapping->tree_lock);
 	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+
+	return 1;
 }
 
 /*
@@ -666,11 +674,9 @@ int __set_page_dirty_buffers(struct page *page)
 			bh = bh->b_this_page;
 		} while (bh != head);
 	}
-	newly_dirty = !TestSetPageDirty(page);
+	newly_dirty = __set_page_dirty(page, mapping, 1);
 	spin_unlock(&mapping->private_lock);
 
-	if (newly_dirty)
-		__set_page_dirty(page, mapping, 1);
 	return newly_dirty;
 }
 EXPORT_SYMBOL(__set_page_dirty_buffers);
@@ -1125,14 +1131,8 @@ void mark_buffer_dirty(struct buffer_head *bh)
 			return;
 	}
 
-	if (!test_set_buffer_dirty(bh)) {
-		struct page *page = bh->b_page;
-		if (!TestSetPageDirty(page)) {
-			struct address_space *mapping = page_mapping(page);
-			if (mapping)
-				__set_page_dirty(page, mapping, 0);
-		}
-	}
+	if (!test_set_buffer_dirty(bh))
+		__set_page_dirty(bh->b_page, page_mapping(bh->b_page), 0);
 }
 EXPORT_SYMBOL(mark_buffer_dirty);
 
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH V3 3/8] use vfs __set_page_dirty interface instead of doing it inside filesystem
  2012-12-25 17:18 ` Sha Zhengju
@ 2012-12-25 17:24   ` Sha Zhengju
  -1 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2012-12-25 17:24 UTC (permalink / raw)
  To: linux-kernel, cgroups, linux-mm, linux-fsdevel, ceph-devel
  Cc: sage, dchinner, mhocko, akpm, kamezawa.hiroyu, gthelen,
	fengguang.wu, glommer, Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

Following we will treat SetPageDirty and dirty page accounting as an integrated
operation. Filesystems had better use vfs interface directly to avoid those details.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Acked-by: Sage Weil <sage@inktank.com>
---
 fs/buffer.c                 |    3 ++-
 fs/ceph/addr.c              |   20 ++------------------
 include/linux/buffer_head.h |    2 ++
 3 files changed, 6 insertions(+), 19 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 3b032b9..762168a 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -609,7 +609,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
  * If warn is true, then emit a warning if the page is not uptodate and has
  * not been truncated.
  */
-static int __set_page_dirty(struct page *page,
+int __set_page_dirty(struct page *page,
 		struct address_space *mapping, int warn)
 {
 	if (unlikely(!mapping))
@@ -630,6 +630,7 @@ static int __set_page_dirty(struct page *page,
 
 	return 1;
 }
+EXPORT_SYMBOL(__set_page_dirty);
 
 /*
  * Add a page to the dirty page list.
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 6690269..f2779b8 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -5,6 +5,7 @@
 #include <linux/mm.h>
 #include <linux/pagemap.h>
 #include <linux/writeback.h>	/* generic_writepages */
+#include <linux/buffer_head.h>
 #include <linux/slab.h>
 #include <linux/pagevec.h>
 #include <linux/task_io_accounting_ops.h>
@@ -73,14 +74,8 @@ static int ceph_set_page_dirty(struct page *page)
 	int undo = 0;
 	struct ceph_snap_context *snapc;
 
-	if (unlikely(!mapping))
-		return !TestSetPageDirty(page);
-
-	if (TestSetPageDirty(page)) {
-		dout("%p set_page_dirty %p idx %lu -- already dirty\n",
-		     mapping->host, page, page->index);
+	if (!__set_page_dirty(page, mapping, 1))
 		return 0;
-	}
 
 	inode = mapping->host;
 	ci = ceph_inode(inode);
@@ -107,14 +102,7 @@ static int ceph_set_page_dirty(struct page *page)
 	     snapc, snapc->seq, snapc->num_snaps);
 	spin_unlock(&ci->i_ceph_lock);
 
-	/* now adjust page */
-	spin_lock_irq(&mapping->tree_lock);
 	if (page->mapping) {	/* Race with truncate? */
-		WARN_ON_ONCE(!PageUptodate(page));
-		account_page_dirtied(page, page->mapping);
-		radix_tree_tag_set(&mapping->page_tree,
-				page_index(page), PAGECACHE_TAG_DIRTY);
-
 		/*
 		 * Reference snap context in page->private.  Also set
 		 * PagePrivate so that we get invalidatepage callback.
@@ -126,14 +114,10 @@ static int ceph_set_page_dirty(struct page *page)
 		undo = 1;
 	}
 
-	spin_unlock_irq(&mapping->tree_lock);
-
 	if (undo)
 		/* whoops, we failed to dirty the page */
 		ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
 
-	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
-
 	BUG_ON(!PageDirty(page));
 	return 1;
 }
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 458f497..0a331a8 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -336,6 +336,8 @@ static inline void lock_buffer(struct buffer_head *bh)
 }
 
 extern int __set_page_dirty_buffers(struct page *page);
+extern int __set_page_dirty(struct page *page,
+		struct address_space *mapping, int warn);
 
 #else /* CONFIG_BLOCK */
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH V3 3/8] use vfs __set_page_dirty interface instead of doing it inside filesystem
@ 2012-12-25 17:24   ` Sha Zhengju
  0 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2012-12-25 17:24 UTC (permalink / raw)
  To: linux-kernel, cgroups, linux-mm, linux-fsdevel, ceph-devel
  Cc: sage, dchinner, mhocko, akpm, kamezawa.hiroyu, gthelen,
	fengguang.wu, glommer, Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

Following we will treat SetPageDirty and dirty page accounting as an integrated
operation. Filesystems had better use vfs interface directly to avoid those details.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Acked-by: Sage Weil <sage@inktank.com>
---
 fs/buffer.c                 |    3 ++-
 fs/ceph/addr.c              |   20 ++------------------
 include/linux/buffer_head.h |    2 ++
 3 files changed, 6 insertions(+), 19 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 3b032b9..762168a 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -609,7 +609,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
  * If warn is true, then emit a warning if the page is not uptodate and has
  * not been truncated.
  */
-static int __set_page_dirty(struct page *page,
+int __set_page_dirty(struct page *page,
 		struct address_space *mapping, int warn)
 {
 	if (unlikely(!mapping))
@@ -630,6 +630,7 @@ static int __set_page_dirty(struct page *page,
 
 	return 1;
 }
+EXPORT_SYMBOL(__set_page_dirty);
 
 /*
  * Add a page to the dirty page list.
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 6690269..f2779b8 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -5,6 +5,7 @@
 #include <linux/mm.h>
 #include <linux/pagemap.h>
 #include <linux/writeback.h>	/* generic_writepages */
+#include <linux/buffer_head.h>
 #include <linux/slab.h>
 #include <linux/pagevec.h>
 #include <linux/task_io_accounting_ops.h>
@@ -73,14 +74,8 @@ static int ceph_set_page_dirty(struct page *page)
 	int undo = 0;
 	struct ceph_snap_context *snapc;
 
-	if (unlikely(!mapping))
-		return !TestSetPageDirty(page);
-
-	if (TestSetPageDirty(page)) {
-		dout("%p set_page_dirty %p idx %lu -- already dirty\n",
-		     mapping->host, page, page->index);
+	if (!__set_page_dirty(page, mapping, 1))
 		return 0;
-	}
 
 	inode = mapping->host;
 	ci = ceph_inode(inode);
@@ -107,14 +102,7 @@ static int ceph_set_page_dirty(struct page *page)
 	     snapc, snapc->seq, snapc->num_snaps);
 	spin_unlock(&ci->i_ceph_lock);
 
-	/* now adjust page */
-	spin_lock_irq(&mapping->tree_lock);
 	if (page->mapping) {	/* Race with truncate? */
-		WARN_ON_ONCE(!PageUptodate(page));
-		account_page_dirtied(page, page->mapping);
-		radix_tree_tag_set(&mapping->page_tree,
-				page_index(page), PAGECACHE_TAG_DIRTY);
-
 		/*
 		 * Reference snap context in page->private.  Also set
 		 * PagePrivate so that we get invalidatepage callback.
@@ -126,14 +114,10 @@ static int ceph_set_page_dirty(struct page *page)
 		undo = 1;
 	}
 
-	spin_unlock_irq(&mapping->tree_lock);
-
 	if (undo)
 		/* whoops, we failed to dirty the page */
 		ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
 
-	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
-
 	BUG_ON(!PageDirty(page));
 	return 1;
 }
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 458f497..0a331a8 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -336,6 +336,8 @@ static inline void lock_buffer(struct buffer_head *bh)
 }
 
 extern int __set_page_dirty_buffers(struct page *page);
+extern int __set_page_dirty(struct page *page,
+		struct address_space *mapping, int warn);
 
 #else /* CONFIG_BLOCK */
 
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
  2012-12-25 17:18 ` Sha Zhengju
@ 2012-12-25 17:26   ` Sha Zhengju
  -1 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2012-12-25 17:26 UTC (permalink / raw)
  To: linux-kernel, cgroups, linux-mm, linux-fsdevel
  Cc: mhocko, akpm, kamezawa.hiroyu, gthelen, fengguang.wu, glommer,
	dchinner, Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

This patch adds memcg routines to count dirty pages, which allows memory controller
to maintain an accurate view of the amount of its dirty memory and can provide some
info for users while cgroup's direct reclaim is working.

After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
has a feature to move a page from a cgroup to another one and may have race between
"move" and "page stat accounting". So in order to avoid the race we have designed a
bigger lock:

         mem_cgroup_begin_update_page_stat()
         modify page information        -->(a)
         mem_cgroup_update_page_stat()  -->(b)
         mem_cgroup_end_update_page_stat()
It requires (a) and (b)(dirty pages accounting) can stay close enough.
In the previous two prepare patches, we have reworked the vfs set page dirty routines
and now the interfaces are more explicit:
        incrementing (2):
                __set_page_dirty
                __set_page_dirty_nobuffers
        decrementing (2):
                clear_page_dirty_for_io
                cancel_dirty_page

To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
(https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
->private_lock --> mapping->tree_lock --> memcg->move_lock.
So we need to make mapping->tree_lock ahead of TestSetPageDirty in __set_page_dirty()
and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention,
a prepare PageDirty() checking is added.


Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
Acked-by: Fengguang Wu <fengguang.wu@intel.com>
---
 fs/buffer.c                |   14 +++++++++++++-
 include/linux/memcontrol.h |    1 +
 mm/filemap.c               |   10 ++++++++++
 mm/memcontrol.c            |   29 ++++++++++++++++++++++-------
 mm/page-writeback.c        |   39 ++++++++++++++++++++++++++++++++-------
 mm/truncate.c              |    6 ++++++
 6 files changed, 84 insertions(+), 15 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 762168a..53402d2 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -612,19 +612,31 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
 int __set_page_dirty(struct page *page,
 		struct address_space *mapping, int warn)
 {
+	bool locked;
+	unsigned long flags;
+
 	if (unlikely(!mapping))
 		return !TestSetPageDirty(page);
 
-	if (TestSetPageDirty(page))
+	if (PageDirty(page))
 		return 0;
 
 	spin_lock_irq(&mapping->tree_lock);
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
+
+	if (TestSetPageDirty(page)) {
+		mem_cgroup_end_update_page_stat(page, &locked, &flags);
+		spin_unlock_irq(&mapping->tree_lock);
+		return 0;
+	}
+
 	if (page->mapping) {	/* Race with truncate? */
 		WARN_ON_ONCE(warn && !PageUptodate(page));
 		account_page_dirtied(page, mapping);
 		radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 	}
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	spin_unlock_irq(&mapping->tree_lock);
 	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5421b8a..2685d8a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -44,6 +44,7 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
 	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
 	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
+	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
 	MEM_CGROUP_STAT_NSTATS,
 };
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 83efee7..b589be5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -62,6 +62,11 @@
  *      ->swap_lock		(exclusive_swap_page, others)
  *        ->mapping->tree_lock
  *
+ *    ->private_lock		(__set_page_dirty_buffers)
+ *      ->mapping->tree_lock
+ *        ->memcg->move_lock	(mem_cgroup_begin_update_page_stat->
+ *							move_lock_mem_cgroup)
+ *
  *  ->i_mutex
  *    ->i_mmap_mutex		(truncate->unmap_mapping_range)
  *
@@ -112,6 +117,8 @@
 void __delete_from_page_cache(struct page *page)
 {
 	struct address_space *mapping = page->mapping;
+	bool locked;
+	unsigned long flags;
 
 	/*
 	 * if we're uptodate, flush out into the cleancache, otherwise
@@ -139,10 +146,13 @@ void __delete_from_page_cache(struct page *page)
 	 * Fix it up by doing a final dirty accounting check after
 	 * having removed the page entirely.
 	 */
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 		dec_zone_page_state(page, NR_FILE_DIRTY);
 		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
 	}
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 }
 
 /**
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d450c04..c884640 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -95,6 +95,7 @@ static const char * const mem_cgroup_stat_names[] = {
 	"rss",
 	"mapped_file",
 	"swap",
+	"dirty",
 };
 
 enum mem_cgroup_events_index {
@@ -3609,6 +3610,19 @@ void mem_cgroup_split_huge_fixup(struct page *head)
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+static inline
+void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
+					struct mem_cgroup *to,
+					unsigned int nr_pages,
+					enum mem_cgroup_stat_index idx)
+{
+	/* Update stat data for mem_cgroup */
+	preempt_disable();
+	__this_cpu_add(from->stat->count[idx], -nr_pages);
+	__this_cpu_add(to->stat->count[idx], nr_pages);
+	preempt_enable();
+}
+
 /**
  * mem_cgroup_move_account - move account of the page
  * @page: the page
@@ -3654,13 +3668,14 @@ static int mem_cgroup_move_account(struct page *page,
 
 	move_lock_mem_cgroup(from, &flags);
 
-	if (!anon && page_mapped(page)) {
-		/* Update mapped_file data for mem_cgroup */
-		preempt_disable();
-		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		preempt_enable();
-	}
+	if (!anon && page_mapped(page))
+		mem_cgroup_move_account_page_stat(from, to, nr_pages,
+			MEM_CGROUP_STAT_FILE_MAPPED);
+
+	if (PageDirty(page))
+		mem_cgroup_move_account_page_stat(from, to, nr_pages,
+			MEM_CGROUP_STAT_FILE_DIRTY);
+
 	mem_cgroup_charge_statistics(from, anon, -nr_pages);
 
 	/* caller should have done css_get */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0713bfb..526ddd7 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1978,11 +1978,17 @@ int __set_page_dirty_no_writeback(struct page *page)
 
 /*
  * Helper function for set_page_dirty family.
+ *
+ * The caller must hold mem_cgroup_begin/end_update_page_stat() lock
+ * while modifying struct page state and accounting dirty pages.
+ * See __set_page_dirty for example.
+ *
  * NOTE: This relies on being atomic wrt interrupts.
  */
 void account_page_dirtied(struct page *page, struct address_space *mapping)
 {
 	if (mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_DIRTIED);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
@@ -2022,14 +2028,22 @@ EXPORT_SYMBOL(account_page_writeback);
  */
 int __set_page_dirty_nobuffers(struct page *page)
 {
+	bool locked;
+	unsigned long flags;
+	struct address_space *mapping = page_mapping(page);
+
+	if (PageDirty(page))
+		return 0;
+
+	if (unlikely(!mapping))
+		return !TestSetPageDirty(page);
+
+	spin_lock_irq(&mapping->tree_lock);
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
+
 	if (!TestSetPageDirty(page)) {
-		struct address_space *mapping = page_mapping(page);
 		struct address_space *mapping2;
 
-		if (!mapping)
-			return 1;
-
-		spin_lock_irq(&mapping->tree_lock);
 		mapping2 = page_mapping(page);
 		if (mapping2) { /* Race with truncate? */
 			BUG_ON(mapping2 != mapping);
@@ -2038,13 +2052,18 @@ int __set_page_dirty_nobuffers(struct page *page)
 			radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 		}
+		mem_cgroup_end_update_page_stat(page, &locked, &flags);
 		spin_unlock_irq(&mapping->tree_lock);
+
 		if (mapping->host) {
 			/* !PageAnon && !swapper_space */
 			__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 		}
 		return 1;
 	}
+
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
+	spin_unlock_irq(&mapping->tree_lock);
 	return 0;
 }
 EXPORT_SYMBOL(__set_page_dirty_nobuffers);
@@ -2160,6 +2179,9 @@ EXPORT_SYMBOL(set_page_dirty_lock);
 int clear_page_dirty_for_io(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
+	bool locked;
+	unsigned long flags;
+	int ret = 0;
 
 	BUG_ON(!PageLocked(page));
 
@@ -2201,13 +2223,16 @@ int clear_page_dirty_for_io(struct page *page)
 		 * the desired exclusion. See mm/memory.c:do_wp_page()
 		 * for more comments.
 		 */
+		mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 		if (TestClearPageDirty(page)) {
+			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
-			return 1;
+			ret = 1;
 		}
-		return 0;
+		mem_cgroup_end_update_page_stat(page, &locked, &flags);
+		return ret;
 	}
 	return TestClearPageDirty(page);
 }
diff --git a/mm/truncate.c b/mm/truncate.c
index db1b216..c81e6c4 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -73,9 +73,14 @@ static inline void truncate_partial_page(struct page *page, unsigned partial)
  */
 void cancel_dirty_page(struct page *page, unsigned int account_size)
 {
+	bool locked;
+	unsigned long flags;
+
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (TestClearPageDirty(page)) {
 		struct address_space *mapping = page->mapping;
 		if (mapping && mapping_cap_account_dirty(mapping)) {
+			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
@@ -83,6 +88,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
 				task_io_account_cancelled_write(account_size);
 		}
 	}
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 }
 EXPORT_SYMBOL(cancel_dirty_page);
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
@ 2012-12-25 17:26   ` Sha Zhengju
  0 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2012-12-25 17:26 UTC (permalink / raw)
  To: linux-kernel, cgroups, linux-mm, linux-fsdevel
  Cc: mhocko, akpm, kamezawa.hiroyu, gthelen, fengguang.wu, glommer,
	dchinner, Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

This patch adds memcg routines to count dirty pages, which allows memory controller
to maintain an accurate view of the amount of its dirty memory and can provide some
info for users while cgroup's direct reclaim is working.

After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
has a feature to move a page from a cgroup to another one and may have race between
"move" and "page stat accounting". So in order to avoid the race we have designed a
bigger lock:

         mem_cgroup_begin_update_page_stat()
         modify page information        -->(a)
         mem_cgroup_update_page_stat()  -->(b)
         mem_cgroup_end_update_page_stat()
It requires (a) and (b)(dirty pages accounting) can stay close enough.
In the previous two prepare patches, we have reworked the vfs set page dirty routines
and now the interfaces are more explicit:
        incrementing (2):
                __set_page_dirty
                __set_page_dirty_nobuffers
        decrementing (2):
                clear_page_dirty_for_io
                cancel_dirty_page

To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
(https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
->private_lock --> mapping->tree_lock --> memcg->move_lock.
So we need to make mapping->tree_lock ahead of TestSetPageDirty in __set_page_dirty()
and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention,
a prepare PageDirty() checking is added.


Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
Acked-by: Fengguang Wu <fengguang.wu@intel.com>
---
 fs/buffer.c                |   14 +++++++++++++-
 include/linux/memcontrol.h |    1 +
 mm/filemap.c               |   10 ++++++++++
 mm/memcontrol.c            |   29 ++++++++++++++++++++++-------
 mm/page-writeback.c        |   39 ++++++++++++++++++++++++++++++++-------
 mm/truncate.c              |    6 ++++++
 6 files changed, 84 insertions(+), 15 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 762168a..53402d2 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -612,19 +612,31 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
 int __set_page_dirty(struct page *page,
 		struct address_space *mapping, int warn)
 {
+	bool locked;
+	unsigned long flags;
+
 	if (unlikely(!mapping))
 		return !TestSetPageDirty(page);
 
-	if (TestSetPageDirty(page))
+	if (PageDirty(page))
 		return 0;
 
 	spin_lock_irq(&mapping->tree_lock);
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
+
+	if (TestSetPageDirty(page)) {
+		mem_cgroup_end_update_page_stat(page, &locked, &flags);
+		spin_unlock_irq(&mapping->tree_lock);
+		return 0;
+	}
+
 	if (page->mapping) {	/* Race with truncate? */
 		WARN_ON_ONCE(warn && !PageUptodate(page));
 		account_page_dirtied(page, mapping);
 		radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 	}
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	spin_unlock_irq(&mapping->tree_lock);
 	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5421b8a..2685d8a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -44,6 +44,7 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
 	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
 	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
+	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
 	MEM_CGROUP_STAT_NSTATS,
 };
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 83efee7..b589be5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -62,6 +62,11 @@
  *      ->swap_lock		(exclusive_swap_page, others)
  *        ->mapping->tree_lock
  *
+ *    ->private_lock		(__set_page_dirty_buffers)
+ *      ->mapping->tree_lock
+ *        ->memcg->move_lock	(mem_cgroup_begin_update_page_stat->
+ *							move_lock_mem_cgroup)
+ *
  *  ->i_mutex
  *    ->i_mmap_mutex		(truncate->unmap_mapping_range)
  *
@@ -112,6 +117,8 @@
 void __delete_from_page_cache(struct page *page)
 {
 	struct address_space *mapping = page->mapping;
+	bool locked;
+	unsigned long flags;
 
 	/*
 	 * if we're uptodate, flush out into the cleancache, otherwise
@@ -139,10 +146,13 @@ void __delete_from_page_cache(struct page *page)
 	 * Fix it up by doing a final dirty accounting check after
 	 * having removed the page entirely.
 	 */
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 		dec_zone_page_state(page, NR_FILE_DIRTY);
 		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
 	}
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 }
 
 /**
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d450c04..c884640 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -95,6 +95,7 @@ static const char * const mem_cgroup_stat_names[] = {
 	"rss",
 	"mapped_file",
 	"swap",
+	"dirty",
 };
 
 enum mem_cgroup_events_index {
@@ -3609,6 +3610,19 @@ void mem_cgroup_split_huge_fixup(struct page *head)
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+static inline
+void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
+					struct mem_cgroup *to,
+					unsigned int nr_pages,
+					enum mem_cgroup_stat_index idx)
+{
+	/* Update stat data for mem_cgroup */
+	preempt_disable();
+	__this_cpu_add(from->stat->count[idx], -nr_pages);
+	__this_cpu_add(to->stat->count[idx], nr_pages);
+	preempt_enable();
+}
+
 /**
  * mem_cgroup_move_account - move account of the page
  * @page: the page
@@ -3654,13 +3668,14 @@ static int mem_cgroup_move_account(struct page *page,
 
 	move_lock_mem_cgroup(from, &flags);
 
-	if (!anon && page_mapped(page)) {
-		/* Update mapped_file data for mem_cgroup */
-		preempt_disable();
-		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		preempt_enable();
-	}
+	if (!anon && page_mapped(page))
+		mem_cgroup_move_account_page_stat(from, to, nr_pages,
+			MEM_CGROUP_STAT_FILE_MAPPED);
+
+	if (PageDirty(page))
+		mem_cgroup_move_account_page_stat(from, to, nr_pages,
+			MEM_CGROUP_STAT_FILE_DIRTY);
+
 	mem_cgroup_charge_statistics(from, anon, -nr_pages);
 
 	/* caller should have done css_get */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0713bfb..526ddd7 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1978,11 +1978,17 @@ int __set_page_dirty_no_writeback(struct page *page)
 
 /*
  * Helper function for set_page_dirty family.
+ *
+ * The caller must hold mem_cgroup_begin/end_update_page_stat() lock
+ * while modifying struct page state and accounting dirty pages.
+ * See __set_page_dirty for example.
+ *
  * NOTE: This relies on being atomic wrt interrupts.
  */
 void account_page_dirtied(struct page *page, struct address_space *mapping)
 {
 	if (mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_DIRTIED);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
@@ -2022,14 +2028,22 @@ EXPORT_SYMBOL(account_page_writeback);
  */
 int __set_page_dirty_nobuffers(struct page *page)
 {
+	bool locked;
+	unsigned long flags;
+	struct address_space *mapping = page_mapping(page);
+
+	if (PageDirty(page))
+		return 0;
+
+	if (unlikely(!mapping))
+		return !TestSetPageDirty(page);
+
+	spin_lock_irq(&mapping->tree_lock);
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
+
 	if (!TestSetPageDirty(page)) {
-		struct address_space *mapping = page_mapping(page);
 		struct address_space *mapping2;
 
-		if (!mapping)
-			return 1;
-
-		spin_lock_irq(&mapping->tree_lock);
 		mapping2 = page_mapping(page);
 		if (mapping2) { /* Race with truncate? */
 			BUG_ON(mapping2 != mapping);
@@ -2038,13 +2052,18 @@ int __set_page_dirty_nobuffers(struct page *page)
 			radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 		}
+		mem_cgroup_end_update_page_stat(page, &locked, &flags);
 		spin_unlock_irq(&mapping->tree_lock);
+
 		if (mapping->host) {
 			/* !PageAnon && !swapper_space */
 			__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 		}
 		return 1;
 	}
+
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
+	spin_unlock_irq(&mapping->tree_lock);
 	return 0;
 }
 EXPORT_SYMBOL(__set_page_dirty_nobuffers);
@@ -2160,6 +2179,9 @@ EXPORT_SYMBOL(set_page_dirty_lock);
 int clear_page_dirty_for_io(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
+	bool locked;
+	unsigned long flags;
+	int ret = 0;
 
 	BUG_ON(!PageLocked(page));
 
@@ -2201,13 +2223,16 @@ int clear_page_dirty_for_io(struct page *page)
 		 * the desired exclusion. See mm/memory.c:do_wp_page()
 		 * for more comments.
 		 */
+		mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 		if (TestClearPageDirty(page)) {
+			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
-			return 1;
+			ret = 1;
 		}
-		return 0;
+		mem_cgroup_end_update_page_stat(page, &locked, &flags);
+		return ret;
 	}
 	return TestClearPageDirty(page);
 }
diff --git a/mm/truncate.c b/mm/truncate.c
index db1b216..c81e6c4 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -73,9 +73,14 @@ static inline void truncate_partial_page(struct page *page, unsigned partial)
  */
 void cancel_dirty_page(struct page *page, unsigned int account_size)
 {
+	bool locked;
+	unsigned long flags;
+
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (TestClearPageDirty(page)) {
 		struct address_space *mapping = page->mapping;
 		if (mapping && mapping_cap_account_dirty(mapping)) {
+			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
@@ -83,6 +88,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
 				task_io_account_cancelled_write(account_size);
 		}
 	}
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 }
 EXPORT_SYMBOL(cancel_dirty_page);
 
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH V3 5/8] memcg: add per cgroup writeback pages accounting
  2012-12-25 17:18 ` Sha Zhengju
@ 2012-12-25 17:26   ` Sha Zhengju
  -1 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2012-12-25 17:26 UTC (permalink / raw)
  To: linux-kernel, cgroups, linux-mm
  Cc: mhocko, akpm, kamezawa.hiroyu, gthelen, fengguang.wu, glommer,
	Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

Similar to dirty page, we add per cgroup writeback pages accounting. The lock
rule still is:
        mem_cgroup_begin_update_page_stat()
        modify page WRITEBACK stat
        mem_cgroup_update_page_stat()
        mem_cgroup_end_update_page_stat()

There're two writeback interface to modify: test_clear/set_page_writeback.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
---
 include/linux/memcontrol.h |    1 +
 mm/memcontrol.c            |    5 +++++
 mm/page-writeback.c        |   17 +++++++++++++++++
 3 files changed, 23 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 358019e..1d22b81 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -45,6 +45,7 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
 	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
 	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
+	MEM_CGROUP_STAT_WRITEBACK,  /* # of pages under writeback */
 	MEM_CGROUP_STAT_NSTATS,
 };
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 21df36d..13cd14a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -96,6 +96,7 @@ static const char * const mem_cgroup_stat_names[] = {
 	"mapped_file",
 	"swap",
 	"dirty",
+	"writeback",
 };
 
 enum mem_cgroup_events_index {
@@ -3676,6 +3677,10 @@ static int mem_cgroup_move_account(struct page *page,
 		mem_cgroup_move_account_page_stat(from, to, nr_pages,
 			MEM_CGROUP_STAT_FILE_DIRTY);
 
+	if (PageWriteback(page))
+		mem_cgroup_move_account_page_stat(from, to, nr_pages,
+			MEM_CGROUP_STAT_WRITEBACK);
+
 	mem_cgroup_charge_statistics(from, anon, -nr_pages);
 
 	/* caller should have done css_get */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 526ddd7..ae6498a 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2002,11 +2002,17 @@ EXPORT_SYMBOL(account_page_dirtied);
 
 /*
  * Helper function for set_page_writeback family.
+ *
+ * The caller must hold mem_cgroup_begin/end_update_page_stat() lock
+ * while modifying struct page state and accounting writeback pages.
+ * See test_set_page_writeback for example.
+ *
  * NOTE: Unlike account_page_dirtied this does not rely on being atomic
  * wrt interrupts.
  */
 void account_page_writeback(struct page *page)
 {
+	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
 	inc_zone_page_state(page, NR_WRITEBACK);
 }
 EXPORT_SYMBOL(account_page_writeback);
@@ -2242,7 +2248,10 @@ int test_clear_page_writeback(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
 	int ret;
+	bool locked;
+	unsigned long memcg_flags;
 
+	mem_cgroup_begin_update_page_stat(page, &locked, &memcg_flags);
 	if (mapping) {
 		struct backing_dev_info *bdi = mapping->backing_dev_info;
 		unsigned long flags;
@@ -2263,9 +2272,12 @@ int test_clear_page_writeback(struct page *page)
 		ret = TestClearPageWriteback(page);
 	}
 	if (ret) {
+		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
 		dec_zone_page_state(page, NR_WRITEBACK);
 		inc_zone_page_state(page, NR_WRITTEN);
 	}
+
+	mem_cgroup_end_update_page_stat(page, &locked, &memcg_flags);
 	return ret;
 }
 
@@ -2273,7 +2285,10 @@ int test_set_page_writeback(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
 	int ret;
+	bool locked;
+	unsigned long flags;
 
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (mapping) {
 		struct backing_dev_info *bdi = mapping->backing_dev_info;
 		unsigned long flags;
@@ -2300,6 +2315,8 @@ int test_set_page_writeback(struct page *page)
 	}
 	if (!ret)
 		account_page_writeback(page);
+
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	return ret;
 
 }
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH V3 5/8] memcg: add per cgroup writeback pages accounting
@ 2012-12-25 17:26   ` Sha Zhengju
  0 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2012-12-25 17:26 UTC (permalink / raw)
  To: linux-kernel, cgroups, linux-mm
  Cc: mhocko, akpm, kamezawa.hiroyu, gthelen, fengguang.wu, glommer,
	Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

Similar to dirty page, we add per cgroup writeback pages accounting. The lock
rule still is:
        mem_cgroup_begin_update_page_stat()
        modify page WRITEBACK stat
        mem_cgroup_update_page_stat()
        mem_cgroup_end_update_page_stat()

There're two writeback interface to modify: test_clear/set_page_writeback.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
---
 include/linux/memcontrol.h |    1 +
 mm/memcontrol.c            |    5 +++++
 mm/page-writeback.c        |   17 +++++++++++++++++
 3 files changed, 23 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 358019e..1d22b81 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -45,6 +45,7 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
 	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
 	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
+	MEM_CGROUP_STAT_WRITEBACK,  /* # of pages under writeback */
 	MEM_CGROUP_STAT_NSTATS,
 };
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 21df36d..13cd14a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -96,6 +96,7 @@ static const char * const mem_cgroup_stat_names[] = {
 	"mapped_file",
 	"swap",
 	"dirty",
+	"writeback",
 };
 
 enum mem_cgroup_events_index {
@@ -3676,6 +3677,10 @@ static int mem_cgroup_move_account(struct page *page,
 		mem_cgroup_move_account_page_stat(from, to, nr_pages,
 			MEM_CGROUP_STAT_FILE_DIRTY);
 
+	if (PageWriteback(page))
+		mem_cgroup_move_account_page_stat(from, to, nr_pages,
+			MEM_CGROUP_STAT_WRITEBACK);
+
 	mem_cgroup_charge_statistics(from, anon, -nr_pages);
 
 	/* caller should have done css_get */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 526ddd7..ae6498a 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2002,11 +2002,17 @@ EXPORT_SYMBOL(account_page_dirtied);
 
 /*
  * Helper function for set_page_writeback family.
+ *
+ * The caller must hold mem_cgroup_begin/end_update_page_stat() lock
+ * while modifying struct page state and accounting writeback pages.
+ * See test_set_page_writeback for example.
+ *
  * NOTE: Unlike account_page_dirtied this does not rely on being atomic
  * wrt interrupts.
  */
 void account_page_writeback(struct page *page)
 {
+	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
 	inc_zone_page_state(page, NR_WRITEBACK);
 }
 EXPORT_SYMBOL(account_page_writeback);
@@ -2242,7 +2248,10 @@ int test_clear_page_writeback(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
 	int ret;
+	bool locked;
+	unsigned long memcg_flags;
 
+	mem_cgroup_begin_update_page_stat(page, &locked, &memcg_flags);
 	if (mapping) {
 		struct backing_dev_info *bdi = mapping->backing_dev_info;
 		unsigned long flags;
@@ -2263,9 +2272,12 @@ int test_clear_page_writeback(struct page *page)
 		ret = TestClearPageWriteback(page);
 	}
 	if (ret) {
+		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
 		dec_zone_page_state(page, NR_WRITEBACK);
 		inc_zone_page_state(page, NR_WRITTEN);
 	}
+
+	mem_cgroup_end_update_page_stat(page, &locked, &memcg_flags);
 	return ret;
 }
 
@@ -2273,7 +2285,10 @@ int test_set_page_writeback(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
 	int ret;
+	bool locked;
+	unsigned long flags;
 
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (mapping) {
 		struct backing_dev_info *bdi = mapping->backing_dev_info;
 		unsigned long flags;
@@ -2300,6 +2315,8 @@ int test_set_page_writeback(struct page *page)
 	}
 	if (!ret)
 		account_page_writeback(page);
+
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	return ret;
 
 }
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH V3 6/8] memcg: Don't account root_mem_cgroup page statistics
  2012-12-25 17:18 ` Sha Zhengju
@ 2012-12-25 17:27   ` Sha Zhengju
  -1 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2012-12-25 17:27 UTC (permalink / raw)
  To: linux-kernel, cgroups, linux-mm
  Cc: mhocko, akpm, kamezawa.hiroyu, gthelen, fengguang.wu, glommer,
	Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

If memcg is enabled and no non-root memcg exists, all allocated pages
belongs to root_mem_cgroup and go through root memcg statistics routines
which brings some overheads. So for the sake of performance, we can give
up accounting stats of root memcg for MEM_CGROUP_STAT_FILE_MAPPED/FILE_DIRTY
/WRITEBACK and instead we pay special attention while showing root
memcg numbers in memcg_stat_show(): as we don't account root memcg stats
anymore, the root_mem_cgroup->stat numbers are actually 0. But because of
hierachy, figures of root_mem_cgroup may just represent numbers of pages
used by its own tasks(not belonging to any other child cgroup). So here we
fake these root numbers by using stats of global state and all other memcg.
That is for root memcg:
	nr(MEM_CGROUP_STAT_FILE_MAPPED) = global_page_state(NR_FILE_MAPPED) -
                              sum_of_all_memcg(MEM_CGROUP_STAT_FILE_MAPPED);
Dirty/Writeback pages accounting are in the similar way.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
---
 mm/memcontrol.c |   70 +++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 68 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fc20ac9..728349d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2230,6 +2230,16 @@ void mem_cgroup_update_page_stat(struct page *page,
 		return;
 
 	memcg = pc->mem_cgroup;
+
+	/*
+	 * For the sake of performance, we don't account stats of root memcg
+	 * for MEM_CGROUP_STAT_FILE_MAPPED/FILE_DIRTY/WRITEBACK.
+	 * So we need to pay special attention while showing root memcg numbers.
+	 * See memcg_stat_show().
+	 */
+	if (memcg == root_mem_cgroup)
+		return;
+
 	if (unlikely(!memcg || !PageCgroupUsed(pc)))
 		return;
 
@@ -5396,18 +5406,70 @@ static inline void mem_cgroup_lru_names_not_uptodate(void)
 	BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS);
 }
 
+long long root_memcg_local_stat(unsigned int i, long long val,
+					long long nstat[])
+{
+	long long res = 0;
+
+	switch (i) {
+	case MEM_CGROUP_STAT_FILE_MAPPED:
+		res = global_page_state(NR_FILE_MAPPED);
+		break;
+	case MEM_CGROUP_STAT_FILE_DIRTY:
+		res = global_page_state(NR_FILE_DIRTY);
+		break;
+	case MEM_CGROUP_STAT_WRITEBACK:
+		res = global_page_state(NR_WRITEBACK);
+		break;
+	default:
+		break;
+	}
+
+	res = (res <= val) ? 0 : (res - val) * PAGE_SIZE;
+	nstat[i] = res;
+
+	return res;
+}
+
 static int memcg_stat_show(struct cgroup *cont, struct cftype *cft,
 				 struct seq_file *m)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
 	struct mem_cgroup *mi;
 	unsigned int i;
+	long long nstat[MEM_CGROUP_STAT_NSTATS] = {0};
 
 	for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
+		long long val = 0, res = 0;
+
 		if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account)
 			continue;
-		seq_printf(m, "%s %ld\n", mem_cgroup_stat_names[i],
-			   mem_cgroup_read_stat(memcg, i) * PAGE_SIZE);
+		if (i == MEM_CGROUP_STAT_SWAP || i == MEM_CGROUP_STAT_CACHE ||
+			i == MEM_CGROUP_STAT_RSS) {
+			seq_printf(m, "%s %ld\n", mem_cgroup_stat_names[i],
+				   mem_cgroup_read_stat(memcg, i) * PAGE_SIZE);
+			continue;
+		}
+
+		/* As we don't account root memcg stats anymore, the
+		 * root_mem_cgroup->stat numbers are actually 0. But because of
+		 * hierachy, figures of root_mem_cgroup may just represent
+		 * numbers of pages used by its own tasks(not belonging to any
+		 * other child cgroup). So here we fake these root numbers by
+		 * using stats of global state and all other memcg. That is for
+		 * root memcg:
+		 * nr(MEM_CGROUP_STAT_FILE_MAPPED) = global_page_state(NR_FILE_
+		 * 	MAPPED) - sum_of_all_memcg(MEM_CGROUP_STAT_FILE_MAPPED)
+		 * Dirty/Writeback pages accounting are in the similar way.
+		 */
+		if (memcg == root_mem_cgroup) {
+			for_each_mem_cgroup(mi)
+				val += mem_cgroup_read_stat(mi, i);
+			res = root_memcg_local_stat(i, val, nstat);
+		} else
+			res = mem_cgroup_read_stat(memcg, i) * PAGE_SIZE;
+
+		seq_printf(m, "%s %lld\n", mem_cgroup_stat_names[i], res);
 	}
 
 	for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++)
@@ -5435,6 +5497,10 @@ static int memcg_stat_show(struct cgroup *cont, struct cftype *cft,
 			continue;
 		for_each_mem_cgroup_tree(mi, memcg)
 			val += mem_cgroup_read_stat(mi, i) * PAGE_SIZE;
+
+		/* Adding local stats of root memcg */
+		if (memcg == root_mem_cgroup)
+			val += nstat[i];
 		seq_printf(m, "total_%s %lld\n", mem_cgroup_stat_names[i], val);
 	}
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH V3 6/8] memcg: Don't account root_mem_cgroup page statistics
@ 2012-12-25 17:27   ` Sha Zhengju
  0 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2012-12-25 17:27 UTC (permalink / raw)
  To: linux-kernel, cgroups, linux-mm
  Cc: mhocko, akpm, kamezawa.hiroyu, gthelen, fengguang.wu, glommer,
	Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

If memcg is enabled and no non-root memcg exists, all allocated pages
belongs to root_mem_cgroup and go through root memcg statistics routines
which brings some overheads. So for the sake of performance, we can give
up accounting stats of root memcg for MEM_CGROUP_STAT_FILE_MAPPED/FILE_DIRTY
/WRITEBACK and instead we pay special attention while showing root
memcg numbers in memcg_stat_show(): as we don't account root memcg stats
anymore, the root_mem_cgroup->stat numbers are actually 0. But because of
hierachy, figures of root_mem_cgroup may just represent numbers of pages
used by its own tasks(not belonging to any other child cgroup). So here we
fake these root numbers by using stats of global state and all other memcg.
That is for root memcg:
	nr(MEM_CGROUP_STAT_FILE_MAPPED) = global_page_state(NR_FILE_MAPPED) -
                              sum_of_all_memcg(MEM_CGROUP_STAT_FILE_MAPPED);
Dirty/Writeback pages accounting are in the similar way.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
---
 mm/memcontrol.c |   70 +++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 68 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fc20ac9..728349d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2230,6 +2230,16 @@ void mem_cgroup_update_page_stat(struct page *page,
 		return;
 
 	memcg = pc->mem_cgroup;
+
+	/*
+	 * For the sake of performance, we don't account stats of root memcg
+	 * for MEM_CGROUP_STAT_FILE_MAPPED/FILE_DIRTY/WRITEBACK.
+	 * So we need to pay special attention while showing root memcg numbers.
+	 * See memcg_stat_show().
+	 */
+	if (memcg == root_mem_cgroup)
+		return;
+
 	if (unlikely(!memcg || !PageCgroupUsed(pc)))
 		return;
 
@@ -5396,18 +5406,70 @@ static inline void mem_cgroup_lru_names_not_uptodate(void)
 	BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS);
 }
 
+long long root_memcg_local_stat(unsigned int i, long long val,
+					long long nstat[])
+{
+	long long res = 0;
+
+	switch (i) {
+	case MEM_CGROUP_STAT_FILE_MAPPED:
+		res = global_page_state(NR_FILE_MAPPED);
+		break;
+	case MEM_CGROUP_STAT_FILE_DIRTY:
+		res = global_page_state(NR_FILE_DIRTY);
+		break;
+	case MEM_CGROUP_STAT_WRITEBACK:
+		res = global_page_state(NR_WRITEBACK);
+		break;
+	default:
+		break;
+	}
+
+	res = (res <= val) ? 0 : (res - val) * PAGE_SIZE;
+	nstat[i] = res;
+
+	return res;
+}
+
 static int memcg_stat_show(struct cgroup *cont, struct cftype *cft,
 				 struct seq_file *m)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
 	struct mem_cgroup *mi;
 	unsigned int i;
+	long long nstat[MEM_CGROUP_STAT_NSTATS] = {0};
 
 	for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
+		long long val = 0, res = 0;
+
 		if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account)
 			continue;
-		seq_printf(m, "%s %ld\n", mem_cgroup_stat_names[i],
-			   mem_cgroup_read_stat(memcg, i) * PAGE_SIZE);
+		if (i == MEM_CGROUP_STAT_SWAP || i == MEM_CGROUP_STAT_CACHE ||
+			i == MEM_CGROUP_STAT_RSS) {
+			seq_printf(m, "%s %ld\n", mem_cgroup_stat_names[i],
+				   mem_cgroup_read_stat(memcg, i) * PAGE_SIZE);
+			continue;
+		}
+
+		/* As we don't account root memcg stats anymore, the
+		 * root_mem_cgroup->stat numbers are actually 0. But because of
+		 * hierachy, figures of root_mem_cgroup may just represent
+		 * numbers of pages used by its own tasks(not belonging to any
+		 * other child cgroup). So here we fake these root numbers by
+		 * using stats of global state and all other memcg. That is for
+		 * root memcg:
+		 * nr(MEM_CGROUP_STAT_FILE_MAPPED) = global_page_state(NR_FILE_
+		 * 	MAPPED) - sum_of_all_memcg(MEM_CGROUP_STAT_FILE_MAPPED)
+		 * Dirty/Writeback pages accounting are in the similar way.
+		 */
+		if (memcg == root_mem_cgroup) {
+			for_each_mem_cgroup(mi)
+				val += mem_cgroup_read_stat(mi, i);
+			res = root_memcg_local_stat(i, val, nstat);
+		} else
+			res = mem_cgroup_read_stat(memcg, i) * PAGE_SIZE;
+
+		seq_printf(m, "%s %lld\n", mem_cgroup_stat_names[i], res);
 	}
 
 	for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++)
@@ -5435,6 +5497,10 @@ static int memcg_stat_show(struct cgroup *cont, struct cftype *cft,
 			continue;
 		for_each_mem_cgroup_tree(mi, memcg)
 			val += mem_cgroup_read_stat(mi, i) * PAGE_SIZE;
+
+		/* Adding local stats of root memcg */
+		if (memcg == root_mem_cgroup)
+			val += nstat[i];
 		seq_printf(m, "total_%s %lld\n", mem_cgroup_stat_names[i], val);
 	}
 
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH V3 7/8] memcg: disable memcg page stat accounting code when not in use
  2012-12-25 17:18 ` Sha Zhengju
  (?)
@ 2012-12-25 17:27   ` Sha Zhengju
  -1 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2012-12-25 17:27 UTC (permalink / raw)
  To: linux-kernel, cgroups, linux-mm
  Cc: mhocko, akpm, kamezawa.hiroyu, gthelen, fengguang.wu, glommer,
	Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

It's inspired by a similar optimization from Glauber Costa
(memcg: make it suck faster; https://lkml.org/lkml/2012/9/25/154).
Here we use jump label to patch the memcg page stat accounting code
in or out when not used. when the first non-root memcg comes to
life the code is patching in otherwise it is out.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
---
 include/linux/memcontrol.h |    9 +++++++++
 mm/memcontrol.c            |    8 ++++++++
 2 files changed, 17 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1d22b81..3c4430c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -56,6 +56,9 @@ struct mem_cgroup_reclaim_cookie {
 };
 
 #ifdef CONFIG_MEMCG
+
+extern struct static_key memcg_in_use_key;
+
 /*
  * All "charge" functions with gfp_mask should use GFP_KERNEL or
  * (gfp_mask & GFP_RECLAIM_MASK). In current implementatin, memcg doesn't
@@ -158,6 +161,9 @@ extern atomic_t memcg_moving;
 static inline void mem_cgroup_begin_update_page_stat(struct page *page,
 					bool *locked, unsigned long *flags)
 {
+	if (!static_key_false(&memcg_in_use_key))
+		return;
+
 	if (mem_cgroup_disabled())
 		return;
 	rcu_read_lock();
@@ -171,6 +177,9 @@ void __mem_cgroup_end_update_page_stat(struct page *page,
 static inline void mem_cgroup_end_update_page_stat(struct page *page,
 					bool *locked, unsigned long *flags)
 {
+	if (!static_key_false(&memcg_in_use_key))
+		return;
+
 	if (mem_cgroup_disabled())
 		return;
 	if (*locked)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0cb5187..a2f73d7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -531,6 +531,8 @@ enum res_type {
 #define MEM_CGROUP_RECLAIM_SHRINK_BIT	0x1
 #define MEM_CGROUP_RECLAIM_SHRINK	(1 << MEM_CGROUP_RECLAIM_SHRINK_BIT)
 
+struct static_key memcg_in_use_key;
+
 static void mem_cgroup_get(struct mem_cgroup *memcg);
 static void mem_cgroup_put(struct mem_cgroup *memcg);
 
@@ -2226,6 +2228,9 @@ void mem_cgroup_update_page_stat(struct page *page,
 	struct page_cgroup *pc = lookup_page_cgroup(page);
 	unsigned long uninitialized_var(flags);
 
+	if (!static_key_false(&memcg_in_use_key))
+		return;
+
 	if (mem_cgroup_disabled())
 		return;
 
@@ -6340,6 +6345,8 @@ mem_cgroup_css_alloc(struct cgroup *cont)
 		parent = mem_cgroup_from_cont(cont->parent);
 		memcg->use_hierarchy = parent->use_hierarchy;
 		memcg->oom_kill_disable = parent->oom_kill_disable;
+
+		static_key_slow_inc(&memcg_in_use_key);
 	}
 
 	if (parent && parent->use_hierarchy) {
@@ -6407,6 +6414,7 @@ static void mem_cgroup_css_free(struct cgroup *cont)
 	kmem_cgroup_destroy(memcg);
 
 	memcg_dangling_add(memcg);
+	static_key_slow_dec(&memcg_in_use_key);
 	mem_cgroup_put(memcg);
 }
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH V3 7/8] memcg: disable memcg page stat accounting code when not in use
@ 2012-12-25 17:27   ` Sha Zhengju
  0 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2012-12-25 17:27 UTC (permalink / raw)
  To: linux-kernel, cgroups, linux-mm
  Cc: mhocko, akpm, kamezawa.hiroyu, gthelen, fengguang.wu, glommer,
	Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

It's inspired by a similar optimization from Glauber Costa
(memcg: make it suck faster; https://lkml.org/lkml/2012/9/25/154).
Here we use jump label to patch the memcg page stat accounting code
in or out when not used. when the first non-root memcg comes to
life the code is patching in otherwise it is out.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
---
 include/linux/memcontrol.h |    9 +++++++++
 mm/memcontrol.c            |    8 ++++++++
 2 files changed, 17 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1d22b81..3c4430c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -56,6 +56,9 @@ struct mem_cgroup_reclaim_cookie {
 };
 
 #ifdef CONFIG_MEMCG
+
+extern struct static_key memcg_in_use_key;
+
 /*
  * All "charge" functions with gfp_mask should use GFP_KERNEL or
  * (gfp_mask & GFP_RECLAIM_MASK). In current implementatin, memcg doesn't
@@ -158,6 +161,9 @@ extern atomic_t memcg_moving;
 static inline void mem_cgroup_begin_update_page_stat(struct page *page,
 					bool *locked, unsigned long *flags)
 {
+	if (!static_key_false(&memcg_in_use_key))
+		return;
+
 	if (mem_cgroup_disabled())
 		return;
 	rcu_read_lock();
@@ -171,6 +177,9 @@ void __mem_cgroup_end_update_page_stat(struct page *page,
 static inline void mem_cgroup_end_update_page_stat(struct page *page,
 					bool *locked, unsigned long *flags)
 {
+	if (!static_key_false(&memcg_in_use_key))
+		return;
+
 	if (mem_cgroup_disabled())
 		return;
 	if (*locked)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0cb5187..a2f73d7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -531,6 +531,8 @@ enum res_type {
 #define MEM_CGROUP_RECLAIM_SHRINK_BIT	0x1
 #define MEM_CGROUP_RECLAIM_SHRINK	(1 << MEM_CGROUP_RECLAIM_SHRINK_BIT)
 
+struct static_key memcg_in_use_key;
+
 static void mem_cgroup_get(struct mem_cgroup *memcg);
 static void mem_cgroup_put(struct mem_cgroup *memcg);
 
@@ -2226,6 +2228,9 @@ void mem_cgroup_update_page_stat(struct page *page,
 	struct page_cgroup *pc = lookup_page_cgroup(page);
 	unsigned long uninitialized_var(flags);
 
+	if (!static_key_false(&memcg_in_use_key))
+		return;
+
 	if (mem_cgroup_disabled())
 		return;
 
@@ -6340,6 +6345,8 @@ mem_cgroup_css_alloc(struct cgroup *cont)
 		parent = mem_cgroup_from_cont(cont->parent);
 		memcg->use_hierarchy = parent->use_hierarchy;
 		memcg->oom_kill_disable = parent->oom_kill_disable;
+
+		static_key_slow_inc(&memcg_in_use_key);
 	}
 
 	if (parent && parent->use_hierarchy) {
@@ -6407,6 +6414,7 @@ static void mem_cgroup_css_free(struct cgroup *cont)
 	kmem_cgroup_destroy(memcg);
 
 	memcg_dangling_add(memcg);
+	static_key_slow_dec(&memcg_in_use_key);
 	mem_cgroup_put(memcg);
 }
 
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH V3 7/8] memcg: disable memcg page stat accounting code when not in use
@ 2012-12-25 17:27   ` Sha Zhengju
  0 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2012-12-25 17:27 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg
  Cc: mhocko-AlSwsSmVLrQ, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	glommer-bzQdu9zFT3WakBO8gow8eQ, Sha Zhengju

From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>

It's inspired by a similar optimization from Glauber Costa
(memcg: make it suck faster; https://lkml.org/lkml/2012/9/25/154).
Here we use jump label to patch the memcg page stat accounting code
in or out when not used. when the first non-root memcg comes to
life the code is patching in otherwise it is out.

Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
---
 include/linux/memcontrol.h |    9 +++++++++
 mm/memcontrol.c            |    8 ++++++++
 2 files changed, 17 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1d22b81..3c4430c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -56,6 +56,9 @@ struct mem_cgroup_reclaim_cookie {
 };
 
 #ifdef CONFIG_MEMCG
+
+extern struct static_key memcg_in_use_key;
+
 /*
  * All "charge" functions with gfp_mask should use GFP_KERNEL or
  * (gfp_mask & GFP_RECLAIM_MASK). In current implementatin, memcg doesn't
@@ -158,6 +161,9 @@ extern atomic_t memcg_moving;
 static inline void mem_cgroup_begin_update_page_stat(struct page *page,
 					bool *locked, unsigned long *flags)
 {
+	if (!static_key_false(&memcg_in_use_key))
+		return;
+
 	if (mem_cgroup_disabled())
 		return;
 	rcu_read_lock();
@@ -171,6 +177,9 @@ void __mem_cgroup_end_update_page_stat(struct page *page,
 static inline void mem_cgroup_end_update_page_stat(struct page *page,
 					bool *locked, unsigned long *flags)
 {
+	if (!static_key_false(&memcg_in_use_key))
+		return;
+
 	if (mem_cgroup_disabled())
 		return;
 	if (*locked)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0cb5187..a2f73d7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -531,6 +531,8 @@ enum res_type {
 #define MEM_CGROUP_RECLAIM_SHRINK_BIT	0x1
 #define MEM_CGROUP_RECLAIM_SHRINK	(1 << MEM_CGROUP_RECLAIM_SHRINK_BIT)
 
+struct static_key memcg_in_use_key;
+
 static void mem_cgroup_get(struct mem_cgroup *memcg);
 static void mem_cgroup_put(struct mem_cgroup *memcg);
 
@@ -2226,6 +2228,9 @@ void mem_cgroup_update_page_stat(struct page *page,
 	struct page_cgroup *pc = lookup_page_cgroup(page);
 	unsigned long uninitialized_var(flags);
 
+	if (!static_key_false(&memcg_in_use_key))
+		return;
+
 	if (mem_cgroup_disabled())
 		return;
 
@@ -6340,6 +6345,8 @@ mem_cgroup_css_alloc(struct cgroup *cont)
 		parent = mem_cgroup_from_cont(cont->parent);
 		memcg->use_hierarchy = parent->use_hierarchy;
 		memcg->oom_kill_disable = parent->oom_kill_disable;
+
+		static_key_slow_inc(&memcg_in_use_key);
 	}
 
 	if (parent && parent->use_hierarchy) {
@@ -6407,6 +6414,7 @@ static void mem_cgroup_css_free(struct cgroup *cont)
 	kmem_cgroup_destroy(memcg);
 
 	memcg_dangling_add(memcg);
+	static_key_slow_dec(&memcg_in_use_key);
 	mem_cgroup_put(memcg);
 }
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH V3 8/8] memcg: Document cgroup dirty/writeback memory statistics
  2012-12-25 17:18 ` Sha Zhengju
  (?)
@ 2012-12-25 17:28   ` Sha Zhengju
  -1 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2012-12-25 17:28 UTC (permalink / raw)
  To: linux-kernel, cgroups, linux-mm
  Cc: mhocko, akpm, kamezawa.hiroyu, gthelen, fengguang.wu, glommer,
	Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
---
 Documentation/cgroups/memory.txt |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index addb1f1..2828164 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -487,6 +487,8 @@ pgpgin		- # of charging events to the memory cgroup. The charging
 pgpgout		- # of uncharging events to the memory cgroup. The uncharging
 		event happens each time a page is unaccounted from the cgroup.
 swap		- # of bytes of swap usage
+dirty          - # of bytes of file cache that are not in sync with the disk copy.
+writeback      - # of bytes of file/anon cache that are queued for syncing to disk.
 inactive_anon	- # of bytes of anonymous memory and swap cache memory on
 		LRU list.
 active_anon	- # of bytes of anonymous and swap cache memory on active
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH V3 8/8] memcg: Document cgroup dirty/writeback memory statistics
@ 2012-12-25 17:28   ` Sha Zhengju
  0 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2012-12-25 17:28 UTC (permalink / raw)
  To: linux-kernel, cgroups, linux-mm
  Cc: mhocko, akpm, kamezawa.hiroyu, gthelen, fengguang.wu, glommer,
	Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
---
 Documentation/cgroups/memory.txt |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index addb1f1..2828164 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -487,6 +487,8 @@ pgpgin		- # of charging events to the memory cgroup. The charging
 pgpgout		- # of uncharging events to the memory cgroup. The uncharging
 		event happens each time a page is unaccounted from the cgroup.
 swap		- # of bytes of swap usage
+dirty          - # of bytes of file cache that are not in sync with the disk copy.
+writeback      - # of bytes of file/anon cache that are queued for syncing to disk.
 inactive_anon	- # of bytes of anonymous memory and swap cache memory on
 		LRU list.
 active_anon	- # of bytes of anonymous and swap cache memory on active
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH V3 8/8] memcg: Document cgroup dirty/writeback memory statistics
@ 2012-12-25 17:28   ` Sha Zhengju
  0 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2012-12-25 17:28 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg
  Cc: mhocko-AlSwsSmVLrQ, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	glommer-bzQdu9zFT3WakBO8gow8eQ, Sha Zhengju

From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>

Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
---
 Documentation/cgroups/memory.txt |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index addb1f1..2828164 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -487,6 +487,8 @@ pgpgin		- # of charging events to the memory cgroup. The charging
 pgpgout		- # of uncharging events to the memory cgroup. The uncharging
 		event happens each time a page is unaccounted from the cgroup.
 swap		- # of bytes of swap usage
+dirty          - # of bytes of file cache that are not in sync with the disk copy.
+writeback      - # of bytes of file/anon cache that are queued for syncing to disk.
 inactive_anon	- # of bytes of anonymous memory and swap cache memory on
 		LRU list.
 active_anon	- # of bytes of anonymous and swap cache memory on active
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 2/8] Make TestSetPageDirty and dirty page accounting in one func
  2012-12-25 17:22   ` Sha Zhengju
@ 2012-12-28  0:39     ` Kamezawa Hiroyuki
  -1 siblings, 0 replies; 124+ messages in thread
From: Kamezawa Hiroyuki @ 2012-12-28  0:39 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel, cgroups, linux-mm, linux-fsdevel, dchinner, mhocko,
	akpm, gthelen, fengguang.wu, glommer, Sha Zhengju

(2012/12/26 2:22), Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> Commit a8e7d49a(Fix race in create_empty_buffers() vs __set_page_dirty_buffers())
> extracts TestSetPageDirty from __set_page_dirty and is far away from
> account_page_dirtied. But it's better to make the two operations in one single
> function to keep modular. So in order to avoid the potential race mentioned in
> commit a8e7d49a, we can hold private_lock until __set_page_dirty completes.
> There's no deadlock between ->private_lock and ->tree_lock after confirmation.
> It's a prepare patch for following memcg dirty page accounting patches.
> 
> 
> Here is some test numbers that before/after this patch:
> Test steps(Mem-4g, ext4):
> drop_cache; sync
> fio (ioengine=sync/write/buffered/bs=4k/size=1g/numjobs=2/group_reporting/thread)
> 
> We test it for 10 times and get the average numbers:
> Before:
> write: io=2048.0MB, bw=254117KB/s, iops=63528.9 , runt=  8279msec
> lat (usec): min=1 , max=742361 , avg=30.918, stdev=1601.02
> After:
> write: io=2048.0MB, bw=254044KB/s, iops=63510.3 , runt=  8274.4msec
> lat (usec): min=1 , max=856333 , avg=31.043, stdev=1769.32
> 
> Note that the impact is little(<1%).
> 
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> Reviewed-by: Michal Hocko <mhocko@suse.cz>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Hmm,..this change should be double-checked by vfs, I/O guys...

increasing hold time of mapping->private_lock doesn't affect performance ?



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 2/8] Make TestSetPageDirty and dirty page accounting in one func
@ 2012-12-28  0:39     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 124+ messages in thread
From: Kamezawa Hiroyuki @ 2012-12-28  0:39 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel, cgroups, linux-mm, linux-fsdevel, dchinner, mhocko,
	akpm, gthelen, fengguang.wu, glommer, Sha Zhengju

(2012/12/26 2:22), Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> Commit a8e7d49a(Fix race in create_empty_buffers() vs __set_page_dirty_buffers())
> extracts TestSetPageDirty from __set_page_dirty and is far away from
> account_page_dirtied. But it's better to make the two operations in one single
> function to keep modular. So in order to avoid the potential race mentioned in
> commit a8e7d49a, we can hold private_lock until __set_page_dirty completes.
> There's no deadlock between ->private_lock and ->tree_lock after confirmation.
> It's a prepare patch for following memcg dirty page accounting patches.
> 
> 
> Here is some test numbers that before/after this patch:
> Test steps(Mem-4g, ext4):
> drop_cache; sync
> fio (ioengine=sync/write/buffered/bs=4k/size=1g/numjobs=2/group_reporting/thread)
> 
> We test it for 10 times and get the average numbers:
> Before:
> write: io=2048.0MB, bw=254117KB/s, iops=63528.9 , runt=  8279msec
> lat (usec): min=1 , max=742361 , avg=30.918, stdev=1601.02
> After:
> write: io=2048.0MB, bw=254044KB/s, iops=63510.3 , runt=  8274.4msec
> lat (usec): min=1 , max=856333 , avg=31.043, stdev=1769.32
> 
> Note that the impact is little(<1%).
> 
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> Reviewed-by: Michal Hocko <mhocko@suse.cz>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Hmm,..this change should be double-checked by vfs, I/O guys...

increasing hold time of mapping->private_lock doesn't affect performance ?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 3/8] use vfs __set_page_dirty interface instead of doing it inside filesystem
@ 2012-12-28  0:41     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 124+ messages in thread
From: Kamezawa Hiroyuki @ 2012-12-28  0:41 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel, cgroups, linux-mm, linux-fsdevel, ceph-devel, sage,
	dchinner, mhocko, akpm, gthelen, fengguang.wu, glommer,
	Sha Zhengju

(2012/12/26 2:24), Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> Following we will treat SetPageDirty and dirty page accounting as an integrated
> operation. Filesystems had better use vfs interface directly to avoid those details.
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> Acked-by: Sage Weil <sage@inktank.com>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 3/8] use vfs __set_page_dirty interface instead of doing it inside filesystem
@ 2012-12-28  0:41     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 124+ messages in thread
From: Kamezawa Hiroyuki @ 2012-12-28  0:41 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	ceph-devel-u79uwXL29TY76Z2rM5mHXA, sage-BnTBU8nroG7k1uMJSBkQmQ,
	dchinner-H+wXaHxf7aLQT0dZR+AlfA, mhocko-AlSwsSmVLrQ,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	glommer-bzQdu9zFT3WakBO8gow8eQ, Sha Zhengju

(2012/12/26 2:24), Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> 
> Following we will treat SetPageDirty and dirty page accounting as an integrated
> operation. Filesystems had better use vfs interface directly to avoid those details.
> 
> Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> Acked-by: Sage Weil <sage-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 3/8] use vfs __set_page_dirty interface instead of doing it inside filesystem
@ 2012-12-28  0:41     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 124+ messages in thread
From: Kamezawa Hiroyuki @ 2012-12-28  0:41 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel, cgroups, linux-mm, linux-fsdevel, ceph-devel, sage,
	dchinner, mhocko, akpm, gthelen, fengguang.wu, glommer,
	Sha Zhengju

(2012/12/26 2:24), Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> Following we will treat SetPageDirty and dirty page accounting as an integrated
> operation. Filesystems had better use vfs interface directly to avoid those details.
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> Acked-by: Sage Weil <sage@inktank.com>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 5/8] memcg: add per cgroup writeback pages accounting
  2012-12-25 17:26   ` Sha Zhengju
@ 2012-12-28  0:52     ` Kamezawa Hiroyuki
  -1 siblings, 0 replies; 124+ messages in thread
From: Kamezawa Hiroyuki @ 2012-12-28  0:52 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel, cgroups, linux-mm, mhocko, akpm, gthelen,
	fengguang.wu, glommer, Sha Zhengju

(2012/12/26 2:26), Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> Similar to dirty page, we add per cgroup writeback pages accounting. The lock
> rule still is:
>          mem_cgroup_begin_update_page_stat()
>          modify page WRITEBACK stat
>          mem_cgroup_update_page_stat()
>          mem_cgroup_end_update_page_stat()
> 
> There're two writeback interface to modify: test_clear/set_page_writeback.
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 5/8] memcg: add per cgroup writeback pages accounting
@ 2012-12-28  0:52     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 124+ messages in thread
From: Kamezawa Hiroyuki @ 2012-12-28  0:52 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel, cgroups, linux-mm, mhocko, akpm, gthelen,
	fengguang.wu, glommer, Sha Zhengju

(2012/12/26 2:26), Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> Similar to dirty page, we add per cgroup writeback pages accounting. The lock
> rule still is:
>          mem_cgroup_begin_update_page_stat()
>          modify page WRITEBACK stat
>          mem_cgroup_update_page_stat()
>          mem_cgroup_end_update_page_stat()
> 
> There're two writeback interface to modify: test_clear/set_page_writeback.
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 6/8] memcg: Don't account root_mem_cgroup page statistics
  2012-12-25 17:27   ` Sha Zhengju
@ 2012-12-28  1:04     ` Kamezawa Hiroyuki
  -1 siblings, 0 replies; 124+ messages in thread
From: Kamezawa Hiroyuki @ 2012-12-28  1:04 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel, cgroups, linux-mm, mhocko, akpm, gthelen,
	fengguang.wu, glommer, Sha Zhengju

(2012/12/26 2:27), Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> If memcg is enabled and no non-root memcg exists, all allocated pages
> belongs to root_mem_cgroup and go through root memcg statistics routines
> which brings some overheads. So for the sake of performance, we can give
> up accounting stats of root memcg for MEM_CGROUP_STAT_FILE_MAPPED/FILE_DIRTY
> /WRITEBACK and instead we pay special attention while showing root
> memcg numbers in memcg_stat_show(): as we don't account root memcg stats
> anymore, the root_mem_cgroup->stat numbers are actually 0. But because of
> hierachy, figures of root_mem_cgroup may just represent numbers of pages
> used by its own tasks(not belonging to any other child cgroup). So here we
> fake these root numbers by using stats of global state and all other memcg.
> That is for root memcg:
> 	nr(MEM_CGROUP_STAT_FILE_MAPPED) = global_page_state(NR_FILE_MAPPED) -
>                                sum_of_all_memcg(MEM_CGROUP_STAT_FILE_MAPPED);
> Dirty/Writeback pages accounting are in the similar way.
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>

isn't it better to use mem_cgroup_is_root() call rather than
direct comparison (memcg == root_mem_cgroup) ?

Anyway, Ack to this approach. 

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 6/8] memcg: Don't account root_mem_cgroup page statistics
@ 2012-12-28  1:04     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 124+ messages in thread
From: Kamezawa Hiroyuki @ 2012-12-28  1:04 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel, cgroups, linux-mm, mhocko, akpm, gthelen,
	fengguang.wu, glommer, Sha Zhengju

(2012/12/26 2:27), Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> If memcg is enabled and no non-root memcg exists, all allocated pages
> belongs to root_mem_cgroup and go through root memcg statistics routines
> which brings some overheads. So for the sake of performance, we can give
> up accounting stats of root memcg for MEM_CGROUP_STAT_FILE_MAPPED/FILE_DIRTY
> /WRITEBACK and instead we pay special attention while showing root
> memcg numbers in memcg_stat_show(): as we don't account root memcg stats
> anymore, the root_mem_cgroup->stat numbers are actually 0. But because of
> hierachy, figures of root_mem_cgroup may just represent numbers of pages
> used by its own tasks(not belonging to any other child cgroup). So here we
> fake these root numbers by using stats of global state and all other memcg.
> That is for root memcg:
> 	nr(MEM_CGROUP_STAT_FILE_MAPPED) = global_page_state(NR_FILE_MAPPED) -
>                                sum_of_all_memcg(MEM_CGROUP_STAT_FILE_MAPPED);
> Dirty/Writeback pages accounting are in the similar way.
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>

isn't it better to use mem_cgroup_is_root() call rather than
direct comparison (memcg == root_mem_cgroup) ?

Anyway, Ack to this approach. 

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 7/8] memcg: disable memcg page stat accounting code when not in use
  2012-12-25 17:27   ` Sha Zhengju
  (?)
@ 2012-12-28  1:06     ` Kamezawa Hiroyuki
  -1 siblings, 0 replies; 124+ messages in thread
From: Kamezawa Hiroyuki @ 2012-12-28  1:06 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel, cgroups, linux-mm, mhocko, akpm, gthelen,
	fengguang.wu, glommer, Sha Zhengju

(2012/12/26 2:27), Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> It's inspired by a similar optimization from Glauber Costa
> (memcg: make it suck faster; https://lkml.org/lkml/2012/9/25/154).
> Here we use jump label to patch the memcg page stat accounting code
> in or out when not used. when the first non-root memcg comes to
> life the code is patching in otherwise it is out.
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 7/8] memcg: disable memcg page stat accounting code when not in use
@ 2012-12-28  1:06     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 124+ messages in thread
From: Kamezawa Hiroyuki @ 2012-12-28  1:06 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel, cgroups, linux-mm, mhocko, akpm, gthelen,
	fengguang.wu, glommer, Sha Zhengju

(2012/12/26 2:27), Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> It's inspired by a similar optimization from Glauber Costa
> (memcg: make it suck faster; https://lkml.org/lkml/2012/9/25/154).
> Here we use jump label to patch the memcg page stat accounting code
> in or out when not used. when the first non-root memcg comes to
> life the code is patching in otherwise it is out.
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 7/8] memcg: disable memcg page stat accounting code when not in use
@ 2012-12-28  1:06     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 124+ messages in thread
From: Kamezawa Hiroyuki @ 2012-12-28  1:06 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	mhocko-AlSwsSmVLrQ, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	glommer-bzQdu9zFT3WakBO8gow8eQ, Sha Zhengju

(2012/12/26 2:27), Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> 
> It's inspired by a similar optimization from Glauber Costa
> (memcg: make it suck faster; https://lkml.org/lkml/2012/9/25/154).
> Here we use jump label to patch the memcg page stat accounting code
> in or out when not used. when the first non-root memcg comes to
> life the code is patching in otherwise it is out.
> 
> Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 8/8] memcg: Document cgroup dirty/writeback memory statistics
  2012-12-25 17:28   ` Sha Zhengju
  (?)
@ 2012-12-28  1:10     ` Kamezawa Hiroyuki
  -1 siblings, 0 replies; 124+ messages in thread
From: Kamezawa Hiroyuki @ 2012-12-28  1:10 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel, cgroups, linux-mm, mhocko, akpm, gthelen,
	fengguang.wu, glommer, Sha Zhengju

(2012/12/26 2:28), Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>

I don't think your words are bad but it may be better to sync with meminfo's text.

> ---
>   Documentation/cgroups/memory.txt |    2 ++
>   1 file changed, 2 insertions(+)
> 
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index addb1f1..2828164 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -487,6 +487,8 @@ pgpgin		- # of charging events to the memory cgroup. The charging
>   pgpgout		- # of uncharging events to the memory cgroup. The uncharging
>   		event happens each time a page is unaccounted from the cgroup.
>   swap		- # of bytes of swap usage
> +dirty          - # of bytes of file cache that are not in sync with the disk copy.
> +writeback      - # of bytes of file/anon cache that are queued for syncing to disk.
>   inactive_anon	- # of bytes of anonymous memory and swap cache memory on
>   		LRU list.
>   active_anon	- # of bytes of anonymous and swap cache memory on active
> 

Documentation/filesystems/proc.txt

       Dirty: Memory which is waiting to get written back to the disk
   Writeback: Memory which is actively being written back to the disk

even if others are not ;(

Thanks,
-Kame






^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 8/8] memcg: Document cgroup dirty/writeback memory statistics
@ 2012-12-28  1:10     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 124+ messages in thread
From: Kamezawa Hiroyuki @ 2012-12-28  1:10 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel, cgroups, linux-mm, mhocko, akpm, gthelen,
	fengguang.wu, glommer, Sha Zhengju

(2012/12/26 2:28), Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>

I don't think your words are bad but it may be better to sync with meminfo's text.

> ---
>   Documentation/cgroups/memory.txt |    2 ++
>   1 file changed, 2 insertions(+)
> 
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index addb1f1..2828164 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -487,6 +487,8 @@ pgpgin		- # of charging events to the memory cgroup. The charging
>   pgpgout		- # of uncharging events to the memory cgroup. The uncharging
>   		event happens each time a page is unaccounted from the cgroup.
>   swap		- # of bytes of swap usage
> +dirty          - # of bytes of file cache that are not in sync with the disk copy.
> +writeback      - # of bytes of file/anon cache that are queued for syncing to disk.
>   inactive_anon	- # of bytes of anonymous memory and swap cache memory on
>   		LRU list.
>   active_anon	- # of bytes of anonymous and swap cache memory on active
> 

Documentation/filesystems/proc.txt

       Dirty: Memory which is waiting to get written back to the disk
   Writeback: Memory which is actively being written back to the disk

even if others are not ;(

Thanks,
-Kame





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 8/8] memcg: Document cgroup dirty/writeback memory statistics
@ 2012-12-28  1:10     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 124+ messages in thread
From: Kamezawa Hiroyuki @ 2012-12-28  1:10 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	mhocko-AlSwsSmVLrQ, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	glommer-bzQdu9zFT3WakBO8gow8eQ, Sha Zhengju

(2012/12/26 2:28), Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> 
> Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>

I don't think your words are bad but it may be better to sync with meminfo's text.

> ---
>   Documentation/cgroups/memory.txt |    2 ++
>   1 file changed, 2 insertions(+)
> 
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index addb1f1..2828164 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -487,6 +487,8 @@ pgpgin		- # of charging events to the memory cgroup. The charging
>   pgpgout		- # of uncharging events to the memory cgroup. The uncharging
>   		event happens each time a page is unaccounted from the cgroup.
>   swap		- # of bytes of swap usage
> +dirty          - # of bytes of file cache that are not in sync with the disk copy.
> +writeback      - # of bytes of file/anon cache that are queued for syncing to disk.
>   inactive_anon	- # of bytes of anonymous memory and swap cache memory on
>   		LRU list.
>   active_anon	- # of bytes of anonymous and swap cache memory on active
> 

Documentation/filesystems/proc.txt

       Dirty: Memory which is waiting to get written back to the disk
   Writeback: Memory which is actively being written back to the disk

even if others are not ;(

Thanks,
-Kame





^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 7/8] memcg: disable memcg page stat accounting code when not in use
  2012-12-25 17:27   ` Sha Zhengju
@ 2012-12-28  1:45     ` Kamezawa Hiroyuki
  -1 siblings, 0 replies; 124+ messages in thread
From: Kamezawa Hiroyuki @ 2012-12-28  1:45 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel, cgroups, linux-mm, mhocko, akpm, gthelen,
	fengguang.wu, glommer, Sha Zhengju

(2012/12/26 2:27), Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> It's inspired by a similar optimization from Glauber Costa
> (memcg: make it suck faster; https://lkml.org/lkml/2012/9/25/154).
> Here we use jump label to patch the memcg page stat accounting code
> in or out when not used. when the first non-root memcg comes to
> life the code is patching in otherwise it is out.
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> ---
>   include/linux/memcontrol.h |    9 +++++++++
>   mm/memcontrol.c            |    8 ++++++++
>   2 files changed, 17 insertions(+)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 1d22b81..3c4430c 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -56,6 +56,9 @@ struct mem_cgroup_reclaim_cookie {
>   };
>   
>   #ifdef CONFIG_MEMCG
> +
> +extern struct static_key memcg_in_use_key;
> +
>   /*
>    * All "charge" functions with gfp_mask should use GFP_KERNEL or
>    * (gfp_mask & GFP_RECLAIM_MASK). In current implementatin, memcg doesn't
> @@ -158,6 +161,9 @@ extern atomic_t memcg_moving;
>   static inline void mem_cgroup_begin_update_page_stat(struct page *page,
>   					bool *locked, unsigned long *flags)
>   {
> +	if (!static_key_false(&memcg_in_use_key))
> +		return;
> +
>   	if (mem_cgroup_disabled())
>   		return;
>   	rcu_read_lock();
> @@ -171,6 +177,9 @@ void __mem_cgroup_end_update_page_stat(struct page *page,
>   static inline void mem_cgroup_end_update_page_stat(struct page *page,
>   					bool *locked, unsigned long *flags)
>   {
> +	if (!static_key_false(&memcg_in_use_key))
> +		return;
> +
>   	if (mem_cgroup_disabled())
>   		return;
>   	if (*locked)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 0cb5187..a2f73d7 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -531,6 +531,8 @@ enum res_type {
>   #define MEM_CGROUP_RECLAIM_SHRINK_BIT	0x1
>   #define MEM_CGROUP_RECLAIM_SHRINK	(1 << MEM_CGROUP_RECLAIM_SHRINK_BIT)
>   
> +struct static_key memcg_in_use_key;
> +
>   static void mem_cgroup_get(struct mem_cgroup *memcg);
>   static void mem_cgroup_put(struct mem_cgroup *memcg);
>   
> @@ -2226,6 +2228,9 @@ void mem_cgroup_update_page_stat(struct page *page,
>   	struct page_cgroup *pc = lookup_page_cgroup(page);
>   	unsigned long uninitialized_var(flags);
>   
> +	if (!static_key_false(&memcg_in_use_key))
> +		return;
> +
>   	if (mem_cgroup_disabled())
>   		return;
>   
> @@ -6340,6 +6345,8 @@ mem_cgroup_css_alloc(struct cgroup *cont)
>   		parent = mem_cgroup_from_cont(cont->parent);
>   		memcg->use_hierarchy = parent->use_hierarchy;
>   		memcg->oom_kill_disable = parent->oom_kill_disable;
> +
> +		static_key_slow_inc(&memcg_in_use_key);
>   	}
>   
>   	if (parent && parent->use_hierarchy) {
> @@ -6407,6 +6414,7 @@ static void mem_cgroup_css_free(struct cgroup *cont)
>   	kmem_cgroup_destroy(memcg);
>   
>   	memcg_dangling_add(memcg);
> +	static_key_slow_dec(&memcg_in_use_key);

Hmm. I think this part should be folded into disarm_static_keys().

Thanks,
-Kame





^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 7/8] memcg: disable memcg page stat accounting code when not in use
@ 2012-12-28  1:45     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 124+ messages in thread
From: Kamezawa Hiroyuki @ 2012-12-28  1:45 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel, cgroups, linux-mm, mhocko, akpm, gthelen,
	fengguang.wu, glommer, Sha Zhengju

(2012/12/26 2:27), Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> It's inspired by a similar optimization from Glauber Costa
> (memcg: make it suck faster; https://lkml.org/lkml/2012/9/25/154).
> Here we use jump label to patch the memcg page stat accounting code
> in or out when not used. when the first non-root memcg comes to
> life the code is patching in otherwise it is out.
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> ---
>   include/linux/memcontrol.h |    9 +++++++++
>   mm/memcontrol.c            |    8 ++++++++
>   2 files changed, 17 insertions(+)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 1d22b81..3c4430c 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -56,6 +56,9 @@ struct mem_cgroup_reclaim_cookie {
>   };
>   
>   #ifdef CONFIG_MEMCG
> +
> +extern struct static_key memcg_in_use_key;
> +
>   /*
>    * All "charge" functions with gfp_mask should use GFP_KERNEL or
>    * (gfp_mask & GFP_RECLAIM_MASK). In current implementatin, memcg doesn't
> @@ -158,6 +161,9 @@ extern atomic_t memcg_moving;
>   static inline void mem_cgroup_begin_update_page_stat(struct page *page,
>   					bool *locked, unsigned long *flags)
>   {
> +	if (!static_key_false(&memcg_in_use_key))
> +		return;
> +
>   	if (mem_cgroup_disabled())
>   		return;
>   	rcu_read_lock();
> @@ -171,6 +177,9 @@ void __mem_cgroup_end_update_page_stat(struct page *page,
>   static inline void mem_cgroup_end_update_page_stat(struct page *page,
>   					bool *locked, unsigned long *flags)
>   {
> +	if (!static_key_false(&memcg_in_use_key))
> +		return;
> +
>   	if (mem_cgroup_disabled())
>   		return;
>   	if (*locked)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 0cb5187..a2f73d7 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -531,6 +531,8 @@ enum res_type {
>   #define MEM_CGROUP_RECLAIM_SHRINK_BIT	0x1
>   #define MEM_CGROUP_RECLAIM_SHRINK	(1 << MEM_CGROUP_RECLAIM_SHRINK_BIT)
>   
> +struct static_key memcg_in_use_key;
> +
>   static void mem_cgroup_get(struct mem_cgroup *memcg);
>   static void mem_cgroup_put(struct mem_cgroup *memcg);
>   
> @@ -2226,6 +2228,9 @@ void mem_cgroup_update_page_stat(struct page *page,
>   	struct page_cgroup *pc = lookup_page_cgroup(page);
>   	unsigned long uninitialized_var(flags);
>   
> +	if (!static_key_false(&memcg_in_use_key))
> +		return;
> +
>   	if (mem_cgroup_disabled())
>   		return;
>   
> @@ -6340,6 +6345,8 @@ mem_cgroup_css_alloc(struct cgroup *cont)
>   		parent = mem_cgroup_from_cont(cont->parent);
>   		memcg->use_hierarchy = parent->use_hierarchy;
>   		memcg->oom_kill_disable = parent->oom_kill_disable;
> +
> +		static_key_slow_inc(&memcg_in_use_key);
>   	}
>   
>   	if (parent && parent->use_hierarchy) {
> @@ -6407,6 +6414,7 @@ static void mem_cgroup_css_free(struct cgroup *cont)
>   	kmem_cgroup_destroy(memcg);
>   
>   	memcg_dangling_add(memcg);
> +	static_key_slow_dec(&memcg_in_use_key);

Hmm. I think this part should be folded into disarm_static_keys().

Thanks,
-Kame




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 2/8] Make TestSetPageDirty and dirty page accounting in one func
  2012-12-25 17:22   ` Sha Zhengju
@ 2013-01-02  9:08     ` Michal Hocko
  -1 siblings, 0 replies; 124+ messages in thread
From: Michal Hocko @ 2013-01-02  9:08 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel, cgroups, linux-mm, linux-fsdevel, dchinner, akpm,
	kamezawa.hiroyu, gthelen, fengguang.wu, glommer, Sha Zhengju

On Wed 26-12-12 01:22:36, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> Commit a8e7d49a(Fix race in create_empty_buffers() vs __set_page_dirty_buffers())
> extracts TestSetPageDirty from __set_page_dirty and is far away from
> account_page_dirtied. But it's better to make the two operations in one single
> function to keep modular. So in order to avoid the potential race mentioned in
> commit a8e7d49a, we can hold private_lock until __set_page_dirty completes.
> There's no deadlock between ->private_lock and ->tree_lock after confirmation.

Could you be more specific here? E.g. quote mm/filemap.c comment I have
mentioned during the first round of review?

> It's a prepare patch for following memcg dirty page accounting patches.
> 
> 
> Here is some test numbers that before/after this patch:
> Test steps(Mem-4g, ext4):
> drop_cache; sync
> fio (ioengine=sync/write/buffered/bs=4k/size=1g/numjobs=2/group_reporting/thread)

Could also add some rationale why you think this test is relevant?

> We test it for 10 times and get the average numbers:
> Before:
> write: io=2048.0MB, bw=254117KB/s, iops=63528.9 , runt=  8279msec
> lat (usec): min=1 , max=742361 , avg=30.918, stdev=1601.02
> After:
> write: io=2048.0MB, bw=254044KB/s, iops=63510.3 , runt=  8274.4msec
> lat (usec): min=1 , max=856333 , avg=31.043, stdev=1769.32
> 
> Note that the impact is little(<1%).
> 
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> Reviewed-by: Michal Hocko <mhocko@suse.cz>
> ---
>  fs/buffer.c |   24 ++++++++++++------------
>  1 file changed, 12 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index c017a2d..3b032b9 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -609,9 +609,15 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>   * If warn is true, then emit a warning if the page is not uptodate and has
>   * not been truncated.
>   */
> -static void __set_page_dirty(struct page *page,
> +static int __set_page_dirty(struct page *page,
>  		struct address_space *mapping, int warn)
>  {
> +	if (unlikely(!mapping))
> +		return !TestSetPageDirty(page);
> +
> +	if (TestSetPageDirty(page))
> +		return 0;
> +
>  	spin_lock_irq(&mapping->tree_lock);
>  	if (page->mapping) {	/* Race with truncate? */
>  		WARN_ON_ONCE(warn && !PageUptodate(page));
> @@ -621,6 +627,8 @@ static void __set_page_dirty(struct page *page,
>  	}
>  	spin_unlock_irq(&mapping->tree_lock);
>  	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> +
> +	return 1;
>  }
>  
>  /*
> @@ -666,11 +674,9 @@ int __set_page_dirty_buffers(struct page *page)
>  			bh = bh->b_this_page;
>  		} while (bh != head);
>  	}
> -	newly_dirty = !TestSetPageDirty(page);
> +	newly_dirty = __set_page_dirty(page, mapping, 1);
>  	spin_unlock(&mapping->private_lock);
>  
> -	if (newly_dirty)
> -		__set_page_dirty(page, mapping, 1);
>  	return newly_dirty;
>  }
>  EXPORT_SYMBOL(__set_page_dirty_buffers);
> @@ -1125,14 +1131,8 @@ void mark_buffer_dirty(struct buffer_head *bh)
>  			return;
>  	}
>  
> -	if (!test_set_buffer_dirty(bh)) {
> -		struct page *page = bh->b_page;
> -		if (!TestSetPageDirty(page)) {
> -			struct address_space *mapping = page_mapping(page);
> -			if (mapping)
> -				__set_page_dirty(page, mapping, 0);
> -		}
> -	}
> +	if (!test_set_buffer_dirty(bh))
> +		__set_page_dirty(bh->b_page, page_mapping(bh->b_page), 0);
>  }
>  EXPORT_SYMBOL(mark_buffer_dirty);
>  
> -- 
> 1.7.9.5
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 2/8] Make TestSetPageDirty and dirty page accounting in one func
@ 2013-01-02  9:08     ` Michal Hocko
  0 siblings, 0 replies; 124+ messages in thread
From: Michal Hocko @ 2013-01-02  9:08 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel, cgroups, linux-mm, linux-fsdevel, dchinner, akpm,
	kamezawa.hiroyu, gthelen, fengguang.wu, glommer, Sha Zhengju

On Wed 26-12-12 01:22:36, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> Commit a8e7d49a(Fix race in create_empty_buffers() vs __set_page_dirty_buffers())
> extracts TestSetPageDirty from __set_page_dirty and is far away from
> account_page_dirtied. But it's better to make the two operations in one single
> function to keep modular. So in order to avoid the potential race mentioned in
> commit a8e7d49a, we can hold private_lock until __set_page_dirty completes.
> There's no deadlock between ->private_lock and ->tree_lock after confirmation.

Could you be more specific here? E.g. quote mm/filemap.c comment I have
mentioned during the first round of review?

> It's a prepare patch for following memcg dirty page accounting patches.
> 
> 
> Here is some test numbers that before/after this patch:
> Test steps(Mem-4g, ext4):
> drop_cache; sync
> fio (ioengine=sync/write/buffered/bs=4k/size=1g/numjobs=2/group_reporting/thread)

Could also add some rationale why you think this test is relevant?

> We test it for 10 times and get the average numbers:
> Before:
> write: io=2048.0MB, bw=254117KB/s, iops=63528.9 , runt=  8279msec
> lat (usec): min=1 , max=742361 , avg=30.918, stdev=1601.02
> After:
> write: io=2048.0MB, bw=254044KB/s, iops=63510.3 , runt=  8274.4msec
> lat (usec): min=1 , max=856333 , avg=31.043, stdev=1769.32
> 
> Note that the impact is little(<1%).
> 
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> Reviewed-by: Michal Hocko <mhocko@suse.cz>
> ---
>  fs/buffer.c |   24 ++++++++++++------------
>  1 file changed, 12 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index c017a2d..3b032b9 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -609,9 +609,15 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>   * If warn is true, then emit a warning if the page is not uptodate and has
>   * not been truncated.
>   */
> -static void __set_page_dirty(struct page *page,
> +static int __set_page_dirty(struct page *page,
>  		struct address_space *mapping, int warn)
>  {
> +	if (unlikely(!mapping))
> +		return !TestSetPageDirty(page);
> +
> +	if (TestSetPageDirty(page))
> +		return 0;
> +
>  	spin_lock_irq(&mapping->tree_lock);
>  	if (page->mapping) {	/* Race with truncate? */
>  		WARN_ON_ONCE(warn && !PageUptodate(page));
> @@ -621,6 +627,8 @@ static void __set_page_dirty(struct page *page,
>  	}
>  	spin_unlock_irq(&mapping->tree_lock);
>  	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> +
> +	return 1;
>  }
>  
>  /*
> @@ -666,11 +674,9 @@ int __set_page_dirty_buffers(struct page *page)
>  			bh = bh->b_this_page;
>  		} while (bh != head);
>  	}
> -	newly_dirty = !TestSetPageDirty(page);
> +	newly_dirty = __set_page_dirty(page, mapping, 1);
>  	spin_unlock(&mapping->private_lock);
>  
> -	if (newly_dirty)
> -		__set_page_dirty(page, mapping, 1);
>  	return newly_dirty;
>  }
>  EXPORT_SYMBOL(__set_page_dirty_buffers);
> @@ -1125,14 +1131,8 @@ void mark_buffer_dirty(struct buffer_head *bh)
>  			return;
>  	}
>  
> -	if (!test_set_buffer_dirty(bh)) {
> -		struct page *page = bh->b_page;
> -		if (!TestSetPageDirty(page)) {
> -			struct address_space *mapping = page_mapping(page);
> -			if (mapping)
> -				__set_page_dirty(page, mapping, 0);
> -		}
> -	}
> +	if (!test_set_buffer_dirty(bh))
> +		__set_page_dirty(bh->b_page, page_mapping(bh->b_page), 0);
>  }
>  EXPORT_SYMBOL(mark_buffer_dirty);
>  
> -- 
> 1.7.9.5
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
  2012-12-25 17:26   ` Sha Zhengju
@ 2013-01-02 10:44     ` Michal Hocko
  -1 siblings, 0 replies; 124+ messages in thread
From: Michal Hocko @ 2013-01-02 10:44 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel, cgroups, linux-mm, linux-fsdevel, akpm,
	kamezawa.hiroyu, gthelen, fengguang.wu, glommer, dchinner,
	Sha Zhengju

On Wed 26-12-12 01:26:07, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> This patch adds memcg routines to count dirty pages, which allows memory controller
> to maintain an accurate view of the amount of its dirty memory and can provide some
> info for users while cgroup's direct reclaim is working.

I guess you meant targeted resp. (hard/soft) limit reclaim here,
right? It is true that this is direct reclaim but it is not clear to me
why the usefulnes should be limitted to the reclaim for users. I would
understand this if the users was in fact in-kernel users.

[...]
> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
> ->private_lock --> mapping->tree_lock --> memcg->move_lock.
> So we need to make mapping->tree_lock ahead of TestSetPageDirty in __set_page_dirty()
> and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention,
> a prepare PageDirty() checking is added.

But there is another AA deadlock here I believe.
page_remove_rmap
  mem_cgroup_begin_update_page_stat		<<< 1
  set_page_dirty
    __set_page_dirty_buffers
      __set_page_dirty
        mem_cgroup_begin_update_page_stat	<<< 2
	  move_lock_mem_cgroup
	    spin_lock_irqsave(&memcg->move_lock, *flags);

mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS
because we might race with the moving charges:
	CPU0						CPU1					
page_remove_rmap
						mem_cgroup_can_attach
  mem_cgroup_begin_update_page_stat (1)
    rcu_read_lock
						  mem_cgroup_start_move
						    atomic_inc(&memcg_moving)
						    atomic_inc(&memcg->moving_account)
						    synchronize_rcu
    __mem_cgroup_begin_update_page_stat
      mem_cgroup_stolen	<<< TRUE
      move_lock_mem_cgroup
  [...]
        mem_cgroup_begin_update_page_stat (2)
	  __mem_cgroup_begin_update_page_stat
	    mem_cgroup_stolen	  <<< still TRUE
	    move_lock_mem_cgroup  <<< DEADLOCK
  [...]
  mem_cgroup_end_update_page_stat
    rcu_unlock
    						  # wake up from synchronize_rcu
						[...]
						mem_cgroup_move_task
						  mem_cgroup_move_charge
						    walk_page_range
						      mem_cgroup_move_account
						        move_lock_mem_cgroup


Maybe I have missed some other locking which would prevent this from
happening but the locking relations are really complicated in this area
so if mem_cgroup_{begin,end}_update_page_stat might be called
recursively then we need a fat comment which justifies that.

[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
@ 2013-01-02 10:44     ` Michal Hocko
  0 siblings, 0 replies; 124+ messages in thread
From: Michal Hocko @ 2013-01-02 10:44 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel, cgroups, linux-mm, linux-fsdevel, akpm,
	kamezawa.hiroyu, gthelen, fengguang.wu, glommer, dchinner,
	Sha Zhengju

On Wed 26-12-12 01:26:07, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> This patch adds memcg routines to count dirty pages, which allows memory controller
> to maintain an accurate view of the amount of its dirty memory and can provide some
> info for users while cgroup's direct reclaim is working.

I guess you meant targeted resp. (hard/soft) limit reclaim here,
right? It is true that this is direct reclaim but it is not clear to me
why the usefulnes should be limitted to the reclaim for users. I would
understand this if the users was in fact in-kernel users.

[...]
> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
> ->private_lock --> mapping->tree_lock --> memcg->move_lock.
> So we need to make mapping->tree_lock ahead of TestSetPageDirty in __set_page_dirty()
> and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention,
> a prepare PageDirty() checking is added.

But there is another AA deadlock here I believe.
page_remove_rmap
  mem_cgroup_begin_update_page_stat		<<< 1
  set_page_dirty
    __set_page_dirty_buffers
      __set_page_dirty
        mem_cgroup_begin_update_page_stat	<<< 2
	  move_lock_mem_cgroup
	    spin_lock_irqsave(&memcg->move_lock, *flags);

mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS
because we might race with the moving charges:
	CPU0						CPU1					
page_remove_rmap
						mem_cgroup_can_attach
  mem_cgroup_begin_update_page_stat (1)
    rcu_read_lock
						  mem_cgroup_start_move
						    atomic_inc(&memcg_moving)
						    atomic_inc(&memcg->moving_account)
						    synchronize_rcu
    __mem_cgroup_begin_update_page_stat
      mem_cgroup_stolen	<<< TRUE
      move_lock_mem_cgroup
  [...]
        mem_cgroup_begin_update_page_stat (2)
	  __mem_cgroup_begin_update_page_stat
	    mem_cgroup_stolen	  <<< still TRUE
	    move_lock_mem_cgroup  <<< DEADLOCK
  [...]
  mem_cgroup_end_update_page_stat
    rcu_unlock
    						  # wake up from synchronize_rcu
						[...]
						mem_cgroup_move_task
						  mem_cgroup_move_charge
						    walk_page_range
						      mem_cgroup_move_account
						        move_lock_mem_cgroup


Maybe I have missed some other locking which would prevent this from
happening but the locking relations are really complicated in this area
so if mem_cgroup_{begin,end}_update_page_stat might be called
recursively then we need a fat comment which justifies that.

[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 5/8] memcg: add per cgroup writeback pages accounting
  2012-12-25 17:26   ` Sha Zhengju
  (?)
@ 2013-01-02 11:15     ` Michal Hocko
  -1 siblings, 0 replies; 124+ messages in thread
From: Michal Hocko @ 2013-01-02 11:15 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel, cgroups, linux-mm, akpm, kamezawa.hiroyu, gthelen,
	fengguang.wu, glommer, Sha Zhengju

On Wed 26-12-12 01:26:48, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> Similar to dirty page, we add per cgroup writeback pages accounting. The lock
> rule still is:
>         mem_cgroup_begin_update_page_stat()
>         modify page WRITEBACK stat
>         mem_cgroup_update_page_stat()
>         mem_cgroup_end_update_page_stat()
> 
> There're two writeback interface to modify: test_clear/set_page_writeback.
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>

Looks good to me
Acked-by: Michal Hocko <mhocko@suse.cz>

Thanks!
> ---
>  include/linux/memcontrol.h |    1 +
>  mm/memcontrol.c            |    5 +++++
>  mm/page-writeback.c        |   17 +++++++++++++++++
>  3 files changed, 23 insertions(+)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 358019e..1d22b81 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -45,6 +45,7 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>  	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
>  	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
> +	MEM_CGROUP_STAT_WRITEBACK,  /* # of pages under writeback */
>  	MEM_CGROUP_STAT_NSTATS,
>  };
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 21df36d..13cd14a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -96,6 +96,7 @@ static const char * const mem_cgroup_stat_names[] = {
>  	"mapped_file",
>  	"swap",
>  	"dirty",
> +	"writeback",
>  };
>  
>  enum mem_cgroup_events_index {
> @@ -3676,6 +3677,10 @@ static int mem_cgroup_move_account(struct page *page,
>  		mem_cgroup_move_account_page_stat(from, to, nr_pages,
>  			MEM_CGROUP_STAT_FILE_DIRTY);
>  
> +	if (PageWriteback(page))
> +		mem_cgroup_move_account_page_stat(from, to, nr_pages,
> +			MEM_CGROUP_STAT_WRITEBACK);
> +
>  	mem_cgroup_charge_statistics(from, anon, -nr_pages);
>  
>  	/* caller should have done css_get */
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 526ddd7..ae6498a 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2002,11 +2002,17 @@ EXPORT_SYMBOL(account_page_dirtied);
>  
>  /*
>   * Helper function for set_page_writeback family.
> + *
> + * The caller must hold mem_cgroup_begin/end_update_page_stat() lock
> + * while modifying struct page state and accounting writeback pages.
> + * See test_set_page_writeback for example.
> + *
>   * NOTE: Unlike account_page_dirtied this does not rely on being atomic
>   * wrt interrupts.
>   */
>  void account_page_writeback(struct page *page)
>  {
> +	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
>  	inc_zone_page_state(page, NR_WRITEBACK);
>  }
>  EXPORT_SYMBOL(account_page_writeback);
> @@ -2242,7 +2248,10 @@ int test_clear_page_writeback(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
>  	int ret;
> +	bool locked;
> +	unsigned long memcg_flags;
>  
> +	mem_cgroup_begin_update_page_stat(page, &locked, &memcg_flags);
>  	if (mapping) {
>  		struct backing_dev_info *bdi = mapping->backing_dev_info;
>  		unsigned long flags;
> @@ -2263,9 +2272,12 @@ int test_clear_page_writeback(struct page *page)
>  		ret = TestClearPageWriteback(page);
>  	}
>  	if (ret) {
> +		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
>  		dec_zone_page_state(page, NR_WRITEBACK);
>  		inc_zone_page_state(page, NR_WRITTEN);
>  	}
> +
> +	mem_cgroup_end_update_page_stat(page, &locked, &memcg_flags);
>  	return ret;
>  }
>  
> @@ -2273,7 +2285,10 @@ int test_set_page_writeback(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
>  	int ret;
> +	bool locked;
> +	unsigned long flags;
>  
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  	if (mapping) {
>  		struct backing_dev_info *bdi = mapping->backing_dev_info;
>  		unsigned long flags;
> @@ -2300,6 +2315,8 @@ int test_set_page_writeback(struct page *page)
>  	}
>  	if (!ret)
>  		account_page_writeback(page);
> +
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  	return ret;
>  
>  }
> -- 
> 1.7.9.5
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 5/8] memcg: add per cgroup writeback pages accounting
@ 2013-01-02 11:15     ` Michal Hocko
  0 siblings, 0 replies; 124+ messages in thread
From: Michal Hocko @ 2013-01-02 11:15 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel, cgroups, linux-mm, akpm, kamezawa.hiroyu, gthelen,
	fengguang.wu, glommer, Sha Zhengju

On Wed 26-12-12 01:26:48, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> Similar to dirty page, we add per cgroup writeback pages accounting. The lock
> rule still is:
>         mem_cgroup_begin_update_page_stat()
>         modify page WRITEBACK stat
>         mem_cgroup_update_page_stat()
>         mem_cgroup_end_update_page_stat()
> 
> There're two writeback interface to modify: test_clear/set_page_writeback.
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>

Looks good to me
Acked-by: Michal Hocko <mhocko@suse.cz>

Thanks!
> ---
>  include/linux/memcontrol.h |    1 +
>  mm/memcontrol.c            |    5 +++++
>  mm/page-writeback.c        |   17 +++++++++++++++++
>  3 files changed, 23 insertions(+)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 358019e..1d22b81 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -45,6 +45,7 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>  	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
>  	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
> +	MEM_CGROUP_STAT_WRITEBACK,  /* # of pages under writeback */
>  	MEM_CGROUP_STAT_NSTATS,
>  };
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 21df36d..13cd14a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -96,6 +96,7 @@ static const char * const mem_cgroup_stat_names[] = {
>  	"mapped_file",
>  	"swap",
>  	"dirty",
> +	"writeback",
>  };
>  
>  enum mem_cgroup_events_index {
> @@ -3676,6 +3677,10 @@ static int mem_cgroup_move_account(struct page *page,
>  		mem_cgroup_move_account_page_stat(from, to, nr_pages,
>  			MEM_CGROUP_STAT_FILE_DIRTY);
>  
> +	if (PageWriteback(page))
> +		mem_cgroup_move_account_page_stat(from, to, nr_pages,
> +			MEM_CGROUP_STAT_WRITEBACK);
> +
>  	mem_cgroup_charge_statistics(from, anon, -nr_pages);
>  
>  	/* caller should have done css_get */
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 526ddd7..ae6498a 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2002,11 +2002,17 @@ EXPORT_SYMBOL(account_page_dirtied);
>  
>  /*
>   * Helper function for set_page_writeback family.
> + *
> + * The caller must hold mem_cgroup_begin/end_update_page_stat() lock
> + * while modifying struct page state and accounting writeback pages.
> + * See test_set_page_writeback for example.
> + *
>   * NOTE: Unlike account_page_dirtied this does not rely on being atomic
>   * wrt interrupts.
>   */
>  void account_page_writeback(struct page *page)
>  {
> +	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
>  	inc_zone_page_state(page, NR_WRITEBACK);
>  }
>  EXPORT_SYMBOL(account_page_writeback);
> @@ -2242,7 +2248,10 @@ int test_clear_page_writeback(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
>  	int ret;
> +	bool locked;
> +	unsigned long memcg_flags;
>  
> +	mem_cgroup_begin_update_page_stat(page, &locked, &memcg_flags);
>  	if (mapping) {
>  		struct backing_dev_info *bdi = mapping->backing_dev_info;
>  		unsigned long flags;
> @@ -2263,9 +2272,12 @@ int test_clear_page_writeback(struct page *page)
>  		ret = TestClearPageWriteback(page);
>  	}
>  	if (ret) {
> +		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
>  		dec_zone_page_state(page, NR_WRITEBACK);
>  		inc_zone_page_state(page, NR_WRITTEN);
>  	}
> +
> +	mem_cgroup_end_update_page_stat(page, &locked, &memcg_flags);
>  	return ret;
>  }
>  
> @@ -2273,7 +2285,10 @@ int test_set_page_writeback(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
>  	int ret;
> +	bool locked;
> +	unsigned long flags;
>  
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  	if (mapping) {
>  		struct backing_dev_info *bdi = mapping->backing_dev_info;
>  		unsigned long flags;
> @@ -2300,6 +2315,8 @@ int test_set_page_writeback(struct page *page)
>  	}
>  	if (!ret)
>  		account_page_writeback(page);
> +
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  	return ret;
>  
>  }
> -- 
> 1.7.9.5
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 5/8] memcg: add per cgroup writeback pages accounting
@ 2013-01-02 11:15     ` Michal Hocko
  0 siblings, 0 replies; 124+ messages in thread
From: Michal Hocko @ 2013-01-02 11:15 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	glommer-bzQdu9zFT3WakBO8gow8eQ, Sha Zhengju

On Wed 26-12-12 01:26:48, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> 
> Similar to dirty page, we add per cgroup writeback pages accounting. The lock
> rule still is:
>         mem_cgroup_begin_update_page_stat()
>         modify page WRITEBACK stat
>         mem_cgroup_update_page_stat()
>         mem_cgroup_end_update_page_stat()
> 
> There're two writeback interface to modify: test_clear/set_page_writeback.
> 
> Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>

Looks good to me
Acked-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>

Thanks!
> ---
>  include/linux/memcontrol.h |    1 +
>  mm/memcontrol.c            |    5 +++++
>  mm/page-writeback.c        |   17 +++++++++++++++++
>  3 files changed, 23 insertions(+)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 358019e..1d22b81 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -45,6 +45,7 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>  	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
>  	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
> +	MEM_CGROUP_STAT_WRITEBACK,  /* # of pages under writeback */
>  	MEM_CGROUP_STAT_NSTATS,
>  };
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 21df36d..13cd14a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -96,6 +96,7 @@ static const char * const mem_cgroup_stat_names[] = {
>  	"mapped_file",
>  	"swap",
>  	"dirty",
> +	"writeback",
>  };
>  
>  enum mem_cgroup_events_index {
> @@ -3676,6 +3677,10 @@ static int mem_cgroup_move_account(struct page *page,
>  		mem_cgroup_move_account_page_stat(from, to, nr_pages,
>  			MEM_CGROUP_STAT_FILE_DIRTY);
>  
> +	if (PageWriteback(page))
> +		mem_cgroup_move_account_page_stat(from, to, nr_pages,
> +			MEM_CGROUP_STAT_WRITEBACK);
> +
>  	mem_cgroup_charge_statistics(from, anon, -nr_pages);
>  
>  	/* caller should have done css_get */
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 526ddd7..ae6498a 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2002,11 +2002,17 @@ EXPORT_SYMBOL(account_page_dirtied);
>  
>  /*
>   * Helper function for set_page_writeback family.
> + *
> + * The caller must hold mem_cgroup_begin/end_update_page_stat() lock
> + * while modifying struct page state and accounting writeback pages.
> + * See test_set_page_writeback for example.
> + *
>   * NOTE: Unlike account_page_dirtied this does not rely on being atomic
>   * wrt interrupts.
>   */
>  void account_page_writeback(struct page *page)
>  {
> +	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
>  	inc_zone_page_state(page, NR_WRITEBACK);
>  }
>  EXPORT_SYMBOL(account_page_writeback);
> @@ -2242,7 +2248,10 @@ int test_clear_page_writeback(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
>  	int ret;
> +	bool locked;
> +	unsigned long memcg_flags;
>  
> +	mem_cgroup_begin_update_page_stat(page, &locked, &memcg_flags);
>  	if (mapping) {
>  		struct backing_dev_info *bdi = mapping->backing_dev_info;
>  		unsigned long flags;
> @@ -2263,9 +2272,12 @@ int test_clear_page_writeback(struct page *page)
>  		ret = TestClearPageWriteback(page);
>  	}
>  	if (ret) {
> +		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
>  		dec_zone_page_state(page, NR_WRITEBACK);
>  		inc_zone_page_state(page, NR_WRITTEN);
>  	}
> +
> +	mem_cgroup_end_update_page_stat(page, &locked, &memcg_flags);
>  	return ret;
>  }
>  
> @@ -2273,7 +2285,10 @@ int test_set_page_writeback(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
>  	int ret;
> +	bool locked;
> +	unsigned long flags;
>  
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  	if (mapping) {
>  		struct backing_dev_info *bdi = mapping->backing_dev_info;
>  		unsigned long flags;
> @@ -2300,6 +2315,8 @@ int test_set_page_writeback(struct page *page)
>  	}
>  	if (!ret)
>  		account_page_writeback(page);
> +
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  	return ret;
>  
>  }
> -- 
> 1.7.9.5
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 6/8] memcg: Don't account root_mem_cgroup page statistics
  2012-12-25 17:27   ` Sha Zhengju
  (?)
@ 2013-01-02 12:27     ` Michal Hocko
  -1 siblings, 0 replies; 124+ messages in thread
From: Michal Hocko @ 2013-01-02 12:27 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel, cgroups, linux-mm, akpm, kamezawa.hiroyu, gthelen,
	fengguang.wu, glommer, Sha Zhengju

On Wed 26-12-12 01:27:27, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> If memcg is enabled and no non-root memcg exists, all allocated pages
> belongs to root_mem_cgroup and go through root memcg statistics routines
> which brings some overheads. So for the sake of performance, we can give
> up accounting stats of root memcg for MEM_CGROUP_STAT_FILE_MAPPED/FILE_DIRTY
> /WRITEBACK

I do not like this selective approach. We should handle all the stat
types in the same way. SWAP is not a hot path but RSS and CACHE should
be optimized as well. It seems that thresholds events might be a
complication here but it shouldn't be that a big deal (mem_cgroup_usage
would need some treat).

> and instead we pay special attention while showing root
> memcg numbers in memcg_stat_show(): as we don't account root memcg stats
> anymore, the root_mem_cgroup->stat numbers are actually 0.

Yes, this is reasonable.

> But because of hierachy, figures of root_mem_cgroup may just represent
> numbers of pages used by its own tasks(not belonging to any other
> child cgroup).

I am not sure what the above means. root might have use_hierarchy set to
1 as well.

> So here we fake these root numbers by using stats of global state and
> all other memcg.  That is for root memcg:
> 	nr(MEM_CGROUP_STAT_FILE_MAPPED) = global_page_state(NR_FILE_MAPPED) -
>                               sum_of_all_memcg(MEM_CGROUP_STAT_FILE_MAPPED);
> Dirty/Writeback pages accounting are in the similar way.
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>

I like the approach but I do not like the implementation. See details
bellow.

> ---
>  mm/memcontrol.c |   70 +++++++++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 68 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index fc20ac9..728349d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
[...]
> @@ -5396,18 +5406,70 @@ static inline void mem_cgroup_lru_names_not_uptodate(void)
>  	BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS);
>  }
>  
> +long long root_memcg_local_stat(unsigned int i, long long val,
> +					long long nstat[])

Function should be static
also
nstat parameter is ugly because this can be done by the caller
and also expecting that the caller already calculated val is not
nice (and undocumented). This approach is really hackish and error
prone. Why should we define a specific function rather than hooking into
mem_cgroup_read_stat and doing all the stuff there? I think that would
be much more maintainable.

> +{
> +	long long res = 0;
> +
> +	switch (i) {
> +	case MEM_CGROUP_STAT_FILE_MAPPED:
> +		res = global_page_state(NR_FILE_MAPPED);
> +		break;
> +	case MEM_CGROUP_STAT_FILE_DIRTY:
> +		res = global_page_state(NR_FILE_DIRTY);
> +		break;
> +	case MEM_CGROUP_STAT_WRITEBACK:
> +		res = global_page_state(NR_WRITEBACK);
> +		break;
> +	default:
> +		break;
> +	}
> +
> +	res = (res <= val) ? 0 : (res - val) * PAGE_SIZE;
> +	nstat[i] = res;
> +
> +	return res;
> +}
> +
>  static int memcg_stat_show(struct cgroup *cont, struct cftype *cft,
>  				 struct seq_file *m)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
>  	struct mem_cgroup *mi;
>  	unsigned int i;
> +	long long nstat[MEM_CGROUP_STAT_NSTATS] = {0};

s/nstat/root_stat/

>  
>  	for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
> +		long long val = 0, res = 0;
> +
>  		if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account)
>  			continue;
> -		seq_printf(m, "%s %ld\n", mem_cgroup_stat_names[i],
> -			   mem_cgroup_read_stat(memcg, i) * PAGE_SIZE);
> +		if (i == MEM_CGROUP_STAT_SWAP || i == MEM_CGROUP_STAT_CACHE ||
> +			i == MEM_CGROUP_STAT_RSS) {

This is plain ugly. If nothing else it asks for a comment why those are
special.

> +			seq_printf(m, "%s %ld\n", mem_cgroup_stat_names[i],
> +				   mem_cgroup_read_stat(memcg, i) * PAGE_SIZE);
> +			continue;
> +		}
> +
> +		/* As we don't account root memcg stats anymore, the
> +		 * root_mem_cgroup->stat numbers are actually 0. But because of
> +		 * hierachy, figures of root_mem_cgroup may just represent
> +		 * numbers of pages used by its own tasks(not belonging to any
> +		 * other child cgroup). So here we fake these root numbers by
> +		 * using stats of global state and all other memcg. That is for
> +		 * root memcg:
> +		 * nr(MEM_CGROUP_STAT_FILE_MAPPED) = global_page_state(NR_FILE_
> +		 * 	MAPPED) - sum_of_all_memcg(MEM_CGROUP_STAT_FILE_MAPPED)
> +		 * Dirty/Writeback pages accounting are in the similar way.
> +		 */
> +		if (memcg == root_mem_cgroup) {
> +			for_each_mem_cgroup(mi)
> +				val += mem_cgroup_read_stat(mi, i);
> +			res = root_memcg_local_stat(i, val, nstat);
> +		} else
> +			res = mem_cgroup_read_stat(memcg, i) * PAGE_SIZE;
> +
> +		seq_printf(m, "%s %lld\n", mem_cgroup_stat_names[i], res);
>  	}
>  
>  	for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++)
> @@ -5435,6 +5497,10 @@ static int memcg_stat_show(struct cgroup *cont, struct cftype *cft,
>  			continue;
>  		for_each_mem_cgroup_tree(mi, memcg)
>  			val += mem_cgroup_read_stat(mi, i) * PAGE_SIZE;
> +
> +		/* Adding local stats of root memcg */
> +		if (memcg == root_mem_cgroup)
> +			val += nstat[i];
>  		seq_printf(m, "total_%s %lld\n", mem_cgroup_stat_names[i], val);
>  	}
>  
> -- 
> 1.7.9.5
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 6/8] memcg: Don't account root_mem_cgroup page statistics
@ 2013-01-02 12:27     ` Michal Hocko
  0 siblings, 0 replies; 124+ messages in thread
From: Michal Hocko @ 2013-01-02 12:27 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel, cgroups, linux-mm, akpm, kamezawa.hiroyu, gthelen,
	fengguang.wu, glommer, Sha Zhengju

On Wed 26-12-12 01:27:27, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> If memcg is enabled and no non-root memcg exists, all allocated pages
> belongs to root_mem_cgroup and go through root memcg statistics routines
> which brings some overheads. So for the sake of performance, we can give
> up accounting stats of root memcg for MEM_CGROUP_STAT_FILE_MAPPED/FILE_DIRTY
> /WRITEBACK

I do not like this selective approach. We should handle all the stat
types in the same way. SWAP is not a hot path but RSS and CACHE should
be optimized as well. It seems that thresholds events might be a
complication here but it shouldn't be that a big deal (mem_cgroup_usage
would need some treat).

> and instead we pay special attention while showing root
> memcg numbers in memcg_stat_show(): as we don't account root memcg stats
> anymore, the root_mem_cgroup->stat numbers are actually 0.

Yes, this is reasonable.

> But because of hierachy, figures of root_mem_cgroup may just represent
> numbers of pages used by its own tasks(not belonging to any other
> child cgroup).

I am not sure what the above means. root might have use_hierarchy set to
1 as well.

> So here we fake these root numbers by using stats of global state and
> all other memcg.  That is for root memcg:
> 	nr(MEM_CGROUP_STAT_FILE_MAPPED) = global_page_state(NR_FILE_MAPPED) -
>                               sum_of_all_memcg(MEM_CGROUP_STAT_FILE_MAPPED);
> Dirty/Writeback pages accounting are in the similar way.
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>

I like the approach but I do not like the implementation. See details
bellow.

> ---
>  mm/memcontrol.c |   70 +++++++++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 68 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index fc20ac9..728349d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
[...]
> @@ -5396,18 +5406,70 @@ static inline void mem_cgroup_lru_names_not_uptodate(void)
>  	BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS);
>  }
>  
> +long long root_memcg_local_stat(unsigned int i, long long val,
> +					long long nstat[])

Function should be static
also
nstat parameter is ugly because this can be done by the caller
and also expecting that the caller already calculated val is not
nice (and undocumented). This approach is really hackish and error
prone. Why should we define a specific function rather than hooking into
mem_cgroup_read_stat and doing all the stuff there? I think that would
be much more maintainable.

> +{
> +	long long res = 0;
> +
> +	switch (i) {
> +	case MEM_CGROUP_STAT_FILE_MAPPED:
> +		res = global_page_state(NR_FILE_MAPPED);
> +		break;
> +	case MEM_CGROUP_STAT_FILE_DIRTY:
> +		res = global_page_state(NR_FILE_DIRTY);
> +		break;
> +	case MEM_CGROUP_STAT_WRITEBACK:
> +		res = global_page_state(NR_WRITEBACK);
> +		break;
> +	default:
> +		break;
> +	}
> +
> +	res = (res <= val) ? 0 : (res - val) * PAGE_SIZE;
> +	nstat[i] = res;
> +
> +	return res;
> +}
> +
>  static int memcg_stat_show(struct cgroup *cont, struct cftype *cft,
>  				 struct seq_file *m)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
>  	struct mem_cgroup *mi;
>  	unsigned int i;
> +	long long nstat[MEM_CGROUP_STAT_NSTATS] = {0};

s/nstat/root_stat/

>  
>  	for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
> +		long long val = 0, res = 0;
> +
>  		if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account)
>  			continue;
> -		seq_printf(m, "%s %ld\n", mem_cgroup_stat_names[i],
> -			   mem_cgroup_read_stat(memcg, i) * PAGE_SIZE);
> +		if (i == MEM_CGROUP_STAT_SWAP || i == MEM_CGROUP_STAT_CACHE ||
> +			i == MEM_CGROUP_STAT_RSS) {

This is plain ugly. If nothing else it asks for a comment why those are
special.

> +			seq_printf(m, "%s %ld\n", mem_cgroup_stat_names[i],
> +				   mem_cgroup_read_stat(memcg, i) * PAGE_SIZE);
> +			continue;
> +		}
> +
> +		/* As we don't account root memcg stats anymore, the
> +		 * root_mem_cgroup->stat numbers are actually 0. But because of
> +		 * hierachy, figures of root_mem_cgroup may just represent
> +		 * numbers of pages used by its own tasks(not belonging to any
> +		 * other child cgroup). So here we fake these root numbers by
> +		 * using stats of global state and all other memcg. That is for
> +		 * root memcg:
> +		 * nr(MEM_CGROUP_STAT_FILE_MAPPED) = global_page_state(NR_FILE_
> +		 * 	MAPPED) - sum_of_all_memcg(MEM_CGROUP_STAT_FILE_MAPPED)
> +		 * Dirty/Writeback pages accounting are in the similar way.
> +		 */
> +		if (memcg == root_mem_cgroup) {
> +			for_each_mem_cgroup(mi)
> +				val += mem_cgroup_read_stat(mi, i);
> +			res = root_memcg_local_stat(i, val, nstat);
> +		} else
> +			res = mem_cgroup_read_stat(memcg, i) * PAGE_SIZE;
> +
> +		seq_printf(m, "%s %lld\n", mem_cgroup_stat_names[i], res);
>  	}
>  
>  	for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++)
> @@ -5435,6 +5497,10 @@ static int memcg_stat_show(struct cgroup *cont, struct cftype *cft,
>  			continue;
>  		for_each_mem_cgroup_tree(mi, memcg)
>  			val += mem_cgroup_read_stat(mi, i) * PAGE_SIZE;
> +
> +		/* Adding local stats of root memcg */
> +		if (memcg == root_mem_cgroup)
> +			val += nstat[i];
>  		seq_printf(m, "total_%s %lld\n", mem_cgroup_stat_names[i], val);
>  	}
>  
> -- 
> 1.7.9.5
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 6/8] memcg: Don't account root_mem_cgroup page statistics
@ 2013-01-02 12:27     ` Michal Hocko
  0 siblings, 0 replies; 124+ messages in thread
From: Michal Hocko @ 2013-01-02 12:27 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	glommer-bzQdu9zFT3WakBO8gow8eQ, Sha Zhengju

On Wed 26-12-12 01:27:27, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> 
> If memcg is enabled and no non-root memcg exists, all allocated pages
> belongs to root_mem_cgroup and go through root memcg statistics routines
> which brings some overheads. So for the sake of performance, we can give
> up accounting stats of root memcg for MEM_CGROUP_STAT_FILE_MAPPED/FILE_DIRTY
> /WRITEBACK

I do not like this selective approach. We should handle all the stat
types in the same way. SWAP is not a hot path but RSS and CACHE should
be optimized as well. It seems that thresholds events might be a
complication here but it shouldn't be that a big deal (mem_cgroup_usage
would need some treat).

> and instead we pay special attention while showing root
> memcg numbers in memcg_stat_show(): as we don't account root memcg stats
> anymore, the root_mem_cgroup->stat numbers are actually 0.

Yes, this is reasonable.

> But because of hierachy, figures of root_mem_cgroup may just represent
> numbers of pages used by its own tasks(not belonging to any other
> child cgroup).

I am not sure what the above means. root might have use_hierarchy set to
1 as well.

> So here we fake these root numbers by using stats of global state and
> all other memcg.  That is for root memcg:
> 	nr(MEM_CGROUP_STAT_FILE_MAPPED) = global_page_state(NR_FILE_MAPPED) -
>                               sum_of_all_memcg(MEM_CGROUP_STAT_FILE_MAPPED);
> Dirty/Writeback pages accounting are in the similar way.
> 
> Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>

I like the approach but I do not like the implementation. See details
bellow.

> ---
>  mm/memcontrol.c |   70 +++++++++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 68 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index fc20ac9..728349d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
[...]
> @@ -5396,18 +5406,70 @@ static inline void mem_cgroup_lru_names_not_uptodate(void)
>  	BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS);
>  }
>  
> +long long root_memcg_local_stat(unsigned int i, long long val,
> +					long long nstat[])

Function should be static
also
nstat parameter is ugly because this can be done by the caller
and also expecting that the caller already calculated val is not
nice (and undocumented). This approach is really hackish and error
prone. Why should we define a specific function rather than hooking into
mem_cgroup_read_stat and doing all the stuff there? I think that would
be much more maintainable.

> +{
> +	long long res = 0;
> +
> +	switch (i) {
> +	case MEM_CGROUP_STAT_FILE_MAPPED:
> +		res = global_page_state(NR_FILE_MAPPED);
> +		break;
> +	case MEM_CGROUP_STAT_FILE_DIRTY:
> +		res = global_page_state(NR_FILE_DIRTY);
> +		break;
> +	case MEM_CGROUP_STAT_WRITEBACK:
> +		res = global_page_state(NR_WRITEBACK);
> +		break;
> +	default:
> +		break;
> +	}
> +
> +	res = (res <= val) ? 0 : (res - val) * PAGE_SIZE;
> +	nstat[i] = res;
> +
> +	return res;
> +}
> +
>  static int memcg_stat_show(struct cgroup *cont, struct cftype *cft,
>  				 struct seq_file *m)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
>  	struct mem_cgroup *mi;
>  	unsigned int i;
> +	long long nstat[MEM_CGROUP_STAT_NSTATS] = {0};

s/nstat/root_stat/

>  
>  	for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
> +		long long val = 0, res = 0;
> +
>  		if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account)
>  			continue;
> -		seq_printf(m, "%s %ld\n", mem_cgroup_stat_names[i],
> -			   mem_cgroup_read_stat(memcg, i) * PAGE_SIZE);
> +		if (i == MEM_CGROUP_STAT_SWAP || i == MEM_CGROUP_STAT_CACHE ||
> +			i == MEM_CGROUP_STAT_RSS) {

This is plain ugly. If nothing else it asks for a comment why those are
special.

> +			seq_printf(m, "%s %ld\n", mem_cgroup_stat_names[i],
> +				   mem_cgroup_read_stat(memcg, i) * PAGE_SIZE);
> +			continue;
> +		}
> +
> +		/* As we don't account root memcg stats anymore, the
> +		 * root_mem_cgroup->stat numbers are actually 0. But because of
> +		 * hierachy, figures of root_mem_cgroup may just represent
> +		 * numbers of pages used by its own tasks(not belonging to any
> +		 * other child cgroup). So here we fake these root numbers by
> +		 * using stats of global state and all other memcg. That is for
> +		 * root memcg:
> +		 * nr(MEM_CGROUP_STAT_FILE_MAPPED) = global_page_state(NR_FILE_
> +		 * 	MAPPED) - sum_of_all_memcg(MEM_CGROUP_STAT_FILE_MAPPED)
> +		 * Dirty/Writeback pages accounting are in the similar way.
> +		 */
> +		if (memcg == root_mem_cgroup) {
> +			for_each_mem_cgroup(mi)
> +				val += mem_cgroup_read_stat(mi, i);
> +			res = root_memcg_local_stat(i, val, nstat);
> +		} else
> +			res = mem_cgroup_read_stat(memcg, i) * PAGE_SIZE;
> +
> +		seq_printf(m, "%s %lld\n", mem_cgroup_stat_names[i], res);
>  	}
>  
>  	for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++)
> @@ -5435,6 +5497,10 @@ static int memcg_stat_show(struct cgroup *cont, struct cftype *cft,
>  			continue;
>  		for_each_mem_cgroup_tree(mi, memcg)
>  			val += mem_cgroup_read_stat(mi, i) * PAGE_SIZE;
> +
> +		/* Adding local stats of root memcg */
> +		if (memcg == root_mem_cgroup)
> +			val += nstat[i];
>  		seq_printf(m, "total_%s %lld\n", mem_cgroup_stat_names[i], val);
>  	}
>  
> -- 
> 1.7.9.5
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 7/8] memcg: disable memcg page stat accounting code when not in use
  2012-12-25 17:27   ` Sha Zhengju
@ 2013-01-02 13:35     ` Michal Hocko
  -1 siblings, 0 replies; 124+ messages in thread
From: Michal Hocko @ 2013-01-02 13:35 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel, cgroups, linux-mm, akpm, kamezawa.hiroyu, gthelen,
	fengguang.wu, glommer, Sha Zhengju, Mel Gorman

[CCing Mel]

On Wed 26-12-12 01:27:57, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> It's inspired by a similar optimization from Glauber Costa
> (memcg: make it suck faster; https://lkml.org/lkml/2012/9/25/154).
> Here we use jump label to patch the memcg page stat accounting code
> in or out when not used. when the first non-root memcg comes to
> life the code is patching in otherwise it is out.

Mel had a workload which shown quite a big regression when memcg is
enabled with no cgroups but root (it was a page fault microbench AFAIR
but I do not have a link handy) so it would be nice to check how much
this patch helps and what are the other places which could benefit from
the static key.

> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>

Anyway, I like this as a first step (and the patch description should be
explicit about that). See other comments bellow.

> ---
>  include/linux/memcontrol.h |    9 +++++++++
>  mm/memcontrol.c            |    8 ++++++++
>  2 files changed, 17 insertions(+)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 1d22b81..3c4430c 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -56,6 +56,9 @@ struct mem_cgroup_reclaim_cookie {
>  };
>  
>  #ifdef CONFIG_MEMCG
> +
> +extern struct static_key memcg_in_use_key;
> +
>  /*
>   * All "charge" functions with gfp_mask should use GFP_KERNEL or
>   * (gfp_mask & GFP_RECLAIM_MASK). In current implementatin, memcg doesn't
> @@ -158,6 +161,9 @@ extern atomic_t memcg_moving;
>  static inline void mem_cgroup_begin_update_page_stat(struct page *page,
>  					bool *locked, unsigned long *flags)
>  {
> +	if (!static_key_false(&memcg_in_use_key))
> +		return;

Maybe static_key checks could be wrapped by a helper function with a
more obvious name (mem_cgroup_in_use())?

> +
>  	if (mem_cgroup_disabled())
>  		return;

I would assume the check ordering would be vice versa.

>  	rcu_read_lock();
> @@ -171,6 +177,9 @@ void __mem_cgroup_end_update_page_stat(struct page *page,
>  static inline void mem_cgroup_end_update_page_stat(struct page *page,
>  					bool *locked, unsigned long *flags)
>  {
> +	if (!static_key_false(&memcg_in_use_key))
> +		return;
> +
>  	if (mem_cgroup_disabled())
>  		return;

ditto

>  	if (*locked)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 0cb5187..a2f73d7 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -531,6 +531,8 @@ enum res_type {
>  #define MEM_CGROUP_RECLAIM_SHRINK_BIT	0x1
>  #define MEM_CGROUP_RECLAIM_SHRINK	(1 << MEM_CGROUP_RECLAIM_SHRINK_BIT)
>  

/*
 * TODO comment what it is used for, please
 */
> +struct static_key memcg_in_use_key;
> +
>  static void mem_cgroup_get(struct mem_cgroup *memcg);
>  static void mem_cgroup_put(struct mem_cgroup *memcg);
>  
> @@ -2226,6 +2228,9 @@ void mem_cgroup_update_page_stat(struct page *page,
>  	struct page_cgroup *pc = lookup_page_cgroup(page);
>  	unsigned long uninitialized_var(flags);

This artifact can be removed.

> +	if (!static_key_false(&memcg_in_use_key))
> +		return;
> +
>  	if (mem_cgroup_disabled())
>  		return;
>  
> @@ -6340,6 +6345,8 @@ mem_cgroup_css_alloc(struct cgroup *cont)
>  		parent = mem_cgroup_from_cont(cont->parent);
>  		memcg->use_hierarchy = parent->use_hierarchy;
>  		memcg->oom_kill_disable = parent->oom_kill_disable;
> +
> +		static_key_slow_inc(&memcg_in_use_key);

Please wrap this into a function because later we will probably want to
do an action depending on whether this is a first onlined group (e.g.
sync stats etc...).

>  	}
>  
>  	if (parent && parent->use_hierarchy) {
> @@ -6407,6 +6414,7 @@ static void mem_cgroup_css_free(struct cgroup *cont)
>  	kmem_cgroup_destroy(memcg);
>  
>  	memcg_dangling_add(memcg);
> +	static_key_slow_dec(&memcg_in_use_key);
>  	mem_cgroup_put(memcg);

memcg could be still alive at this moment (e.g. due to swap or kmem
charges). This is not a big issue with the current state of the patch
set as you ignore MEM_CGROUP_STAT_SWAP now but you shouldn't rely on
that. We also have a proper place for it already (disarm_static_keys).

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 7/8] memcg: disable memcg page stat accounting code when not in use
@ 2013-01-02 13:35     ` Michal Hocko
  0 siblings, 0 replies; 124+ messages in thread
From: Michal Hocko @ 2013-01-02 13:35 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel, cgroups, linux-mm, akpm, kamezawa.hiroyu, gthelen,
	fengguang.wu, glommer, Sha Zhengju, Mel Gorman

[CCing Mel]

On Wed 26-12-12 01:27:57, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> It's inspired by a similar optimization from Glauber Costa
> (memcg: make it suck faster; https://lkml.org/lkml/2012/9/25/154).
> Here we use jump label to patch the memcg page stat accounting code
> in or out when not used. when the first non-root memcg comes to
> life the code is patching in otherwise it is out.

Mel had a workload which shown quite a big regression when memcg is
enabled with no cgroups but root (it was a page fault microbench AFAIR
but I do not have a link handy) so it would be nice to check how much
this patch helps and what are the other places which could benefit from
the static key.

> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>

Anyway, I like this as a first step (and the patch description should be
explicit about that). See other comments bellow.

> ---
>  include/linux/memcontrol.h |    9 +++++++++
>  mm/memcontrol.c            |    8 ++++++++
>  2 files changed, 17 insertions(+)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 1d22b81..3c4430c 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -56,6 +56,9 @@ struct mem_cgroup_reclaim_cookie {
>  };
>  
>  #ifdef CONFIG_MEMCG
> +
> +extern struct static_key memcg_in_use_key;
> +
>  /*
>   * All "charge" functions with gfp_mask should use GFP_KERNEL or
>   * (gfp_mask & GFP_RECLAIM_MASK). In current implementatin, memcg doesn't
> @@ -158,6 +161,9 @@ extern atomic_t memcg_moving;
>  static inline void mem_cgroup_begin_update_page_stat(struct page *page,
>  					bool *locked, unsigned long *flags)
>  {
> +	if (!static_key_false(&memcg_in_use_key))
> +		return;

Maybe static_key checks could be wrapped by a helper function with a
more obvious name (mem_cgroup_in_use())?

> +
>  	if (mem_cgroup_disabled())
>  		return;

I would assume the check ordering would be vice versa.

>  	rcu_read_lock();
> @@ -171,6 +177,9 @@ void __mem_cgroup_end_update_page_stat(struct page *page,
>  static inline void mem_cgroup_end_update_page_stat(struct page *page,
>  					bool *locked, unsigned long *flags)
>  {
> +	if (!static_key_false(&memcg_in_use_key))
> +		return;
> +
>  	if (mem_cgroup_disabled())
>  		return;

ditto

>  	if (*locked)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 0cb5187..a2f73d7 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -531,6 +531,8 @@ enum res_type {
>  #define MEM_CGROUP_RECLAIM_SHRINK_BIT	0x1
>  #define MEM_CGROUP_RECLAIM_SHRINK	(1 << MEM_CGROUP_RECLAIM_SHRINK_BIT)
>  

/*
 * TODO comment what it is used for, please
 */
> +struct static_key memcg_in_use_key;
> +
>  static void mem_cgroup_get(struct mem_cgroup *memcg);
>  static void mem_cgroup_put(struct mem_cgroup *memcg);
>  
> @@ -2226,6 +2228,9 @@ void mem_cgroup_update_page_stat(struct page *page,
>  	struct page_cgroup *pc = lookup_page_cgroup(page);
>  	unsigned long uninitialized_var(flags);

This artifact can be removed.

> +	if (!static_key_false(&memcg_in_use_key))
> +		return;
> +
>  	if (mem_cgroup_disabled())
>  		return;
>  
> @@ -6340,6 +6345,8 @@ mem_cgroup_css_alloc(struct cgroup *cont)
>  		parent = mem_cgroup_from_cont(cont->parent);
>  		memcg->use_hierarchy = parent->use_hierarchy;
>  		memcg->oom_kill_disable = parent->oom_kill_disable;
> +
> +		static_key_slow_inc(&memcg_in_use_key);

Please wrap this into a function because later we will probably want to
do an action depending on whether this is a first onlined group (e.g.
sync stats etc...).

>  	}
>  
>  	if (parent && parent->use_hierarchy) {
> @@ -6407,6 +6414,7 @@ static void mem_cgroup_css_free(struct cgroup *cont)
>  	kmem_cgroup_destroy(memcg);
>  
>  	memcg_dangling_add(memcg);
> +	static_key_slow_dec(&memcg_in_use_key);
>  	mem_cgroup_put(memcg);

memcg could be still alive at this moment (e.g. due to swap or kmem
charges). This is not a big issue with the current state of the patch
set as you ignore MEM_CGROUP_STAT_SWAP now but you shouldn't rely on
that. We also have a proper place for it already (disarm_static_keys).

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 8/8] memcg: Document cgroup dirty/writeback memory statistics
  2012-12-25 17:28   ` Sha Zhengju
  (?)
@ 2013-01-02 13:36     ` Michal Hocko
  -1 siblings, 0 replies; 124+ messages in thread
From: Michal Hocko @ 2013-01-02 13:36 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel, cgroups, linux-mm, akpm, kamezawa.hiroyu, gthelen,
	fengguang.wu, glommer, Sha Zhengju

On Wed 26-12-12 01:28:21, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>

Acked-by: Michal Hocko <mhocko@suse.cz>

> ---
>  Documentation/cgroups/memory.txt |    2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index addb1f1..2828164 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -487,6 +487,8 @@ pgpgin		- # of charging events to the memory cgroup. The charging
>  pgpgout		- # of uncharging events to the memory cgroup. The uncharging
>  		event happens each time a page is unaccounted from the cgroup.
>  swap		- # of bytes of swap usage
> +dirty          - # of bytes of file cache that are not in sync with the disk copy.
> +writeback      - # of bytes of file/anon cache that are queued for syncing to disk.
>  inactive_anon	- # of bytes of anonymous memory and swap cache memory on
>  		LRU list.
>  active_anon	- # of bytes of anonymous and swap cache memory on active
> -- 
> 1.7.9.5
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 8/8] memcg: Document cgroup dirty/writeback memory statistics
@ 2013-01-02 13:36     ` Michal Hocko
  0 siblings, 0 replies; 124+ messages in thread
From: Michal Hocko @ 2013-01-02 13:36 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel, cgroups, linux-mm, akpm, kamezawa.hiroyu, gthelen,
	fengguang.wu, glommer, Sha Zhengju

On Wed 26-12-12 01:28:21, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>

Acked-by: Michal Hocko <mhocko@suse.cz>

> ---
>  Documentation/cgroups/memory.txt |    2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index addb1f1..2828164 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -487,6 +487,8 @@ pgpgin		- # of charging events to the memory cgroup. The charging
>  pgpgout		- # of uncharging events to the memory cgroup. The uncharging
>  		event happens each time a page is unaccounted from the cgroup.
>  swap		- # of bytes of swap usage
> +dirty          - # of bytes of file cache that are not in sync with the disk copy.
> +writeback      - # of bytes of file/anon cache that are queued for syncing to disk.
>  inactive_anon	- # of bytes of anonymous memory and swap cache memory on
>  		LRU list.
>  active_anon	- # of bytes of anonymous and swap cache memory on active
> -- 
> 1.7.9.5
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 8/8] memcg: Document cgroup dirty/writeback memory statistics
@ 2013-01-02 13:36     ` Michal Hocko
  0 siblings, 0 replies; 124+ messages in thread
From: Michal Hocko @ 2013-01-02 13:36 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	glommer-bzQdu9zFT3WakBO8gow8eQ, Sha Zhengju

On Wed 26-12-12 01:28:21, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> 
> Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>

Acked-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>

> ---
>  Documentation/cgroups/memory.txt |    2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index addb1f1..2828164 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -487,6 +487,8 @@ pgpgin		- # of charging events to the memory cgroup. The charging
>  pgpgout		- # of uncharging events to the memory cgroup. The uncharging
>  		event happens each time a page is unaccounted from the cgroup.
>  swap		- # of bytes of swap usage
> +dirty          - # of bytes of file cache that are not in sync with the disk copy.
> +writeback      - # of bytes of file/anon cache that are queued for syncing to disk.
>  inactive_anon	- # of bytes of anonymous memory and swap cache memory on
>  		LRU list.
>  active_anon	- # of bytes of anonymous and swap cache memory on active
> -- 
> 1.7.9.5
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 2/8] Make TestSetPageDirty and dirty page accounting in one func
  2012-12-28  0:39     ` Kamezawa Hiroyuki
@ 2013-01-05  2:34       ` Sha Zhengju
  -1 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2013-01-05  2:34 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-kernel, cgroups, linux-mm, linux-fsdevel, dchinner, mhocko,
	akpm, gthelen, fengguang.wu, glommer, Sha Zhengju

Hi Kame,

Sorry for the late response, I'm just back from vocation. : )

On Fri, Dec 28, 2012 at 8:39 AM, Kamezawa Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> (2012/12/26 2:22), Sha Zhengju wrote:
>> From: Sha Zhengju <handai.szj@taobao.com>
>>
>> Commit a8e7d49a(Fix race in create_empty_buffers() vs __set_page_dirty_buffers())
>> extracts TestSetPageDirty from __set_page_dirty and is far away from
>> account_page_dirtied. But it's better to make the two operations in one single
>> function to keep modular. So in order to avoid the potential race mentioned in
>> commit a8e7d49a, we can hold private_lock until __set_page_dirty completes.
>> There's no deadlock between ->private_lock and ->tree_lock after confirmation.
>> It's a prepare patch for following memcg dirty page accounting patches.
>>
>>
>> Here is some test numbers that before/after this patch:
>> Test steps(Mem-4g, ext4):
>> drop_cache; sync
>> fio (ioengine=sync/write/buffered/bs=4k/size=1g/numjobs=2/group_reporting/thread)
>>
>> We test it for 10 times and get the average numbers:
>> Before:
>> write: io=2048.0MB, bw=254117KB/s, iops=63528.9 , runt=  8279msec
>> lat (usec): min=1 , max=742361 , avg=30.918, stdev=1601.02
>> After:
>> write: io=2048.0MB, bw=254044KB/s, iops=63510.3 , runt=  8274.4msec
>> lat (usec): min=1 , max=856333 , avg=31.043, stdev=1769.32
>>
>> Note that the impact is little(<1%).
>>
>>
>> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
>> Reviewed-by: Michal Hocko <mhocko@suse.cz>
>
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> Hmm,..this change should be double-checked by vfs, I/O guys...
>

Now it seems they haven't paid attention here... I'll push it soon for
more review.

> increasing hold time of mapping->private_lock doesn't affect performance ?
>
>

Yes, pointed by Fengguang in the previous round, mapping->private_lock and
mapping->tree_lock are often contented locks that in a dd testcase
they have the top
 #1 and #2 contention.
So the numbers above are trying to find the impaction of lock
contention by multiple
threads(numjobs=2) writing to the same file in parallel and it seems
the impact is
little (<1%).
I'm not sure if the test case is enough, any advice is welcomed! : )

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 2/8] Make TestSetPageDirty and dirty page accounting in one func
@ 2013-01-05  2:34       ` Sha Zhengju
  0 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2013-01-05  2:34 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-kernel, cgroups, linux-mm, linux-fsdevel, dchinner, mhocko,
	akpm, gthelen, fengguang.wu, glommer, Sha Zhengju

Hi Kame,

Sorry for the late response, I'm just back from vocation. : )

On Fri, Dec 28, 2012 at 8:39 AM, Kamezawa Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> (2012/12/26 2:22), Sha Zhengju wrote:
>> From: Sha Zhengju <handai.szj@taobao.com>
>>
>> Commit a8e7d49a(Fix race in create_empty_buffers() vs __set_page_dirty_buffers())
>> extracts TestSetPageDirty from __set_page_dirty and is far away from
>> account_page_dirtied. But it's better to make the two operations in one single
>> function to keep modular. So in order to avoid the potential race mentioned in
>> commit a8e7d49a, we can hold private_lock until __set_page_dirty completes.
>> There's no deadlock between ->private_lock and ->tree_lock after confirmation.
>> It's a prepare patch for following memcg dirty page accounting patches.
>>
>>
>> Here is some test numbers that before/after this patch:
>> Test steps(Mem-4g, ext4):
>> drop_cache; sync
>> fio (ioengine=sync/write/buffered/bs=4k/size=1g/numjobs=2/group_reporting/thread)
>>
>> We test it for 10 times and get the average numbers:
>> Before:
>> write: io=2048.0MB, bw=254117KB/s, iops=63528.9 , runt=  8279msec
>> lat (usec): min=1 , max=742361 , avg=30.918, stdev=1601.02
>> After:
>> write: io=2048.0MB, bw=254044KB/s, iops=63510.3 , runt=  8274.4msec
>> lat (usec): min=1 , max=856333 , avg=31.043, stdev=1769.32
>>
>> Note that the impact is little(<1%).
>>
>>
>> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
>> Reviewed-by: Michal Hocko <mhocko@suse.cz>
>
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> Hmm,..this change should be double-checked by vfs, I/O guys...
>

Now it seems they haven't paid attention here... I'll push it soon for
more review.

> increasing hold time of mapping->private_lock doesn't affect performance ?
>
>

Yes, pointed by Fengguang in the previous round, mapping->private_lock and
mapping->tree_lock are often contented locks that in a dd testcase
they have the top
 #1 and #2 contention.
So the numbers above are trying to find the impaction of lock
contention by multiple
threads(numjobs=2) writing to the same file in parallel and it seems
the impact is
little (<1%).
I'm not sure if the test case is enough, any advice is welcomed! : )

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 2/8] Make TestSetPageDirty and dirty page accounting in one func
@ 2013-01-05  2:49       ` Sha Zhengju
  0 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2013-01-05  2:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, cgroups, linux-mm, linux-fsdevel, dchinner, akpm,
	kamezawa.hiroyu, gthelen, fengguang.wu, glommer, Sha Zhengju

Hi Michal,

Sorry for my late response, I'm just back from vocation. : )

On Wed, Jan 2, 2013 at 5:08 PM, Michal Hocko <mhocko@suse.cz> wrote:
> On Wed 26-12-12 01:22:36, Sha Zhengju wrote:
>> From: Sha Zhengju <handai.szj@taobao.com>
>>
>> Commit a8e7d49a(Fix race in create_empty_buffers() vs __set_page_dirty_buffers())
>> extracts TestSetPageDirty from __set_page_dirty and is far away from
>> account_page_dirtied. But it's better to make the two operations in one single
>> function to keep modular. So in order to avoid the potential race mentioned in
>> commit a8e7d49a, we can hold private_lock until __set_page_dirty completes.
>> There's no deadlock between ->private_lock and ->tree_lock after confirmation.
>
> Could you be more specific here? E.g. quote mm/filemap.c comment I have
> mentioned during the first round of review?
>

Okay, sorry for forgetting the comment. I'll add it next round.

>> It's a prepare patch for following memcg dirty page accounting patches.
>>
>>
>> Here is some test numbers that before/after this patch:
>> Test steps(Mem-4g, ext4):
>> drop_cache; sync
>> fio (ioengine=sync/write/buffered/bs=4k/size=1g/numjobs=2/group_reporting/thread)
>
> Could also add some rationale why you think this test is relevant?
>

The test is aiming at finding the impact of performance due to lock
contention by writing parallel
to the same file. I'll add the reason next version too.

Thanks for reviewing!


Regards,
Sha

>> We test it for 10 times and get the average numbers:
>> Before:
>> write: io=2048.0MB, bw=254117KB/s, iops=63528.9 , runt=  8279msec
>> lat (usec): min=1 , max=742361 , avg=30.918, stdev=1601.02
>> After:
>> write: io=2048.0MB, bw=254044KB/s, iops=63510.3 , runt=  8274.4msec
>> lat (usec): min=1 , max=856333 , avg=31.043, stdev=1769.32
>>
>> Note that the impact is little(<1%).
>>
>>
>> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
>> Reviewed-by: Michal Hocko <mhocko@suse.cz>
>> ---
>>  fs/buffer.c |   24 ++++++++++++------------
>>  1 file changed, 12 insertions(+), 12 deletions(-)
>>
>> diff --git a/fs/buffer.c b/fs/buffer.c
>> index c017a2d..3b032b9 100644
>> --- a/fs/buffer.c
>> +++ b/fs/buffer.c
>> @@ -609,9 +609,15 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>>   * If warn is true, then emit a warning if the page is not uptodate and has
>>   * not been truncated.
>>   */
>> -static void __set_page_dirty(struct page *page,
>> +static int __set_page_dirty(struct page *page,
>>               struct address_space *mapping, int warn)
>>  {
>> +     if (unlikely(!mapping))
>> +             return !TestSetPageDirty(page);
>> +
>> +     if (TestSetPageDirty(page))
>> +             return 0;
>> +
>>       spin_lock_irq(&mapping->tree_lock);
>>       if (page->mapping) {    /* Race with truncate? */
>>               WARN_ON_ONCE(warn && !PageUptodate(page));
>> @@ -621,6 +627,8 @@ static void __set_page_dirty(struct page *page,
>>       }
>>       spin_unlock_irq(&mapping->tree_lock);
>>       __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>> +
>> +     return 1;
>>  }
>>
>>  /*
>> @@ -666,11 +674,9 @@ int __set_page_dirty_buffers(struct page *page)
>>                       bh = bh->b_this_page;
>>               } while (bh != head);
>>       }
>> -     newly_dirty = !TestSetPageDirty(page);
>> +     newly_dirty = __set_page_dirty(page, mapping, 1);
>>       spin_unlock(&mapping->private_lock);
>>
>> -     if (newly_dirty)
>> -             __set_page_dirty(page, mapping, 1);
>>       return newly_dirty;
>>  }
>>  EXPORT_SYMBOL(__set_page_dirty_buffers);
>> @@ -1125,14 +1131,8 @@ void mark_buffer_dirty(struct buffer_head *bh)
>>                       return;
>>       }
>>
>> -     if (!test_set_buffer_dirty(bh)) {
>> -             struct page *page = bh->b_page;
>> -             if (!TestSetPageDirty(page)) {
>> -                     struct address_space *mapping = page_mapping(page);
>> -                     if (mapping)
>> -                             __set_page_dirty(page, mapping, 0);
>> -             }
>> -     }
>> +     if (!test_set_buffer_dirty(bh))
>> +             __set_page_dirty(bh->b_page, page_mapping(bh->b_page), 0);
>>  }
>>  EXPORT_SYMBOL(mark_buffer_dirty);
>>
>> --
>> 1.7.9.5
>>
>
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 2/8] Make TestSetPageDirty and dirty page accounting in one func
@ 2013-01-05  2:49       ` Sha Zhengju
  0 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2013-01-05  2:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	dchinner-H+wXaHxf7aLQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	glommer-bzQdu9zFT3WakBO8gow8eQ, Sha Zhengju

Hi Michal,

Sorry for my late response, I'm just back from vocation. : )

On Wed, Jan 2, 2013 at 5:08 PM, Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> wrote:
> On Wed 26-12-12 01:22:36, Sha Zhengju wrote:
>> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
>>
>> Commit a8e7d49a(Fix race in create_empty_buffers() vs __set_page_dirty_buffers())
>> extracts TestSetPageDirty from __set_page_dirty and is far away from
>> account_page_dirtied. But it's better to make the two operations in one single
>> function to keep modular. So in order to avoid the potential race mentioned in
>> commit a8e7d49a, we can hold private_lock until __set_page_dirty completes.
>> There's no deadlock between ->private_lock and ->tree_lock after confirmation.
>
> Could you be more specific here? E.g. quote mm/filemap.c comment I have
> mentioned during the first round of review?
>

Okay, sorry for forgetting the comment. I'll add it next round.

>> It's a prepare patch for following memcg dirty page accounting patches.
>>
>>
>> Here is some test numbers that before/after this patch:
>> Test steps(Mem-4g, ext4):
>> drop_cache; sync
>> fio (ioengine=sync/write/buffered/bs=4k/size=1g/numjobs=2/group_reporting/thread)
>
> Could also add some rationale why you think this test is relevant?
>

The test is aiming at finding the impact of performance due to lock
contention by writing parallel
to the same file. I'll add the reason next version too.

Thanks for reviewing!


Regards,
Sha

>> We test it for 10 times and get the average numbers:
>> Before:
>> write: io=2048.0MB, bw=254117KB/s, iops=63528.9 , runt=  8279msec
>> lat (usec): min=1 , max=742361 , avg=30.918, stdev=1601.02
>> After:
>> write: io=2048.0MB, bw=254044KB/s, iops=63510.3 , runt=  8274.4msec
>> lat (usec): min=1 , max=856333 , avg=31.043, stdev=1769.32
>>
>> Note that the impact is little(<1%).
>>
>>
>> Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
>> Reviewed-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
>> ---
>>  fs/buffer.c |   24 ++++++++++++------------
>>  1 file changed, 12 insertions(+), 12 deletions(-)
>>
>> diff --git a/fs/buffer.c b/fs/buffer.c
>> index c017a2d..3b032b9 100644
>> --- a/fs/buffer.c
>> +++ b/fs/buffer.c
>> @@ -609,9 +609,15 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>>   * If warn is true, then emit a warning if the page is not uptodate and has
>>   * not been truncated.
>>   */
>> -static void __set_page_dirty(struct page *page,
>> +static int __set_page_dirty(struct page *page,
>>               struct address_space *mapping, int warn)
>>  {
>> +     if (unlikely(!mapping))
>> +             return !TestSetPageDirty(page);
>> +
>> +     if (TestSetPageDirty(page))
>> +             return 0;
>> +
>>       spin_lock_irq(&mapping->tree_lock);
>>       if (page->mapping) {    /* Race with truncate? */
>>               WARN_ON_ONCE(warn && !PageUptodate(page));
>> @@ -621,6 +627,8 @@ static void __set_page_dirty(struct page *page,
>>       }
>>       spin_unlock_irq(&mapping->tree_lock);
>>       __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>> +
>> +     return 1;
>>  }
>>
>>  /*
>> @@ -666,11 +674,9 @@ int __set_page_dirty_buffers(struct page *page)
>>                       bh = bh->b_this_page;
>>               } while (bh != head);
>>       }
>> -     newly_dirty = !TestSetPageDirty(page);
>> +     newly_dirty = __set_page_dirty(page, mapping, 1);
>>       spin_unlock(&mapping->private_lock);
>>
>> -     if (newly_dirty)
>> -             __set_page_dirty(page, mapping, 1);
>>       return newly_dirty;
>>  }
>>  EXPORT_SYMBOL(__set_page_dirty_buffers);
>> @@ -1125,14 +1131,8 @@ void mark_buffer_dirty(struct buffer_head *bh)
>>                       return;
>>       }
>>
>> -     if (!test_set_buffer_dirty(bh)) {
>> -             struct page *page = bh->b_page;
>> -             if (!TestSetPageDirty(page)) {
>> -                     struct address_space *mapping = page_mapping(page);
>> -                     if (mapping)
>> -                             __set_page_dirty(page, mapping, 0);
>> -             }
>> -     }
>> +     if (!test_set_buffer_dirty(bh))
>> +             __set_page_dirty(bh->b_page, page_mapping(bh->b_page), 0);
>>  }
>>  EXPORT_SYMBOL(mark_buffer_dirty);
>>
>> --
>> 1.7.9.5
>>
>
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 2/8] Make TestSetPageDirty and dirty page accounting in one func
@ 2013-01-05  2:49       ` Sha Zhengju
  0 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2013-01-05  2:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, cgroups, linux-mm, linux-fsdevel, dchinner, akpm,
	kamezawa.hiroyu, gthelen, fengguang.wu, glommer, Sha Zhengju

Hi Michal,

Sorry for my late response, I'm just back from vocation. : )

On Wed, Jan 2, 2013 at 5:08 PM, Michal Hocko <mhocko@suse.cz> wrote:
> On Wed 26-12-12 01:22:36, Sha Zhengju wrote:
>> From: Sha Zhengju <handai.szj@taobao.com>
>>
>> Commit a8e7d49a(Fix race in create_empty_buffers() vs __set_page_dirty_buffers())
>> extracts TestSetPageDirty from __set_page_dirty and is far away from
>> account_page_dirtied. But it's better to make the two operations in one single
>> function to keep modular. So in order to avoid the potential race mentioned in
>> commit a8e7d49a, we can hold private_lock until __set_page_dirty completes.
>> There's no deadlock between ->private_lock and ->tree_lock after confirmation.
>
> Could you be more specific here? E.g. quote mm/filemap.c comment I have
> mentioned during the first round of review?
>

Okay, sorry for forgetting the comment. I'll add it next round.

>> It's a prepare patch for following memcg dirty page accounting patches.
>>
>>
>> Here is some test numbers that before/after this patch:
>> Test steps(Mem-4g, ext4):
>> drop_cache; sync
>> fio (ioengine=sync/write/buffered/bs=4k/size=1g/numjobs=2/group_reporting/thread)
>
> Could also add some rationale why you think this test is relevant?
>

The test is aiming at finding the impact of performance due to lock
contention by writing parallel
to the same file. I'll add the reason next version too.

Thanks for reviewing!


Regards,
Sha

>> We test it for 10 times and get the average numbers:
>> Before:
>> write: io=2048.0MB, bw=254117KB/s, iops=63528.9 , runt=  8279msec
>> lat (usec): min=1 , max=742361 , avg=30.918, stdev=1601.02
>> After:
>> write: io=2048.0MB, bw=254044KB/s, iops=63510.3 , runt=  8274.4msec
>> lat (usec): min=1 , max=856333 , avg=31.043, stdev=1769.32
>>
>> Note that the impact is little(<1%).
>>
>>
>> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
>> Reviewed-by: Michal Hocko <mhocko@suse.cz>
>> ---
>>  fs/buffer.c |   24 ++++++++++++------------
>>  1 file changed, 12 insertions(+), 12 deletions(-)
>>
>> diff --git a/fs/buffer.c b/fs/buffer.c
>> index c017a2d..3b032b9 100644
>> --- a/fs/buffer.c
>> +++ b/fs/buffer.c
>> @@ -609,9 +609,15 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>>   * If warn is true, then emit a warning if the page is not uptodate and has
>>   * not been truncated.
>>   */
>> -static void __set_page_dirty(struct page *page,
>> +static int __set_page_dirty(struct page *page,
>>               struct address_space *mapping, int warn)
>>  {
>> +     if (unlikely(!mapping))
>> +             return !TestSetPageDirty(page);
>> +
>> +     if (TestSetPageDirty(page))
>> +             return 0;
>> +
>>       spin_lock_irq(&mapping->tree_lock);
>>       if (page->mapping) {    /* Race with truncate? */
>>               WARN_ON_ONCE(warn && !PageUptodate(page));
>> @@ -621,6 +627,8 @@ static void __set_page_dirty(struct page *page,
>>       }
>>       spin_unlock_irq(&mapping->tree_lock);
>>       __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>> +
>> +     return 1;
>>  }
>>
>>  /*
>> @@ -666,11 +674,9 @@ int __set_page_dirty_buffers(struct page *page)
>>                       bh = bh->b_this_page;
>>               } while (bh != head);
>>       }
>> -     newly_dirty = !TestSetPageDirty(page);
>> +     newly_dirty = __set_page_dirty(page, mapping, 1);
>>       spin_unlock(&mapping->private_lock);
>>
>> -     if (newly_dirty)
>> -             __set_page_dirty(page, mapping, 1);
>>       return newly_dirty;
>>  }
>>  EXPORT_SYMBOL(__set_page_dirty_buffers);
>> @@ -1125,14 +1131,8 @@ void mark_buffer_dirty(struct buffer_head *bh)
>>                       return;
>>       }
>>
>> -     if (!test_set_buffer_dirty(bh)) {
>> -             struct page *page = bh->b_page;
>> -             if (!TestSetPageDirty(page)) {
>> -                     struct address_space *mapping = page_mapping(page);
>> -                     if (mapping)
>> -                             __set_page_dirty(page, mapping, 0);
>> -             }
>> -     }
>> +     if (!test_set_buffer_dirty(bh))
>> +             __set_page_dirty(bh->b_page, page_mapping(bh->b_page), 0);
>>  }
>>  EXPORT_SYMBOL(mark_buffer_dirty);
>>
>> --
>> 1.7.9.5
>>
>
> --
> Michal Hocko
> SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
  2013-01-02 10:44     ` Michal Hocko
@ 2013-01-05  4:48       ` Sha Zhengju
  -1 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2013-01-05  4:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, cgroups, linux-mm, linux-fsdevel, akpm,
	kamezawa.hiroyu, gthelen, fengguang.wu, glommer, dchinner,
	Sha Zhengju

On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko <mhocko@suse.cz> wrote:
> On Wed 26-12-12 01:26:07, Sha Zhengju wrote:
>> From: Sha Zhengju <handai.szj@taobao.com>
>>
>> This patch adds memcg routines to count dirty pages, which allows memory controller
>> to maintain an accurate view of the amount of its dirty memory and can provide some
>> info for users while cgroup's direct reclaim is working.
>
> I guess you meant targeted resp. (hard/soft) limit reclaim here,
> right? It is true that this is direct reclaim but it is not clear to me

Yes, I meant memcg hard/soft reclaim here which is triggered directly
by allocation and is distinct from background kswapd reclaim (global).

> why the usefulnes should be limitted to the reclaim for users. I would
> understand this if the users was in fact in-kernel users.
>

One of the reasons I'm trying to accounting the dirty pages is to get a
more board overall view of memory usages because memcg hard/soft
reclaim may have effect on response time of user application.
Yeah, the beneficiary can be application administrator or kernel users.  :P

> [...]
>> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
>> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
>> ->private_lock --> mapping->tree_lock --> memcg->move_lock.
>> So we need to make mapping->tree_lock ahead of TestSetPageDirty in __set_page_dirty()
>> and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention,
>> a prepare PageDirty() checking is added.
>
> But there is another AA deadlock here I believe.
> page_remove_rmap
>   mem_cgroup_begin_update_page_stat             <<< 1
>   set_page_dirty
>     __set_page_dirty_buffers
>       __set_page_dirty
>         mem_cgroup_begin_update_page_stat       <<< 2
>           move_lock_mem_cgroup
>             spin_lock_irqsave(&memcg->move_lock, *flags);
>
> mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS
> because we might race with the moving charges:
>         CPU0                                            CPU1
> page_remove_rmap
>                                                 mem_cgroup_can_attach
>   mem_cgroup_begin_update_page_stat (1)
>     rcu_read_lock
>                                                   mem_cgroup_start_move
>                                                     atomic_inc(&memcg_moving)
>                                                     atomic_inc(&memcg->moving_account)
>                                                     synchronize_rcu
>     __mem_cgroup_begin_update_page_stat
>       mem_cgroup_stolen <<< TRUE
>       move_lock_mem_cgroup
>   [...]
>         mem_cgroup_begin_update_page_stat (2)
>           __mem_cgroup_begin_update_page_stat
>             mem_cgroup_stolen     <<< still TRUE
>             move_lock_mem_cgroup  <<< DEADLOCK
>   [...]
>   mem_cgroup_end_update_page_stat
>     rcu_unlock
>                                                   # wake up from synchronize_rcu
>                                                 [...]
>                                                 mem_cgroup_move_task
>                                                   mem_cgroup_move_charge
>                                                     walk_page_range
>                                                       mem_cgroup_move_account
>                                                         move_lock_mem_cgroup
>
>
> Maybe I have missed some other locking which would prevent this from
> happening but the locking relations are really complicated in this area
> so if mem_cgroup_{begin,end}_update_page_stat might be called
> recursively then we need a fat comment which justifies that.
>

Ohhh...good catching!  I didn't notice there is a recursive call of
mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
The mem_cgroup_{begin,end}_update_page_stat() design has depressed
me a lot recently as the lock granularity is a little bigger than I thought.
Not only the resource but also some code logic is in the range of locking
which may be deadlock prone. The problem still exists if we are trying to
add stat account of other memcg page later, may I make bold to suggest
that we dig into the lock again...

But with regard to the current lock implementation, I doubt if we can we can
account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just
try to get move_lock once in the beginning. IMHO we can make
mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what I'm
thinking now is changing memcg->move_lock to rw-spinlock from the
original spinlock:
mem_cgroup_{begin,end}_update_page_stat() try to get the read lock which make it
reenterable and memcg moving task side try to get the write spinlock.
Then the race may be following:

        CPU0                                            CPU1
page_remove_rmap
                                                mem_cgroup_can_attach
  mem_cgroup_begin_update_page_stat (1)
    rcu_read_lock
                                                  mem_cgroup_start_move
                                                    atomic_inc(&memcg_moving)

atomic_inc(&memcg->moving_account)
                                                    synchronize_rcu
    __mem_cgroup_begin_update_page_stat
      mem_cgroup_stolen   <<< TRUE
      move_lock_mem_cgroup   <<<< read-spinlock success
  [...]
     mem_cgroup_begin_update_page_stat (2)
          __mem_cgroup_begin_update_page_stat
            mem_cgroup_stolen     <<< still TRUE
            move_lock_mem_cgroup  <<<< read-spinlock success

  [...]
  mem_cgroup_end_update_page_stat     <<< locked = true, unlock
    rcu_unlock
                                                  # wake up from synchronize_rcu
                                                [...]
                                                mem_cgroup_move_task
                                                  mem_cgroup_move_charge
                                                    walk_page_range
                                                      mem_cgroup_move_account

move_lock_mem_cgroup    <<< write-spinlock


AFAICS, the deadlock seems to be avoided by both the rcu and rwlock.
Is there anything I lost?


Thanks,
Sha

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
@ 2013-01-05  4:48       ` Sha Zhengju
  0 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2013-01-05  4:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, cgroups, linux-mm, linux-fsdevel, akpm,
	kamezawa.hiroyu, gthelen, fengguang.wu, glommer, dchinner,
	Sha Zhengju

On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko <mhocko@suse.cz> wrote:
> On Wed 26-12-12 01:26:07, Sha Zhengju wrote:
>> From: Sha Zhengju <handai.szj@taobao.com>
>>
>> This patch adds memcg routines to count dirty pages, which allows memory controller
>> to maintain an accurate view of the amount of its dirty memory and can provide some
>> info for users while cgroup's direct reclaim is working.
>
> I guess you meant targeted resp. (hard/soft) limit reclaim here,
> right? It is true that this is direct reclaim but it is not clear to me

Yes, I meant memcg hard/soft reclaim here which is triggered directly
by allocation and is distinct from background kswapd reclaim (global).

> why the usefulnes should be limitted to the reclaim for users. I would
> understand this if the users was in fact in-kernel users.
>

One of the reasons I'm trying to accounting the dirty pages is to get a
more board overall view of memory usages because memcg hard/soft
reclaim may have effect on response time of user application.
Yeah, the beneficiary can be application administrator or kernel users.  :P

> [...]
>> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
>> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
>> ->private_lock --> mapping->tree_lock --> memcg->move_lock.
>> So we need to make mapping->tree_lock ahead of TestSetPageDirty in __set_page_dirty()
>> and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention,
>> a prepare PageDirty() checking is added.
>
> But there is another AA deadlock here I believe.
> page_remove_rmap
>   mem_cgroup_begin_update_page_stat             <<< 1
>   set_page_dirty
>     __set_page_dirty_buffers
>       __set_page_dirty
>         mem_cgroup_begin_update_page_stat       <<< 2
>           move_lock_mem_cgroup
>             spin_lock_irqsave(&memcg->move_lock, *flags);
>
> mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS
> because we might race with the moving charges:
>         CPU0                                            CPU1
> page_remove_rmap
>                                                 mem_cgroup_can_attach
>   mem_cgroup_begin_update_page_stat (1)
>     rcu_read_lock
>                                                   mem_cgroup_start_move
>                                                     atomic_inc(&memcg_moving)
>                                                     atomic_inc(&memcg->moving_account)
>                                                     synchronize_rcu
>     __mem_cgroup_begin_update_page_stat
>       mem_cgroup_stolen <<< TRUE
>       move_lock_mem_cgroup
>   [...]
>         mem_cgroup_begin_update_page_stat (2)
>           __mem_cgroup_begin_update_page_stat
>             mem_cgroup_stolen     <<< still TRUE
>             move_lock_mem_cgroup  <<< DEADLOCK
>   [...]
>   mem_cgroup_end_update_page_stat
>     rcu_unlock
>                                                   # wake up from synchronize_rcu
>                                                 [...]
>                                                 mem_cgroup_move_task
>                                                   mem_cgroup_move_charge
>                                                     walk_page_range
>                                                       mem_cgroup_move_account
>                                                         move_lock_mem_cgroup
>
>
> Maybe I have missed some other locking which would prevent this from
> happening but the locking relations are really complicated in this area
> so if mem_cgroup_{begin,end}_update_page_stat might be called
> recursively then we need a fat comment which justifies that.
>

Ohhh...good catching!  I didn't notice there is a recursive call of
mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
The mem_cgroup_{begin,end}_update_page_stat() design has depressed
me a lot recently as the lock granularity is a little bigger than I thought.
Not only the resource but also some code logic is in the range of locking
which may be deadlock prone. The problem still exists if we are trying to
add stat account of other memcg page later, may I make bold to suggest
that we dig into the lock again...

But with regard to the current lock implementation, I doubt if we can we can
account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just
try to get move_lock once in the beginning. IMHO we can make
mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what I'm
thinking now is changing memcg->move_lock to rw-spinlock from the
original spinlock:
mem_cgroup_{begin,end}_update_page_stat() try to get the read lock which make it
reenterable and memcg moving task side try to get the write spinlock.
Then the race may be following:

        CPU0                                            CPU1
page_remove_rmap
                                                mem_cgroup_can_attach
  mem_cgroup_begin_update_page_stat (1)
    rcu_read_lock
                                                  mem_cgroup_start_move
                                                    atomic_inc(&memcg_moving)

atomic_inc(&memcg->moving_account)
                                                    synchronize_rcu
    __mem_cgroup_begin_update_page_stat
      mem_cgroup_stolen   <<< TRUE
      move_lock_mem_cgroup   <<<< read-spinlock success
  [...]
     mem_cgroup_begin_update_page_stat (2)
          __mem_cgroup_begin_update_page_stat
            mem_cgroup_stolen     <<< still TRUE
            move_lock_mem_cgroup  <<<< read-spinlock success

  [...]
  mem_cgroup_end_update_page_stat     <<< locked = true, unlock
    rcu_unlock
                                                  # wake up from synchronize_rcu
                                                [...]
                                                mem_cgroup_move_task
                                                  mem_cgroup_move_charge
                                                    walk_page_range
                                                      mem_cgroup_move_account

move_lock_mem_cgroup    <<< write-spinlock


AFAICS, the deadlock seems to be avoided by both the rcu and rwlock.
Is there anything I lost?


Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 6/8] memcg: Don't account root_mem_cgroup page statistics
  2012-12-28  1:04     ` Kamezawa Hiroyuki
  (?)
@ 2013-01-05  7:38       ` Sha Zhengju
  -1 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2013-01-05  7:38 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-kernel, cgroups, linux-mm, mhocko, akpm, gthelen,
	fengguang.wu, glommer, Sha Zhengju

On Fri, Dec 28, 2012 at 9:04 AM, Kamezawa Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> (2012/12/26 2:27), Sha Zhengju wrote:
>> From: Sha Zhengju <handai.szj@taobao.com>
>>
>> If memcg is enabled and no non-root memcg exists, all allocated pages
>> belongs to root_mem_cgroup and go through root memcg statistics routines
>> which brings some overheads. So for the sake of performance, we can give
>> up accounting stats of root memcg for MEM_CGROUP_STAT_FILE_MAPPED/FILE_DIRTY
>> /WRITEBACK and instead we pay special attention while showing root
>> memcg numbers in memcg_stat_show(): as we don't account root memcg stats
>> anymore, the root_mem_cgroup->stat numbers are actually 0. But because of
>> hierachy, figures of root_mem_cgroup may just represent numbers of pages
>> used by its own tasks(not belonging to any other child cgroup). So here we
>> fake these root numbers by using stats of global state and all other memcg.
>> That is for root memcg:
>>       nr(MEM_CGROUP_STAT_FILE_MAPPED) = global_page_state(NR_FILE_MAPPED) -
>>                                sum_of_all_memcg(MEM_CGROUP_STAT_FILE_MAPPED);
>> Dirty/Writeback pages accounting are in the similar way.
>>
>> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
>
> isn't it better to use mem_cgroup_is_root() call rather than
> direct comparison (memcg == root_mem_cgroup) ?
>

Okay, it's better to use the wrapper.

> Anyway, Ack to this approach.
>

Thanks for reviewing!


Regards,
Sha

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 6/8] memcg: Don't account root_mem_cgroup page statistics
@ 2013-01-05  7:38       ` Sha Zhengju
  0 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2013-01-05  7:38 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-kernel, cgroups, linux-mm, mhocko, akpm, gthelen,
	fengguang.wu, glommer, Sha Zhengju

On Fri, Dec 28, 2012 at 9:04 AM, Kamezawa Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> (2012/12/26 2:27), Sha Zhengju wrote:
>> From: Sha Zhengju <handai.szj@taobao.com>
>>
>> If memcg is enabled and no non-root memcg exists, all allocated pages
>> belongs to root_mem_cgroup and go through root memcg statistics routines
>> which brings some overheads. So for the sake of performance, we can give
>> up accounting stats of root memcg for MEM_CGROUP_STAT_FILE_MAPPED/FILE_DIRTY
>> /WRITEBACK and instead we pay special attention while showing root
>> memcg numbers in memcg_stat_show(): as we don't account root memcg stats
>> anymore, the root_mem_cgroup->stat numbers are actually 0. But because of
>> hierachy, figures of root_mem_cgroup may just represent numbers of pages
>> used by its own tasks(not belonging to any other child cgroup). So here we
>> fake these root numbers by using stats of global state and all other memcg.
>> That is for root memcg:
>>       nr(MEM_CGROUP_STAT_FILE_MAPPED) = global_page_state(NR_FILE_MAPPED) -
>>                                sum_of_all_memcg(MEM_CGROUP_STAT_FILE_MAPPED);
>> Dirty/Writeback pages accounting are in the similar way.
>>
>> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
>
> isn't it better to use mem_cgroup_is_root() call rather than
> direct comparison (memcg == root_mem_cgroup) ?
>

Okay, it's better to use the wrapper.

> Anyway, Ack to this approach.
>

Thanks for reviewing!


Regards,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 6/8] memcg: Don't account root_mem_cgroup page statistics
@ 2013-01-05  7:38       ` Sha Zhengju
  0 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2013-01-05  7:38 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	mhocko-AlSwsSmVLrQ, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	glommer-bzQdu9zFT3WakBO8gow8eQ, Sha Zhengju

On Fri, Dec 28, 2012 at 9:04 AM, Kamezawa Hiroyuki
<kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> wrote:
> (2012/12/26 2:27), Sha Zhengju wrote:
>> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
>>
>> If memcg is enabled and no non-root memcg exists, all allocated pages
>> belongs to root_mem_cgroup and go through root memcg statistics routines
>> which brings some overheads. So for the sake of performance, we can give
>> up accounting stats of root memcg for MEM_CGROUP_STAT_FILE_MAPPED/FILE_DIRTY
>> /WRITEBACK and instead we pay special attention while showing root
>> memcg numbers in memcg_stat_show(): as we don't account root memcg stats
>> anymore, the root_mem_cgroup->stat numbers are actually 0. But because of
>> hierachy, figures of root_mem_cgroup may just represent numbers of pages
>> used by its own tasks(not belonging to any other child cgroup). So here we
>> fake these root numbers by using stats of global state and all other memcg.
>> That is for root memcg:
>>       nr(MEM_CGROUP_STAT_FILE_MAPPED) = global_page_state(NR_FILE_MAPPED) -
>>                                sum_of_all_memcg(MEM_CGROUP_STAT_FILE_MAPPED);
>> Dirty/Writeback pages accounting are in the similar way.
>>
>> Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
>
> isn't it better to use mem_cgroup_is_root() call rather than
> direct comparison (memcg == root_mem_cgroup) ?
>

Okay, it's better to use the wrapper.

> Anyway, Ack to this approach.
>

Thanks for reviewing!


Regards,
Sha

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 2/8] Make TestSetPageDirty and dirty page accounting in one func
@ 2013-01-05 10:45         ` Michal Hocko
  0 siblings, 0 replies; 124+ messages in thread
From: Michal Hocko @ 2013-01-05 10:45 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel, cgroups, linux-mm, linux-fsdevel, dchinner, akpm,
	kamezawa.hiroyu, gthelen, fengguang.wu, glommer, Sha Zhengju

On Sat 05-01-13 10:49:00, Sha Zhengju wrote:
[...]
> >> Here is some test numbers that before/after this patch:
> >> Test steps(Mem-4g, ext4):
> >> drop_cache; sync
> >> fio (ioengine=sync/write/buffered/bs=4k/size=1g/numjobs=2/group_reporting/thread)
> >
> > Could also add some rationale why you think this test is relevant?
> >
> 
> The test is aiming at finding the impact of performance due to lock
> contention by writing parallel
> to the same file. I'll add the reason next version too.

Please make sure to describe which locks are exercised during that test
and how much.

Thanks
[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 2/8] Make TestSetPageDirty and dirty page accounting in one func
@ 2013-01-05 10:45         ` Michal Hocko
  0 siblings, 0 replies; 124+ messages in thread
From: Michal Hocko @ 2013-01-05 10:45 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	dchinner-H+wXaHxf7aLQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	glommer-bzQdu9zFT3WakBO8gow8eQ, Sha Zhengju

On Sat 05-01-13 10:49:00, Sha Zhengju wrote:
[...]
> >> Here is some test numbers that before/after this patch:
> >> Test steps(Mem-4g, ext4):
> >> drop_cache; sync
> >> fio (ioengine=sync/write/buffered/bs=4k/size=1g/numjobs=2/group_reporting/thread)
> >
> > Could also add some rationale why you think this test is relevant?
> >
> 
> The test is aiming at finding the impact of performance due to lock
> contention by writing parallel
> to the same file. I'll add the reason next version too.

Please make sure to describe which locks are exercised during that test
and how much.

Thanks
[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 2/8] Make TestSetPageDirty and dirty page accounting in one func
@ 2013-01-05 10:45         ` Michal Hocko
  0 siblings, 0 replies; 124+ messages in thread
From: Michal Hocko @ 2013-01-05 10:45 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel, cgroups, linux-mm, linux-fsdevel, dchinner, akpm,
	kamezawa.hiroyu, gthelen, fengguang.wu, glommer, Sha Zhengju

On Sat 05-01-13 10:49:00, Sha Zhengju wrote:
[...]
> >> Here is some test numbers that before/after this patch:
> >> Test steps(Mem-4g, ext4):
> >> drop_cache; sync
> >> fio (ioengine=sync/write/buffered/bs=4k/size=1g/numjobs=2/group_reporting/thread)
> >
> > Could also add some rationale why you think this test is relevant?
> >
> 
> The test is aiming at finding the impact of performance due to lock
> contention by writing parallel
> to the same file. I'll add the reason next version too.

Please make sure to describe which locks are exercised during that test
and how much.

Thanks
[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 6/8] memcg: Don't account root_mem_cgroup page statistics
  2013-01-02 12:27     ` Michal Hocko
  (?)
@ 2013-01-05 10:52       ` Sha Zhengju
  -1 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2013-01-05 10:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, cgroups, linux-mm, akpm, kamezawa.hiroyu, gthelen,
	fengguang.wu, glommer, Sha Zhengju

On Wed, Jan 2, 2013 at 8:27 PM, Michal Hocko <mhocko@suse.cz> wrote:
> On Wed 26-12-12 01:27:27, Sha Zhengju wrote:
>> From: Sha Zhengju <handai.szj@taobao.com>
>>
>> If memcg is enabled and no non-root memcg exists, all allocated pages
>> belongs to root_mem_cgroup and go through root memcg statistics routines
>> which brings some overheads. So for the sake of performance, we can give
>> up accounting stats of root memcg for MEM_CGROUP_STAT_FILE_MAPPED/FILE_DIRTY
>> /WRITEBACK
>
> I do not like this selective approach. We should handle all the stat
> types in the same way. SWAP is not a hot path but RSS and CACHE should
> be optimized as well. It seems that thresholds events might be a
> complication here but it shouldn't be that a big deal (mem_cgroup_usage
> would need some treat).
>

The three MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY, WRITEBACK} stat
are accounted in the similar way: all of them are embedded in original global
accounting routine and have mem_cgroup_update_page_stat() as the unified
accounting entry. But RSS/CACHE/SWAP are counted in memcg charging periods
and may extend the patch larger... so I just want to send this out to
collect opinion.
Now from the feedback, I think I can continue the remaining in next round.


>> and instead we pay special attention while showing root
>> memcg numbers in memcg_stat_show(): as we don't account root memcg stats
>> anymore, the root_mem_cgroup->stat numbers are actually 0.
>
> Yes, this is reasonable.
>
>> But because of hierachy, figures of root_mem_cgroup may just represent
>> numbers of pages used by its own tasks(not belonging to any other
>> child cgroup).
>
> I am not sure what the above means. root might have use_hierarchy set to
> 1 as well.
>

Yes, my fault for not clear here. What I mean is we can not simply set numbers
of root memcg to global state. Whether root.use_hierarchy is 1 or 0, the figures
of root memcg are behalf of its local page stat (the difference exists
in total_* stat).
So we can fake these root numbers out by using stats of global state and
all other memcg.

>> So here we fake these root numbers by using stats of global state and
>> all other memcg.  That is for root memcg:
>>       nr(MEM_CGROUP_STAT_FILE_MAPPED) = global_page_state(NR_FILE_MAPPED) -
>>                               sum_of_all_memcg(MEM_CGROUP_STAT_FILE_MAPPED);
>> Dirty/Writeback pages accounting are in the similar way.
>>
>> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
>
> I like the approach but I do not like the implementation. See details
> bellow.
>
>> ---
>>  mm/memcontrol.c |   70 +++++++++++++++++++++++++++++++++++++++++++++++++++++--
>>  1 file changed, 68 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index fc20ac9..728349d 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
> [...]
>> @@ -5396,18 +5406,70 @@ static inline void mem_cgroup_lru_names_not_uptodate(void)
>>       BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS);
>>  }
>>
>> +long long root_memcg_local_stat(unsigned int i, long long val,
>> +                                     long long nstat[])
>
> Function should be static
> also
> nstat parameter is ugly because this can be done by the caller
> and also expecting that the caller already calculated val is not
> nice (and undocumented). This approach is really hackish and error
> prone. Why should we define a specific function rather than hooking into
> mem_cgroup_read_stat and doing all the stuff there? I think that would
> be much more maintainable.
>

IMHO, hooking into mem_cgroup_read_stat may be also improper because
of the for_each_mem_cgroup traversal. I prefer to make mem_cgroup_read_stat
as the base func unit. But I'll repeal the function base on your opinion in next
version.  Thanks for the advice!

>> +{
>> +     long long res = 0;
>> +
>> +     switch (i) {
>> +     case MEM_CGROUP_STAT_FILE_MAPPED:
>> +             res = global_page_state(NR_FILE_MAPPED);
>> +             break;
>> +     case MEM_CGROUP_STAT_FILE_DIRTY:
>> +             res = global_page_state(NR_FILE_DIRTY);
>> +             break;
>> +     case MEM_CGROUP_STAT_WRITEBACK:
>> +             res = global_page_state(NR_WRITEBACK);
>> +             break;
>> +     default:
>> +             break;
>> +     }
>> +
>> +     res = (res <= val) ? 0 : (res - val) * PAGE_SIZE;
>> +     nstat[i] = res;
>> +
>> +     return res;
>> +}
>> +
>>  static int memcg_stat_show(struct cgroup *cont, struct cftype *cft,
>>                                struct seq_file *m)
>>  {
>>       struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
>>       struct mem_cgroup *mi;
>>       unsigned int i;
>> +     long long nstat[MEM_CGROUP_STAT_NSTATS] = {0};
>
> s/nstat/root_stat/
>
>>
>>       for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
>> +             long long val = 0, res = 0;
>> +
>>               if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account)
>>                       continue;
>> -             seq_printf(m, "%s %ld\n", mem_cgroup_stat_names[i],
>> -                        mem_cgroup_read_stat(memcg, i) * PAGE_SIZE);
>> +             if (i == MEM_CGROUP_STAT_SWAP || i == MEM_CGROUP_STAT_CACHE ||
>> +                     i == MEM_CGROUP_STAT_RSS) {
>
> This is plain ugly. If nothing else it asks for a comment why those are
> special.

Okay, more comments will be added.


Thanks,
Sha

>
>> +                     seq_printf(m, "%s %ld\n", mem_cgroup_stat_names[i],
>> +                                mem_cgroup_read_stat(memcg, i) * PAGE_SIZE);
>> +                     continue;
>> +             }
>> +
>> +             /* As we don't account root memcg stats anymore, the
>> +              * root_mem_cgroup->stat numbers are actually 0. But because of
>> +              * hierachy, figures of root_mem_cgroup may just represent
>> +              * numbers of pages used by its own tasks(not belonging to any
>> +              * other child cgroup). So here we fake these root numbers by
>> +              * using stats of global state and all other memcg. That is for
>> +              * root memcg:
>> +              * nr(MEM_CGROUP_STAT_FILE_MAPPED) = global_page_state(NR_FILE_
>> +              *      MAPPED) - sum_of_all_memcg(MEM_CGROUP_STAT_FILE_MAPPED)
>> +              * Dirty/Writeback pages accounting are in the similar way.
>> +              */
>> +             if (memcg == root_mem_cgroup) {
>> +                     for_each_mem_cgroup(mi)
>> +                             val += mem_cgroup_read_stat(mi, i);
>> +                     res = root_memcg_local_stat(i, val, nstat);
>> +             } else
>> +                     res = mem_cgroup_read_stat(memcg, i) * PAGE_SIZE;
>> +
>> +             seq_printf(m, "%s %lld\n", mem_cgroup_stat_names[i], res);
>>       }
>>
>>       for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++)
>> @@ -5435,6 +5497,10 @@ static int memcg_stat_show(struct cgroup *cont, struct cftype *cft,
>>                       continue;
>>               for_each_mem_cgroup_tree(mi, memcg)
>>                       val += mem_cgroup_read_stat(mi, i) * PAGE_SIZE;
>> +
>> +             /* Adding local stats of root memcg */
>> +             if (memcg == root_mem_cgroup)
>> +                     val += nstat[i];
>>               seq_printf(m, "total_%s %lld\n", mem_cgroup_stat_names[i], val);
>>       }
>>
>> --
>> 1.7.9.5
>>
>
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 6/8] memcg: Don't account root_mem_cgroup page statistics
@ 2013-01-05 10:52       ` Sha Zhengju
  0 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2013-01-05 10:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, cgroups, linux-mm, akpm, kamezawa.hiroyu, gthelen,
	fengguang.wu, glommer, Sha Zhengju

On Wed, Jan 2, 2013 at 8:27 PM, Michal Hocko <mhocko@suse.cz> wrote:
> On Wed 26-12-12 01:27:27, Sha Zhengju wrote:
>> From: Sha Zhengju <handai.szj@taobao.com>
>>
>> If memcg is enabled and no non-root memcg exists, all allocated pages
>> belongs to root_mem_cgroup and go through root memcg statistics routines
>> which brings some overheads. So for the sake of performance, we can give
>> up accounting stats of root memcg for MEM_CGROUP_STAT_FILE_MAPPED/FILE_DIRTY
>> /WRITEBACK
>
> I do not like this selective approach. We should handle all the stat
> types in the same way. SWAP is not a hot path but RSS and CACHE should
> be optimized as well. It seems that thresholds events might be a
> complication here but it shouldn't be that a big deal (mem_cgroup_usage
> would need some treat).
>

The three MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY, WRITEBACK} stat
are accounted in the similar way: all of them are embedded in original global
accounting routine and have mem_cgroup_update_page_stat() as the unified
accounting entry. But RSS/CACHE/SWAP are counted in memcg charging periods
and may extend the patch larger... so I just want to send this out to
collect opinion.
Now from the feedback, I think I can continue the remaining in next round.


>> and instead we pay special attention while showing root
>> memcg numbers in memcg_stat_show(): as we don't account root memcg stats
>> anymore, the root_mem_cgroup->stat numbers are actually 0.
>
> Yes, this is reasonable.
>
>> But because of hierachy, figures of root_mem_cgroup may just represent
>> numbers of pages used by its own tasks(not belonging to any other
>> child cgroup).
>
> I am not sure what the above means. root might have use_hierarchy set to
> 1 as well.
>

Yes, my fault for not clear here. What I mean is we can not simply set numbers
of root memcg to global state. Whether root.use_hierarchy is 1 or 0, the figures
of root memcg are behalf of its local page stat (the difference exists
in total_* stat).
So we can fake these root numbers out by using stats of global state and
all other memcg.

>> So here we fake these root numbers by using stats of global state and
>> all other memcg.  That is for root memcg:
>>       nr(MEM_CGROUP_STAT_FILE_MAPPED) = global_page_state(NR_FILE_MAPPED) -
>>                               sum_of_all_memcg(MEM_CGROUP_STAT_FILE_MAPPED);
>> Dirty/Writeback pages accounting are in the similar way.
>>
>> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
>
> I like the approach but I do not like the implementation. See details
> bellow.
>
>> ---
>>  mm/memcontrol.c |   70 +++++++++++++++++++++++++++++++++++++++++++++++++++++--
>>  1 file changed, 68 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index fc20ac9..728349d 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
> [...]
>> @@ -5396,18 +5406,70 @@ static inline void mem_cgroup_lru_names_not_uptodate(void)
>>       BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS);
>>  }
>>
>> +long long root_memcg_local_stat(unsigned int i, long long val,
>> +                                     long long nstat[])
>
> Function should be static
> also
> nstat parameter is ugly because this can be done by the caller
> and also expecting that the caller already calculated val is not
> nice (and undocumented). This approach is really hackish and error
> prone. Why should we define a specific function rather than hooking into
> mem_cgroup_read_stat and doing all the stuff there? I think that would
> be much more maintainable.
>

IMHO, hooking into mem_cgroup_read_stat may be also improper because
of the for_each_mem_cgroup traversal. I prefer to make mem_cgroup_read_stat
as the base func unit. But I'll repeal the function base on your opinion in next
version.  Thanks for the advice!

>> +{
>> +     long long res = 0;
>> +
>> +     switch (i) {
>> +     case MEM_CGROUP_STAT_FILE_MAPPED:
>> +             res = global_page_state(NR_FILE_MAPPED);
>> +             break;
>> +     case MEM_CGROUP_STAT_FILE_DIRTY:
>> +             res = global_page_state(NR_FILE_DIRTY);
>> +             break;
>> +     case MEM_CGROUP_STAT_WRITEBACK:
>> +             res = global_page_state(NR_WRITEBACK);
>> +             break;
>> +     default:
>> +             break;
>> +     }
>> +
>> +     res = (res <= val) ? 0 : (res - val) * PAGE_SIZE;
>> +     nstat[i] = res;
>> +
>> +     return res;
>> +}
>> +
>>  static int memcg_stat_show(struct cgroup *cont, struct cftype *cft,
>>                                struct seq_file *m)
>>  {
>>       struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
>>       struct mem_cgroup *mi;
>>       unsigned int i;
>> +     long long nstat[MEM_CGROUP_STAT_NSTATS] = {0};
>
> s/nstat/root_stat/
>
>>
>>       for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
>> +             long long val = 0, res = 0;
>> +
>>               if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account)
>>                       continue;
>> -             seq_printf(m, "%s %ld\n", mem_cgroup_stat_names[i],
>> -                        mem_cgroup_read_stat(memcg, i) * PAGE_SIZE);
>> +             if (i == MEM_CGROUP_STAT_SWAP || i == MEM_CGROUP_STAT_CACHE ||
>> +                     i == MEM_CGROUP_STAT_RSS) {
>
> This is plain ugly. If nothing else it asks for a comment why those are
> special.

Okay, more comments will be added.


Thanks,
Sha

>
>> +                     seq_printf(m, "%s %ld\n", mem_cgroup_stat_names[i],
>> +                                mem_cgroup_read_stat(memcg, i) * PAGE_SIZE);
>> +                     continue;
>> +             }
>> +
>> +             /* As we don't account root memcg stats anymore, the
>> +              * root_mem_cgroup->stat numbers are actually 0. But because of
>> +              * hierachy, figures of root_mem_cgroup may just represent
>> +              * numbers of pages used by its own tasks(not belonging to any
>> +              * other child cgroup). So here we fake these root numbers by
>> +              * using stats of global state and all other memcg. That is for
>> +              * root memcg:
>> +              * nr(MEM_CGROUP_STAT_FILE_MAPPED) = global_page_state(NR_FILE_
>> +              *      MAPPED) - sum_of_all_memcg(MEM_CGROUP_STAT_FILE_MAPPED)
>> +              * Dirty/Writeback pages accounting are in the similar way.
>> +              */
>> +             if (memcg == root_mem_cgroup) {
>> +                     for_each_mem_cgroup(mi)
>> +                             val += mem_cgroup_read_stat(mi, i);
>> +                     res = root_memcg_local_stat(i, val, nstat);
>> +             } else
>> +                     res = mem_cgroup_read_stat(memcg, i) * PAGE_SIZE;
>> +
>> +             seq_printf(m, "%s %lld\n", mem_cgroup_stat_names[i], res);
>>       }
>>
>>       for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++)
>> @@ -5435,6 +5497,10 @@ static int memcg_stat_show(struct cgroup *cont, struct cftype *cft,
>>                       continue;
>>               for_each_mem_cgroup_tree(mi, memcg)
>>                       val += mem_cgroup_read_stat(mi, i) * PAGE_SIZE;
>> +
>> +             /* Adding local stats of root memcg */
>> +             if (memcg == root_mem_cgroup)
>> +                     val += nstat[i];
>>               seq_printf(m, "total_%s %lld\n", mem_cgroup_stat_names[i], val);
>>       }
>>
>> --
>> 1.7.9.5
>>
>
> --
> Michal Hocko
> SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 6/8] memcg: Don't account root_mem_cgroup page statistics
@ 2013-01-05 10:52       ` Sha Zhengju
  0 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2013-01-05 10:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	glommer-bzQdu9zFT3WakBO8gow8eQ, Sha Zhengju

On Wed, Jan 2, 2013 at 8:27 PM, Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> wrote:
> On Wed 26-12-12 01:27:27, Sha Zhengju wrote:
>> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
>>
>> If memcg is enabled and no non-root memcg exists, all allocated pages
>> belongs to root_mem_cgroup and go through root memcg statistics routines
>> which brings some overheads. So for the sake of performance, we can give
>> up accounting stats of root memcg for MEM_CGROUP_STAT_FILE_MAPPED/FILE_DIRTY
>> /WRITEBACK
>
> I do not like this selective approach. We should handle all the stat
> types in the same way. SWAP is not a hot path but RSS and CACHE should
> be optimized as well. It seems that thresholds events might be a
> complication here but it shouldn't be that a big deal (mem_cgroup_usage
> would need some treat).
>

The three MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY, WRITEBACK} stat
are accounted in the similar way: all of them are embedded in original global
accounting routine and have mem_cgroup_update_page_stat() as the unified
accounting entry. But RSS/CACHE/SWAP are counted in memcg charging periods
and may extend the patch larger... so I just want to send this out to
collect opinion.
Now from the feedback, I think I can continue the remaining in next round.


>> and instead we pay special attention while showing root
>> memcg numbers in memcg_stat_show(): as we don't account root memcg stats
>> anymore, the root_mem_cgroup->stat numbers are actually 0.
>
> Yes, this is reasonable.
>
>> But because of hierachy, figures of root_mem_cgroup may just represent
>> numbers of pages used by its own tasks(not belonging to any other
>> child cgroup).
>
> I am not sure what the above means. root might have use_hierarchy set to
> 1 as well.
>

Yes, my fault for not clear here. What I mean is we can not simply set numbers
of root memcg to global state. Whether root.use_hierarchy is 1 or 0, the figures
of root memcg are behalf of its local page stat (the difference exists
in total_* stat).
So we can fake these root numbers out by using stats of global state and
all other memcg.

>> So here we fake these root numbers by using stats of global state and
>> all other memcg.  That is for root memcg:
>>       nr(MEM_CGROUP_STAT_FILE_MAPPED) = global_page_state(NR_FILE_MAPPED) -
>>                               sum_of_all_memcg(MEM_CGROUP_STAT_FILE_MAPPED);
>> Dirty/Writeback pages accounting are in the similar way.
>>
>> Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
>
> I like the approach but I do not like the implementation. See details
> bellow.
>
>> ---
>>  mm/memcontrol.c |   70 +++++++++++++++++++++++++++++++++++++++++++++++++++++--
>>  1 file changed, 68 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index fc20ac9..728349d 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
> [...]
>> @@ -5396,18 +5406,70 @@ static inline void mem_cgroup_lru_names_not_uptodate(void)
>>       BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS);
>>  }
>>
>> +long long root_memcg_local_stat(unsigned int i, long long val,
>> +                                     long long nstat[])
>
> Function should be static
> also
> nstat parameter is ugly because this can be done by the caller
> and also expecting that the caller already calculated val is not
> nice (and undocumented). This approach is really hackish and error
> prone. Why should we define a specific function rather than hooking into
> mem_cgroup_read_stat and doing all the stuff there? I think that would
> be much more maintainable.
>

IMHO, hooking into mem_cgroup_read_stat may be also improper because
of the for_each_mem_cgroup traversal. I prefer to make mem_cgroup_read_stat
as the base func unit. But I'll repeal the function base on your opinion in next
version.  Thanks for the advice!

>> +{
>> +     long long res = 0;
>> +
>> +     switch (i) {
>> +     case MEM_CGROUP_STAT_FILE_MAPPED:
>> +             res = global_page_state(NR_FILE_MAPPED);
>> +             break;
>> +     case MEM_CGROUP_STAT_FILE_DIRTY:
>> +             res = global_page_state(NR_FILE_DIRTY);
>> +             break;
>> +     case MEM_CGROUP_STAT_WRITEBACK:
>> +             res = global_page_state(NR_WRITEBACK);
>> +             break;
>> +     default:
>> +             break;
>> +     }
>> +
>> +     res = (res <= val) ? 0 : (res - val) * PAGE_SIZE;
>> +     nstat[i] = res;
>> +
>> +     return res;
>> +}
>> +
>>  static int memcg_stat_show(struct cgroup *cont, struct cftype *cft,
>>                                struct seq_file *m)
>>  {
>>       struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
>>       struct mem_cgroup *mi;
>>       unsigned int i;
>> +     long long nstat[MEM_CGROUP_STAT_NSTATS] = {0};
>
> s/nstat/root_stat/
>
>>
>>       for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
>> +             long long val = 0, res = 0;
>> +
>>               if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account)
>>                       continue;
>> -             seq_printf(m, "%s %ld\n", mem_cgroup_stat_names[i],
>> -                        mem_cgroup_read_stat(memcg, i) * PAGE_SIZE);
>> +             if (i == MEM_CGROUP_STAT_SWAP || i == MEM_CGROUP_STAT_CACHE ||
>> +                     i == MEM_CGROUP_STAT_RSS) {
>
> This is plain ugly. If nothing else it asks for a comment why those are
> special.

Okay, more comments will be added.


Thanks,
Sha

>
>> +                     seq_printf(m, "%s %ld\n", mem_cgroup_stat_names[i],
>> +                                mem_cgroup_read_stat(memcg, i) * PAGE_SIZE);
>> +                     continue;
>> +             }
>> +
>> +             /* As we don't account root memcg stats anymore, the
>> +              * root_mem_cgroup->stat numbers are actually 0. But because of
>> +              * hierachy, figures of root_mem_cgroup may just represent
>> +              * numbers of pages used by its own tasks(not belonging to any
>> +              * other child cgroup). So here we fake these root numbers by
>> +              * using stats of global state and all other memcg. That is for
>> +              * root memcg:
>> +              * nr(MEM_CGROUP_STAT_FILE_MAPPED) = global_page_state(NR_FILE_
>> +              *      MAPPED) - sum_of_all_memcg(MEM_CGROUP_STAT_FILE_MAPPED)
>> +              * Dirty/Writeback pages accounting are in the similar way.
>> +              */
>> +             if (memcg == root_mem_cgroup) {
>> +                     for_each_mem_cgroup(mi)
>> +                             val += mem_cgroup_read_stat(mi, i);
>> +                     res = root_memcg_local_stat(i, val, nstat);
>> +             } else
>> +                     res = mem_cgroup_read_stat(memcg, i) * PAGE_SIZE;
>> +
>> +             seq_printf(m, "%s %lld\n", mem_cgroup_stat_names[i], res);
>>       }
>>
>>       for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++)
>> @@ -5435,6 +5497,10 @@ static int memcg_stat_show(struct cgroup *cont, struct cftype *cft,
>>                       continue;
>>               for_each_mem_cgroup_tree(mi, memcg)
>>                       val += mem_cgroup_read_stat(mi, i) * PAGE_SIZE;
>> +
>> +             /* Adding local stats of root memcg */
>> +             if (memcg == root_mem_cgroup)
>> +                     val += nstat[i];
>>               seq_printf(m, "total_%s %lld\n", mem_cgroup_stat_names[i], val);
>>       }
>>
>> --
>> 1.7.9.5
>>
>
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 7/8] memcg: disable memcg page stat accounting code when not in use
  2012-12-28  1:45     ` Kamezawa Hiroyuki
@ 2013-01-05 11:06       ` Sha Zhengju
  -1 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2013-01-05 11:06 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-kernel, cgroups, linux-mm, mhocko, akpm, gthelen,
	fengguang.wu, glommer, Sha Zhengju

On Fri, Dec 28, 2012 at 9:45 AM, Kamezawa Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> (2012/12/26 2:27), Sha Zhengju wrote:
>> From: Sha Zhengju <handai.szj@taobao.com>
>>
>> It's inspired by a similar optimization from Glauber Costa
>> (memcg: make it suck faster; https://lkml.org/lkml/2012/9/25/154).
>> Here we use jump label to patch the memcg page stat accounting code
>> in or out when not used. when the first non-root memcg comes to
>> life the code is patching in otherwise it is out.
>>
>> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
>> ---
>>   include/linux/memcontrol.h |    9 +++++++++
>>   mm/memcontrol.c            |    8 ++++++++
>>   2 files changed, 17 insertions(+)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 1d22b81..3c4430c 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -56,6 +56,9 @@ struct mem_cgroup_reclaim_cookie {
>>   };
>>
>>   #ifdef CONFIG_MEMCG
>> +
>> +extern struct static_key memcg_in_use_key;
>> +
>>   /*
>>    * All "charge" functions with gfp_mask should use GFP_KERNEL or
>>    * (gfp_mask & GFP_RECLAIM_MASK). In current implementatin, memcg doesn't
>> @@ -158,6 +161,9 @@ extern atomic_t memcg_moving;
>>   static inline void mem_cgroup_begin_update_page_stat(struct page *page,
>>                                       bool *locked, unsigned long *flags)
>>   {
>> +     if (!static_key_false(&memcg_in_use_key))
>> +             return;
>> +
>>       if (mem_cgroup_disabled())
>>               return;
>>       rcu_read_lock();
>> @@ -171,6 +177,9 @@ void __mem_cgroup_end_update_page_stat(struct page *page,
>>   static inline void mem_cgroup_end_update_page_stat(struct page *page,
>>                                       bool *locked, unsigned long *flags)
>>   {
>> +     if (!static_key_false(&memcg_in_use_key))
>> +             return;
>> +
>>       if (mem_cgroup_disabled())
>>               return;
>>       if (*locked)
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 0cb5187..a2f73d7 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -531,6 +531,8 @@ enum res_type {
>>   #define MEM_CGROUP_RECLAIM_SHRINK_BIT       0x1
>>   #define MEM_CGROUP_RECLAIM_SHRINK   (1 << MEM_CGROUP_RECLAIM_SHRINK_BIT)
>>
>> +struct static_key memcg_in_use_key;
>> +
>>   static void mem_cgroup_get(struct mem_cgroup *memcg);
>>   static void mem_cgroup_put(struct mem_cgroup *memcg);
>>
>> @@ -2226,6 +2228,9 @@ void mem_cgroup_update_page_stat(struct page *page,
>>       struct page_cgroup *pc = lookup_page_cgroup(page);
>>       unsigned long uninitialized_var(flags);
>>
>> +     if (!static_key_false(&memcg_in_use_key))
>> +             return;
>> +
>>       if (mem_cgroup_disabled())
>>               return;
>>
>> @@ -6340,6 +6345,8 @@ mem_cgroup_css_alloc(struct cgroup *cont)
>>               parent = mem_cgroup_from_cont(cont->parent);
>>               memcg->use_hierarchy = parent->use_hierarchy;
>>               memcg->oom_kill_disable = parent->oom_kill_disable;
>> +
>> +             static_key_slow_inc(&memcg_in_use_key);
>>       }
>>
>>       if (parent && parent->use_hierarchy) {
>> @@ -6407,6 +6414,7 @@ static void mem_cgroup_css_free(struct cgroup *cont)
>>       kmem_cgroup_destroy(memcg);
>>
>>       memcg_dangling_add(memcg);
>> +     static_key_slow_dec(&memcg_in_use_key);
>
> Hmm. I think this part should be folded into disarm_static_keys().
>

Yeah, now I think there is more suitable.


Thanks,
Sha

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 7/8] memcg: disable memcg page stat accounting code when not in use
@ 2013-01-05 11:06       ` Sha Zhengju
  0 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2013-01-05 11:06 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-kernel, cgroups, linux-mm, mhocko, akpm, gthelen,
	fengguang.wu, glommer, Sha Zhengju

On Fri, Dec 28, 2012 at 9:45 AM, Kamezawa Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> (2012/12/26 2:27), Sha Zhengju wrote:
>> From: Sha Zhengju <handai.szj@taobao.com>
>>
>> It's inspired by a similar optimization from Glauber Costa
>> (memcg: make it suck faster; https://lkml.org/lkml/2012/9/25/154).
>> Here we use jump label to patch the memcg page stat accounting code
>> in or out when not used. when the first non-root memcg comes to
>> life the code is patching in otherwise it is out.
>>
>> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
>> ---
>>   include/linux/memcontrol.h |    9 +++++++++
>>   mm/memcontrol.c            |    8 ++++++++
>>   2 files changed, 17 insertions(+)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 1d22b81..3c4430c 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -56,6 +56,9 @@ struct mem_cgroup_reclaim_cookie {
>>   };
>>
>>   #ifdef CONFIG_MEMCG
>> +
>> +extern struct static_key memcg_in_use_key;
>> +
>>   /*
>>    * All "charge" functions with gfp_mask should use GFP_KERNEL or
>>    * (gfp_mask & GFP_RECLAIM_MASK). In current implementatin, memcg doesn't
>> @@ -158,6 +161,9 @@ extern atomic_t memcg_moving;
>>   static inline void mem_cgroup_begin_update_page_stat(struct page *page,
>>                                       bool *locked, unsigned long *flags)
>>   {
>> +     if (!static_key_false(&memcg_in_use_key))
>> +             return;
>> +
>>       if (mem_cgroup_disabled())
>>               return;
>>       rcu_read_lock();
>> @@ -171,6 +177,9 @@ void __mem_cgroup_end_update_page_stat(struct page *page,
>>   static inline void mem_cgroup_end_update_page_stat(struct page *page,
>>                                       bool *locked, unsigned long *flags)
>>   {
>> +     if (!static_key_false(&memcg_in_use_key))
>> +             return;
>> +
>>       if (mem_cgroup_disabled())
>>               return;
>>       if (*locked)
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 0cb5187..a2f73d7 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -531,6 +531,8 @@ enum res_type {
>>   #define MEM_CGROUP_RECLAIM_SHRINK_BIT       0x1
>>   #define MEM_CGROUP_RECLAIM_SHRINK   (1 << MEM_CGROUP_RECLAIM_SHRINK_BIT)
>>
>> +struct static_key memcg_in_use_key;
>> +
>>   static void mem_cgroup_get(struct mem_cgroup *memcg);
>>   static void mem_cgroup_put(struct mem_cgroup *memcg);
>>
>> @@ -2226,6 +2228,9 @@ void mem_cgroup_update_page_stat(struct page *page,
>>       struct page_cgroup *pc = lookup_page_cgroup(page);
>>       unsigned long uninitialized_var(flags);
>>
>> +     if (!static_key_false(&memcg_in_use_key))
>> +             return;
>> +
>>       if (mem_cgroup_disabled())
>>               return;
>>
>> @@ -6340,6 +6345,8 @@ mem_cgroup_css_alloc(struct cgroup *cont)
>>               parent = mem_cgroup_from_cont(cont->parent);
>>               memcg->use_hierarchy = parent->use_hierarchy;
>>               memcg->oom_kill_disable = parent->oom_kill_disable;
>> +
>> +             static_key_slow_inc(&memcg_in_use_key);
>>       }
>>
>>       if (parent && parent->use_hierarchy) {
>> @@ -6407,6 +6414,7 @@ static void mem_cgroup_css_free(struct cgroup *cont)
>>       kmem_cgroup_destroy(memcg);
>>
>>       memcg_dangling_add(memcg);
>> +     static_key_slow_dec(&memcg_in_use_key);
>
> Hmm. I think this part should be folded into disarm_static_keys().
>

Yeah, now I think there is more suitable.


Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 8/8] memcg: Document cgroup dirty/writeback memory statistics
  2012-12-28  1:10     ` Kamezawa Hiroyuki
  (?)
@ 2013-01-06  2:55       ` Sha Zhengju
  -1 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2013-01-06  2:55 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-kernel, cgroups, linux-mm, mhocko, akpm, gthelen,
	fengguang.wu, glommer, Sha Zhengju

On Fri, Dec 28, 2012 at 9:10 AM, Kamezawa Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> (2012/12/26 2:28), Sha Zhengju wrote:
>> From: Sha Zhengju <handai.szj@taobao.com>
>>
>> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
>
> I don't think your words are bad but it may be better to sync with meminfo's text.
>
>> ---
>>   Documentation/cgroups/memory.txt |    2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
>> index addb1f1..2828164 100644
>> --- a/Documentation/cgroups/memory.txt
>> +++ b/Documentation/cgroups/memory.txt
>> @@ -487,6 +487,8 @@ pgpgin            - # of charging events to the memory cgroup. The charging
>>   pgpgout             - # of uncharging events to the memory cgroup. The uncharging
>>               event happens each time a page is unaccounted from the cgroup.
>>   swap                - # of bytes of swap usage
>> +dirty          - # of bytes of file cache that are not in sync with the disk copy.
>> +writeback      - # of bytes of file/anon cache that are queued for syncing to disk.
>>   inactive_anon       - # of bytes of anonymous memory and swap cache memory on
>>               LRU list.
>>   active_anon - # of bytes of anonymous and swap cache memory on active
>>
>
> Documentation/filesystems/proc.txt
>
>        Dirty: Memory which is waiting to get written back to the disk
>    Writeback: Memory which is actively being written back to the disk
>
> even if others are not ;(
>


The words are actually revised by Fengguang before:
https://lkml.org/lkml/2012/7/7/49
It might be more accurate than previous one and I just follow his advise...


Thanks,
Sha

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 8/8] memcg: Document cgroup dirty/writeback memory statistics
@ 2013-01-06  2:55       ` Sha Zhengju
  0 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2013-01-06  2:55 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-kernel, cgroups, linux-mm, mhocko, akpm, gthelen,
	fengguang.wu, glommer, Sha Zhengju

On Fri, Dec 28, 2012 at 9:10 AM, Kamezawa Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> (2012/12/26 2:28), Sha Zhengju wrote:
>> From: Sha Zhengju <handai.szj@taobao.com>
>>
>> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
>
> I don't think your words are bad but it may be better to sync with meminfo's text.
>
>> ---
>>   Documentation/cgroups/memory.txt |    2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
>> index addb1f1..2828164 100644
>> --- a/Documentation/cgroups/memory.txt
>> +++ b/Documentation/cgroups/memory.txt
>> @@ -487,6 +487,8 @@ pgpgin            - # of charging events to the memory cgroup. The charging
>>   pgpgout             - # of uncharging events to the memory cgroup. The uncharging
>>               event happens each time a page is unaccounted from the cgroup.
>>   swap                - # of bytes of swap usage
>> +dirty          - # of bytes of file cache that are not in sync with the disk copy.
>> +writeback      - # of bytes of file/anon cache that are queued for syncing to disk.
>>   inactive_anon       - # of bytes of anonymous memory and swap cache memory on
>>               LRU list.
>>   active_anon - # of bytes of anonymous and swap cache memory on active
>>
>
> Documentation/filesystems/proc.txt
>
>        Dirty: Memory which is waiting to get written back to the disk
>    Writeback: Memory which is actively being written back to the disk
>
> even if others are not ;(
>


The words are actually revised by Fengguang before:
https://lkml.org/lkml/2012/7/7/49
It might be more accurate than previous one and I just follow his advise...


Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 8/8] memcg: Document cgroup dirty/writeback memory statistics
@ 2013-01-06  2:55       ` Sha Zhengju
  0 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2013-01-06  2:55 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	mhocko-AlSwsSmVLrQ, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	glommer-bzQdu9zFT3WakBO8gow8eQ, Sha Zhengju

On Fri, Dec 28, 2012 at 9:10 AM, Kamezawa Hiroyuki
<kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> wrote:
> (2012/12/26 2:28), Sha Zhengju wrote:
>> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
>>
>> Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
>
> I don't think your words are bad but it may be better to sync with meminfo's text.
>
>> ---
>>   Documentation/cgroups/memory.txt |    2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
>> index addb1f1..2828164 100644
>> --- a/Documentation/cgroups/memory.txt
>> +++ b/Documentation/cgroups/memory.txt
>> @@ -487,6 +487,8 @@ pgpgin            - # of charging events to the memory cgroup. The charging
>>   pgpgout             - # of uncharging events to the memory cgroup. The uncharging
>>               event happens each time a page is unaccounted from the cgroup.
>>   swap                - # of bytes of swap usage
>> +dirty          - # of bytes of file cache that are not in sync with the disk copy.
>> +writeback      - # of bytes of file/anon cache that are queued for syncing to disk.
>>   inactive_anon       - # of bytes of anonymous memory and swap cache memory on
>>               LRU list.
>>   active_anon - # of bytes of anonymous and swap cache memory on active
>>
>
> Documentation/filesystems/proc.txt
>
>        Dirty: Memory which is waiting to get written back to the disk
>    Writeback: Memory which is actively being written back to the disk
>
> even if others are not ;(
>


The words are actually revised by Fengguang before:
https://lkml.org/lkml/2012/7/7/49
It might be more accurate than previous one and I just follow his advise...


Thanks,
Sha

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
  2013-01-05  4:48       ` Sha Zhengju
@ 2013-01-06 20:02         ` Hugh Dickins
  -1 siblings, 0 replies; 124+ messages in thread
From: Hugh Dickins @ 2013-01-06 20:02 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: Michal Hocko, Johannes Weiner, linux-kernel, cgroups, linux-mm,
	linux-fsdevel, akpm, kamezawa.hiroyu, gthelen, fengguang.wu,
	glommer, dchinner, Sha Zhengju

On Sat, 5 Jan 2013, Sha Zhengju wrote:
> On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko <mhocko@suse.cz> wrote:
> >
> > Maybe I have missed some other locking which would prevent this from
> > happening but the locking relations are really complicated in this area
> > so if mem_cgroup_{begin,end}_update_page_stat might be called
> > recursively then we need a fat comment which justifies that.
> >
> 
> Ohhh...good catching!  I didn't notice there is a recursive call of
> mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
> The mem_cgroup_{begin,end}_update_page_stat() design has depressed
> me a lot recently as the lock granularity is a little bigger than I thought.
> Not only the resource but also some code logic is in the range of locking
> which may be deadlock prone. The problem still exists if we are trying to
> add stat account of other memcg page later, may I make bold to suggest
> that we dig into the lock again...

Forgive me, I must confess I'm no more than skimming this thread,
and don't like dumping unsigned-off patches on people; but thought
that on balance it might be more helpful than not if I offer you a
patch I worked on around 3.6-rc2 (but have updated to 3.8-rc2 below).

I too was getting depressed by the constraints imposed by
mem_cgroup_{begin,end}_update_page_stat (good job though Kamezawa-san
did to minimize them), and wanted to replace by something freer, more
RCU-like.  In the end it seemed more effort than it was worth to go
as far as I wanted, but I do think that this is some improvement over
what we currently have, and should deal with your recursion issue.

But if this does appear useful to memcg people, then we really ought
to get it checked over by locking/barrier experts before going further.
I think myself that I've over-barriered it, and could use a little
lighter; but they (Paul McKenney, Peter Zijlstra, Oleg Nesterov come
to mind) will see more clearly, and may just hate the whole thing,
as yet another peculiar lockdep-avoiding hand-crafted locking scheme.
I've not wanted to waste their time on reviewing it, if it's not even
going to be useful to memcg people.

It may be easier to understand if you just apply the patch and look
at the result in mm/memcontrol.c, where I tried to gather the pieces
together in one place and describe them ("These functions mediate...").

Hugh

 include/linux/memcontrol.h |   39 +--
 mm/memcontrol.c            |  375 +++++++++++++++++++++--------------
 mm/rmap.c                  |   20 -
 3 files changed, 257 insertions(+), 177 deletions(-)

--- 3.8-rc2/include/linux/memcontrol.h	2012-12-22 09:43:27.172015571 -0800
+++ linux/include/linux/memcontrol.h	2013-01-02 14:47:47.960394878 -0800
@@ -136,32 +136,28 @@ static inline bool mem_cgroup_disabled(v
 	return false;
 }
 
-void __mem_cgroup_begin_update_page_stat(struct page *page, bool *locked,
-					 unsigned long *flags);
-
+void __mem_cgroup_begin_update_page_stat(struct page *page);
+void __mem_cgroup_end_update_page_stat(void);
 extern atomic_t memcg_moving;
 
 static inline void mem_cgroup_begin_update_page_stat(struct page *page,
-					bool *locked, unsigned long *flags)
+						     bool *clamped)
 {
-	if (mem_cgroup_disabled())
-		return;
-	rcu_read_lock();
-	*locked = false;
-	if (atomic_read(&memcg_moving))
-		__mem_cgroup_begin_update_page_stat(page, locked, flags);
+	preempt_disable();
+	*clamped = false;
+	if (unlikely(atomic_read(&memcg_moving))) {
+		__mem_cgroup_begin_update_page_stat(page);
+		*clamped = true;
+	}
 }
 
-void __mem_cgroup_end_update_page_stat(struct page *page,
-				unsigned long *flags);
 static inline void mem_cgroup_end_update_page_stat(struct page *page,
-					bool *locked, unsigned long *flags)
+						   bool *clamped)
 {
-	if (mem_cgroup_disabled())
-		return;
-	if (*locked)
-		__mem_cgroup_end_update_page_stat(page, flags);
-	rcu_read_unlock();
+	/* We don't currently use the page arg, but keep it for symmetry */
+	if (unlikely(*clamped))
+		__mem_cgroup_end_update_page_stat();
+	preempt_enable();
 }
 
 void mem_cgroup_update_page_stat(struct page *page,
@@ -345,13 +341,16 @@ mem_cgroup_print_oom_info(struct mem_cgr
 }
 
 static inline void mem_cgroup_begin_update_page_stat(struct page *page,
-					bool *locked, unsigned long *flags)
+						     bool *clamped)
 {
+	/* It may be helpful to our callers if the stub behaves the same way */
+	preempt_disable();
 }
 
 static inline void mem_cgroup_end_update_page_stat(struct page *page,
-					bool *locked, unsigned long *flags)
+						   bool *clamped)
 {
+	preempt_enable();
 }
 
 static inline void mem_cgroup_inc_page_stat(struct page *page,
--- 3.8-rc2/mm/memcontrol.c	2012-12-22 09:43:27.628015582 -0800
+++ linux/mm/memcontrol.c	2013-01-02 14:55:36.268406008 -0800
@@ -321,12 +321,7 @@ struct mem_cgroup {
 	 * mem_cgroup ? And what type of charges should we move ?
 	 */
 	unsigned long 	move_charge_at_immigrate;
-	/*
-	 * set > 0 if pages under this cgroup are moving to other cgroup.
-	 */
-	atomic_t	moving_account;
-	/* taken only while moving_account > 0 */
-	spinlock_t	move_lock;
+
 	/*
 	 * percpu counter.
 	 */
@@ -1414,60 +1409,10 @@ int mem_cgroup_swappiness(struct mem_cgr
 }
 
 /*
- * memcg->moving_account is used for checking possibility that some thread is
- * calling move_account(). When a thread on CPU-A starts moving pages under
- * a memcg, other threads should check memcg->moving_account under
- * rcu_read_lock(), like this:
- *
- *         CPU-A                                    CPU-B
- *                                              rcu_read_lock()
- *         memcg->moving_account+1              if (memcg->mocing_account)
- *                                                   take heavy locks.
- *         synchronize_rcu()                    update something.
- *                                              rcu_read_unlock()
- *         start move here.
- */
-
-/* for quick checking without looking up memcg */
-atomic_t memcg_moving __read_mostly;
-
-static void mem_cgroup_start_move(struct mem_cgroup *memcg)
-{
-	atomic_inc(&memcg_moving);
-	atomic_inc(&memcg->moving_account);
-	synchronize_rcu();
-}
-
-static void mem_cgroup_end_move(struct mem_cgroup *memcg)
-{
-	/*
-	 * Now, mem_cgroup_clear_mc() may call this function with NULL.
-	 * We check NULL in callee rather than caller.
-	 */
-	if (memcg) {
-		atomic_dec(&memcg_moving);
-		atomic_dec(&memcg->moving_account);
-	}
-}
-
-/*
- * 2 routines for checking "mem" is under move_account() or not.
- *
- * mem_cgroup_stolen() -  checking whether a cgroup is mc.from or not. This
- *			  is used for avoiding races in accounting.  If true,
- *			  pc->mem_cgroup may be overwritten.
- *
  * mem_cgroup_under_move() - checking a cgroup is mc.from or mc.to or
  *			  under hierarchy of moving cgroups. This is for
- *			  waiting at hith-memory prressure caused by "move".
+ *			  waiting at high memory pressure caused by "move".
  */
-
-static bool mem_cgroup_stolen(struct mem_cgroup *memcg)
-{
-	VM_BUG_ON(!rcu_read_lock_held());
-	return atomic_read(&memcg->moving_account) > 0;
-}
-
 static bool mem_cgroup_under_move(struct mem_cgroup *memcg)
 {
 	struct mem_cgroup *from;
@@ -1506,24 +1451,6 @@ static bool mem_cgroup_wait_acct_move(st
 	return false;
 }
 
-/*
- * Take this lock when
- * - a code tries to modify page's memcg while it's USED.
- * - a code tries to modify page state accounting in a memcg.
- * see mem_cgroup_stolen(), too.
- */
-static void move_lock_mem_cgroup(struct mem_cgroup *memcg,
-				  unsigned long *flags)
-{
-	spin_lock_irqsave(&memcg->move_lock, *flags);
-}
-
-static void move_unlock_mem_cgroup(struct mem_cgroup *memcg,
-				unsigned long *flags)
-{
-	spin_unlock_irqrestore(&memcg->move_lock, *flags);
-}
-
 /**
  * mem_cgroup_print_oom_info: Called from OOM with tasklist_lock held in read mode.
  * @memcg: The memory cgroup that went over limit
@@ -2096,75 +2023,215 @@ static bool mem_cgroup_handle_oom(struct
 }
 
 /*
- * Currently used to update mapped file statistics, but the routine can be
- * generalized to update other statistics as well.
- *
- * Notes: Race condition
- *
- * We usually use page_cgroup_lock() for accessing page_cgroup member but
- * it tends to be costly. But considering some conditions, we doesn't need
- * to do so _always_.
- *
- * Considering "charge", lock_page_cgroup() is not required because all
- * file-stat operations happen after a page is attached to radix-tree. There
- * are no race with "charge".
- *
- * Considering "uncharge", we know that memcg doesn't clear pc->mem_cgroup
- * at "uncharge" intentionally. So, we always see valid pc->mem_cgroup even
- * if there are race with "uncharge". Statistics itself is properly handled
- * by flags.
+ * These functions mediate between the common case of updating memcg stats
+ * when a page transitions from one state to another, and the rare case of
+ * moving a page from one memcg to another.
+ *
+ * A simple example of the updater would be:
+ *	mem_cgroup_begin_update_page_stat(page);
+ *	if (TestClearPageFlag(page))
+ *		mem_cgroup_dec_page_stat(page, NR_FLAG_PAGES);
+ *	mem_cgroup_end_update_page_stat(page);
+ *
+ * An over-simplified example of the mover would be:
+ *	mem_cgroup_begin_move();
+ *	for each page chosen from old_memcg {
+ *		pc = lookup_page_cgroup(page);
+ *		lock_page_cgroup(pc);
+ *		if (trylock_memcg_move(page)) {
+ *			if (PageFlag(page)) {
+ *				mem_cgroup_dec_page_stat(page, NR_FLAG_PAGES);
+ *				pc->mem_cgroup = new_memcg;
+ *				mem_cgroup_inc_page_stat(page, NR_FLAG_PAGES);
+ *			}
+ *			unlock_memcg_move();
+ *			unlock_page_cgroup(pc);
+ *		}
+ *		cond_resched();
+ *	}
+ *	mem_cgroup_end_move();
+ *
+ * Without some kind of serialization between updater and mover, the mover
+ * cannot know whether or not to move one count from old to new memcg stats;
+ * but the serialization must be as lightweight as possible for the updater.
+ *
+ * At present we use two layers of lock avoidance, then spinlock on memcg;
+ * but that already got into (easily avoided) lock hierarchy violation with
+ * the page_cgroup lock; and as dirty writeback stats are added, it gets
+ * into further difficulty with the page cache radix tree lock (and on s390
+ * architecture, page_remove_rmap calls set_page_dirty within its critical
+ * section: perhaps that can be reordered, but if not, it requires nesting).
+ *
+ * We need a mechanism more like rcu_read_lock() for the updater, who then
+ * does not have to worry about lock ordering.  The scheme below is not quite
+ * as light as that: rarely, the updater does have to spin waiting on a mover;
+ * and it is still best for updater to avoid taking page_cgroup lock in its
+ * critical section (though mover drops and retries if necessary, so there is
+ * no actual deadlock).  Testing on 4-way suggests 5% heavier for the mover.
+ */
+
+/*
+ * memcg_moving count is written in advance by movers,
+ * and read by updaters to see if they need to worry further.
+ */
+atomic_t memcg_moving __read_mostly;
+
+/*
+ * Keep it simple: allow only one page to move at a time.  cgroup_mutex
+ * already serializes move_charge_at_immigrate movements, but not writes
+ * to memory.force_empty, nor move-pages-to-parent phase of cgroup rmdir.
  *
- * Considering "move", this is an only case we see a race. To make the race
- * small, we check mm->moving_account and detect there are possibility of race
- * If there is, we take a lock.
+ * memcg_moving_lock guards writes by movers to memcg_moving_page,
+ * which is read by updaters to see if they need to worry about their page.
+ */
+static DEFINE_SPINLOCK(memcg_moving_lock);
+static struct page *memcg_moving_page;
+
+/*
+ * updating_page_stat is written per-cpu by updaters,
+ * and all cpus read by mover to check when safe to proceed with the move.
  */
+static DEFINE_PER_CPU(int, updating_page_stat) = 0;
 
-void __mem_cgroup_begin_update_page_stat(struct page *page,
-				bool *locked, unsigned long *flags)
+/*
+ * Mover calls mem_cgroup_begin_move() before starting on its pages; its
+ * synchronize_rcu() ensures that all updaters will see memcg_moving in time.
+ */
+static void mem_cgroup_begin_move(void)
 {
-	struct mem_cgroup *memcg;
-	struct page_cgroup *pc;
+	get_online_cpus();
+	atomic_inc(&memcg_moving);
+	synchronize_rcu();
+}
+
+static void mem_cgroup_end_move(void)
+{
+	atomic_dec(&memcg_moving);
+	put_online_cpus();
+}
+
+/*
+ * Mover calls trylock_memcg_move(page) before moving stats and changing
+ * ownership of page.  If it fails, mover should drop page_cgroup lock and
+ * any other spinlocks held, cond_resched then try the page again.  This
+ * lets updaters take those locks if unavoidable, though preferably not.
+ */
+static bool trylock_memcg_move(struct page *page)
+{
+	static struct cpumask updating;
+	int try;
+
+	cpumask_copy(&updating, cpu_online_mask);
+	spin_lock(&memcg_moving_lock);
+	memcg_moving_page = page;
 
-	pc = lookup_page_cgroup(page);
-again:
-	memcg = pc->mem_cgroup;
-	if (unlikely(!memcg || !PageCgroupUsed(pc)))
-		return;
 	/*
-	 * If this memory cgroup is not under account moving, we don't
-	 * need to take move_lock_mem_cgroup(). Because we already hold
-	 * rcu_read_lock(), any calls to move_account will be delayed until
-	 * rcu_read_unlock() if mem_cgroup_stolen() == true.
+	 * Make sure that __mem_cgroup_begin_update_page_stat(page) can see
+	 * our memcg_moving_page before it commits to updating_page_stat.
 	 */
-	if (!mem_cgroup_stolen(memcg))
-		return;
+	smp_mb();
 
-	move_lock_mem_cgroup(memcg, flags);
-	if (memcg != pc->mem_cgroup || !PageCgroupUsed(pc)) {
-		move_unlock_mem_cgroup(memcg, flags);
-		goto again;
+	for (try = 0; try < 64; try++) {
+		int updaters = 0;
+		int cpu;
+
+		for_each_cpu(cpu, &updating) {
+			if (ACCESS_ONCE(per_cpu(updating_page_stat, cpu)))
+				updaters++;
+			else
+				cpumask_clear_cpu(cpu, &updating);
+		}
+		if (!updaters)
+			return true;
 	}
-	*locked = true;
+
+	memcg_moving_page = NULL;
+	spin_unlock(&memcg_moving_lock);
+	return false;
 }
 
-void __mem_cgroup_end_update_page_stat(struct page *page, unsigned long *flags)
+static void unlock_memcg_move(void)
 {
-	struct page_cgroup *pc = lookup_page_cgroup(page);
+	memcg_moving_page = NULL;
+	spin_unlock(&memcg_moving_lock);
+}
 
-	/*
-	 * It's guaranteed that pc->mem_cgroup never changes while
-	 * lock is held because a routine modifies pc->mem_cgroup
-	 * should take move_lock_mem_cgroup().
-	 */
-	move_unlock_mem_cgroup(pc->mem_cgroup, flags);
+/*
+ * If memcg_moving, updater calls __mem_cgroup_begin_update_page_stat(page)
+ * (with preemption disabled) to indicate to the next mover that this cpu is
+ * updating a page, or to wait on the mover if it's already moving this page.
+ */
+void __mem_cgroup_begin_update_page_stat(struct page *page)
+{
+	static const int probing = 0x10000;
+	int updating;
+
+	__this_cpu_add(updating_page_stat, probing);
+
+	for (;;) {
+		/*
+		 * Make sure that trylock_memcg_move(page) can see our
+		 * updating_page_stat before we check memcg_moving_page.
+		 *
+		 * We use the special probing value at first so move sees it,
+		 * but nesting and interrupts on this cpu can distinguish it.
+		 */
+		smp_mb();
+
+		if (likely(page != ACCESS_ONCE(memcg_moving_page)))
+			break;
+
+		/*
+		 * We may be nested, we may be serving an interrupt: do not
+		 * hang here if the outer level already went beyond probing.
+		 */
+		updating = __this_cpu_read(updating_page_stat);
+		if (updating & (probing - 1))
+			break;
+
+		__this_cpu_write(updating_page_stat, 0);
+		while (page == ACCESS_ONCE(memcg_moving_page))
+			cpu_relax();
+		__this_cpu_write(updating_page_stat, updating);
+	}
+
+	/* Add one to count and remove temporary probing value */
+	__this_cpu_sub(updating_page_stat, probing - 1);
+}
+
+void __mem_cgroup_end_update_page_stat(void)
+{
+	__this_cpu_dec(updating_page_stat);
+}
+
+/*
+ * Static inline interfaces to the above in include/linux/memcontrol.h:
+ *
+static inline void mem_cgroup_begin_update_page_stat(struct page *page,
+						     bool *clamped)
+{
+	preempt_disable();
+	*clamped = false;
+	if (unlikely(atomic_read(&memcg_moving))) {
+		__mem_cgroup_begin_update_page_stat(page);
+		*clamped = true;
+	}
 }
 
+static inline void mem_cgroup_end_update_page_stat(struct page *page,
+						   bool *clamped)
+{
+	if (unlikely(*clamped))
+		__mem_cgroup_end_update_page_stat();
+	preempt_enable();
+}
+ */
+
 void mem_cgroup_update_page_stat(struct page *page,
 				 enum mem_cgroup_page_stat_item idx, int val)
 {
 	struct mem_cgroup *memcg;
 	struct page_cgroup *pc = lookup_page_cgroup(page);
-	unsigned long uninitialized_var(flags);
 
 	if (mem_cgroup_disabled())
 		return;
@@ -2181,7 +2248,8 @@ void mem_cgroup_update_page_stat(struct
 		BUG();
 	}
 
-	this_cpu_add(memcg->stat->count[idx], val);
+	/* mem_cgroup_begin_update_page_stat() disabled preemption */
+	__this_cpu_add(memcg->stat->count[idx], val);
 }
 
 /*
@@ -3580,7 +3648,6 @@ static int mem_cgroup_move_account(struc
 				   struct mem_cgroup *from,
 				   struct mem_cgroup *to)
 {
-	unsigned long flags;
 	int ret;
 	bool anon = PageAnon(page);
 
@@ -3602,21 +3669,21 @@ static int mem_cgroup_move_account(struc
 	if (!PageCgroupUsed(pc) || pc->mem_cgroup != from)
 		goto unlock;
 
-	move_lock_mem_cgroup(from, &flags);
+	ret = -EAGAIN;
+	if (!trylock_memcg_move(page))
+		goto unlock;
 
 	if (!anon && page_mapped(page)) {
 		/* Update mapped_file data for mem_cgroup */
-		preempt_disable();
 		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
 		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		preempt_enable();
 	}
 	mem_cgroup_charge_statistics(from, anon, -nr_pages);
 
 	/* caller should have done css_get */
 	pc->mem_cgroup = to;
 	mem_cgroup_charge_statistics(to, anon, nr_pages);
-	move_unlock_mem_cgroup(from, &flags);
+	unlock_memcg_move();
 	ret = 0;
 unlock:
 	unlock_page_cgroup(pc);
@@ -3675,19 +3742,25 @@ static int mem_cgroup_move_parent(struct
 	 */
 	if (!parent)
 		parent = root_mem_cgroup;
-
+retry:
 	if (nr_pages > 1) {
 		VM_BUG_ON(!PageTransHuge(page));
 		flags = compound_lock_irqsave(page);
 	}
 
-	ret = mem_cgroup_move_account(page, nr_pages,
-				pc, child, parent);
-	if (!ret)
-		__mem_cgroup_cancel_local_charge(child, nr_pages);
+	ret = mem_cgroup_move_account(page, nr_pages, pc, child, parent);
 
 	if (nr_pages > 1)
 		compound_unlock_irqrestore(page, flags);
+
+	if (ret == -EAGAIN) {
+		cond_resched();
+		goto retry;
+	}
+
+	if (!ret)
+		__mem_cgroup_cancel_local_charge(child, nr_pages);
+
 	putback_lru_page(page);
 put:
 	put_page(page);
@@ -4685,7 +4758,7 @@ static void mem_cgroup_reparent_charges(
 		/* This is for making all *used* pages to be on LRU. */
 		lru_add_drain_all();
 		drain_all_stock_sync(memcg);
-		mem_cgroup_start_move(memcg);
+		mem_cgroup_begin_move();
 		for_each_node_state(node, N_MEMORY) {
 			for (zid = 0; zid < MAX_NR_ZONES; zid++) {
 				enum lru_list lru;
@@ -4695,7 +4768,7 @@ static void mem_cgroup_reparent_charges(
 				}
 			}
 		}
-		mem_cgroup_end_move(memcg);
+		mem_cgroup_end_move();
 		memcg_oom_recover(memcg);
 		cond_resched();
 
@@ -6128,7 +6201,6 @@ mem_cgroup_css_alloc(struct cgroup *cont
 	atomic_set(&memcg->refcnt, 1);
 	memcg->move_charge_at_immigrate = 0;
 	mutex_init(&memcg->thresholds_lock);
-	spin_lock_init(&memcg->move_lock);
 
 	error = memcg_init_kmem(memcg, &mem_cgroup_subsys);
 	if (error) {
@@ -6521,7 +6593,8 @@ static void mem_cgroup_clear_mc(void)
 	mc.from = NULL;
 	mc.to = NULL;
 	spin_unlock(&mc.lock);
-	mem_cgroup_end_move(from);
+	if (from)
+		mem_cgroup_end_move();
 }
 
 static int mem_cgroup_can_attach(struct cgroup *cgroup,
@@ -6547,7 +6620,7 @@ static int mem_cgroup_can_attach(struct
 			VM_BUG_ON(mc.precharge);
 			VM_BUG_ON(mc.moved_charge);
 			VM_BUG_ON(mc.moved_swap);
-			mem_cgroup_start_move(from);
+			mem_cgroup_begin_move();
 			spin_lock(&mc.lock);
 			mc.from = from;
 			mc.to = memcg;
@@ -6573,7 +6646,7 @@ static int mem_cgroup_move_charge_pte_ra
 				unsigned long addr, unsigned long end,
 				struct mm_walk *walk)
 {
-	int ret = 0;
+	int ret;
 	struct vm_area_struct *vma = walk->private;
 	pte_t *pte;
 	spinlock_t *ptl;
@@ -6592,6 +6665,8 @@ static int mem_cgroup_move_charge_pte_ra
 	 *    to be unlocked in __split_huge_page_splitting(), where the main
 	 *    part of thp split is not executed yet.
 	 */
+retry:
+	ret = 0;
 	if (pmd_trans_huge_lock(pmd, vma) == 1) {
 		if (mc.precharge < HPAGE_PMD_NR) {
 			spin_unlock(&vma->vm_mm->page_table_lock);
@@ -6602,8 +6677,9 @@ static int mem_cgroup_move_charge_pte_ra
 			page = target.page;
 			if (!isolate_lru_page(page)) {
 				pc = lookup_page_cgroup(page);
-				if (!mem_cgroup_move_account(page, HPAGE_PMD_NR,
-							pc, mc.from, mc.to)) {
+				ret = mem_cgroup_move_account(page,
+					    HPAGE_PMD_NR, pc, mc.from, mc.to);
+				if (!ret) {
 					mc.precharge -= HPAGE_PMD_NR;
 					mc.moved_charge += HPAGE_PMD_NR;
 				}
@@ -6612,12 +6688,14 @@ static int mem_cgroup_move_charge_pte_ra
 			put_page(page);
 		}
 		spin_unlock(&vma->vm_mm->page_table_lock);
+		if (ret == -EAGAIN)
+			goto retry;
 		return 0;
 	}
 
 	if (pmd_trans_unstable(pmd))
 		return 0;
-retry:
+
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	for (; addr != end; addr += PAGE_SIZE) {
 		pte_t ptent = *(pte++);
@@ -6632,8 +6710,9 @@ retry:
 			if (isolate_lru_page(page))
 				goto put;
 			pc = lookup_page_cgroup(page);
-			if (!mem_cgroup_move_account(page, 1, pc,
-						     mc.from, mc.to)) {
+			ret = mem_cgroup_move_account(page, 1, pc,
+						      mc.from, mc.to);
+			if (!ret) {
 				mc.precharge--;
 				/* we uncharge from mc.from later. */
 				mc.moved_charge++;
@@ -6653,11 +6732,15 @@ put:			/* get_mctgt_type() gets the page
 		default:
 			break;
 		}
+		if (ret == -EAGAIN)
+			break;
 	}
 	pte_unmap_unlock(pte - 1, ptl);
 	cond_resched();
 
 	if (addr != end) {
+		if (ret == -EAGAIN)
+			goto retry;
 		/*
 		 * We have consumed all precharges we got in can_attach().
 		 * We try charge one by one, but don't do any additional
--- 3.8-rc2/mm/rmap.c	2012-12-22 09:43:27.656015582 -0800
+++ linux/mm/rmap.c	2013-01-02 15:03:46.100417650 -0800
@@ -1107,15 +1107,14 @@ void page_add_new_anon_rmap(struct page
  */
 void page_add_file_rmap(struct page *page)
 {
-	bool locked;
-	unsigned long flags;
+	bool clamped;
 
-	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
+	mem_cgroup_begin_update_page_stat(page, &clamped);
 	if (atomic_inc_and_test(&page->_mapcount)) {
 		__inc_zone_page_state(page, NR_FILE_MAPPED);
 		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
 	}
-	mem_cgroup_end_update_page_stat(page, &locked, &flags);
+	mem_cgroup_end_update_page_stat(page, &clamped);
 }
 
 /**
@@ -1128,16 +1127,15 @@ void page_remove_rmap(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
 	bool anon = PageAnon(page);
-	bool locked;
-	unsigned long flags;
+	bool uninitialized_var(clamped);
 
 	/*
 	 * The anon case has no mem_cgroup page_stat to update; but may
-	 * uncharge_page() below, where the lock ordering can deadlock if
-	 * we hold the lock against page_stat move: so avoid it on anon.
+	 * uncharge_page() below, when holding page_cgroup lock might force
+	 * a page_stat move to back off temporarily: so avoid it on anon.
 	 */
 	if (!anon)
-		mem_cgroup_begin_update_page_stat(page, &locked, &flags);
+		mem_cgroup_begin_update_page_stat(page, &clamped);
 
 	/* page still mapped by someone else? */
 	if (!atomic_add_negative(-1, &page->_mapcount))
@@ -1182,7 +1180,7 @@ void page_remove_rmap(struct page *page)
 	} else {
 		__dec_zone_page_state(page, NR_FILE_MAPPED);
 		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
-		mem_cgroup_end_update_page_stat(page, &locked, &flags);
+		mem_cgroup_end_update_page_stat(page, &clamped);
 	}
 	if (unlikely(PageMlocked(page)))
 		clear_page_mlock(page);
@@ -1198,7 +1196,7 @@ void page_remove_rmap(struct page *page)
 	return;
 out:
 	if (!anon)
-		mem_cgroup_end_update_page_stat(page, &locked, &flags);
+		mem_cgroup_end_update_page_stat(page, &clamped);
 }
 
 /*

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
@ 2013-01-06 20:02         ` Hugh Dickins
  0 siblings, 0 replies; 124+ messages in thread
From: Hugh Dickins @ 2013-01-06 20:02 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: Michal Hocko, Johannes Weiner, linux-kernel, cgroups, linux-mm,
	linux-fsdevel, akpm, kamezawa.hiroyu, gthelen, fengguang.wu,
	glommer, dchinner, Sha Zhengju

On Sat, 5 Jan 2013, Sha Zhengju wrote:
> On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko <mhocko@suse.cz> wrote:
> >
> > Maybe I have missed some other locking which would prevent this from
> > happening but the locking relations are really complicated in this area
> > so if mem_cgroup_{begin,end}_update_page_stat might be called
> > recursively then we need a fat comment which justifies that.
> >
> 
> Ohhh...good catching!  I didn't notice there is a recursive call of
> mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
> The mem_cgroup_{begin,end}_update_page_stat() design has depressed
> me a lot recently as the lock granularity is a little bigger than I thought.
> Not only the resource but also some code logic is in the range of locking
> which may be deadlock prone. The problem still exists if we are trying to
> add stat account of other memcg page later, may I make bold to suggest
> that we dig into the lock again...

Forgive me, I must confess I'm no more than skimming this thread,
and don't like dumping unsigned-off patches on people; but thought
that on balance it might be more helpful than not if I offer you a
patch I worked on around 3.6-rc2 (but have updated to 3.8-rc2 below).

I too was getting depressed by the constraints imposed by
mem_cgroup_{begin,end}_update_page_stat (good job though Kamezawa-san
did to minimize them), and wanted to replace by something freer, more
RCU-like.  In the end it seemed more effort than it was worth to go
as far as I wanted, but I do think that this is some improvement over
what we currently have, and should deal with your recursion issue.

But if this does appear useful to memcg people, then we really ought
to get it checked over by locking/barrier experts before going further.
I think myself that I've over-barriered it, and could use a little
lighter; but they (Paul McKenney, Peter Zijlstra, Oleg Nesterov come
to mind) will see more clearly, and may just hate the whole thing,
as yet another peculiar lockdep-avoiding hand-crafted locking scheme.
I've not wanted to waste their time on reviewing it, if it's not even
going to be useful to memcg people.

It may be easier to understand if you just apply the patch and look
at the result in mm/memcontrol.c, where I tried to gather the pieces
together in one place and describe them ("These functions mediate...").

Hugh

 include/linux/memcontrol.h |   39 +--
 mm/memcontrol.c            |  375 +++++++++++++++++++++--------------
 mm/rmap.c                  |   20 -
 3 files changed, 257 insertions(+), 177 deletions(-)

--- 3.8-rc2/include/linux/memcontrol.h	2012-12-22 09:43:27.172015571 -0800
+++ linux/include/linux/memcontrol.h	2013-01-02 14:47:47.960394878 -0800
@@ -136,32 +136,28 @@ static inline bool mem_cgroup_disabled(v
 	return false;
 }
 
-void __mem_cgroup_begin_update_page_stat(struct page *page, bool *locked,
-					 unsigned long *flags);
-
+void __mem_cgroup_begin_update_page_stat(struct page *page);
+void __mem_cgroup_end_update_page_stat(void);
 extern atomic_t memcg_moving;
 
 static inline void mem_cgroup_begin_update_page_stat(struct page *page,
-					bool *locked, unsigned long *flags)
+						     bool *clamped)
 {
-	if (mem_cgroup_disabled())
-		return;
-	rcu_read_lock();
-	*locked = false;
-	if (atomic_read(&memcg_moving))
-		__mem_cgroup_begin_update_page_stat(page, locked, flags);
+	preempt_disable();
+	*clamped = false;
+	if (unlikely(atomic_read(&memcg_moving))) {
+		__mem_cgroup_begin_update_page_stat(page);
+		*clamped = true;
+	}
 }
 
-void __mem_cgroup_end_update_page_stat(struct page *page,
-				unsigned long *flags);
 static inline void mem_cgroup_end_update_page_stat(struct page *page,
-					bool *locked, unsigned long *flags)
+						   bool *clamped)
 {
-	if (mem_cgroup_disabled())
-		return;
-	if (*locked)
-		__mem_cgroup_end_update_page_stat(page, flags);
-	rcu_read_unlock();
+	/* We don't currently use the page arg, but keep it for symmetry */
+	if (unlikely(*clamped))
+		__mem_cgroup_end_update_page_stat();
+	preempt_enable();
 }
 
 void mem_cgroup_update_page_stat(struct page *page,
@@ -345,13 +341,16 @@ mem_cgroup_print_oom_info(struct mem_cgr
 }
 
 static inline void mem_cgroup_begin_update_page_stat(struct page *page,
-					bool *locked, unsigned long *flags)
+						     bool *clamped)
 {
+	/* It may be helpful to our callers if the stub behaves the same way */
+	preempt_disable();
 }
 
 static inline void mem_cgroup_end_update_page_stat(struct page *page,
-					bool *locked, unsigned long *flags)
+						   bool *clamped)
 {
+	preempt_enable();
 }
 
 static inline void mem_cgroup_inc_page_stat(struct page *page,
--- 3.8-rc2/mm/memcontrol.c	2012-12-22 09:43:27.628015582 -0800
+++ linux/mm/memcontrol.c	2013-01-02 14:55:36.268406008 -0800
@@ -321,12 +321,7 @@ struct mem_cgroup {
 	 * mem_cgroup ? And what type of charges should we move ?
 	 */
 	unsigned long 	move_charge_at_immigrate;
-	/*
-	 * set > 0 if pages under this cgroup are moving to other cgroup.
-	 */
-	atomic_t	moving_account;
-	/* taken only while moving_account > 0 */
-	spinlock_t	move_lock;
+
 	/*
 	 * percpu counter.
 	 */
@@ -1414,60 +1409,10 @@ int mem_cgroup_swappiness(struct mem_cgr
 }
 
 /*
- * memcg->moving_account is used for checking possibility that some thread is
- * calling move_account(). When a thread on CPU-A starts moving pages under
- * a memcg, other threads should check memcg->moving_account under
- * rcu_read_lock(), like this:
- *
- *         CPU-A                                    CPU-B
- *                                              rcu_read_lock()
- *         memcg->moving_account+1              if (memcg->mocing_account)
- *                                                   take heavy locks.
- *         synchronize_rcu()                    update something.
- *                                              rcu_read_unlock()
- *         start move here.
- */
-
-/* for quick checking without looking up memcg */
-atomic_t memcg_moving __read_mostly;
-
-static void mem_cgroup_start_move(struct mem_cgroup *memcg)
-{
-	atomic_inc(&memcg_moving);
-	atomic_inc(&memcg->moving_account);
-	synchronize_rcu();
-}
-
-static void mem_cgroup_end_move(struct mem_cgroup *memcg)
-{
-	/*
-	 * Now, mem_cgroup_clear_mc() may call this function with NULL.
-	 * We check NULL in callee rather than caller.
-	 */
-	if (memcg) {
-		atomic_dec(&memcg_moving);
-		atomic_dec(&memcg->moving_account);
-	}
-}
-
-/*
- * 2 routines for checking "mem" is under move_account() or not.
- *
- * mem_cgroup_stolen() -  checking whether a cgroup is mc.from or not. This
- *			  is used for avoiding races in accounting.  If true,
- *			  pc->mem_cgroup may be overwritten.
- *
  * mem_cgroup_under_move() - checking a cgroup is mc.from or mc.to or
  *			  under hierarchy of moving cgroups. This is for
- *			  waiting at hith-memory prressure caused by "move".
+ *			  waiting at high memory pressure caused by "move".
  */
-
-static bool mem_cgroup_stolen(struct mem_cgroup *memcg)
-{
-	VM_BUG_ON(!rcu_read_lock_held());
-	return atomic_read(&memcg->moving_account) > 0;
-}
-
 static bool mem_cgroup_under_move(struct mem_cgroup *memcg)
 {
 	struct mem_cgroup *from;
@@ -1506,24 +1451,6 @@ static bool mem_cgroup_wait_acct_move(st
 	return false;
 }
 
-/*
- * Take this lock when
- * - a code tries to modify page's memcg while it's USED.
- * - a code tries to modify page state accounting in a memcg.
- * see mem_cgroup_stolen(), too.
- */
-static void move_lock_mem_cgroup(struct mem_cgroup *memcg,
-				  unsigned long *flags)
-{
-	spin_lock_irqsave(&memcg->move_lock, *flags);
-}
-
-static void move_unlock_mem_cgroup(struct mem_cgroup *memcg,
-				unsigned long *flags)
-{
-	spin_unlock_irqrestore(&memcg->move_lock, *flags);
-}
-
 /**
  * mem_cgroup_print_oom_info: Called from OOM with tasklist_lock held in read mode.
  * @memcg: The memory cgroup that went over limit
@@ -2096,75 +2023,215 @@ static bool mem_cgroup_handle_oom(struct
 }
 
 /*
- * Currently used to update mapped file statistics, but the routine can be
- * generalized to update other statistics as well.
- *
- * Notes: Race condition
- *
- * We usually use page_cgroup_lock() for accessing page_cgroup member but
- * it tends to be costly. But considering some conditions, we doesn't need
- * to do so _always_.
- *
- * Considering "charge", lock_page_cgroup() is not required because all
- * file-stat operations happen after a page is attached to radix-tree. There
- * are no race with "charge".
- *
- * Considering "uncharge", we know that memcg doesn't clear pc->mem_cgroup
- * at "uncharge" intentionally. So, we always see valid pc->mem_cgroup even
- * if there are race with "uncharge". Statistics itself is properly handled
- * by flags.
+ * These functions mediate between the common case of updating memcg stats
+ * when a page transitions from one state to another, and the rare case of
+ * moving a page from one memcg to another.
+ *
+ * A simple example of the updater would be:
+ *	mem_cgroup_begin_update_page_stat(page);
+ *	if (TestClearPageFlag(page))
+ *		mem_cgroup_dec_page_stat(page, NR_FLAG_PAGES);
+ *	mem_cgroup_end_update_page_stat(page);
+ *
+ * An over-simplified example of the mover would be:
+ *	mem_cgroup_begin_move();
+ *	for each page chosen from old_memcg {
+ *		pc = lookup_page_cgroup(page);
+ *		lock_page_cgroup(pc);
+ *		if (trylock_memcg_move(page)) {
+ *			if (PageFlag(page)) {
+ *				mem_cgroup_dec_page_stat(page, NR_FLAG_PAGES);
+ *				pc->mem_cgroup = new_memcg;
+ *				mem_cgroup_inc_page_stat(page, NR_FLAG_PAGES);
+ *			}
+ *			unlock_memcg_move();
+ *			unlock_page_cgroup(pc);
+ *		}
+ *		cond_resched();
+ *	}
+ *	mem_cgroup_end_move();
+ *
+ * Without some kind of serialization between updater and mover, the mover
+ * cannot know whether or not to move one count from old to new memcg stats;
+ * but the serialization must be as lightweight as possible for the updater.
+ *
+ * At present we use two layers of lock avoidance, then spinlock on memcg;
+ * but that already got into (easily avoided) lock hierarchy violation with
+ * the page_cgroup lock; and as dirty writeback stats are added, it gets
+ * into further difficulty with the page cache radix tree lock (and on s390
+ * architecture, page_remove_rmap calls set_page_dirty within its critical
+ * section: perhaps that can be reordered, but if not, it requires nesting).
+ *
+ * We need a mechanism more like rcu_read_lock() for the updater, who then
+ * does not have to worry about lock ordering.  The scheme below is not quite
+ * as light as that: rarely, the updater does have to spin waiting on a mover;
+ * and it is still best for updater to avoid taking page_cgroup lock in its
+ * critical section (though mover drops and retries if necessary, so there is
+ * no actual deadlock).  Testing on 4-way suggests 5% heavier for the mover.
+ */
+
+/*
+ * memcg_moving count is written in advance by movers,
+ * and read by updaters to see if they need to worry further.
+ */
+atomic_t memcg_moving __read_mostly;
+
+/*
+ * Keep it simple: allow only one page to move at a time.  cgroup_mutex
+ * already serializes move_charge_at_immigrate movements, but not writes
+ * to memory.force_empty, nor move-pages-to-parent phase of cgroup rmdir.
  *
- * Considering "move", this is an only case we see a race. To make the race
- * small, we check mm->moving_account and detect there are possibility of race
- * If there is, we take a lock.
+ * memcg_moving_lock guards writes by movers to memcg_moving_page,
+ * which is read by updaters to see if they need to worry about their page.
+ */
+static DEFINE_SPINLOCK(memcg_moving_lock);
+static struct page *memcg_moving_page;
+
+/*
+ * updating_page_stat is written per-cpu by updaters,
+ * and all cpus read by mover to check when safe to proceed with the move.
  */
+static DEFINE_PER_CPU(int, updating_page_stat) = 0;
 
-void __mem_cgroup_begin_update_page_stat(struct page *page,
-				bool *locked, unsigned long *flags)
+/*
+ * Mover calls mem_cgroup_begin_move() before starting on its pages; its
+ * synchronize_rcu() ensures that all updaters will see memcg_moving in time.
+ */
+static void mem_cgroup_begin_move(void)
 {
-	struct mem_cgroup *memcg;
-	struct page_cgroup *pc;
+	get_online_cpus();
+	atomic_inc(&memcg_moving);
+	synchronize_rcu();
+}
+
+static void mem_cgroup_end_move(void)
+{
+	atomic_dec(&memcg_moving);
+	put_online_cpus();
+}
+
+/*
+ * Mover calls trylock_memcg_move(page) before moving stats and changing
+ * ownership of page.  If it fails, mover should drop page_cgroup lock and
+ * any other spinlocks held, cond_resched then try the page again.  This
+ * lets updaters take those locks if unavoidable, though preferably not.
+ */
+static bool trylock_memcg_move(struct page *page)
+{
+	static struct cpumask updating;
+	int try;
+
+	cpumask_copy(&updating, cpu_online_mask);
+	spin_lock(&memcg_moving_lock);
+	memcg_moving_page = page;
 
-	pc = lookup_page_cgroup(page);
-again:
-	memcg = pc->mem_cgroup;
-	if (unlikely(!memcg || !PageCgroupUsed(pc)))
-		return;
 	/*
-	 * If this memory cgroup is not under account moving, we don't
-	 * need to take move_lock_mem_cgroup(). Because we already hold
-	 * rcu_read_lock(), any calls to move_account will be delayed until
-	 * rcu_read_unlock() if mem_cgroup_stolen() == true.
+	 * Make sure that __mem_cgroup_begin_update_page_stat(page) can see
+	 * our memcg_moving_page before it commits to updating_page_stat.
 	 */
-	if (!mem_cgroup_stolen(memcg))
-		return;
+	smp_mb();
 
-	move_lock_mem_cgroup(memcg, flags);
-	if (memcg != pc->mem_cgroup || !PageCgroupUsed(pc)) {
-		move_unlock_mem_cgroup(memcg, flags);
-		goto again;
+	for (try = 0; try < 64; try++) {
+		int updaters = 0;
+		int cpu;
+
+		for_each_cpu(cpu, &updating) {
+			if (ACCESS_ONCE(per_cpu(updating_page_stat, cpu)))
+				updaters++;
+			else
+				cpumask_clear_cpu(cpu, &updating);
+		}
+		if (!updaters)
+			return true;
 	}
-	*locked = true;
+
+	memcg_moving_page = NULL;
+	spin_unlock(&memcg_moving_lock);
+	return false;
 }
 
-void __mem_cgroup_end_update_page_stat(struct page *page, unsigned long *flags)
+static void unlock_memcg_move(void)
 {
-	struct page_cgroup *pc = lookup_page_cgroup(page);
+	memcg_moving_page = NULL;
+	spin_unlock(&memcg_moving_lock);
+}
 
-	/*
-	 * It's guaranteed that pc->mem_cgroup never changes while
-	 * lock is held because a routine modifies pc->mem_cgroup
-	 * should take move_lock_mem_cgroup().
-	 */
-	move_unlock_mem_cgroup(pc->mem_cgroup, flags);
+/*
+ * If memcg_moving, updater calls __mem_cgroup_begin_update_page_stat(page)
+ * (with preemption disabled) to indicate to the next mover that this cpu is
+ * updating a page, or to wait on the mover if it's already moving this page.
+ */
+void __mem_cgroup_begin_update_page_stat(struct page *page)
+{
+	static const int probing = 0x10000;
+	int updating;
+
+	__this_cpu_add(updating_page_stat, probing);
+
+	for (;;) {
+		/*
+		 * Make sure that trylock_memcg_move(page) can see our
+		 * updating_page_stat before we check memcg_moving_page.
+		 *
+		 * We use the special probing value at first so move sees it,
+		 * but nesting and interrupts on this cpu can distinguish it.
+		 */
+		smp_mb();
+
+		if (likely(page != ACCESS_ONCE(memcg_moving_page)))
+			break;
+
+		/*
+		 * We may be nested, we may be serving an interrupt: do not
+		 * hang here if the outer level already went beyond probing.
+		 */
+		updating = __this_cpu_read(updating_page_stat);
+		if (updating & (probing - 1))
+			break;
+
+		__this_cpu_write(updating_page_stat, 0);
+		while (page == ACCESS_ONCE(memcg_moving_page))
+			cpu_relax();
+		__this_cpu_write(updating_page_stat, updating);
+	}
+
+	/* Add one to count and remove temporary probing value */
+	__this_cpu_sub(updating_page_stat, probing - 1);
+}
+
+void __mem_cgroup_end_update_page_stat(void)
+{
+	__this_cpu_dec(updating_page_stat);
+}
+
+/*
+ * Static inline interfaces to the above in include/linux/memcontrol.h:
+ *
+static inline void mem_cgroup_begin_update_page_stat(struct page *page,
+						     bool *clamped)
+{
+	preempt_disable();
+	*clamped = false;
+	if (unlikely(atomic_read(&memcg_moving))) {
+		__mem_cgroup_begin_update_page_stat(page);
+		*clamped = true;
+	}
 }
 
+static inline void mem_cgroup_end_update_page_stat(struct page *page,
+						   bool *clamped)
+{
+	if (unlikely(*clamped))
+		__mem_cgroup_end_update_page_stat();
+	preempt_enable();
+}
+ */
+
 void mem_cgroup_update_page_stat(struct page *page,
 				 enum mem_cgroup_page_stat_item idx, int val)
 {
 	struct mem_cgroup *memcg;
 	struct page_cgroup *pc = lookup_page_cgroup(page);
-	unsigned long uninitialized_var(flags);
 
 	if (mem_cgroup_disabled())
 		return;
@@ -2181,7 +2248,8 @@ void mem_cgroup_update_page_stat(struct
 		BUG();
 	}
 
-	this_cpu_add(memcg->stat->count[idx], val);
+	/* mem_cgroup_begin_update_page_stat() disabled preemption */
+	__this_cpu_add(memcg->stat->count[idx], val);
 }
 
 /*
@@ -3580,7 +3648,6 @@ static int mem_cgroup_move_account(struc
 				   struct mem_cgroup *from,
 				   struct mem_cgroup *to)
 {
-	unsigned long flags;
 	int ret;
 	bool anon = PageAnon(page);
 
@@ -3602,21 +3669,21 @@ static int mem_cgroup_move_account(struc
 	if (!PageCgroupUsed(pc) || pc->mem_cgroup != from)
 		goto unlock;
 
-	move_lock_mem_cgroup(from, &flags);
+	ret = -EAGAIN;
+	if (!trylock_memcg_move(page))
+		goto unlock;
 
 	if (!anon && page_mapped(page)) {
 		/* Update mapped_file data for mem_cgroup */
-		preempt_disable();
 		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
 		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		preempt_enable();
 	}
 	mem_cgroup_charge_statistics(from, anon, -nr_pages);
 
 	/* caller should have done css_get */
 	pc->mem_cgroup = to;
 	mem_cgroup_charge_statistics(to, anon, nr_pages);
-	move_unlock_mem_cgroup(from, &flags);
+	unlock_memcg_move();
 	ret = 0;
 unlock:
 	unlock_page_cgroup(pc);
@@ -3675,19 +3742,25 @@ static int mem_cgroup_move_parent(struct
 	 */
 	if (!parent)
 		parent = root_mem_cgroup;
-
+retry:
 	if (nr_pages > 1) {
 		VM_BUG_ON(!PageTransHuge(page));
 		flags = compound_lock_irqsave(page);
 	}
 
-	ret = mem_cgroup_move_account(page, nr_pages,
-				pc, child, parent);
-	if (!ret)
-		__mem_cgroup_cancel_local_charge(child, nr_pages);
+	ret = mem_cgroup_move_account(page, nr_pages, pc, child, parent);
 
 	if (nr_pages > 1)
 		compound_unlock_irqrestore(page, flags);
+
+	if (ret == -EAGAIN) {
+		cond_resched();
+		goto retry;
+	}
+
+	if (!ret)
+		__mem_cgroup_cancel_local_charge(child, nr_pages);
+
 	putback_lru_page(page);
 put:
 	put_page(page);
@@ -4685,7 +4758,7 @@ static void mem_cgroup_reparent_charges(
 		/* This is for making all *used* pages to be on LRU. */
 		lru_add_drain_all();
 		drain_all_stock_sync(memcg);
-		mem_cgroup_start_move(memcg);
+		mem_cgroup_begin_move();
 		for_each_node_state(node, N_MEMORY) {
 			for (zid = 0; zid < MAX_NR_ZONES; zid++) {
 				enum lru_list lru;
@@ -4695,7 +4768,7 @@ static void mem_cgroup_reparent_charges(
 				}
 			}
 		}
-		mem_cgroup_end_move(memcg);
+		mem_cgroup_end_move();
 		memcg_oom_recover(memcg);
 		cond_resched();
 
@@ -6128,7 +6201,6 @@ mem_cgroup_css_alloc(struct cgroup *cont
 	atomic_set(&memcg->refcnt, 1);
 	memcg->move_charge_at_immigrate = 0;
 	mutex_init(&memcg->thresholds_lock);
-	spin_lock_init(&memcg->move_lock);
 
 	error = memcg_init_kmem(memcg, &mem_cgroup_subsys);
 	if (error) {
@@ -6521,7 +6593,8 @@ static void mem_cgroup_clear_mc(void)
 	mc.from = NULL;
 	mc.to = NULL;
 	spin_unlock(&mc.lock);
-	mem_cgroup_end_move(from);
+	if (from)
+		mem_cgroup_end_move();
 }
 
 static int mem_cgroup_can_attach(struct cgroup *cgroup,
@@ -6547,7 +6620,7 @@ static int mem_cgroup_can_attach(struct
 			VM_BUG_ON(mc.precharge);
 			VM_BUG_ON(mc.moved_charge);
 			VM_BUG_ON(mc.moved_swap);
-			mem_cgroup_start_move(from);
+			mem_cgroup_begin_move();
 			spin_lock(&mc.lock);
 			mc.from = from;
 			mc.to = memcg;
@@ -6573,7 +6646,7 @@ static int mem_cgroup_move_charge_pte_ra
 				unsigned long addr, unsigned long end,
 				struct mm_walk *walk)
 {
-	int ret = 0;
+	int ret;
 	struct vm_area_struct *vma = walk->private;
 	pte_t *pte;
 	spinlock_t *ptl;
@@ -6592,6 +6665,8 @@ static int mem_cgroup_move_charge_pte_ra
 	 *    to be unlocked in __split_huge_page_splitting(), where the main
 	 *    part of thp split is not executed yet.
 	 */
+retry:
+	ret = 0;
 	if (pmd_trans_huge_lock(pmd, vma) == 1) {
 		if (mc.precharge < HPAGE_PMD_NR) {
 			spin_unlock(&vma->vm_mm->page_table_lock);
@@ -6602,8 +6677,9 @@ static int mem_cgroup_move_charge_pte_ra
 			page = target.page;
 			if (!isolate_lru_page(page)) {
 				pc = lookup_page_cgroup(page);
-				if (!mem_cgroup_move_account(page, HPAGE_PMD_NR,
-							pc, mc.from, mc.to)) {
+				ret = mem_cgroup_move_account(page,
+					    HPAGE_PMD_NR, pc, mc.from, mc.to);
+				if (!ret) {
 					mc.precharge -= HPAGE_PMD_NR;
 					mc.moved_charge += HPAGE_PMD_NR;
 				}
@@ -6612,12 +6688,14 @@ static int mem_cgroup_move_charge_pte_ra
 			put_page(page);
 		}
 		spin_unlock(&vma->vm_mm->page_table_lock);
+		if (ret == -EAGAIN)
+			goto retry;
 		return 0;
 	}
 
 	if (pmd_trans_unstable(pmd))
 		return 0;
-retry:
+
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	for (; addr != end; addr += PAGE_SIZE) {
 		pte_t ptent = *(pte++);
@@ -6632,8 +6710,9 @@ retry:
 			if (isolate_lru_page(page))
 				goto put;
 			pc = lookup_page_cgroup(page);
-			if (!mem_cgroup_move_account(page, 1, pc,
-						     mc.from, mc.to)) {
+			ret = mem_cgroup_move_account(page, 1, pc,
+						      mc.from, mc.to);
+			if (!ret) {
 				mc.precharge--;
 				/* we uncharge from mc.from later. */
 				mc.moved_charge++;
@@ -6653,11 +6732,15 @@ put:			/* get_mctgt_type() gets the page
 		default:
 			break;
 		}
+		if (ret == -EAGAIN)
+			break;
 	}
 	pte_unmap_unlock(pte - 1, ptl);
 	cond_resched();
 
 	if (addr != end) {
+		if (ret == -EAGAIN)
+			goto retry;
 		/*
 		 * We have consumed all precharges we got in can_attach().
 		 * We try charge one by one, but don't do any additional
--- 3.8-rc2/mm/rmap.c	2012-12-22 09:43:27.656015582 -0800
+++ linux/mm/rmap.c	2013-01-02 15:03:46.100417650 -0800
@@ -1107,15 +1107,14 @@ void page_add_new_anon_rmap(struct page
  */
 void page_add_file_rmap(struct page *page)
 {
-	bool locked;
-	unsigned long flags;
+	bool clamped;
 
-	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
+	mem_cgroup_begin_update_page_stat(page, &clamped);
 	if (atomic_inc_and_test(&page->_mapcount)) {
 		__inc_zone_page_state(page, NR_FILE_MAPPED);
 		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
 	}
-	mem_cgroup_end_update_page_stat(page, &locked, &flags);
+	mem_cgroup_end_update_page_stat(page, &clamped);
 }
 
 /**
@@ -1128,16 +1127,15 @@ void page_remove_rmap(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
 	bool anon = PageAnon(page);
-	bool locked;
-	unsigned long flags;
+	bool uninitialized_var(clamped);
 
 	/*
 	 * The anon case has no mem_cgroup page_stat to update; but may
-	 * uncharge_page() below, where the lock ordering can deadlock if
-	 * we hold the lock against page_stat move: so avoid it on anon.
+	 * uncharge_page() below, when holding page_cgroup lock might force
+	 * a page_stat move to back off temporarily: so avoid it on anon.
 	 */
 	if (!anon)
-		mem_cgroup_begin_update_page_stat(page, &locked, &flags);
+		mem_cgroup_begin_update_page_stat(page, &clamped);
 
 	/* page still mapped by someone else? */
 	if (!atomic_add_negative(-1, &page->_mapcount))
@@ -1182,7 +1180,7 @@ void page_remove_rmap(struct page *page)
 	} else {
 		__dec_zone_page_state(page, NR_FILE_MAPPED);
 		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
-		mem_cgroup_end_update_page_stat(page, &locked, &flags);
+		mem_cgroup_end_update_page_stat(page, &clamped);
 	}
 	if (unlikely(PageMlocked(page)))
 		clear_page_mlock(page);
@@ -1198,7 +1196,7 @@ void page_remove_rmap(struct page *page)
 	return;
 out:
 	if (!anon)
-		mem_cgroup_end_update_page_stat(page, &locked, &flags);
+		mem_cgroup_end_update_page_stat(page, &clamped);
 }
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 5/8] memcg: add per cgroup writeback pages accounting
  2012-12-25 17:26   ` Sha Zhengju
@ 2013-01-06 20:07     ` Greg Thelen
  -1 siblings, 0 replies; 124+ messages in thread
From: Greg Thelen @ 2013-01-06 20:07 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel, cgroups, linux-mm, mhocko, akpm, kamezawa.hiroyu,
	fengguang.wu, glommer, Sha Zhengju

On Tue, Dec 25 2012, Sha Zhengju wrote:

> From: Sha Zhengju <handai.szj@taobao.com>
>
> Similar to dirty page, we add per cgroup writeback pages accounting. The lock
> rule still is:
>         mem_cgroup_begin_update_page_stat()
>         modify page WRITEBACK stat
>         mem_cgroup_update_page_stat()
>         mem_cgroup_end_update_page_stat()
>
> There're two writeback interface to modify: test_clear/set_page_writeback.
>
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> ---
>  include/linux/memcontrol.h |    1 +
>  mm/memcontrol.c            |    5 +++++
>  mm/page-writeback.c        |   17 +++++++++++++++++
>  3 files changed, 23 insertions(+)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 358019e..1d22b81 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -45,6 +45,7 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>  	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
>  	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
> +	MEM_CGROUP_STAT_WRITEBACK,  /* # of pages under writeback */
>  	MEM_CGROUP_STAT_NSTATS,
>  };
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 21df36d..13cd14a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -96,6 +96,7 @@ static const char * const mem_cgroup_stat_names[] = {
>  	"mapped_file",
>  	"swap",
>  	"dirty",
> +	"writeback",
>  };
>  
>  enum mem_cgroup_events_index {
> @@ -3676,6 +3677,10 @@ static int mem_cgroup_move_account(struct page *page,
>  		mem_cgroup_move_account_page_stat(from, to, nr_pages,
>  			MEM_CGROUP_STAT_FILE_DIRTY);
>  
> +	if (PageWriteback(page))
> +		mem_cgroup_move_account_page_stat(from, to, nr_pages,
> +			MEM_CGROUP_STAT_WRITEBACK);
> +
>  	mem_cgroup_charge_statistics(from, anon, -nr_pages);
>  
>  	/* caller should have done css_get */
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 526ddd7..ae6498a 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2002,11 +2002,17 @@ EXPORT_SYMBOL(account_page_dirtied);
>  
>  /*
>   * Helper function for set_page_writeback family.
> + *
> + * The caller must hold mem_cgroup_begin/end_update_page_stat() lock
> + * while modifying struct page state and accounting writeback pages.
> + * See test_set_page_writeback for example.
> + *
>   * NOTE: Unlike account_page_dirtied this does not rely on being atomic
>   * wrt interrupts.
>   */
>  void account_page_writeback(struct page *page)
>  {
> +	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
>  	inc_zone_page_state(page, NR_WRITEBACK);
>  }
>  EXPORT_SYMBOL(account_page_writeback);
> @@ -2242,7 +2248,10 @@ int test_clear_page_writeback(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
>  	int ret;
> +	bool locked;
> +	unsigned long memcg_flags;
>  
> +	mem_cgroup_begin_update_page_stat(page, &locked, &memcg_flags);

Does this violate lock ordering?  Here we are grabbing
mem_cgroup_begin_update_page_stat and below we grab tree_lock.  I
thought the prescribed order was tree_lock first, then
mem_cgroup_begin_update_page_stat.

>  	if (mapping) {
>  		struct backing_dev_info *bdi = mapping->backing_dev_info;
>  		unsigned long flags;
> @@ -2263,9 +2272,12 @@ int test_clear_page_writeback(struct page *page)
>  		ret = TestClearPageWriteback(page);
>  	}
>  	if (ret) {
> +		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
>  		dec_zone_page_state(page, NR_WRITEBACK);
>  		inc_zone_page_state(page, NR_WRITTEN);
>  	}
> +
> +	mem_cgroup_end_update_page_stat(page, &locked, &memcg_flags);
>  	return ret;
>  }
>  
> @@ -2273,7 +2285,10 @@ int test_set_page_writeback(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
>  	int ret;
> +	bool locked;
> +	unsigned long flags;
>  
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);

Same "Does this violate lock ordering?" question as above.

>  	if (mapping) {
>  		struct backing_dev_info *bdi = mapping->backing_dev_info;
>  		unsigned long flags;
> @@ -2300,6 +2315,8 @@ int test_set_page_writeback(struct page *page)
>  	}
>  	if (!ret)
>  		account_page_writeback(page);
> +
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  	return ret;
>  
>  }

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 5/8] memcg: add per cgroup writeback pages accounting
@ 2013-01-06 20:07     ` Greg Thelen
  0 siblings, 0 replies; 124+ messages in thread
From: Greg Thelen @ 2013-01-06 20:07 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel, cgroups, linux-mm, mhocko, akpm, kamezawa.hiroyu,
	fengguang.wu, glommer, Sha Zhengju

On Tue, Dec 25 2012, Sha Zhengju wrote:

> From: Sha Zhengju <handai.szj@taobao.com>
>
> Similar to dirty page, we add per cgroup writeback pages accounting. The lock
> rule still is:
>         mem_cgroup_begin_update_page_stat()
>         modify page WRITEBACK stat
>         mem_cgroup_update_page_stat()
>         mem_cgroup_end_update_page_stat()
>
> There're two writeback interface to modify: test_clear/set_page_writeback.
>
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> ---
>  include/linux/memcontrol.h |    1 +
>  mm/memcontrol.c            |    5 +++++
>  mm/page-writeback.c        |   17 +++++++++++++++++
>  3 files changed, 23 insertions(+)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 358019e..1d22b81 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -45,6 +45,7 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>  	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
>  	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
> +	MEM_CGROUP_STAT_WRITEBACK,  /* # of pages under writeback */
>  	MEM_CGROUP_STAT_NSTATS,
>  };
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 21df36d..13cd14a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -96,6 +96,7 @@ static const char * const mem_cgroup_stat_names[] = {
>  	"mapped_file",
>  	"swap",
>  	"dirty",
> +	"writeback",
>  };
>  
>  enum mem_cgroup_events_index {
> @@ -3676,6 +3677,10 @@ static int mem_cgroup_move_account(struct page *page,
>  		mem_cgroup_move_account_page_stat(from, to, nr_pages,
>  			MEM_CGROUP_STAT_FILE_DIRTY);
>  
> +	if (PageWriteback(page))
> +		mem_cgroup_move_account_page_stat(from, to, nr_pages,
> +			MEM_CGROUP_STAT_WRITEBACK);
> +
>  	mem_cgroup_charge_statistics(from, anon, -nr_pages);
>  
>  	/* caller should have done css_get */
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 526ddd7..ae6498a 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2002,11 +2002,17 @@ EXPORT_SYMBOL(account_page_dirtied);
>  
>  /*
>   * Helper function for set_page_writeback family.
> + *
> + * The caller must hold mem_cgroup_begin/end_update_page_stat() lock
> + * while modifying struct page state and accounting writeback pages.
> + * See test_set_page_writeback for example.
> + *
>   * NOTE: Unlike account_page_dirtied this does not rely on being atomic
>   * wrt interrupts.
>   */
>  void account_page_writeback(struct page *page)
>  {
> +	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
>  	inc_zone_page_state(page, NR_WRITEBACK);
>  }
>  EXPORT_SYMBOL(account_page_writeback);
> @@ -2242,7 +2248,10 @@ int test_clear_page_writeback(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
>  	int ret;
> +	bool locked;
> +	unsigned long memcg_flags;
>  
> +	mem_cgroup_begin_update_page_stat(page, &locked, &memcg_flags);

Does this violate lock ordering?  Here we are grabbing
mem_cgroup_begin_update_page_stat and below we grab tree_lock.  I
thought the prescribed order was tree_lock first, then
mem_cgroup_begin_update_page_stat.

>  	if (mapping) {
>  		struct backing_dev_info *bdi = mapping->backing_dev_info;
>  		unsigned long flags;
> @@ -2263,9 +2272,12 @@ int test_clear_page_writeback(struct page *page)
>  		ret = TestClearPageWriteback(page);
>  	}
>  	if (ret) {
> +		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
>  		dec_zone_page_state(page, NR_WRITEBACK);
>  		inc_zone_page_state(page, NR_WRITTEN);
>  	}
> +
> +	mem_cgroup_end_update_page_stat(page, &locked, &memcg_flags);
>  	return ret;
>  }
>  
> @@ -2273,7 +2285,10 @@ int test_set_page_writeback(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
>  	int ret;
> +	bool locked;
> +	unsigned long flags;
>  
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);

Same "Does this violate lock ordering?" question as above.

>  	if (mapping) {
>  		struct backing_dev_info *bdi = mapping->backing_dev_info;
>  		unsigned long flags;
> @@ -2300,6 +2315,8 @@ int test_set_page_writeback(struct page *page)
>  	}
>  	if (!ret)
>  		account_page_writeback(page);
> +
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  	return ret;
>  
>  }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
  2012-12-25 17:26   ` Sha Zhengju
  (?)
@ 2013-01-06 20:07     ` Greg Thelen
  -1 siblings, 0 replies; 124+ messages in thread
From: Greg Thelen @ 2013-01-06 20:07 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel, cgroups, linux-mm, linux-fsdevel, mhocko, akpm,
	kamezawa.hiroyu, fengguang.wu, glommer, dchinner, Sha Zhengju

On Tue, Dec 25 2012, Sha Zhengju wrote:

> From: Sha Zhengju <handai.szj@taobao.com>
>
> This patch adds memcg routines to count dirty pages, which allows memory controller
> to maintain an accurate view of the amount of its dirty memory and can provide some
> info for users while cgroup's direct reclaim is working.
>
> After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
> use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
> has a feature to move a page from a cgroup to another one and may have race between
> "move" and "page stat accounting". So in order to avoid the race we have designed a
> bigger lock:
>
>          mem_cgroup_begin_update_page_stat()
>          modify page information        -->(a)
>          mem_cgroup_update_page_stat()  -->(b)
>          mem_cgroup_end_update_page_stat()
> It requires (a) and (b)(dirty pages accounting) can stay close enough.
> In the previous two prepare patches, we have reworked the vfs set page dirty routines
> and now the interfaces are more explicit:
>         incrementing (2):
>                 __set_page_dirty
>                 __set_page_dirty_nobuffers
>         decrementing (2):
>                 clear_page_dirty_for_io
>                 cancel_dirty_page
>
> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
> ->private_lock --> mapping->tree_lock --> memcg->move_lock.
> So we need to make mapping->tree_lock ahead of TestSetPageDirty in __set_page_dirty()
> and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention,
> a prepare PageDirty() checking is added.
>
>
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
> Acked-by: Fengguang Wu <fengguang.wu@intel.com>
> ---
>  fs/buffer.c                |   14 +++++++++++++-
>  include/linux/memcontrol.h |    1 +
>  mm/filemap.c               |   10 ++++++++++
>  mm/memcontrol.c            |   29 ++++++++++++++++++++++-------
>  mm/page-writeback.c        |   39 ++++++++++++++++++++++++++++++++-------
>  mm/truncate.c              |    6 ++++++
>  6 files changed, 84 insertions(+), 15 deletions(-)

__nilfs_clear_page_dirty() clears PageDirty, does it need modification
for this patch series?

> diff --git a/fs/buffer.c b/fs/buffer.c
> index 762168a..53402d2 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -612,19 +612,31 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>  int __set_page_dirty(struct page *page,
>  		struct address_space *mapping, int warn)
>  {
> +	bool locked;
> +	unsigned long flags;
> +
>  	if (unlikely(!mapping))
>  		return !TestSetPageDirty(page);
>  
> -	if (TestSetPageDirty(page))
> +	if (PageDirty(page))
>  		return 0;
>  
>  	spin_lock_irq(&mapping->tree_lock);
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +
> +	if (TestSetPageDirty(page)) {
> +		mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +		spin_unlock_irq(&mapping->tree_lock);
> +		return 0;
> +	}
> +
>  	if (page->mapping) {	/* Race with truncate? */
>  		WARN_ON_ONCE(warn && !PageUptodate(page));
>  		account_page_dirtied(page, mapping);
>  		radix_tree_tag_set(&mapping->page_tree,
>  				page_index(page), PAGECACHE_TAG_DIRTY);
>  	}
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  	spin_unlock_irq(&mapping->tree_lock);
>  	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>  
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 5421b8a..2685d8a 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -44,6 +44,7 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
>  	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>  	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
> +	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
>  	MEM_CGROUP_STAT_NSTATS,
>  };
>  
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 83efee7..b589be5 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -62,6 +62,11 @@
>   *      ->swap_lock		(exclusive_swap_page, others)
>   *        ->mapping->tree_lock
>   *
> + *    ->private_lock		(__set_page_dirty_buffers)
> + *      ->mapping->tree_lock
> + *        ->memcg->move_lock	(mem_cgroup_begin_update_page_stat->
> + *							move_lock_mem_cgroup)
> + *
>   *  ->i_mutex
>   *    ->i_mmap_mutex		(truncate->unmap_mapping_range)
>   *
> @@ -112,6 +117,8 @@
>  void __delete_from_page_cache(struct page *page)
>  {
>  	struct address_space *mapping = page->mapping;
> +	bool locked;
> +	unsigned long flags;
>  
>  	/*
>  	 * if we're uptodate, flush out into the cleancache, otherwise
> @@ -139,10 +146,13 @@ void __delete_from_page_cache(struct page *page)
>  	 * Fix it up by doing a final dirty accounting check after
>  	 * having removed the page entirely.
>  	 */
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  		dec_zone_page_state(page, NR_FILE_DIRTY);
>  		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>  	}
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  }
>  
>  /**
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d450c04..c884640 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -95,6 +95,7 @@ static const char * const mem_cgroup_stat_names[] = {
>  	"rss",
>  	"mapped_file",
>  	"swap",
> +	"dirty",
>  };
>  
>  enum mem_cgroup_events_index {
> @@ -3609,6 +3610,19 @@ void mem_cgroup_split_huge_fixup(struct page *head)
>  }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
> +static inline
> +void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
> +					struct mem_cgroup *to,
> +					unsigned int nr_pages,
> +					enum mem_cgroup_stat_index idx)
> +{
> +	/* Update stat data for mem_cgroup */
> +	preempt_disable();
> +	__this_cpu_add(from->stat->count[idx], -nr_pages);

What you do think about adding a WARN_ON_ONCE() here to check for
underflow?  A check might help catch:
a) unresolved races between move accounting vs setting/clearing
   dirtying.
b) future modifications that mess with PageDirty/Writeback flags without
   considering memcg.

> +	__this_cpu_add(to->stat->count[idx], nr_pages);
> +	preempt_enable();
> +}
> +
>  /**
>   * mem_cgroup_move_account - move account of the page
>   * @page: the page
> @@ -3654,13 +3668,14 @@ static int mem_cgroup_move_account(struct page *page,
>  
>  	move_lock_mem_cgroup(from, &flags);
>  
> -	if (!anon && page_mapped(page)) {
> -		/* Update mapped_file data for mem_cgroup */
> -		preempt_disable();
> -		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		preempt_enable();
> -	}
> +	if (!anon && page_mapped(page))
> +		mem_cgroup_move_account_page_stat(from, to, nr_pages,
> +			MEM_CGROUP_STAT_FILE_MAPPED);
> +
> +	if (PageDirty(page))

Is (!anon && PageDirty(page)) better?  If dirty anon pages are moved
between memcg M1 and M2 I think that we'd mistakenly underflow M1 if it
was not previously accounting for the dirty anon page.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
@ 2013-01-06 20:07     ` Greg Thelen
  0 siblings, 0 replies; 124+ messages in thread
From: Greg Thelen @ 2013-01-06 20:07 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, mhocko-AlSwsSmVLrQ,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	glommer-bzQdu9zFT3WakBO8gow8eQ, dchinner-H+wXaHxf7aLQT0dZR+AlfA,
	Sha Zhengju

On Tue, Dec 25 2012, Sha Zhengju wrote:

> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
>
> This patch adds memcg routines to count dirty pages, which allows memory controller
> to maintain an accurate view of the amount of its dirty memory and can provide some
> info for users while cgroup's direct reclaim is working.
>
> After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
> use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
> has a feature to move a page from a cgroup to another one and may have race between
> "move" and "page stat accounting". So in order to avoid the race we have designed a
> bigger lock:
>
>          mem_cgroup_begin_update_page_stat()
>          modify page information        -->(a)
>          mem_cgroup_update_page_stat()  -->(b)
>          mem_cgroup_end_update_page_stat()
> It requires (a) and (b)(dirty pages accounting) can stay close enough.
> In the previous two prepare patches, we have reworked the vfs set page dirty routines
> and now the interfaces are more explicit:
>         incrementing (2):
>                 __set_page_dirty
>                 __set_page_dirty_nobuffers
>         decrementing (2):
>                 clear_page_dirty_for_io
>                 cancel_dirty_page
>
> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
> ->private_lock --> mapping->tree_lock --> memcg->move_lock.
> So we need to make mapping->tree_lock ahead of TestSetPageDirty in __set_page_dirty()
> and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention,
> a prepare PageDirty() checking is added.
>
>
> Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu-LdfC7J4mv27QFUHtdCDX3A@public.gmane.org>
> Acked-by: Fengguang Wu <fengguang.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> ---
>  fs/buffer.c                |   14 +++++++++++++-
>  include/linux/memcontrol.h |    1 +
>  mm/filemap.c               |   10 ++++++++++
>  mm/memcontrol.c            |   29 ++++++++++++++++++++++-------
>  mm/page-writeback.c        |   39 ++++++++++++++++++++++++++++++++-------
>  mm/truncate.c              |    6 ++++++
>  6 files changed, 84 insertions(+), 15 deletions(-)

__nilfs_clear_page_dirty() clears PageDirty, does it need modification
for this patch series?

> diff --git a/fs/buffer.c b/fs/buffer.c
> index 762168a..53402d2 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -612,19 +612,31 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>  int __set_page_dirty(struct page *page,
>  		struct address_space *mapping, int warn)
>  {
> +	bool locked;
> +	unsigned long flags;
> +
>  	if (unlikely(!mapping))
>  		return !TestSetPageDirty(page);
>  
> -	if (TestSetPageDirty(page))
> +	if (PageDirty(page))
>  		return 0;
>  
>  	spin_lock_irq(&mapping->tree_lock);
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +
> +	if (TestSetPageDirty(page)) {
> +		mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +		spin_unlock_irq(&mapping->tree_lock);
> +		return 0;
> +	}
> +
>  	if (page->mapping) {	/* Race with truncate? */
>  		WARN_ON_ONCE(warn && !PageUptodate(page));
>  		account_page_dirtied(page, mapping);
>  		radix_tree_tag_set(&mapping->page_tree,
>  				page_index(page), PAGECACHE_TAG_DIRTY);
>  	}
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  	spin_unlock_irq(&mapping->tree_lock);
>  	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>  
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 5421b8a..2685d8a 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -44,6 +44,7 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
>  	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>  	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
> +	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
>  	MEM_CGROUP_STAT_NSTATS,
>  };
>  
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 83efee7..b589be5 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -62,6 +62,11 @@
>   *      ->swap_lock		(exclusive_swap_page, others)
>   *        ->mapping->tree_lock
>   *
> + *    ->private_lock		(__set_page_dirty_buffers)
> + *      ->mapping->tree_lock
> + *        ->memcg->move_lock	(mem_cgroup_begin_update_page_stat->
> + *							move_lock_mem_cgroup)
> + *
>   *  ->i_mutex
>   *    ->i_mmap_mutex		(truncate->unmap_mapping_range)
>   *
> @@ -112,6 +117,8 @@
>  void __delete_from_page_cache(struct page *page)
>  {
>  	struct address_space *mapping = page->mapping;
> +	bool locked;
> +	unsigned long flags;
>  
>  	/*
>  	 * if we're uptodate, flush out into the cleancache, otherwise
> @@ -139,10 +146,13 @@ void __delete_from_page_cache(struct page *page)
>  	 * Fix it up by doing a final dirty accounting check after
>  	 * having removed the page entirely.
>  	 */
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  		dec_zone_page_state(page, NR_FILE_DIRTY);
>  		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>  	}
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  }
>  
>  /**
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d450c04..c884640 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -95,6 +95,7 @@ static const char * const mem_cgroup_stat_names[] = {
>  	"rss",
>  	"mapped_file",
>  	"swap",
> +	"dirty",
>  };
>  
>  enum mem_cgroup_events_index {
> @@ -3609,6 +3610,19 @@ void mem_cgroup_split_huge_fixup(struct page *head)
>  }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
> +static inline
> +void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
> +					struct mem_cgroup *to,
> +					unsigned int nr_pages,
> +					enum mem_cgroup_stat_index idx)
> +{
> +	/* Update stat data for mem_cgroup */
> +	preempt_disable();
> +	__this_cpu_add(from->stat->count[idx], -nr_pages);

What you do think about adding a WARN_ON_ONCE() here to check for
underflow?  A check might help catch:
a) unresolved races between move accounting vs setting/clearing
   dirtying.
b) future modifications that mess with PageDirty/Writeback flags without
   considering memcg.

> +	__this_cpu_add(to->stat->count[idx], nr_pages);
> +	preempt_enable();
> +}
> +
>  /**
>   * mem_cgroup_move_account - move account of the page
>   * @page: the page
> @@ -3654,13 +3668,14 @@ static int mem_cgroup_move_account(struct page *page,
>  
>  	move_lock_mem_cgroup(from, &flags);
>  
> -	if (!anon && page_mapped(page)) {
> -		/* Update mapped_file data for mem_cgroup */
> -		preempt_disable();
> -		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		preempt_enable();
> -	}
> +	if (!anon && page_mapped(page))
> +		mem_cgroup_move_account_page_stat(from, to, nr_pages,
> +			MEM_CGROUP_STAT_FILE_MAPPED);
> +
> +	if (PageDirty(page))

Is (!anon && PageDirty(page)) better?  If dirty anon pages are moved
between memcg M1 and M2 I think that we'd mistakenly underflow M1 if it
was not previously accounting for the dirty anon page.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
@ 2013-01-06 20:07     ` Greg Thelen
  0 siblings, 0 replies; 124+ messages in thread
From: Greg Thelen @ 2013-01-06 20:07 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel, cgroups, linux-mm, linux-fsdevel, mhocko, akpm,
	kamezawa.hiroyu, fengguang.wu, glommer, dchinner, Sha Zhengju

On Tue, Dec 25 2012, Sha Zhengju wrote:

> From: Sha Zhengju <handai.szj@taobao.com>
>
> This patch adds memcg routines to count dirty pages, which allows memory controller
> to maintain an accurate view of the amount of its dirty memory and can provide some
> info for users while cgroup's direct reclaim is working.
>
> After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
> use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
> has a feature to move a page from a cgroup to another one and may have race between
> "move" and "page stat accounting". So in order to avoid the race we have designed a
> bigger lock:
>
>          mem_cgroup_begin_update_page_stat()
>          modify page information        -->(a)
>          mem_cgroup_update_page_stat()  -->(b)
>          mem_cgroup_end_update_page_stat()
> It requires (a) and (b)(dirty pages accounting) can stay close enough.
> In the previous two prepare patches, we have reworked the vfs set page dirty routines
> and now the interfaces are more explicit:
>         incrementing (2):
>                 __set_page_dirty
>                 __set_page_dirty_nobuffers
>         decrementing (2):
>                 clear_page_dirty_for_io
>                 cancel_dirty_page
>
> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
> ->private_lock --> mapping->tree_lock --> memcg->move_lock.
> So we need to make mapping->tree_lock ahead of TestSetPageDirty in __set_page_dirty()
> and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention,
> a prepare PageDirty() checking is added.
>
>
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
> Acked-by: Fengguang Wu <fengguang.wu@intel.com>
> ---
>  fs/buffer.c                |   14 +++++++++++++-
>  include/linux/memcontrol.h |    1 +
>  mm/filemap.c               |   10 ++++++++++
>  mm/memcontrol.c            |   29 ++++++++++++++++++++++-------
>  mm/page-writeback.c        |   39 ++++++++++++++++++++++++++++++++-------
>  mm/truncate.c              |    6 ++++++
>  6 files changed, 84 insertions(+), 15 deletions(-)

__nilfs_clear_page_dirty() clears PageDirty, does it need modification
for this patch series?

> diff --git a/fs/buffer.c b/fs/buffer.c
> index 762168a..53402d2 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -612,19 +612,31 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>  int __set_page_dirty(struct page *page,
>  		struct address_space *mapping, int warn)
>  {
> +	bool locked;
> +	unsigned long flags;
> +
>  	if (unlikely(!mapping))
>  		return !TestSetPageDirty(page);
>  
> -	if (TestSetPageDirty(page))
> +	if (PageDirty(page))
>  		return 0;
>  
>  	spin_lock_irq(&mapping->tree_lock);
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +
> +	if (TestSetPageDirty(page)) {
> +		mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +		spin_unlock_irq(&mapping->tree_lock);
> +		return 0;
> +	}
> +
>  	if (page->mapping) {	/* Race with truncate? */
>  		WARN_ON_ONCE(warn && !PageUptodate(page));
>  		account_page_dirtied(page, mapping);
>  		radix_tree_tag_set(&mapping->page_tree,
>  				page_index(page), PAGECACHE_TAG_DIRTY);
>  	}
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  	spin_unlock_irq(&mapping->tree_lock);
>  	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>  
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 5421b8a..2685d8a 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -44,6 +44,7 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
>  	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>  	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
> +	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
>  	MEM_CGROUP_STAT_NSTATS,
>  };
>  
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 83efee7..b589be5 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -62,6 +62,11 @@
>   *      ->swap_lock		(exclusive_swap_page, others)
>   *        ->mapping->tree_lock
>   *
> + *    ->private_lock		(__set_page_dirty_buffers)
> + *      ->mapping->tree_lock
> + *        ->memcg->move_lock	(mem_cgroup_begin_update_page_stat->
> + *							move_lock_mem_cgroup)
> + *
>   *  ->i_mutex
>   *    ->i_mmap_mutex		(truncate->unmap_mapping_range)
>   *
> @@ -112,6 +117,8 @@
>  void __delete_from_page_cache(struct page *page)
>  {
>  	struct address_space *mapping = page->mapping;
> +	bool locked;
> +	unsigned long flags;
>  
>  	/*
>  	 * if we're uptodate, flush out into the cleancache, otherwise
> @@ -139,10 +146,13 @@ void __delete_from_page_cache(struct page *page)
>  	 * Fix it up by doing a final dirty accounting check after
>  	 * having removed the page entirely.
>  	 */
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  		dec_zone_page_state(page, NR_FILE_DIRTY);
>  		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>  	}
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  }
>  
>  /**
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d450c04..c884640 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -95,6 +95,7 @@ static const char * const mem_cgroup_stat_names[] = {
>  	"rss",
>  	"mapped_file",
>  	"swap",
> +	"dirty",
>  };
>  
>  enum mem_cgroup_events_index {
> @@ -3609,6 +3610,19 @@ void mem_cgroup_split_huge_fixup(struct page *head)
>  }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
> +static inline
> +void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
> +					struct mem_cgroup *to,
> +					unsigned int nr_pages,
> +					enum mem_cgroup_stat_index idx)
> +{
> +	/* Update stat data for mem_cgroup */
> +	preempt_disable();
> +	__this_cpu_add(from->stat->count[idx], -nr_pages);

What you do think about adding a WARN_ON_ONCE() here to check for
underflow?  A check might help catch:
a) unresolved races between move accounting vs setting/clearing
   dirtying.
b) future modifications that mess with PageDirty/Writeback flags without
   considering memcg.

> +	__this_cpu_add(to->stat->count[idx], nr_pages);
> +	preempt_enable();
> +}
> +
>  /**
>   * mem_cgroup_move_account - move account of the page
>   * @page: the page
> @@ -3654,13 +3668,14 @@ static int mem_cgroup_move_account(struct page *page,
>  
>  	move_lock_mem_cgroup(from, &flags);
>  
> -	if (!anon && page_mapped(page)) {
> -		/* Update mapped_file data for mem_cgroup */
> -		preempt_disable();
> -		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		preempt_enable();
> -	}
> +	if (!anon && page_mapped(page))
> +		mem_cgroup_move_account_page_stat(from, to, nr_pages,
> +			MEM_CGROUP_STAT_FILE_MAPPED);
> +
> +	if (PageDirty(page))

Is (!anon && PageDirty(page)) better?  If dirty anon pages are moved
between memcg M1 and M2 I think that we'd mistakenly underflow M1 if it
was not previously accounting for the dirty anon page.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
@ 2013-01-07  7:25         ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 124+ messages in thread
From: Kamezawa Hiroyuki @ 2013-01-07  7:25 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: Michal Hocko, linux-kernel, cgroups, linux-mm, linux-fsdevel,
	akpm, gthelen, fengguang.wu, glommer, dchinner, Sha Zhengju

(2013/01/05 13:48), Sha Zhengju wrote:
> On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko <mhocko@suse.cz> wrote:
>> On Wed 26-12-12 01:26:07, Sha Zhengju wrote:
>>> From: Sha Zhengju <handai.szj@taobao.com>
>>>
>>> This patch adds memcg routines to count dirty pages, which allows memory controller
>>> to maintain an accurate view of the amount of its dirty memory and can provide some
>>> info for users while cgroup's direct reclaim is working.
>>
>> I guess you meant targeted resp. (hard/soft) limit reclaim here,
>> right? It is true that this is direct reclaim but it is not clear to me
>
> Yes, I meant memcg hard/soft reclaim here which is triggered directly
> by allocation and is distinct from background kswapd reclaim (global).
>
>> why the usefulnes should be limitted to the reclaim for users. I would
>> understand this if the users was in fact in-kernel users.
>>
>
> One of the reasons I'm trying to accounting the dirty pages is to get a
> more board overall view of memory usages because memcg hard/soft
> reclaim may have effect on response time of user application.
> Yeah, the beneficiary can be application administrator or kernel users.  :P
>
>> [...]
>>> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
>>> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
>>> ->private_lock --> mapping->tree_lock --> memcg->move_lock.
>>> So we need to make mapping->tree_lock ahead of TestSetPageDirty in __set_page_dirty()
>>> and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention,
>>> a prepare PageDirty() checking is added.
>>
>> But there is another AA deadlock here I believe.
>> page_remove_rmap
>>    mem_cgroup_begin_update_page_stat             <<< 1
>>    set_page_dirty
>>      __set_page_dirty_buffers
>>        __set_page_dirty
>>          mem_cgroup_begin_update_page_stat       <<< 2
>>            move_lock_mem_cgroup
>>              spin_lock_irqsave(&memcg->move_lock, *flags);
>>
>> mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS
>> because we might race with the moving charges:
>>          CPU0                                            CPU1
>> page_remove_rmap
>>                                                  mem_cgroup_can_attach
>>    mem_cgroup_begin_update_page_stat (1)
>>      rcu_read_lock
>>                                                    mem_cgroup_start_move
>>                                                      atomic_inc(&memcg_moving)
>>                                                      atomic_inc(&memcg->moving_account)
>>                                                      synchronize_rcu
>>      __mem_cgroup_begin_update_page_stat
>>        mem_cgroup_stolen <<< TRUE
>>        move_lock_mem_cgroup
>>    [...]
>>          mem_cgroup_begin_update_page_stat (2)
>>            __mem_cgroup_begin_update_page_stat
>>              mem_cgroup_stolen     <<< still TRUE
>>              move_lock_mem_cgroup  <<< DEADLOCK
>>    [...]
>>    mem_cgroup_end_update_page_stat
>>      rcu_unlock
>>                                                    # wake up from synchronize_rcu
>>                                                  [...]
>>                                                  mem_cgroup_move_task
>>                                                    mem_cgroup_move_charge
>>                                                      walk_page_range
>>                                                        mem_cgroup_move_account
>>                                                          move_lock_mem_cgroup
>>
>>
>> Maybe I have missed some other locking which would prevent this from
>> happening but the locking relations are really complicated in this area
>> so if mem_cgroup_{begin,end}_update_page_stat might be called
>> recursively then we need a fat comment which justifies that.
>>
>
> Ohhh...good catching!  I didn't notice there is a recursive call of
> mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
> The mem_cgroup_{begin,end}_update_page_stat() design has depressed
> me a lot recently as the lock granularity is a little bigger than I thought.
> Not only the resource but also some code logic is in the range of locking
> which may be deadlock prone. The problem still exists if we are trying to
> add stat account of other memcg page later, may I make bold to suggest
> that we dig into the lock again...
>
> But with regard to the current lock implementation, I doubt if we can we can
> account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just
> try to get move_lock once in the beginning. IMHO we can make
> mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what I'm
> thinking now is changing memcg->move_lock to rw-spinlock from the
> original spinlock:
> mem_cgroup_{begin,end}_update_page_stat() try to get the read lock which make it
> reenterable and memcg moving task side try to get the write spinlock.
> Then the race may be following:
>
>          CPU0                                            CPU1
> page_remove_rmap
>                                                  mem_cgroup_can_attach
>    mem_cgroup_begin_update_page_stat (1)
>      rcu_read_lock
>                                                    mem_cgroup_start_move
>                                                      atomic_inc(&memcg_moving)
>
> atomic_inc(&memcg->moving_account)
>                                                      synchronize_rcu
>      __mem_cgroup_begin_update_page_stat
>        mem_cgroup_stolen   <<< TRUE
>        move_lock_mem_cgroup   <<<< read-spinlock success
>    [...]
>       mem_cgroup_begin_update_page_stat (2)
>            __mem_cgroup_begin_update_page_stat
>              mem_cgroup_stolen     <<< still TRUE
>              move_lock_mem_cgroup  <<<< read-spinlock success
>
>    [...]
>    mem_cgroup_end_update_page_stat     <<< locked = true, unlock
>      rcu_unlock
>                                                    # wake up from synchronize_rcu
>                                                  [...]
>                                                  mem_cgroup_move_task
>                                                    mem_cgroup_move_charge
>                                                      walk_page_range
>                                                        mem_cgroup_move_account
>
> move_lock_mem_cgroup    <<< write-spinlock
>
>
> AFAICS, the deadlock seems to be avoided by both the rcu and rwlock.
> Is there anything I lost?
>

rwlock will work with the nest but it seems ugly do updates under read-lock.

How about this straightforward ?
==
/*
  * Once a thread takes memcg_move_lock() on a memcg, it can take the lock on
  * the memcg again for nesting calls
  */
static void move_lock_mem_cgroup(memcg, flags);
{
	current->memcg_move_lock_nested += 1;
	if (current->memcg_move_lock_nested > 1) {
		VM_BUG_ON(current->move_locked_memcg != memcg);
		return;
	}
	spin_lock_irqsave(&memcg_move_lock, &flags);
	current->move_lockdev_memcg = memcg;
}

static void move_unlock_mem_cgroup(memcg, flags)
{
	current->memcg_move_lock_nested -= 1;
	if (!current->memcg_move_lock_nested) {
		current->move_locked_memcg = NULL;
		spin_unlock_irqrestore(&memcg_move_lock,flags);
	}
}

==
But, hmm, this kind of (ugly) hack may cause trouble as Hugh said.

Thanks,
-Kame








^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
@ 2013-01-07  7:25         ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 124+ messages in thread
From: Kamezawa Hiroyuki @ 2013-01-07  7:25 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: Michal Hocko, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	glommer-bzQdu9zFT3WakBO8gow8eQ, dchinner-H+wXaHxf7aLQT0dZR+AlfA,
	Sha Zhengju

(2013/01/05 13:48), Sha Zhengju wrote:
> On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> wrote:
>> On Wed 26-12-12 01:26:07, Sha Zhengju wrote:
>>> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
>>>
>>> This patch adds memcg routines to count dirty pages, which allows memory controller
>>> to maintain an accurate view of the amount of its dirty memory and can provide some
>>> info for users while cgroup's direct reclaim is working.
>>
>> I guess you meant targeted resp. (hard/soft) limit reclaim here,
>> right? It is true that this is direct reclaim but it is not clear to me
>
> Yes, I meant memcg hard/soft reclaim here which is triggered directly
> by allocation and is distinct from background kswapd reclaim (global).
>
>> why the usefulnes should be limitted to the reclaim for users. I would
>> understand this if the users was in fact in-kernel users.
>>
>
> One of the reasons I'm trying to accounting the dirty pages is to get a
> more board overall view of memory usages because memcg hard/soft
> reclaim may have effect on response time of user application.
> Yeah, the beneficiary can be application administrator or kernel users.  :P
>
>> [...]
>>> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
>>> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
>>> ->private_lock --> mapping->tree_lock --> memcg->move_lock.
>>> So we need to make mapping->tree_lock ahead of TestSetPageDirty in __set_page_dirty()
>>> and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention,
>>> a prepare PageDirty() checking is added.
>>
>> But there is another AA deadlock here I believe.
>> page_remove_rmap
>>    mem_cgroup_begin_update_page_stat             <<< 1
>>    set_page_dirty
>>      __set_page_dirty_buffers
>>        __set_page_dirty
>>          mem_cgroup_begin_update_page_stat       <<< 2
>>            move_lock_mem_cgroup
>>              spin_lock_irqsave(&memcg->move_lock, *flags);
>>
>> mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS
>> because we might race with the moving charges:
>>          CPU0                                            CPU1
>> page_remove_rmap
>>                                                  mem_cgroup_can_attach
>>    mem_cgroup_begin_update_page_stat (1)
>>      rcu_read_lock
>>                                                    mem_cgroup_start_move
>>                                                      atomic_inc(&memcg_moving)
>>                                                      atomic_inc(&memcg->moving_account)
>>                                                      synchronize_rcu
>>      __mem_cgroup_begin_update_page_stat
>>        mem_cgroup_stolen <<< TRUE
>>        move_lock_mem_cgroup
>>    [...]
>>          mem_cgroup_begin_update_page_stat (2)
>>            __mem_cgroup_begin_update_page_stat
>>              mem_cgroup_stolen     <<< still TRUE
>>              move_lock_mem_cgroup  <<< DEADLOCK
>>    [...]
>>    mem_cgroup_end_update_page_stat
>>      rcu_unlock
>>                                                    # wake up from synchronize_rcu
>>                                                  [...]
>>                                                  mem_cgroup_move_task
>>                                                    mem_cgroup_move_charge
>>                                                      walk_page_range
>>                                                        mem_cgroup_move_account
>>                                                          move_lock_mem_cgroup
>>
>>
>> Maybe I have missed some other locking which would prevent this from
>> happening but the locking relations are really complicated in this area
>> so if mem_cgroup_{begin,end}_update_page_stat might be called
>> recursively then we need a fat comment which justifies that.
>>
>
> Ohhh...good catching!  I didn't notice there is a recursive call of
> mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
> The mem_cgroup_{begin,end}_update_page_stat() design has depressed
> me a lot recently as the lock granularity is a little bigger than I thought.
> Not only the resource but also some code logic is in the range of locking
> which may be deadlock prone. The problem still exists if we are trying to
> add stat account of other memcg page later, may I make bold to suggest
> that we dig into the lock again...
>
> But with regard to the current lock implementation, I doubt if we can we can
> account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just
> try to get move_lock once in the beginning. IMHO we can make
> mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what I'm
> thinking now is changing memcg->move_lock to rw-spinlock from the
> original spinlock:
> mem_cgroup_{begin,end}_update_page_stat() try to get the read lock which make it
> reenterable and memcg moving task side try to get the write spinlock.
> Then the race may be following:
>
>          CPU0                                            CPU1
> page_remove_rmap
>                                                  mem_cgroup_can_attach
>    mem_cgroup_begin_update_page_stat (1)
>      rcu_read_lock
>                                                    mem_cgroup_start_move
>                                                      atomic_inc(&memcg_moving)
>
> atomic_inc(&memcg->moving_account)
>                                                      synchronize_rcu
>      __mem_cgroup_begin_update_page_stat
>        mem_cgroup_stolen   <<< TRUE
>        move_lock_mem_cgroup   <<<< read-spinlock success
>    [...]
>       mem_cgroup_begin_update_page_stat (2)
>            __mem_cgroup_begin_update_page_stat
>              mem_cgroup_stolen     <<< still TRUE
>              move_lock_mem_cgroup  <<<< read-spinlock success
>
>    [...]
>    mem_cgroup_end_update_page_stat     <<< locked = true, unlock
>      rcu_unlock
>                                                    # wake up from synchronize_rcu
>                                                  [...]
>                                                  mem_cgroup_move_task
>                                                    mem_cgroup_move_charge
>                                                      walk_page_range
>                                                        mem_cgroup_move_account
>
> move_lock_mem_cgroup    <<< write-spinlock
>
>
> AFAICS, the deadlock seems to be avoided by both the rcu and rwlock.
> Is there anything I lost?
>

rwlock will work with the nest but it seems ugly do updates under read-lock.

How about this straightforward ?
==
/*
  * Once a thread takes memcg_move_lock() on a memcg, it can take the lock on
  * the memcg again for nesting calls
  */
static void move_lock_mem_cgroup(memcg, flags);
{
	current->memcg_move_lock_nested += 1;
	if (current->memcg_move_lock_nested > 1) {
		VM_BUG_ON(current->move_locked_memcg != memcg);
		return;
	}
	spin_lock_irqsave(&memcg_move_lock, &flags);
	current->move_lockdev_memcg = memcg;
}

static void move_unlock_mem_cgroup(memcg, flags)
{
	current->memcg_move_lock_nested -= 1;
	if (!current->memcg_move_lock_nested) {
		current->move_locked_memcg = NULL;
		spin_unlock_irqrestore(&memcg_move_lock,flags);
	}
}

==
But, hmm, this kind of (ugly) hack may cause trouble as Hugh said.

Thanks,
-Kame

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
@ 2013-01-07  7:25         ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 124+ messages in thread
From: Kamezawa Hiroyuki @ 2013-01-07  7:25 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: Michal Hocko, linux-kernel, cgroups, linux-mm, linux-fsdevel,
	akpm, gthelen, fengguang.wu, glommer, dchinner, Sha Zhengju

(2013/01/05 13:48), Sha Zhengju wrote:
> On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko <mhocko@suse.cz> wrote:
>> On Wed 26-12-12 01:26:07, Sha Zhengju wrote:
>>> From: Sha Zhengju <handai.szj@taobao.com>
>>>
>>> This patch adds memcg routines to count dirty pages, which allows memory controller
>>> to maintain an accurate view of the amount of its dirty memory and can provide some
>>> info for users while cgroup's direct reclaim is working.
>>
>> I guess you meant targeted resp. (hard/soft) limit reclaim here,
>> right? It is true that this is direct reclaim but it is not clear to me
>
> Yes, I meant memcg hard/soft reclaim here which is triggered directly
> by allocation and is distinct from background kswapd reclaim (global).
>
>> why the usefulnes should be limitted to the reclaim for users. I would
>> understand this if the users was in fact in-kernel users.
>>
>
> One of the reasons I'm trying to accounting the dirty pages is to get a
> more board overall view of memory usages because memcg hard/soft
> reclaim may have effect on response time of user application.
> Yeah, the beneficiary can be application administrator or kernel users.  :P
>
>> [...]
>>> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
>>> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
>>> ->private_lock --> mapping->tree_lock --> memcg->move_lock.
>>> So we need to make mapping->tree_lock ahead of TestSetPageDirty in __set_page_dirty()
>>> and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention,
>>> a prepare PageDirty() checking is added.
>>
>> But there is another AA deadlock here I believe.
>> page_remove_rmap
>>    mem_cgroup_begin_update_page_stat             <<< 1
>>    set_page_dirty
>>      __set_page_dirty_buffers
>>        __set_page_dirty
>>          mem_cgroup_begin_update_page_stat       <<< 2
>>            move_lock_mem_cgroup
>>              spin_lock_irqsave(&memcg->move_lock, *flags);
>>
>> mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS
>> because we might race with the moving charges:
>>          CPU0                                            CPU1
>> page_remove_rmap
>>                                                  mem_cgroup_can_attach
>>    mem_cgroup_begin_update_page_stat (1)
>>      rcu_read_lock
>>                                                    mem_cgroup_start_move
>>                                                      atomic_inc(&memcg_moving)
>>                                                      atomic_inc(&memcg->moving_account)
>>                                                      synchronize_rcu
>>      __mem_cgroup_begin_update_page_stat
>>        mem_cgroup_stolen <<< TRUE
>>        move_lock_mem_cgroup
>>    [...]
>>          mem_cgroup_begin_update_page_stat (2)
>>            __mem_cgroup_begin_update_page_stat
>>              mem_cgroup_stolen     <<< still TRUE
>>              move_lock_mem_cgroup  <<< DEADLOCK
>>    [...]
>>    mem_cgroup_end_update_page_stat
>>      rcu_unlock
>>                                                    # wake up from synchronize_rcu
>>                                                  [...]
>>                                                  mem_cgroup_move_task
>>                                                    mem_cgroup_move_charge
>>                                                      walk_page_range
>>                                                        mem_cgroup_move_account
>>                                                          move_lock_mem_cgroup
>>
>>
>> Maybe I have missed some other locking which would prevent this from
>> happening but the locking relations are really complicated in this area
>> so if mem_cgroup_{begin,end}_update_page_stat might be called
>> recursively then we need a fat comment which justifies that.
>>
>
> Ohhh...good catching!  I didn't notice there is a recursive call of
> mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
> The mem_cgroup_{begin,end}_update_page_stat() design has depressed
> me a lot recently as the lock granularity is a little bigger than I thought.
> Not only the resource but also some code logic is in the range of locking
> which may be deadlock prone. The problem still exists if we are trying to
> add stat account of other memcg page later, may I make bold to suggest
> that we dig into the lock again...
>
> But with regard to the current lock implementation, I doubt if we can we can
> account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just
> try to get move_lock once in the beginning. IMHO we can make
> mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what I'm
> thinking now is changing memcg->move_lock to rw-spinlock from the
> original spinlock:
> mem_cgroup_{begin,end}_update_page_stat() try to get the read lock which make it
> reenterable and memcg moving task side try to get the write spinlock.
> Then the race may be following:
>
>          CPU0                                            CPU1
> page_remove_rmap
>                                                  mem_cgroup_can_attach
>    mem_cgroup_begin_update_page_stat (1)
>      rcu_read_lock
>                                                    mem_cgroup_start_move
>                                                      atomic_inc(&memcg_moving)
>
> atomic_inc(&memcg->moving_account)
>                                                      synchronize_rcu
>      __mem_cgroup_begin_update_page_stat
>        mem_cgroup_stolen   <<< TRUE
>        move_lock_mem_cgroup   <<<< read-spinlock success
>    [...]
>       mem_cgroup_begin_update_page_stat (2)
>            __mem_cgroup_begin_update_page_stat
>              mem_cgroup_stolen     <<< still TRUE
>              move_lock_mem_cgroup  <<<< read-spinlock success
>
>    [...]
>    mem_cgroup_end_update_page_stat     <<< locked = true, unlock
>      rcu_unlock
>                                                    # wake up from synchronize_rcu
>                                                  [...]
>                                                  mem_cgroup_move_task
>                                                    mem_cgroup_move_charge
>                                                      walk_page_range
>                                                        mem_cgroup_move_account
>
> move_lock_mem_cgroup    <<< write-spinlock
>
>
> AFAICS, the deadlock seems to be avoided by both the rcu and rwlock.
> Is there anything I lost?
>

rwlock will work with the nest but it seems ugly do updates under read-lock.

How about this straightforward ?
==
/*
  * Once a thread takes memcg_move_lock() on a memcg, it can take the lock on
  * the memcg again for nesting calls
  */
static void move_lock_mem_cgroup(memcg, flags);
{
	current->memcg_move_lock_nested += 1;
	if (current->memcg_move_lock_nested > 1) {
		VM_BUG_ON(current->move_locked_memcg != memcg);
		return;
	}
	spin_lock_irqsave(&memcg_move_lock, &flags);
	current->move_lockdev_memcg = memcg;
}

static void move_unlock_mem_cgroup(memcg, flags)
{
	current->memcg_move_lock_nested -= 1;
	if (!current->memcg_move_lock_nested) {
		current->move_locked_memcg = NULL;
		spin_unlock_irqrestore(&memcg_move_lock,flags);
	}
}

==
But, hmm, this kind of (ugly) hack may cause trouble as Hugh said.

Thanks,
-Kame







--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
@ 2013-01-07  7:49           ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 124+ messages in thread
From: Kamezawa Hiroyuki @ 2013-01-07  7:49 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Sha Zhengju, Michal Hocko, Johannes Weiner, linux-kernel,
	cgroups, linux-mm, linux-fsdevel, akpm, gthelen, fengguang.wu,
	glommer, dchinner, Sha Zhengju

(2013/01/07 5:02), Hugh Dickins wrote:
> On Sat, 5 Jan 2013, Sha Zhengju wrote:
>> On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko <mhocko@suse.cz> wrote:
>>>
>>> Maybe I have missed some other locking which would prevent this from
>>> happening but the locking relations are really complicated in this area
>>> so if mem_cgroup_{begin,end}_update_page_stat might be called
>>> recursively then we need a fat comment which justifies that.
>>>
>>
>> Ohhh...good catching!  I didn't notice there is a recursive call of
>> mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
>> The mem_cgroup_{begin,end}_update_page_stat() design has depressed
>> me a lot recently as the lock granularity is a little bigger than I thought.
>> Not only the resource but also some code logic is in the range of locking
>> which may be deadlock prone. The problem still exists if we are trying to
>> add stat account of other memcg page later, may I make bold to suggest
>> that we dig into the lock again...
>
> Forgive me, I must confess I'm no more than skimming this thread,
> and don't like dumping unsigned-off patches on people; but thought
> that on balance it might be more helpful than not if I offer you a
> patch I worked on around 3.6-rc2 (but have updated to 3.8-rc2 below).
>
> I too was getting depressed by the constraints imposed by
> mem_cgroup_{begin,end}_update_page_stat (good job though Kamezawa-san
> did to minimize them), and wanted to replace by something freer, more
> RCU-like.  In the end it seemed more effort than it was worth to go
> as far as I wanted, but I do think that this is some improvement over
> what we currently have, and should deal with your recursion issue.
>
In what case does this improve performance ?

> But if this does appear useful to memcg people, then we really ought
> to get it checked over by locking/barrier experts before going further.
> I think myself that I've over-barriered it, and could use a little
> lighter; but they (Paul McKenney, Peter Zijlstra, Oleg Nesterov come
> to mind) will see more clearly, and may just hate the whole thing,
> as yet another peculiar lockdep-avoiding hand-crafted locking scheme.
> I've not wanted to waste their time on reviewing it, if it's not even
> going to be useful to memcg people.
>
> It may be easier to understand if you just apply the patch and look
> at the result in mm/memcontrol.c, where I tried to gather the pieces
> together in one place and describe them ("These functions mediate...").
>
> Hugh
>

Hi, this patch seems interesting but...doesn't this make move_account() very
slow if the number of cpus increases because of scanning all cpus per a page ?
And this looks like reader-can-block-writer percpu rwlock..it's too heavy to
writers if there are many readers.


Thanks,
-Kame


  








^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
@ 2013-01-07  7:49           ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 124+ messages in thread
From: Kamezawa Hiroyuki @ 2013-01-07  7:49 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Sha Zhengju, Michal Hocko, Johannes Weiner,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	glommer-bzQdu9zFT3WakBO8gow8eQ, dchinner-H+wXaHxf7aLQT0dZR+AlfA,
	Sha Zhengju

(2013/01/07 5:02), Hugh Dickins wrote:
> On Sat, 5 Jan 2013, Sha Zhengju wrote:
>> On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> wrote:
>>>
>>> Maybe I have missed some other locking which would prevent this from
>>> happening but the locking relations are really complicated in this area
>>> so if mem_cgroup_{begin,end}_update_page_stat might be called
>>> recursively then we need a fat comment which justifies that.
>>>
>>
>> Ohhh...good catching!  I didn't notice there is a recursive call of
>> mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
>> The mem_cgroup_{begin,end}_update_page_stat() design has depressed
>> me a lot recently as the lock granularity is a little bigger than I thought.
>> Not only the resource but also some code logic is in the range of locking
>> which may be deadlock prone. The problem still exists if we are trying to
>> add stat account of other memcg page later, may I make bold to suggest
>> that we dig into the lock again...
>
> Forgive me, I must confess I'm no more than skimming this thread,
> and don't like dumping unsigned-off patches on people; but thought
> that on balance it might be more helpful than not if I offer you a
> patch I worked on around 3.6-rc2 (but have updated to 3.8-rc2 below).
>
> I too was getting depressed by the constraints imposed by
> mem_cgroup_{begin,end}_update_page_stat (good job though Kamezawa-san
> did to minimize them), and wanted to replace by something freer, more
> RCU-like.  In the end it seemed more effort than it was worth to go
> as far as I wanted, but I do think that this is some improvement over
> what we currently have, and should deal with your recursion issue.
>
In what case does this improve performance ?

> But if this does appear useful to memcg people, then we really ought
> to get it checked over by locking/barrier experts before going further.
> I think myself that I've over-barriered it, and could use a little
> lighter; but they (Paul McKenney, Peter Zijlstra, Oleg Nesterov come
> to mind) will see more clearly, and may just hate the whole thing,
> as yet another peculiar lockdep-avoiding hand-crafted locking scheme.
> I've not wanted to waste their time on reviewing it, if it's not even
> going to be useful to memcg people.
>
> It may be easier to understand if you just apply the patch and look
> at the result in mm/memcontrol.c, where I tried to gather the pieces
> together in one place and describe them ("These functions mediate...").
>
> Hugh
>

Hi, this patch seems interesting but...doesn't this make move_account() very
slow if the number of cpus increases because of scanning all cpus per a page ?
And this looks like reader-can-block-writer percpu rwlock..it's too heavy to
writers if there are many readers.


Thanks,
-Kame


  

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
@ 2013-01-07  7:49           ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 124+ messages in thread
From: Kamezawa Hiroyuki @ 2013-01-07  7:49 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Sha Zhengju, Michal Hocko, Johannes Weiner, linux-kernel,
	cgroups, linux-mm, linux-fsdevel, akpm, gthelen, fengguang.wu,
	glommer, dchinner, Sha Zhengju

(2013/01/07 5:02), Hugh Dickins wrote:
> On Sat, 5 Jan 2013, Sha Zhengju wrote:
>> On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko <mhocko@suse.cz> wrote:
>>>
>>> Maybe I have missed some other locking which would prevent this from
>>> happening but the locking relations are really complicated in this area
>>> so if mem_cgroup_{begin,end}_update_page_stat might be called
>>> recursively then we need a fat comment which justifies that.
>>>
>>
>> Ohhh...good catching!  I didn't notice there is a recursive call of
>> mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
>> The mem_cgroup_{begin,end}_update_page_stat() design has depressed
>> me a lot recently as the lock granularity is a little bigger than I thought.
>> Not only the resource but also some code logic is in the range of locking
>> which may be deadlock prone. The problem still exists if we are trying to
>> add stat account of other memcg page later, may I make bold to suggest
>> that we dig into the lock again...
>
> Forgive me, I must confess I'm no more than skimming this thread,
> and don't like dumping unsigned-off patches on people; but thought
> that on balance it might be more helpful than not if I offer you a
> patch I worked on around 3.6-rc2 (but have updated to 3.8-rc2 below).
>
> I too was getting depressed by the constraints imposed by
> mem_cgroup_{begin,end}_update_page_stat (good job though Kamezawa-san
> did to minimize them), and wanted to replace by something freer, more
> RCU-like.  In the end it seemed more effort than it was worth to go
> as far as I wanted, but I do think that this is some improvement over
> what we currently have, and should deal with your recursion issue.
>
In what case does this improve performance ?

> But if this does appear useful to memcg people, then we really ought
> to get it checked over by locking/barrier experts before going further.
> I think myself that I've over-barriered it, and could use a little
> lighter; but they (Paul McKenney, Peter Zijlstra, Oleg Nesterov come
> to mind) will see more clearly, and may just hate the whole thing,
> as yet another peculiar lockdep-avoiding hand-crafted locking scheme.
> I've not wanted to waste their time on reviewing it, if it's not even
> going to be useful to memcg people.
>
> It may be easier to understand if you just apply the patch and look
> at the result in mm/memcontrol.c, where I tried to gather the pieces
> together in one place and describe them ("These functions mediate...").
>
> Hugh
>

Hi, this patch seems interesting but...doesn't this make move_account() very
slow if the number of cpus increases because of scanning all cpus per a page ?
And this looks like reader-can-block-writer percpu rwlock..it's too heavy to
writers if there are many readers.


Thanks,
-Kame


  







--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
  2013-01-07  7:49           ` Kamezawa Hiroyuki
@ 2013-01-09  5:15             ` Hugh Dickins
  -1 siblings, 0 replies; 124+ messages in thread
From: Hugh Dickins @ 2013-01-09  5:15 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Sha Zhengju, Michal Hocko, Johannes Weiner, linux-kernel,
	cgroups, linux-mm, linux-fsdevel, akpm, gthelen, fengguang.wu,
	glommer, dchinner, Sha Zhengju

On Mon, 7 Jan 2013, Kamezawa Hiroyuki wrote:
> (2013/01/07 5:02), Hugh Dickins wrote:
> > 
> > Forgive me, I must confess I'm no more than skimming this thread,
> > and don't like dumping unsigned-off patches on people; but thought
> > that on balance it might be more helpful than not if I offer you a
> > patch I worked on around 3.6-rc2 (but have updated to 3.8-rc2 below).
> > 
> > I too was getting depressed by the constraints imposed by
> > mem_cgroup_{begin,end}_update_page_stat (good job though Kamezawa-san
> > did to minimize them), and wanted to replace by something freer, more
> > RCU-like.  In the end it seemed more effort than it was worth to go
> > as far as I wanted, but I do think that this is some improvement over
> > what we currently have, and should deal with your recursion issue.
> > 
> In what case does this improve performance ?

Perhaps none.  I was aiming to not degrade performance at the stats
update end, and make it more flexible, so new stats can be updated which
would be problematic today (for lock ordering and recursion reasons).

I've not done any performance measurement on it, and don't have enough
cpus for an interesting report; but if someone thinks it might solve a
problem for them, and has plenty of cpus to test with, please go ahead,
we'd be glad to hear the results.

> Hi, this patch seems interesting but...doesn't this make move_account() very
> slow if the number of cpus increases because of scanning all cpus per a page
> ?
> And this looks like reader-can-block-writer percpu rwlock..it's too heavy to
> writers if there are many readers.

I was happy to make the relatively rare move_account end considerably
heavier.  I'll be disappointed if it turns out to be prohibitively
heavy at that end - if we're going to make move_account impossible,
there are much easier ways to achieve that! - but it is a possibility.

Something you might have missed when considering many readers (stats
updaters): the move_account end does not wait for a moment when there
are no readers, that would indeed be a losing strategy; it just waits
for each cpu that's updating page stats to leave that section, so every
cpu is sure to notice and hold off if it then tries to update the page
which is to be moved.  (I may not be explaining that very well!)

Hugh

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
@ 2013-01-09  5:15             ` Hugh Dickins
  0 siblings, 0 replies; 124+ messages in thread
From: Hugh Dickins @ 2013-01-09  5:15 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Sha Zhengju, Michal Hocko, Johannes Weiner, linux-kernel,
	cgroups, linux-mm, linux-fsdevel, akpm, gthelen, fengguang.wu,
	glommer, dchinner, Sha Zhengju

On Mon, 7 Jan 2013, Kamezawa Hiroyuki wrote:
> (2013/01/07 5:02), Hugh Dickins wrote:
> > 
> > Forgive me, I must confess I'm no more than skimming this thread,
> > and don't like dumping unsigned-off patches on people; but thought
> > that on balance it might be more helpful than not if I offer you a
> > patch I worked on around 3.6-rc2 (but have updated to 3.8-rc2 below).
> > 
> > I too was getting depressed by the constraints imposed by
> > mem_cgroup_{begin,end}_update_page_stat (good job though Kamezawa-san
> > did to minimize them), and wanted to replace by something freer, more
> > RCU-like.  In the end it seemed more effort than it was worth to go
> > as far as I wanted, but I do think that this is some improvement over
> > what we currently have, and should deal with your recursion issue.
> > 
> In what case does this improve performance ?

Perhaps none.  I was aiming to not degrade performance at the stats
update end, and make it more flexible, so new stats can be updated which
would be problematic today (for lock ordering and recursion reasons).

I've not done any performance measurement on it, and don't have enough
cpus for an interesting report; but if someone thinks it might solve a
problem for them, and has plenty of cpus to test with, please go ahead,
we'd be glad to hear the results.

> Hi, this patch seems interesting but...doesn't this make move_account() very
> slow if the number of cpus increases because of scanning all cpus per a page
> ?
> And this looks like reader-can-block-writer percpu rwlock..it's too heavy to
> writers if there are many readers.

I was happy to make the relatively rare move_account end considerably
heavier.  I'll be disappointed if it turns out to be prohibitively
heavy at that end - if we're going to make move_account impossible,
there are much easier ways to achieve that! - but it is a possibility.

Something you might have missed when considering many readers (stats
updaters): the move_account end does not wait for a moment when there
are no readers, that would indeed be a losing strategy; it just waits
for each cpu that's updating page stats to leave that section, so every
cpu is sure to notice and hold off if it then tries to update the page
which is to be moved.  (I may not be explaining that very well!)

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
@ 2013-01-09  7:24               ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 124+ messages in thread
From: Kamezawa Hiroyuki @ 2013-01-09  7:24 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Sha Zhengju, Michal Hocko, Johannes Weiner, linux-kernel,
	cgroups, linux-mm, linux-fsdevel, akpm, gthelen, fengguang.wu,
	glommer, dchinner, Sha Zhengju

(2013/01/09 14:15), Hugh Dickins wrote:
> On Mon, 7 Jan 2013, Kamezawa Hiroyuki wrote:
>> (2013/01/07 5:02), Hugh Dickins wrote:
>>>
>>> Forgive me, I must confess I'm no more than skimming this thread,
>>> and don't like dumping unsigned-off patches on people; but thought
>>> that on balance it might be more helpful than not if I offer you a
>>> patch I worked on around 3.6-rc2 (but have updated to 3.8-rc2 below).
>>>
>>> I too was getting depressed by the constraints imposed by
>>> mem_cgroup_{begin,end}_update_page_stat (good job though Kamezawa-san
>>> did to minimize them), and wanted to replace by something freer, more
>>> RCU-like.  In the end it seemed more effort than it was worth to go
>>> as far as I wanted, but I do think that this is some improvement over
>>> what we currently have, and should deal with your recursion issue.
>>>
>> In what case does this improve performance ?
>
> Perhaps none.  I was aiming to not degrade performance at the stats
> update end, and make it more flexible, so new stats can be updated which
> would be problematic today (for lock ordering and recursion reasons).
>
> I've not done any performance measurement on it, and don't have enough
> cpus for an interesting report; but if someone thinks it might solve a
> problem for them, and has plenty of cpus to test with, please go ahead,
> we'd be glad to hear the results.
>
>> Hi, this patch seems interesting but...doesn't this make move_account() very
>> slow if the number of cpus increases because of scanning all cpus per a page
>> ?
>> And this looks like reader-can-block-writer percpu rwlock..it's too heavy to
>> writers if there are many readers.
>
> I was happy to make the relatively rare move_account end considerably
> heavier.  I'll be disappointed if it turns out to be prohibitively
> heavy at that end - if we're going to make move_account impossible,
> there are much easier ways to achieve that! - but it is a possibility.
>

move_account at task-move has been required feature for NEC and Nishimura-san
did good job. I'd like to keep that available as much as possible.

> Something you might have missed when considering many readers (stats
> updaters): the move_account end does not wait for a moment when there
> are no readers, that would indeed be a losing strategy; it just waits
> for each cpu that's updating page stats to leave that section, so every
> cpu is sure to notice and hold off if it then tries to update the page
> which is to be moved.  (I may not be explaining that very well!)
>

Hmm, yeah, maybe I miss somehing.

BTW, if nesting, mem_cgroup_end_update_page_stat() seems to make counter minus.

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
@ 2013-01-09  7:24               ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 124+ messages in thread
From: Kamezawa Hiroyuki @ 2013-01-09  7:24 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Sha Zhengju, Michal Hocko, Johannes Weiner,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	glommer-bzQdu9zFT3WakBO8gow8eQ, dchinner-H+wXaHxf7aLQT0dZR+AlfA,
	Sha Zhengju

(2013/01/09 14:15), Hugh Dickins wrote:
> On Mon, 7 Jan 2013, Kamezawa Hiroyuki wrote:
>> (2013/01/07 5:02), Hugh Dickins wrote:
>>>
>>> Forgive me, I must confess I'm no more than skimming this thread,
>>> and don't like dumping unsigned-off patches on people; but thought
>>> that on balance it might be more helpful than not if I offer you a
>>> patch I worked on around 3.6-rc2 (but have updated to 3.8-rc2 below).
>>>
>>> I too was getting depressed by the constraints imposed by
>>> mem_cgroup_{begin,end}_update_page_stat (good job though Kamezawa-san
>>> did to minimize them), and wanted to replace by something freer, more
>>> RCU-like.  In the end it seemed more effort than it was worth to go
>>> as far as I wanted, but I do think that this is some improvement over
>>> what we currently have, and should deal with your recursion issue.
>>>
>> In what case does this improve performance ?
>
> Perhaps none.  I was aiming to not degrade performance at the stats
> update end, and make it more flexible, so new stats can be updated which
> would be problematic today (for lock ordering and recursion reasons).
>
> I've not done any performance measurement on it, and don't have enough
> cpus for an interesting report; but if someone thinks it might solve a
> problem for them, and has plenty of cpus to test with, please go ahead,
> we'd be glad to hear the results.
>
>> Hi, this patch seems interesting but...doesn't this make move_account() very
>> slow if the number of cpus increases because of scanning all cpus per a page
>> ?
>> And this looks like reader-can-block-writer percpu rwlock..it's too heavy to
>> writers if there are many readers.
>
> I was happy to make the relatively rare move_account end considerably
> heavier.  I'll be disappointed if it turns out to be prohibitively
> heavy at that end - if we're going to make move_account impossible,
> there are much easier ways to achieve that! - but it is a possibility.
>

move_account at task-move has been required feature for NEC and Nishimura-san
did good job. I'd like to keep that available as much as possible.

> Something you might have missed when considering many readers (stats
> updaters): the move_account end does not wait for a moment when there
> are no readers, that would indeed be a losing strategy; it just waits
> for each cpu that's updating page stats to leave that section, so every
> cpu is sure to notice and hold off if it then tries to update the page
> which is to be moved.  (I may not be explaining that very well!)
>

Hmm, yeah, maybe I miss somehing.

BTW, if nesting, mem_cgroup_end_update_page_stat() seems to make counter minus.

Thanks,
-Kame

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
@ 2013-01-09  7:24               ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 124+ messages in thread
From: Kamezawa Hiroyuki @ 2013-01-09  7:24 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Sha Zhengju, Michal Hocko, Johannes Weiner, linux-kernel,
	cgroups, linux-mm, linux-fsdevel, akpm, gthelen, fengguang.wu,
	glommer, dchinner, Sha Zhengju

(2013/01/09 14:15), Hugh Dickins wrote:
> On Mon, 7 Jan 2013, Kamezawa Hiroyuki wrote:
>> (2013/01/07 5:02), Hugh Dickins wrote:
>>>
>>> Forgive me, I must confess I'm no more than skimming this thread,
>>> and don't like dumping unsigned-off patches on people; but thought
>>> that on balance it might be more helpful than not if I offer you a
>>> patch I worked on around 3.6-rc2 (but have updated to 3.8-rc2 below).
>>>
>>> I too was getting depressed by the constraints imposed by
>>> mem_cgroup_{begin,end}_update_page_stat (good job though Kamezawa-san
>>> did to minimize them), and wanted to replace by something freer, more
>>> RCU-like.  In the end it seemed more effort than it was worth to go
>>> as far as I wanted, but I do think that this is some improvement over
>>> what we currently have, and should deal with your recursion issue.
>>>
>> In what case does this improve performance ?
>
> Perhaps none.  I was aiming to not degrade performance at the stats
> update end, and make it more flexible, so new stats can be updated which
> would be problematic today (for lock ordering and recursion reasons).
>
> I've not done any performance measurement on it, and don't have enough
> cpus for an interesting report; but if someone thinks it might solve a
> problem for them, and has plenty of cpus to test with, please go ahead,
> we'd be glad to hear the results.
>
>> Hi, this patch seems interesting but...doesn't this make move_account() very
>> slow if the number of cpus increases because of scanning all cpus per a page
>> ?
>> And this looks like reader-can-block-writer percpu rwlock..it's too heavy to
>> writers if there are many readers.
>
> I was happy to make the relatively rare move_account end considerably
> heavier.  I'll be disappointed if it turns out to be prohibitively
> heavy at that end - if we're going to make move_account impossible,
> there are much easier ways to achieve that! - but it is a possibility.
>

move_account at task-move has been required feature for NEC and Nishimura-san
did good job. I'd like to keep that available as much as possible.

> Something you might have missed when considering many readers (stats
> updaters): the move_account end does not wait for a moment when there
> are no readers, that would indeed be a losing strategy; it just waits
> for each cpu that's updating page stats to leave that section, so every
> cpu is sure to notice and hold off if it then tries to update the page
> which is to be moved.  (I may not be explaining that very well!)
>

Hmm, yeah, maybe I miss somehing.

BTW, if nesting, mem_cgroup_end_update_page_stat() seems to make counter minus.

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 5/8] memcg: add per cgroup writeback pages accounting
  2013-01-06 20:07     ` Greg Thelen
@ 2013-01-09  9:08       ` Sha Zhengju
  -1 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2013-01-09  9:08 UTC (permalink / raw)
  To: Greg Thelen
  Cc: linux-kernel, cgroups, linux-mm, mhocko, akpm, kamezawa.hiroyu,
	fengguang.wu, glommer, Sha Zhengju

On Mon, Jan 7, 2013 at 4:07 AM, Greg Thelen <gthelen@google.com> wrote:
> On Tue, Dec 25 2012, Sha Zhengju wrote:
>
>> From: Sha Zhengju <handai.szj@taobao.com>
>>
>> Similar to dirty page, we add per cgroup writeback pages accounting. The lock
>> rule still is:
>>         mem_cgroup_begin_update_page_stat()
>>         modify page WRITEBACK stat
>>         mem_cgroup_update_page_stat()
>>         mem_cgroup_end_update_page_stat()
>>
>> There're two writeback interface to modify: test_clear/set_page_writeback.
>>
>> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
>> ---
>>  include/linux/memcontrol.h |    1 +
>>  mm/memcontrol.c            |    5 +++++
>>  mm/page-writeback.c        |   17 +++++++++++++++++
>>  3 files changed, 23 insertions(+)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 358019e..1d22b81 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -45,6 +45,7 @@ enum mem_cgroup_stat_index {
>>       MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>>       MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
>>       MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
>> +     MEM_CGROUP_STAT_WRITEBACK,  /* # of pages under writeback */
>>       MEM_CGROUP_STAT_NSTATS,
>>  };
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 21df36d..13cd14a 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -96,6 +96,7 @@ static const char * const mem_cgroup_stat_names[] = {
>>       "mapped_file",
>>       "swap",
>>       "dirty",
>> +     "writeback",
>>  };
>>
>>  enum mem_cgroup_events_index {
>> @@ -3676,6 +3677,10 @@ static int mem_cgroup_move_account(struct page *page,
>>               mem_cgroup_move_account_page_stat(from, to, nr_pages,
>>                       MEM_CGROUP_STAT_FILE_DIRTY);
>>
>> +     if (PageWriteback(page))
>> +             mem_cgroup_move_account_page_stat(from, to, nr_pages,
>> +                     MEM_CGROUP_STAT_WRITEBACK);
>> +
>>       mem_cgroup_charge_statistics(from, anon, -nr_pages);
>>
>>       /* caller should have done css_get */
>> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>> index 526ddd7..ae6498a 100644
>> --- a/mm/page-writeback.c
>> +++ b/mm/page-writeback.c
>> @@ -2002,11 +2002,17 @@ EXPORT_SYMBOL(account_page_dirtied);
>>
>>  /*
>>   * Helper function for set_page_writeback family.
>> + *
>> + * The caller must hold mem_cgroup_begin/end_update_page_stat() lock
>> + * while modifying struct page state and accounting writeback pages.
>> + * See test_set_page_writeback for example.
>> + *
>>   * NOTE: Unlike account_page_dirtied this does not rely on being atomic
>>   * wrt interrupts.
>>   */
>>  void account_page_writeback(struct page *page)
>>  {
>> +     mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
>>       inc_zone_page_state(page, NR_WRITEBACK);
>>  }
>>  EXPORT_SYMBOL(account_page_writeback);
>> @@ -2242,7 +2248,10 @@ int test_clear_page_writeback(struct page *page)
>>  {
>>       struct address_space *mapping = page_mapping(page);
>>       int ret;
>> +     bool locked;
>> +     unsigned long memcg_flags;
>>
>> +     mem_cgroup_begin_update_page_stat(page, &locked, &memcg_flags);
>
> Does this violate lock ordering?  Here we are grabbing
> mem_cgroup_begin_update_page_stat and below we grab tree_lock.  I
> thought the prescribed order was tree_lock first, then
> mem_cgroup_begin_update_page_stat.

Yes, tree_lock should go first. Sorry it's my fault to forget to check
this one.
Thanks for reminding!

>>       if (mapping) {
>>               struct backing_dev_info *bdi = mapping->backing_dev_info;
>>               unsigned long flags;
>> @@ -2263,9 +2272,12 @@ int test_clear_page_writeback(struct page *page)
>>               ret = TestClearPageWriteback(page);
>>       }
>>       if (ret) {
>> +             mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
>>               dec_zone_page_state(page, NR_WRITEBACK);
>>               inc_zone_page_state(page, NR_WRITTEN);
>>       }
>> +
>> +     mem_cgroup_end_update_page_stat(page, &locked, &memcg_flags);
>>       return ret;
>>  }
>>
>> @@ -2273,7 +2285,10 @@ int test_set_page_writeback(struct page *page)
>>  {
>>       struct address_space *mapping = page_mapping(page);
>>       int ret;
>> +     bool locked;
>> +     unsigned long flags;
>>
>> +     mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>
> Same "Does this violate lock ordering?" question as above.

OK, got it.
>>       if (mapping) {
>>               struct backing_dev_info *bdi = mapping->backing_dev_info;
>>               unsigned long flags;
>> @@ -2300,6 +2315,8 @@ int test_set_page_writeback(struct page *page)
>>       }
>>       if (!ret)
>>               account_page_writeback(page);
>> +
>> +     mem_cgroup_end_update_page_stat(page, &locked, &flags);
>>       return ret;
>>
>>  }



-- 
Thanks,
Sha

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 5/8] memcg: add per cgroup writeback pages accounting
@ 2013-01-09  9:08       ` Sha Zhengju
  0 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2013-01-09  9:08 UTC (permalink / raw)
  To: Greg Thelen
  Cc: linux-kernel, cgroups, linux-mm, mhocko, akpm, kamezawa.hiroyu,
	fengguang.wu, glommer, Sha Zhengju

On Mon, Jan 7, 2013 at 4:07 AM, Greg Thelen <gthelen@google.com> wrote:
> On Tue, Dec 25 2012, Sha Zhengju wrote:
>
>> From: Sha Zhengju <handai.szj@taobao.com>
>>
>> Similar to dirty page, we add per cgroup writeback pages accounting. The lock
>> rule still is:
>>         mem_cgroup_begin_update_page_stat()
>>         modify page WRITEBACK stat
>>         mem_cgroup_update_page_stat()
>>         mem_cgroup_end_update_page_stat()
>>
>> There're two writeback interface to modify: test_clear/set_page_writeback.
>>
>> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
>> ---
>>  include/linux/memcontrol.h |    1 +
>>  mm/memcontrol.c            |    5 +++++
>>  mm/page-writeback.c        |   17 +++++++++++++++++
>>  3 files changed, 23 insertions(+)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 358019e..1d22b81 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -45,6 +45,7 @@ enum mem_cgroup_stat_index {
>>       MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>>       MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
>>       MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
>> +     MEM_CGROUP_STAT_WRITEBACK,  /* # of pages under writeback */
>>       MEM_CGROUP_STAT_NSTATS,
>>  };
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 21df36d..13cd14a 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -96,6 +96,7 @@ static const char * const mem_cgroup_stat_names[] = {
>>       "mapped_file",
>>       "swap",
>>       "dirty",
>> +     "writeback",
>>  };
>>
>>  enum mem_cgroup_events_index {
>> @@ -3676,6 +3677,10 @@ static int mem_cgroup_move_account(struct page *page,
>>               mem_cgroup_move_account_page_stat(from, to, nr_pages,
>>                       MEM_CGROUP_STAT_FILE_DIRTY);
>>
>> +     if (PageWriteback(page))
>> +             mem_cgroup_move_account_page_stat(from, to, nr_pages,
>> +                     MEM_CGROUP_STAT_WRITEBACK);
>> +
>>       mem_cgroup_charge_statistics(from, anon, -nr_pages);
>>
>>       /* caller should have done css_get */
>> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>> index 526ddd7..ae6498a 100644
>> --- a/mm/page-writeback.c
>> +++ b/mm/page-writeback.c
>> @@ -2002,11 +2002,17 @@ EXPORT_SYMBOL(account_page_dirtied);
>>
>>  /*
>>   * Helper function for set_page_writeback family.
>> + *
>> + * The caller must hold mem_cgroup_begin/end_update_page_stat() lock
>> + * while modifying struct page state and accounting writeback pages.
>> + * See test_set_page_writeback for example.
>> + *
>>   * NOTE: Unlike account_page_dirtied this does not rely on being atomic
>>   * wrt interrupts.
>>   */
>>  void account_page_writeback(struct page *page)
>>  {
>> +     mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
>>       inc_zone_page_state(page, NR_WRITEBACK);
>>  }
>>  EXPORT_SYMBOL(account_page_writeback);
>> @@ -2242,7 +2248,10 @@ int test_clear_page_writeback(struct page *page)
>>  {
>>       struct address_space *mapping = page_mapping(page);
>>       int ret;
>> +     bool locked;
>> +     unsigned long memcg_flags;
>>
>> +     mem_cgroup_begin_update_page_stat(page, &locked, &memcg_flags);
>
> Does this violate lock ordering?  Here we are grabbing
> mem_cgroup_begin_update_page_stat and below we grab tree_lock.  I
> thought the prescribed order was tree_lock first, then
> mem_cgroup_begin_update_page_stat.

Yes, tree_lock should go first. Sorry it's my fault to forget to check
this one.
Thanks for reminding!

>>       if (mapping) {
>>               struct backing_dev_info *bdi = mapping->backing_dev_info;
>>               unsigned long flags;
>> @@ -2263,9 +2272,12 @@ int test_clear_page_writeback(struct page *page)
>>               ret = TestClearPageWriteback(page);
>>       }
>>       if (ret) {
>> +             mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
>>               dec_zone_page_state(page, NR_WRITEBACK);
>>               inc_zone_page_state(page, NR_WRITTEN);
>>       }
>> +
>> +     mem_cgroup_end_update_page_stat(page, &locked, &memcg_flags);
>>       return ret;
>>  }
>>
>> @@ -2273,7 +2285,10 @@ int test_set_page_writeback(struct page *page)
>>  {
>>       struct address_space *mapping = page_mapping(page);
>>       int ret;
>> +     bool locked;
>> +     unsigned long flags;
>>
>> +     mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>
> Same "Does this violate lock ordering?" question as above.

OK, got it.
>>       if (mapping) {
>>               struct backing_dev_info *bdi = mapping->backing_dev_info;
>>               unsigned long flags;
>> @@ -2300,6 +2315,8 @@ int test_set_page_writeback(struct page *page)
>>       }
>>       if (!ret)
>>               account_page_writeback(page);
>> +
>> +     mem_cgroup_end_update_page_stat(page, &locked, &flags);
>>       return ret;
>>
>>  }



-- 
Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
@ 2013-01-09  9:45       ` Sha Zhengju
  0 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2013-01-09  9:45 UTC (permalink / raw)
  To: Greg Thelen
  Cc: linux-kernel, cgroups, linux-mm, linux-fsdevel, mhocko, akpm,
	kamezawa.hiroyu, fengguang.wu, glommer, dchinner, Sha Zhengju

On Mon, Jan 7, 2013 at 4:07 AM, Greg Thelen <gthelen@google.com> wrote:
> On Tue, Dec 25 2012, Sha Zhengju wrote:
>
>> From: Sha Zhengju <handai.szj@taobao.com>
>>
>> This patch adds memcg routines to count dirty pages, which allows memory controller
>> to maintain an accurate view of the amount of its dirty memory and can provide some
>> info for users while cgroup's direct reclaim is working.
>>
>> After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
>> use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
>> has a feature to move a page from a cgroup to another one and may have race between
>> "move" and "page stat accounting". So in order to avoid the race we have designed a
>> bigger lock:
>>
>>          mem_cgroup_begin_update_page_stat()
>>          modify page information        -->(a)
>>          mem_cgroup_update_page_stat()  -->(b)
>>          mem_cgroup_end_update_page_stat()
>> It requires (a) and (b)(dirty pages accounting) can stay close enough.
>> In the previous two prepare patches, we have reworked the vfs set page dirty routines
>> and now the interfaces are more explicit:
>>         incrementing (2):
>>                 __set_page_dirty
>>                 __set_page_dirty_nobuffers
>>         decrementing (2):
>>                 clear_page_dirty_for_io
>>                 cancel_dirty_page
>>
>> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
>> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
>> ->private_lock --> mapping->tree_lock --> memcg->move_lock.
>> So we need to make mapping->tree_lock ahead of TestSetPageDirty in __set_page_dirty()
>> and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention,
>> a prepare PageDirty() checking is added.
>>
>>
>> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
>> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
>> Acked-by: Fengguang Wu <fengguang.wu@intel.com>
>> ---
>>  fs/buffer.c                |   14 +++++++++++++-
>>  include/linux/memcontrol.h |    1 +
>>  mm/filemap.c               |   10 ++++++++++
>>  mm/memcontrol.c            |   29 ++++++++++++++++++++++-------
>>  mm/page-writeback.c        |   39 ++++++++++++++++++++++++++++++++-------
>>  mm/truncate.c              |    6 ++++++
>>  6 files changed, 84 insertions(+), 15 deletions(-)
>
> __nilfs_clear_page_dirty() clears PageDirty, does it need modification
> for this patch series?

It doesn't need to do so.
mem_cgroup_dec/inc_page_stat() is accompany with
dec/inc_zone_page_state() to account memcg page stat.  IMHO we only
have to do some modification while SetPageDirty and
dec/inc_zone_page_state() occur together.
__nilfs_clear_page_dirty() will call clear_page_dirty_for_io(page)
later where the accounting is done.

>> diff --git a/fs/buffer.c b/fs/buffer.c
>> index 762168a..53402d2 100644
>> --- a/fs/buffer.c
>> +++ b/fs/buffer.c
>> @@ -612,19 +612,31 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>>  int __set_page_dirty(struct page *page,
>>               struct address_space *mapping, int warn)
>>  {
>> +     bool locked;
>> +     unsigned long flags;
>> +
>>       if (unlikely(!mapping))
>>               return !TestSetPageDirty(page);
>>
>> -     if (TestSetPageDirty(page))
>> +     if (PageDirty(page))
>>               return 0;
>>
>>       spin_lock_irq(&mapping->tree_lock);
>> +     mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>> +
>> +     if (TestSetPageDirty(page)) {
>> +             mem_cgroup_end_update_page_stat(page, &locked, &flags);
>> +             spin_unlock_irq(&mapping->tree_lock);
>> +             return 0;
>> +     }
>> +
>>       if (page->mapping) {    /* Race with truncate? */
>>               WARN_ON_ONCE(warn && !PageUptodate(page));
>>               account_page_dirtied(page, mapping);
>>               radix_tree_tag_set(&mapping->page_tree,
>>                               page_index(page), PAGECACHE_TAG_DIRTY);
>>       }
>> +     mem_cgroup_end_update_page_stat(page, &locked, &flags);
>>       spin_unlock_irq(&mapping->tree_lock);
>>       __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 5421b8a..2685d8a 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -44,6 +44,7 @@ enum mem_cgroup_stat_index {
>>       MEM_CGROUP_STAT_RSS,       /* # of pages charged as anon rss */
>>       MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>>       MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
>> +     MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
>>       MEM_CGROUP_STAT_NSTATS,
>>  };
>>
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 83efee7..b589be5 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -62,6 +62,11 @@
>>   *      ->swap_lock          (exclusive_swap_page, others)
>>   *        ->mapping->tree_lock
>>   *
>> + *    ->private_lock         (__set_page_dirty_buffers)
>> + *      ->mapping->tree_lock
>> + *        ->memcg->move_lock (mem_cgroup_begin_update_page_stat->
>> + *                                                   move_lock_mem_cgroup)
>> + *
>>   *  ->i_mutex
>>   *    ->i_mmap_mutex         (truncate->unmap_mapping_range)
>>   *
>> @@ -112,6 +117,8 @@
>>  void __delete_from_page_cache(struct page *page)
>>  {
>>       struct address_space *mapping = page->mapping;
>> +     bool locked;
>> +     unsigned long flags;
>>
>>       /*
>>        * if we're uptodate, flush out into the cleancache, otherwise
>> @@ -139,10 +146,13 @@ void __delete_from_page_cache(struct page *page)
>>        * Fix it up by doing a final dirty accounting check after
>>        * having removed the page entirely.
>>        */
>> +     mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>>       if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
>> +             mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>>               dec_zone_page_state(page, NR_FILE_DIRTY);
>>               dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>>       }
>> +     mem_cgroup_end_update_page_stat(page, &locked, &flags);
>>  }
>>
>>  /**
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index d450c04..c884640 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -95,6 +95,7 @@ static const char * const mem_cgroup_stat_names[] = {
>>       "rss",
>>       "mapped_file",
>>       "swap",
>> +     "dirty",
>>  };
>>
>>  enum mem_cgroup_events_index {
>> @@ -3609,6 +3610,19 @@ void mem_cgroup_split_huge_fixup(struct page *head)
>>  }
>>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>
>> +static inline
>> +void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
>> +                                     struct mem_cgroup *to,
>> +                                     unsigned int nr_pages,
>> +                                     enum mem_cgroup_stat_index idx)
>> +{
>> +     /* Update stat data for mem_cgroup */
>> +     preempt_disable();
>> +     __this_cpu_add(from->stat->count[idx], -nr_pages);
>
> What you do think about adding a WARN_ON_ONCE() here to check for
> underflow?  A check might help catch:
> a) unresolved races between move accounting vs setting/clearing
>    dirtying.
> b) future modifications that mess with PageDirty/Writeback flags without
>    considering memcg.
To prevent the current memcg deadlock and lock nesting, I'm thinking
about another synchronization proposal for memcg page stat &
move_account. The counter may be minus in a very short periods. Now
I'm not sure whether it's okay... maybe I'll send it out in another
thread later...

>
>> +     __this_cpu_add(to->stat->count[idx], nr_pages);
>> +     preempt_enable();
>> +}
>> +
>>  /**
>>   * mem_cgroup_move_account - move account of the page
>>   * @page: the page
>> @@ -3654,13 +3668,14 @@ static int mem_cgroup_move_account(struct page *page,
>>
>>       move_lock_mem_cgroup(from, &flags);
>>
>> -     if (!anon && page_mapped(page)) {
>> -             /* Update mapped_file data for mem_cgroup */
>> -             preempt_disable();
>> -             __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
>> -             __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
>> -             preempt_enable();
>> -     }
>> +     if (!anon && page_mapped(page))
>> +             mem_cgroup_move_account_page_stat(from, to, nr_pages,
>> +                     MEM_CGROUP_STAT_FILE_MAPPED);
>> +
>> +     if (PageDirty(page))
>
> Is (!anon && PageDirty(page)) better?  If dirty anon pages are moved
> between memcg M1 and M2 I think that we'd mistakenly underflow M1 if it
> was not previously accounting for the dirty anon page.
Yeah... A page can be PageAnon and PageDirty simultaneously but we only
account dirty file-page. Will be updated next version.


-- 
Thanks,
Sha

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
@ 2013-01-09  9:45       ` Sha Zhengju
  0 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2013-01-09  9:45 UTC (permalink / raw)
  To: Greg Thelen
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, mhocko-AlSwsSmVLrQ,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	glommer-bzQdu9zFT3WakBO8gow8eQ, dchinner-H+wXaHxf7aLQT0dZR+AlfA,
	Sha Zhengju

On Mon, Jan 7, 2013 at 4:07 AM, Greg Thelen <gthelen-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> On Tue, Dec 25 2012, Sha Zhengju wrote:
>
>> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
>>
>> This patch adds memcg routines to count dirty pages, which allows memory controller
>> to maintain an accurate view of the amount of its dirty memory and can provide some
>> info for users while cgroup's direct reclaim is working.
>>
>> After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
>> use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
>> has a feature to move a page from a cgroup to another one and may have race between
>> "move" and "page stat accounting". So in order to avoid the race we have designed a
>> bigger lock:
>>
>>          mem_cgroup_begin_update_page_stat()
>>          modify page information        -->(a)
>>          mem_cgroup_update_page_stat()  -->(b)
>>          mem_cgroup_end_update_page_stat()
>> It requires (a) and (b)(dirty pages accounting) can stay close enough.
>> In the previous two prepare patches, we have reworked the vfs set page dirty routines
>> and now the interfaces are more explicit:
>>         incrementing (2):
>>                 __set_page_dirty
>>                 __set_page_dirty_nobuffers
>>         decrementing (2):
>>                 clear_page_dirty_for_io
>>                 cancel_dirty_page
>>
>> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
>> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
>> ->private_lock --> mapping->tree_lock --> memcg->move_lock.
>> So we need to make mapping->tree_lock ahead of TestSetPageDirty in __set_page_dirty()
>> and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention,
>> a prepare PageDirty() checking is added.
>>
>>
>> Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
>> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu-LdfC7J4mv27QFUHtdCDX3A@public.gmane.org>
>> Acked-by: Fengguang Wu <fengguang.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
>> ---
>>  fs/buffer.c                |   14 +++++++++++++-
>>  include/linux/memcontrol.h |    1 +
>>  mm/filemap.c               |   10 ++++++++++
>>  mm/memcontrol.c            |   29 ++++++++++++++++++++++-------
>>  mm/page-writeback.c        |   39 ++++++++++++++++++++++++++++++++-------
>>  mm/truncate.c              |    6 ++++++
>>  6 files changed, 84 insertions(+), 15 deletions(-)
>
> __nilfs_clear_page_dirty() clears PageDirty, does it need modification
> for this patch series?

It doesn't need to do so.
mem_cgroup_dec/inc_page_stat() is accompany with
dec/inc_zone_page_state() to account memcg page stat.  IMHO we only
have to do some modification while SetPageDirty and
dec/inc_zone_page_state() occur together.
__nilfs_clear_page_dirty() will call clear_page_dirty_for_io(page)
later where the accounting is done.

>> diff --git a/fs/buffer.c b/fs/buffer.c
>> index 762168a..53402d2 100644
>> --- a/fs/buffer.c
>> +++ b/fs/buffer.c
>> @@ -612,19 +612,31 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>>  int __set_page_dirty(struct page *page,
>>               struct address_space *mapping, int warn)
>>  {
>> +     bool locked;
>> +     unsigned long flags;
>> +
>>       if (unlikely(!mapping))
>>               return !TestSetPageDirty(page);
>>
>> -     if (TestSetPageDirty(page))
>> +     if (PageDirty(page))
>>               return 0;
>>
>>       spin_lock_irq(&mapping->tree_lock);
>> +     mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>> +
>> +     if (TestSetPageDirty(page)) {
>> +             mem_cgroup_end_update_page_stat(page, &locked, &flags);
>> +             spin_unlock_irq(&mapping->tree_lock);
>> +             return 0;
>> +     }
>> +
>>       if (page->mapping) {    /* Race with truncate? */
>>               WARN_ON_ONCE(warn && !PageUptodate(page));
>>               account_page_dirtied(page, mapping);
>>               radix_tree_tag_set(&mapping->page_tree,
>>                               page_index(page), PAGECACHE_TAG_DIRTY);
>>       }
>> +     mem_cgroup_end_update_page_stat(page, &locked, &flags);
>>       spin_unlock_irq(&mapping->tree_lock);
>>       __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 5421b8a..2685d8a 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -44,6 +44,7 @@ enum mem_cgroup_stat_index {
>>       MEM_CGROUP_STAT_RSS,       /* # of pages charged as anon rss */
>>       MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>>       MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
>> +     MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
>>       MEM_CGROUP_STAT_NSTATS,
>>  };
>>
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 83efee7..b589be5 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -62,6 +62,11 @@
>>   *      ->swap_lock          (exclusive_swap_page, others)
>>   *        ->mapping->tree_lock
>>   *
>> + *    ->private_lock         (__set_page_dirty_buffers)
>> + *      ->mapping->tree_lock
>> + *        ->memcg->move_lock (mem_cgroup_begin_update_page_stat->
>> + *                                                   move_lock_mem_cgroup)
>> + *
>>   *  ->i_mutex
>>   *    ->i_mmap_mutex         (truncate->unmap_mapping_range)
>>   *
>> @@ -112,6 +117,8 @@
>>  void __delete_from_page_cache(struct page *page)
>>  {
>>       struct address_space *mapping = page->mapping;
>> +     bool locked;
>> +     unsigned long flags;
>>
>>       /*
>>        * if we're uptodate, flush out into the cleancache, otherwise
>> @@ -139,10 +146,13 @@ void __delete_from_page_cache(struct page *page)
>>        * Fix it up by doing a final dirty accounting check after
>>        * having removed the page entirely.
>>        */
>> +     mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>>       if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
>> +             mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>>               dec_zone_page_state(page, NR_FILE_DIRTY);
>>               dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>>       }
>> +     mem_cgroup_end_update_page_stat(page, &locked, &flags);
>>  }
>>
>>  /**
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index d450c04..c884640 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -95,6 +95,7 @@ static const char * const mem_cgroup_stat_names[] = {
>>       "rss",
>>       "mapped_file",
>>       "swap",
>> +     "dirty",
>>  };
>>
>>  enum mem_cgroup_events_index {
>> @@ -3609,6 +3610,19 @@ void mem_cgroup_split_huge_fixup(struct page *head)
>>  }
>>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>
>> +static inline
>> +void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
>> +                                     struct mem_cgroup *to,
>> +                                     unsigned int nr_pages,
>> +                                     enum mem_cgroup_stat_index idx)
>> +{
>> +     /* Update stat data for mem_cgroup */
>> +     preempt_disable();
>> +     __this_cpu_add(from->stat->count[idx], -nr_pages);
>
> What you do think about adding a WARN_ON_ONCE() here to check for
> underflow?  A check might help catch:
> a) unresolved races between move accounting vs setting/clearing
>    dirtying.
> b) future modifications that mess with PageDirty/Writeback flags without
>    considering memcg.
To prevent the current memcg deadlock and lock nesting, I'm thinking
about another synchronization proposal for memcg page stat &
move_account. The counter may be minus in a very short periods. Now
I'm not sure whether it's okay... maybe I'll send it out in another
thread later...

>
>> +     __this_cpu_add(to->stat->count[idx], nr_pages);
>> +     preempt_enable();
>> +}
>> +
>>  /**
>>   * mem_cgroup_move_account - move account of the page
>>   * @page: the page
>> @@ -3654,13 +3668,14 @@ static int mem_cgroup_move_account(struct page *page,
>>
>>       move_lock_mem_cgroup(from, &flags);
>>
>> -     if (!anon && page_mapped(page)) {
>> -             /* Update mapped_file data for mem_cgroup */
>> -             preempt_disable();
>> -             __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
>> -             __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
>> -             preempt_enable();
>> -     }
>> +     if (!anon && page_mapped(page))
>> +             mem_cgroup_move_account_page_stat(from, to, nr_pages,
>> +                     MEM_CGROUP_STAT_FILE_MAPPED);
>> +
>> +     if (PageDirty(page))
>
> Is (!anon && PageDirty(page)) better?  If dirty anon pages are moved
> between memcg M1 and M2 I think that we'd mistakenly underflow M1 if it
> was not previously accounting for the dirty anon page.
Yeah... A page can be PageAnon and PageDirty simultaneously but we only
account dirty file-page. Will be updated next version.


-- 
Thanks,
Sha

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
@ 2013-01-09  9:45       ` Sha Zhengju
  0 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2013-01-09  9:45 UTC (permalink / raw)
  To: Greg Thelen
  Cc: linux-kernel, cgroups, linux-mm, linux-fsdevel, mhocko, akpm,
	kamezawa.hiroyu, fengguang.wu, glommer, dchinner, Sha Zhengju

On Mon, Jan 7, 2013 at 4:07 AM, Greg Thelen <gthelen@google.com> wrote:
> On Tue, Dec 25 2012, Sha Zhengju wrote:
>
>> From: Sha Zhengju <handai.szj@taobao.com>
>>
>> This patch adds memcg routines to count dirty pages, which allows memory controller
>> to maintain an accurate view of the amount of its dirty memory and can provide some
>> info for users while cgroup's direct reclaim is working.
>>
>> After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
>> use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
>> has a feature to move a page from a cgroup to another one and may have race between
>> "move" and "page stat accounting". So in order to avoid the race we have designed a
>> bigger lock:
>>
>>          mem_cgroup_begin_update_page_stat()
>>          modify page information        -->(a)
>>          mem_cgroup_update_page_stat()  -->(b)
>>          mem_cgroup_end_update_page_stat()
>> It requires (a) and (b)(dirty pages accounting) can stay close enough.
>> In the previous two prepare patches, we have reworked the vfs set page dirty routines
>> and now the interfaces are more explicit:
>>         incrementing (2):
>>                 __set_page_dirty
>>                 __set_page_dirty_nobuffers
>>         decrementing (2):
>>                 clear_page_dirty_for_io
>>                 cancel_dirty_page
>>
>> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
>> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
>> ->private_lock --> mapping->tree_lock --> memcg->move_lock.
>> So we need to make mapping->tree_lock ahead of TestSetPageDirty in __set_page_dirty()
>> and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention,
>> a prepare PageDirty() checking is added.
>>
>>
>> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
>> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
>> Acked-by: Fengguang Wu <fengguang.wu@intel.com>
>> ---
>>  fs/buffer.c                |   14 +++++++++++++-
>>  include/linux/memcontrol.h |    1 +
>>  mm/filemap.c               |   10 ++++++++++
>>  mm/memcontrol.c            |   29 ++++++++++++++++++++++-------
>>  mm/page-writeback.c        |   39 ++++++++++++++++++++++++++++++++-------
>>  mm/truncate.c              |    6 ++++++
>>  6 files changed, 84 insertions(+), 15 deletions(-)
>
> __nilfs_clear_page_dirty() clears PageDirty, does it need modification
> for this patch series?

It doesn't need to do so.
mem_cgroup_dec/inc_page_stat() is accompany with
dec/inc_zone_page_state() to account memcg page stat.  IMHO we only
have to do some modification while SetPageDirty and
dec/inc_zone_page_state() occur together.
__nilfs_clear_page_dirty() will call clear_page_dirty_for_io(page)
later where the accounting is done.

>> diff --git a/fs/buffer.c b/fs/buffer.c
>> index 762168a..53402d2 100644
>> --- a/fs/buffer.c
>> +++ b/fs/buffer.c
>> @@ -612,19 +612,31 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>>  int __set_page_dirty(struct page *page,
>>               struct address_space *mapping, int warn)
>>  {
>> +     bool locked;
>> +     unsigned long flags;
>> +
>>       if (unlikely(!mapping))
>>               return !TestSetPageDirty(page);
>>
>> -     if (TestSetPageDirty(page))
>> +     if (PageDirty(page))
>>               return 0;
>>
>>       spin_lock_irq(&mapping->tree_lock);
>> +     mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>> +
>> +     if (TestSetPageDirty(page)) {
>> +             mem_cgroup_end_update_page_stat(page, &locked, &flags);
>> +             spin_unlock_irq(&mapping->tree_lock);
>> +             return 0;
>> +     }
>> +
>>       if (page->mapping) {    /* Race with truncate? */
>>               WARN_ON_ONCE(warn && !PageUptodate(page));
>>               account_page_dirtied(page, mapping);
>>               radix_tree_tag_set(&mapping->page_tree,
>>                               page_index(page), PAGECACHE_TAG_DIRTY);
>>       }
>> +     mem_cgroup_end_update_page_stat(page, &locked, &flags);
>>       spin_unlock_irq(&mapping->tree_lock);
>>       __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 5421b8a..2685d8a 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -44,6 +44,7 @@ enum mem_cgroup_stat_index {
>>       MEM_CGROUP_STAT_RSS,       /* # of pages charged as anon rss */
>>       MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>>       MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
>> +     MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
>>       MEM_CGROUP_STAT_NSTATS,
>>  };
>>
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 83efee7..b589be5 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -62,6 +62,11 @@
>>   *      ->swap_lock          (exclusive_swap_page, others)
>>   *        ->mapping->tree_lock
>>   *
>> + *    ->private_lock         (__set_page_dirty_buffers)
>> + *      ->mapping->tree_lock
>> + *        ->memcg->move_lock (mem_cgroup_begin_update_page_stat->
>> + *                                                   move_lock_mem_cgroup)
>> + *
>>   *  ->i_mutex
>>   *    ->i_mmap_mutex         (truncate->unmap_mapping_range)
>>   *
>> @@ -112,6 +117,8 @@
>>  void __delete_from_page_cache(struct page *page)
>>  {
>>       struct address_space *mapping = page->mapping;
>> +     bool locked;
>> +     unsigned long flags;
>>
>>       /*
>>        * if we're uptodate, flush out into the cleancache, otherwise
>> @@ -139,10 +146,13 @@ void __delete_from_page_cache(struct page *page)
>>        * Fix it up by doing a final dirty accounting check after
>>        * having removed the page entirely.
>>        */
>> +     mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>>       if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
>> +             mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>>               dec_zone_page_state(page, NR_FILE_DIRTY);
>>               dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>>       }
>> +     mem_cgroup_end_update_page_stat(page, &locked, &flags);
>>  }
>>
>>  /**
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index d450c04..c884640 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -95,6 +95,7 @@ static const char * const mem_cgroup_stat_names[] = {
>>       "rss",
>>       "mapped_file",
>>       "swap",
>> +     "dirty",
>>  };
>>
>>  enum mem_cgroup_events_index {
>> @@ -3609,6 +3610,19 @@ void mem_cgroup_split_huge_fixup(struct page *head)
>>  }
>>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>
>> +static inline
>> +void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
>> +                                     struct mem_cgroup *to,
>> +                                     unsigned int nr_pages,
>> +                                     enum mem_cgroup_stat_index idx)
>> +{
>> +     /* Update stat data for mem_cgroup */
>> +     preempt_disable();
>> +     __this_cpu_add(from->stat->count[idx], -nr_pages);
>
> What you do think about adding a WARN_ON_ONCE() here to check for
> underflow?  A check might help catch:
> a) unresolved races between move accounting vs setting/clearing
>    dirtying.
> b) future modifications that mess with PageDirty/Writeback flags without
>    considering memcg.
To prevent the current memcg deadlock and lock nesting, I'm thinking
about another synchronization proposal for memcg page stat &
move_account. The counter may be minus in a very short periods. Now
I'm not sure whether it's okay... maybe I'll send it out in another
thread later...

>
>> +     __this_cpu_add(to->stat->count[idx], nr_pages);
>> +     preempt_enable();
>> +}
>> +
>>  /**
>>   * mem_cgroup_move_account - move account of the page
>>   * @page: the page
>> @@ -3654,13 +3668,14 @@ static int mem_cgroup_move_account(struct page *page,
>>
>>       move_lock_mem_cgroup(from, &flags);
>>
>> -     if (!anon && page_mapped(page)) {
>> -             /* Update mapped_file data for mem_cgroup */
>> -             preempt_disable();
>> -             __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
>> -             __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
>> -             preempt_enable();
>> -     }
>> +     if (!anon && page_mapped(page))
>> +             mem_cgroup_move_account_page_stat(from, to, nr_pages,
>> +                     MEM_CGROUP_STAT_FILE_MAPPED);
>> +
>> +     if (PageDirty(page))
>
> Is (!anon && PageDirty(page)) better?  If dirty anon pages are moved
> between memcg M1 and M2 I think that we'd mistakenly underflow M1 if it
> was not previously accounting for the dirty anon page.
Yeah... A page can be PageAnon and PageDirty simultaneously but we only
account dirty file-page. Will be updated next version.


-- 
Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 6/8] memcg: Don't account root_mem_cgroup page statistics
  2013-01-05 10:52       ` Sha Zhengju
@ 2013-01-09 12:57         ` Michal Hocko
  -1 siblings, 0 replies; 124+ messages in thread
From: Michal Hocko @ 2013-01-09 12:57 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel, cgroups, linux-mm, akpm, kamezawa.hiroyu, gthelen,
	fengguang.wu, glommer, Sha Zhengju

On Sat 05-01-13 18:52:12, Sha Zhengju wrote:
> On Wed, Jan 2, 2013 at 8:27 PM, Michal Hocko <mhocko@suse.cz> wrote:
> > On Wed 26-12-12 01:27:27, Sha Zhengju wrote:
[...]
> >> @@ -5396,18 +5406,70 @@ static inline void mem_cgroup_lru_names_not_uptodate(void)
> >>       BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS);
> >>  }
> >>
> >> +long long root_memcg_local_stat(unsigned int i, long long val,
> >> +                                     long long nstat[])
> >
> > Function should be static
> > also
> > nstat parameter is ugly because this can be done by the caller
> > and also expecting that the caller already calculated val is not
> > nice (and undocumented). This approach is really hackish and error
> > prone. Why should we define a specific function rather than hooking into
> > mem_cgroup_read_stat and doing all the stuff there? I think that would
> > be much more maintainable.
> >
> 
> IMHO, hooking into mem_cgroup_read_stat may be also improper because
> of the for_each_mem_cgroup traversal. I prefer to make mem_cgroup_read_stat
> as the base func unit. But I'll repeal the function base on your opinion in next
> version.  Thanks for the advice!

Maybe my "do all the stuff there" was confusing. I didn't mean to
iterate through the hierarchy there. I just wanted to have
mem_cgroup_root is a special case and it uses global counters there.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 6/8] memcg: Don't account root_mem_cgroup page statistics
@ 2013-01-09 12:57         ` Michal Hocko
  0 siblings, 0 replies; 124+ messages in thread
From: Michal Hocko @ 2013-01-09 12:57 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel, cgroups, linux-mm, akpm, kamezawa.hiroyu, gthelen,
	fengguang.wu, glommer, Sha Zhengju

On Sat 05-01-13 18:52:12, Sha Zhengju wrote:
> On Wed, Jan 2, 2013 at 8:27 PM, Michal Hocko <mhocko@suse.cz> wrote:
> > On Wed 26-12-12 01:27:27, Sha Zhengju wrote:
[...]
> >> @@ -5396,18 +5406,70 @@ static inline void mem_cgroup_lru_names_not_uptodate(void)
> >>       BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS);
> >>  }
> >>
> >> +long long root_memcg_local_stat(unsigned int i, long long val,
> >> +                                     long long nstat[])
> >
> > Function should be static
> > also
> > nstat parameter is ugly because this can be done by the caller
> > and also expecting that the caller already calculated val is not
> > nice (and undocumented). This approach is really hackish and error
> > prone. Why should we define a specific function rather than hooking into
> > mem_cgroup_read_stat and doing all the stuff there? I think that would
> > be much more maintainable.
> >
> 
> IMHO, hooking into mem_cgroup_read_stat may be also improper because
> of the for_each_mem_cgroup traversal. I prefer to make mem_cgroup_read_stat
> as the base func unit. But I'll repeal the function base on your opinion in next
> version.  Thanks for the advice!

Maybe my "do all the stuff there" was confusing. I didn't mean to
iterate through the hierarchy there. I just wanted to have
mem_cgroup_root is a special case and it uses global counters there.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
@ 2013-01-09 14:35           ` Sha Zhengju
  0 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2013-01-09 14:35 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Michal Hocko, Johannes Weiner, linux-kernel, cgroups, linux-mm,
	linux-fsdevel, akpm, kamezawa.hiroyu, gthelen, fengguang.wu,
	glommer, dchinner, Sha Zhengju

Hi Hugh,

On Mon, Jan 7, 2013 at 4:02 AM, Hugh Dickins <hughd@google.com> wrote:
> On Sat, 5 Jan 2013, Sha Zhengju wrote:
>> On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko <mhocko@suse.cz> wrote:
>> >
>> > Maybe I have missed some other locking which would prevent this from
>> > happening but the locking relations are really complicated in this area
>> > so if mem_cgroup_{begin,end}_update_page_stat might be called
>> > recursively then we need a fat comment which justifies that.
>> >
>>
>> Ohhh...good catching!  I didn't notice there is a recursive call of
>> mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
>> The mem_cgroup_{begin,end}_update_page_stat() design has depressed
>> me a lot recently as the lock granularity is a little bigger than I thought.
>> Not only the resource but also some code logic is in the range of locking
>> which may be deadlock prone. The problem still exists if we are trying to
>> add stat account of other memcg page later, may I make bold to suggest
>> that we dig into the lock again...
>
> Forgive me, I must confess I'm no more than skimming this thread,
> and don't like dumping unsigned-off patches on people; but thought
> that on balance it might be more helpful than not if I offer you a
> patch I worked on around 3.6-rc2 (but have updated to 3.8-rc2 below).

Thanks for your interest in this matter! I really appreciate your work!

> I too was getting depressed by the constraints imposed by
> mem_cgroup_{begin,end}_update_page_stat (good job though Kamezawa-san
> did to minimize them), and wanted to replace by something freer, more
> RCU-like.  In the end it seemed more effort than it was worth to go
> as far as I wanted, but I do think that this is some improvement over
> what we currently have, and should deal with your recursion issue.
It takes me some time to understand the patch. yeah, it can solve my
recursion issue and also reduce some locks(e.g. move_lock). But it did
have some side effect on move end as it will become slower. To my
knowledge, each task is forked in root memcg, and there's a moving
while attaching it to a cgroup. So move_account is also a frequent
behavior to some extent.
Some comments are below.

> But if this does appear useful to memcg people, then we really ought
> to get it checked over by locking/barrier experts before going further.
> I think myself that I've over-barriered it, and could use a little
> lighter; but they (Paul McKenney, Peter Zijlstra, Oleg Nesterov come
> to mind) will see more clearly, and may just hate the whole thing,
> as yet another peculiar lockdep-avoiding hand-crafted locking scheme.
> I've not wanted to waste their time on reviewing it, if it's not even
> going to be useful to memcg people.
>
> It may be easier to understand if you just apply the patch and look
> at the result in mm/memcontrol.c, where I tried to gather the pieces
> together in one place and describe them ("These functions mediate...").
>
> Hugh
>
>  include/linux/memcontrol.h |   39 +--
>  mm/memcontrol.c            |  375 +++++++++++++++++++++--------------
>  mm/rmap.c                  |   20 -
>  3 files changed, 257 insertions(+), 177 deletions(-)
>
> --- 3.8-rc2/include/linux/memcontrol.h  2012-12-22 09:43:27.172015571 -0800
> +++ linux/include/linux/memcontrol.h    2013-01-02 14:47:47.960394878 -0800
> @@ -136,32 +136,28 @@ static inline bool mem_cgroup_disabled(v
>         return false;
>  }
>
> -void __mem_cgroup_begin_update_page_stat(struct page *page, bool *locked,
> -                                        unsigned long *flags);
> -
> +void __mem_cgroup_begin_update_page_stat(struct page *page);
> +void __mem_cgroup_end_update_page_stat(void);
>  extern atomic_t memcg_moving;
>
>  static inline void mem_cgroup_begin_update_page_stat(struct page *page,
> -                                       bool *locked, unsigned long *flags)
> +                                                    bool *clamped)
>  {
> -       if (mem_cgroup_disabled())
> -               return;
> -       rcu_read_lock();
> -       *locked = false;
> -       if (atomic_read(&memcg_moving))
> -               __mem_cgroup_begin_update_page_stat(page, locked, flags);
> +       preempt_disable();

Referring to synchronize_rcu in mem_cgroup_begin_move(), here
rcu_read_lock() lost?

> +       *clamped = false;
> +       if (unlikely(atomic_read(&memcg_moving))) {
> +               __mem_cgroup_begin_update_page_stat(page);
> +               *clamped = true;
> +       }
>  }
>
> -void __mem_cgroup_end_update_page_stat(struct page *page,
> -                               unsigned long *flags);
>  static inline void mem_cgroup_end_update_page_stat(struct page *page,
> -                                       bool *locked, unsigned long *flags)
> +                                                  bool *clamped)
>  {
> -       if (mem_cgroup_disabled())
> -               return;
> -       if (*locked)
> -               __mem_cgroup_end_update_page_stat(page, flags);
> -       rcu_read_unlock();
> +       /* We don't currently use the page arg, but keep it for symmetry */
> +       if (unlikely(*clamped))
> +               __mem_cgroup_end_update_page_stat();

Ditto. Need rcu_read_unlock()?

> +       preempt_enable();
>  }
>
>  void mem_cgroup_update_page_stat(struct page *page,
> @@ -345,13 +341,16 @@ mem_cgroup_print_oom_info(struct mem_cgr
>  }
>
>  static inline void mem_cgroup_begin_update_page_stat(struct page *page,
> -                                       bool *locked, unsigned long *flags)
> +                                                    bool *clamped)
>  {
> +       /* It may be helpful to our callers if the stub behaves the same way */
> +       preempt_disable();
>  }
>
>  static inline void mem_cgroup_end_update_page_stat(struct page *page,
> -                                       bool *locked, unsigned long *flags)
> +                                                  bool *clamped)
>  {
> +       preempt_enable();
>  }
>
>  static inline void mem_cgroup_inc_page_stat(struct page *page,
> --- 3.8-rc2/mm/memcontrol.c     2012-12-22 09:43:27.628015582 -0800
> +++ linux/mm/memcontrol.c       2013-01-02 14:55:36.268406008 -0800
> @@ -321,12 +321,7 @@ struct mem_cgroup {
>          * mem_cgroup ? And what type of charges should we move ?
>          */
>         unsigned long   move_charge_at_immigrate;
> -       /*
> -        * set > 0 if pages under this cgroup are moving to other cgroup.
> -        */
> -       atomic_t        moving_account;
> -       /* taken only while moving_account > 0 */
> -       spinlock_t      move_lock;
> +
>         /*
>          * percpu counter.
>          */
> @@ -1414,60 +1409,10 @@ int mem_cgroup_swappiness(struct mem_cgr
>  }
>
>  /*
> - * memcg->moving_account is used for checking possibility that some thread is
> - * calling move_account(). When a thread on CPU-A starts moving pages under
> - * a memcg, other threads should check memcg->moving_account under
> - * rcu_read_lock(), like this:
> - *
> - *         CPU-A                                    CPU-B
> - *                                              rcu_read_lock()
> - *         memcg->moving_account+1              if (memcg->mocing_account)
> - *                                                   take heavy locks.
> - *         synchronize_rcu()                    update something.
> - *                                              rcu_read_unlock()
> - *         start move here.
> - */
> -
> -/* for quick checking without looking up memcg */
> -atomic_t memcg_moving __read_mostly;
> -
> -static void mem_cgroup_start_move(struct mem_cgroup *memcg)
> -{
> -       atomic_inc(&memcg_moving);
> -       atomic_inc(&memcg->moving_account);
> -       synchronize_rcu();
> -}
> -
> -static void mem_cgroup_end_move(struct mem_cgroup *memcg)
> -{
> -       /*
> -        * Now, mem_cgroup_clear_mc() may call this function with NULL.
> -        * We check NULL in callee rather than caller.
> -        */
> -       if (memcg) {
> -               atomic_dec(&memcg_moving);
> -               atomic_dec(&memcg->moving_account);
> -       }
> -}
> -
> -/*
> - * 2 routines for checking "mem" is under move_account() or not.
> - *
> - * mem_cgroup_stolen() -  checking whether a cgroup is mc.from or not. This
> - *                       is used for avoiding races in accounting.  If true,
> - *                       pc->mem_cgroup may be overwritten.
> - *
>   * mem_cgroup_under_move() - checking a cgroup is mc.from or mc.to or
>   *                       under hierarchy of moving cgroups. This is for
> - *                       waiting at hith-memory prressure caused by "move".
> + *                       waiting at high memory pressure caused by "move".
>   */
> -
> -static bool mem_cgroup_stolen(struct mem_cgroup *memcg)
> -{
> -       VM_BUG_ON(!rcu_read_lock_held());
> -       return atomic_read(&memcg->moving_account) > 0;
> -}
> -
>  static bool mem_cgroup_under_move(struct mem_cgroup *memcg)
>  {
>         struct mem_cgroup *from;
> @@ -1506,24 +1451,6 @@ static bool mem_cgroup_wait_acct_move(st
>         return false;
>  }
>
> -/*
> - * Take this lock when
> - * - a code tries to modify page's memcg while it's USED.
> - * - a code tries to modify page state accounting in a memcg.
> - * see mem_cgroup_stolen(), too.
> - */
> -static void move_lock_mem_cgroup(struct mem_cgroup *memcg,
> -                                 unsigned long *flags)
> -{
> -       spin_lock_irqsave(&memcg->move_lock, *flags);
> -}
> -
> -static void move_unlock_mem_cgroup(struct mem_cgroup *memcg,
> -                               unsigned long *flags)
> -{
> -       spin_unlock_irqrestore(&memcg->move_lock, *flags);
> -}
> -
>  /**
>   * mem_cgroup_print_oom_info: Called from OOM with tasklist_lock held in read mode.
>   * @memcg: The memory cgroup that went over limit
> @@ -2096,75 +2023,215 @@ static bool mem_cgroup_handle_oom(struct
>  }
>
>  /*
> - * Currently used to update mapped file statistics, but the routine can be
> - * generalized to update other statistics as well.
> - *
> - * Notes: Race condition
> - *
> - * We usually use page_cgroup_lock() for accessing page_cgroup member but
> - * it tends to be costly. But considering some conditions, we doesn't need
> - * to do so _always_.
> - *
> - * Considering "charge", lock_page_cgroup() is not required because all
> - * file-stat operations happen after a page is attached to radix-tree. There
> - * are no race with "charge".
> - *
> - * Considering "uncharge", we know that memcg doesn't clear pc->mem_cgroup
> - * at "uncharge" intentionally. So, we always see valid pc->mem_cgroup even
> - * if there are race with "uncharge". Statistics itself is properly handled
> - * by flags.
> + * These functions mediate between the common case of updating memcg stats
> + * when a page transitions from one state to another, and the rare case of
> + * moving a page from one memcg to another.
> + *
> + * A simple example of the updater would be:
> + *     mem_cgroup_begin_update_page_stat(page);
> + *     if (TestClearPageFlag(page))
> + *             mem_cgroup_dec_page_stat(page, NR_FLAG_PAGES);
> + *     mem_cgroup_end_update_page_stat(page);
> + *
> + * An over-simplified example of the mover would be:
> + *     mem_cgroup_begin_move();
> + *     for each page chosen from old_memcg {
> + *             pc = lookup_page_cgroup(page);
> + *             lock_page_cgroup(pc);
> + *             if (trylock_memcg_move(page)) {
> + *                     if (PageFlag(page)) {
> + *                             mem_cgroup_dec_page_stat(page, NR_FLAG_PAGES);
> + *                             pc->mem_cgroup = new_memcg;
> + *                             mem_cgroup_inc_page_stat(page, NR_FLAG_PAGES);
> + *                     }
> + *                     unlock_memcg_move();
> + *                     unlock_page_cgroup(pc);
> + *             }
> + *             cond_resched();
> + *     }
> + *     mem_cgroup_end_move();
> + *
> + * Without some kind of serialization between updater and mover, the mover
> + * cannot know whether or not to move one count from old to new memcg stats;
> + * but the serialization must be as lightweight as possible for the updater.
> + *
> + * At present we use two layers of lock avoidance, then spinlock on memcg;
> + * but that already got into (easily avoided) lock hierarchy violation with
> + * the page_cgroup lock; and as dirty writeback stats are added, it gets
> + * into further difficulty with the page cache radix tree lock (and on s390
> + * architecture, page_remove_rmap calls set_page_dirty within its critical
> + * section: perhaps that can be reordered, but if not, it requires nesting).
> + *
> + * We need a mechanism more like rcu_read_lock() for the updater, who then
> + * does not have to worry about lock ordering.  The scheme below is not quite
> + * as light as that: rarely, the updater does have to spin waiting on a mover;
> + * and it is still best for updater to avoid taking page_cgroup lock in its
> + * critical section (though mover drops and retries if necessary, so there is
> + * no actual deadlock).  Testing on 4-way suggests 5% heavier for the mover.
> + */
> +
> +/*
> + * memcg_moving count is written in advance by movers,
> + * and read by updaters to see if they need to worry further.
> + */
> +atomic_t memcg_moving __read_mostly;
> +
> +/*
> + * Keep it simple: allow only one page to move at a time.  cgroup_mutex
> + * already serializes move_charge_at_immigrate movements, but not writes
> + * to memory.force_empty, nor move-pages-to-parent phase of cgroup rmdir.
>   *
> - * Considering "move", this is an only case we see a race. To make the race
> - * small, we check mm->moving_account and detect there are possibility of race
> - * If there is, we take a lock.
> + * memcg_moving_lock guards writes by movers to memcg_moving_page,
> + * which is read by updaters to see if they need to worry about their page.
> + */
> +static DEFINE_SPINLOCK(memcg_moving_lock);
> +static struct page *memcg_moving_page;
> +
> +/*
> + * updating_page_stat is written per-cpu by updaters,
> + * and all cpus read by mover to check when safe to proceed with the move.
>   */
> +static DEFINE_PER_CPU(int, updating_page_stat) = 0;
>
> -void __mem_cgroup_begin_update_page_stat(struct page *page,
> -                               bool *locked, unsigned long *flags)
> +/*
> + * Mover calls mem_cgroup_begin_move() before starting on its pages; its
> + * synchronize_rcu() ensures that all updaters will see memcg_moving in time.
> + */
> +static void mem_cgroup_begin_move(void)
>  {
> -       struct mem_cgroup *memcg;
> -       struct page_cgroup *pc;
> +       get_online_cpus();
> +       atomic_inc(&memcg_moving);
> +       synchronize_rcu();
> +}
> +
> +static void mem_cgroup_end_move(void)
> +{
> +       atomic_dec(&memcg_moving);
> +       put_online_cpus();
> +}
> +
> +/*
> + * Mover calls trylock_memcg_move(page) before moving stats and changing
> + * ownership of page.  If it fails, mover should drop page_cgroup lock and
> + * any other spinlocks held, cond_resched then try the page again.  This
> + * lets updaters take those locks if unavoidable, though preferably not.
> + */
> +static bool trylock_memcg_move(struct page *page)
> +{
> +       static struct cpumask updating;
> +       int try;
> +
> +       cpumask_copy(&updating, cpu_online_mask);
> +       spin_lock(&memcg_moving_lock);
While reaching here, we already do lock_page_cgroup() which is a much
heavier lock. But memcg_moving_lock is a only one global spinlock that
each memcg doing move_account will try to get, so it may spin many
memcg to waiting here while holding their page_cgroup lock. This may
aggravate some performance decrease IMHO.

> +       memcg_moving_page = page;
>
> -       pc = lookup_page_cgroup(page);
> -again:
> -       memcg = pc->mem_cgroup;
> -       if (unlikely(!memcg || !PageCgroupUsed(pc)))
> -               return;
>         /*
> -        * If this memory cgroup is not under account moving, we don't
> -        * need to take move_lock_mem_cgroup(). Because we already hold
> -        * rcu_read_lock(), any calls to move_account will be delayed until
> -        * rcu_read_unlock() if mem_cgroup_stolen() == true.
> +        * Make sure that __mem_cgroup_begin_update_page_stat(page) can see
> +        * our memcg_moving_page before it commits to updating_page_stat.
>          */
> -       if (!mem_cgroup_stolen(memcg))
> -               return;
> +       smp_mb();
>
> -       move_lock_mem_cgroup(memcg, flags);
> -       if (memcg != pc->mem_cgroup || !PageCgroupUsed(pc)) {
> -               move_unlock_mem_cgroup(memcg, flags);
> -               goto again;
> +       for (try = 0; try < 64; try++) {
> +               int updaters = 0;
> +               int cpu;
> +
> +               for_each_cpu(cpu, &updating) {
> +                       if (ACCESS_ONCE(per_cpu(updating_page_stat, cpu)))
> +                               updaters++;
> +                       else
> +                               cpumask_clear_cpu(cpu, &updating);
> +               }
> +               if (!updaters)
> +                       return true;
>         }
> -       *locked = true;
> +
> +       memcg_moving_page = NULL;
Also need a smp_mb()?

> +       spin_unlock(&memcg_moving_lock);
> +       return false;
>  }
>
> -void __mem_cgroup_end_update_page_stat(struct page *page, unsigned long *flags)
> +static void unlock_memcg_move(void)
>  {
> -       struct page_cgroup *pc = lookup_page_cgroup(page);
> +       memcg_moving_page = NULL;
> +       spin_unlock(&memcg_moving_lock);
> +}
>
> -       /*
> -        * It's guaranteed that pc->mem_cgroup never changes while
> -        * lock is held because a routine modifies pc->mem_cgroup
> -        * should take move_lock_mem_cgroup().
> -        */
> -       move_unlock_mem_cgroup(pc->mem_cgroup, flags);
> +/*
> + * If memcg_moving, updater calls __mem_cgroup_begin_update_page_stat(page)
> + * (with preemption disabled) to indicate to the next mover that this cpu is
> + * updating a page, or to wait on the mover if it's already moving this page.
> + */
> +void __mem_cgroup_begin_update_page_stat(struct page *page)
> +{
> +       static const int probing = 0x10000;
> +       int updating;
> +
> +       __this_cpu_add(updating_page_stat, probing);
> +
> +       for (;;) {
> +               /*
> +                * Make sure that trylock_memcg_move(page) can see our
> +                * updating_page_stat before we check memcg_moving_page.
> +                *
> +                * We use the special probing value at first so move sees it,
> +                * but nesting and interrupts on this cpu can distinguish it.
> +                */
> +               smp_mb();
> +
> +               if (likely(page != ACCESS_ONCE(memcg_moving_page)))
> +                       break;
> +
> +               /*
> +                * We may be nested, we may be serving an interrupt: do not
> +                * hang here if the outer level already went beyond probing.
> +                */
> +               updating = __this_cpu_read(updating_page_stat);
> +               if (updating & (probing - 1))
> +                       break;
> +
> +               __this_cpu_write(updating_page_stat, 0);
> +               while (page == ACCESS_ONCE(memcg_moving_page))
> +                       cpu_relax();
> +               __this_cpu_write(updating_page_stat, updating);
> +       }
> +
> +       /* Add one to count and remove temporary probing value */
> +       __this_cpu_sub(updating_page_stat, probing - 1);
> +}
> +
> +void __mem_cgroup_end_update_page_stat(void)
> +{
> +       __this_cpu_dec(updating_page_stat);
> +}
> +
> +/*
> + * Static inline interfaces to the above in include/linux/memcontrol.h:
> + *
> +static inline void mem_cgroup_begin_update_page_stat(struct page *page,
> +                                                    bool *clamped)
> +{
> +       preempt_disable();
> +       *clamped = false;
> +       if (unlikely(atomic_read(&memcg_moving))) {
> +               __mem_cgroup_begin_update_page_stat(page);
> +               *clamped = true;
> +       }
>  }
>
> +static inline void mem_cgroup_end_update_page_stat(struct page *page,
> +                                                  bool *clamped)
> +{
> +       if (unlikely(*clamped))
> +               __mem_cgroup_end_update_page_stat();
> +       preempt_enable();
> +}
> + */
> +
>  void mem_cgroup_update_page_stat(struct page *page,
>                                  enum mem_cgroup_page_stat_item idx, int val)
>  {
>         struct mem_cgroup *memcg;
>         struct page_cgroup *pc = lookup_page_cgroup(page);
> -       unsigned long uninitialized_var(flags);
>
>         if (mem_cgroup_disabled())
>                 return;
> @@ -2181,7 +2248,8 @@ void mem_cgroup_update_page_stat(struct
>                 BUG();
>         }
>
> -       this_cpu_add(memcg->stat->count[idx], val);
> +       /* mem_cgroup_begin_update_page_stat() disabled preemption */
> +       __this_cpu_add(memcg->stat->count[idx], val);
>  }
>
>  /*
> @@ -3580,7 +3648,6 @@ static int mem_cgroup_move_account(struc
>                                    struct mem_cgroup *from,
>                                    struct mem_cgroup *to)
>  {
> -       unsigned long flags;
>         int ret;
>         bool anon = PageAnon(page);
>
> @@ -3602,21 +3669,21 @@ static int mem_cgroup_move_account(struc
>         if (!PageCgroupUsed(pc) || pc->mem_cgroup != from)
>                 goto unlock;
>
> -       move_lock_mem_cgroup(from, &flags);
> +       ret = -EAGAIN;
> +       if (!trylock_memcg_move(page))
> +               goto unlock;
>
>         if (!anon && page_mapped(page)) {
>                 /* Update mapped_file data for mem_cgroup */
> -               preempt_disable();
>                 __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
>                 __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -               preempt_enable();
>         }
>         mem_cgroup_charge_statistics(from, anon, -nr_pages);
>
>         /* caller should have done css_get */
>         pc->mem_cgroup = to;
>         mem_cgroup_charge_statistics(to, anon, nr_pages);
> -       move_unlock_mem_cgroup(from, &flags);
> +       unlock_memcg_move();
>         ret = 0;
>  unlock:
>         unlock_page_cgroup(pc);
> @@ -3675,19 +3742,25 @@ static int mem_cgroup_move_parent(struct
>          */
>         if (!parent)
>                 parent = root_mem_cgroup;
> -
> +retry:
>         if (nr_pages > 1) {
>                 VM_BUG_ON(!PageTransHuge(page));
>                 flags = compound_lock_irqsave(page);
>         }
>
> -       ret = mem_cgroup_move_account(page, nr_pages,
> -                               pc, child, parent);
> -       if (!ret)
> -               __mem_cgroup_cancel_local_charge(child, nr_pages);
> +       ret = mem_cgroup_move_account(page, nr_pages, pc, child, parent);
>
>         if (nr_pages > 1)
>                 compound_unlock_irqrestore(page, flags);
> +
> +       if (ret == -EAGAIN) {
> +               cond_resched();
> +               goto retry;
> +       }
> +
> +       if (!ret)
> +               __mem_cgroup_cancel_local_charge(child, nr_pages);
> +
>         putback_lru_page(page);
>  put:
>         put_page(page);
> @@ -4685,7 +4758,7 @@ static void mem_cgroup_reparent_charges(
>                 /* This is for making all *used* pages to be on LRU. */
>                 lru_add_drain_all();
>                 drain_all_stock_sync(memcg);
> -               mem_cgroup_start_move(memcg);
> +               mem_cgroup_begin_move();
>                 for_each_node_state(node, N_MEMORY) {
>                         for (zid = 0; zid < MAX_NR_ZONES; zid++) {
>                                 enum lru_list lru;
> @@ -4695,7 +4768,7 @@ static void mem_cgroup_reparent_charges(
>                                 }
>                         }
>                 }
> -               mem_cgroup_end_move(memcg);
> +               mem_cgroup_end_move();
>                 memcg_oom_recover(memcg);
>                 cond_resched();
>
> @@ -6128,7 +6201,6 @@ mem_cgroup_css_alloc(struct cgroup *cont
>         atomic_set(&memcg->refcnt, 1);
>         memcg->move_charge_at_immigrate = 0;
>         mutex_init(&memcg->thresholds_lock);
> -       spin_lock_init(&memcg->move_lock);
>
>         error = memcg_init_kmem(memcg, &mem_cgroup_subsys);
>         if (error) {
> @@ -6521,7 +6593,8 @@ static void mem_cgroup_clear_mc(void)
>         mc.from = NULL;
>         mc.to = NULL;
>         spin_unlock(&mc.lock);
> -       mem_cgroup_end_move(from);
> +       if (from)
> +               mem_cgroup_end_move();
>  }
>
>  static int mem_cgroup_can_attach(struct cgroup *cgroup,
> @@ -6547,7 +6620,7 @@ static int mem_cgroup_can_attach(struct
>                         VM_BUG_ON(mc.precharge);
>                         VM_BUG_ON(mc.moved_charge);
>                         VM_BUG_ON(mc.moved_swap);
> -                       mem_cgroup_start_move(from);
> +                       mem_cgroup_begin_move();
>                         spin_lock(&mc.lock);
>                         mc.from = from;
>                         mc.to = memcg;
> @@ -6573,7 +6646,7 @@ static int mem_cgroup_move_charge_pte_ra
>                                 unsigned long addr, unsigned long end,
>                                 struct mm_walk *walk)
>  {
> -       int ret = 0;
> +       int ret;
>         struct vm_area_struct *vma = walk->private;
>         pte_t *pte;
>         spinlock_t *ptl;
> @@ -6592,6 +6665,8 @@ static int mem_cgroup_move_charge_pte_ra
>          *    to be unlocked in __split_huge_page_splitting(), where the main
>          *    part of thp split is not executed yet.
>          */
> +retry:
> +       ret = 0;
>         if (pmd_trans_huge_lock(pmd, vma) == 1) {
>                 if (mc.precharge < HPAGE_PMD_NR) {
>                         spin_unlock(&vma->vm_mm->page_table_lock);
> @@ -6602,8 +6677,9 @@ static int mem_cgroup_move_charge_pte_ra
>                         page = target.page;
>                         if (!isolate_lru_page(page)) {
>                                 pc = lookup_page_cgroup(page);
> -                               if (!mem_cgroup_move_account(page, HPAGE_PMD_NR,
> -                                                       pc, mc.from, mc.to)) {
> +                               ret = mem_cgroup_move_account(page,
> +                                           HPAGE_PMD_NR, pc, mc.from, mc.to);
> +                               if (!ret) {
>                                         mc.precharge -= HPAGE_PMD_NR;
>                                         mc.moved_charge += HPAGE_PMD_NR;
>                                 }
> @@ -6612,12 +6688,14 @@ static int mem_cgroup_move_charge_pte_ra
>                         put_page(page);
>                 }
>                 spin_unlock(&vma->vm_mm->page_table_lock);
> +               if (ret == -EAGAIN)
> +                       goto retry;
>                 return 0;
>         }
>
>         if (pmd_trans_unstable(pmd))
>                 return 0;
> -retry:
> +
>         pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>         for (; addr != end; addr += PAGE_SIZE) {
>                 pte_t ptent = *(pte++);
> @@ -6632,8 +6710,9 @@ retry:
>                         if (isolate_lru_page(page))
>                                 goto put;
>                         pc = lookup_page_cgroup(page);
> -                       if (!mem_cgroup_move_account(page, 1, pc,
> -                                                    mc.from, mc.to)) {
> +                       ret = mem_cgroup_move_account(page, 1, pc,
> +                                                     mc.from, mc.to);
> +                       if (!ret) {
>                                 mc.precharge--;
>                                 /* we uncharge from mc.from later. */
>                                 mc.moved_charge++;
> @@ -6653,11 +6732,15 @@ put:                    /* get_mctgt_type() gets the page
>                 default:
>                         break;
>                 }
> +               if (ret == -EAGAIN)
> +                       break;
>         }
>         pte_unmap_unlock(pte - 1, ptl);
>         cond_resched();
>
>         if (addr != end) {
> +               if (ret == -EAGAIN)
> +                       goto retry;
>                 /*
>                  * We have consumed all precharges we got in can_attach().
>                  * We try charge one by one, but don't do any additional
> --- 3.8-rc2/mm/rmap.c   2012-12-22 09:43:27.656015582 -0800
> +++ linux/mm/rmap.c     2013-01-02 15:03:46.100417650 -0800
> @@ -1107,15 +1107,14 @@ void page_add_new_anon_rmap(struct page
>   */
>  void page_add_file_rmap(struct page *page)
>  {
> -       bool locked;
> -       unsigned long flags;
> +       bool clamped;
>
> -       mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +       mem_cgroup_begin_update_page_stat(page, &clamped);
>         if (atomic_inc_and_test(&page->_mapcount)) {
>                 __inc_zone_page_state(page, NR_FILE_MAPPED);
>                 mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
>         }
> -       mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +       mem_cgroup_end_update_page_stat(page, &clamped);
>  }
>
>  /**
> @@ -1128,16 +1127,15 @@ void page_remove_rmap(struct page *page)
>  {
>         struct address_space *mapping = page_mapping(page);
>         bool anon = PageAnon(page);
> -       bool locked;
> -       unsigned long flags;
> +       bool uninitialized_var(clamped);
>
>         /*
>          * The anon case has no mem_cgroup page_stat to update; but may
> -        * uncharge_page() below, where the lock ordering can deadlock if
> -        * we hold the lock against page_stat move: so avoid it on anon.
> +        * uncharge_page() below, when holding page_cgroup lock might force
> +        * a page_stat move to back off temporarily: so avoid it on anon.
>          */
>         if (!anon)
> -               mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +               mem_cgroup_begin_update_page_stat(page, &clamped);
>
>         /* page still mapped by someone else? */
>         if (!atomic_add_negative(-1, &page->_mapcount))
> @@ -1182,7 +1180,7 @@ void page_remove_rmap(struct page *page)
>         } else {
>                 __dec_zone_page_state(page, NR_FILE_MAPPED);
>                 mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
> -               mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +               mem_cgroup_end_update_page_stat(page, &clamped);
>         }
>         if (unlikely(PageMlocked(page)))
>                 clear_page_mlock(page);
> @@ -1198,7 +1196,7 @@ void page_remove_rmap(struct page *page)
>         return;
>  out:
>         if (!anon)
> -               mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +               mem_cgroup_end_update_page_stat(page, &clamped);
>  }
>
>  /*



-- 
Thanks,
Sha

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
@ 2013-01-09 14:35           ` Sha Zhengju
  0 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2013-01-09 14:35 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Michal Hocko, Johannes Weiner,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	glommer-bzQdu9zFT3WakBO8gow8eQ, dchinner-H+wXaHxf7aLQT0dZR+AlfA,
	Sha Zhengju

Hi Hugh,

On Mon, Jan 7, 2013 at 4:02 AM, Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> On Sat, 5 Jan 2013, Sha Zhengju wrote:
>> On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> wrote:
>> >
>> > Maybe I have missed some other locking which would prevent this from
>> > happening but the locking relations are really complicated in this area
>> > so if mem_cgroup_{begin,end}_update_page_stat might be called
>> > recursively then we need a fat comment which justifies that.
>> >
>>
>> Ohhh...good catching!  I didn't notice there is a recursive call of
>> mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
>> The mem_cgroup_{begin,end}_update_page_stat() design has depressed
>> me a lot recently as the lock granularity is a little bigger than I thought.
>> Not only the resource but also some code logic is in the range of locking
>> which may be deadlock prone. The problem still exists if we are trying to
>> add stat account of other memcg page later, may I make bold to suggest
>> that we dig into the lock again...
>
> Forgive me, I must confess I'm no more than skimming this thread,
> and don't like dumping unsigned-off patches on people; but thought
> that on balance it might be more helpful than not if I offer you a
> patch I worked on around 3.6-rc2 (but have updated to 3.8-rc2 below).

Thanks for your interest in this matter! I really appreciate your work!

> I too was getting depressed by the constraints imposed by
> mem_cgroup_{begin,end}_update_page_stat (good job though Kamezawa-san
> did to minimize them), and wanted to replace by something freer, more
> RCU-like.  In the end it seemed more effort than it was worth to go
> as far as I wanted, but I do think that this is some improvement over
> what we currently have, and should deal with your recursion issue.
It takes me some time to understand the patch. yeah, it can solve my
recursion issue and also reduce some locks(e.g. move_lock). But it did
have some side effect on move end as it will become slower. To my
knowledge, each task is forked in root memcg, and there's a moving
while attaching it to a cgroup. So move_account is also a frequent
behavior to some extent.
Some comments are below.

> But if this does appear useful to memcg people, then we really ought
> to get it checked over by locking/barrier experts before going further.
> I think myself that I've over-barriered it, and could use a little
> lighter; but they (Paul McKenney, Peter Zijlstra, Oleg Nesterov come
> to mind) will see more clearly, and may just hate the whole thing,
> as yet another peculiar lockdep-avoiding hand-crafted locking scheme.
> I've not wanted to waste their time on reviewing it, if it's not even
> going to be useful to memcg people.
>
> It may be easier to understand if you just apply the patch and look
> at the result in mm/memcontrol.c, where I tried to gather the pieces
> together in one place and describe them ("These functions mediate...").
>
> Hugh
>
>  include/linux/memcontrol.h |   39 +--
>  mm/memcontrol.c            |  375 +++++++++++++++++++++--------------
>  mm/rmap.c                  |   20 -
>  3 files changed, 257 insertions(+), 177 deletions(-)
>
> --- 3.8-rc2/include/linux/memcontrol.h  2012-12-22 09:43:27.172015571 -0800
> +++ linux/include/linux/memcontrol.h    2013-01-02 14:47:47.960394878 -0800
> @@ -136,32 +136,28 @@ static inline bool mem_cgroup_disabled(v
>         return false;
>  }
>
> -void __mem_cgroup_begin_update_page_stat(struct page *page, bool *locked,
> -                                        unsigned long *flags);
> -
> +void __mem_cgroup_begin_update_page_stat(struct page *page);
> +void __mem_cgroup_end_update_page_stat(void);
>  extern atomic_t memcg_moving;
>
>  static inline void mem_cgroup_begin_update_page_stat(struct page *page,
> -                                       bool *locked, unsigned long *flags)
> +                                                    bool *clamped)
>  {
> -       if (mem_cgroup_disabled())
> -               return;
> -       rcu_read_lock();
> -       *locked = false;
> -       if (atomic_read(&memcg_moving))
> -               __mem_cgroup_begin_update_page_stat(page, locked, flags);
> +       preempt_disable();

Referring to synchronize_rcu in mem_cgroup_begin_move(), here
rcu_read_lock() lost?

> +       *clamped = false;
> +       if (unlikely(atomic_read(&memcg_moving))) {
> +               __mem_cgroup_begin_update_page_stat(page);
> +               *clamped = true;
> +       }
>  }
>
> -void __mem_cgroup_end_update_page_stat(struct page *page,
> -                               unsigned long *flags);
>  static inline void mem_cgroup_end_update_page_stat(struct page *page,
> -                                       bool *locked, unsigned long *flags)
> +                                                  bool *clamped)
>  {
> -       if (mem_cgroup_disabled())
> -               return;
> -       if (*locked)
> -               __mem_cgroup_end_update_page_stat(page, flags);
> -       rcu_read_unlock();
> +       /* We don't currently use the page arg, but keep it for symmetry */
> +       if (unlikely(*clamped))
> +               __mem_cgroup_end_update_page_stat();

Ditto. Need rcu_read_unlock()?

> +       preempt_enable();
>  }
>
>  void mem_cgroup_update_page_stat(struct page *page,
> @@ -345,13 +341,16 @@ mem_cgroup_print_oom_info(struct mem_cgr
>  }
>
>  static inline void mem_cgroup_begin_update_page_stat(struct page *page,
> -                                       bool *locked, unsigned long *flags)
> +                                                    bool *clamped)
>  {
> +       /* It may be helpful to our callers if the stub behaves the same way */
> +       preempt_disable();
>  }
>
>  static inline void mem_cgroup_end_update_page_stat(struct page *page,
> -                                       bool *locked, unsigned long *flags)
> +                                                  bool *clamped)
>  {
> +       preempt_enable();
>  }
>
>  static inline void mem_cgroup_inc_page_stat(struct page *page,
> --- 3.8-rc2/mm/memcontrol.c     2012-12-22 09:43:27.628015582 -0800
> +++ linux/mm/memcontrol.c       2013-01-02 14:55:36.268406008 -0800
> @@ -321,12 +321,7 @@ struct mem_cgroup {
>          * mem_cgroup ? And what type of charges should we move ?
>          */
>         unsigned long   move_charge_at_immigrate;
> -       /*
> -        * set > 0 if pages under this cgroup are moving to other cgroup.
> -        */
> -       atomic_t        moving_account;
> -       /* taken only while moving_account > 0 */
> -       spinlock_t      move_lock;
> +
>         /*
>          * percpu counter.
>          */
> @@ -1414,60 +1409,10 @@ int mem_cgroup_swappiness(struct mem_cgr
>  }
>
>  /*
> - * memcg->moving_account is used for checking possibility that some thread is
> - * calling move_account(). When a thread on CPU-A starts moving pages under
> - * a memcg, other threads should check memcg->moving_account under
> - * rcu_read_lock(), like this:
> - *
> - *         CPU-A                                    CPU-B
> - *                                              rcu_read_lock()
> - *         memcg->moving_account+1              if (memcg->mocing_account)
> - *                                                   take heavy locks.
> - *         synchronize_rcu()                    update something.
> - *                                              rcu_read_unlock()
> - *         start move here.
> - */
> -
> -/* for quick checking without looking up memcg */
> -atomic_t memcg_moving __read_mostly;
> -
> -static void mem_cgroup_start_move(struct mem_cgroup *memcg)
> -{
> -       atomic_inc(&memcg_moving);
> -       atomic_inc(&memcg->moving_account);
> -       synchronize_rcu();
> -}
> -
> -static void mem_cgroup_end_move(struct mem_cgroup *memcg)
> -{
> -       /*
> -        * Now, mem_cgroup_clear_mc() may call this function with NULL.
> -        * We check NULL in callee rather than caller.
> -        */
> -       if (memcg) {
> -               atomic_dec(&memcg_moving);
> -               atomic_dec(&memcg->moving_account);
> -       }
> -}
> -
> -/*
> - * 2 routines for checking "mem" is under move_account() or not.
> - *
> - * mem_cgroup_stolen() -  checking whether a cgroup is mc.from or not. This
> - *                       is used for avoiding races in accounting.  If true,
> - *                       pc->mem_cgroup may be overwritten.
> - *
>   * mem_cgroup_under_move() - checking a cgroup is mc.from or mc.to or
>   *                       under hierarchy of moving cgroups. This is for
> - *                       waiting at hith-memory prressure caused by "move".
> + *                       waiting at high memory pressure caused by "move".
>   */
> -
> -static bool mem_cgroup_stolen(struct mem_cgroup *memcg)
> -{
> -       VM_BUG_ON(!rcu_read_lock_held());
> -       return atomic_read(&memcg->moving_account) > 0;
> -}
> -
>  static bool mem_cgroup_under_move(struct mem_cgroup *memcg)
>  {
>         struct mem_cgroup *from;
> @@ -1506,24 +1451,6 @@ static bool mem_cgroup_wait_acct_move(st
>         return false;
>  }
>
> -/*
> - * Take this lock when
> - * - a code tries to modify page's memcg while it's USED.
> - * - a code tries to modify page state accounting in a memcg.
> - * see mem_cgroup_stolen(), too.
> - */
> -static void move_lock_mem_cgroup(struct mem_cgroup *memcg,
> -                                 unsigned long *flags)
> -{
> -       spin_lock_irqsave(&memcg->move_lock, *flags);
> -}
> -
> -static void move_unlock_mem_cgroup(struct mem_cgroup *memcg,
> -                               unsigned long *flags)
> -{
> -       spin_unlock_irqrestore(&memcg->move_lock, *flags);
> -}
> -
>  /**
>   * mem_cgroup_print_oom_info: Called from OOM with tasklist_lock held in read mode.
>   * @memcg: The memory cgroup that went over limit
> @@ -2096,75 +2023,215 @@ static bool mem_cgroup_handle_oom(struct
>  }
>
>  /*
> - * Currently used to update mapped file statistics, but the routine can be
> - * generalized to update other statistics as well.
> - *
> - * Notes: Race condition
> - *
> - * We usually use page_cgroup_lock() for accessing page_cgroup member but
> - * it tends to be costly. But considering some conditions, we doesn't need
> - * to do so _always_.
> - *
> - * Considering "charge", lock_page_cgroup() is not required because all
> - * file-stat operations happen after a page is attached to radix-tree. There
> - * are no race with "charge".
> - *
> - * Considering "uncharge", we know that memcg doesn't clear pc->mem_cgroup
> - * at "uncharge" intentionally. So, we always see valid pc->mem_cgroup even
> - * if there are race with "uncharge". Statistics itself is properly handled
> - * by flags.
> + * These functions mediate between the common case of updating memcg stats
> + * when a page transitions from one state to another, and the rare case of
> + * moving a page from one memcg to another.
> + *
> + * A simple example of the updater would be:
> + *     mem_cgroup_begin_update_page_stat(page);
> + *     if (TestClearPageFlag(page))
> + *             mem_cgroup_dec_page_stat(page, NR_FLAG_PAGES);
> + *     mem_cgroup_end_update_page_stat(page);
> + *
> + * An over-simplified example of the mover would be:
> + *     mem_cgroup_begin_move();
> + *     for each page chosen from old_memcg {
> + *             pc = lookup_page_cgroup(page);
> + *             lock_page_cgroup(pc);
> + *             if (trylock_memcg_move(page)) {
> + *                     if (PageFlag(page)) {
> + *                             mem_cgroup_dec_page_stat(page, NR_FLAG_PAGES);
> + *                             pc->mem_cgroup = new_memcg;
> + *                             mem_cgroup_inc_page_stat(page, NR_FLAG_PAGES);
> + *                     }
> + *                     unlock_memcg_move();
> + *                     unlock_page_cgroup(pc);
> + *             }
> + *             cond_resched();
> + *     }
> + *     mem_cgroup_end_move();
> + *
> + * Without some kind of serialization between updater and mover, the mover
> + * cannot know whether or not to move one count from old to new memcg stats;
> + * but the serialization must be as lightweight as possible for the updater.
> + *
> + * At present we use two layers of lock avoidance, then spinlock on memcg;
> + * but that already got into (easily avoided) lock hierarchy violation with
> + * the page_cgroup lock; and as dirty writeback stats are added, it gets
> + * into further difficulty with the page cache radix tree lock (and on s390
> + * architecture, page_remove_rmap calls set_page_dirty within its critical
> + * section: perhaps that can be reordered, but if not, it requires nesting).
> + *
> + * We need a mechanism more like rcu_read_lock() for the updater, who then
> + * does not have to worry about lock ordering.  The scheme below is not quite
> + * as light as that: rarely, the updater does have to spin waiting on a mover;
> + * and it is still best for updater to avoid taking page_cgroup lock in its
> + * critical section (though mover drops and retries if necessary, so there is
> + * no actual deadlock).  Testing on 4-way suggests 5% heavier for the mover.
> + */
> +
> +/*
> + * memcg_moving count is written in advance by movers,
> + * and read by updaters to see if they need to worry further.
> + */
> +atomic_t memcg_moving __read_mostly;
> +
> +/*
> + * Keep it simple: allow only one page to move at a time.  cgroup_mutex
> + * already serializes move_charge_at_immigrate movements, but not writes
> + * to memory.force_empty, nor move-pages-to-parent phase of cgroup rmdir.
>   *
> - * Considering "move", this is an only case we see a race. To make the race
> - * small, we check mm->moving_account and detect there are possibility of race
> - * If there is, we take a lock.
> + * memcg_moving_lock guards writes by movers to memcg_moving_page,
> + * which is read by updaters to see if they need to worry about their page.
> + */
> +static DEFINE_SPINLOCK(memcg_moving_lock);
> +static struct page *memcg_moving_page;
> +
> +/*
> + * updating_page_stat is written per-cpu by updaters,
> + * and all cpus read by mover to check when safe to proceed with the move.
>   */
> +static DEFINE_PER_CPU(int, updating_page_stat) = 0;
>
> -void __mem_cgroup_begin_update_page_stat(struct page *page,
> -                               bool *locked, unsigned long *flags)
> +/*
> + * Mover calls mem_cgroup_begin_move() before starting on its pages; its
> + * synchronize_rcu() ensures that all updaters will see memcg_moving in time.
> + */
> +static void mem_cgroup_begin_move(void)
>  {
> -       struct mem_cgroup *memcg;
> -       struct page_cgroup *pc;
> +       get_online_cpus();
> +       atomic_inc(&memcg_moving);
> +       synchronize_rcu();
> +}
> +
> +static void mem_cgroup_end_move(void)
> +{
> +       atomic_dec(&memcg_moving);
> +       put_online_cpus();
> +}
> +
> +/*
> + * Mover calls trylock_memcg_move(page) before moving stats and changing
> + * ownership of page.  If it fails, mover should drop page_cgroup lock and
> + * any other spinlocks held, cond_resched then try the page again.  This
> + * lets updaters take those locks if unavoidable, though preferably not.
> + */
> +static bool trylock_memcg_move(struct page *page)
> +{
> +       static struct cpumask updating;
> +       int try;
> +
> +       cpumask_copy(&updating, cpu_online_mask);
> +       spin_lock(&memcg_moving_lock);
While reaching here, we already do lock_page_cgroup() which is a much
heavier lock. But memcg_moving_lock is a only one global spinlock that
each memcg doing move_account will try to get, so it may spin many
memcg to waiting here while holding their page_cgroup lock. This may
aggravate some performance decrease IMHO.

> +       memcg_moving_page = page;
>
> -       pc = lookup_page_cgroup(page);
> -again:
> -       memcg = pc->mem_cgroup;
> -       if (unlikely(!memcg || !PageCgroupUsed(pc)))
> -               return;
>         /*
> -        * If this memory cgroup is not under account moving, we don't
> -        * need to take move_lock_mem_cgroup(). Because we already hold
> -        * rcu_read_lock(), any calls to move_account will be delayed until
> -        * rcu_read_unlock() if mem_cgroup_stolen() == true.
> +        * Make sure that __mem_cgroup_begin_update_page_stat(page) can see
> +        * our memcg_moving_page before it commits to updating_page_stat.
>          */
> -       if (!mem_cgroup_stolen(memcg))
> -               return;
> +       smp_mb();
>
> -       move_lock_mem_cgroup(memcg, flags);
> -       if (memcg != pc->mem_cgroup || !PageCgroupUsed(pc)) {
> -               move_unlock_mem_cgroup(memcg, flags);
> -               goto again;
> +       for (try = 0; try < 64; try++) {
> +               int updaters = 0;
> +               int cpu;
> +
> +               for_each_cpu(cpu, &updating) {
> +                       if (ACCESS_ONCE(per_cpu(updating_page_stat, cpu)))
> +                               updaters++;
> +                       else
> +                               cpumask_clear_cpu(cpu, &updating);
> +               }
> +               if (!updaters)
> +                       return true;
>         }
> -       *locked = true;
> +
> +       memcg_moving_page = NULL;
Also need a smp_mb()?

> +       spin_unlock(&memcg_moving_lock);
> +       return false;
>  }
>
> -void __mem_cgroup_end_update_page_stat(struct page *page, unsigned long *flags)
> +static void unlock_memcg_move(void)
>  {
> -       struct page_cgroup *pc = lookup_page_cgroup(page);
> +       memcg_moving_page = NULL;
> +       spin_unlock(&memcg_moving_lock);
> +}
>
> -       /*
> -        * It's guaranteed that pc->mem_cgroup never changes while
> -        * lock is held because a routine modifies pc->mem_cgroup
> -        * should take move_lock_mem_cgroup().
> -        */
> -       move_unlock_mem_cgroup(pc->mem_cgroup, flags);
> +/*
> + * If memcg_moving, updater calls __mem_cgroup_begin_update_page_stat(page)
> + * (with preemption disabled) to indicate to the next mover that this cpu is
> + * updating a page, or to wait on the mover if it's already moving this page.
> + */
> +void __mem_cgroup_begin_update_page_stat(struct page *page)
> +{
> +       static const int probing = 0x10000;
> +       int updating;
> +
> +       __this_cpu_add(updating_page_stat, probing);
> +
> +       for (;;) {
> +               /*
> +                * Make sure that trylock_memcg_move(page) can see our
> +                * updating_page_stat before we check memcg_moving_page.
> +                *
> +                * We use the special probing value at first so move sees it,
> +                * but nesting and interrupts on this cpu can distinguish it.
> +                */
> +               smp_mb();
> +
> +               if (likely(page != ACCESS_ONCE(memcg_moving_page)))
> +                       break;
> +
> +               /*
> +                * We may be nested, we may be serving an interrupt: do not
> +                * hang here if the outer level already went beyond probing.
> +                */
> +               updating = __this_cpu_read(updating_page_stat);
> +               if (updating & (probing - 1))
> +                       break;
> +
> +               __this_cpu_write(updating_page_stat, 0);
> +               while (page == ACCESS_ONCE(memcg_moving_page))
> +                       cpu_relax();
> +               __this_cpu_write(updating_page_stat, updating);
> +       }
> +
> +       /* Add one to count and remove temporary probing value */
> +       __this_cpu_sub(updating_page_stat, probing - 1);
> +}
> +
> +void __mem_cgroup_end_update_page_stat(void)
> +{
> +       __this_cpu_dec(updating_page_stat);
> +}
> +
> +/*
> + * Static inline interfaces to the above in include/linux/memcontrol.h:
> + *
> +static inline void mem_cgroup_begin_update_page_stat(struct page *page,
> +                                                    bool *clamped)
> +{
> +       preempt_disable();
> +       *clamped = false;
> +       if (unlikely(atomic_read(&memcg_moving))) {
> +               __mem_cgroup_begin_update_page_stat(page);
> +               *clamped = true;
> +       }
>  }
>
> +static inline void mem_cgroup_end_update_page_stat(struct page *page,
> +                                                  bool *clamped)
> +{
> +       if (unlikely(*clamped))
> +               __mem_cgroup_end_update_page_stat();
> +       preempt_enable();
> +}
> + */
> +
>  void mem_cgroup_update_page_stat(struct page *page,
>                                  enum mem_cgroup_page_stat_item idx, int val)
>  {
>         struct mem_cgroup *memcg;
>         struct page_cgroup *pc = lookup_page_cgroup(page);
> -       unsigned long uninitialized_var(flags);
>
>         if (mem_cgroup_disabled())
>                 return;
> @@ -2181,7 +2248,8 @@ void mem_cgroup_update_page_stat(struct
>                 BUG();
>         }
>
> -       this_cpu_add(memcg->stat->count[idx], val);
> +       /* mem_cgroup_begin_update_page_stat() disabled preemption */
> +       __this_cpu_add(memcg->stat->count[idx], val);
>  }
>
>  /*
> @@ -3580,7 +3648,6 @@ static int mem_cgroup_move_account(struc
>                                    struct mem_cgroup *from,
>                                    struct mem_cgroup *to)
>  {
> -       unsigned long flags;
>         int ret;
>         bool anon = PageAnon(page);
>
> @@ -3602,21 +3669,21 @@ static int mem_cgroup_move_account(struc
>         if (!PageCgroupUsed(pc) || pc->mem_cgroup != from)
>                 goto unlock;
>
> -       move_lock_mem_cgroup(from, &flags);
> +       ret = -EAGAIN;
> +       if (!trylock_memcg_move(page))
> +               goto unlock;
>
>         if (!anon && page_mapped(page)) {
>                 /* Update mapped_file data for mem_cgroup */
> -               preempt_disable();
>                 __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
>                 __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -               preempt_enable();
>         }
>         mem_cgroup_charge_statistics(from, anon, -nr_pages);
>
>         /* caller should have done css_get */
>         pc->mem_cgroup = to;
>         mem_cgroup_charge_statistics(to, anon, nr_pages);
> -       move_unlock_mem_cgroup(from, &flags);
> +       unlock_memcg_move();
>         ret = 0;
>  unlock:
>         unlock_page_cgroup(pc);
> @@ -3675,19 +3742,25 @@ static int mem_cgroup_move_parent(struct
>          */
>         if (!parent)
>                 parent = root_mem_cgroup;
> -
> +retry:
>         if (nr_pages > 1) {
>                 VM_BUG_ON(!PageTransHuge(page));
>                 flags = compound_lock_irqsave(page);
>         }
>
> -       ret = mem_cgroup_move_account(page, nr_pages,
> -                               pc, child, parent);
> -       if (!ret)
> -               __mem_cgroup_cancel_local_charge(child, nr_pages);
> +       ret = mem_cgroup_move_account(page, nr_pages, pc, child, parent);
>
>         if (nr_pages > 1)
>                 compound_unlock_irqrestore(page, flags);
> +
> +       if (ret == -EAGAIN) {
> +               cond_resched();
> +               goto retry;
> +       }
> +
> +       if (!ret)
> +               __mem_cgroup_cancel_local_charge(child, nr_pages);
> +
>         putback_lru_page(page);
>  put:
>         put_page(page);
> @@ -4685,7 +4758,7 @@ static void mem_cgroup_reparent_charges(
>                 /* This is for making all *used* pages to be on LRU. */
>                 lru_add_drain_all();
>                 drain_all_stock_sync(memcg);
> -               mem_cgroup_start_move(memcg);
> +               mem_cgroup_begin_move();
>                 for_each_node_state(node, N_MEMORY) {
>                         for (zid = 0; zid < MAX_NR_ZONES; zid++) {
>                                 enum lru_list lru;
> @@ -4695,7 +4768,7 @@ static void mem_cgroup_reparent_charges(
>                                 }
>                         }
>                 }
> -               mem_cgroup_end_move(memcg);
> +               mem_cgroup_end_move();
>                 memcg_oom_recover(memcg);
>                 cond_resched();
>
> @@ -6128,7 +6201,6 @@ mem_cgroup_css_alloc(struct cgroup *cont
>         atomic_set(&memcg->refcnt, 1);
>         memcg->move_charge_at_immigrate = 0;
>         mutex_init(&memcg->thresholds_lock);
> -       spin_lock_init(&memcg->move_lock);
>
>         error = memcg_init_kmem(memcg, &mem_cgroup_subsys);
>         if (error) {
> @@ -6521,7 +6593,8 @@ static void mem_cgroup_clear_mc(void)
>         mc.from = NULL;
>         mc.to = NULL;
>         spin_unlock(&mc.lock);
> -       mem_cgroup_end_move(from);
> +       if (from)
> +               mem_cgroup_end_move();
>  }
>
>  static int mem_cgroup_can_attach(struct cgroup *cgroup,
> @@ -6547,7 +6620,7 @@ static int mem_cgroup_can_attach(struct
>                         VM_BUG_ON(mc.precharge);
>                         VM_BUG_ON(mc.moved_charge);
>                         VM_BUG_ON(mc.moved_swap);
> -                       mem_cgroup_start_move(from);
> +                       mem_cgroup_begin_move();
>                         spin_lock(&mc.lock);
>                         mc.from = from;
>                         mc.to = memcg;
> @@ -6573,7 +6646,7 @@ static int mem_cgroup_move_charge_pte_ra
>                                 unsigned long addr, unsigned long end,
>                                 struct mm_walk *walk)
>  {
> -       int ret = 0;
> +       int ret;
>         struct vm_area_struct *vma = walk->private;
>         pte_t *pte;
>         spinlock_t *ptl;
> @@ -6592,6 +6665,8 @@ static int mem_cgroup_move_charge_pte_ra
>          *    to be unlocked in __split_huge_page_splitting(), where the main
>          *    part of thp split is not executed yet.
>          */
> +retry:
> +       ret = 0;
>         if (pmd_trans_huge_lock(pmd, vma) == 1) {
>                 if (mc.precharge < HPAGE_PMD_NR) {
>                         spin_unlock(&vma->vm_mm->page_table_lock);
> @@ -6602,8 +6677,9 @@ static int mem_cgroup_move_charge_pte_ra
>                         page = target.page;
>                         if (!isolate_lru_page(page)) {
>                                 pc = lookup_page_cgroup(page);
> -                               if (!mem_cgroup_move_account(page, HPAGE_PMD_NR,
> -                                                       pc, mc.from, mc.to)) {
> +                               ret = mem_cgroup_move_account(page,
> +                                           HPAGE_PMD_NR, pc, mc.from, mc.to);
> +                               if (!ret) {
>                                         mc.precharge -= HPAGE_PMD_NR;
>                                         mc.moved_charge += HPAGE_PMD_NR;
>                                 }
> @@ -6612,12 +6688,14 @@ static int mem_cgroup_move_charge_pte_ra
>                         put_page(page);
>                 }
>                 spin_unlock(&vma->vm_mm->page_table_lock);
> +               if (ret == -EAGAIN)
> +                       goto retry;
>                 return 0;
>         }
>
>         if (pmd_trans_unstable(pmd))
>                 return 0;
> -retry:
> +
>         pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>         for (; addr != end; addr += PAGE_SIZE) {
>                 pte_t ptent = *(pte++);
> @@ -6632,8 +6710,9 @@ retry:
>                         if (isolate_lru_page(page))
>                                 goto put;
>                         pc = lookup_page_cgroup(page);
> -                       if (!mem_cgroup_move_account(page, 1, pc,
> -                                                    mc.from, mc.to)) {
> +                       ret = mem_cgroup_move_account(page, 1, pc,
> +                                                     mc.from, mc.to);
> +                       if (!ret) {
>                                 mc.precharge--;
>                                 /* we uncharge from mc.from later. */
>                                 mc.moved_charge++;
> @@ -6653,11 +6732,15 @@ put:                    /* get_mctgt_type() gets the page
>                 default:
>                         break;
>                 }
> +               if (ret == -EAGAIN)
> +                       break;
>         }
>         pte_unmap_unlock(pte - 1, ptl);
>         cond_resched();
>
>         if (addr != end) {
> +               if (ret == -EAGAIN)
> +                       goto retry;
>                 /*
>                  * We have consumed all precharges we got in can_attach().
>                  * We try charge one by one, but don't do any additional
> --- 3.8-rc2/mm/rmap.c   2012-12-22 09:43:27.656015582 -0800
> +++ linux/mm/rmap.c     2013-01-02 15:03:46.100417650 -0800
> @@ -1107,15 +1107,14 @@ void page_add_new_anon_rmap(struct page
>   */
>  void page_add_file_rmap(struct page *page)
>  {
> -       bool locked;
> -       unsigned long flags;
> +       bool clamped;
>
> -       mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +       mem_cgroup_begin_update_page_stat(page, &clamped);
>         if (atomic_inc_and_test(&page->_mapcount)) {
>                 __inc_zone_page_state(page, NR_FILE_MAPPED);
>                 mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
>         }
> -       mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +       mem_cgroup_end_update_page_stat(page, &clamped);
>  }
>
>  /**
> @@ -1128,16 +1127,15 @@ void page_remove_rmap(struct page *page)
>  {
>         struct address_space *mapping = page_mapping(page);
>         bool anon = PageAnon(page);
> -       bool locked;
> -       unsigned long flags;
> +       bool uninitialized_var(clamped);
>
>         /*
>          * The anon case has no mem_cgroup page_stat to update; but may
> -        * uncharge_page() below, where the lock ordering can deadlock if
> -        * we hold the lock against page_stat move: so avoid it on anon.
> +        * uncharge_page() below, when holding page_cgroup lock might force
> +        * a page_stat move to back off temporarily: so avoid it on anon.
>          */
>         if (!anon)
> -               mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +               mem_cgroup_begin_update_page_stat(page, &clamped);
>
>         /* page still mapped by someone else? */
>         if (!atomic_add_negative(-1, &page->_mapcount))
> @@ -1182,7 +1180,7 @@ void page_remove_rmap(struct page *page)
>         } else {
>                 __dec_zone_page_state(page, NR_FILE_MAPPED);
>                 mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
> -               mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +               mem_cgroup_end_update_page_stat(page, &clamped);
>         }
>         if (unlikely(PageMlocked(page)))
>                 clear_page_mlock(page);
> @@ -1198,7 +1196,7 @@ void page_remove_rmap(struct page *page)
>         return;
>  out:
>         if (!anon)
> -               mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +               mem_cgroup_end_update_page_stat(page, &clamped);
>  }
>
>  /*



-- 
Thanks,
Sha

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
@ 2013-01-09 14:35           ` Sha Zhengju
  0 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2013-01-09 14:35 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Michal Hocko, Johannes Weiner, linux-kernel, cgroups, linux-mm,
	linux-fsdevel, akpm, kamezawa.hiroyu, gthelen, fengguang.wu,
	glommer, dchinner, Sha Zhengju

Hi Hugh,

On Mon, Jan 7, 2013 at 4:02 AM, Hugh Dickins <hughd@google.com> wrote:
> On Sat, 5 Jan 2013, Sha Zhengju wrote:
>> On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko <mhocko@suse.cz> wrote:
>> >
>> > Maybe I have missed some other locking which would prevent this from
>> > happening but the locking relations are really complicated in this area
>> > so if mem_cgroup_{begin,end}_update_page_stat might be called
>> > recursively then we need a fat comment which justifies that.
>> >
>>
>> Ohhh...good catching!  I didn't notice there is a recursive call of
>> mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
>> The mem_cgroup_{begin,end}_update_page_stat() design has depressed
>> me a lot recently as the lock granularity is a little bigger than I thought.
>> Not only the resource but also some code logic is in the range of locking
>> which may be deadlock prone. The problem still exists if we are trying to
>> add stat account of other memcg page later, may I make bold to suggest
>> that we dig into the lock again...
>
> Forgive me, I must confess I'm no more than skimming this thread,
> and don't like dumping unsigned-off patches on people; but thought
> that on balance it might be more helpful than not if I offer you a
> patch I worked on around 3.6-rc2 (but have updated to 3.8-rc2 below).

Thanks for your interest in this matter! I really appreciate your work!

> I too was getting depressed by the constraints imposed by
> mem_cgroup_{begin,end}_update_page_stat (good job though Kamezawa-san
> did to minimize them), and wanted to replace by something freer, more
> RCU-like.  In the end it seemed more effort than it was worth to go
> as far as I wanted, but I do think that this is some improvement over
> what we currently have, and should deal with your recursion issue.
It takes me some time to understand the patch. yeah, it can solve my
recursion issue and also reduce some locks(e.g. move_lock). But it did
have some side effect on move end as it will become slower. To my
knowledge, each task is forked in root memcg, and there's a moving
while attaching it to a cgroup. So move_account is also a frequent
behavior to some extent.
Some comments are below.

> But if this does appear useful to memcg people, then we really ought
> to get it checked over by locking/barrier experts before going further.
> I think myself that I've over-barriered it, and could use a little
> lighter; but they (Paul McKenney, Peter Zijlstra, Oleg Nesterov come
> to mind) will see more clearly, and may just hate the whole thing,
> as yet another peculiar lockdep-avoiding hand-crafted locking scheme.
> I've not wanted to waste their time on reviewing it, if it's not even
> going to be useful to memcg people.
>
> It may be easier to understand if you just apply the patch and look
> at the result in mm/memcontrol.c, where I tried to gather the pieces
> together in one place and describe them ("These functions mediate...").
>
> Hugh
>
>  include/linux/memcontrol.h |   39 +--
>  mm/memcontrol.c            |  375 +++++++++++++++++++++--------------
>  mm/rmap.c                  |   20 -
>  3 files changed, 257 insertions(+), 177 deletions(-)
>
> --- 3.8-rc2/include/linux/memcontrol.h  2012-12-22 09:43:27.172015571 -0800
> +++ linux/include/linux/memcontrol.h    2013-01-02 14:47:47.960394878 -0800
> @@ -136,32 +136,28 @@ static inline bool mem_cgroup_disabled(v
>         return false;
>  }
>
> -void __mem_cgroup_begin_update_page_stat(struct page *page, bool *locked,
> -                                        unsigned long *flags);
> -
> +void __mem_cgroup_begin_update_page_stat(struct page *page);
> +void __mem_cgroup_end_update_page_stat(void);
>  extern atomic_t memcg_moving;
>
>  static inline void mem_cgroup_begin_update_page_stat(struct page *page,
> -                                       bool *locked, unsigned long *flags)
> +                                                    bool *clamped)
>  {
> -       if (mem_cgroup_disabled())
> -               return;
> -       rcu_read_lock();
> -       *locked = false;
> -       if (atomic_read(&memcg_moving))
> -               __mem_cgroup_begin_update_page_stat(page, locked, flags);
> +       preempt_disable();

Referring to synchronize_rcu in mem_cgroup_begin_move(), here
rcu_read_lock() lost?

> +       *clamped = false;
> +       if (unlikely(atomic_read(&memcg_moving))) {
> +               __mem_cgroup_begin_update_page_stat(page);
> +               *clamped = true;
> +       }
>  }
>
> -void __mem_cgroup_end_update_page_stat(struct page *page,
> -                               unsigned long *flags);
>  static inline void mem_cgroup_end_update_page_stat(struct page *page,
> -                                       bool *locked, unsigned long *flags)
> +                                                  bool *clamped)
>  {
> -       if (mem_cgroup_disabled())
> -               return;
> -       if (*locked)
> -               __mem_cgroup_end_update_page_stat(page, flags);
> -       rcu_read_unlock();
> +       /* We don't currently use the page arg, but keep it for symmetry */
> +       if (unlikely(*clamped))
> +               __mem_cgroup_end_update_page_stat();

Ditto. Need rcu_read_unlock()?

> +       preempt_enable();
>  }
>
>  void mem_cgroup_update_page_stat(struct page *page,
> @@ -345,13 +341,16 @@ mem_cgroup_print_oom_info(struct mem_cgr
>  }
>
>  static inline void mem_cgroup_begin_update_page_stat(struct page *page,
> -                                       bool *locked, unsigned long *flags)
> +                                                    bool *clamped)
>  {
> +       /* It may be helpful to our callers if the stub behaves the same way */
> +       preempt_disable();
>  }
>
>  static inline void mem_cgroup_end_update_page_stat(struct page *page,
> -                                       bool *locked, unsigned long *flags)
> +                                                  bool *clamped)
>  {
> +       preempt_enable();
>  }
>
>  static inline void mem_cgroup_inc_page_stat(struct page *page,
> --- 3.8-rc2/mm/memcontrol.c     2012-12-22 09:43:27.628015582 -0800
> +++ linux/mm/memcontrol.c       2013-01-02 14:55:36.268406008 -0800
> @@ -321,12 +321,7 @@ struct mem_cgroup {
>          * mem_cgroup ? And what type of charges should we move ?
>          */
>         unsigned long   move_charge_at_immigrate;
> -       /*
> -        * set > 0 if pages under this cgroup are moving to other cgroup.
> -        */
> -       atomic_t        moving_account;
> -       /* taken only while moving_account > 0 */
> -       spinlock_t      move_lock;
> +
>         /*
>          * percpu counter.
>          */
> @@ -1414,60 +1409,10 @@ int mem_cgroup_swappiness(struct mem_cgr
>  }
>
>  /*
> - * memcg->moving_account is used for checking possibility that some thread is
> - * calling move_account(). When a thread on CPU-A starts moving pages under
> - * a memcg, other threads should check memcg->moving_account under
> - * rcu_read_lock(), like this:
> - *
> - *         CPU-A                                    CPU-B
> - *                                              rcu_read_lock()
> - *         memcg->moving_account+1              if (memcg->mocing_account)
> - *                                                   take heavy locks.
> - *         synchronize_rcu()                    update something.
> - *                                              rcu_read_unlock()
> - *         start move here.
> - */
> -
> -/* for quick checking without looking up memcg */
> -atomic_t memcg_moving __read_mostly;
> -
> -static void mem_cgroup_start_move(struct mem_cgroup *memcg)
> -{
> -       atomic_inc(&memcg_moving);
> -       atomic_inc(&memcg->moving_account);
> -       synchronize_rcu();
> -}
> -
> -static void mem_cgroup_end_move(struct mem_cgroup *memcg)
> -{
> -       /*
> -        * Now, mem_cgroup_clear_mc() may call this function with NULL.
> -        * We check NULL in callee rather than caller.
> -        */
> -       if (memcg) {
> -               atomic_dec(&memcg_moving);
> -               atomic_dec(&memcg->moving_account);
> -       }
> -}
> -
> -/*
> - * 2 routines for checking "mem" is under move_account() or not.
> - *
> - * mem_cgroup_stolen() -  checking whether a cgroup is mc.from or not. This
> - *                       is used for avoiding races in accounting.  If true,
> - *                       pc->mem_cgroup may be overwritten.
> - *
>   * mem_cgroup_under_move() - checking a cgroup is mc.from or mc.to or
>   *                       under hierarchy of moving cgroups. This is for
> - *                       waiting at hith-memory prressure caused by "move".
> + *                       waiting at high memory pressure caused by "move".
>   */
> -
> -static bool mem_cgroup_stolen(struct mem_cgroup *memcg)
> -{
> -       VM_BUG_ON(!rcu_read_lock_held());
> -       return atomic_read(&memcg->moving_account) > 0;
> -}
> -
>  static bool mem_cgroup_under_move(struct mem_cgroup *memcg)
>  {
>         struct mem_cgroup *from;
> @@ -1506,24 +1451,6 @@ static bool mem_cgroup_wait_acct_move(st
>         return false;
>  }
>
> -/*
> - * Take this lock when
> - * - a code tries to modify page's memcg while it's USED.
> - * - a code tries to modify page state accounting in a memcg.
> - * see mem_cgroup_stolen(), too.
> - */
> -static void move_lock_mem_cgroup(struct mem_cgroup *memcg,
> -                                 unsigned long *flags)
> -{
> -       spin_lock_irqsave(&memcg->move_lock, *flags);
> -}
> -
> -static void move_unlock_mem_cgroup(struct mem_cgroup *memcg,
> -                               unsigned long *flags)
> -{
> -       spin_unlock_irqrestore(&memcg->move_lock, *flags);
> -}
> -
>  /**
>   * mem_cgroup_print_oom_info: Called from OOM with tasklist_lock held in read mode.
>   * @memcg: The memory cgroup that went over limit
> @@ -2096,75 +2023,215 @@ static bool mem_cgroup_handle_oom(struct
>  }
>
>  /*
> - * Currently used to update mapped file statistics, but the routine can be
> - * generalized to update other statistics as well.
> - *
> - * Notes: Race condition
> - *
> - * We usually use page_cgroup_lock() for accessing page_cgroup member but
> - * it tends to be costly. But considering some conditions, we doesn't need
> - * to do so _always_.
> - *
> - * Considering "charge", lock_page_cgroup() is not required because all
> - * file-stat operations happen after a page is attached to radix-tree. There
> - * are no race with "charge".
> - *
> - * Considering "uncharge", we know that memcg doesn't clear pc->mem_cgroup
> - * at "uncharge" intentionally. So, we always see valid pc->mem_cgroup even
> - * if there are race with "uncharge". Statistics itself is properly handled
> - * by flags.
> + * These functions mediate between the common case of updating memcg stats
> + * when a page transitions from one state to another, and the rare case of
> + * moving a page from one memcg to another.
> + *
> + * A simple example of the updater would be:
> + *     mem_cgroup_begin_update_page_stat(page);
> + *     if (TestClearPageFlag(page))
> + *             mem_cgroup_dec_page_stat(page, NR_FLAG_PAGES);
> + *     mem_cgroup_end_update_page_stat(page);
> + *
> + * An over-simplified example of the mover would be:
> + *     mem_cgroup_begin_move();
> + *     for each page chosen from old_memcg {
> + *             pc = lookup_page_cgroup(page);
> + *             lock_page_cgroup(pc);
> + *             if (trylock_memcg_move(page)) {
> + *                     if (PageFlag(page)) {
> + *                             mem_cgroup_dec_page_stat(page, NR_FLAG_PAGES);
> + *                             pc->mem_cgroup = new_memcg;
> + *                             mem_cgroup_inc_page_stat(page, NR_FLAG_PAGES);
> + *                     }
> + *                     unlock_memcg_move();
> + *                     unlock_page_cgroup(pc);
> + *             }
> + *             cond_resched();
> + *     }
> + *     mem_cgroup_end_move();
> + *
> + * Without some kind of serialization between updater and mover, the mover
> + * cannot know whether or not to move one count from old to new memcg stats;
> + * but the serialization must be as lightweight as possible for the updater.
> + *
> + * At present we use two layers of lock avoidance, then spinlock on memcg;
> + * but that already got into (easily avoided) lock hierarchy violation with
> + * the page_cgroup lock; and as dirty writeback stats are added, it gets
> + * into further difficulty with the page cache radix tree lock (and on s390
> + * architecture, page_remove_rmap calls set_page_dirty within its critical
> + * section: perhaps that can be reordered, but if not, it requires nesting).
> + *
> + * We need a mechanism more like rcu_read_lock() for the updater, who then
> + * does not have to worry about lock ordering.  The scheme below is not quite
> + * as light as that: rarely, the updater does have to spin waiting on a mover;
> + * and it is still best for updater to avoid taking page_cgroup lock in its
> + * critical section (though mover drops and retries if necessary, so there is
> + * no actual deadlock).  Testing on 4-way suggests 5% heavier for the mover.
> + */
> +
> +/*
> + * memcg_moving count is written in advance by movers,
> + * and read by updaters to see if they need to worry further.
> + */
> +atomic_t memcg_moving __read_mostly;
> +
> +/*
> + * Keep it simple: allow only one page to move at a time.  cgroup_mutex
> + * already serializes move_charge_at_immigrate movements, but not writes
> + * to memory.force_empty, nor move-pages-to-parent phase of cgroup rmdir.
>   *
> - * Considering "move", this is an only case we see a race. To make the race
> - * small, we check mm->moving_account and detect there are possibility of race
> - * If there is, we take a lock.
> + * memcg_moving_lock guards writes by movers to memcg_moving_page,
> + * which is read by updaters to see if they need to worry about their page.
> + */
> +static DEFINE_SPINLOCK(memcg_moving_lock);
> +static struct page *memcg_moving_page;
> +
> +/*
> + * updating_page_stat is written per-cpu by updaters,
> + * and all cpus read by mover to check when safe to proceed with the move.
>   */
> +static DEFINE_PER_CPU(int, updating_page_stat) = 0;
>
> -void __mem_cgroup_begin_update_page_stat(struct page *page,
> -                               bool *locked, unsigned long *flags)
> +/*
> + * Mover calls mem_cgroup_begin_move() before starting on its pages; its
> + * synchronize_rcu() ensures that all updaters will see memcg_moving in time.
> + */
> +static void mem_cgroup_begin_move(void)
>  {
> -       struct mem_cgroup *memcg;
> -       struct page_cgroup *pc;
> +       get_online_cpus();
> +       atomic_inc(&memcg_moving);
> +       synchronize_rcu();
> +}
> +
> +static void mem_cgroup_end_move(void)
> +{
> +       atomic_dec(&memcg_moving);
> +       put_online_cpus();
> +}
> +
> +/*
> + * Mover calls trylock_memcg_move(page) before moving stats and changing
> + * ownership of page.  If it fails, mover should drop page_cgroup lock and
> + * any other spinlocks held, cond_resched then try the page again.  This
> + * lets updaters take those locks if unavoidable, though preferably not.
> + */
> +static bool trylock_memcg_move(struct page *page)
> +{
> +       static struct cpumask updating;
> +       int try;
> +
> +       cpumask_copy(&updating, cpu_online_mask);
> +       spin_lock(&memcg_moving_lock);
While reaching here, we already do lock_page_cgroup() which is a much
heavier lock. But memcg_moving_lock is a only one global spinlock that
each memcg doing move_account will try to get, so it may spin many
memcg to waiting here while holding their page_cgroup lock. This may
aggravate some performance decrease IMHO.

> +       memcg_moving_page = page;
>
> -       pc = lookup_page_cgroup(page);
> -again:
> -       memcg = pc->mem_cgroup;
> -       if (unlikely(!memcg || !PageCgroupUsed(pc)))
> -               return;
>         /*
> -        * If this memory cgroup is not under account moving, we don't
> -        * need to take move_lock_mem_cgroup(). Because we already hold
> -        * rcu_read_lock(), any calls to move_account will be delayed until
> -        * rcu_read_unlock() if mem_cgroup_stolen() == true.
> +        * Make sure that __mem_cgroup_begin_update_page_stat(page) can see
> +        * our memcg_moving_page before it commits to updating_page_stat.
>          */
> -       if (!mem_cgroup_stolen(memcg))
> -               return;
> +       smp_mb();
>
> -       move_lock_mem_cgroup(memcg, flags);
> -       if (memcg != pc->mem_cgroup || !PageCgroupUsed(pc)) {
> -               move_unlock_mem_cgroup(memcg, flags);
> -               goto again;
> +       for (try = 0; try < 64; try++) {
> +               int updaters = 0;
> +               int cpu;
> +
> +               for_each_cpu(cpu, &updating) {
> +                       if (ACCESS_ONCE(per_cpu(updating_page_stat, cpu)))
> +                               updaters++;
> +                       else
> +                               cpumask_clear_cpu(cpu, &updating);
> +               }
> +               if (!updaters)
> +                       return true;
>         }
> -       *locked = true;
> +
> +       memcg_moving_page = NULL;
Also need a smp_mb()?

> +       spin_unlock(&memcg_moving_lock);
> +       return false;
>  }
>
> -void __mem_cgroup_end_update_page_stat(struct page *page, unsigned long *flags)
> +static void unlock_memcg_move(void)
>  {
> -       struct page_cgroup *pc = lookup_page_cgroup(page);
> +       memcg_moving_page = NULL;
> +       spin_unlock(&memcg_moving_lock);
> +}
>
> -       /*
> -        * It's guaranteed that pc->mem_cgroup never changes while
> -        * lock is held because a routine modifies pc->mem_cgroup
> -        * should take move_lock_mem_cgroup().
> -        */
> -       move_unlock_mem_cgroup(pc->mem_cgroup, flags);
> +/*
> + * If memcg_moving, updater calls __mem_cgroup_begin_update_page_stat(page)
> + * (with preemption disabled) to indicate to the next mover that this cpu is
> + * updating a page, or to wait on the mover if it's already moving this page.
> + */
> +void __mem_cgroup_begin_update_page_stat(struct page *page)
> +{
> +       static const int probing = 0x10000;
> +       int updating;
> +
> +       __this_cpu_add(updating_page_stat, probing);
> +
> +       for (;;) {
> +               /*
> +                * Make sure that trylock_memcg_move(page) can see our
> +                * updating_page_stat before we check memcg_moving_page.
> +                *
> +                * We use the special probing value at first so move sees it,
> +                * but nesting and interrupts on this cpu can distinguish it.
> +                */
> +               smp_mb();
> +
> +               if (likely(page != ACCESS_ONCE(memcg_moving_page)))
> +                       break;
> +
> +               /*
> +                * We may be nested, we may be serving an interrupt: do not
> +                * hang here if the outer level already went beyond probing.
> +                */
> +               updating = __this_cpu_read(updating_page_stat);
> +               if (updating & (probing - 1))
> +                       break;
> +
> +               __this_cpu_write(updating_page_stat, 0);
> +               while (page == ACCESS_ONCE(memcg_moving_page))
> +                       cpu_relax();
> +               __this_cpu_write(updating_page_stat, updating);
> +       }
> +
> +       /* Add one to count and remove temporary probing value */
> +       __this_cpu_sub(updating_page_stat, probing - 1);
> +}
> +
> +void __mem_cgroup_end_update_page_stat(void)
> +{
> +       __this_cpu_dec(updating_page_stat);
> +}
> +
> +/*
> + * Static inline interfaces to the above in include/linux/memcontrol.h:
> + *
> +static inline void mem_cgroup_begin_update_page_stat(struct page *page,
> +                                                    bool *clamped)
> +{
> +       preempt_disable();
> +       *clamped = false;
> +       if (unlikely(atomic_read(&memcg_moving))) {
> +               __mem_cgroup_begin_update_page_stat(page);
> +               *clamped = true;
> +       }
>  }
>
> +static inline void mem_cgroup_end_update_page_stat(struct page *page,
> +                                                  bool *clamped)
> +{
> +       if (unlikely(*clamped))
> +               __mem_cgroup_end_update_page_stat();
> +       preempt_enable();
> +}
> + */
> +
>  void mem_cgroup_update_page_stat(struct page *page,
>                                  enum mem_cgroup_page_stat_item idx, int val)
>  {
>         struct mem_cgroup *memcg;
>         struct page_cgroup *pc = lookup_page_cgroup(page);
> -       unsigned long uninitialized_var(flags);
>
>         if (mem_cgroup_disabled())
>                 return;
> @@ -2181,7 +2248,8 @@ void mem_cgroup_update_page_stat(struct
>                 BUG();
>         }
>
> -       this_cpu_add(memcg->stat->count[idx], val);
> +       /* mem_cgroup_begin_update_page_stat() disabled preemption */
> +       __this_cpu_add(memcg->stat->count[idx], val);
>  }
>
>  /*
> @@ -3580,7 +3648,6 @@ static int mem_cgroup_move_account(struc
>                                    struct mem_cgroup *from,
>                                    struct mem_cgroup *to)
>  {
> -       unsigned long flags;
>         int ret;
>         bool anon = PageAnon(page);
>
> @@ -3602,21 +3669,21 @@ static int mem_cgroup_move_account(struc
>         if (!PageCgroupUsed(pc) || pc->mem_cgroup != from)
>                 goto unlock;
>
> -       move_lock_mem_cgroup(from, &flags);
> +       ret = -EAGAIN;
> +       if (!trylock_memcg_move(page))
> +               goto unlock;
>
>         if (!anon && page_mapped(page)) {
>                 /* Update mapped_file data for mem_cgroup */
> -               preempt_disable();
>                 __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
>                 __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -               preempt_enable();
>         }
>         mem_cgroup_charge_statistics(from, anon, -nr_pages);
>
>         /* caller should have done css_get */
>         pc->mem_cgroup = to;
>         mem_cgroup_charge_statistics(to, anon, nr_pages);
> -       move_unlock_mem_cgroup(from, &flags);
> +       unlock_memcg_move();
>         ret = 0;
>  unlock:
>         unlock_page_cgroup(pc);
> @@ -3675,19 +3742,25 @@ static int mem_cgroup_move_parent(struct
>          */
>         if (!parent)
>                 parent = root_mem_cgroup;
> -
> +retry:
>         if (nr_pages > 1) {
>                 VM_BUG_ON(!PageTransHuge(page));
>                 flags = compound_lock_irqsave(page);
>         }
>
> -       ret = mem_cgroup_move_account(page, nr_pages,
> -                               pc, child, parent);
> -       if (!ret)
> -               __mem_cgroup_cancel_local_charge(child, nr_pages);
> +       ret = mem_cgroup_move_account(page, nr_pages, pc, child, parent);
>
>         if (nr_pages > 1)
>                 compound_unlock_irqrestore(page, flags);
> +
> +       if (ret == -EAGAIN) {
> +               cond_resched();
> +               goto retry;
> +       }
> +
> +       if (!ret)
> +               __mem_cgroup_cancel_local_charge(child, nr_pages);
> +
>         putback_lru_page(page);
>  put:
>         put_page(page);
> @@ -4685,7 +4758,7 @@ static void mem_cgroup_reparent_charges(
>                 /* This is for making all *used* pages to be on LRU. */
>                 lru_add_drain_all();
>                 drain_all_stock_sync(memcg);
> -               mem_cgroup_start_move(memcg);
> +               mem_cgroup_begin_move();
>                 for_each_node_state(node, N_MEMORY) {
>                         for (zid = 0; zid < MAX_NR_ZONES; zid++) {
>                                 enum lru_list lru;
> @@ -4695,7 +4768,7 @@ static void mem_cgroup_reparent_charges(
>                                 }
>                         }
>                 }
> -               mem_cgroup_end_move(memcg);
> +               mem_cgroup_end_move();
>                 memcg_oom_recover(memcg);
>                 cond_resched();
>
> @@ -6128,7 +6201,6 @@ mem_cgroup_css_alloc(struct cgroup *cont
>         atomic_set(&memcg->refcnt, 1);
>         memcg->move_charge_at_immigrate = 0;
>         mutex_init(&memcg->thresholds_lock);
> -       spin_lock_init(&memcg->move_lock);
>
>         error = memcg_init_kmem(memcg, &mem_cgroup_subsys);
>         if (error) {
> @@ -6521,7 +6593,8 @@ static void mem_cgroup_clear_mc(void)
>         mc.from = NULL;
>         mc.to = NULL;
>         spin_unlock(&mc.lock);
> -       mem_cgroup_end_move(from);
> +       if (from)
> +               mem_cgroup_end_move();
>  }
>
>  static int mem_cgroup_can_attach(struct cgroup *cgroup,
> @@ -6547,7 +6620,7 @@ static int mem_cgroup_can_attach(struct
>                         VM_BUG_ON(mc.precharge);
>                         VM_BUG_ON(mc.moved_charge);
>                         VM_BUG_ON(mc.moved_swap);
> -                       mem_cgroup_start_move(from);
> +                       mem_cgroup_begin_move();
>                         spin_lock(&mc.lock);
>                         mc.from = from;
>                         mc.to = memcg;
> @@ -6573,7 +6646,7 @@ static int mem_cgroup_move_charge_pte_ra
>                                 unsigned long addr, unsigned long end,
>                                 struct mm_walk *walk)
>  {
> -       int ret = 0;
> +       int ret;
>         struct vm_area_struct *vma = walk->private;
>         pte_t *pte;
>         spinlock_t *ptl;
> @@ -6592,6 +6665,8 @@ static int mem_cgroup_move_charge_pte_ra
>          *    to be unlocked in __split_huge_page_splitting(), where the main
>          *    part of thp split is not executed yet.
>          */
> +retry:
> +       ret = 0;
>         if (pmd_trans_huge_lock(pmd, vma) == 1) {
>                 if (mc.precharge < HPAGE_PMD_NR) {
>                         spin_unlock(&vma->vm_mm->page_table_lock);
> @@ -6602,8 +6677,9 @@ static int mem_cgroup_move_charge_pte_ra
>                         page = target.page;
>                         if (!isolate_lru_page(page)) {
>                                 pc = lookup_page_cgroup(page);
> -                               if (!mem_cgroup_move_account(page, HPAGE_PMD_NR,
> -                                                       pc, mc.from, mc.to)) {
> +                               ret = mem_cgroup_move_account(page,
> +                                           HPAGE_PMD_NR, pc, mc.from, mc.to);
> +                               if (!ret) {
>                                         mc.precharge -= HPAGE_PMD_NR;
>                                         mc.moved_charge += HPAGE_PMD_NR;
>                                 }
> @@ -6612,12 +6688,14 @@ static int mem_cgroup_move_charge_pte_ra
>                         put_page(page);
>                 }
>                 spin_unlock(&vma->vm_mm->page_table_lock);
> +               if (ret == -EAGAIN)
> +                       goto retry;
>                 return 0;
>         }
>
>         if (pmd_trans_unstable(pmd))
>                 return 0;
> -retry:
> +
>         pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>         for (; addr != end; addr += PAGE_SIZE) {
>                 pte_t ptent = *(pte++);
> @@ -6632,8 +6710,9 @@ retry:
>                         if (isolate_lru_page(page))
>                                 goto put;
>                         pc = lookup_page_cgroup(page);
> -                       if (!mem_cgroup_move_account(page, 1, pc,
> -                                                    mc.from, mc.to)) {
> +                       ret = mem_cgroup_move_account(page, 1, pc,
> +                                                     mc.from, mc.to);
> +                       if (!ret) {
>                                 mc.precharge--;
>                                 /* we uncharge from mc.from later. */
>                                 mc.moved_charge++;
> @@ -6653,11 +6732,15 @@ put:                    /* get_mctgt_type() gets the page
>                 default:
>                         break;
>                 }
> +               if (ret == -EAGAIN)
> +                       break;
>         }
>         pte_unmap_unlock(pte - 1, ptl);
>         cond_resched();
>
>         if (addr != end) {
> +               if (ret == -EAGAIN)
> +                       goto retry;
>                 /*
>                  * We have consumed all precharges we got in can_attach().
>                  * We try charge one by one, but don't do any additional
> --- 3.8-rc2/mm/rmap.c   2012-12-22 09:43:27.656015582 -0800
> +++ linux/mm/rmap.c     2013-01-02 15:03:46.100417650 -0800
> @@ -1107,15 +1107,14 @@ void page_add_new_anon_rmap(struct page
>   */
>  void page_add_file_rmap(struct page *page)
>  {
> -       bool locked;
> -       unsigned long flags;
> +       bool clamped;
>
> -       mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +       mem_cgroup_begin_update_page_stat(page, &clamped);
>         if (atomic_inc_and_test(&page->_mapcount)) {
>                 __inc_zone_page_state(page, NR_FILE_MAPPED);
>                 mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
>         }
> -       mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +       mem_cgroup_end_update_page_stat(page, &clamped);
>  }
>
>  /**
> @@ -1128,16 +1127,15 @@ void page_remove_rmap(struct page *page)
>  {
>         struct address_space *mapping = page_mapping(page);
>         bool anon = PageAnon(page);
> -       bool locked;
> -       unsigned long flags;
> +       bool uninitialized_var(clamped);
>
>         /*
>          * The anon case has no mem_cgroup page_stat to update; but may
> -        * uncharge_page() below, where the lock ordering can deadlock if
> -        * we hold the lock against page_stat move: so avoid it on anon.
> +        * uncharge_page() below, when holding page_cgroup lock might force
> +        * a page_stat move to back off temporarily: so avoid it on anon.
>          */
>         if (!anon)
> -               mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +               mem_cgroup_begin_update_page_stat(page, &clamped);
>
>         /* page still mapped by someone else? */
>         if (!atomic_add_negative(-1, &page->_mapcount))
> @@ -1182,7 +1180,7 @@ void page_remove_rmap(struct page *page)
>         } else {
>                 __dec_zone_page_state(page, NR_FILE_MAPPED);
>                 mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
> -               mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +               mem_cgroup_end_update_page_stat(page, &clamped);
>         }
>         if (unlikely(PageMlocked(page)))
>                 clear_page_mlock(page);
> @@ -1198,7 +1196,7 @@ void page_remove_rmap(struct page *page)
>         return;
>  out:
>         if (!anon)
> -               mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +               mem_cgroup_end_update_page_stat(page, &clamped);
>  }
>
>  /*



-- 
Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
@ 2013-01-09 14:47             ` Michal Hocko
  0 siblings, 0 replies; 124+ messages in thread
From: Michal Hocko @ 2013-01-09 14:47 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: Hugh Dickins, Johannes Weiner, linux-kernel, cgroups, linux-mm,
	linux-fsdevel, akpm, kamezawa.hiroyu, gthelen, fengguang.wu,
	glommer, dchinner, Sha Zhengju

On Wed 09-01-13 22:35:12, Sha Zhengju wrote:
[...]
> To my knowledge, each task is forked in root memcg, and there's a
> moving while attaching it to a cgroup. So move_account is also a
> frequent behavior to some extent.

Not really. Every fork/exec is copies the current group (see
cgroup_fork) so there is no moving on that path.
[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
@ 2013-01-09 14:47             ` Michal Hocko
  0 siblings, 0 replies; 124+ messages in thread
From: Michal Hocko @ 2013-01-09 14:47 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: Hugh Dickins, Johannes Weiner,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	glommer-bzQdu9zFT3WakBO8gow8eQ, dchinner-H+wXaHxf7aLQT0dZR+AlfA,
	Sha Zhengju

On Wed 09-01-13 22:35:12, Sha Zhengju wrote:
[...]
> To my knowledge, each task is forked in root memcg, and there's a
> moving while attaching it to a cgroup. So move_account is also a
> frequent behavior to some extent.

Not really. Every fork/exec is copies the current group (see
cgroup_fork) so there is no moving on that path.
[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
@ 2013-01-09 14:47             ` Michal Hocko
  0 siblings, 0 replies; 124+ messages in thread
From: Michal Hocko @ 2013-01-09 14:47 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: Hugh Dickins, Johannes Weiner, linux-kernel, cgroups, linux-mm,
	linux-fsdevel, akpm, kamezawa.hiroyu, gthelen, fengguang.wu,
	glommer, dchinner, Sha Zhengju

On Wed 09-01-13 22:35:12, Sha Zhengju wrote:
[...]
> To my knowledge, each task is forked in root memcg, and there's a
> moving while attaching it to a cgroup. So move_account is also a
> frequent behavior to some extent.

Not really. Every fork/exec is copies the current group (see
cgroup_fork) so there is no moving on that path.
[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
@ 2013-01-09 15:02           ` Sha Zhengju
  0 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2013-01-09 15:02 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Michal Hocko, Hugh Dickins, Johannes Weiner, linux-kernel,
	cgroups, linux-mm, linux-fsdevel, akpm, gthelen, fengguang.wu,
	glommer, dchinner, Sha Zhengju

On Mon, Jan 7, 2013 at 3:25 PM, Kamezawa Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> (2013/01/05 13:48), Sha Zhengju wrote:
>>
>> On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko <mhocko@suse.cz> wrote:
>>>
>>> On Wed 26-12-12 01:26:07, Sha Zhengju wrote:
>>>>
>>>> From: Sha Zhengju <handai.szj@taobao.com>
>>>>
>>>> This patch adds memcg routines to count dirty pages, which allows memory
>>>> controller
>>>> to maintain an accurate view of the amount of its dirty memory and can
>>>> provide some
>>>> info for users while cgroup's direct reclaim is working.
>>>
>>>
>>> I guess you meant targeted resp. (hard/soft) limit reclaim here,
>>> right? It is true that this is direct reclaim but it is not clear to me
>>
>>
>> Yes, I meant memcg hard/soft reclaim here which is triggered directly
>> by allocation and is distinct from background kswapd reclaim (global).
>>
>>> why the usefulnes should be limitted to the reclaim for users. I would
>>> understand this if the users was in fact in-kernel users.
>>>
>>
>> One of the reasons I'm trying to accounting the dirty pages is to get a
>> more board overall view of memory usages because memcg hard/soft
>> reclaim may have effect on response time of user application.
>> Yeah, the beneficiary can be application administrator or kernel users.
>> :P
>>
>>> [...]
>>>>
>>>> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
>>>> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
>>>> ->private_lock --> mapping->tree_lock --> memcg->move_lock.
>>>> So we need to make mapping->tree_lock ahead of TestSetPageDirty in
>>>> __set_page_dirty()
>>>> and __set_page_dirty_nobuffers(). But in order to avoiding useless
>>>> spinlock contention,
>>>> a prepare PageDirty() checking is added.
>>>
>>>
>>> But there is another AA deadlock here I believe.
>>> page_remove_rmap
>>>    mem_cgroup_begin_update_page_stat             <<< 1
>>>    set_page_dirty
>>>      __set_page_dirty_buffers
>>>        __set_page_dirty
>>>          mem_cgroup_begin_update_page_stat       <<< 2
>>>            move_lock_mem_cgroup
>>>              spin_lock_irqsave(&memcg->move_lock, *flags);
>>>
>>> mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS
>>> because we might race with the moving charges:
>>>          CPU0                                            CPU1
>>> page_remove_rmap
>>>                                                  mem_cgroup_can_attach
>>>    mem_cgroup_begin_update_page_stat (1)
>>>      rcu_read_lock
>>>                                                    mem_cgroup_start_move
>>>
>>> atomic_inc(&memcg_moving)
>>>
>>> atomic_inc(&memcg->moving_account)
>>>                                                      synchronize_rcu
>>>      __mem_cgroup_begin_update_page_stat
>>>        mem_cgroup_stolen <<< TRUE
>>>        move_lock_mem_cgroup
>>>    [...]
>>>          mem_cgroup_begin_update_page_stat (2)
>>>            __mem_cgroup_begin_update_page_stat
>>>              mem_cgroup_stolen     <<< still TRUE
>>>              move_lock_mem_cgroup  <<< DEADLOCK
>>>    [...]
>>>    mem_cgroup_end_update_page_stat
>>>      rcu_unlock
>>>                                                    # wake up from
>>> synchronize_rcu
>>>                                                  [...]
>>>                                                  mem_cgroup_move_task
>>>                                                    mem_cgroup_move_charge
>>>                                                      walk_page_range
>>>
>>> mem_cgroup_move_account
>>>
>>> move_lock_mem_cgroup
>>>
>>>
>>> Maybe I have missed some other locking which would prevent this from
>>> happening but the locking relations are really complicated in this area
>>> so if mem_cgroup_{begin,end}_update_page_stat might be called
>>> recursively then we need a fat comment which justifies that.
>>>
>>
>> Ohhh...good catching!  I didn't notice there is a recursive call of
>> mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
>> The mem_cgroup_{begin,end}_update_page_stat() design has depressed
>> me a lot recently as the lock granularity is a little bigger than I
>> thought.
>> Not only the resource but also some code logic is in the range of locking
>> which may be deadlock prone. The problem still exists if we are trying to
>> add stat account of other memcg page later, may I make bold to suggest
>> that we dig into the lock again...
>>
>> But with regard to the current lock implementation, I doubt if we can we
>> can
>> account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just
>> try to get move_lock once in the beginning. IMHO we can make
>> mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what I'm
>> thinking now is changing memcg->move_lock to rw-spinlock from the
>> original spinlock:
>> mem_cgroup_{begin,end}_update_page_stat() try to get the read lock which
>> make it
>> reenterable and memcg moving task side try to get the write spinlock.
>> Then the race may be following:
>>
>>          CPU0                                            CPU1
>> page_remove_rmap
>>                                                  mem_cgroup_can_attach
>>    mem_cgroup_begin_update_page_stat (1)
>>      rcu_read_lock
>>                                                    mem_cgroup_start_move
>>
>> atomic_inc(&memcg_moving)
>>
>> atomic_inc(&memcg->moving_account)
>>                                                      synchronize_rcu
>>      __mem_cgroup_begin_update_page_stat
>>        mem_cgroup_stolen   <<< TRUE
>>        move_lock_mem_cgroup   <<<< read-spinlock success
>>    [...]
>>       mem_cgroup_begin_update_page_stat (2)
>>            __mem_cgroup_begin_update_page_stat
>>              mem_cgroup_stolen     <<< still TRUE
>>              move_lock_mem_cgroup  <<<< read-spinlock success
>>
>>    [...]
>>    mem_cgroup_end_update_page_stat     <<< locked = true, unlock
>>      rcu_unlock
>>                                                    # wake up from
>> synchronize_rcu
>>                                                  [...]
>>                                                  mem_cgroup_move_task
>>                                                    mem_cgroup_move_charge
>>                                                      walk_page_range
>>
>> mem_cgroup_move_account
>>
>> move_lock_mem_cgroup    <<< write-spinlock
>>
>>
>> AFAICS, the deadlock seems to be avoided by both the rcu and rwlock.
>> Is there anything I lost?
>>
>
> rwlock will work with the nest but it seems ugly do updates under read-lock.
>
> How about this straightforward ?
> ==
> /*
>  * Once a thread takes memcg_move_lock() on a memcg, it can take the lock on
>  * the memcg again for nesting calls
>  */
> static void move_lock_mem_cgroup(memcg, flags);
> {
>         current->memcg_move_lock_nested += 1;
>         if (current->memcg_move_lock_nested > 1) {
>                 VM_BUG_ON(current->move_locked_memcg != memcg);
>                 return;
>         }
>         spin_lock_irqsave(&memcg_move_lock, &flags);
>         current->move_lockdev_memcg = memcg;
> }
>
> static void move_unlock_mem_cgroup(memcg, flags)
> {
>         current->memcg_move_lock_nested -= 1;
>         if (!current->memcg_move_lock_nested) {
>                 current->move_locked_memcg = NULL;
>                 spin_unlock_irqrestore(&memcg_move_lock,flags);
>         }
> }
>
Does we need to add two
fields(current->memcg_move_lock_nested/move_locked_memcg) to 'struct
task'? Is it feasible?

Now I'm thinking about another synchronization proposal for memcg page
stat updater and move_account, which seems to deal with recursion
issue and deadlock:

             CPU A                                               CPU B

  move_lock_mem_cgroup
  old_memcg = pc->mem_cgroup
  TestSetPageDirty(page)
  move_unlock_mem_cgroup
                                                         move_lock_mem_cgroup
                                                         if (PageDirty)

old_memcg->nr_dirty --

new_memcg->nr_dirty ++

pc->mem_cgroup = new_memcgy
                                                         move_unlock_mem_cgroup

  old_memcg->nr_dirty ++


So nr_dirty of old_memcg may be minus in a very short
period('old_memcg->nr_dirty --' by CPU B), but it will be revised soon
by CPU A. And the final figures of memcg->nr_dirty is correct.
Meanwhile the move_lock only protect saving old_memcg and
TestSetPageDirty in its critical section and without any irrelevant
logic, so the lock order or deadlock can be handled easily.

But I'm not sure whether I've lost some race conditions, any comments
are welcomed. : )


--
Thanks,
Sha

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
@ 2013-01-09 15:02           ` Sha Zhengju
  0 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2013-01-09 15:02 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Michal Hocko, Hugh Dickins, Johannes Weiner,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	glommer-bzQdu9zFT3WakBO8gow8eQ, dchinner-H+wXaHxf7aLQT0dZR+AlfA,
	Sha Zhengju

On Mon, Jan 7, 2013 at 3:25 PM, Kamezawa Hiroyuki
<kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> wrote:
> (2013/01/05 13:48), Sha Zhengju wrote:
>>
>> On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> wrote:
>>>
>>> On Wed 26-12-12 01:26:07, Sha Zhengju wrote:
>>>>
>>>> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
>>>>
>>>> This patch adds memcg routines to count dirty pages, which allows memory
>>>> controller
>>>> to maintain an accurate view of the amount of its dirty memory and can
>>>> provide some
>>>> info for users while cgroup's direct reclaim is working.
>>>
>>>
>>> I guess you meant targeted resp. (hard/soft) limit reclaim here,
>>> right? It is true that this is direct reclaim but it is not clear to me
>>
>>
>> Yes, I meant memcg hard/soft reclaim here which is triggered directly
>> by allocation and is distinct from background kswapd reclaim (global).
>>
>>> why the usefulnes should be limitted to the reclaim for users. I would
>>> understand this if the users was in fact in-kernel users.
>>>
>>
>> One of the reasons I'm trying to accounting the dirty pages is to get a
>> more board overall view of memory usages because memcg hard/soft
>> reclaim may have effect on response time of user application.
>> Yeah, the beneficiary can be application administrator or kernel users.
>> :P
>>
>>> [...]
>>>>
>>>> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
>>>> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
>>>> ->private_lock --> mapping->tree_lock --> memcg->move_lock.
>>>> So we need to make mapping->tree_lock ahead of TestSetPageDirty in
>>>> __set_page_dirty()
>>>> and __set_page_dirty_nobuffers(). But in order to avoiding useless
>>>> spinlock contention,
>>>> a prepare PageDirty() checking is added.
>>>
>>>
>>> But there is another AA deadlock here I believe.
>>> page_remove_rmap
>>>    mem_cgroup_begin_update_page_stat             <<< 1
>>>    set_page_dirty
>>>      __set_page_dirty_buffers
>>>        __set_page_dirty
>>>          mem_cgroup_begin_update_page_stat       <<< 2
>>>            move_lock_mem_cgroup
>>>              spin_lock_irqsave(&memcg->move_lock, *flags);
>>>
>>> mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS
>>> because we might race with the moving charges:
>>>          CPU0                                            CPU1
>>> page_remove_rmap
>>>                                                  mem_cgroup_can_attach
>>>    mem_cgroup_begin_update_page_stat (1)
>>>      rcu_read_lock
>>>                                                    mem_cgroup_start_move
>>>
>>> atomic_inc(&memcg_moving)
>>>
>>> atomic_inc(&memcg->moving_account)
>>>                                                      synchronize_rcu
>>>      __mem_cgroup_begin_update_page_stat
>>>        mem_cgroup_stolen <<< TRUE
>>>        move_lock_mem_cgroup
>>>    [...]
>>>          mem_cgroup_begin_update_page_stat (2)
>>>            __mem_cgroup_begin_update_page_stat
>>>              mem_cgroup_stolen     <<< still TRUE
>>>              move_lock_mem_cgroup  <<< DEADLOCK
>>>    [...]
>>>    mem_cgroup_end_update_page_stat
>>>      rcu_unlock
>>>                                                    # wake up from
>>> synchronize_rcu
>>>                                                  [...]
>>>                                                  mem_cgroup_move_task
>>>                                                    mem_cgroup_move_charge
>>>                                                      walk_page_range
>>>
>>> mem_cgroup_move_account
>>>
>>> move_lock_mem_cgroup
>>>
>>>
>>> Maybe I have missed some other locking which would prevent this from
>>> happening but the locking relations are really complicated in this area
>>> so if mem_cgroup_{begin,end}_update_page_stat might be called
>>> recursively then we need a fat comment which justifies that.
>>>
>>
>> Ohhh...good catching!  I didn't notice there is a recursive call of
>> mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
>> The mem_cgroup_{begin,end}_update_page_stat() design has depressed
>> me a lot recently as the lock granularity is a little bigger than I
>> thought.
>> Not only the resource but also some code logic is in the range of locking
>> which may be deadlock prone. The problem still exists if we are trying to
>> add stat account of other memcg page later, may I make bold to suggest
>> that we dig into the lock again...
>>
>> But with regard to the current lock implementation, I doubt if we can we
>> can
>> account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just
>> try to get move_lock once in the beginning. IMHO we can make
>> mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what I'm
>> thinking now is changing memcg->move_lock to rw-spinlock from the
>> original spinlock:
>> mem_cgroup_{begin,end}_update_page_stat() try to get the read lock which
>> make it
>> reenterable and memcg moving task side try to get the write spinlock.
>> Then the race may be following:
>>
>>          CPU0                                            CPU1
>> page_remove_rmap
>>                                                  mem_cgroup_can_attach
>>    mem_cgroup_begin_update_page_stat (1)
>>      rcu_read_lock
>>                                                    mem_cgroup_start_move
>>
>> atomic_inc(&memcg_moving)
>>
>> atomic_inc(&memcg->moving_account)
>>                                                      synchronize_rcu
>>      __mem_cgroup_begin_update_page_stat
>>        mem_cgroup_stolen   <<< TRUE
>>        move_lock_mem_cgroup   <<<< read-spinlock success
>>    [...]
>>       mem_cgroup_begin_update_page_stat (2)
>>            __mem_cgroup_begin_update_page_stat
>>              mem_cgroup_stolen     <<< still TRUE
>>              move_lock_mem_cgroup  <<<< read-spinlock success
>>
>>    [...]
>>    mem_cgroup_end_update_page_stat     <<< locked = true, unlock
>>      rcu_unlock
>>                                                    # wake up from
>> synchronize_rcu
>>                                                  [...]
>>                                                  mem_cgroup_move_task
>>                                                    mem_cgroup_move_charge
>>                                                      walk_page_range
>>
>> mem_cgroup_move_account
>>
>> move_lock_mem_cgroup    <<< write-spinlock
>>
>>
>> AFAICS, the deadlock seems to be avoided by both the rcu and rwlock.
>> Is there anything I lost?
>>
>
> rwlock will work with the nest but it seems ugly do updates under read-lock.
>
> How about this straightforward ?
> ==
> /*
>  * Once a thread takes memcg_move_lock() on a memcg, it can take the lock on
>  * the memcg again for nesting calls
>  */
> static void move_lock_mem_cgroup(memcg, flags);
> {
>         current->memcg_move_lock_nested += 1;
>         if (current->memcg_move_lock_nested > 1) {
>                 VM_BUG_ON(current->move_locked_memcg != memcg);
>                 return;
>         }
>         spin_lock_irqsave(&memcg_move_lock, &flags);
>         current->move_lockdev_memcg = memcg;
> }
>
> static void move_unlock_mem_cgroup(memcg, flags)
> {
>         current->memcg_move_lock_nested -= 1;
>         if (!current->memcg_move_lock_nested) {
>                 current->move_locked_memcg = NULL;
>                 spin_unlock_irqrestore(&memcg_move_lock,flags);
>         }
> }
>
Does we need to add two
fields(current->memcg_move_lock_nested/move_locked_memcg) to 'struct
task'? Is it feasible?

Now I'm thinking about another synchronization proposal for memcg page
stat updater and move_account, which seems to deal with recursion
issue and deadlock:

             CPU A                                               CPU B

  move_lock_mem_cgroup
  old_memcg = pc->mem_cgroup
  TestSetPageDirty(page)
  move_unlock_mem_cgroup
                                                         move_lock_mem_cgroup
                                                         if (PageDirty)

old_memcg->nr_dirty --

new_memcg->nr_dirty ++

pc->mem_cgroup = new_memcgy
                                                         move_unlock_mem_cgroup

  old_memcg->nr_dirty ++


So nr_dirty of old_memcg may be minus in a very short
period('old_memcg->nr_dirty --' by CPU B), but it will be revised soon
by CPU A. And the final figures of memcg->nr_dirty is correct.
Meanwhile the move_lock only protect saving old_memcg and
TestSetPageDirty in its critical section and without any irrelevant
logic, so the lock order or deadlock can be handled easily.

But I'm not sure whether I've lost some race conditions, any comments
are welcomed. : )


--
Thanks,
Sha

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
@ 2013-01-09 15:02           ` Sha Zhengju
  0 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2013-01-09 15:02 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Michal Hocko, Hugh Dickins, Johannes Weiner, linux-kernel,
	cgroups, linux-mm, linux-fsdevel, akpm, gthelen, fengguang.wu,
	glommer, dchinner, Sha Zhengju

On Mon, Jan 7, 2013 at 3:25 PM, Kamezawa Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> (2013/01/05 13:48), Sha Zhengju wrote:
>>
>> On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko <mhocko@suse.cz> wrote:
>>>
>>> On Wed 26-12-12 01:26:07, Sha Zhengju wrote:
>>>>
>>>> From: Sha Zhengju <handai.szj@taobao.com>
>>>>
>>>> This patch adds memcg routines to count dirty pages, which allows memory
>>>> controller
>>>> to maintain an accurate view of the amount of its dirty memory and can
>>>> provide some
>>>> info for users while cgroup's direct reclaim is working.
>>>
>>>
>>> I guess you meant targeted resp. (hard/soft) limit reclaim here,
>>> right? It is true that this is direct reclaim but it is not clear to me
>>
>>
>> Yes, I meant memcg hard/soft reclaim here which is triggered directly
>> by allocation and is distinct from background kswapd reclaim (global).
>>
>>> why the usefulnes should be limitted to the reclaim for users. I would
>>> understand this if the users was in fact in-kernel users.
>>>
>>
>> One of the reasons I'm trying to accounting the dirty pages is to get a
>> more board overall view of memory usages because memcg hard/soft
>> reclaim may have effect on response time of user application.
>> Yeah, the beneficiary can be application administrator or kernel users.
>> :P
>>
>>> [...]
>>>>
>>>> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
>>>> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
>>>> ->private_lock --> mapping->tree_lock --> memcg->move_lock.
>>>> So we need to make mapping->tree_lock ahead of TestSetPageDirty in
>>>> __set_page_dirty()
>>>> and __set_page_dirty_nobuffers(). But in order to avoiding useless
>>>> spinlock contention,
>>>> a prepare PageDirty() checking is added.
>>>
>>>
>>> But there is another AA deadlock here I believe.
>>> page_remove_rmap
>>>    mem_cgroup_begin_update_page_stat             <<< 1
>>>    set_page_dirty
>>>      __set_page_dirty_buffers
>>>        __set_page_dirty
>>>          mem_cgroup_begin_update_page_stat       <<< 2
>>>            move_lock_mem_cgroup
>>>              spin_lock_irqsave(&memcg->move_lock, *flags);
>>>
>>> mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS
>>> because we might race with the moving charges:
>>>          CPU0                                            CPU1
>>> page_remove_rmap
>>>                                                  mem_cgroup_can_attach
>>>    mem_cgroup_begin_update_page_stat (1)
>>>      rcu_read_lock
>>>                                                    mem_cgroup_start_move
>>>
>>> atomic_inc(&memcg_moving)
>>>
>>> atomic_inc(&memcg->moving_account)
>>>                                                      synchronize_rcu
>>>      __mem_cgroup_begin_update_page_stat
>>>        mem_cgroup_stolen <<< TRUE
>>>        move_lock_mem_cgroup
>>>    [...]
>>>          mem_cgroup_begin_update_page_stat (2)
>>>            __mem_cgroup_begin_update_page_stat
>>>              mem_cgroup_stolen     <<< still TRUE
>>>              move_lock_mem_cgroup  <<< DEADLOCK
>>>    [...]
>>>    mem_cgroup_end_update_page_stat
>>>      rcu_unlock
>>>                                                    # wake up from
>>> synchronize_rcu
>>>                                                  [...]
>>>                                                  mem_cgroup_move_task
>>>                                                    mem_cgroup_move_charge
>>>                                                      walk_page_range
>>>
>>> mem_cgroup_move_account
>>>
>>> move_lock_mem_cgroup
>>>
>>>
>>> Maybe I have missed some other locking which would prevent this from
>>> happening but the locking relations are really complicated in this area
>>> so if mem_cgroup_{begin,end}_update_page_stat might be called
>>> recursively then we need a fat comment which justifies that.
>>>
>>
>> Ohhh...good catching!  I didn't notice there is a recursive call of
>> mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
>> The mem_cgroup_{begin,end}_update_page_stat() design has depressed
>> me a lot recently as the lock granularity is a little bigger than I
>> thought.
>> Not only the resource but also some code logic is in the range of locking
>> which may be deadlock prone. The problem still exists if we are trying to
>> add stat account of other memcg page later, may I make bold to suggest
>> that we dig into the lock again...
>>
>> But with regard to the current lock implementation, I doubt if we can we
>> can
>> account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just
>> try to get move_lock once in the beginning. IMHO we can make
>> mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what I'm
>> thinking now is changing memcg->move_lock to rw-spinlock from the
>> original spinlock:
>> mem_cgroup_{begin,end}_update_page_stat() try to get the read lock which
>> make it
>> reenterable and memcg moving task side try to get the write spinlock.
>> Then the race may be following:
>>
>>          CPU0                                            CPU1
>> page_remove_rmap
>>                                                  mem_cgroup_can_attach
>>    mem_cgroup_begin_update_page_stat (1)
>>      rcu_read_lock
>>                                                    mem_cgroup_start_move
>>
>> atomic_inc(&memcg_moving)
>>
>> atomic_inc(&memcg->moving_account)
>>                                                      synchronize_rcu
>>      __mem_cgroup_begin_update_page_stat
>>        mem_cgroup_stolen   <<< TRUE
>>        move_lock_mem_cgroup   <<<< read-spinlock success
>>    [...]
>>       mem_cgroup_begin_update_page_stat (2)
>>            __mem_cgroup_begin_update_page_stat
>>              mem_cgroup_stolen     <<< still TRUE
>>              move_lock_mem_cgroup  <<<< read-spinlock success
>>
>>    [...]
>>    mem_cgroup_end_update_page_stat     <<< locked = true, unlock
>>      rcu_unlock
>>                                                    # wake up from
>> synchronize_rcu
>>                                                  [...]
>>                                                  mem_cgroup_move_task
>>                                                    mem_cgroup_move_charge
>>                                                      walk_page_range
>>
>> mem_cgroup_move_account
>>
>> move_lock_mem_cgroup    <<< write-spinlock
>>
>>
>> AFAICS, the deadlock seems to be avoided by both the rcu and rwlock.
>> Is there anything I lost?
>>
>
> rwlock will work with the nest but it seems ugly do updates under read-lock.
>
> How about this straightforward ?
> ==
> /*
>  * Once a thread takes memcg_move_lock() on a memcg, it can take the lock on
>  * the memcg again for nesting calls
>  */
> static void move_lock_mem_cgroup(memcg, flags);
> {
>         current->memcg_move_lock_nested += 1;
>         if (current->memcg_move_lock_nested > 1) {
>                 VM_BUG_ON(current->move_locked_memcg != memcg);
>                 return;
>         }
>         spin_lock_irqsave(&memcg_move_lock, &flags);
>         current->move_lockdev_memcg = memcg;
> }
>
> static void move_unlock_mem_cgroup(memcg, flags)
> {
>         current->memcg_move_lock_nested -= 1;
>         if (!current->memcg_move_lock_nested) {
>                 current->move_locked_memcg = NULL;
>                 spin_unlock_irqrestore(&memcg_move_lock,flags);
>         }
> }
>
Does we need to add two
fields(current->memcg_move_lock_nested/move_locked_memcg) to 'struct
task'? Is it feasible?

Now I'm thinking about another synchronization proposal for memcg page
stat updater and move_account, which seems to deal with recursion
issue and deadlock:

             CPU A                                               CPU B

  move_lock_mem_cgroup
  old_memcg = pc->mem_cgroup
  TestSetPageDirty(page)
  move_unlock_mem_cgroup
                                                         move_lock_mem_cgroup
                                                         if (PageDirty)

old_memcg->nr_dirty --

new_memcg->nr_dirty ++

pc->mem_cgroup = new_memcgy
                                                         move_unlock_mem_cgroup

  old_memcg->nr_dirty ++


So nr_dirty of old_memcg may be minus in a very short
period('old_memcg->nr_dirty --' by CPU B), but it will be revised soon
by CPU A. And the final figures of memcg->nr_dirty is correct.
Meanwhile the move_lock only protect saving old_memcg and
TestSetPageDirty in its critical section and without any irrelevant
logic, so the lock order or deadlock can be handled easily.

But I'm not sure whether I've lost some race conditions, any comments
are welcomed. : )


--
Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
@ 2013-01-10  2:16             ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 124+ messages in thread
From: Kamezawa Hiroyuki @ 2013-01-10  2:16 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: Michal Hocko, Hugh Dickins, Johannes Weiner, linux-kernel,
	cgroups, linux-mm, linux-fsdevel, akpm, gthelen, fengguang.wu,
	glommer, dchinner, Sha Zhengju

(2013/01/10 0:02), Sha Zhengju wrote:
> On Mon, Jan 7, 2013 at 3:25 PM, Kamezawa Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> (2013/01/05 13:48), Sha Zhengju wrote:
>>>
>>> On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko <mhocko@suse.cz> wrote:
>>>>
>>>> On Wed 26-12-12 01:26:07, Sha Zhengju wrote:
>>>>>
>>>>> From: Sha Zhengju <handai.szj@taobao.com>
>>>>>
>>>>> This patch adds memcg routines to count dirty pages, which allows memory
>>>>> controller
>>>>> to maintain an accurate view of the amount of its dirty memory and can
>>>>> provide some
>>>>> info for users while cgroup's direct reclaim is working.
>>>>
>>>>
>>>> I guess you meant targeted resp. (hard/soft) limit reclaim here,
>>>> right? It is true that this is direct reclaim but it is not clear to me
>>>
>>>
>>> Yes, I meant memcg hard/soft reclaim here which is triggered directly
>>> by allocation and is distinct from background kswapd reclaim (global).
>>>
>>>> why the usefulnes should be limitted to the reclaim for users. I would
>>>> understand this if the users was in fact in-kernel users.
>>>>
>>>
>>> One of the reasons I'm trying to accounting the dirty pages is to get a
>>> more board overall view of memory usages because memcg hard/soft
>>> reclaim may have effect on response time of user application.
>>> Yeah, the beneficiary can be application administrator or kernel users.
>>> :P
>>>
>>>> [...]
>>>>>
>>>>> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
>>>>> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
>>>>> ->private_lock --> mapping->tree_lock --> memcg->move_lock.
>>>>> So we need to make mapping->tree_lock ahead of TestSetPageDirty in
>>>>> __set_page_dirty()
>>>>> and __set_page_dirty_nobuffers(). But in order to avoiding useless
>>>>> spinlock contention,
>>>>> a prepare PageDirty() checking is added.
>>>>
>>>>
>>>> But there is another AA deadlock here I believe.
>>>> page_remove_rmap
>>>>     mem_cgroup_begin_update_page_stat             <<< 1
>>>>     set_page_dirty
>>>>       __set_page_dirty_buffers
>>>>         __set_page_dirty
>>>>           mem_cgroup_begin_update_page_stat       <<< 2
>>>>             move_lock_mem_cgroup
>>>>               spin_lock_irqsave(&memcg->move_lock, *flags);
>>>>
>>>> mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS
>>>> because we might race with the moving charges:
>>>>           CPU0                                            CPU1
>>>> page_remove_rmap
>>>>                                                   mem_cgroup_can_attach
>>>>     mem_cgroup_begin_update_page_stat (1)
>>>>       rcu_read_lock
>>>>                                                     mem_cgroup_start_move
>>>>
>>>> atomic_inc(&memcg_moving)
>>>>
>>>> atomic_inc(&memcg->moving_account)
>>>>                                                       synchronize_rcu
>>>>       __mem_cgroup_begin_update_page_stat
>>>>         mem_cgroup_stolen <<< TRUE
>>>>         move_lock_mem_cgroup
>>>>     [...]
>>>>           mem_cgroup_begin_update_page_stat (2)
>>>>             __mem_cgroup_begin_update_page_stat
>>>>               mem_cgroup_stolen     <<< still TRUE
>>>>               move_lock_mem_cgroup  <<< DEADLOCK
>>>>     [...]
>>>>     mem_cgroup_end_update_page_stat
>>>>       rcu_unlock
>>>>                                                     # wake up from
>>>> synchronize_rcu
>>>>                                                   [...]
>>>>                                                   mem_cgroup_move_task
>>>>                                                     mem_cgroup_move_charge
>>>>                                                       walk_page_range
>>>>
>>>> mem_cgroup_move_account
>>>>
>>>> move_lock_mem_cgroup
>>>>
>>>>
>>>> Maybe I have missed some other locking which would prevent this from
>>>> happening but the locking relations are really complicated in this area
>>>> so if mem_cgroup_{begin,end}_update_page_stat might be called
>>>> recursively then we need a fat comment which justifies that.
>>>>
>>>
>>> Ohhh...good catching!  I didn't notice there is a recursive call of
>>> mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
>>> The mem_cgroup_{begin,end}_update_page_stat() design has depressed
>>> me a lot recently as the lock granularity is a little bigger than I
>>> thought.
>>> Not only the resource but also some code logic is in the range of locking
>>> which may be deadlock prone. The problem still exists if we are trying to
>>> add stat account of other memcg page later, may I make bold to suggest
>>> that we dig into the lock again...
>>>
>>> But with regard to the current lock implementation, I doubt if we can we
>>> can
>>> account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just
>>> try to get move_lock once in the beginning. IMHO we can make
>>> mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what I'm
>>> thinking now is changing memcg->move_lock to rw-spinlock from the
>>> original spinlock:
>>> mem_cgroup_{begin,end}_update_page_stat() try to get the read lock which
>>> make it
>>> reenterable and memcg moving task side try to get the write spinlock.
>>> Then the race may be following:
>>>
>>>           CPU0                                            CPU1
>>> page_remove_rmap
>>>                                                   mem_cgroup_can_attach
>>>     mem_cgroup_begin_update_page_stat (1)
>>>       rcu_read_lock
>>>                                                     mem_cgroup_start_move
>>>
>>> atomic_inc(&memcg_moving)
>>>
>>> atomic_inc(&memcg->moving_account)
>>>                                                       synchronize_rcu
>>>       __mem_cgroup_begin_update_page_stat
>>>         mem_cgroup_stolen   <<< TRUE
>>>         move_lock_mem_cgroup   <<<< read-spinlock success
>>>     [...]
>>>        mem_cgroup_begin_update_page_stat (2)
>>>             __mem_cgroup_begin_update_page_stat
>>>               mem_cgroup_stolen     <<< still TRUE
>>>               move_lock_mem_cgroup  <<<< read-spinlock success
>>>
>>>     [...]
>>>     mem_cgroup_end_update_page_stat     <<< locked = true, unlock
>>>       rcu_unlock
>>>                                                     # wake up from
>>> synchronize_rcu
>>>                                                   [...]
>>>                                                   mem_cgroup_move_task
>>>                                                     mem_cgroup_move_charge
>>>                                                       walk_page_range
>>>
>>> mem_cgroup_move_account
>>>
>>> move_lock_mem_cgroup    <<< write-spinlock
>>>
>>>
>>> AFAICS, the deadlock seems to be avoided by both the rcu and rwlock.
>>> Is there anything I lost?
>>>
>>
>> rwlock will work with the nest but it seems ugly do updates under read-lock.
>>
>> How about this straightforward ?
>> ==
>> /*
>>   * Once a thread takes memcg_move_lock() on a memcg, it can take the lock on
>>   * the memcg again for nesting calls
>>   */
>> static void move_lock_mem_cgroup(memcg, flags);
>> {
>>          current->memcg_move_lock_nested += 1;
>>          if (current->memcg_move_lock_nested > 1) {
>>                  VM_BUG_ON(current->move_locked_memcg != memcg);
>>                  return;
>>          }
>>          spin_lock_irqsave(&memcg_move_lock, &flags);
>>          current->move_lockdev_memcg = memcg;
>> }
>>
>> static void move_unlock_mem_cgroup(memcg, flags)
>> {
>>          current->memcg_move_lock_nested -= 1;
>>          if (!current->memcg_move_lock_nested) {
>>                  current->move_locked_memcg = NULL;
>>                  spin_unlock_irqrestore(&memcg_move_lock,flags);
>>          }
>> }
>>
> Does we need to add two
> fields(current->memcg_move_lock_nested/move_locked_memcg) to 'struct
> task'? Is it feasible?
>
> Now I'm thinking about another synchronization proposal for memcg page
> stat updater and move_account, which seems to deal with recursion
> issue and deadlock:
>
>               CPU A                                               CPU B
>
>    move_lock_mem_cgroup
>    old_memcg = pc->mem_cgroup
>    TestSetPageDirty(page)
>    move_unlock_mem_cgroup
>                                                           move_lock_mem_cgroup
>                                                           if (PageDirty)
>
> old_memcg->nr_dirty --
>
> new_memcg->nr_dirty ++
>
> pc->mem_cgroup = new_memcgy
>                                                           move_unlock_mem_cgroup
>
>    old_memcg->nr_dirty ++
>

I'm sorry I couldn't catch why you call TestSetPageDirty()....and what CPUA/CPUB is
doing ? CPUA calls move_account() and CPUB updates stat ? If so, why move_account()
is allowed to set PG_dirty ??


>
> So nr_dirty of old_memcg may be minus in a very short
> period('old_memcg->nr_dirty --' by CPU B), but it will be revised soon
> by CPU A. And the final figures of memcg->nr_dirty is correct.

It seems both of new_memcg and old_memcg has an account for a page. Is it correct ?


> Meanwhile the move_lock only protect saving old_memcg and
> TestSetPageDirty in its critical section and without any irrelevant
> logic, so the lock order or deadlock can be handled easily.
>
> But I'm not sure whether I've lost some race conditions, any comments
> are welcomed. : )
>

Sorry I couldn't understand.

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
@ 2013-01-10  2:16             ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 124+ messages in thread
From: Kamezawa Hiroyuki @ 2013-01-10  2:16 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: Michal Hocko, Hugh Dickins, Johannes Weiner,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	glommer-bzQdu9zFT3WakBO8gow8eQ, dchinner-H+wXaHxf7aLQT0dZR+AlfA,
	Sha Zhengju

(2013/01/10 0:02), Sha Zhengju wrote:
> On Mon, Jan 7, 2013 at 3:25 PM, Kamezawa Hiroyuki
> <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> wrote:
>> (2013/01/05 13:48), Sha Zhengju wrote:
>>>
>>> On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> wrote:
>>>>
>>>> On Wed 26-12-12 01:26:07, Sha Zhengju wrote:
>>>>>
>>>>> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
>>>>>
>>>>> This patch adds memcg routines to count dirty pages, which allows memory
>>>>> controller
>>>>> to maintain an accurate view of the amount of its dirty memory and can
>>>>> provide some
>>>>> info for users while cgroup's direct reclaim is working.
>>>>
>>>>
>>>> I guess you meant targeted resp. (hard/soft) limit reclaim here,
>>>> right? It is true that this is direct reclaim but it is not clear to me
>>>
>>>
>>> Yes, I meant memcg hard/soft reclaim here which is triggered directly
>>> by allocation and is distinct from background kswapd reclaim (global).
>>>
>>>> why the usefulnes should be limitted to the reclaim for users. I would
>>>> understand this if the users was in fact in-kernel users.
>>>>
>>>
>>> One of the reasons I'm trying to accounting the dirty pages is to get a
>>> more board overall view of memory usages because memcg hard/soft
>>> reclaim may have effect on response time of user application.
>>> Yeah, the beneficiary can be application administrator or kernel users.
>>> :P
>>>
>>>> [...]
>>>>>
>>>>> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
>>>>> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
>>>>> ->private_lock --> mapping->tree_lock --> memcg->move_lock.
>>>>> So we need to make mapping->tree_lock ahead of TestSetPageDirty in
>>>>> __set_page_dirty()
>>>>> and __set_page_dirty_nobuffers(). But in order to avoiding useless
>>>>> spinlock contention,
>>>>> a prepare PageDirty() checking is added.
>>>>
>>>>
>>>> But there is another AA deadlock here I believe.
>>>> page_remove_rmap
>>>>     mem_cgroup_begin_update_page_stat             <<< 1
>>>>     set_page_dirty
>>>>       __set_page_dirty_buffers
>>>>         __set_page_dirty
>>>>           mem_cgroup_begin_update_page_stat       <<< 2
>>>>             move_lock_mem_cgroup
>>>>               spin_lock_irqsave(&memcg->move_lock, *flags);
>>>>
>>>> mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS
>>>> because we might race with the moving charges:
>>>>           CPU0                                            CPU1
>>>> page_remove_rmap
>>>>                                                   mem_cgroup_can_attach
>>>>     mem_cgroup_begin_update_page_stat (1)
>>>>       rcu_read_lock
>>>>                                                     mem_cgroup_start_move
>>>>
>>>> atomic_inc(&memcg_moving)
>>>>
>>>> atomic_inc(&memcg->moving_account)
>>>>                                                       synchronize_rcu
>>>>       __mem_cgroup_begin_update_page_stat
>>>>         mem_cgroup_stolen <<< TRUE
>>>>         move_lock_mem_cgroup
>>>>     [...]
>>>>           mem_cgroup_begin_update_page_stat (2)
>>>>             __mem_cgroup_begin_update_page_stat
>>>>               mem_cgroup_stolen     <<< still TRUE
>>>>               move_lock_mem_cgroup  <<< DEADLOCK
>>>>     [...]
>>>>     mem_cgroup_end_update_page_stat
>>>>       rcu_unlock
>>>>                                                     # wake up from
>>>> synchronize_rcu
>>>>                                                   [...]
>>>>                                                   mem_cgroup_move_task
>>>>                                                     mem_cgroup_move_charge
>>>>                                                       walk_page_range
>>>>
>>>> mem_cgroup_move_account
>>>>
>>>> move_lock_mem_cgroup
>>>>
>>>>
>>>> Maybe I have missed some other locking which would prevent this from
>>>> happening but the locking relations are really complicated in this area
>>>> so if mem_cgroup_{begin,end}_update_page_stat might be called
>>>> recursively then we need a fat comment which justifies that.
>>>>
>>>
>>> Ohhh...good catching!  I didn't notice there is a recursive call of
>>> mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
>>> The mem_cgroup_{begin,end}_update_page_stat() design has depressed
>>> me a lot recently as the lock granularity is a little bigger than I
>>> thought.
>>> Not only the resource but also some code logic is in the range of locking
>>> which may be deadlock prone. The problem still exists if we are trying to
>>> add stat account of other memcg page later, may I make bold to suggest
>>> that we dig into the lock again...
>>>
>>> But with regard to the current lock implementation, I doubt if we can we
>>> can
>>> account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just
>>> try to get move_lock once in the beginning. IMHO we can make
>>> mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what I'm
>>> thinking now is changing memcg->move_lock to rw-spinlock from the
>>> original spinlock:
>>> mem_cgroup_{begin,end}_update_page_stat() try to get the read lock which
>>> make it
>>> reenterable and memcg moving task side try to get the write spinlock.
>>> Then the race may be following:
>>>
>>>           CPU0                                            CPU1
>>> page_remove_rmap
>>>                                                   mem_cgroup_can_attach
>>>     mem_cgroup_begin_update_page_stat (1)
>>>       rcu_read_lock
>>>                                                     mem_cgroup_start_move
>>>
>>> atomic_inc(&memcg_moving)
>>>
>>> atomic_inc(&memcg->moving_account)
>>>                                                       synchronize_rcu
>>>       __mem_cgroup_begin_update_page_stat
>>>         mem_cgroup_stolen   <<< TRUE
>>>         move_lock_mem_cgroup   <<<< read-spinlock success
>>>     [...]
>>>        mem_cgroup_begin_update_page_stat (2)
>>>             __mem_cgroup_begin_update_page_stat
>>>               mem_cgroup_stolen     <<< still TRUE
>>>               move_lock_mem_cgroup  <<<< read-spinlock success
>>>
>>>     [...]
>>>     mem_cgroup_end_update_page_stat     <<< locked = true, unlock
>>>       rcu_unlock
>>>                                                     # wake up from
>>> synchronize_rcu
>>>                                                   [...]
>>>                                                   mem_cgroup_move_task
>>>                                                     mem_cgroup_move_charge
>>>                                                       walk_page_range
>>>
>>> mem_cgroup_move_account
>>>
>>> move_lock_mem_cgroup    <<< write-spinlock
>>>
>>>
>>> AFAICS, the deadlock seems to be avoided by both the rcu and rwlock.
>>> Is there anything I lost?
>>>
>>
>> rwlock will work with the nest but it seems ugly do updates under read-lock.
>>
>> How about this straightforward ?
>> ==
>> /*
>>   * Once a thread takes memcg_move_lock() on a memcg, it can take the lock on
>>   * the memcg again for nesting calls
>>   */
>> static void move_lock_mem_cgroup(memcg, flags);
>> {
>>          current->memcg_move_lock_nested += 1;
>>          if (current->memcg_move_lock_nested > 1) {
>>                  VM_BUG_ON(current->move_locked_memcg != memcg);
>>                  return;
>>          }
>>          spin_lock_irqsave(&memcg_move_lock, &flags);
>>          current->move_lockdev_memcg = memcg;
>> }
>>
>> static void move_unlock_mem_cgroup(memcg, flags)
>> {
>>          current->memcg_move_lock_nested -= 1;
>>          if (!current->memcg_move_lock_nested) {
>>                  current->move_locked_memcg = NULL;
>>                  spin_unlock_irqrestore(&memcg_move_lock,flags);
>>          }
>> }
>>
> Does we need to add two
> fields(current->memcg_move_lock_nested/move_locked_memcg) to 'struct
> task'? Is it feasible?
>
> Now I'm thinking about another synchronization proposal for memcg page
> stat updater and move_account, which seems to deal with recursion
> issue and deadlock:
>
>               CPU A                                               CPU B
>
>    move_lock_mem_cgroup
>    old_memcg = pc->mem_cgroup
>    TestSetPageDirty(page)
>    move_unlock_mem_cgroup
>                                                           move_lock_mem_cgroup
>                                                           if (PageDirty)
>
> old_memcg->nr_dirty --
>
> new_memcg->nr_dirty ++
>
> pc->mem_cgroup = new_memcgy
>                                                           move_unlock_mem_cgroup
>
>    old_memcg->nr_dirty ++
>

I'm sorry I couldn't catch why you call TestSetPageDirty()....and what CPUA/CPUB is
doing ? CPUA calls move_account() and CPUB updates stat ? If so, why move_account()
is allowed to set PG_dirty ??


>
> So nr_dirty of old_memcg may be minus in a very short
> period('old_memcg->nr_dirty --' by CPU B), but it will be revised soon
> by CPU A. And the final figures of memcg->nr_dirty is correct.

It seems both of new_memcg and old_memcg has an account for a page. Is it correct ?


> Meanwhile the move_lock only protect saving old_memcg and
> TestSetPageDirty in its critical section and without any irrelevant
> logic, so the lock order or deadlock can be handled easily.
>
> But I'm not sure whether I've lost some race conditions, any comments
> are welcomed. : )
>

Sorry I couldn't understand.

Thanks,
-Kame

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
@ 2013-01-10  2:16             ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 124+ messages in thread
From: Kamezawa Hiroyuki @ 2013-01-10  2:16 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: Michal Hocko, Hugh Dickins, Johannes Weiner, linux-kernel,
	cgroups, linux-mm, linux-fsdevel, akpm, gthelen, fengguang.wu,
	glommer, dchinner, Sha Zhengju

(2013/01/10 0:02), Sha Zhengju wrote:
> On Mon, Jan 7, 2013 at 3:25 PM, Kamezawa Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> (2013/01/05 13:48), Sha Zhengju wrote:
>>>
>>> On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko <mhocko@suse.cz> wrote:
>>>>
>>>> On Wed 26-12-12 01:26:07, Sha Zhengju wrote:
>>>>>
>>>>> From: Sha Zhengju <handai.szj@taobao.com>
>>>>>
>>>>> This patch adds memcg routines to count dirty pages, which allows memory
>>>>> controller
>>>>> to maintain an accurate view of the amount of its dirty memory and can
>>>>> provide some
>>>>> info for users while cgroup's direct reclaim is working.
>>>>
>>>>
>>>> I guess you meant targeted resp. (hard/soft) limit reclaim here,
>>>> right? It is true that this is direct reclaim but it is not clear to me
>>>
>>>
>>> Yes, I meant memcg hard/soft reclaim here which is triggered directly
>>> by allocation and is distinct from background kswapd reclaim (global).
>>>
>>>> why the usefulnes should be limitted to the reclaim for users. I would
>>>> understand this if the users was in fact in-kernel users.
>>>>
>>>
>>> One of the reasons I'm trying to accounting the dirty pages is to get a
>>> more board overall view of memory usages because memcg hard/soft
>>> reclaim may have effect on response time of user application.
>>> Yeah, the beneficiary can be application administrator or kernel users.
>>> :P
>>>
>>>> [...]
>>>>>
>>>>> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
>>>>> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
>>>>> ->private_lock --> mapping->tree_lock --> memcg->move_lock.
>>>>> So we need to make mapping->tree_lock ahead of TestSetPageDirty in
>>>>> __set_page_dirty()
>>>>> and __set_page_dirty_nobuffers(). But in order to avoiding useless
>>>>> spinlock contention,
>>>>> a prepare PageDirty() checking is added.
>>>>
>>>>
>>>> But there is another AA deadlock here I believe.
>>>> page_remove_rmap
>>>>     mem_cgroup_begin_update_page_stat             <<< 1
>>>>     set_page_dirty
>>>>       __set_page_dirty_buffers
>>>>         __set_page_dirty
>>>>           mem_cgroup_begin_update_page_stat       <<< 2
>>>>             move_lock_mem_cgroup
>>>>               spin_lock_irqsave(&memcg->move_lock, *flags);
>>>>
>>>> mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS
>>>> because we might race with the moving charges:
>>>>           CPU0                                            CPU1
>>>> page_remove_rmap
>>>>                                                   mem_cgroup_can_attach
>>>>     mem_cgroup_begin_update_page_stat (1)
>>>>       rcu_read_lock
>>>>                                                     mem_cgroup_start_move
>>>>
>>>> atomic_inc(&memcg_moving)
>>>>
>>>> atomic_inc(&memcg->moving_account)
>>>>                                                       synchronize_rcu
>>>>       __mem_cgroup_begin_update_page_stat
>>>>         mem_cgroup_stolen <<< TRUE
>>>>         move_lock_mem_cgroup
>>>>     [...]
>>>>           mem_cgroup_begin_update_page_stat (2)
>>>>             __mem_cgroup_begin_update_page_stat
>>>>               mem_cgroup_stolen     <<< still TRUE
>>>>               move_lock_mem_cgroup  <<< DEADLOCK
>>>>     [...]
>>>>     mem_cgroup_end_update_page_stat
>>>>       rcu_unlock
>>>>                                                     # wake up from
>>>> synchronize_rcu
>>>>                                                   [...]
>>>>                                                   mem_cgroup_move_task
>>>>                                                     mem_cgroup_move_charge
>>>>                                                       walk_page_range
>>>>
>>>> mem_cgroup_move_account
>>>>
>>>> move_lock_mem_cgroup
>>>>
>>>>
>>>> Maybe I have missed some other locking which would prevent this from
>>>> happening but the locking relations are really complicated in this area
>>>> so if mem_cgroup_{begin,end}_update_page_stat might be called
>>>> recursively then we need a fat comment which justifies that.
>>>>
>>>
>>> Ohhh...good catching!  I didn't notice there is a recursive call of
>>> mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
>>> The mem_cgroup_{begin,end}_update_page_stat() design has depressed
>>> me a lot recently as the lock granularity is a little bigger than I
>>> thought.
>>> Not only the resource but also some code logic is in the range of locking
>>> which may be deadlock prone. The problem still exists if we are trying to
>>> add stat account of other memcg page later, may I make bold to suggest
>>> that we dig into the lock again...
>>>
>>> But with regard to the current lock implementation, I doubt if we can we
>>> can
>>> account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just
>>> try to get move_lock once in the beginning. IMHO we can make
>>> mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what I'm
>>> thinking now is changing memcg->move_lock to rw-spinlock from the
>>> original spinlock:
>>> mem_cgroup_{begin,end}_update_page_stat() try to get the read lock which
>>> make it
>>> reenterable and memcg moving task side try to get the write spinlock.
>>> Then the race may be following:
>>>
>>>           CPU0                                            CPU1
>>> page_remove_rmap
>>>                                                   mem_cgroup_can_attach
>>>     mem_cgroup_begin_update_page_stat (1)
>>>       rcu_read_lock
>>>                                                     mem_cgroup_start_move
>>>
>>> atomic_inc(&memcg_moving)
>>>
>>> atomic_inc(&memcg->moving_account)
>>>                                                       synchronize_rcu
>>>       __mem_cgroup_begin_update_page_stat
>>>         mem_cgroup_stolen   <<< TRUE
>>>         move_lock_mem_cgroup   <<<< read-spinlock success
>>>     [...]
>>>        mem_cgroup_begin_update_page_stat (2)
>>>             __mem_cgroup_begin_update_page_stat
>>>               mem_cgroup_stolen     <<< still TRUE
>>>               move_lock_mem_cgroup  <<<< read-spinlock success
>>>
>>>     [...]
>>>     mem_cgroup_end_update_page_stat     <<< locked = true, unlock
>>>       rcu_unlock
>>>                                                     # wake up from
>>> synchronize_rcu
>>>                                                   [...]
>>>                                                   mem_cgroup_move_task
>>>                                                     mem_cgroup_move_charge
>>>                                                       walk_page_range
>>>
>>> mem_cgroup_move_account
>>>
>>> move_lock_mem_cgroup    <<< write-spinlock
>>>
>>>
>>> AFAICS, the deadlock seems to be avoided by both the rcu and rwlock.
>>> Is there anything I lost?
>>>
>>
>> rwlock will work with the nest but it seems ugly do updates under read-lock.
>>
>> How about this straightforward ?
>> ==
>> /*
>>   * Once a thread takes memcg_move_lock() on a memcg, it can take the lock on
>>   * the memcg again for nesting calls
>>   */
>> static void move_lock_mem_cgroup(memcg, flags);
>> {
>>          current->memcg_move_lock_nested += 1;
>>          if (current->memcg_move_lock_nested > 1) {
>>                  VM_BUG_ON(current->move_locked_memcg != memcg);
>>                  return;
>>          }
>>          spin_lock_irqsave(&memcg_move_lock, &flags);
>>          current->move_lockdev_memcg = memcg;
>> }
>>
>> static void move_unlock_mem_cgroup(memcg, flags)
>> {
>>          current->memcg_move_lock_nested -= 1;
>>          if (!current->memcg_move_lock_nested) {
>>                  current->move_locked_memcg = NULL;
>>                  spin_unlock_irqrestore(&memcg_move_lock,flags);
>>          }
>> }
>>
> Does we need to add two
> fields(current->memcg_move_lock_nested/move_locked_memcg) to 'struct
> task'? Is it feasible?
>
> Now I'm thinking about another synchronization proposal for memcg page
> stat updater and move_account, which seems to deal with recursion
> issue and deadlock:
>
>               CPU A                                               CPU B
>
>    move_lock_mem_cgroup
>    old_memcg = pc->mem_cgroup
>    TestSetPageDirty(page)
>    move_unlock_mem_cgroup
>                                                           move_lock_mem_cgroup
>                                                           if (PageDirty)
>
> old_memcg->nr_dirty --
>
> new_memcg->nr_dirty ++
>
> pc->mem_cgroup = new_memcgy
>                                                           move_unlock_mem_cgroup
>
>    old_memcg->nr_dirty ++
>

I'm sorry I couldn't catch why you call TestSetPageDirty()....and what CPUA/CPUB is
doing ? CPUA calls move_account() and CPUB updates stat ? If so, why move_account()
is allowed to set PG_dirty ??


>
> So nr_dirty of old_memcg may be minus in a very short
> period('old_memcg->nr_dirty --' by CPU B), but it will be revised soon
> by CPU A. And the final figures of memcg->nr_dirty is correct.

It seems both of new_memcg and old_memcg has an account for a page. Is it correct ?


> Meanwhile the move_lock only protect saving old_memcg and
> TestSetPageDirty in its critical section and without any irrelevant
> logic, so the lock order or deadlock can be handled easily.
>
> But I'm not sure whether I've lost some race conditions, any comments
> are welcomed. : )
>

Sorry I couldn't understand.

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
  2013-01-10  2:16             ` Kamezawa Hiroyuki
  (?)
@ 2013-01-10  4:26               ` Sha Zhengju
  -1 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2013-01-10  4:26 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Michal Hocko, Hugh Dickins, Johannes Weiner, linux-kernel,
	cgroups, linux-mm, linux-fsdevel, akpm, gthelen, fengguang.wu,
	glommer, dchinner, Sha Zhengju

On Thu, Jan 10, 2013 at 10:16 AM, Kamezawa Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> (2013/01/10 0:02), Sha Zhengju wrote:
>>
>> On Mon, Jan 7, 2013 at 3:25 PM, Kamezawa Hiroyuki
>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>>>
>>> (2013/01/05 13:48), Sha Zhengju wrote:
>>>>
>>>>
>>>> On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko <mhocko@suse.cz> wrote:
>>>>>
>>>>>
>>>>> On Wed 26-12-12 01:26:07, Sha Zhengju wrote:
>>>>>>
>>>>>>
>>>>>> From: Sha Zhengju <handai.szj@taobao.com>
>>>>>>
>>>>>> This patch adds memcg routines to count dirty pages, which allows
>>>>>> memory
>>>>>> controller
>>>>>> to maintain an accurate view of the amount of its dirty memory and can
>>>>>> provide some
>>>>>> info for users while cgroup's direct reclaim is working.
>>>>>
>>>>>
>>>>>
>>>>> I guess you meant targeted resp. (hard/soft) limit reclaim here,
>>>>> right? It is true that this is direct reclaim but it is not clear to me
>>>>
>>>>
>>>>
>>>> Yes, I meant memcg hard/soft reclaim here which is triggered directly
>>>> by allocation and is distinct from background kswapd reclaim (global).
>>>>
>>>>> why the usefulnes should be limitted to the reclaim for users. I would
>>>>> understand this if the users was in fact in-kernel users.
>>>>>
>>>>
>>>> One of the reasons I'm trying to accounting the dirty pages is to get a
>>>> more board overall view of memory usages because memcg hard/soft
>>>> reclaim may have effect on response time of user application.
>>>> Yeah, the beneficiary can be application administrator or kernel users.
>>>> :P
>>>>
>>>>> [...]
>>>>>>
>>>>>>
>>>>>> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
>>>>>> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
>>>>>> ->private_lock --> mapping->tree_lock --> memcg->move_lock.
>>>>>> So we need to make mapping->tree_lock ahead of TestSetPageDirty in
>>>>>> __set_page_dirty()
>>>>>> and __set_page_dirty_nobuffers(). But in order to avoiding useless
>>>>>> spinlock contention,
>>>>>> a prepare PageDirty() checking is added.
>>>>>
>>>>>
>>>>>
>>>>> But there is another AA deadlock here I believe.
>>>>> page_remove_rmap
>>>>>     mem_cgroup_begin_update_page_stat             <<< 1
>>>>>     set_page_dirty
>>>>>       __set_page_dirty_buffers
>>>>>         __set_page_dirty
>>>>>           mem_cgroup_begin_update_page_stat       <<< 2
>>>>>             move_lock_mem_cgroup
>>>>>               spin_lock_irqsave(&memcg->move_lock, *flags);
>>>>>
>>>>> mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS
>>>>> because we might race with the moving charges:
>>>>>           CPU0                                            CPU1
>>>>> page_remove_rmap
>>>>>                                                   mem_cgroup_can_attach
>>>>>     mem_cgroup_begin_update_page_stat (1)
>>>>>       rcu_read_lock
>>>>>
>>>>> mem_cgroup_start_move
>>>>>
>>>>> atomic_inc(&memcg_moving)
>>>>>
>>>>> atomic_inc(&memcg->moving_account)
>>>>>                                                       synchronize_rcu
>>>>>       __mem_cgroup_begin_update_page_stat
>>>>>         mem_cgroup_stolen <<< TRUE
>>>>>         move_lock_mem_cgroup
>>>>>     [...]
>>>>>           mem_cgroup_begin_update_page_stat (2)
>>>>>             __mem_cgroup_begin_update_page_stat
>>>>>               mem_cgroup_stolen     <<< still TRUE
>>>>>               move_lock_mem_cgroup  <<< DEADLOCK
>>>>>     [...]
>>>>>     mem_cgroup_end_update_page_stat
>>>>>       rcu_unlock
>>>>>                                                     # wake up from
>>>>> synchronize_rcu
>>>>>                                                   [...]
>>>>>                                                   mem_cgroup_move_task
>>>>>
>>>>> mem_cgroup_move_charge
>>>>>                                                       walk_page_range
>>>>>
>>>>> mem_cgroup_move_account
>>>>>
>>>>> move_lock_mem_cgroup
>>>>>
>>>>>
>>>>> Maybe I have missed some other locking which would prevent this from
>>>>> happening but the locking relations are really complicated in this area
>>>>> so if mem_cgroup_{begin,end}_update_page_stat might be called
>>>>> recursively then we need a fat comment which justifies that.
>>>>>
>>>>
>>>> Ohhh...good catching!  I didn't notice there is a recursive call of
>>>> mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
>>>> The mem_cgroup_{begin,end}_update_page_stat() design has depressed
>>>> me a lot recently as the lock granularity is a little bigger than I
>>>> thought.
>>>> Not only the resource but also some code logic is in the range of
>>>> locking
>>>> which may be deadlock prone. The problem still exists if we are trying
>>>> to
>>>> add stat account of other memcg page later, may I make bold to suggest
>>>> that we dig into the lock again...
>>>>
>>>> But with regard to the current lock implementation, I doubt if we can we
>>>> can
>>>> account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just
>>>> try to get move_lock once in the beginning. IMHO we can make
>>>> mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what
>>>> I'm
>>>> thinking now is changing memcg->move_lock to rw-spinlock from the
>>>> original spinlock:
>>>> mem_cgroup_{begin,end}_update_page_stat() try to get the read lock which
>>>> make it
>>>> reenterable and memcg moving task side try to get the write spinlock.
>>>> Then the race may be following:
>>>>
>>>>           CPU0                                            CPU1
>>>> page_remove_rmap
>>>>                                                   mem_cgroup_can_attach
>>>>     mem_cgroup_begin_update_page_stat (1)
>>>>       rcu_read_lock
>>>>
>>>> mem_cgroup_start_move
>>>>
>>>> atomic_inc(&memcg_moving)
>>>>
>>>> atomic_inc(&memcg->moving_account)
>>>>                                                       synchronize_rcu
>>>>       __mem_cgroup_begin_update_page_stat
>>>>         mem_cgroup_stolen   <<< TRUE
>>>>         move_lock_mem_cgroup   <<<< read-spinlock success
>>>>     [...]
>>>>        mem_cgroup_begin_update_page_stat (2)
>>>>             __mem_cgroup_begin_update_page_stat
>>>>               mem_cgroup_stolen     <<< still TRUE
>>>>               move_lock_mem_cgroup  <<<< read-spinlock success
>>>>
>>>>     [...]
>>>>     mem_cgroup_end_update_page_stat     <<< locked = true, unlock
>>>>       rcu_unlock
>>>>                                                     # wake up from
>>>> synchronize_rcu
>>>>                                                   [...]
>>>>                                                   mem_cgroup_move_task
>>>>
>>>> mem_cgroup_move_charge
>>>>                                                       walk_page_range
>>>>
>>>> mem_cgroup_move_account
>>>>
>>>> move_lock_mem_cgroup    <<< write-spinlock
>>>>
>>>>
>>>> AFAICS, the deadlock seems to be avoided by both the rcu and rwlock.
>>>> Is there anything I lost?
>>>>
>>>
>>> rwlock will work with the nest but it seems ugly do updates under
>>> read-lock.
>>>
>>> How about this straightforward ?
>>> ==
>>> /*
>>>   * Once a thread takes memcg_move_lock() on a memcg, it can take the
>>> lock on
>>>   * the memcg again for nesting calls
>>>   */
>>> static void move_lock_mem_cgroup(memcg, flags);
>>> {
>>>          current->memcg_move_lock_nested += 1;
>>>          if (current->memcg_move_lock_nested > 1) {
>>>                  VM_BUG_ON(current->move_locked_memcg != memcg);
>>>                  return;
>>>          }
>>>          spin_lock_irqsave(&memcg_move_lock, &flags);
>>>          current->move_lockdev_memcg = memcg;
>>> }
>>>
>>> static void move_unlock_mem_cgroup(memcg, flags)
>>> {
>>>          current->memcg_move_lock_nested -= 1;
>>>          if (!current->memcg_move_lock_nested) {
>>>                  current->move_locked_memcg = NULL;
>>>                  spin_unlock_irqrestore(&memcg_move_lock,flags);
>>>          }
>>> }
>>>
>> Does we need to add two
>> fields(current->memcg_move_lock_nested/move_locked_memcg) to 'struct
>> task'? Is it feasible?
>>
>> Now I'm thinking about another synchronization proposal for memcg page
>> stat updater and move_account, which seems to deal with recursion
>> issue and deadlock:
>>
>>               CPU A                                               CPU B
>>
>>    move_lock_mem_cgroup
>>    old_memcg = pc->mem_cgroup
>>    TestSetPageDirty(page)
>>    move_unlock_mem_cgroup
>>
>> move_lock_mem_cgroup
>>                                                           if (PageDirty)
>>
>> old_memcg->nr_dirty --
>>
>> new_memcg->nr_dirty ++
>>
>> pc->mem_cgroup = new_memcgy
>>
>> move_unlock_mem_cgroup
>>
>>    old_memcg->nr_dirty ++
>>
>
> I'm sorry I couldn't catch why you call TestSetPageDirty()....and what
> CPUA/CPUB is
> doing ? CPUA calls move_account() and CPUB updates stat ? If so, why
> move_account()
> is allowed to set PG_dirty ??
>

Sorry,  the layout above seems in a mess and is confusing...
>From the beginning, after removing duplicated information like PCG_*
flags in 'struct page_cgroup'(commit 2ff76f1193), there's a problem
between "move" and "page stat accounting" :
assume CPU-A does "page stat accounting" and CPU-B does "move"

CPU-A                        CPU-B
TestSet PG_dirty
(delay)                 move_lock_mem_cgroup()
                            if (PageDirty(page)) {
                                  old_memcg->nr_dirty --
                                  new_memcg->nr_dirty++
                            }
                            pc->mem_cgroup = new_memcg;
                            move_unlock_mem_cgroup()

move_lock_mem_cgroup()
memcg = pc->mem_cgroup
memcg->nr_dirty++
move_unlock_mem_cgroup()

while accounting information of new_memcg may be double-counted. So we
use a bigger lock to solve this problem:  (commit: 89c06bd52f)
         move_lock_mem_cgroup()
         TestSetPageDirty(page)
         update page stats (without any checks)
         move_unlock_mem_cgroup()

But this method also has its pros and cons(e.g. need lock nesting). So
I doubt whether the following is able to deal with these issues all
together:
(CPU-A does "page stat accounting" and CPU-B does "move")

             CPU-A                            CPU-B

move_lock_mem_cgroup()
memcg = pc->mem_cgroup
SetPageDirty(page)
move_unlock_mem_cgroup()
                                      move_lock_mem_cgroup()
                                      if (PageDirty) {
                                               old_memcg->nr_dirty --;
                                               new_memcg->nr_dirty ++;
                                       }
                                       pc->mem_cgroup = new_memcg
                                       move_unlock_mem_cgroup()

memcg->nr_dirty ++


For CPU-A, we save pc->mem_cgroup in a temporary variable just before
SetPageDirty inside move_lock and then update stats if the page is set
PG_dirty successfully. But CPU-B may do "moving" in advance that
"old_memcg->nr_dirty --" will make old_memcg->nr_dirty incorrect but
soon CPU-A will do "memcg->nr_dirty ++" at the heels that amend the
stats.
However, there is a potential problem that old_memcg->nr_dirty  may be
minus in a very short period but not a big issue IMHO.

I hope that is clear.  : )
Thanks!

>
>>
>> So nr_dirty of old_memcg may be minus in a very short
>> period('old_memcg->nr_dirty --' by CPU B), but it will be revised soon
>> by CPU A. And the final figures of memcg->nr_dirty is correct.
>
>
> It seems both of new_memcg and old_memcg has an account for a page. Is it
> correct ?
>
>
>
>> Meanwhile the move_lock only protect saving old_memcg and
>> TestSetPageDirty in its critical section and without any irrelevant
>> logic, so the lock order or deadlock can be handled easily.
>>
>> But I'm not sure whether I've lost some race conditions, any comments
>> are welcomed. : )
>>
>
> Sorry I couldn't understand.
>
> Thanks,
> -Kame
>
>



-- 
Thanks,
Sha

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
@ 2013-01-10  4:26               ` Sha Zhengju
  0 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2013-01-10  4:26 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Michal Hocko, Hugh Dickins, Johannes Weiner, linux-kernel,
	cgroups, linux-mm, linux-fsdevel, akpm, gthelen, fengguang.wu,
	glommer, dchinner, Sha Zhengju

On Thu, Jan 10, 2013 at 10:16 AM, Kamezawa Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> (2013/01/10 0:02), Sha Zhengju wrote:
>>
>> On Mon, Jan 7, 2013 at 3:25 PM, Kamezawa Hiroyuki
>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>>>
>>> (2013/01/05 13:48), Sha Zhengju wrote:
>>>>
>>>>
>>>> On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko <mhocko@suse.cz> wrote:
>>>>>
>>>>>
>>>>> On Wed 26-12-12 01:26:07, Sha Zhengju wrote:
>>>>>>
>>>>>>
>>>>>> From: Sha Zhengju <handai.szj@taobao.com>
>>>>>>
>>>>>> This patch adds memcg routines to count dirty pages, which allows
>>>>>> memory
>>>>>> controller
>>>>>> to maintain an accurate view of the amount of its dirty memory and can
>>>>>> provide some
>>>>>> info for users while cgroup's direct reclaim is working.
>>>>>
>>>>>
>>>>>
>>>>> I guess you meant targeted resp. (hard/soft) limit reclaim here,
>>>>> right? It is true that this is direct reclaim but it is not clear to me
>>>>
>>>>
>>>>
>>>> Yes, I meant memcg hard/soft reclaim here which is triggered directly
>>>> by allocation and is distinct from background kswapd reclaim (global).
>>>>
>>>>> why the usefulnes should be limitted to the reclaim for users. I would
>>>>> understand this if the users was in fact in-kernel users.
>>>>>
>>>>
>>>> One of the reasons I'm trying to accounting the dirty pages is to get a
>>>> more board overall view of memory usages because memcg hard/soft
>>>> reclaim may have effect on response time of user application.
>>>> Yeah, the beneficiary can be application administrator or kernel users.
>>>> :P
>>>>
>>>>> [...]
>>>>>>
>>>>>>
>>>>>> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
>>>>>> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
>>>>>> ->private_lock --> mapping->tree_lock --> memcg->move_lock.
>>>>>> So we need to make mapping->tree_lock ahead of TestSetPageDirty in
>>>>>> __set_page_dirty()
>>>>>> and __set_page_dirty_nobuffers(). But in order to avoiding useless
>>>>>> spinlock contention,
>>>>>> a prepare PageDirty() checking is added.
>>>>>
>>>>>
>>>>>
>>>>> But there is another AA deadlock here I believe.
>>>>> page_remove_rmap
>>>>>     mem_cgroup_begin_update_page_stat             <<< 1
>>>>>     set_page_dirty
>>>>>       __set_page_dirty_buffers
>>>>>         __set_page_dirty
>>>>>           mem_cgroup_begin_update_page_stat       <<< 2
>>>>>             move_lock_mem_cgroup
>>>>>               spin_lock_irqsave(&memcg->move_lock, *flags);
>>>>>
>>>>> mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS
>>>>> because we might race with the moving charges:
>>>>>           CPU0                                            CPU1
>>>>> page_remove_rmap
>>>>>                                                   mem_cgroup_can_attach
>>>>>     mem_cgroup_begin_update_page_stat (1)
>>>>>       rcu_read_lock
>>>>>
>>>>> mem_cgroup_start_move
>>>>>
>>>>> atomic_inc(&memcg_moving)
>>>>>
>>>>> atomic_inc(&memcg->moving_account)
>>>>>                                                       synchronize_rcu
>>>>>       __mem_cgroup_begin_update_page_stat
>>>>>         mem_cgroup_stolen <<< TRUE
>>>>>         move_lock_mem_cgroup
>>>>>     [...]
>>>>>           mem_cgroup_begin_update_page_stat (2)
>>>>>             __mem_cgroup_begin_update_page_stat
>>>>>               mem_cgroup_stolen     <<< still TRUE
>>>>>               move_lock_mem_cgroup  <<< DEADLOCK
>>>>>     [...]
>>>>>     mem_cgroup_end_update_page_stat
>>>>>       rcu_unlock
>>>>>                                                     # wake up from
>>>>> synchronize_rcu
>>>>>                                                   [...]
>>>>>                                                   mem_cgroup_move_task
>>>>>
>>>>> mem_cgroup_move_charge
>>>>>                                                       walk_page_range
>>>>>
>>>>> mem_cgroup_move_account
>>>>>
>>>>> move_lock_mem_cgroup
>>>>>
>>>>>
>>>>> Maybe I have missed some other locking which would prevent this from
>>>>> happening but the locking relations are really complicated in this area
>>>>> so if mem_cgroup_{begin,end}_update_page_stat might be called
>>>>> recursively then we need a fat comment which justifies that.
>>>>>
>>>>
>>>> Ohhh...good catching!  I didn't notice there is a recursive call of
>>>> mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
>>>> The mem_cgroup_{begin,end}_update_page_stat() design has depressed
>>>> me a lot recently as the lock granularity is a little bigger than I
>>>> thought.
>>>> Not only the resource but also some code logic is in the range of
>>>> locking
>>>> which may be deadlock prone. The problem still exists if we are trying
>>>> to
>>>> add stat account of other memcg page later, may I make bold to suggest
>>>> that we dig into the lock again...
>>>>
>>>> But with regard to the current lock implementation, I doubt if we can we
>>>> can
>>>> account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just
>>>> try to get move_lock once in the beginning. IMHO we can make
>>>> mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what
>>>> I'm
>>>> thinking now is changing memcg->move_lock to rw-spinlock from the
>>>> original spinlock:
>>>> mem_cgroup_{begin,end}_update_page_stat() try to get the read lock which
>>>> make it
>>>> reenterable and memcg moving task side try to get the write spinlock.
>>>> Then the race may be following:
>>>>
>>>>           CPU0                                            CPU1
>>>> page_remove_rmap
>>>>                                                   mem_cgroup_can_attach
>>>>     mem_cgroup_begin_update_page_stat (1)
>>>>       rcu_read_lock
>>>>
>>>> mem_cgroup_start_move
>>>>
>>>> atomic_inc(&memcg_moving)
>>>>
>>>> atomic_inc(&memcg->moving_account)
>>>>                                                       synchronize_rcu
>>>>       __mem_cgroup_begin_update_page_stat
>>>>         mem_cgroup_stolen   <<< TRUE
>>>>         move_lock_mem_cgroup   <<<< read-spinlock success
>>>>     [...]
>>>>        mem_cgroup_begin_update_page_stat (2)
>>>>             __mem_cgroup_begin_update_page_stat
>>>>               mem_cgroup_stolen     <<< still TRUE
>>>>               move_lock_mem_cgroup  <<<< read-spinlock success
>>>>
>>>>     [...]
>>>>     mem_cgroup_end_update_page_stat     <<< locked = true, unlock
>>>>       rcu_unlock
>>>>                                                     # wake up from
>>>> synchronize_rcu
>>>>                                                   [...]
>>>>                                                   mem_cgroup_move_task
>>>>
>>>> mem_cgroup_move_charge
>>>>                                                       walk_page_range
>>>>
>>>> mem_cgroup_move_account
>>>>
>>>> move_lock_mem_cgroup    <<< write-spinlock
>>>>
>>>>
>>>> AFAICS, the deadlock seems to be avoided by both the rcu and rwlock.
>>>> Is there anything I lost?
>>>>
>>>
>>> rwlock will work with the nest but it seems ugly do updates under
>>> read-lock.
>>>
>>> How about this straightforward ?
>>> ==
>>> /*
>>>   * Once a thread takes memcg_move_lock() on a memcg, it can take the
>>> lock on
>>>   * the memcg again for nesting calls
>>>   */
>>> static void move_lock_mem_cgroup(memcg, flags);
>>> {
>>>          current->memcg_move_lock_nested += 1;
>>>          if (current->memcg_move_lock_nested > 1) {
>>>                  VM_BUG_ON(current->move_locked_memcg != memcg);
>>>                  return;
>>>          }
>>>          spin_lock_irqsave(&memcg_move_lock, &flags);
>>>          current->move_lockdev_memcg = memcg;
>>> }
>>>
>>> static void move_unlock_mem_cgroup(memcg, flags)
>>> {
>>>          current->memcg_move_lock_nested -= 1;
>>>          if (!current->memcg_move_lock_nested) {
>>>                  current->move_locked_memcg = NULL;
>>>                  spin_unlock_irqrestore(&memcg_move_lock,flags);
>>>          }
>>> }
>>>
>> Does we need to add two
>> fields(current->memcg_move_lock_nested/move_locked_memcg) to 'struct
>> task'? Is it feasible?
>>
>> Now I'm thinking about another synchronization proposal for memcg page
>> stat updater and move_account, which seems to deal with recursion
>> issue and deadlock:
>>
>>               CPU A                                               CPU B
>>
>>    move_lock_mem_cgroup
>>    old_memcg = pc->mem_cgroup
>>    TestSetPageDirty(page)
>>    move_unlock_mem_cgroup
>>
>> move_lock_mem_cgroup
>>                                                           if (PageDirty)
>>
>> old_memcg->nr_dirty --
>>
>> new_memcg->nr_dirty ++
>>
>> pc->mem_cgroup = new_memcgy
>>
>> move_unlock_mem_cgroup
>>
>>    old_memcg->nr_dirty ++
>>
>
> I'm sorry I couldn't catch why you call TestSetPageDirty()....and what
> CPUA/CPUB is
> doing ? CPUA calls move_account() and CPUB updates stat ? If so, why
> move_account()
> is allowed to set PG_dirty ??
>

Sorry,  the layout above seems in a mess and is confusing...
>From the beginning, after removing duplicated information like PCG_*
flags in 'struct page_cgroup'(commit 2ff76f1193), there's a problem
between "move" and "page stat accounting" :
assume CPU-A does "page stat accounting" and CPU-B does "move"

CPU-A                        CPU-B
TestSet PG_dirty
(delay)                 move_lock_mem_cgroup()
                            if (PageDirty(page)) {
                                  old_memcg->nr_dirty --
                                  new_memcg->nr_dirty++
                            }
                            pc->mem_cgroup = new_memcg;
                            move_unlock_mem_cgroup()

move_lock_mem_cgroup()
memcg = pc->mem_cgroup
memcg->nr_dirty++
move_unlock_mem_cgroup()

while accounting information of new_memcg may be double-counted. So we
use a bigger lock to solve this problem:  (commit: 89c06bd52f)
         move_lock_mem_cgroup()
         TestSetPageDirty(page)
         update page stats (without any checks)
         move_unlock_mem_cgroup()

But this method also has its pros and cons(e.g. need lock nesting). So
I doubt whether the following is able to deal with these issues all
together:
(CPU-A does "page stat accounting" and CPU-B does "move")

             CPU-A                            CPU-B

move_lock_mem_cgroup()
memcg = pc->mem_cgroup
SetPageDirty(page)
move_unlock_mem_cgroup()
                                      move_lock_mem_cgroup()
                                      if (PageDirty) {
                                               old_memcg->nr_dirty --;
                                               new_memcg->nr_dirty ++;
                                       }
                                       pc->mem_cgroup = new_memcg
                                       move_unlock_mem_cgroup()

memcg->nr_dirty ++


For CPU-A, we save pc->mem_cgroup in a temporary variable just before
SetPageDirty inside move_lock and then update stats if the page is set
PG_dirty successfully. But CPU-B may do "moving" in advance that
"old_memcg->nr_dirty --" will make old_memcg->nr_dirty incorrect but
soon CPU-A will do "memcg->nr_dirty ++" at the heels that amend the
stats.
However, there is a potential problem that old_memcg->nr_dirty  may be
minus in a very short period but not a big issue IMHO.

I hope that is clear.  : )
Thanks!

>
>>
>> So nr_dirty of old_memcg may be minus in a very short
>> period('old_memcg->nr_dirty --' by CPU B), but it will be revised soon
>> by CPU A. And the final figures of memcg->nr_dirty is correct.
>
>
> It seems both of new_memcg and old_memcg has an account for a page. Is it
> correct ?
>
>
>
>> Meanwhile the move_lock only protect saving old_memcg and
>> TestSetPageDirty in its critical section and without any irrelevant
>> logic, so the lock order or deadlock can be handled easily.
>>
>> But I'm not sure whether I've lost some race conditions, any comments
>> are welcomed. : )
>>
>
> Sorry I couldn't understand.
>
> Thanks,
> -Kame
>
>



-- 
Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
@ 2013-01-10  4:26               ` Sha Zhengju
  0 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2013-01-10  4:26 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Michal Hocko, Hugh Dickins, Johannes Weiner, linux-kernel,
	cgroups, linux-mm, linux-fsdevel, akpm, gthelen, fengguang.wu,
	glommer, dchinner, Sha Zhengju

On Thu, Jan 10, 2013 at 10:16 AM, Kamezawa Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> (2013/01/10 0:02), Sha Zhengju wrote:
>>
>> On Mon, Jan 7, 2013 at 3:25 PM, Kamezawa Hiroyuki
>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>>>
>>> (2013/01/05 13:48), Sha Zhengju wrote:
>>>>
>>>>
>>>> On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko <mhocko@suse.cz> wrote:
>>>>>
>>>>>
>>>>> On Wed 26-12-12 01:26:07, Sha Zhengju wrote:
>>>>>>
>>>>>>
>>>>>> From: Sha Zhengju <handai.szj@taobao.com>
>>>>>>
>>>>>> This patch adds memcg routines to count dirty pages, which allows
>>>>>> memory
>>>>>> controller
>>>>>> to maintain an accurate view of the amount of its dirty memory and can
>>>>>> provide some
>>>>>> info for users while cgroup's direct reclaim is working.
>>>>>
>>>>>
>>>>>
>>>>> I guess you meant targeted resp. (hard/soft) limit reclaim here,
>>>>> right? It is true that this is direct reclaim but it is not clear to me
>>>>
>>>>
>>>>
>>>> Yes, I meant memcg hard/soft reclaim here which is triggered directly
>>>> by allocation and is distinct from background kswapd reclaim (global).
>>>>
>>>>> why the usefulnes should be limitted to the reclaim for users. I would
>>>>> understand this if the users was in fact in-kernel users.
>>>>>
>>>>
>>>> One of the reasons I'm trying to accounting the dirty pages is to get a
>>>> more board overall view of memory usages because memcg hard/soft
>>>> reclaim may have effect on response time of user application.
>>>> Yeah, the beneficiary can be application administrator or kernel users.
>>>> :P
>>>>
>>>>> [...]
>>>>>>
>>>>>>
>>>>>> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
>>>>>> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
>>>>>> ->private_lock --> mapping->tree_lock --> memcg->move_lock.
>>>>>> So we need to make mapping->tree_lock ahead of TestSetPageDirty in
>>>>>> __set_page_dirty()
>>>>>> and __set_page_dirty_nobuffers(). But in order to avoiding useless
>>>>>> spinlock contention,
>>>>>> a prepare PageDirty() checking is added.
>>>>>
>>>>>
>>>>>
>>>>> But there is another AA deadlock here I believe.
>>>>> page_remove_rmap
>>>>>     mem_cgroup_begin_update_page_stat             <<< 1
>>>>>     set_page_dirty
>>>>>       __set_page_dirty_buffers
>>>>>         __set_page_dirty
>>>>>           mem_cgroup_begin_update_page_stat       <<< 2
>>>>>             move_lock_mem_cgroup
>>>>>               spin_lock_irqsave(&memcg->move_lock, *flags);
>>>>>
>>>>> mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS
>>>>> because we might race with the moving charges:
>>>>>           CPU0                                            CPU1
>>>>> page_remove_rmap
>>>>>                                                   mem_cgroup_can_attach
>>>>>     mem_cgroup_begin_update_page_stat (1)
>>>>>       rcu_read_lock
>>>>>
>>>>> mem_cgroup_start_move
>>>>>
>>>>> atomic_inc(&memcg_moving)
>>>>>
>>>>> atomic_inc(&memcg->moving_account)
>>>>>                                                       synchronize_rcu
>>>>>       __mem_cgroup_begin_update_page_stat
>>>>>         mem_cgroup_stolen <<< TRUE
>>>>>         move_lock_mem_cgroup
>>>>>     [...]
>>>>>           mem_cgroup_begin_update_page_stat (2)
>>>>>             __mem_cgroup_begin_update_page_stat
>>>>>               mem_cgroup_stolen     <<< still TRUE
>>>>>               move_lock_mem_cgroup  <<< DEADLOCK
>>>>>     [...]
>>>>>     mem_cgroup_end_update_page_stat
>>>>>       rcu_unlock
>>>>>                                                     # wake up from
>>>>> synchronize_rcu
>>>>>                                                   [...]
>>>>>                                                   mem_cgroup_move_task
>>>>>
>>>>> mem_cgroup_move_charge
>>>>>                                                       walk_page_range
>>>>>
>>>>> mem_cgroup_move_account
>>>>>
>>>>> move_lock_mem_cgroup
>>>>>
>>>>>
>>>>> Maybe I have missed some other locking which would prevent this from
>>>>> happening but the locking relations are really complicated in this area
>>>>> so if mem_cgroup_{begin,end}_update_page_stat might be called
>>>>> recursively then we need a fat comment which justifies that.
>>>>>
>>>>
>>>> Ohhh...good catching!  I didn't notice there is a recursive call of
>>>> mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
>>>> The mem_cgroup_{begin,end}_update_page_stat() design has depressed
>>>> me a lot recently as the lock granularity is a little bigger than I
>>>> thought.
>>>> Not only the resource but also some code logic is in the range of
>>>> locking
>>>> which may be deadlock prone. The problem still exists if we are trying
>>>> to
>>>> add stat account of other memcg page later, may I make bold to suggest
>>>> that we dig into the lock again...
>>>>
>>>> But with regard to the current lock implementation, I doubt if we can we
>>>> can
>>>> account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just
>>>> try to get move_lock once in the beginning. IMHO we can make
>>>> mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what
>>>> I'm
>>>> thinking now is changing memcg->move_lock to rw-spinlock from the
>>>> original spinlock:
>>>> mem_cgroup_{begin,end}_update_page_stat() try to get the read lock which
>>>> make it
>>>> reenterable and memcg moving task side try to get the write spinlock.
>>>> Then the race may be following:
>>>>
>>>>           CPU0                                            CPU1
>>>> page_remove_rmap
>>>>                                                   mem_cgroup_can_attach
>>>>     mem_cgroup_begin_update_page_stat (1)
>>>>       rcu_read_lock
>>>>
>>>> mem_cgroup_start_move
>>>>
>>>> atomic_inc(&memcg_moving)
>>>>
>>>> atomic_inc(&memcg->moving_account)
>>>>                                                       synchronize_rcu
>>>>       __mem_cgroup_begin_update_page_stat
>>>>         mem_cgroup_stolen   <<< TRUE
>>>>         move_lock_mem_cgroup   <<<< read-spinlock success
>>>>     [...]
>>>>        mem_cgroup_begin_update_page_stat (2)
>>>>             __mem_cgroup_begin_update_page_stat
>>>>               mem_cgroup_stolen     <<< still TRUE
>>>>               move_lock_mem_cgroup  <<<< read-spinlock success
>>>>
>>>>     [...]
>>>>     mem_cgroup_end_update_page_stat     <<< locked = true, unlock
>>>>       rcu_unlock
>>>>                                                     # wake up from
>>>> synchronize_rcu
>>>>                                                   [...]
>>>>                                                   mem_cgroup_move_task
>>>>
>>>> mem_cgroup_move_charge
>>>>                                                       walk_page_range
>>>>
>>>> mem_cgroup_move_account
>>>>
>>>> move_lock_mem_cgroup    <<< write-spinlock
>>>>
>>>>
>>>> AFAICS, the deadlock seems to be avoided by both the rcu and rwlock.
>>>> Is there anything I lost?
>>>>
>>>
>>> rwlock will work with the nest but it seems ugly do updates under
>>> read-lock.
>>>
>>> How about this straightforward ?
>>> ==
>>> /*
>>>   * Once a thread takes memcg_move_lock() on a memcg, it can take the
>>> lock on
>>>   * the memcg again for nesting calls
>>>   */
>>> static void move_lock_mem_cgroup(memcg, flags);
>>> {
>>>          current->memcg_move_lock_nested += 1;
>>>          if (current->memcg_move_lock_nested > 1) {
>>>                  VM_BUG_ON(current->move_locked_memcg != memcg);
>>>                  return;
>>>          }
>>>          spin_lock_irqsave(&memcg_move_lock, &flags);
>>>          current->move_lockdev_memcg = memcg;
>>> }
>>>
>>> static void move_unlock_mem_cgroup(memcg, flags)
>>> {
>>>          current->memcg_move_lock_nested -= 1;
>>>          if (!current->memcg_move_lock_nested) {
>>>                  current->move_locked_memcg = NULL;
>>>                  spin_unlock_irqrestore(&memcg_move_lock,flags);
>>>          }
>>> }
>>>
>> Does we need to add two
>> fields(current->memcg_move_lock_nested/move_locked_memcg) to 'struct
>> task'? Is it feasible?
>>
>> Now I'm thinking about another synchronization proposal for memcg page
>> stat updater and move_account, which seems to deal with recursion
>> issue and deadlock:
>>
>>               CPU A                                               CPU B
>>
>>    move_lock_mem_cgroup
>>    old_memcg = pc->mem_cgroup
>>    TestSetPageDirty(page)
>>    move_unlock_mem_cgroup
>>
>> move_lock_mem_cgroup
>>                                                           if (PageDirty)
>>
>> old_memcg->nr_dirty --
>>
>> new_memcg->nr_dirty ++
>>
>> pc->mem_cgroup = new_memcgy
>>
>> move_unlock_mem_cgroup
>>
>>    old_memcg->nr_dirty ++
>>
>
> I'm sorry I couldn't catch why you call TestSetPageDirty()....and what
> CPUA/CPUB is
> doing ? CPUA calls move_account() and CPUB updates stat ? If so, why
> move_account()
> is allowed to set PG_dirty ??
>

Sorry,  the layout above seems in a mess and is confusing...
From the beginning, after removing duplicated information like PCG_*
flags in 'struct page_cgroup'(commit 2ff76f1193), there's a problem
between "move" and "page stat accounting" :
assume CPU-A does "page stat accounting" and CPU-B does "move"

CPU-A                        CPU-B
TestSet PG_dirty
(delay)                 move_lock_mem_cgroup()
                            if (PageDirty(page)) {
                                  old_memcg->nr_dirty --
                                  new_memcg->nr_dirty++
                            }
                            pc->mem_cgroup = new_memcg;
                            move_unlock_mem_cgroup()

move_lock_mem_cgroup()
memcg = pc->mem_cgroup
memcg->nr_dirty++
move_unlock_mem_cgroup()

while accounting information of new_memcg may be double-counted. So we
use a bigger lock to solve this problem:  (commit: 89c06bd52f)
         move_lock_mem_cgroup()
         TestSetPageDirty(page)
         update page stats (without any checks)
         move_unlock_mem_cgroup()

But this method also has its pros and cons(e.g. need lock nesting). So
I doubt whether the following is able to deal with these issues all
together:
(CPU-A does "page stat accounting" and CPU-B does "move")

             CPU-A                            CPU-B

move_lock_mem_cgroup()
memcg = pc->mem_cgroup
SetPageDirty(page)
move_unlock_mem_cgroup()
                                      move_lock_mem_cgroup()
                                      if (PageDirty) {
                                               old_memcg->nr_dirty --;
                                               new_memcg->nr_dirty ++;
                                       }
                                       pc->mem_cgroup = new_memcg
                                       move_unlock_mem_cgroup()

memcg->nr_dirty ++


For CPU-A, we save pc->mem_cgroup in a temporary variable just before
SetPageDirty inside move_lock and then update stats if the page is set
PG_dirty successfully. But CPU-B may do "moving" in advance that
"old_memcg->nr_dirty --" will make old_memcg->nr_dirty incorrect but
soon CPU-A will do "memcg->nr_dirty ++" at the heels that amend the
stats.
However, there is a potential problem that old_memcg->nr_dirty  may be
minus in a very short period but not a big issue IMHO.

I hope that is clear.  : )
Thanks!

>
>>
>> So nr_dirty of old_memcg may be minus in a very short
>> period('old_memcg->nr_dirty --' by CPU B), but it will be revised soon
>> by CPU A. And the final figures of memcg->nr_dirty is correct.
>
>
> It seems both of new_memcg and old_memcg has an account for a page. Is it
> correct ?
>
>
>
>> Meanwhile the move_lock only protect saving old_memcg and
>> TestSetPageDirty in its critical section and without any irrelevant
>> logic, so the lock order or deadlock can be handled easily.
>>
>> But I'm not sure whether I've lost some race conditions, any comments
>> are welcomed. : )
>>
>
> Sorry I couldn't understand.
>
> Thanks,
> -Kame
>
>



-- 
Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
@ 2013-01-10  5:03                 ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 124+ messages in thread
From: Kamezawa Hiroyuki @ 2013-01-10  5:03 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: Michal Hocko, Hugh Dickins, Johannes Weiner, linux-kernel,
	cgroups, linux-mm, linux-fsdevel, akpm, gthelen, fengguang.wu,
	glommer, dchinner, Sha Zhengju

(2013/01/10 13:26), Sha Zhengju wrote:

> But this method also has its pros and cons(e.g. need lock nesting). So
> I doubt whether the following is able to deal with these issues all
> together:
> (CPU-A does "page stat accounting" and CPU-B does "move")
>
>               CPU-A                            CPU-B
>
> move_lock_mem_cgroup()
> memcg = pc->mem_cgroup
> SetPageDirty(page)
> move_unlock_mem_cgroup()
>                                        move_lock_mem_cgroup()
>                                        if (PageDirty) {
>                                                 old_memcg->nr_dirty --;
>                                                 new_memcg->nr_dirty ++;
>                                         }
>                                         pc->mem_cgroup = new_memcg
>                                         move_unlock_mem_cgroup()
>
> memcg->nr_dirty ++
>
>
> For CPU-A, we save pc->mem_cgroup in a temporary variable just before
> SetPageDirty inside move_lock and then update stats if the page is set
> PG_dirty successfully. But CPU-B may do "moving" in advance that
> "old_memcg->nr_dirty --" will make old_memcg->nr_dirty incorrect but
> soon CPU-A will do "memcg->nr_dirty ++" at the heels that amend the
> stats.
> However, there is a potential problem that old_memcg->nr_dirty  may be
> minus in a very short period but not a big issue IMHO.
>

IMHO, this will work. Please take care of that the recorded memcg will not
be invalid pointer when you update the nr_dirty later.
(Maybe RCU will protect it.)

_If_ this method can handle "nesting" problem clearer and make implementation
simpler, please go ahead. To be honest, I'm not sure how the code will be until
seeing the patch. Hmm, why you write SetPageDirty() here rather than
TestSetPageDirty()....

Thanks,
-Kame





^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
@ 2013-01-10  5:03                 ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 124+ messages in thread
From: Kamezawa Hiroyuki @ 2013-01-10  5:03 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: Michal Hocko, Hugh Dickins, Johannes Weiner,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	glommer-bzQdu9zFT3WakBO8gow8eQ, dchinner-H+wXaHxf7aLQT0dZR+AlfA,
	Sha Zhengju

(2013/01/10 13:26), Sha Zhengju wrote:

> But this method also has its pros and cons(e.g. need lock nesting). So
> I doubt whether the following is able to deal with these issues all
> together:
> (CPU-A does "page stat accounting" and CPU-B does "move")
>
>               CPU-A                            CPU-B
>
> move_lock_mem_cgroup()
> memcg = pc->mem_cgroup
> SetPageDirty(page)
> move_unlock_mem_cgroup()
>                                        move_lock_mem_cgroup()
>                                        if (PageDirty) {
>                                                 old_memcg->nr_dirty --;
>                                                 new_memcg->nr_dirty ++;
>                                         }
>                                         pc->mem_cgroup = new_memcg
>                                         move_unlock_mem_cgroup()
>
> memcg->nr_dirty ++
>
>
> For CPU-A, we save pc->mem_cgroup in a temporary variable just before
> SetPageDirty inside move_lock and then update stats if the page is set
> PG_dirty successfully. But CPU-B may do "moving" in advance that
> "old_memcg->nr_dirty --" will make old_memcg->nr_dirty incorrect but
> soon CPU-A will do "memcg->nr_dirty ++" at the heels that amend the
> stats.
> However, there is a potential problem that old_memcg->nr_dirty  may be
> minus in a very short period but not a big issue IMHO.
>

IMHO, this will work. Please take care of that the recorded memcg will not
be invalid pointer when you update the nr_dirty later.
(Maybe RCU will protect it.)

_If_ this method can handle "nesting" problem clearer and make implementation
simpler, please go ahead. To be honest, I'm not sure how the code will be until
seeing the patch. Hmm, why you write SetPageDirty() here rather than
TestSetPageDirty()....

Thanks,
-Kame

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
@ 2013-01-10  5:03                 ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 124+ messages in thread
From: Kamezawa Hiroyuki @ 2013-01-10  5:03 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: Michal Hocko, Hugh Dickins, Johannes Weiner, linux-kernel,
	cgroups, linux-mm, linux-fsdevel, akpm, gthelen, fengguang.wu,
	glommer, dchinner, Sha Zhengju

(2013/01/10 13:26), Sha Zhengju wrote:

> But this method also has its pros and cons(e.g. need lock nesting). So
> I doubt whether the following is able to deal with these issues all
> together:
> (CPU-A does "page stat accounting" and CPU-B does "move")
>
>               CPU-A                            CPU-B
>
> move_lock_mem_cgroup()
> memcg = pc->mem_cgroup
> SetPageDirty(page)
> move_unlock_mem_cgroup()
>                                        move_lock_mem_cgroup()
>                                        if (PageDirty) {
>                                                 old_memcg->nr_dirty --;
>                                                 new_memcg->nr_dirty ++;
>                                         }
>                                         pc->mem_cgroup = new_memcg
>                                         move_unlock_mem_cgroup()
>
> memcg->nr_dirty ++
>
>
> For CPU-A, we save pc->mem_cgroup in a temporary variable just before
> SetPageDirty inside move_lock and then update stats if the page is set
> PG_dirty successfully. But CPU-B may do "moving" in advance that
> "old_memcg->nr_dirty --" will make old_memcg->nr_dirty incorrect but
> soon CPU-A will do "memcg->nr_dirty ++" at the heels that amend the
> stats.
> However, there is a potential problem that old_memcg->nr_dirty  may be
> minus in a very short period but not a big issue IMHO.
>

IMHO, this will work. Please take care of that the recorded memcg will not
be invalid pointer when you update the nr_dirty later.
(Maybe RCU will protect it.)

_If_ this method can handle "nesting" problem clearer and make implementation
simpler, please go ahead. To be honest, I'm not sure how the code will be until
seeing the patch. Hmm, why you write SetPageDirty() here rather than
TestSetPageDirty()....

Thanks,
-Kame




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
@ 2013-01-10  8:28                   ` Sha Zhengju
  0 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2013-01-10  8:28 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Michal Hocko, Hugh Dickins, Johannes Weiner, linux-kernel,
	cgroups, linux-mm, linux-fsdevel, akpm, gthelen, fengguang.wu,
	glommer, dchinner, Sha Zhengju

On Thu, Jan 10, 2013 at 1:03 PM, Kamezawa Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> (2013/01/10 13:26), Sha Zhengju wrote:
>
>> But this method also has its pros and cons(e.g. need lock nesting). So
>> I doubt whether the following is able to deal with these issues all
>> together:
>> (CPU-A does "page stat accounting" and CPU-B does "move")
>>
>>               CPU-A                            CPU-B
>>
>> move_lock_mem_cgroup()
>> memcg = pc->mem_cgroup
>> SetPageDirty(page)
>> move_unlock_mem_cgroup()
>>                                        move_lock_mem_cgroup()
>>                                        if (PageDirty) {
>>                                                 old_memcg->nr_dirty --;
>>                                                 new_memcg->nr_dirty ++;
>>                                         }
>>                                         pc->mem_cgroup = new_memcg
>>                                         move_unlock_mem_cgroup()
>>
>> memcg->nr_dirty ++
>>
>>
>> For CPU-A, we save pc->mem_cgroup in a temporary variable just before
>> SetPageDirty inside move_lock and then update stats if the page is set
>> PG_dirty successfully. But CPU-B may do "moving" in advance that
>> "old_memcg->nr_dirty --" will make old_memcg->nr_dirty incorrect but
>> soon CPU-A will do "memcg->nr_dirty ++" at the heels that amend the
>> stats.
>> However, there is a potential problem that old_memcg->nr_dirty  may be
>> minus in a very short period but not a big issue IMHO.
>>
>
> IMHO, this will work. Please take care of that the recorded memcg will not
> be invalid pointer when you update the nr_dirty later.
> (Maybe RCU will protect it.)
>
Yes, there're 3 places to change pc->mem_cgroup: charge & uncharge &
move_account. "charge" has no race with stat updater and "uncharge"
doesn't reset pc->mem_cgroup directly, also "move_account" is just the
one we are handling, so they may do no harm here. Meanwhile, invalid
pointer made by cgroup deletion may also be avoided by RCU. Yet it's a
rough conclusion by quick look...

> _If_ this method can handle "nesting" problem clearer and make
> implementation
> simpler, please go ahead. To be honest, I'm not sure how the code will be
> until
Okay, later I'll try to propose the patch.

> seeing the patch. Hmm, why you write SetPageDirty() here rather than
> TestSetPageDirty()....
>
No particular reason...TestSetPageDirty() may be more precise... : )


-- 
Thanks,
Sha

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
@ 2013-01-10  8:28                   ` Sha Zhengju
  0 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2013-01-10  8:28 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Michal Hocko, Hugh Dickins, Johannes Weiner,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	glommer-bzQdu9zFT3WakBO8gow8eQ, dchinner-H+wXaHxf7aLQT0dZR+AlfA,
	Sha Zhengju

On Thu, Jan 10, 2013 at 1:03 PM, Kamezawa Hiroyuki
<kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> wrote:
> (2013/01/10 13:26), Sha Zhengju wrote:
>
>> But this method also has its pros and cons(e.g. need lock nesting). So
>> I doubt whether the following is able to deal with these issues all
>> together:
>> (CPU-A does "page stat accounting" and CPU-B does "move")
>>
>>               CPU-A                            CPU-B
>>
>> move_lock_mem_cgroup()
>> memcg = pc->mem_cgroup
>> SetPageDirty(page)
>> move_unlock_mem_cgroup()
>>                                        move_lock_mem_cgroup()
>>                                        if (PageDirty) {
>>                                                 old_memcg->nr_dirty --;
>>                                                 new_memcg->nr_dirty ++;
>>                                         }
>>                                         pc->mem_cgroup = new_memcg
>>                                         move_unlock_mem_cgroup()
>>
>> memcg->nr_dirty ++
>>
>>
>> For CPU-A, we save pc->mem_cgroup in a temporary variable just before
>> SetPageDirty inside move_lock and then update stats if the page is set
>> PG_dirty successfully. But CPU-B may do "moving" in advance that
>> "old_memcg->nr_dirty --" will make old_memcg->nr_dirty incorrect but
>> soon CPU-A will do "memcg->nr_dirty ++" at the heels that amend the
>> stats.
>> However, there is a potential problem that old_memcg->nr_dirty  may be
>> minus in a very short period but not a big issue IMHO.
>>
>
> IMHO, this will work. Please take care of that the recorded memcg will not
> be invalid pointer when you update the nr_dirty later.
> (Maybe RCU will protect it.)
>
Yes, there're 3 places to change pc->mem_cgroup: charge & uncharge &
move_account. "charge" has no race with stat updater and "uncharge"
doesn't reset pc->mem_cgroup directly, also "move_account" is just the
one we are handling, so they may do no harm here. Meanwhile, invalid
pointer made by cgroup deletion may also be avoided by RCU. Yet it's a
rough conclusion by quick look...

> _If_ this method can handle "nesting" problem clearer and make
> implementation
> simpler, please go ahead. To be honest, I'm not sure how the code will be
> until
Okay, later I'll try to propose the patch.

> seeing the patch. Hmm, why you write SetPageDirty() here rather than
> TestSetPageDirty()....
>
No particular reason...TestSetPageDirty() may be more precise... : )


-- 
Thanks,
Sha

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
@ 2013-01-10  8:28                   ` Sha Zhengju
  0 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2013-01-10  8:28 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Michal Hocko, Hugh Dickins, Johannes Weiner, linux-kernel,
	cgroups, linux-mm, linux-fsdevel, akpm, gthelen, fengguang.wu,
	glommer, dchinner, Sha Zhengju

On Thu, Jan 10, 2013 at 1:03 PM, Kamezawa Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> (2013/01/10 13:26), Sha Zhengju wrote:
>
>> But this method also has its pros and cons(e.g. need lock nesting). So
>> I doubt whether the following is able to deal with these issues all
>> together:
>> (CPU-A does "page stat accounting" and CPU-B does "move")
>>
>>               CPU-A                            CPU-B
>>
>> move_lock_mem_cgroup()
>> memcg = pc->mem_cgroup
>> SetPageDirty(page)
>> move_unlock_mem_cgroup()
>>                                        move_lock_mem_cgroup()
>>                                        if (PageDirty) {
>>                                                 old_memcg->nr_dirty --;
>>                                                 new_memcg->nr_dirty ++;
>>                                         }
>>                                         pc->mem_cgroup = new_memcg
>>                                         move_unlock_mem_cgroup()
>>
>> memcg->nr_dirty ++
>>
>>
>> For CPU-A, we save pc->mem_cgroup in a temporary variable just before
>> SetPageDirty inside move_lock and then update stats if the page is set
>> PG_dirty successfully. But CPU-B may do "moving" in advance that
>> "old_memcg->nr_dirty --" will make old_memcg->nr_dirty incorrect but
>> soon CPU-A will do "memcg->nr_dirty ++" at the heels that amend the
>> stats.
>> However, there is a potential problem that old_memcg->nr_dirty  may be
>> minus in a very short period but not a big issue IMHO.
>>
>
> IMHO, this will work. Please take care of that the recorded memcg will not
> be invalid pointer when you update the nr_dirty later.
> (Maybe RCU will protect it.)
>
Yes, there're 3 places to change pc->mem_cgroup: charge & uncharge &
move_account. "charge" has no race with stat updater and "uncharge"
doesn't reset pc->mem_cgroup directly, also "move_account" is just the
one we are handling, so they may do no harm here. Meanwhile, invalid
pointer made by cgroup deletion may also be avoided by RCU. Yet it's a
rough conclusion by quick look...

> _If_ this method can handle "nesting" problem clearer and make
> implementation
> simpler, please go ahead. To be honest, I'm not sure how the code will be
> until
Okay, later I'll try to propose the patch.

> seeing the patch. Hmm, why you write SetPageDirty() here rather than
> TestSetPageDirty()....
>
No particular reason...TestSetPageDirty() may be more precise... : )


-- 
Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
  2013-01-02 10:44     ` Michal Hocko
@ 2013-05-03  9:11       ` Michal Hocko
  -1 siblings, 0 replies; 124+ messages in thread
From: Michal Hocko @ 2013-05-03  9:11 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel, cgroups, linux-mm, linux-fsdevel, akpm,
	kamezawa.hiroyu, gthelen, fengguang.wu, glommer, dchinner,
	Sha Zhengju

On Wed 02-01-13 11:44:21, Michal Hocko wrote:
> On Wed 26-12-12 01:26:07, Sha Zhengju wrote:
> > From: Sha Zhengju <handai.szj@taobao.com>
> > 
> > This patch adds memcg routines to count dirty pages, which allows memory controller
> > to maintain an accurate view of the amount of its dirty memory and can provide some
> > info for users while cgroup's direct reclaim is working.
> 
> I guess you meant targeted resp. (hard/soft) limit reclaim here,
> right? It is true that this is direct reclaim but it is not clear to me
> why the usefulnes should be limitted to the reclaim for users. I would
> understand this if the users was in fact in-kernel users.
> 
> [...]
> > To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
> > (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
> > ->private_lock --> mapping->tree_lock --> memcg->move_lock.
> > So we need to make mapping->tree_lock ahead of TestSetPageDirty in __set_page_dirty()
> > and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention,
> > a prepare PageDirty() checking is added.
> 
> But there is another AA deadlock here I believe.
> page_remove_rmap
>   mem_cgroup_begin_update_page_stat		<<< 1
>   set_page_dirty
>     __set_page_dirty_buffers
>       __set_page_dirty
>         mem_cgroup_begin_update_page_stat	<<< 2
> 	  move_lock_mem_cgroup
> 	    spin_lock_irqsave(&memcg->move_lock, *flags);

JFYI since abf09bed (s390/mm: implement software dirty bits) this is no
longer possible. I haven't checked wheter there are other cases like
this one and it should be better if mem_cgroup_begin_update_page_stat
was recursive safe if that can be done without too many hacks.
I will have a look at this (hopefully) sometimes next week.

[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
@ 2013-05-03  9:11       ` Michal Hocko
  0 siblings, 0 replies; 124+ messages in thread
From: Michal Hocko @ 2013-05-03  9:11 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel, cgroups, linux-mm, linux-fsdevel, akpm,
	kamezawa.hiroyu, gthelen, fengguang.wu, glommer, dchinner,
	Sha Zhengju

On Wed 02-01-13 11:44:21, Michal Hocko wrote:
> On Wed 26-12-12 01:26:07, Sha Zhengju wrote:
> > From: Sha Zhengju <handai.szj@taobao.com>
> > 
> > This patch adds memcg routines to count dirty pages, which allows memory controller
> > to maintain an accurate view of the amount of its dirty memory and can provide some
> > info for users while cgroup's direct reclaim is working.
> 
> I guess you meant targeted resp. (hard/soft) limit reclaim here,
> right? It is true that this is direct reclaim but it is not clear to me
> why the usefulnes should be limitted to the reclaim for users. I would
> understand this if the users was in fact in-kernel users.
> 
> [...]
> > To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
> > (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
> > ->private_lock --> mapping->tree_lock --> memcg->move_lock.
> > So we need to make mapping->tree_lock ahead of TestSetPageDirty in __set_page_dirty()
> > and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention,
> > a prepare PageDirty() checking is added.
> 
> But there is another AA deadlock here I believe.
> page_remove_rmap
>   mem_cgroup_begin_update_page_stat		<<< 1
>   set_page_dirty
>     __set_page_dirty_buffers
>       __set_page_dirty
>         mem_cgroup_begin_update_page_stat	<<< 2
> 	  move_lock_mem_cgroup
> 	    spin_lock_irqsave(&memcg->move_lock, *flags);

JFYI since abf09bed (s390/mm: implement software dirty bits) this is no
longer possible. I haven't checked wheter there are other cases like
this one and it should be better if mem_cgroup_begin_update_page_stat
was recursive safe if that can be done without too many hacks.
I will have a look at this (hopefully) sometimes next week.

[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
  2013-05-03  9:11       ` Michal Hocko
@ 2013-05-03  9:59         ` Sha Zhengju
  -1 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2013-05-03  9:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: LKML, Cgroups, linux-mm, linux-fsdevel, Andrew Morton,
	KAMEZAWA Hiroyuki, Greg Thelen, Wu Fengguang, Glauber Costa,
	Dave Chinner, Sha Zhengju

On Fri, May 3, 2013 at 5:11 PM, Michal Hocko <mhocko@suse.cz> wrote:
> On Wed 02-01-13 11:44:21, Michal Hocko wrote:
>> On Wed 26-12-12 01:26:07, Sha Zhengju wrote:
>> > From: Sha Zhengju <handai.szj@taobao.com>
>> >
>> > This patch adds memcg routines to count dirty pages, which allows memory controller
>> > to maintain an accurate view of the amount of its dirty memory and can provide some
>> > info for users while cgroup's direct reclaim is working.
>>
>> I guess you meant targeted resp. (hard/soft) limit reclaim here,
>> right? It is true that this is direct reclaim but it is not clear to me
>> why the usefulnes should be limitted to the reclaim for users. I would
>> understand this if the users was in fact in-kernel users.
>>
>> [...]
>> > To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
>> > (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
>> > ->private_lock --> mapping->tree_lock --> memcg->move_lock.
>> > So we need to make mapping->tree_lock ahead of TestSetPageDirty in __set_page_dirty()
>> > and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention,
>> > a prepare PageDirty() checking is added.
>>
>> But there is another AA deadlock here I believe.
>> page_remove_rmap
>>   mem_cgroup_begin_update_page_stat           <<< 1
>>   set_page_dirty
>>     __set_page_dirty_buffers
>>       __set_page_dirty
>>         mem_cgroup_begin_update_page_stat     <<< 2
>>         move_lock_mem_cgroup
>>           spin_lock_irqsave(&memcg->move_lock, *flags);
>
> JFYI since abf09bed (s390/mm: implement software dirty bits) this is no
> longer possible. I haven't checked wheter there are other cases like
> this one and it should be better if mem_cgroup_begin_update_page_stat
> was recursive safe if that can be done without too many hacks.
> I will have a look at this (hopefully) sometimes next week.
>

Hi Michal,


I'm sorry for not being able to return to this problem immediately after LSF/MM.
That is good news. IIRC, it's the only place we have encountered
recursive problem in accounting memcg dirty pages. But I'll try to
revive my previous work of simplifying
mem_cgroup_begin_update_page_stat() lock.
I'll back to it in next few days.


--
Thanks,
Sha

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
@ 2013-05-03  9:59         ` Sha Zhengju
  0 siblings, 0 replies; 124+ messages in thread
From: Sha Zhengju @ 2013-05-03  9:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: LKML, Cgroups, linux-mm, linux-fsdevel, Andrew Morton,
	KAMEZAWA Hiroyuki, Greg Thelen, Wu Fengguang, Glauber Costa,
	Dave Chinner, Sha Zhengju

On Fri, May 3, 2013 at 5:11 PM, Michal Hocko <mhocko@suse.cz> wrote:
> On Wed 02-01-13 11:44:21, Michal Hocko wrote:
>> On Wed 26-12-12 01:26:07, Sha Zhengju wrote:
>> > From: Sha Zhengju <handai.szj@taobao.com>
>> >
>> > This patch adds memcg routines to count dirty pages, which allows memory controller
>> > to maintain an accurate view of the amount of its dirty memory and can provide some
>> > info for users while cgroup's direct reclaim is working.
>>
>> I guess you meant targeted resp. (hard/soft) limit reclaim here,
>> right? It is true that this is direct reclaim but it is not clear to me
>> why the usefulnes should be limitted to the reclaim for users. I would
>> understand this if the users was in fact in-kernel users.
>>
>> [...]
>> > To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
>> > (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
>> > ->private_lock --> mapping->tree_lock --> memcg->move_lock.
>> > So we need to make mapping->tree_lock ahead of TestSetPageDirty in __set_page_dirty()
>> > and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention,
>> > a prepare PageDirty() checking is added.
>>
>> But there is another AA deadlock here I believe.
>> page_remove_rmap
>>   mem_cgroup_begin_update_page_stat           <<< 1
>>   set_page_dirty
>>     __set_page_dirty_buffers
>>       __set_page_dirty
>>         mem_cgroup_begin_update_page_stat     <<< 2
>>         move_lock_mem_cgroup
>>           spin_lock_irqsave(&memcg->move_lock, *flags);
>
> JFYI since abf09bed (s390/mm: implement software dirty bits) this is no
> longer possible. I haven't checked wheter there are other cases like
> this one and it should be better if mem_cgroup_begin_update_page_stat
> was recursive safe if that can be done without too many hacks.
> I will have a look at this (hopefully) sometimes next week.
>

Hi Michal,


I'm sorry for not being able to return to this problem immediately after LSF/MM.
That is good news. IIRC, it's the only place we have encountered
recursive problem in accounting memcg dirty pages. But I'll try to
revive my previous work of simplifying
mem_cgroup_begin_update_page_stat() lock.
I'll back to it in next few days.


--
Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 124+ messages in thread

end of thread, other threads:[~2013-05-03  9:59 UTC | newest]

Thread overview: 124+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-12-25 17:18 [PATCH V3 0/8] Per-cgroup page stat accounting Sha Zhengju
2012-12-25 17:18 ` Sha Zhengju
2012-12-25 17:20 ` [PATCH V3 1/8] memcg: remove MEMCG_NR_FILE_MAPPED Sha Zhengju
2012-12-25 17:20   ` Sha Zhengju
2012-12-25 17:22 ` [PATCH V3 2/8] Make TestSetPageDirty and dirty page accounting in one func Sha Zhengju
2012-12-25 17:22   ` Sha Zhengju
2012-12-25 17:22   ` Sha Zhengju
2012-12-28  0:39   ` Kamezawa Hiroyuki
2012-12-28  0:39     ` Kamezawa Hiroyuki
2013-01-05  2:34     ` Sha Zhengju
2013-01-05  2:34       ` Sha Zhengju
2013-01-02  9:08   ` Michal Hocko
2013-01-02  9:08     ` Michal Hocko
2013-01-05  2:49     ` Sha Zhengju
2013-01-05  2:49       ` Sha Zhengju
2013-01-05  2:49       ` Sha Zhengju
2013-01-05 10:45       ` Michal Hocko
2013-01-05 10:45         ` Michal Hocko
2013-01-05 10:45         ` Michal Hocko
2012-12-25 17:24 ` [PATCH V3 3/8] use vfs __set_page_dirty interface instead of doing it inside filesystem Sha Zhengju
2012-12-25 17:24   ` Sha Zhengju
2012-12-28  0:41   ` Kamezawa Hiroyuki
2012-12-28  0:41     ` Kamezawa Hiroyuki
2012-12-28  0:41     ` Kamezawa Hiroyuki
2012-12-25 17:26 ` [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting Sha Zhengju
2012-12-25 17:26   ` Sha Zhengju
2013-01-02 10:44   ` Michal Hocko
2013-01-02 10:44     ` Michal Hocko
2013-01-05  4:48     ` Sha Zhengju
2013-01-05  4:48       ` Sha Zhengju
2013-01-06 20:02       ` Hugh Dickins
2013-01-06 20:02         ` Hugh Dickins
2013-01-07  7:49         ` Kamezawa Hiroyuki
2013-01-07  7:49           ` Kamezawa Hiroyuki
2013-01-07  7:49           ` Kamezawa Hiroyuki
2013-01-09  5:15           ` Hugh Dickins
2013-01-09  5:15             ` Hugh Dickins
2013-01-09  7:24             ` Kamezawa Hiroyuki
2013-01-09  7:24               ` Kamezawa Hiroyuki
2013-01-09  7:24               ` Kamezawa Hiroyuki
2013-01-09 14:35         ` Sha Zhengju
2013-01-09 14:35           ` Sha Zhengju
2013-01-09 14:35           ` Sha Zhengju
2013-01-09 14:47           ` Michal Hocko
2013-01-09 14:47             ` Michal Hocko
2013-01-09 14:47             ` Michal Hocko
2013-01-07  7:25       ` Kamezawa Hiroyuki
2013-01-07  7:25         ` Kamezawa Hiroyuki
2013-01-07  7:25         ` Kamezawa Hiroyuki
2013-01-09 15:02         ` Sha Zhengju
2013-01-09 15:02           ` Sha Zhengju
2013-01-09 15:02           ` Sha Zhengju
2013-01-10  2:16           ` Kamezawa Hiroyuki
2013-01-10  2:16             ` Kamezawa Hiroyuki
2013-01-10  2:16             ` Kamezawa Hiroyuki
2013-01-10  4:26             ` Sha Zhengju
2013-01-10  4:26               ` Sha Zhengju
2013-01-10  4:26               ` Sha Zhengju
2013-01-10  5:03               ` Kamezawa Hiroyuki
2013-01-10  5:03                 ` Kamezawa Hiroyuki
2013-01-10  5:03                 ` Kamezawa Hiroyuki
2013-01-10  8:28                 ` Sha Zhengju
2013-01-10  8:28                   ` Sha Zhengju
2013-01-10  8:28                   ` Sha Zhengju
2013-05-03  9:11     ` Michal Hocko
2013-05-03  9:11       ` Michal Hocko
2013-05-03  9:59       ` Sha Zhengju
2013-05-03  9:59         ` Sha Zhengju
2013-01-06 20:07   ` Greg Thelen
2013-01-06 20:07     ` Greg Thelen
2013-01-06 20:07     ` Greg Thelen
2013-01-09  9:45     ` Sha Zhengju
2013-01-09  9:45       ` Sha Zhengju
2013-01-09  9:45       ` Sha Zhengju
2012-12-25 17:26 ` [PATCH V3 5/8] memcg: add per cgroup writeback " Sha Zhengju
2012-12-25 17:26   ` Sha Zhengju
2012-12-28  0:52   ` Kamezawa Hiroyuki
2012-12-28  0:52     ` Kamezawa Hiroyuki
2013-01-02 11:15   ` Michal Hocko
2013-01-02 11:15     ` Michal Hocko
2013-01-02 11:15     ` Michal Hocko
2013-01-06 20:07   ` Greg Thelen
2013-01-06 20:07     ` Greg Thelen
2013-01-09  9:08     ` Sha Zhengju
2013-01-09  9:08       ` Sha Zhengju
2012-12-25 17:27 ` [PATCH V3 6/8] memcg: Don't account root_mem_cgroup page statistics Sha Zhengju
2012-12-25 17:27   ` Sha Zhengju
2012-12-28  1:04   ` Kamezawa Hiroyuki
2012-12-28  1:04     ` Kamezawa Hiroyuki
2013-01-05  7:38     ` Sha Zhengju
2013-01-05  7:38       ` Sha Zhengju
2013-01-05  7:38       ` Sha Zhengju
2013-01-02 12:27   ` Michal Hocko
2013-01-02 12:27     ` Michal Hocko
2013-01-02 12:27     ` Michal Hocko
2013-01-05 10:52     ` Sha Zhengju
2013-01-05 10:52       ` Sha Zhengju
2013-01-05 10:52       ` Sha Zhengju
2013-01-09 12:57       ` Michal Hocko
2013-01-09 12:57         ` Michal Hocko
2012-12-25 17:27 ` [PATCH V3 7/8] memcg: disable memcg page stat accounting code when not in use Sha Zhengju
2012-12-25 17:27   ` Sha Zhengju
2012-12-25 17:27   ` Sha Zhengju
2012-12-28  1:06   ` Kamezawa Hiroyuki
2012-12-28  1:06     ` Kamezawa Hiroyuki
2012-12-28  1:06     ` Kamezawa Hiroyuki
2012-12-28  1:45   ` Kamezawa Hiroyuki
2012-12-28  1:45     ` Kamezawa Hiroyuki
2013-01-05 11:06     ` Sha Zhengju
2013-01-05 11:06       ` Sha Zhengju
2013-01-02 13:35   ` Michal Hocko
2013-01-02 13:35     ` Michal Hocko
2012-12-25 17:28 ` [PATCH V3 8/8] memcg: Document cgroup dirty/writeback memory statistics Sha Zhengju
2012-12-25 17:28   ` Sha Zhengju
2012-12-25 17:28   ` Sha Zhengju
2012-12-28  1:10   ` Kamezawa Hiroyuki
2012-12-28  1:10     ` Kamezawa Hiroyuki
2012-12-28  1:10     ` Kamezawa Hiroyuki
2013-01-06  2:55     ` Sha Zhengju
2013-01-06  2:55       ` Sha Zhengju
2013-01-06  2:55       ` Sha Zhengju
2013-01-02 13:36   ` Michal Hocko
2013-01-02 13:36     ` Michal Hocko
2013-01-02 13:36     ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.