All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V2 0/6] Per-cgroup page stat accounting
@ 2012-07-27 10:20 ` Sha Zhengju
  0 siblings, 0 replies; 24+ messages in thread
From: Sha Zhengju @ 2012-07-27 10:20 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: fengguang.wu, gthelen, akpm, yinghan, mhocko, linux-kernel,
	hannes, Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

Hi, list

This V2 patch series provide the ability for each memory cgroup to have independent
dirty/writeback page statistics which can provide information for per-cgroup
direct reclaim or some.

In the first three prepare patches, we have done some cleanup and reworked vfs
set page dirty routines to make "modify page info" and "dirty page accouting" stay
in one function as much as possible for the sake of memcg bigger lock(test numbers
are in the specific patch).

Kame, I tested these patches on linux mainline v3.5, because I cannot boot up the kernel
under linux-next :(. But these patches are cooked on top of your recent memcg patches
(I backport them to mainline) and I think there is no hunk with the mm tree.
So If there's no other problem, I think it could be considered for merging.



Following is performance comparison between before/after the series:

Test steps(Mem-24g, ext4):
drop_cache; sync
cat /proc/meminfo|grep Dirty (=4KB)
fio (buffered/randwrite/bs=4k/size=128m/filesize=1g/numjobs=8/sync) 
cat /proc/meminfo|grep Dirty(=648696kB)

We test it for 10 times and get the average numbers:
Before:
write: io=1024.0MB, bw=334678 KB/s, iops=83669.2 , runt=  3136 msec
lat (usec): min=1 , max=26203.1 , avg=81.473, stdev=275.754

After:
write: io=1024.0MB, bw=325219 KB/s, iops= 81304.1 , runt=  3226.9 msec
lat (usec): min=1 , max=17224 , avg=86.194, stdev=298.183



There is about 2.8% performance decrease. But I notice that once memcg is enabled,
the root_memcg exsits and all pages allocated are belong to it, so they will go
through the root memcg statistics routines which bring some overhead. 
Moreover,in case of memcg_is_enable && no cgroups, we can get root memcg stats
just from global numbers which can avoid both accounting overheads and many if-test
overheads. Later I'll work further into it.

Any comments are welcomed. : )



Change log:
v2 <-- v1:
	1. add test numbers
	2. some small fix and comments

Sha Zhengju (6):
	memcg-remove-MEMCG_NR_FILE_MAPPED.patch
	Make-TestSetPageDirty-and-dirty-page-accounting-in-o.patch
	Use-vfs-__set_page_dirty-interface-instead-of-doing-.patch
	memcg-add-per-cgroup-dirty-pages-accounting.patch
	memcg-add-per-cgroup-writeback-pages-accounting.patch
	memcg-Document-cgroup-dirty-writeback-memory-statist.patch

 Documentation/cgroups/memory.txt |    2 +
 fs/buffer.c                      |   36 +++++++++++++++--------
 fs/ceph/addr.c                   |   20 +------------
 include/linux/buffer_head.h      |    2 +
 include/linux/memcontrol.h       |   30 ++++++++++++++-----
 mm/filemap.c                     |    9 ++++++
 mm/memcontrol.c                  |   58 +++++++++++++++++++-------------------
 mm/page-writeback.c              |   48 ++++++++++++++++++++++++++++---
 mm/rmap.c                        |    4 +-
 mm/truncate.c                    |    6 ++++
 10 files changed, 141 insertions(+), 74 deletions(-)


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH V2 0/6] Per-cgroup page stat accounting
@ 2012-07-27 10:20 ` Sha Zhengju
  0 siblings, 0 replies; 24+ messages in thread
From: Sha Zhengju @ 2012-07-27 10:20 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: fengguang.wu, gthelen, akpm, yinghan, mhocko, linux-kernel,
	hannes, Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

Hi, list

This V2 patch series provide the ability for each memory cgroup to have independent
dirty/writeback page statistics which can provide information for per-cgroup
direct reclaim or some.

In the first three prepare patches, we have done some cleanup and reworked vfs
set page dirty routines to make "modify page info" and "dirty page accouting" stay
in one function as much as possible for the sake of memcg bigger lock(test numbers
are in the specific patch).

Kame, I tested these patches on linux mainline v3.5, because I cannot boot up the kernel
under linux-next :(. But these patches are cooked on top of your recent memcg patches
(I backport them to mainline) and I think there is no hunk with the mm tree.
So If there's no other problem, I think it could be considered for merging.



Following is performance comparison between before/after the series:

Test steps(Mem-24g, ext4):
drop_cache; sync
cat /proc/meminfo|grep Dirty (=4KB)
fio (buffered/randwrite/bs=4k/size=128m/filesize=1g/numjobs=8/sync) 
cat /proc/meminfo|grep Dirty(=648696kB)

We test it for 10 times and get the average numbers:
Before:
write: io=1024.0MB, bw=334678 KB/s, iops=83669.2 , runt=  3136 msec
lat (usec): min=1 , max=26203.1 , avg=81.473, stdev=275.754

After:
write: io=1024.0MB, bw=325219 KB/s, iops= 81304.1 , runt=  3226.9 msec
lat (usec): min=1 , max=17224 , avg=86.194, stdev=298.183



There is about 2.8% performance decrease. But I notice that once memcg is enabled,
the root_memcg exsits and all pages allocated are belong to it, so they will go
through the root memcg statistics routines which bring some overhead. 
Moreover,in case of memcg_is_enable && no cgroups, we can get root memcg stats
just from global numbers which can avoid both accounting overheads and many if-test
overheads. Later I'll work further into it.

Any comments are welcomed. : )



Change log:
v2 <-- v1:
	1. add test numbers
	2. some small fix and comments

Sha Zhengju (6):
	memcg-remove-MEMCG_NR_FILE_MAPPED.patch
	Make-TestSetPageDirty-and-dirty-page-accounting-in-o.patch
	Use-vfs-__set_page_dirty-interface-instead-of-doing-.patch
	memcg-add-per-cgroup-dirty-pages-accounting.patch
	memcg-add-per-cgroup-writeback-pages-accounting.patch
	memcg-Document-cgroup-dirty-writeback-memory-statist.patch

 Documentation/cgroups/memory.txt |    2 +
 fs/buffer.c                      |   36 +++++++++++++++--------
 fs/ceph/addr.c                   |   20 +------------
 include/linux/buffer_head.h      |    2 +
 include/linux/memcontrol.h       |   30 ++++++++++++++-----
 mm/filemap.c                     |    9 ++++++
 mm/memcontrol.c                  |   58 +++++++++++++++++++-------------------
 mm/page-writeback.c              |   48 ++++++++++++++++++++++++++++---
 mm/rmap.c                        |    4 +-
 mm/truncate.c                    |    6 ++++
 10 files changed, 141 insertions(+), 74 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH V2 0/6] Per-cgroup page stat accounting
@ 2012-07-27 10:20 ` Sha Zhengju
  0 siblings, 0 replies; 24+ messages in thread
From: Sha Zhengju @ 2012-07-27 10:20 UTC (permalink / raw)
  To: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA
  Cc: fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	yinghan-hpIqsD4AKlfQT0dZR+AlfA, mhocko-AlSwsSmVLrQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	hannes-druUgvl0LCNAfugRpC6u6w, Sha Zhengju

From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>

Hi, list

This V2 patch series provide the ability for each memory cgroup to have independent
dirty/writeback page statistics which can provide information for per-cgroup
direct reclaim or some.

In the first three prepare patches, we have done some cleanup and reworked vfs
set page dirty routines to make "modify page info" and "dirty page accouting" stay
in one function as much as possible for the sake of memcg bigger lock(test numbers
are in the specific patch).

Kame, I tested these patches on linux mainline v3.5, because I cannot boot up the kernel
under linux-next :(. But these patches are cooked on top of your recent memcg patches
(I backport them to mainline) and I think there is no hunk with the mm tree.
So If there's no other problem, I think it could be considered for merging.



Following is performance comparison between before/after the series:

Test steps(Mem-24g, ext4):
drop_cache; sync
cat /proc/meminfo|grep Dirty (=4KB)
fio (buffered/randwrite/bs=4k/size=128m/filesize=1g/numjobs=8/sync) 
cat /proc/meminfo|grep Dirty(=648696kB)

We test it for 10 times and get the average numbers:
Before:
write: io=1024.0MB, bw=334678 KB/s, iops=83669.2 , runt=  3136 msec
lat (usec): min=1 , max=26203.1 , avg=81.473, stdev=275.754

After:
write: io=1024.0MB, bw=325219 KB/s, iops= 81304.1 , runt=  3226.9 msec
lat (usec): min=1 , max=17224 , avg=86.194, stdev=298.183



There is about 2.8% performance decrease. But I notice that once memcg is enabled,
the root_memcg exsits and all pages allocated are belong to it, so they will go
through the root memcg statistics routines which bring some overhead. 
Moreover,in case of memcg_is_enable && no cgroups, we can get root memcg stats
just from global numbers which can avoid both accounting overheads and many if-test
overheads. Later I'll work further into it.

Any comments are welcomed. : )



Change log:
v2 <-- v1:
	1. add test numbers
	2. some small fix and comments

Sha Zhengju (6):
	memcg-remove-MEMCG_NR_FILE_MAPPED.patch
	Make-TestSetPageDirty-and-dirty-page-accounting-in-o.patch
	Use-vfs-__set_page_dirty-interface-instead-of-doing-.patch
	memcg-add-per-cgroup-dirty-pages-accounting.patch
	memcg-add-per-cgroup-writeback-pages-accounting.patch
	memcg-Document-cgroup-dirty-writeback-memory-statist.patch

 Documentation/cgroups/memory.txt |    2 +
 fs/buffer.c                      |   36 +++++++++++++++--------
 fs/ceph/addr.c                   |   20 +------------
 include/linux/buffer_head.h      |    2 +
 include/linux/memcontrol.h       |   30 ++++++++++++++-----
 mm/filemap.c                     |    9 ++++++
 mm/memcontrol.c                  |   58 +++++++++++++++++++-------------------
 mm/page-writeback.c              |   48 ++++++++++++++++++++++++++++---
 mm/rmap.c                        |    4 +-
 mm/truncate.c                    |    6 ++++
 10 files changed, 141 insertions(+), 74 deletions(-)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH V2 1/6] memcg: remove MEMCG_NR_FILE_MAPPED
  2012-07-27 10:20 ` Sha Zhengju
@ 2012-07-27 10:23   ` Sha Zhengju
  -1 siblings, 0 replies; 24+ messages in thread
From: Sha Zhengju @ 2012-07-27 10:23 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: fengguang.wu, gthelen, akpm, yinghan, mhocko, linux-kernel,
	hannes, Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

While accounting memcg page stat, it's not worth to use MEMCG_NR_FILE_MAPPED
as an extra layer of indirection because of the complexity and presumed
performance overhead. We can use MEM_CGROUP_STAT_FILE_MAPPED directly.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Greg Thelen <gthelen@google.com>
---
 include/linux/memcontrol.h |   28 ++++++++++++++++++++--------
 mm/memcontrol.c            |   25 +++----------------------
 mm/rmap.c                  |    4 ++--
 3 files changed, 25 insertions(+), 32 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 83e7ba9..c1e2617 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -27,9 +27,21 @@ struct page_cgroup;
 struct page;
 struct mm_struct;
 
-/* Stats that can be updated by kernel. */
-enum mem_cgroup_page_stat_item {
-	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
+/*
+ * Statistics for memory cgroup.
+ *
+ * The corresponding mem_cgroup_stat_names is defined in mm/memcontrol.c,
+ * These two lists should keep in accord with each other.
+ */
+enum mem_cgroup_stat_index {
+	/*
+	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
+	 */
+	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
+	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
+	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
+	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
+	MEM_CGROUP_STAT_NSTATS,
 };
 
 struct mem_cgroup_reclaim_cookie {
@@ -164,17 +176,17 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
 }
 
 void mem_cgroup_update_page_stat(struct page *page,
-				 enum mem_cgroup_page_stat_item idx,
+				 enum mem_cgroup_stat_index idx,
 				 int val);
 
 static inline void mem_cgroup_inc_page_stat(struct page *page,
-					    enum mem_cgroup_page_stat_item idx)
+					    enum mem_cgroup_stat_index idx)
 {
 	mem_cgroup_update_page_stat(page, idx, 1);
 }
 
 static inline void mem_cgroup_dec_page_stat(struct page *page,
-					    enum mem_cgroup_page_stat_item idx)
+					    enum mem_cgroup_stat_index idx)
 {
 	mem_cgroup_update_page_stat(page, idx, -1);
 }
@@ -349,12 +361,12 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
 }
 
 static inline void mem_cgroup_inc_page_stat(struct page *page,
-					    enum mem_cgroup_page_stat_item idx)
+					    enum mem_cgroup_stat_index idx)
 {
 }
 
 static inline void mem_cgroup_dec_page_stat(struct page *page,
-					    enum mem_cgroup_page_stat_item idx)
+					    enum mem_cgroup_stat_index idx)
 {
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1940ba8..aef9fb0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -76,21 +76,10 @@ static int really_do_swap_account __initdata = 0;
 #define do_swap_account		0
 #endif
 
-
 /*
- * Statistics for memory cgroup.
+ * The corresponding mem_cgroup_stat_index is defined in include/linux/memcontrol.h,
+ * These two lists should keep in accord with each other.
  */
-enum mem_cgroup_stat_index {
-	/*
-	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
-	 */
-	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
-	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
-	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
-	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
-	MEM_CGROUP_STAT_NSTATS,
-};
-
 static const char * const mem_cgroup_stat_names[] = {
 	"cache",
 	"rss",
@@ -1926,7 +1915,7 @@ void __mem_cgroup_end_update_page_stat(struct page *page, unsigned long *flags)
 }
 
 void mem_cgroup_update_page_stat(struct page *page,
-				 enum mem_cgroup_page_stat_item idx, int val)
+				 enum mem_cgroup_stat_index idx, int val)
 {
 	struct mem_cgroup *memcg;
 	struct page_cgroup *pc = lookup_page_cgroup(page);
@@ -1939,14 +1928,6 @@ void mem_cgroup_update_page_stat(struct page *page,
 	if (unlikely(!memcg || !PageCgroupUsed(pc)))
 		return;
 
-	switch (idx) {
-	case MEMCG_NR_FILE_MAPPED:
-		idx = MEM_CGROUP_STAT_FILE_MAPPED;
-		break;
-	default:
-		BUG();
-	}
-
 	this_cpu_add(memcg->stat->count[idx], val);
 }
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 0f3b7cd..cd7e54e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1148,7 +1148,7 @@ void page_add_file_rmap(struct page *page)
 	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (atomic_inc_and_test(&page->_mapcount)) {
 		__inc_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
+		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
 	}
 	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 }
@@ -1202,7 +1202,7 @@ void page_remove_rmap(struct page *page)
 					      NR_ANON_TRANSPARENT_HUGEPAGES);
 	} else {
 		__dec_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
+		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
 	}
 	/*
 	 * It would be tidy to reset the PageAnon mapping here,
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V2 1/6] memcg: remove MEMCG_NR_FILE_MAPPED
@ 2012-07-27 10:23   ` Sha Zhengju
  0 siblings, 0 replies; 24+ messages in thread
From: Sha Zhengju @ 2012-07-27 10:23 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: fengguang.wu, gthelen, akpm, yinghan, mhocko, linux-kernel,
	hannes, Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

While accounting memcg page stat, it's not worth to use MEMCG_NR_FILE_MAPPED
as an extra layer of indirection because of the complexity and presumed
performance overhead. We can use MEM_CGROUP_STAT_FILE_MAPPED directly.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Greg Thelen <gthelen@google.com>
---
 include/linux/memcontrol.h |   28 ++++++++++++++++++++--------
 mm/memcontrol.c            |   25 +++----------------------
 mm/rmap.c                  |    4 ++--
 3 files changed, 25 insertions(+), 32 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 83e7ba9..c1e2617 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -27,9 +27,21 @@ struct page_cgroup;
 struct page;
 struct mm_struct;
 
-/* Stats that can be updated by kernel. */
-enum mem_cgroup_page_stat_item {
-	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
+/*
+ * Statistics for memory cgroup.
+ *
+ * The corresponding mem_cgroup_stat_names is defined in mm/memcontrol.c,
+ * These two lists should keep in accord with each other.
+ */
+enum mem_cgroup_stat_index {
+	/*
+	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
+	 */
+	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
+	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
+	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
+	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
+	MEM_CGROUP_STAT_NSTATS,
 };
 
 struct mem_cgroup_reclaim_cookie {
@@ -164,17 +176,17 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
 }
 
 void mem_cgroup_update_page_stat(struct page *page,
-				 enum mem_cgroup_page_stat_item idx,
+				 enum mem_cgroup_stat_index idx,
 				 int val);
 
 static inline void mem_cgroup_inc_page_stat(struct page *page,
-					    enum mem_cgroup_page_stat_item idx)
+					    enum mem_cgroup_stat_index idx)
 {
 	mem_cgroup_update_page_stat(page, idx, 1);
 }
 
 static inline void mem_cgroup_dec_page_stat(struct page *page,
-					    enum mem_cgroup_page_stat_item idx)
+					    enum mem_cgroup_stat_index idx)
 {
 	mem_cgroup_update_page_stat(page, idx, -1);
 }
@@ -349,12 +361,12 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
 }
 
 static inline void mem_cgroup_inc_page_stat(struct page *page,
-					    enum mem_cgroup_page_stat_item idx)
+					    enum mem_cgroup_stat_index idx)
 {
 }
 
 static inline void mem_cgroup_dec_page_stat(struct page *page,
-					    enum mem_cgroup_page_stat_item idx)
+					    enum mem_cgroup_stat_index idx)
 {
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1940ba8..aef9fb0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -76,21 +76,10 @@ static int really_do_swap_account __initdata = 0;
 #define do_swap_account		0
 #endif
 
-
 /*
- * Statistics for memory cgroup.
+ * The corresponding mem_cgroup_stat_index is defined in include/linux/memcontrol.h,
+ * These two lists should keep in accord with each other.
  */
-enum mem_cgroup_stat_index {
-	/*
-	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
-	 */
-	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
-	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
-	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
-	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
-	MEM_CGROUP_STAT_NSTATS,
-};
-
 static const char * const mem_cgroup_stat_names[] = {
 	"cache",
 	"rss",
@@ -1926,7 +1915,7 @@ void __mem_cgroup_end_update_page_stat(struct page *page, unsigned long *flags)
 }
 
 void mem_cgroup_update_page_stat(struct page *page,
-				 enum mem_cgroup_page_stat_item idx, int val)
+				 enum mem_cgroup_stat_index idx, int val)
 {
 	struct mem_cgroup *memcg;
 	struct page_cgroup *pc = lookup_page_cgroup(page);
@@ -1939,14 +1928,6 @@ void mem_cgroup_update_page_stat(struct page *page,
 	if (unlikely(!memcg || !PageCgroupUsed(pc)))
 		return;
 
-	switch (idx) {
-	case MEMCG_NR_FILE_MAPPED:
-		idx = MEM_CGROUP_STAT_FILE_MAPPED;
-		break;
-	default:
-		BUG();
-	}
-
 	this_cpu_add(memcg->stat->count[idx], val);
 }
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 0f3b7cd..cd7e54e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1148,7 +1148,7 @@ void page_add_file_rmap(struct page *page)
 	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (atomic_inc_and_test(&page->_mapcount)) {
 		__inc_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
+		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
 	}
 	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 }
@@ -1202,7 +1202,7 @@ void page_remove_rmap(struct page *page)
 					      NR_ANON_TRANSPARENT_HUGEPAGES);
 	} else {
 		__dec_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
+		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
 	}
 	/*
 	 * It would be tidy to reset the PageAnon mapping here,
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V2 2/6] Make TestSetPageDirty and dirty page accounting in one func
  2012-07-27 10:20 ` Sha Zhengju
@ 2012-07-27 10:25   ` Sha Zhengju
  -1 siblings, 0 replies; 24+ messages in thread
From: Sha Zhengju @ 2012-07-27 10:25 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: fengguang.wu, gthelen, akpm, yinghan, mhocko, linux-kernel,
	hannes, torvalds, viro, linux-fsdevel, npiggin, Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

Commit a8e7d49a(Fix race in create_empty_buffers() vs __set_page_dirty_buffers())
extracts TestSetPageDirty from __set_page_dirty and is far away from
account_page_dirtied.But it's better to make the two operations in one single
function to keep modular.So in order to avoid the potential race mentioned in
commit a8e7d49a, we can hold private_lock until __set_page_dirty completes.
I guess there's no deadlock between ->private_lock and ->tree_lock by quick look.
It's a prepare patch for following memcg dirty page accounting patches.


Here is some test numbers that before/after this patch:

Test steps(Mem-24g, ext4):
drop_cache; sync
fio (buffered/randwrite/bs=4k/size=128m/filesize=1g/numjobs=8/sync)

We test it for 10 times and get the average numbers:
Before:
write: io=1024.0MB, bw=334678 KB/s, iops=83669.2 , runt=  3136 msec
lat (usec): min=1 , max=26203.1 , avg=81.473, stdev=275.754
After:
write: io=1024.0MB, bw=331583 KB/s, iops=82895.3 , runt=  3164.4 msec
lat (usec): min=1.1 , max=19001.6 , avg=83.544, stdev=272.704

Note that the impact is little(~1%).

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
---
 fs/buffer.c |   25 +++++++++++++------------
 1 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index c7062c8..5e0b0d2 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -610,9 +610,15 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
  * If warn is true, then emit a warning if the page is not uptodate and has
  * not been truncated.
  */
-static void __set_page_dirty(struct page *page,
+static int __set_page_dirty(struct page *page,
 		struct address_space *mapping, int warn)
 {
+	if (unlikely(!mapping))
+		return !TestSetPageDirty(page);
+
+	if (TestSetPageDirty(page))
+		return 0;
+
 	spin_lock_irq(&mapping->tree_lock);
 	if (page->mapping) {	/* Race with truncate? */
 		WARN_ON_ONCE(warn && !PageUptodate(page));
@@ -622,6 +628,8 @@ static void __set_page_dirty(struct page *page,
 	}
 	spin_unlock_irq(&mapping->tree_lock);
 	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+
+	return 1;
 }
 
 /*
@@ -667,11 +675,9 @@ int __set_page_dirty_buffers(struct page *page)
 			bh = bh->b_this_page;
 		} while (bh != head);
 	}
-	newly_dirty = !TestSetPageDirty(page);
+	newly_dirty = __set_page_dirty(page, mapping, 1);
 	spin_unlock(&mapping->private_lock);
 
-	if (newly_dirty)
-		__set_page_dirty(page, mapping, 1);
 	return newly_dirty;
 }
 EXPORT_SYMBOL(__set_page_dirty_buffers);
@@ -1119,14 +1125,9 @@ void mark_buffer_dirty(struct buffer_head *bh)
 			return;
 	}
 
-	if (!test_set_buffer_dirty(bh)) {
-		struct page *page = bh->b_page;
-		if (!TestSetPageDirty(page)) {
-			struct address_space *mapping = page_mapping(page);
-			if (mapping)
-				__set_page_dirty(page, mapping, 0);
-		}
-	}
+	if (!test_set_buffer_dirty(bh))
+		__set_page_dirty(bh->b_page, page_mapping(bh->b_page), 0);
+
 }
 EXPORT_SYMBOL(mark_buffer_dirty);
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V2 2/6] Make TestSetPageDirty and dirty page accounting in one func
@ 2012-07-27 10:25   ` Sha Zhengju
  0 siblings, 0 replies; 24+ messages in thread
From: Sha Zhengju @ 2012-07-27 10:25 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: fengguang.wu, gthelen, akpm, yinghan, mhocko, linux-kernel,
	hannes, torvalds, viro, linux-fsdevel, npiggin, Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

Commit a8e7d49a(Fix race in create_empty_buffers() vs __set_page_dirty_buffers())
extracts TestSetPageDirty from __set_page_dirty and is far away from
account_page_dirtied.But it's better to make the two operations in one single
function to keep modular.So in order to avoid the potential race mentioned in
commit a8e7d49a, we can hold private_lock until __set_page_dirty completes.
I guess there's no deadlock between ->private_lock and ->tree_lock by quick look.
It's a prepare patch for following memcg dirty page accounting patches.


Here is some test numbers that before/after this patch:

Test steps(Mem-24g, ext4):
drop_cache; sync
fio (buffered/randwrite/bs=4k/size=128m/filesize=1g/numjobs=8/sync)

We test it for 10 times and get the average numbers:
Before:
write: io=1024.0MB, bw=334678 KB/s, iops=83669.2 , runt=  3136 msec
lat (usec): min=1 , max=26203.1 , avg=81.473, stdev=275.754
After:
write: io=1024.0MB, bw=331583 KB/s, iops=82895.3 , runt=  3164.4 msec
lat (usec): min=1.1 , max=19001.6 , avg=83.544, stdev=272.704

Note that the impact is little(~1%).

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
---
 fs/buffer.c |   25 +++++++++++++------------
 1 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index c7062c8..5e0b0d2 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -610,9 +610,15 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
  * If warn is true, then emit a warning if the page is not uptodate and has
  * not been truncated.
  */
-static void __set_page_dirty(struct page *page,
+static int __set_page_dirty(struct page *page,
 		struct address_space *mapping, int warn)
 {
+	if (unlikely(!mapping))
+		return !TestSetPageDirty(page);
+
+	if (TestSetPageDirty(page))
+		return 0;
+
 	spin_lock_irq(&mapping->tree_lock);
 	if (page->mapping) {	/* Race with truncate? */
 		WARN_ON_ONCE(warn && !PageUptodate(page));
@@ -622,6 +628,8 @@ static void __set_page_dirty(struct page *page,
 	}
 	spin_unlock_irq(&mapping->tree_lock);
 	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+
+	return 1;
 }
 
 /*
@@ -667,11 +675,9 @@ int __set_page_dirty_buffers(struct page *page)
 			bh = bh->b_this_page;
 		} while (bh != head);
 	}
-	newly_dirty = !TestSetPageDirty(page);
+	newly_dirty = __set_page_dirty(page, mapping, 1);
 	spin_unlock(&mapping->private_lock);
 
-	if (newly_dirty)
-		__set_page_dirty(page, mapping, 1);
 	return newly_dirty;
 }
 EXPORT_SYMBOL(__set_page_dirty_buffers);
@@ -1119,14 +1125,9 @@ void mark_buffer_dirty(struct buffer_head *bh)
 			return;
 	}
 
-	if (!test_set_buffer_dirty(bh)) {
-		struct page *page = bh->b_page;
-		if (!TestSetPageDirty(page)) {
-			struct address_space *mapping = page_mapping(page);
-			if (mapping)
-				__set_page_dirty(page, mapping, 0);
-		}
-	}
+	if (!test_set_buffer_dirty(bh))
+		__set_page_dirty(bh->b_page, page_mapping(bh->b_page), 0);
+
 }
 EXPORT_SYMBOL(mark_buffer_dirty);
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V2 3/6] Use vfs __set_page_dirty interface instead of doing it inside filesystem
  2012-07-27 10:20 ` Sha Zhengju
@ 2012-07-27 10:27   ` Sha Zhengju
  -1 siblings, 0 replies; 24+ messages in thread
From: Sha Zhengju @ 2012-07-27 10:27 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: fengguang.wu, gthelen, akpm, yinghan, mhocko, linux-kernel,
	hannes, torvalds, viro, linux-fsdevel, sage, ceph-devel,
	Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

Following we will treat SetPageDirty and dirty page accounting as an integrated
operation. Filesystems had better use vfs interface directly to avoid those details.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Acked-by: Sage Weil <sage@inktank.com>
---
 fs/buffer.c                 |    3 ++-
 fs/ceph/addr.c              |   20 ++------------------
 include/linux/buffer_head.h |    2 ++
 3 files changed, 6 insertions(+), 19 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 5e0b0d2..ffcfb87 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -610,7 +610,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
  * If warn is true, then emit a warning if the page is not uptodate and has
  * not been truncated.
  */
-static int __set_page_dirty(struct page *page,
+int __set_page_dirty(struct page *page,
 		struct address_space *mapping, int warn)
 {
 	if (unlikely(!mapping))
@@ -631,6 +631,7 @@ static int __set_page_dirty(struct page *page,
 
 	return 1;
 }
+EXPORT_SYMBOL(__set_page_dirty);
 
 /*
  * Add a page to the dirty page list.
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 8b67304..d028fbe 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -5,6 +5,7 @@
 #include <linux/mm.h>
 #include <linux/pagemap.h>
 #include <linux/writeback.h>	/* generic_writepages */
+#include <linux/buffer_head.h>
 #include <linux/slab.h>
 #include <linux/pagevec.h>
 #include <linux/task_io_accounting_ops.h>
@@ -73,14 +74,8 @@ static int ceph_set_page_dirty(struct page *page)
 	int undo = 0;
 	struct ceph_snap_context *snapc;
 
-	if (unlikely(!mapping))
-		return !TestSetPageDirty(page);
-
-	if (TestSetPageDirty(page)) {
-		dout("%p set_page_dirty %p idx %lu -- already dirty\n",
-		     mapping->host, page, page->index);
+	if (!__set_page_dirty(page, mapping, 1))
 		return 0;
-	}
 
 	inode = mapping->host;
 	ci = ceph_inode(inode);
@@ -107,14 +102,7 @@ static int ceph_set_page_dirty(struct page *page)
 	     snapc, snapc->seq, snapc->num_snaps);
 	spin_unlock(&ci->i_ceph_lock);
 
-	/* now adjust page */
-	spin_lock_irq(&mapping->tree_lock);
 	if (page->mapping) {	/* Race with truncate? */
-		WARN_ON_ONCE(!PageUptodate(page));
-		account_page_dirtied(page, page->mapping);
-		radix_tree_tag_set(&mapping->page_tree,
-				page_index(page), PAGECACHE_TAG_DIRTY);
-
 		/*
 		 * Reference snap context in page->private.  Also set
 		 * PagePrivate so that we get invalidatepage callback.
@@ -126,14 +114,10 @@ static int ceph_set_page_dirty(struct page *page)
 		undo = 1;
 	}
 
-	spin_unlock_irq(&mapping->tree_lock);
-
 	if (undo)
 		/* whoops, we failed to dirty the page */
 		ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
 
-	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
-
 	BUG_ON(!PageDirty(page));
 	return 1;
 }
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 458f497..0a331a8 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -336,6 +336,8 @@ static inline void lock_buffer(struct buffer_head *bh)
 }
 
 extern int __set_page_dirty_buffers(struct page *page);
+extern int __set_page_dirty(struct page *page,
+		struct address_space *mapping, int warn);
 
 #else /* CONFIG_BLOCK */
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V2 3/6] Use vfs __set_page_dirty interface instead of doing it inside filesystem
@ 2012-07-27 10:27   ` Sha Zhengju
  0 siblings, 0 replies; 24+ messages in thread
From: Sha Zhengju @ 2012-07-27 10:27 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: fengguang.wu, gthelen, akpm, yinghan, mhocko, linux-kernel,
	hannes, torvalds, viro, linux-fsdevel, sage, ceph-devel,
	Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

Following we will treat SetPageDirty and dirty page accounting as an integrated
operation. Filesystems had better use vfs interface directly to avoid those details.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Acked-by: Sage Weil <sage@inktank.com>
---
 fs/buffer.c                 |    3 ++-
 fs/ceph/addr.c              |   20 ++------------------
 include/linux/buffer_head.h |    2 ++
 3 files changed, 6 insertions(+), 19 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 5e0b0d2..ffcfb87 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -610,7 +610,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
  * If warn is true, then emit a warning if the page is not uptodate and has
  * not been truncated.
  */
-static int __set_page_dirty(struct page *page,
+int __set_page_dirty(struct page *page,
 		struct address_space *mapping, int warn)
 {
 	if (unlikely(!mapping))
@@ -631,6 +631,7 @@ static int __set_page_dirty(struct page *page,
 
 	return 1;
 }
+EXPORT_SYMBOL(__set_page_dirty);
 
 /*
  * Add a page to the dirty page list.
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 8b67304..d028fbe 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -5,6 +5,7 @@
 #include <linux/mm.h>
 #include <linux/pagemap.h>
 #include <linux/writeback.h>	/* generic_writepages */
+#include <linux/buffer_head.h>
 #include <linux/slab.h>
 #include <linux/pagevec.h>
 #include <linux/task_io_accounting_ops.h>
@@ -73,14 +74,8 @@ static int ceph_set_page_dirty(struct page *page)
 	int undo = 0;
 	struct ceph_snap_context *snapc;
 
-	if (unlikely(!mapping))
-		return !TestSetPageDirty(page);
-
-	if (TestSetPageDirty(page)) {
-		dout("%p set_page_dirty %p idx %lu -- already dirty\n",
-		     mapping->host, page, page->index);
+	if (!__set_page_dirty(page, mapping, 1))
 		return 0;
-	}
 
 	inode = mapping->host;
 	ci = ceph_inode(inode);
@@ -107,14 +102,7 @@ static int ceph_set_page_dirty(struct page *page)
 	     snapc, snapc->seq, snapc->num_snaps);
 	spin_unlock(&ci->i_ceph_lock);
 
-	/* now adjust page */
-	spin_lock_irq(&mapping->tree_lock);
 	if (page->mapping) {	/* Race with truncate? */
-		WARN_ON_ONCE(!PageUptodate(page));
-		account_page_dirtied(page, page->mapping);
-		radix_tree_tag_set(&mapping->page_tree,
-				page_index(page), PAGECACHE_TAG_DIRTY);
-
 		/*
 		 * Reference snap context in page->private.  Also set
 		 * PagePrivate so that we get invalidatepage callback.
@@ -126,14 +114,10 @@ static int ceph_set_page_dirty(struct page *page)
 		undo = 1;
 	}
 
-	spin_unlock_irq(&mapping->tree_lock);
-
 	if (undo)
 		/* whoops, we failed to dirty the page */
 		ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
 
-	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
-
 	BUG_ON(!PageDirty(page));
 	return 1;
 }
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 458f497..0a331a8 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -336,6 +336,8 @@ static inline void lock_buffer(struct buffer_head *bh)
 }
 
 extern int __set_page_dirty_buffers(struct page *page);
+extern int __set_page_dirty(struct page *page,
+		struct address_space *mapping, int warn);
 
 #else /* CONFIG_BLOCK */
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V2 4/6] memcg: add per cgroup dirty pages accounting
  2012-07-27 10:20 ` Sha Zhengju
@ 2012-07-27 10:28   ` Sha Zhengju
  -1 siblings, 0 replies; 24+ messages in thread
From: Sha Zhengju @ 2012-07-27 10:28 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: fengguang.wu, gthelen, akpm, yinghan, mhocko, linux-kernel,
	hannes, Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

This patch adds memcg routines to count dirty pages, which allows memory controller
to maintain an accurate view of the amount of its dirty memory and can provide some
info for users while group's direct reclaim is working.

After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
has a feature to move a page from a cgroup to another one and may have race between
"move" and "page stat accounting". So in order to avoid the race we have designed a
bigger lock:

         mem_cgroup_begin_update_page_stat()
         modify page information        -->(a)
         mem_cgroup_update_page_stat()  -->(b)
         mem_cgroup_end_update_page_stat()

It requires (a) and (b)(dirty pages accounting) can stay close enough.

In the previous two prepare patches, we have reworked the vfs set page dirty routines
and now the interfaces are more explicit:
        incrementing (2):
                __set_page_dirty
                __set_page_dirty_nobuffers
        decrementing (2):
                clear_page_dirty_for_io
                cancel_dirty_page


Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
Acked-by: Fengguang Wu <fengguang.wu@intel.com>
---
 fs/buffer.c                |   16 +++++++++++++---
 include/linux/memcontrol.h |    1 +
 mm/filemap.c               |    9 +++++++++
 mm/memcontrol.c            |   28 +++++++++++++++++++++-------
 mm/page-writeback.c        |   31 ++++++++++++++++++++++++++-----
 mm/truncate.c              |    6 ++++++
 6 files changed, 76 insertions(+), 15 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index ffcfb87..e7b5766 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -613,11 +613,19 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
 int __set_page_dirty(struct page *page,
 		struct address_space *mapping, int warn)
 {
+	bool locked;
+	unsigned long flags;
+	int ret = 1;
+
 	if (unlikely(!mapping))
 		return !TestSetPageDirty(page);
 
-	if (TestSetPageDirty(page))
-		return 0;
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
+
+	if (TestSetPageDirty(page)) {
+		ret = 0;
+		goto out;
+	}
 
 	spin_lock_irq(&mapping->tree_lock);
 	if (page->mapping) {	/* Race with truncate? */
@@ -629,7 +637,9 @@ int __set_page_dirty(struct page *page,
 	spin_unlock_irq(&mapping->tree_lock);
 	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 
-	return 1;
+out:
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
+	return ret;
 }
 EXPORT_SYMBOL(__set_page_dirty);
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c1e2617..8c6b8ca 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -41,6 +41,7 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
 	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
 	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
+	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
 	MEM_CGROUP_STAT_NSTATS,
 };
 
diff --git a/mm/filemap.c b/mm/filemap.c
index a4a5260..7f53fb0 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -62,6 +62,10 @@
  *      ->swap_lock		(exclusive_swap_page, others)
  *        ->mapping->tree_lock
  *
+ *    ->private_lock		(__set_page_dirty_buffers)
+ *      ->memcg->move_lock	(mem_cgroup_begin_update_page_stat->move_lock_mem_cgroup)
+ *        ->mapping->tree_lock
+ *
  *  ->i_mutex
  *    ->i_mmap_mutex		(truncate->unmap_mapping_range)
  *
@@ -112,6 +116,8 @@
 void __delete_from_page_cache(struct page *page)
 {
 	struct address_space *mapping = page->mapping;
+	bool locked;
+	unsigned long flags;
 
 	/*
 	 * if we're uptodate, flush out into the cleancache, otherwise
@@ -139,10 +145,13 @@ void __delete_from_page_cache(struct page *page)
 	 * Fix it up by doing a final dirty accounting check after
 	 * having removed the page entirely.
 	 */
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 		dec_zone_page_state(page, NR_FILE_DIRTY);
 		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
 	}
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 }
 
 /**
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index aef9fb0..cdcd547 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -85,6 +85,7 @@ static const char * const mem_cgroup_stat_names[] = {
 	"rss",
 	"mapped_file",
 	"swap",
+	"dirty",
 };
 
 enum mem_cgroup_events_index {
@@ -2541,6 +2542,18 @@ void mem_cgroup_split_huge_fixup(struct page *head)
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+static inline
+void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
+					struct mem_cgroup *to,
+					enum mem_cgroup_stat_index idx)
+{
+	/* Update stat data for mem_cgroup */
+	preempt_disable();
+	__this_cpu_dec(from->stat->count[idx]);
+	__this_cpu_inc(to->stat->count[idx]);
+	preempt_enable();
+}
+
 /**
  * mem_cgroup_move_account - move account of the page
  * @page: the page
@@ -2586,13 +2599,14 @@ static int mem_cgroup_move_account(struct page *page,
 
 	move_lock_mem_cgroup(from, &flags);
 
-	if (!anon && page_mapped(page)) {
-		/* Update mapped_file data for mem_cgroup */
-		preempt_disable();
-		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		preempt_enable();
-	}
+	if (!anon && page_mapped(page))
+		mem_cgroup_move_account_page_stat(from, to,
+				MEM_CGROUP_STAT_FILE_MAPPED);
+
+	if (PageDirty(page))
+		mem_cgroup_move_account_page_stat(from, to,
+				MEM_CGROUP_STAT_FILE_DIRTY);
+
 	mem_cgroup_charge_statistics(from, anon, -nr_pages);
 
 	/* caller should have done css_get */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 93d8d2f..233e7ac 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1932,11 +1932,17 @@ int __set_page_dirty_no_writeback(struct page *page)
 
 /*
  * Helper function for set_page_dirty family.
+ *
+ * The caller must hold mem_cgroup_begin/end_update_page_stat() lock
+ * while modifying struct page state and accounting dirty pages.
+ * See __set_page_dirty for example.
+ *
  * NOTE: This relies on being atomic wrt interrupts.
  */
 void account_page_dirtied(struct page *page, struct address_space *mapping)
 {
 	if (mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_DIRTIED);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
@@ -1976,12 +1982,19 @@ EXPORT_SYMBOL(account_page_writeback);
  */
 int __set_page_dirty_nobuffers(struct page *page)
 {
+	bool locked;
+	unsigned long flags;
+	int ret = 0;
+
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
+
 	if (!TestSetPageDirty(page)) {
 		struct address_space *mapping = page_mapping(page);
 		struct address_space *mapping2;
 
+		ret = 1;
 		if (!mapping)
-			return 1;
+			goto out;
 
 		spin_lock_irq(&mapping->tree_lock);
 		mapping2 = page_mapping(page);
@@ -1997,9 +2010,11 @@ int __set_page_dirty_nobuffers(struct page *page)
 			/* !PageAnon && !swapper_space */
 			__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 		}
-		return 1;
 	}
-	return 0;
+
+out:
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
+	return ret;
 }
 EXPORT_SYMBOL(__set_page_dirty_nobuffers);
 
@@ -2114,6 +2129,9 @@ EXPORT_SYMBOL(set_page_dirty_lock);
 int clear_page_dirty_for_io(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
+	bool locked;
+	unsigned long flags;
+	int ret = 0;
 
 	BUG_ON(!PageLocked(page));
 
@@ -2155,13 +2173,16 @@ int clear_page_dirty_for_io(struct page *page)
 		 * the desired exclusion. See mm/memory.c:do_wp_page()
 		 * for more comments.
 		 */
+		mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 		if (TestClearPageDirty(page)) {
+			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
-			return 1;
+			ret = 1;
 		}
-		return 0;
+		mem_cgroup_end_update_page_stat(page, &locked, &flags);
+		return ret;
 	}
 	return TestClearPageDirty(page);
 }
diff --git a/mm/truncate.c b/mm/truncate.c
index 75801ac..052016a 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -73,9 +73,14 @@ static inline void truncate_partial_page(struct page *page, unsigned partial)
  */
 void cancel_dirty_page(struct page *page, unsigned int account_size)
 {
+	bool locked;
+	unsigned long flags;
+
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (TestClearPageDirty(page)) {
 		struct address_space *mapping = page->mapping;
 		if (mapping && mapping_cap_account_dirty(mapping)) {
+			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
@@ -83,6 +88,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
 				task_io_account_cancelled_write(account_size);
 		}
 	}
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 }
 EXPORT_SYMBOL(cancel_dirty_page);
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V2 4/6] memcg: add per cgroup dirty pages accounting
@ 2012-07-27 10:28   ` Sha Zhengju
  0 siblings, 0 replies; 24+ messages in thread
From: Sha Zhengju @ 2012-07-27 10:28 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: fengguang.wu, gthelen, akpm, yinghan, mhocko, linux-kernel,
	hannes, Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

This patch adds memcg routines to count dirty pages, which allows memory controller
to maintain an accurate view of the amount of its dirty memory and can provide some
info for users while group's direct reclaim is working.

After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
has a feature to move a page from a cgroup to another one and may have race between
"move" and "page stat accounting". So in order to avoid the race we have designed a
bigger lock:

         mem_cgroup_begin_update_page_stat()
         modify page information        -->(a)
         mem_cgroup_update_page_stat()  -->(b)
         mem_cgroup_end_update_page_stat()

It requires (a) and (b)(dirty pages accounting) can stay close enough.

In the previous two prepare patches, we have reworked the vfs set page dirty routines
and now the interfaces are more explicit:
        incrementing (2):
                __set_page_dirty
                __set_page_dirty_nobuffers
        decrementing (2):
                clear_page_dirty_for_io
                cancel_dirty_page


Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
Acked-by: Fengguang Wu <fengguang.wu@intel.com>
---
 fs/buffer.c                |   16 +++++++++++++---
 include/linux/memcontrol.h |    1 +
 mm/filemap.c               |    9 +++++++++
 mm/memcontrol.c            |   28 +++++++++++++++++++++-------
 mm/page-writeback.c        |   31 ++++++++++++++++++++++++++-----
 mm/truncate.c              |    6 ++++++
 6 files changed, 76 insertions(+), 15 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index ffcfb87..e7b5766 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -613,11 +613,19 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
 int __set_page_dirty(struct page *page,
 		struct address_space *mapping, int warn)
 {
+	bool locked;
+	unsigned long flags;
+	int ret = 1;
+
 	if (unlikely(!mapping))
 		return !TestSetPageDirty(page);
 
-	if (TestSetPageDirty(page))
-		return 0;
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
+
+	if (TestSetPageDirty(page)) {
+		ret = 0;
+		goto out;
+	}
 
 	spin_lock_irq(&mapping->tree_lock);
 	if (page->mapping) {	/* Race with truncate? */
@@ -629,7 +637,9 @@ int __set_page_dirty(struct page *page,
 	spin_unlock_irq(&mapping->tree_lock);
 	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 
-	return 1;
+out:
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
+	return ret;
 }
 EXPORT_SYMBOL(__set_page_dirty);
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c1e2617..8c6b8ca 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -41,6 +41,7 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
 	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
 	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
+	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
 	MEM_CGROUP_STAT_NSTATS,
 };
 
diff --git a/mm/filemap.c b/mm/filemap.c
index a4a5260..7f53fb0 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -62,6 +62,10 @@
  *      ->swap_lock		(exclusive_swap_page, others)
  *        ->mapping->tree_lock
  *
+ *    ->private_lock		(__set_page_dirty_buffers)
+ *      ->memcg->move_lock	(mem_cgroup_begin_update_page_stat->move_lock_mem_cgroup)
+ *        ->mapping->tree_lock
+ *
  *  ->i_mutex
  *    ->i_mmap_mutex		(truncate->unmap_mapping_range)
  *
@@ -112,6 +116,8 @@
 void __delete_from_page_cache(struct page *page)
 {
 	struct address_space *mapping = page->mapping;
+	bool locked;
+	unsigned long flags;
 
 	/*
 	 * if we're uptodate, flush out into the cleancache, otherwise
@@ -139,10 +145,13 @@ void __delete_from_page_cache(struct page *page)
 	 * Fix it up by doing a final dirty accounting check after
 	 * having removed the page entirely.
 	 */
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 		dec_zone_page_state(page, NR_FILE_DIRTY);
 		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
 	}
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 }
 
 /**
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index aef9fb0..cdcd547 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -85,6 +85,7 @@ static const char * const mem_cgroup_stat_names[] = {
 	"rss",
 	"mapped_file",
 	"swap",
+	"dirty",
 };
 
 enum mem_cgroup_events_index {
@@ -2541,6 +2542,18 @@ void mem_cgroup_split_huge_fixup(struct page *head)
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+static inline
+void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
+					struct mem_cgroup *to,
+					enum mem_cgroup_stat_index idx)
+{
+	/* Update stat data for mem_cgroup */
+	preempt_disable();
+	__this_cpu_dec(from->stat->count[idx]);
+	__this_cpu_inc(to->stat->count[idx]);
+	preempt_enable();
+}
+
 /**
  * mem_cgroup_move_account - move account of the page
  * @page: the page
@@ -2586,13 +2599,14 @@ static int mem_cgroup_move_account(struct page *page,
 
 	move_lock_mem_cgroup(from, &flags);
 
-	if (!anon && page_mapped(page)) {
-		/* Update mapped_file data for mem_cgroup */
-		preempt_disable();
-		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		preempt_enable();
-	}
+	if (!anon && page_mapped(page))
+		mem_cgroup_move_account_page_stat(from, to,
+				MEM_CGROUP_STAT_FILE_MAPPED);
+
+	if (PageDirty(page))
+		mem_cgroup_move_account_page_stat(from, to,
+				MEM_CGROUP_STAT_FILE_DIRTY);
+
 	mem_cgroup_charge_statistics(from, anon, -nr_pages);
 
 	/* caller should have done css_get */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 93d8d2f..233e7ac 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1932,11 +1932,17 @@ int __set_page_dirty_no_writeback(struct page *page)
 
 /*
  * Helper function for set_page_dirty family.
+ *
+ * The caller must hold mem_cgroup_begin/end_update_page_stat() lock
+ * while modifying struct page state and accounting dirty pages.
+ * See __set_page_dirty for example.
+ *
  * NOTE: This relies on being atomic wrt interrupts.
  */
 void account_page_dirtied(struct page *page, struct address_space *mapping)
 {
 	if (mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_DIRTIED);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
@@ -1976,12 +1982,19 @@ EXPORT_SYMBOL(account_page_writeback);
  */
 int __set_page_dirty_nobuffers(struct page *page)
 {
+	bool locked;
+	unsigned long flags;
+	int ret = 0;
+
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
+
 	if (!TestSetPageDirty(page)) {
 		struct address_space *mapping = page_mapping(page);
 		struct address_space *mapping2;
 
+		ret = 1;
 		if (!mapping)
-			return 1;
+			goto out;
 
 		spin_lock_irq(&mapping->tree_lock);
 		mapping2 = page_mapping(page);
@@ -1997,9 +2010,11 @@ int __set_page_dirty_nobuffers(struct page *page)
 			/* !PageAnon && !swapper_space */
 			__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 		}
-		return 1;
 	}
-	return 0;
+
+out:
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
+	return ret;
 }
 EXPORT_SYMBOL(__set_page_dirty_nobuffers);
 
@@ -2114,6 +2129,9 @@ EXPORT_SYMBOL(set_page_dirty_lock);
 int clear_page_dirty_for_io(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
+	bool locked;
+	unsigned long flags;
+	int ret = 0;
 
 	BUG_ON(!PageLocked(page));
 
@@ -2155,13 +2173,16 @@ int clear_page_dirty_for_io(struct page *page)
 		 * the desired exclusion. See mm/memory.c:do_wp_page()
 		 * for more comments.
 		 */
+		mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 		if (TestClearPageDirty(page)) {
+			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
-			return 1;
+			ret = 1;
 		}
-		return 0;
+		mem_cgroup_end_update_page_stat(page, &locked, &flags);
+		return ret;
 	}
 	return TestClearPageDirty(page);
 }
diff --git a/mm/truncate.c b/mm/truncate.c
index 75801ac..052016a 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -73,9 +73,14 @@ static inline void truncate_partial_page(struct page *page, unsigned partial)
  */
 void cancel_dirty_page(struct page *page, unsigned int account_size)
 {
+	bool locked;
+	unsigned long flags;
+
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (TestClearPageDirty(page)) {
 		struct address_space *mapping = page->mapping;
 		if (mapping && mapping_cap_account_dirty(mapping)) {
+			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
@@ -83,6 +88,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
 				task_io_account_cancelled_write(account_size);
 		}
 	}
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 }
 EXPORT_SYMBOL(cancel_dirty_page);
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V2 5/6] memcg: add per cgroup writeback pages accounting
  2012-07-27 10:20 ` Sha Zhengju
@ 2012-07-27 10:28   ` Sha Zhengju
  -1 siblings, 0 replies; 24+ messages in thread
From: Sha Zhengju @ 2012-07-27 10:28 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: fengguang.wu, gthelen, akpm, yinghan, mhocko, linux-kernel,
	hannes, Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

Similar to dirty page, we add per cgroup writeback pages accounting. The lock
rule still is:
        mem_cgroup_begin_update_page_stat()
        modify page WRITEBACK stat
        mem_cgroup_update_page_stat()
        mem_cgroup_end_update_page_stat()

There're two writeback interface to modify: test_clear/set_page_writeback.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
---
 include/linux/memcontrol.h |    1 +
 mm/memcontrol.c            |    5 +++++
 mm/page-writeback.c        |   17 +++++++++++++++++
 3 files changed, 23 insertions(+), 0 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 8c6b8ca..0c8a699 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -42,6 +42,7 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
 	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
 	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
+	MEM_CGROUP_STAT_WRITEBACK,  /* # of pages under writeback */
 	MEM_CGROUP_STAT_NSTATS,
 };
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index cdcd547..de91d3d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -86,6 +86,7 @@ static const char * const mem_cgroup_stat_names[] = {
 	"mapped_file",
 	"swap",
 	"dirty",
+	"writeback",
 };
 
 enum mem_cgroup_events_index {
@@ -2607,6 +2608,10 @@ static int mem_cgroup_move_account(struct page *page,
 		mem_cgroup_move_account_page_stat(from, to,
 				MEM_CGROUP_STAT_FILE_DIRTY);
 
+	if (PageWriteback(page))
+		mem_cgroup_move_account_page_stat(from, to,
+				MEM_CGROUP_STAT_WRITEBACK);
+
 	mem_cgroup_charge_statistics(from, anon, -nr_pages);
 
 	/* caller should have done css_get */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 233e7ac..6b06d5e 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1956,11 +1956,17 @@ EXPORT_SYMBOL(account_page_dirtied);
 
 /*
  * Helper function for set_page_writeback family.
+ *
+ * The caller must hold mem_cgroup_begin/end_update_page_stat() lock
+ * while modifying struct page state and accounting writeback pages.
+ * See test_set_page_writeback for example.
+ *
  * NOTE: Unlike account_page_dirtied this does not rely on being atomic
  * wrt interrupts.
  */
 void account_page_writeback(struct page *page)
 {
+	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
 	inc_zone_page_state(page, NR_WRITEBACK);
 }
 EXPORT_SYMBOL(account_page_writeback);
@@ -2192,7 +2198,10 @@ int test_clear_page_writeback(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
 	int ret;
+	bool locked;
+	unsigned long flags;
 
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (mapping) {
 		struct backing_dev_info *bdi = mapping->backing_dev_info;
 		unsigned long flags;
@@ -2213,9 +2222,12 @@ int test_clear_page_writeback(struct page *page)
 		ret = TestClearPageWriteback(page);
 	}
 	if (ret) {
+		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
 		dec_zone_page_state(page, NR_WRITEBACK);
 		inc_zone_page_state(page, NR_WRITTEN);
 	}
+
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	return ret;
 }
 
@@ -2223,7 +2235,10 @@ int test_set_page_writeback(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
 	int ret;
+	bool locked;
+	unsigned long flags;
 
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (mapping) {
 		struct backing_dev_info *bdi = mapping->backing_dev_info;
 		unsigned long flags;
@@ -2250,6 +2265,8 @@ int test_set_page_writeback(struct page *page)
 	}
 	if (!ret)
 		account_page_writeback(page);
+
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	return ret;
 
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V2 5/6] memcg: add per cgroup writeback pages accounting
@ 2012-07-27 10:28   ` Sha Zhengju
  0 siblings, 0 replies; 24+ messages in thread
From: Sha Zhengju @ 2012-07-27 10:28 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: fengguang.wu, gthelen, akpm, yinghan, mhocko, linux-kernel,
	hannes, Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

Similar to dirty page, we add per cgroup writeback pages accounting. The lock
rule still is:
        mem_cgroup_begin_update_page_stat()
        modify page WRITEBACK stat
        mem_cgroup_update_page_stat()
        mem_cgroup_end_update_page_stat()

There're two writeback interface to modify: test_clear/set_page_writeback.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
---
 include/linux/memcontrol.h |    1 +
 mm/memcontrol.c            |    5 +++++
 mm/page-writeback.c        |   17 +++++++++++++++++
 3 files changed, 23 insertions(+), 0 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 8c6b8ca..0c8a699 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -42,6 +42,7 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
 	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
 	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
+	MEM_CGROUP_STAT_WRITEBACK,  /* # of pages under writeback */
 	MEM_CGROUP_STAT_NSTATS,
 };
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index cdcd547..de91d3d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -86,6 +86,7 @@ static const char * const mem_cgroup_stat_names[] = {
 	"mapped_file",
 	"swap",
 	"dirty",
+	"writeback",
 };
 
 enum mem_cgroup_events_index {
@@ -2607,6 +2608,10 @@ static int mem_cgroup_move_account(struct page *page,
 		mem_cgroup_move_account_page_stat(from, to,
 				MEM_CGROUP_STAT_FILE_DIRTY);
 
+	if (PageWriteback(page))
+		mem_cgroup_move_account_page_stat(from, to,
+				MEM_CGROUP_STAT_WRITEBACK);
+
 	mem_cgroup_charge_statistics(from, anon, -nr_pages);
 
 	/* caller should have done css_get */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 233e7ac..6b06d5e 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1956,11 +1956,17 @@ EXPORT_SYMBOL(account_page_dirtied);
 
 /*
  * Helper function for set_page_writeback family.
+ *
+ * The caller must hold mem_cgroup_begin/end_update_page_stat() lock
+ * while modifying struct page state and accounting writeback pages.
+ * See test_set_page_writeback for example.
+ *
  * NOTE: Unlike account_page_dirtied this does not rely on being atomic
  * wrt interrupts.
  */
 void account_page_writeback(struct page *page)
 {
+	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
 	inc_zone_page_state(page, NR_WRITEBACK);
 }
 EXPORT_SYMBOL(account_page_writeback);
@@ -2192,7 +2198,10 @@ int test_clear_page_writeback(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
 	int ret;
+	bool locked;
+	unsigned long flags;
 
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (mapping) {
 		struct backing_dev_info *bdi = mapping->backing_dev_info;
 		unsigned long flags;
@@ -2213,9 +2222,12 @@ int test_clear_page_writeback(struct page *page)
 		ret = TestClearPageWriteback(page);
 	}
 	if (ret) {
+		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
 		dec_zone_page_state(page, NR_WRITEBACK);
 		inc_zone_page_state(page, NR_WRITTEN);
 	}
+
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	return ret;
 }
 
@@ -2223,7 +2235,10 @@ int test_set_page_writeback(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
 	int ret;
+	bool locked;
+	unsigned long flags;
 
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (mapping) {
 		struct backing_dev_info *bdi = mapping->backing_dev_info;
 		unsigned long flags;
@@ -2250,6 +2265,8 @@ int test_set_page_writeback(struct page *page)
 	}
 	if (!ret)
 		account_page_writeback(page);
+
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	return ret;
 
 }
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V2 6/6] memcg: Document cgroup dirty/writeback memory statistics
  2012-07-27 10:20 ` Sha Zhengju
  (?)
@ 2012-07-27 10:31   ` Sha Zhengju
  -1 siblings, 0 replies; 24+ messages in thread
From: Sha Zhengju @ 2012-07-27 10:31 UTC (permalink / raw)
  To: linux-mm, cgroups, kamezawa.hiroyu
  Cc: fengguang.wu, gthelen, akpm, yinghan, mhocko, linux-kernel,
	hannes, Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Ackedy-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Fengguang Wu <fengguang.wu@intel.com>
---
 Documentation/cgroups/memory.txt |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index dd88540..f4b5778 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -420,6 +420,8 @@ pgpgin		- # of charging events to the memory cgroup. The charging
 pgpgout		- # of uncharging events to the memory cgroup. The uncharging
 		event happens each time a page is unaccounted from the cgroup.
 swap		- # of bytes of swap usage
+dirty		- # of bytes of file cache that are not in sync with the disk copy.
+writeback	- # of bytes of file/anon cache that are queued for syncing to disk.
 inactive_anon	- # of bytes of anonymous memory and swap cache memory on
 		LRU list.
 active_anon	- # of bytes of anonymous and swap cache memory on active
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V2 6/6] memcg: Document cgroup dirty/writeback memory statistics
@ 2012-07-27 10:31   ` Sha Zhengju
  0 siblings, 0 replies; 24+ messages in thread
From: Sha Zhengju @ 2012-07-27 10:31 UTC (permalink / raw)
  To: linux-mm, cgroups, kamezawa.hiroyu
  Cc: fengguang.wu, gthelen, akpm, yinghan, mhocko, linux-kernel,
	hannes, Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Ackedy-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Fengguang Wu <fengguang.wu@intel.com>
---
 Documentation/cgroups/memory.txt |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index dd88540..f4b5778 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -420,6 +420,8 @@ pgpgin		- # of charging events to the memory cgroup. The charging
 pgpgout		- # of uncharging events to the memory cgroup. The uncharging
 		event happens each time a page is unaccounted from the cgroup.
 swap		- # of bytes of swap usage
+dirty		- # of bytes of file cache that are not in sync with the disk copy.
+writeback	- # of bytes of file/anon cache that are queued for syncing to disk.
 inactive_anon	- # of bytes of anonymous memory and swap cache memory on
 		LRU list.
 active_anon	- # of bytes of anonymous and swap cache memory on active
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V2 6/6] memcg: Document cgroup dirty/writeback memory statistics
@ 2012-07-27 10:31   ` Sha Zhengju
  0 siblings, 0 replies; 24+ messages in thread
From: Sha Zhengju @ 2012-07-27 10:31 UTC (permalink / raw)
  To: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A
  Cc: fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	yinghan-hpIqsD4AKlfQT0dZR+AlfA, mhocko-AlSwsSmVLrQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	hannes-druUgvl0LCNAfugRpC6u6w, Sha Zhengju

From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>

Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
Ackedy-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
Acked-by: Fengguang Wu <fengguang.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 Documentation/cgroups/memory.txt |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index dd88540..f4b5778 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -420,6 +420,8 @@ pgpgin		- # of charging events to the memory cgroup. The charging
 pgpgout		- # of uncharging events to the memory cgroup. The uncharging
 		event happens each time a page is unaccounted from the cgroup.
 swap		- # of bytes of swap usage
+dirty		- # of bytes of file cache that are not in sync with the disk copy.
+writeback	- # of bytes of file/anon cache that are queued for syncing to disk.
 inactive_anon	- # of bytes of anonymous memory and swap cache memory on
 		LRU list.
 active_anon	- # of bytes of anonymous and swap cache memory on active
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH V2 0/6] Per-cgroup page stat accounting
  2012-07-27 10:20 ` Sha Zhengju
@ 2012-07-30 12:38   ` Michal Hocko
  -1 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2012-07-30 12:38 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, fengguang.wu, gthelen, akpm, yinghan,
	linux-kernel, hannes, Sha Zhengju

On Fri 27-07-12 18:20:32, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> Hi, list
> 
> This V2 patch series provide the ability for each memory cgroup to
> have independent dirty/writeback page statistics which can provide
> information for per-cgroup direct reclaim or some.
>
> In the first three prepare patches, we have done some cleanup and
> reworked vfs set page dirty routines to make "modify page info" and
> "dirty page accouting" stay in one function as much as possible for
> the sake of memcg bigger lock(test numbers are in the specific patch).

I think the first 3 patches should be merged (if VFS people are OK with
them of course) first to make the rest of the work easier.

> Kame, I tested these patches on linux mainline v3.5, because I cannot
> boot up the kernel under linux-next :(. But these patches are cooked
> on top of your recent memcg patches (I backport them to mainline) and
> I think there is no hunk with the mm tree.  So If there's no other
> problem, I think it could be considered for merging.
> 
> 
> 
> Following is performance comparison between before/after the series:
> 
> Test steps(Mem-24g, ext4):
> drop_cache; sync
> cat /proc/meminfo|grep Dirty (=4KB)
> fio (buffered/randwrite/bs=4k/size=128m/filesize=1g/numjobs=8/sync) 
> cat /proc/meminfo|grep Dirty(=648696kB)

How does this tests that the accounting is correct? I would expect
something that would exercise dirty vs. writeback and having 0 in the
end. with your 24G memcg and 1G IO you do not see any memcg effects
(only the global ones). Have you tested with something like 20M group and
pushing 1G data through it?

> 
> We test it for 10 times and get the average numbers:
> Before:
> write: io=1024.0MB, bw=334678 KB/s, iops=83669.2 , runt=  3136 msec
> lat (usec): min=1 , max=26203.1 , avg=81.473, stdev=275.754
> 
> After:
> write: io=1024.0MB, bw=325219 KB/s, iops= 81304.1 , runt=  3226.9 msec
> lat (usec): min=1 , max=17224 , avg=86.194, stdev=298.183
> 
> 
> 
> There is about 2.8% performance decrease. But I notice that once memcg
> is enabled, the root_memcg exsits and all pages allocated are belong
> to it, so they will go through the root memcg statistics routines
> which bring some overhead.

Yes, this is exactly what I had in mind when I asked about the overhead
when memcg are not in use last time (mentioning jump labels or how they
are called at the moment).

> Moreover,in case of memcg_is_enable && no cgroups, we can get root
> memcg stats just from global numbers which can avoid both accounting
> overheads and many if-test overheads. Later I'll work further into it.

I think it is a must before merging. We do not want an infrastructure
that is not used currently and which brings an overhead.
Anyway, this is a good start for the future work. Thanks for doing that!

> Any comments are welcomed. : )
> 
> 
> 
> Change log:
> v2 <-- v1:
> 	1. add test numbers
> 	2. some small fix and comments
> 
> Sha Zhengju (6):
> 	memcg-remove-MEMCG_NR_FILE_MAPPED.patch
> 	Make-TestSetPageDirty-and-dirty-page-accounting-in-o.patch
> 	Use-vfs-__set_page_dirty-interface-instead-of-doing-.patch
> 	memcg-add-per-cgroup-dirty-pages-accounting.patch
> 	memcg-add-per-cgroup-writeback-pages-accounting.patch
> 	memcg-Document-cgroup-dirty-writeback-memory-statist.patch
> 
>  Documentation/cgroups/memory.txt |    2 +
>  fs/buffer.c                      |   36 +++++++++++++++--------
>  fs/ceph/addr.c                   |   20 +------------
>  include/linux/buffer_head.h      |    2 +
>  include/linux/memcontrol.h       |   30 ++++++++++++++-----
>  mm/filemap.c                     |    9 ++++++
>  mm/memcontrol.c                  |   58 +++++++++++++++++++-------------------
>  mm/page-writeback.c              |   48 ++++++++++++++++++++++++++++---
>  mm/rmap.c                        |    4 +-
>  mm/truncate.c                    |    6 ++++
>  10 files changed, 141 insertions(+), 74 deletions(-)
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V2 0/6] Per-cgroup page stat accounting
@ 2012-07-30 12:38   ` Michal Hocko
  0 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2012-07-30 12:38 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, fengguang.wu, gthelen, akpm, yinghan,
	linux-kernel, hannes, Sha Zhengju

On Fri 27-07-12 18:20:32, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> Hi, list
> 
> This V2 patch series provide the ability for each memory cgroup to
> have independent dirty/writeback page statistics which can provide
> information for per-cgroup direct reclaim or some.
>
> In the first three prepare patches, we have done some cleanup and
> reworked vfs set page dirty routines to make "modify page info" and
> "dirty page accouting" stay in one function as much as possible for
> the sake of memcg bigger lock(test numbers are in the specific patch).

I think the first 3 patches should be merged (if VFS people are OK with
them of course) first to make the rest of the work easier.

> Kame, I tested these patches on linux mainline v3.5, because I cannot
> boot up the kernel under linux-next :(. But these patches are cooked
> on top of your recent memcg patches (I backport them to mainline) and
> I think there is no hunk with the mm tree.  So If there's no other
> problem, I think it could be considered for merging.
> 
> 
> 
> Following is performance comparison between before/after the series:
> 
> Test steps(Mem-24g, ext4):
> drop_cache; sync
> cat /proc/meminfo|grep Dirty (=4KB)
> fio (buffered/randwrite/bs=4k/size=128m/filesize=1g/numjobs=8/sync) 
> cat /proc/meminfo|grep Dirty(=648696kB)

How does this tests that the accounting is correct? I would expect
something that would exercise dirty vs. writeback and having 0 in the
end. with your 24G memcg and 1G IO you do not see any memcg effects
(only the global ones). Have you tested with something like 20M group and
pushing 1G data through it?

> 
> We test it for 10 times and get the average numbers:
> Before:
> write: io=1024.0MB, bw=334678 KB/s, iops=83669.2 , runt=  3136 msec
> lat (usec): min=1 , max=26203.1 , avg=81.473, stdev=275.754
> 
> After:
> write: io=1024.0MB, bw=325219 KB/s, iops= 81304.1 , runt=  3226.9 msec
> lat (usec): min=1 , max=17224 , avg=86.194, stdev=298.183
> 
> 
> 
> There is about 2.8% performance decrease. But I notice that once memcg
> is enabled, the root_memcg exsits and all pages allocated are belong
> to it, so they will go through the root memcg statistics routines
> which bring some overhead.

Yes, this is exactly what I had in mind when I asked about the overhead
when memcg are not in use last time (mentioning jump labels or how they
are called at the moment).

> Moreover,in case of memcg_is_enable && no cgroups, we can get root
> memcg stats just from global numbers which can avoid both accounting
> overheads and many if-test overheads. Later I'll work further into it.

I think it is a must before merging. We do not want an infrastructure
that is not used currently and which brings an overhead.
Anyway, this is a good start for the future work. Thanks for doing that!

> Any comments are welcomed. : )
> 
> 
> 
> Change log:
> v2 <-- v1:
> 	1. add test numbers
> 	2. some small fix and comments
> 
> Sha Zhengju (6):
> 	memcg-remove-MEMCG_NR_FILE_MAPPED.patch
> 	Make-TestSetPageDirty-and-dirty-page-accounting-in-o.patch
> 	Use-vfs-__set_page_dirty-interface-instead-of-doing-.patch
> 	memcg-add-per-cgroup-dirty-pages-accounting.patch
> 	memcg-add-per-cgroup-writeback-pages-accounting.patch
> 	memcg-Document-cgroup-dirty-writeback-memory-statist.patch
> 
>  Documentation/cgroups/memory.txt |    2 +
>  fs/buffer.c                      |   36 +++++++++++++++--------
>  fs/ceph/addr.c                   |   20 +------------
>  include/linux/buffer_head.h      |    2 +
>  include/linux/memcontrol.h       |   30 ++++++++++++++-----
>  mm/filemap.c                     |    9 ++++++
>  mm/memcontrol.c                  |   58 +++++++++++++++++++-------------------
>  mm/page-writeback.c              |   48 ++++++++++++++++++++++++++++---
>  mm/rmap.c                        |    4 +-
>  mm/truncate.c                    |    6 ++++
>  10 files changed, 141 insertions(+), 74 deletions(-)
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V2 4/6] memcg: add per cgroup dirty pages accounting
  2012-07-27 10:28   ` Sha Zhengju
  (?)
@ 2012-07-30 14:56     ` Greg Thelen
  -1 siblings, 0 replies; 24+ messages in thread
From: Greg Thelen @ 2012-07-30 14:56 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, fengguang.wu, akpm, yinghan, mhocko,
	linux-kernel, hannes, Sha Zhengju

On Fri, Jul 27 2012, Sha Zhengju wrote:

> From: Sha Zhengju <handai.szj@taobao.com>
>
> This patch adds memcg routines to count dirty pages, which allows memory controller
> to maintain an accurate view of the amount of its dirty memory and can provide some
> info for users while group's direct reclaim is working.
>
> After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
> use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
> has a feature to move a page from a cgroup to another one and may have race between
> "move" and "page stat accounting". So in order to avoid the race we have designed a
> bigger lock:
>
>          mem_cgroup_begin_update_page_stat()
>          modify page information        -->(a)
>          mem_cgroup_update_page_stat()  -->(b)
>          mem_cgroup_end_update_page_stat()
>
> It requires (a) and (b)(dirty pages accounting) can stay close enough.
>
> In the previous two prepare patches, we have reworked the vfs set page dirty routines
> and now the interfaces are more explicit:
>         incrementing (2):
>                 __set_page_dirty
>                 __set_page_dirty_nobuffers
>         decrementing (2):
>                 clear_page_dirty_for_io
>                 cancel_dirty_page
>
>
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
> Acked-by: Fengguang Wu <fengguang.wu@intel.com>
> ---
>  fs/buffer.c                |   16 +++++++++++++---
>  include/linux/memcontrol.h |    1 +
>  mm/filemap.c               |    9 +++++++++
>  mm/memcontrol.c            |   28 +++++++++++++++++++++-------
>  mm/page-writeback.c        |   31 ++++++++++++++++++++++++++-----
>  mm/truncate.c              |    6 ++++++
>  6 files changed, 76 insertions(+), 15 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index ffcfb87..e7b5766 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -613,11 +613,19 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>  int __set_page_dirty(struct page *page,
>  		struct address_space *mapping, int warn)
>  {
> +	bool locked;
> +	unsigned long flags;
> +	int ret = 1;
> +
>  	if (unlikely(!mapping))
>  		return !TestSetPageDirty(page);
>  
> -	if (TestSetPageDirty(page))
> -		return 0;
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +
> +	if (TestSetPageDirty(page)) {
> +		ret = 0;
> +		goto out;
> +	}
>  
>  	spin_lock_irq(&mapping->tree_lock);

There are two problems here:
1. Here we grab mem_cgroup_begin_update_page_stat() lock(s) and then
   tree_lock (this can cause AB/BA deadlock - more below)
2. Here we grab the tree_lock with spin_lock_irq(), which assumes that
   interrupts are not already disabled by spin_lock_irq[save]()..  But
   mem_cgroup_begin_update_page_stat() may disable interrupts.  One
   solution would be to convert this spin_lock_irq to spin_lock_irqsave
   with a second flags local variable.

>  	if (page->mapping) {	/* Race with truncate? */
> @@ -629,7 +637,9 @@ int __set_page_dirty(struct page *page,
>  	spin_unlock_irq(&mapping->tree_lock);
>  	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>  
> -	return 1;
> +out:
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +	return ret;
>  }
>  EXPORT_SYMBOL(__set_page_dirty);
>  
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index c1e2617..8c6b8ca 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -41,6 +41,7 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
>  	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>  	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
> +	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
>  	MEM_CGROUP_STAT_NSTATS,
>  };
>  
> diff --git a/mm/filemap.c b/mm/filemap.c
> index a4a5260..7f53fb0 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -62,6 +62,10 @@
>   *      ->swap_lock		(exclusive_swap_page, others)
>   *        ->mapping->tree_lock
>   *
> + *    ->private_lock		(__set_page_dirty_buffers)
> + *      ->memcg->move_lock	(mem_cgroup_begin_update_page_stat->move_lock_mem_cgroup)
> + *        ->mapping->tree_lock
> + *
>   *  ->i_mutex
>   *    ->i_mmap_mutex		(truncate->unmap_mapping_range)
>   *
> @@ -112,6 +116,8 @@
>  void __delete_from_page_cache(struct page *page)
>  {
>  	struct address_space *mapping = page->mapping;
> +	bool locked;
> +	unsigned long flags;
>  
>  	/*
>  	 * if we're uptodate, flush out into the cleancache, otherwise
> @@ -139,10 +145,13 @@ void __delete_from_page_cache(struct page *page)
>  	 * Fix it up by doing a final dirty accounting check after
>  	 * having removed the page entirely.
>  	 */
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);

Here caller of __delete_from_page_cache() already holds tree_lock and we
grab mem_cgroup_begin_update_page_stat() lock(s).

This can cause an AB/BA deadlock.  We need to be consistent with lock
ordering.  The filemap.c lock ordering comment indicates that tree_lock
should be grabbed after the mem_cgroup_begin_update_page_stat() lock(s).

>  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  		dec_zone_page_state(page, NR_FILE_DIRTY);
>  		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>  	}
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  }
>  
>  /**
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index aef9fb0..cdcd547 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -85,6 +85,7 @@ static const char * const mem_cgroup_stat_names[] = {
>  	"rss",
>  	"mapped_file",
>  	"swap",
> +	"dirty",
>  };
>  
>  enum mem_cgroup_events_index {
> @@ -2541,6 +2542,18 @@ void mem_cgroup_split_huge_fixup(struct page *head)
>  }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
> +static inline
> +void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
> +					struct mem_cgroup *to,
> +					enum mem_cgroup_stat_index idx)
> +{
> +	/* Update stat data for mem_cgroup */
> +	preempt_disable();
> +	__this_cpu_dec(from->stat->count[idx]);
> +	__this_cpu_inc(to->stat->count[idx]);
> +	preempt_enable();
> +}
> +
>  /**
>   * mem_cgroup_move_account - move account of the page
>   * @page: the page
> @@ -2586,13 +2599,14 @@ static int mem_cgroup_move_account(struct page *page,
>  
>  	move_lock_mem_cgroup(from, &flags);
>  
> -	if (!anon && page_mapped(page)) {
> -		/* Update mapped_file data for mem_cgroup */
> -		preempt_disable();
> -		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		preempt_enable();
> -	}
> +	if (!anon && page_mapped(page))
> +		mem_cgroup_move_account_page_stat(from, to,
> +				MEM_CGROUP_STAT_FILE_MAPPED);
> +
> +	if (PageDirty(page))
> +		mem_cgroup_move_account_page_stat(from, to,
> +				MEM_CGROUP_STAT_FILE_DIRTY);
> +
>  	mem_cgroup_charge_statistics(from, anon, -nr_pages);
>  
>  	/* caller should have done css_get */
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 93d8d2f..233e7ac 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1932,11 +1932,17 @@ int __set_page_dirty_no_writeback(struct page *page)
>  
>  /*
>   * Helper function for set_page_dirty family.
> + *
> + * The caller must hold mem_cgroup_begin/end_update_page_stat() lock
> + * while modifying struct page state and accounting dirty pages.
> + * See __set_page_dirty for example.
> + *
>   * NOTE: This relies on being atomic wrt interrupts.
>   */
>  void account_page_dirtied(struct page *page, struct address_space *mapping)
>  {
>  	if (mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  		__inc_zone_page_state(page, NR_FILE_DIRTY);
>  		__inc_zone_page_state(page, NR_DIRTIED);
>  		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> @@ -1976,12 +1982,19 @@ EXPORT_SYMBOL(account_page_writeback);
>   */
>  int __set_page_dirty_nobuffers(struct page *page)
>  {
> +	bool locked;
> +	unsigned long flags;
> +	int ret = 0;
> +
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +
>  	if (!TestSetPageDirty(page)) {
>  		struct address_space *mapping = page_mapping(page);
>  		struct address_space *mapping2;
>  
> +		ret = 1;
>  		if (!mapping)
> -			return 1;
> +			goto out;
>  
>  		spin_lock_irq(&mapping->tree_lock);
>  		mapping2 = page_mapping(page);
> @@ -1997,9 +2010,11 @@ int __set_page_dirty_nobuffers(struct page *page)
>  			/* !PageAnon && !swapper_space */
>  			__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>  		}
> -		return 1;
>  	}
> -	return 0;
> +
> +out:
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +	return ret;
>  }
>  EXPORT_SYMBOL(__set_page_dirty_nobuffers);
>  
> @@ -2114,6 +2129,9 @@ EXPORT_SYMBOL(set_page_dirty_lock);
>  int clear_page_dirty_for_io(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
> +	bool locked;
> +	unsigned long flags;
> +	int ret = 0;
>  
>  	BUG_ON(!PageLocked(page));
>  
> @@ -2155,13 +2173,16 @@ int clear_page_dirty_for_io(struct page *page)
>  		 * the desired exclusion. See mm/memory.c:do_wp_page()
>  		 * for more comments.
>  		 */
> +		mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  		if (TestClearPageDirty(page)) {
> +			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_RECLAIMABLE);
> -			return 1;
> +			ret = 1;
>  		}
> -		return 0;
> +		mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +		return ret;
>  	}
>  	return TestClearPageDirty(page);
>  }
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 75801ac..052016a 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -73,9 +73,14 @@ static inline void truncate_partial_page(struct page *page, unsigned partial)
>   */
>  void cancel_dirty_page(struct page *page, unsigned int account_size)
>  {
> +	bool locked;
> +	unsigned long flags;
> +
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  	if (TestClearPageDirty(page)) {
>  		struct address_space *mapping = page->mapping;
>  		if (mapping && mapping_cap_account_dirty(mapping)) {
> +			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_RECLAIMABLE);
> @@ -83,6 +88,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
>  				task_io_account_cancelled_write(account_size);
>  		}
>  	}
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  }
>  EXPORT_SYMBOL(cancel_dirty_page);

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V2 4/6] memcg: add per cgroup dirty pages accounting
@ 2012-07-30 14:56     ` Greg Thelen
  0 siblings, 0 replies; 24+ messages in thread
From: Greg Thelen @ 2012-07-30 14:56 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, fengguang.wu, akpm, yinghan, mhocko,
	linux-kernel, hannes, Sha Zhengju

On Fri, Jul 27 2012, Sha Zhengju wrote:

> From: Sha Zhengju <handai.szj@taobao.com>
>
> This patch adds memcg routines to count dirty pages, which allows memory controller
> to maintain an accurate view of the amount of its dirty memory and can provide some
> info for users while group's direct reclaim is working.
>
> After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
> use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
> has a feature to move a page from a cgroup to another one and may have race between
> "move" and "page stat accounting". So in order to avoid the race we have designed a
> bigger lock:
>
>          mem_cgroup_begin_update_page_stat()
>          modify page information        -->(a)
>          mem_cgroup_update_page_stat()  -->(b)
>          mem_cgroup_end_update_page_stat()
>
> It requires (a) and (b)(dirty pages accounting) can stay close enough.
>
> In the previous two prepare patches, we have reworked the vfs set page dirty routines
> and now the interfaces are more explicit:
>         incrementing (2):
>                 __set_page_dirty
>                 __set_page_dirty_nobuffers
>         decrementing (2):
>                 clear_page_dirty_for_io
>                 cancel_dirty_page
>
>
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
> Acked-by: Fengguang Wu <fengguang.wu@intel.com>
> ---
>  fs/buffer.c                |   16 +++++++++++++---
>  include/linux/memcontrol.h |    1 +
>  mm/filemap.c               |    9 +++++++++
>  mm/memcontrol.c            |   28 +++++++++++++++++++++-------
>  mm/page-writeback.c        |   31 ++++++++++++++++++++++++++-----
>  mm/truncate.c              |    6 ++++++
>  6 files changed, 76 insertions(+), 15 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index ffcfb87..e7b5766 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -613,11 +613,19 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>  int __set_page_dirty(struct page *page,
>  		struct address_space *mapping, int warn)
>  {
> +	bool locked;
> +	unsigned long flags;
> +	int ret = 1;
> +
>  	if (unlikely(!mapping))
>  		return !TestSetPageDirty(page);
>  
> -	if (TestSetPageDirty(page))
> -		return 0;
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +
> +	if (TestSetPageDirty(page)) {
> +		ret = 0;
> +		goto out;
> +	}
>  
>  	spin_lock_irq(&mapping->tree_lock);

There are two problems here:
1. Here we grab mem_cgroup_begin_update_page_stat() lock(s) and then
   tree_lock (this can cause AB/BA deadlock - more below)
2. Here we grab the tree_lock with spin_lock_irq(), which assumes that
   interrupts are not already disabled by spin_lock_irq[save]()..  But
   mem_cgroup_begin_update_page_stat() may disable interrupts.  One
   solution would be to convert this spin_lock_irq to spin_lock_irqsave
   with a second flags local variable.

>  	if (page->mapping) {	/* Race with truncate? */
> @@ -629,7 +637,9 @@ int __set_page_dirty(struct page *page,
>  	spin_unlock_irq(&mapping->tree_lock);
>  	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>  
> -	return 1;
> +out:
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +	return ret;
>  }
>  EXPORT_SYMBOL(__set_page_dirty);
>  
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index c1e2617..8c6b8ca 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -41,6 +41,7 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
>  	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>  	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
> +	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
>  	MEM_CGROUP_STAT_NSTATS,
>  };
>  
> diff --git a/mm/filemap.c b/mm/filemap.c
> index a4a5260..7f53fb0 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -62,6 +62,10 @@
>   *      ->swap_lock		(exclusive_swap_page, others)
>   *        ->mapping->tree_lock
>   *
> + *    ->private_lock		(__set_page_dirty_buffers)
> + *      ->memcg->move_lock	(mem_cgroup_begin_update_page_stat->move_lock_mem_cgroup)
> + *        ->mapping->tree_lock
> + *
>   *  ->i_mutex
>   *    ->i_mmap_mutex		(truncate->unmap_mapping_range)
>   *
> @@ -112,6 +116,8 @@
>  void __delete_from_page_cache(struct page *page)
>  {
>  	struct address_space *mapping = page->mapping;
> +	bool locked;
> +	unsigned long flags;
>  
>  	/*
>  	 * if we're uptodate, flush out into the cleancache, otherwise
> @@ -139,10 +145,13 @@ void __delete_from_page_cache(struct page *page)
>  	 * Fix it up by doing a final dirty accounting check after
>  	 * having removed the page entirely.
>  	 */
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);

Here caller of __delete_from_page_cache() already holds tree_lock and we
grab mem_cgroup_begin_update_page_stat() lock(s).

This can cause an AB/BA deadlock.  We need to be consistent with lock
ordering.  The filemap.c lock ordering comment indicates that tree_lock
should be grabbed after the mem_cgroup_begin_update_page_stat() lock(s).

>  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  		dec_zone_page_state(page, NR_FILE_DIRTY);
>  		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>  	}
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  }
>  
>  /**
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index aef9fb0..cdcd547 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -85,6 +85,7 @@ static const char * const mem_cgroup_stat_names[] = {
>  	"rss",
>  	"mapped_file",
>  	"swap",
> +	"dirty",
>  };
>  
>  enum mem_cgroup_events_index {
> @@ -2541,6 +2542,18 @@ void mem_cgroup_split_huge_fixup(struct page *head)
>  }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
> +static inline
> +void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
> +					struct mem_cgroup *to,
> +					enum mem_cgroup_stat_index idx)
> +{
> +	/* Update stat data for mem_cgroup */
> +	preempt_disable();
> +	__this_cpu_dec(from->stat->count[idx]);
> +	__this_cpu_inc(to->stat->count[idx]);
> +	preempt_enable();
> +}
> +
>  /**
>   * mem_cgroup_move_account - move account of the page
>   * @page: the page
> @@ -2586,13 +2599,14 @@ static int mem_cgroup_move_account(struct page *page,
>  
>  	move_lock_mem_cgroup(from, &flags);
>  
> -	if (!anon && page_mapped(page)) {
> -		/* Update mapped_file data for mem_cgroup */
> -		preempt_disable();
> -		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		preempt_enable();
> -	}
> +	if (!anon && page_mapped(page))
> +		mem_cgroup_move_account_page_stat(from, to,
> +				MEM_CGROUP_STAT_FILE_MAPPED);
> +
> +	if (PageDirty(page))
> +		mem_cgroup_move_account_page_stat(from, to,
> +				MEM_CGROUP_STAT_FILE_DIRTY);
> +
>  	mem_cgroup_charge_statistics(from, anon, -nr_pages);
>  
>  	/* caller should have done css_get */
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 93d8d2f..233e7ac 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1932,11 +1932,17 @@ int __set_page_dirty_no_writeback(struct page *page)
>  
>  /*
>   * Helper function for set_page_dirty family.
> + *
> + * The caller must hold mem_cgroup_begin/end_update_page_stat() lock
> + * while modifying struct page state and accounting dirty pages.
> + * See __set_page_dirty for example.
> + *
>   * NOTE: This relies on being atomic wrt interrupts.
>   */
>  void account_page_dirtied(struct page *page, struct address_space *mapping)
>  {
>  	if (mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  		__inc_zone_page_state(page, NR_FILE_DIRTY);
>  		__inc_zone_page_state(page, NR_DIRTIED);
>  		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> @@ -1976,12 +1982,19 @@ EXPORT_SYMBOL(account_page_writeback);
>   */
>  int __set_page_dirty_nobuffers(struct page *page)
>  {
> +	bool locked;
> +	unsigned long flags;
> +	int ret = 0;
> +
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +
>  	if (!TestSetPageDirty(page)) {
>  		struct address_space *mapping = page_mapping(page);
>  		struct address_space *mapping2;
>  
> +		ret = 1;
>  		if (!mapping)
> -			return 1;
> +			goto out;
>  
>  		spin_lock_irq(&mapping->tree_lock);
>  		mapping2 = page_mapping(page);
> @@ -1997,9 +2010,11 @@ int __set_page_dirty_nobuffers(struct page *page)
>  			/* !PageAnon && !swapper_space */
>  			__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>  		}
> -		return 1;
>  	}
> -	return 0;
> +
> +out:
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +	return ret;
>  }
>  EXPORT_SYMBOL(__set_page_dirty_nobuffers);
>  
> @@ -2114,6 +2129,9 @@ EXPORT_SYMBOL(set_page_dirty_lock);
>  int clear_page_dirty_for_io(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
> +	bool locked;
> +	unsigned long flags;
> +	int ret = 0;
>  
>  	BUG_ON(!PageLocked(page));
>  
> @@ -2155,13 +2173,16 @@ int clear_page_dirty_for_io(struct page *page)
>  		 * the desired exclusion. See mm/memory.c:do_wp_page()
>  		 * for more comments.
>  		 */
> +		mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  		if (TestClearPageDirty(page)) {
> +			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_RECLAIMABLE);
> -			return 1;
> +			ret = 1;
>  		}
> -		return 0;
> +		mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +		return ret;
>  	}
>  	return TestClearPageDirty(page);
>  }
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 75801ac..052016a 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -73,9 +73,14 @@ static inline void truncate_partial_page(struct page *page, unsigned partial)
>   */
>  void cancel_dirty_page(struct page *page, unsigned int account_size)
>  {
> +	bool locked;
> +	unsigned long flags;
> +
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  	if (TestClearPageDirty(page)) {
>  		struct address_space *mapping = page->mapping;
>  		if (mapping && mapping_cap_account_dirty(mapping)) {
> +			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_RECLAIMABLE);
> @@ -83,6 +88,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
>  				task_io_account_cancelled_write(account_size);
>  		}
>  	}
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  }
>  EXPORT_SYMBOL(cancel_dirty_page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V2 4/6] memcg: add per cgroup dirty pages accounting
@ 2012-07-30 14:56     ` Greg Thelen
  0 siblings, 0 replies; 24+ messages in thread
From: Greg Thelen @ 2012-07-30 14:56 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	yinghan-hpIqsD4AKlfQT0dZR+AlfA, mhocko-AlSwsSmVLrQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	hannes-druUgvl0LCNAfugRpC6u6w, Sha Zhengju

On Fri, Jul 27 2012, Sha Zhengju wrote:

> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
>
> This patch adds memcg routines to count dirty pages, which allows memory controller
> to maintain an accurate view of the amount of its dirty memory and can provide some
> info for users while group's direct reclaim is working.
>
> After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
> use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
> has a feature to move a page from a cgroup to another one and may have race between
> "move" and "page stat accounting". So in order to avoid the race we have designed a
> bigger lock:
>
>          mem_cgroup_begin_update_page_stat()
>          modify page information        -->(a)
>          mem_cgroup_update_page_stat()  -->(b)
>          mem_cgroup_end_update_page_stat()
>
> It requires (a) and (b)(dirty pages accounting) can stay close enough.
>
> In the previous two prepare patches, we have reworked the vfs set page dirty routines
> and now the interfaces are more explicit:
>         incrementing (2):
>                 __set_page_dirty
>                 __set_page_dirty_nobuffers
>         decrementing (2):
>                 clear_page_dirty_for_io
>                 cancel_dirty_page
>
>
> Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu-LdfC7J4mv27QFUHtdCDX3A@public.gmane.org>
> Acked-by: Fengguang Wu <fengguang.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> ---
>  fs/buffer.c                |   16 +++++++++++++---
>  include/linux/memcontrol.h |    1 +
>  mm/filemap.c               |    9 +++++++++
>  mm/memcontrol.c            |   28 +++++++++++++++++++++-------
>  mm/page-writeback.c        |   31 ++++++++++++++++++++++++++-----
>  mm/truncate.c              |    6 ++++++
>  6 files changed, 76 insertions(+), 15 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index ffcfb87..e7b5766 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -613,11 +613,19 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>  int __set_page_dirty(struct page *page,
>  		struct address_space *mapping, int warn)
>  {
> +	bool locked;
> +	unsigned long flags;
> +	int ret = 1;
> +
>  	if (unlikely(!mapping))
>  		return !TestSetPageDirty(page);
>  
> -	if (TestSetPageDirty(page))
> -		return 0;
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +
> +	if (TestSetPageDirty(page)) {
> +		ret = 0;
> +		goto out;
> +	}
>  
>  	spin_lock_irq(&mapping->tree_lock);

There are two problems here:
1. Here we grab mem_cgroup_begin_update_page_stat() lock(s) and then
   tree_lock (this can cause AB/BA deadlock - more below)
2. Here we grab the tree_lock with spin_lock_irq(), which assumes that
   interrupts are not already disabled by spin_lock_irq[save]()..  But
   mem_cgroup_begin_update_page_stat() may disable interrupts.  One
   solution would be to convert this spin_lock_irq to spin_lock_irqsave
   with a second flags local variable.

>  	if (page->mapping) {	/* Race with truncate? */
> @@ -629,7 +637,9 @@ int __set_page_dirty(struct page *page,
>  	spin_unlock_irq(&mapping->tree_lock);
>  	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>  
> -	return 1;
> +out:
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +	return ret;
>  }
>  EXPORT_SYMBOL(__set_page_dirty);
>  
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index c1e2617..8c6b8ca 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -41,6 +41,7 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
>  	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>  	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
> +	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
>  	MEM_CGROUP_STAT_NSTATS,
>  };
>  
> diff --git a/mm/filemap.c b/mm/filemap.c
> index a4a5260..7f53fb0 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -62,6 +62,10 @@
>   *      ->swap_lock		(exclusive_swap_page, others)
>   *        ->mapping->tree_lock
>   *
> + *    ->private_lock		(__set_page_dirty_buffers)
> + *      ->memcg->move_lock	(mem_cgroup_begin_update_page_stat->move_lock_mem_cgroup)
> + *        ->mapping->tree_lock
> + *
>   *  ->i_mutex
>   *    ->i_mmap_mutex		(truncate->unmap_mapping_range)
>   *
> @@ -112,6 +116,8 @@
>  void __delete_from_page_cache(struct page *page)
>  {
>  	struct address_space *mapping = page->mapping;
> +	bool locked;
> +	unsigned long flags;
>  
>  	/*
>  	 * if we're uptodate, flush out into the cleancache, otherwise
> @@ -139,10 +145,13 @@ void __delete_from_page_cache(struct page *page)
>  	 * Fix it up by doing a final dirty accounting check after
>  	 * having removed the page entirely.
>  	 */
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);

Here caller of __delete_from_page_cache() already holds tree_lock and we
grab mem_cgroup_begin_update_page_stat() lock(s).

This can cause an AB/BA deadlock.  We need to be consistent with lock
ordering.  The filemap.c lock ordering comment indicates that tree_lock
should be grabbed after the mem_cgroup_begin_update_page_stat() lock(s).

>  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  		dec_zone_page_state(page, NR_FILE_DIRTY);
>  		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>  	}
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  }
>  
>  /**
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index aef9fb0..cdcd547 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -85,6 +85,7 @@ static const char * const mem_cgroup_stat_names[] = {
>  	"rss",
>  	"mapped_file",
>  	"swap",
> +	"dirty",
>  };
>  
>  enum mem_cgroup_events_index {
> @@ -2541,6 +2542,18 @@ void mem_cgroup_split_huge_fixup(struct page *head)
>  }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
> +static inline
> +void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
> +					struct mem_cgroup *to,
> +					enum mem_cgroup_stat_index idx)
> +{
> +	/* Update stat data for mem_cgroup */
> +	preempt_disable();
> +	__this_cpu_dec(from->stat->count[idx]);
> +	__this_cpu_inc(to->stat->count[idx]);
> +	preempt_enable();
> +}
> +
>  /**
>   * mem_cgroup_move_account - move account of the page
>   * @page: the page
> @@ -2586,13 +2599,14 @@ static int mem_cgroup_move_account(struct page *page,
>  
>  	move_lock_mem_cgroup(from, &flags);
>  
> -	if (!anon && page_mapped(page)) {
> -		/* Update mapped_file data for mem_cgroup */
> -		preempt_disable();
> -		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		preempt_enable();
> -	}
> +	if (!anon && page_mapped(page))
> +		mem_cgroup_move_account_page_stat(from, to,
> +				MEM_CGROUP_STAT_FILE_MAPPED);
> +
> +	if (PageDirty(page))
> +		mem_cgroup_move_account_page_stat(from, to,
> +				MEM_CGROUP_STAT_FILE_DIRTY);
> +
>  	mem_cgroup_charge_statistics(from, anon, -nr_pages);
>  
>  	/* caller should have done css_get */
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 93d8d2f..233e7ac 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1932,11 +1932,17 @@ int __set_page_dirty_no_writeback(struct page *page)
>  
>  /*
>   * Helper function for set_page_dirty family.
> + *
> + * The caller must hold mem_cgroup_begin/end_update_page_stat() lock
> + * while modifying struct page state and accounting dirty pages.
> + * See __set_page_dirty for example.
> + *
>   * NOTE: This relies on being atomic wrt interrupts.
>   */
>  void account_page_dirtied(struct page *page, struct address_space *mapping)
>  {
>  	if (mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  		__inc_zone_page_state(page, NR_FILE_DIRTY);
>  		__inc_zone_page_state(page, NR_DIRTIED);
>  		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> @@ -1976,12 +1982,19 @@ EXPORT_SYMBOL(account_page_writeback);
>   */
>  int __set_page_dirty_nobuffers(struct page *page)
>  {
> +	bool locked;
> +	unsigned long flags;
> +	int ret = 0;
> +
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +
>  	if (!TestSetPageDirty(page)) {
>  		struct address_space *mapping = page_mapping(page);
>  		struct address_space *mapping2;
>  
> +		ret = 1;
>  		if (!mapping)
> -			return 1;
> +			goto out;
>  
>  		spin_lock_irq(&mapping->tree_lock);
>  		mapping2 = page_mapping(page);
> @@ -1997,9 +2010,11 @@ int __set_page_dirty_nobuffers(struct page *page)
>  			/* !PageAnon && !swapper_space */
>  			__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>  		}
> -		return 1;
>  	}
> -	return 0;
> +
> +out:
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +	return ret;
>  }
>  EXPORT_SYMBOL(__set_page_dirty_nobuffers);
>  
> @@ -2114,6 +2129,9 @@ EXPORT_SYMBOL(set_page_dirty_lock);
>  int clear_page_dirty_for_io(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
> +	bool locked;
> +	unsigned long flags;
> +	int ret = 0;
>  
>  	BUG_ON(!PageLocked(page));
>  
> @@ -2155,13 +2173,16 @@ int clear_page_dirty_for_io(struct page *page)
>  		 * the desired exclusion. See mm/memory.c:do_wp_page()
>  		 * for more comments.
>  		 */
> +		mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  		if (TestClearPageDirty(page)) {
> +			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_RECLAIMABLE);
> -			return 1;
> +			ret = 1;
>  		}
> -		return 0;
> +		mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +		return ret;
>  	}
>  	return TestClearPageDirty(page);
>  }
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 75801ac..052016a 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -73,9 +73,14 @@ static inline void truncate_partial_page(struct page *page, unsigned partial)
>   */
>  void cancel_dirty_page(struct page *page, unsigned int account_size)
>  {
> +	bool locked;
> +	unsigned long flags;
> +
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  	if (TestClearPageDirty(page)) {
>  		struct address_space *mapping = page->mapping;
>  		if (mapping && mapping_cap_account_dirty(mapping)) {
> +			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_RECLAIMABLE);
> @@ -83,6 +88,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
>  				task_io_account_cancelled_write(account_size);
>  		}
>  	}
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  }
>  EXPORT_SYMBOL(cancel_dirty_page);

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V2 5/6] memcg: add per cgroup writeback pages accounting
  2012-07-27 10:28   ` Sha Zhengju
  (?)
@ 2012-07-30 15:02     ` Greg Thelen
  -1 siblings, 0 replies; 24+ messages in thread
From: Greg Thelen @ 2012-07-30 15:02 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, fengguang.wu, akpm, yinghan, mhocko,
	linux-kernel, hannes, Sha Zhengju

On Fri, Jul 27 2012, Sha Zhengju wrote:

> From: Sha Zhengju <handai.szj@taobao.com>
>
> Similar to dirty page, we add per cgroup writeback pages accounting. The lock
> rule still is:
>         mem_cgroup_begin_update_page_stat()
>         modify page WRITEBACK stat
>         mem_cgroup_update_page_stat()
>         mem_cgroup_end_update_page_stat()
>
> There're two writeback interface to modify: test_clear/set_page_writeback.
>
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> ---
>  include/linux/memcontrol.h |    1 +
>  mm/memcontrol.c            |    5 +++++
>  mm/page-writeback.c        |   17 +++++++++++++++++
>  3 files changed, 23 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 8c6b8ca..0c8a699 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -42,6 +42,7 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>  	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
>  	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
> +	MEM_CGROUP_STAT_WRITEBACK,  /* # of pages under writeback */
>  	MEM_CGROUP_STAT_NSTATS,
>  };
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index cdcd547..de91d3d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -86,6 +86,7 @@ static const char * const mem_cgroup_stat_names[] = {
>  	"mapped_file",
>  	"swap",
>  	"dirty",
> +	"writeback",
>  };
>  
>  enum mem_cgroup_events_index {
> @@ -2607,6 +2608,10 @@ static int mem_cgroup_move_account(struct page *page,
>  		mem_cgroup_move_account_page_stat(from, to,
>  				MEM_CGROUP_STAT_FILE_DIRTY);
>  
> +	if (PageWriteback(page))
> +		mem_cgroup_move_account_page_stat(from, to,
> +				MEM_CGROUP_STAT_WRITEBACK);
> +
>  	mem_cgroup_charge_statistics(from, anon, -nr_pages);
>  
>  	/* caller should have done css_get */
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 233e7ac..6b06d5e 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1956,11 +1956,17 @@ EXPORT_SYMBOL(account_page_dirtied);
>  
>  /*
>   * Helper function for set_page_writeback family.
> + *
> + * The caller must hold mem_cgroup_begin/end_update_page_stat() lock
> + * while modifying struct page state and accounting writeback pages.
> + * See test_set_page_writeback for example.
> + *
>   * NOTE: Unlike account_page_dirtied this does not rely on being atomic
>   * wrt interrupts.
>   */
>  void account_page_writeback(struct page *page)
>  {
> +	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
>  	inc_zone_page_state(page, NR_WRITEBACK);
>  }
>  EXPORT_SYMBOL(account_page_writeback);
> @@ -2192,7 +2198,10 @@ int test_clear_page_writeback(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
>  	int ret;
> +	bool locked;
> +	unsigned long flags;
>  
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);

Reusing a variable name (flags) in the same function can cause
confusion.  Please use a different name.  Maybe memcg_flags.

>  	if (mapping) {
>  		struct backing_dev_info *bdi = mapping->backing_dev_info;
>  		unsigned long flags;
> @@ -2213,9 +2222,12 @@ int test_clear_page_writeback(struct page *page)
>  		ret = TestClearPageWriteback(page);
>  	}
>  	if (ret) {
> +		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
>  		dec_zone_page_state(page, NR_WRITEBACK);
>  		inc_zone_page_state(page, NR_WRITTEN);
>  	}
> +
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  	return ret;
>  }
>  
> @@ -2223,7 +2235,10 @@ int test_set_page_writeback(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
>  	int ret;
> +	bool locked;
> +	unsigned long flags;
>  
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  	if (mapping) {
>  		struct backing_dev_info *bdi = mapping->backing_dev_info;
>  		unsigned long flags;
> @@ -2250,6 +2265,8 @@ int test_set_page_writeback(struct page *page)
>  	}
>  	if (!ret)
>  		account_page_writeback(page);
> +
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  	return ret;
>  
>  }

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V2 5/6] memcg: add per cgroup writeback pages accounting
@ 2012-07-30 15:02     ` Greg Thelen
  0 siblings, 0 replies; 24+ messages in thread
From: Greg Thelen @ 2012-07-30 15:02 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, fengguang.wu, akpm, yinghan, mhocko,
	linux-kernel, hannes, Sha Zhengju

On Fri, Jul 27 2012, Sha Zhengju wrote:

> From: Sha Zhengju <handai.szj@taobao.com>
>
> Similar to dirty page, we add per cgroup writeback pages accounting. The lock
> rule still is:
>         mem_cgroup_begin_update_page_stat()
>         modify page WRITEBACK stat
>         mem_cgroup_update_page_stat()
>         mem_cgroup_end_update_page_stat()
>
> There're two writeback interface to modify: test_clear/set_page_writeback.
>
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> ---
>  include/linux/memcontrol.h |    1 +
>  mm/memcontrol.c            |    5 +++++
>  mm/page-writeback.c        |   17 +++++++++++++++++
>  3 files changed, 23 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 8c6b8ca..0c8a699 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -42,6 +42,7 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>  	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
>  	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
> +	MEM_CGROUP_STAT_WRITEBACK,  /* # of pages under writeback */
>  	MEM_CGROUP_STAT_NSTATS,
>  };
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index cdcd547..de91d3d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -86,6 +86,7 @@ static const char * const mem_cgroup_stat_names[] = {
>  	"mapped_file",
>  	"swap",
>  	"dirty",
> +	"writeback",
>  };
>  
>  enum mem_cgroup_events_index {
> @@ -2607,6 +2608,10 @@ static int mem_cgroup_move_account(struct page *page,
>  		mem_cgroup_move_account_page_stat(from, to,
>  				MEM_CGROUP_STAT_FILE_DIRTY);
>  
> +	if (PageWriteback(page))
> +		mem_cgroup_move_account_page_stat(from, to,
> +				MEM_CGROUP_STAT_WRITEBACK);
> +
>  	mem_cgroup_charge_statistics(from, anon, -nr_pages);
>  
>  	/* caller should have done css_get */
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 233e7ac..6b06d5e 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1956,11 +1956,17 @@ EXPORT_SYMBOL(account_page_dirtied);
>  
>  /*
>   * Helper function for set_page_writeback family.
> + *
> + * The caller must hold mem_cgroup_begin/end_update_page_stat() lock
> + * while modifying struct page state and accounting writeback pages.
> + * See test_set_page_writeback for example.
> + *
>   * NOTE: Unlike account_page_dirtied this does not rely on being atomic
>   * wrt interrupts.
>   */
>  void account_page_writeback(struct page *page)
>  {
> +	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
>  	inc_zone_page_state(page, NR_WRITEBACK);
>  }
>  EXPORT_SYMBOL(account_page_writeback);
> @@ -2192,7 +2198,10 @@ int test_clear_page_writeback(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
>  	int ret;
> +	bool locked;
> +	unsigned long flags;
>  
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);

Reusing a variable name (flags) in the same function can cause
confusion.  Please use a different name.  Maybe memcg_flags.

>  	if (mapping) {
>  		struct backing_dev_info *bdi = mapping->backing_dev_info;
>  		unsigned long flags;
> @@ -2213,9 +2222,12 @@ int test_clear_page_writeback(struct page *page)
>  		ret = TestClearPageWriteback(page);
>  	}
>  	if (ret) {
> +		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
>  		dec_zone_page_state(page, NR_WRITEBACK);
>  		inc_zone_page_state(page, NR_WRITTEN);
>  	}
> +
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  	return ret;
>  }
>  
> @@ -2223,7 +2235,10 @@ int test_set_page_writeback(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
>  	int ret;
> +	bool locked;
> +	unsigned long flags;
>  
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  	if (mapping) {
>  		struct backing_dev_info *bdi = mapping->backing_dev_info;
>  		unsigned long flags;
> @@ -2250,6 +2265,8 @@ int test_set_page_writeback(struct page *page)
>  	}
>  	if (!ret)
>  		account_page_writeback(page);
> +
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  	return ret;
>  
>  }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V2 5/6] memcg: add per cgroup writeback pages accounting
@ 2012-07-30 15:02     ` Greg Thelen
  0 siblings, 0 replies; 24+ messages in thread
From: Greg Thelen @ 2012-07-30 15:02 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	yinghan-hpIqsD4AKlfQT0dZR+AlfA, mhocko-AlSwsSmVLrQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	hannes-druUgvl0LCNAfugRpC6u6w, Sha Zhengju

On Fri, Jul 27 2012, Sha Zhengju wrote:

> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
>
> Similar to dirty page, we add per cgroup writeback pages accounting. The lock
> rule still is:
>         mem_cgroup_begin_update_page_stat()
>         modify page WRITEBACK stat
>         mem_cgroup_update_page_stat()
>         mem_cgroup_end_update_page_stat()
>
> There're two writeback interface to modify: test_clear/set_page_writeback.
>
> Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> ---
>  include/linux/memcontrol.h |    1 +
>  mm/memcontrol.c            |    5 +++++
>  mm/page-writeback.c        |   17 +++++++++++++++++
>  3 files changed, 23 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 8c6b8ca..0c8a699 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -42,6 +42,7 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>  	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
>  	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
> +	MEM_CGROUP_STAT_WRITEBACK,  /* # of pages under writeback */
>  	MEM_CGROUP_STAT_NSTATS,
>  };
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index cdcd547..de91d3d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -86,6 +86,7 @@ static const char * const mem_cgroup_stat_names[] = {
>  	"mapped_file",
>  	"swap",
>  	"dirty",
> +	"writeback",
>  };
>  
>  enum mem_cgroup_events_index {
> @@ -2607,6 +2608,10 @@ static int mem_cgroup_move_account(struct page *page,
>  		mem_cgroup_move_account_page_stat(from, to,
>  				MEM_CGROUP_STAT_FILE_DIRTY);
>  
> +	if (PageWriteback(page))
> +		mem_cgroup_move_account_page_stat(from, to,
> +				MEM_CGROUP_STAT_WRITEBACK);
> +
>  	mem_cgroup_charge_statistics(from, anon, -nr_pages);
>  
>  	/* caller should have done css_get */
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 233e7ac..6b06d5e 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1956,11 +1956,17 @@ EXPORT_SYMBOL(account_page_dirtied);
>  
>  /*
>   * Helper function for set_page_writeback family.
> + *
> + * The caller must hold mem_cgroup_begin/end_update_page_stat() lock
> + * while modifying struct page state and accounting writeback pages.
> + * See test_set_page_writeback for example.
> + *
>   * NOTE: Unlike account_page_dirtied this does not rely on being atomic
>   * wrt interrupts.
>   */
>  void account_page_writeback(struct page *page)
>  {
> +	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
>  	inc_zone_page_state(page, NR_WRITEBACK);
>  }
>  EXPORT_SYMBOL(account_page_writeback);
> @@ -2192,7 +2198,10 @@ int test_clear_page_writeback(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
>  	int ret;
> +	bool locked;
> +	unsigned long flags;
>  
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);

Reusing a variable name (flags) in the same function can cause
confusion.  Please use a different name.  Maybe memcg_flags.

>  	if (mapping) {
>  		struct backing_dev_info *bdi = mapping->backing_dev_info;
>  		unsigned long flags;
> @@ -2213,9 +2222,12 @@ int test_clear_page_writeback(struct page *page)
>  		ret = TestClearPageWriteback(page);
>  	}
>  	if (ret) {
> +		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
>  		dec_zone_page_state(page, NR_WRITEBACK);
>  		inc_zone_page_state(page, NR_WRITTEN);
>  	}
> +
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  	return ret;
>  }
>  
> @@ -2223,7 +2235,10 @@ int test_set_page_writeback(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
>  	int ret;
> +	bool locked;
> +	unsigned long flags;
>  
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  	if (mapping) {
>  		struct backing_dev_info *bdi = mapping->backing_dev_info;
>  		unsigned long flags;
> @@ -2250,6 +2265,8 @@ int test_set_page_writeback(struct page *page)
>  	}
>  	if (!ret)
>  		account_page_writeback(page);
> +
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  	return ret;
>  
>  }

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2012-07-30 15:07 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-27 10:20 [PATCH V2 0/6] Per-cgroup page stat accounting Sha Zhengju
2012-07-27 10:20 ` Sha Zhengju
2012-07-27 10:20 ` Sha Zhengju
2012-07-27 10:23 ` [PATCH V2 1/6] memcg: remove MEMCG_NR_FILE_MAPPED Sha Zhengju
2012-07-27 10:23   ` Sha Zhengju
2012-07-27 10:25 ` [PATCH V2 2/6] Make TestSetPageDirty and dirty page accounting in one func Sha Zhengju
2012-07-27 10:25   ` Sha Zhengju
2012-07-27 10:27 ` [PATCH V2 3/6] Use vfs __set_page_dirty interface instead of doing it inside filesystem Sha Zhengju
2012-07-27 10:27   ` Sha Zhengju
2012-07-27 10:28 ` [PATCH V2 4/6] memcg: add per cgroup dirty pages accounting Sha Zhengju
2012-07-27 10:28   ` Sha Zhengju
2012-07-30 14:56   ` Greg Thelen
2012-07-30 14:56     ` Greg Thelen
2012-07-30 14:56     ` Greg Thelen
2012-07-27 10:28 ` [PATCH V2 5/6] memcg: add per cgroup writeback " Sha Zhengju
2012-07-27 10:28   ` Sha Zhengju
2012-07-30 15:02   ` Greg Thelen
2012-07-30 15:02     ` Greg Thelen
2012-07-30 15:02     ` Greg Thelen
2012-07-27 10:31 ` [PATCH V2 6/6] memcg: Document cgroup dirty/writeback memory statistics Sha Zhengju
2012-07-27 10:31   ` Sha Zhengju
2012-07-27 10:31   ` Sha Zhengju
2012-07-30 12:38 ` [PATCH V2 0/6] Per-cgroup page stat accounting Michal Hocko
2012-07-30 12:38   ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.