All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/7] Per-cgroup page stat accounting
@ 2012-06-28 10:54 ` Sha Zhengju
  0 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-06-28 10:54 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: kamezawa.hiroyu, gthelen, yinghan, akpm, mhocko, linux-kernel,
	Sha Zhengju

This patch series provide the ability for each memory cgroup to have independent
dirty/writeback page stats. This can provide some information for per-cgroup direct
reclaim. Meanwhile, we add more detailed dump messages for memcg OOMs.

Three features are included in this patch series:
 (0).prepare patches for page accounting
  1. memcg dirty page accounting
  2. memcg writeback page accounting
  3. memcg OOMs dump info

In (0) prepare patches, we have reworked vfs set page dirty routines to make "modify
page info" and "dirty page accouting" stay in one function as much as possible for
the sake of memcg bigger lock.

These patches are cooked based on Andrew's akpm tree.

Sha Zhengju (7):
	memcg-update-cgroup-memory-document.patch
	memcg-remove-MEMCG_NR_FILE_MAPPED.patch
	Make-TestSetPageDirty-and-dirty-page-accounting-in-o.patch
	Use-vfs-__set_page_dirty-interface-instead-of-doing-.patch
	memcg-add-per-cgroup-dirty-pages-accounting.patch
	memcg-add-per-cgroup-writeback-pages-accounting.patch
	memcg-print-more-detailed-info-while-memcg-oom-happe.patch	

 Documentation/cgroups/memory.txt |    2 +
 fs/buffer.c                      |   36 +++++++++-----
 fs/ceph/addr.c                   |   20 +-------
 include/linux/buffer_head.h      |    2 +
 include/linux/memcontrol.h       |   27 +++++++---
 mm/filemap.c                     |    5 ++
 mm/memcontrol.c                  |   99 +++++++++++++++++++++++--------------
 mm/page-writeback.c              |   42 ++++++++++++++--
 mm/rmap.c                        |    4 +-
 mm/truncate.c                    |    6 ++
 10 files changed, 159 insertions(+), 84 deletions(-)


^ permalink raw reply	[flat|nested] 132+ messages in thread

* [PATCH 0/7] Per-cgroup page stat accounting
@ 2012-06-28 10:54 ` Sha Zhengju
  0 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-06-28 10:54 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: kamezawa.hiroyu, gthelen, yinghan, akpm, mhocko, linux-kernel,
	Sha Zhengju

This patch series provide the ability for each memory cgroup to have independent
dirty/writeback page stats. This can provide some information for per-cgroup direct
reclaim. Meanwhile, we add more detailed dump messages for memcg OOMs.

Three features are included in this patch series:
 (0).prepare patches for page accounting
  1. memcg dirty page accounting
  2. memcg writeback page accounting
  3. memcg OOMs dump info

In (0) prepare patches, we have reworked vfs set page dirty routines to make "modify
page info" and "dirty page accouting" stay in one function as much as possible for
the sake of memcg bigger lock.

These patches are cooked based on Andrew's akpm tree.

Sha Zhengju (7):
	memcg-update-cgroup-memory-document.patch
	memcg-remove-MEMCG_NR_FILE_MAPPED.patch
	Make-TestSetPageDirty-and-dirty-page-accounting-in-o.patch
	Use-vfs-__set_page_dirty-interface-instead-of-doing-.patch
	memcg-add-per-cgroup-dirty-pages-accounting.patch
	memcg-add-per-cgroup-writeback-pages-accounting.patch
	memcg-print-more-detailed-info-while-memcg-oom-happe.patch	

 Documentation/cgroups/memory.txt |    2 +
 fs/buffer.c                      |   36 +++++++++-----
 fs/ceph/addr.c                   |   20 +-------
 include/linux/buffer_head.h      |    2 +
 include/linux/memcontrol.h       |   27 +++++++---
 mm/filemap.c                     |    5 ++
 mm/memcontrol.c                  |   99 +++++++++++++++++++++++--------------
 mm/page-writeback.c              |   42 ++++++++++++++--
 mm/rmap.c                        |    4 +-
 mm/truncate.c                    |    6 ++
 10 files changed, 159 insertions(+), 84 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* [PATCH 0/7] Per-cgroup page stat accounting
@ 2012-06-28 10:54 ` Sha Zhengju
  0 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-06-28 10:54 UTC (permalink / raw)
  To: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA
  Cc: kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA, yinghan-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mhocko-AlSwsSmVLrQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju

This patch series provide the ability for each memory cgroup to have independent
dirty/writeback page stats. This can provide some information for per-cgroup direct
reclaim. Meanwhile, we add more detailed dump messages for memcg OOMs.

Three features are included in this patch series:
 (0).prepare patches for page accounting
  1. memcg dirty page accounting
  2. memcg writeback page accounting
  3. memcg OOMs dump info

In (0) prepare patches, we have reworked vfs set page dirty routines to make "modify
page info" and "dirty page accouting" stay in one function as much as possible for
the sake of memcg bigger lock.

These patches are cooked based on Andrew's akpm tree.

Sha Zhengju (7):
	memcg-update-cgroup-memory-document.patch
	memcg-remove-MEMCG_NR_FILE_MAPPED.patch
	Make-TestSetPageDirty-and-dirty-page-accounting-in-o.patch
	Use-vfs-__set_page_dirty-interface-instead-of-doing-.patch
	memcg-add-per-cgroup-dirty-pages-accounting.patch
	memcg-add-per-cgroup-writeback-pages-accounting.patch
	memcg-print-more-detailed-info-while-memcg-oom-happe.patch	

 Documentation/cgroups/memory.txt |    2 +
 fs/buffer.c                      |   36 +++++++++-----
 fs/ceph/addr.c                   |   20 +-------
 include/linux/buffer_head.h      |    2 +
 include/linux/memcontrol.h       |   27 +++++++---
 mm/filemap.c                     |    5 ++
 mm/memcontrol.c                  |   99 +++++++++++++++++++++++--------------
 mm/page-writeback.c              |   42 ++++++++++++++--
 mm/rmap.c                        |    4 +-
 mm/truncate.c                    |    6 ++
 10 files changed, 159 insertions(+), 84 deletions(-)

^ permalink raw reply	[flat|nested] 132+ messages in thread

* [PATCH 1/7] memcg: update cgroup memory document
  2012-06-28 10:54 ` Sha Zhengju
  (?)
@ 2012-06-28 10:57   ` Sha Zhengju
  -1 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-06-28 10:57 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: kamezawa.hiroyu, gthelen, yinghan, akpm, mhocko, linux-kernel,
	Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

Document cgroup dirty/writeback memory statistics.

The implementation for these new interface routines come in a series
of following patches.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
---
 Documentation/cgroups/memory.txt |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index dd88540..24d7e3c 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -420,6 +420,8 @@ pgpgin		- # of charging events to the memory cgroup. The charging
 pgpgout		- # of uncharging events to the memory cgroup. The uncharging
 		event happens each time a page is unaccounted from the cgroup.
 swap		- # of bytes of swap usage
+dirty		- # of bytes that are waiting to get written back to the disk.
+writeback	- # of bytes that are actively being written back to the disk.
 inactive_anon	- # of bytes of anonymous memory and swap cache memory on
 		LRU list.
 active_anon	- # of bytes of anonymous and swap cache memory on active
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH 1/7] memcg: update cgroup memory document
@ 2012-06-28 10:57   ` Sha Zhengju
  0 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-06-28 10:57 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: kamezawa.hiroyu, gthelen, yinghan, akpm, mhocko, linux-kernel,
	Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

Document cgroup dirty/writeback memory statistics.

The implementation for these new interface routines come in a series
of following patches.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
---
 Documentation/cgroups/memory.txt |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index dd88540..24d7e3c 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -420,6 +420,8 @@ pgpgin		- # of charging events to the memory cgroup. The charging
 pgpgout		- # of uncharging events to the memory cgroup. The uncharging
 		event happens each time a page is unaccounted from the cgroup.
 swap		- # of bytes of swap usage
+dirty		- # of bytes that are waiting to get written back to the disk.
+writeback	- # of bytes that are actively being written back to the disk.
 inactive_anon	- # of bytes of anonymous memory and swap cache memory on
 		LRU list.
 active_anon	- # of bytes of anonymous and swap cache memory on active
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH 1/7] memcg: update cgroup memory document
@ 2012-06-28 10:57   ` Sha Zhengju
  0 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-06-28 10:57 UTC (permalink / raw)
  To: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA
  Cc: kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA, yinghan-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mhocko-AlSwsSmVLrQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju

From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>

Document cgroup dirty/writeback memory statistics.

The implementation for these new interface routines come in a series
of following patches.

Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
---
 Documentation/cgroups/memory.txt |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index dd88540..24d7e3c 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -420,6 +420,8 @@ pgpgin		- # of charging events to the memory cgroup. The charging
 pgpgout		- # of uncharging events to the memory cgroup. The uncharging
 		event happens each time a page is unaccounted from the cgroup.
 swap		- # of bytes of swap usage
+dirty		- # of bytes that are waiting to get written back to the disk.
+writeback	- # of bytes that are actively being written back to the disk.
 inactive_anon	- # of bytes of anonymous memory and swap cache memory on
 		LRU list.
 active_anon	- # of bytes of anonymous and swap cache memory on active
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH 2/7] memcg: remove MEMCG_NR_FILE_MAPPED
  2012-06-28 10:54 ` Sha Zhengju
@ 2012-06-28 10:58   ` Sha Zhengju
  -1 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-06-28 10:58 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: kamezawa.hiroyu, gthelen, yinghan, akpm, mhocko, linux-kernel,
	Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

While accounting memcg page stat, it's not worth to use MEMCG_NR_FILE_MAPPED
as an extra layer of indirection because of the complexity and presumed
performance overhead. We can use MEM_CGROUP_STAT_FILE_MAPPED directly.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
---
 include/linux/memcontrol.h |   25 +++++++++++++++++--------
 mm/memcontrol.c            |   24 +-----------------------
 mm/rmap.c                  |    4 ++--
 3 files changed, 20 insertions(+), 33 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 83e7ba9..20b0f2d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -27,9 +27,18 @@ struct page_cgroup;
 struct page;
 struct mm_struct;
 
-/* Stats that can be updated by kernel. */
-enum mem_cgroup_page_stat_item {
-	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
+/*
+ * Statistics for memory cgroup.
+ */
+enum mem_cgroup_stat_index {
+	/*
+	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
+	 */
+	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
+	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
+	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
+	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
+	MEM_CGROUP_STAT_NSTATS,
 };
 
 struct mem_cgroup_reclaim_cookie {
@@ -164,17 +173,17 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
 }
 
 void mem_cgroup_update_page_stat(struct page *page,
-				 enum mem_cgroup_page_stat_item idx,
+				 enum mem_cgroup_stat_index idx,
 				 int val);
 
 static inline void mem_cgroup_inc_page_stat(struct page *page,
-					    enum mem_cgroup_page_stat_item idx)
+					    enum mem_cgroup_stat_index idx)
 {
 	mem_cgroup_update_page_stat(page, idx, 1);
 }
 
 static inline void mem_cgroup_dec_page_stat(struct page *page,
-					    enum mem_cgroup_page_stat_item idx)
+					    enum mem_cgroup_stat_index idx)
 {
 	mem_cgroup_update_page_stat(page, idx, -1);
 }
@@ -349,12 +358,12 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
 }
 
 static inline void mem_cgroup_inc_page_stat(struct page *page,
-					    enum mem_cgroup_page_stat_item idx)
+					    enum mem_cgroup_stat_index idx)
 {
 }
 
 static inline void mem_cgroup_dec_page_stat(struct page *page,
-					    enum mem_cgroup_page_stat_item idx)
+					    enum mem_cgroup_stat_index idx)
 {
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a2677e0..ebed1ca 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -77,20 +77,6 @@ static int really_do_swap_account __initdata = 0;
 #endif
 
 
-/*
- * Statistics for memory cgroup.
- */
-enum mem_cgroup_stat_index {
-	/*
-	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
-	 */
-	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
-	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
-	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
-	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
-	MEM_CGROUP_STAT_NSTATS,
-};
-
 static const char * const mem_cgroup_stat_names[] = {
 	"cache",
 	"rss",
@@ -1926,7 +1912,7 @@ void __mem_cgroup_end_update_page_stat(struct page *page, unsigned long *flags)
 }
 
 void mem_cgroup_update_page_stat(struct page *page,
-				 enum mem_cgroup_page_stat_item idx, int val)
+				 enum mem_cgroup_stat_index idx, int val)
 {
 	struct mem_cgroup *memcg;
 	struct page_cgroup *pc = lookup_page_cgroup(page);
@@ -1939,14 +1925,6 @@ void mem_cgroup_update_page_stat(struct page *page,
 	if (unlikely(!memcg || !PageCgroupUsed(pc)))
 		return;
 
-	switch (idx) {
-	case MEMCG_NR_FILE_MAPPED:
-		idx = MEM_CGROUP_STAT_FILE_MAPPED;
-		break;
-	default:
-		BUG();
-	}
-
 	this_cpu_add(memcg->stat->count[idx], val);
 }
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 2144160..d6b93df 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1148,7 +1148,7 @@ void page_add_file_rmap(struct page *page)
 	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (atomic_inc_and_test(&page->_mapcount)) {
 		__inc_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
+		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
 	}
 	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 }
@@ -1202,7 +1202,7 @@ void page_remove_rmap(struct page *page)
 					      NR_ANON_TRANSPARENT_HUGEPAGES);
 	} else {
 		__dec_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
+		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
 	}
 	/*
 	 * It would be tidy to reset the PageAnon mapping here,
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH 2/7] memcg: remove MEMCG_NR_FILE_MAPPED
@ 2012-06-28 10:58   ` Sha Zhengju
  0 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-06-28 10:58 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: kamezawa.hiroyu, gthelen, yinghan, akpm, mhocko, linux-kernel,
	Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

While accounting memcg page stat, it's not worth to use MEMCG_NR_FILE_MAPPED
as an extra layer of indirection because of the complexity and presumed
performance overhead. We can use MEM_CGROUP_STAT_FILE_MAPPED directly.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
---
 include/linux/memcontrol.h |   25 +++++++++++++++++--------
 mm/memcontrol.c            |   24 +-----------------------
 mm/rmap.c                  |    4 ++--
 3 files changed, 20 insertions(+), 33 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 83e7ba9..20b0f2d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -27,9 +27,18 @@ struct page_cgroup;
 struct page;
 struct mm_struct;
 
-/* Stats that can be updated by kernel. */
-enum mem_cgroup_page_stat_item {
-	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
+/*
+ * Statistics for memory cgroup.
+ */
+enum mem_cgroup_stat_index {
+	/*
+	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
+	 */
+	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
+	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
+	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
+	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
+	MEM_CGROUP_STAT_NSTATS,
 };
 
 struct mem_cgroup_reclaim_cookie {
@@ -164,17 +173,17 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
 }
 
 void mem_cgroup_update_page_stat(struct page *page,
-				 enum mem_cgroup_page_stat_item idx,
+				 enum mem_cgroup_stat_index idx,
 				 int val);
 
 static inline void mem_cgroup_inc_page_stat(struct page *page,
-					    enum mem_cgroup_page_stat_item idx)
+					    enum mem_cgroup_stat_index idx)
 {
 	mem_cgroup_update_page_stat(page, idx, 1);
 }
 
 static inline void mem_cgroup_dec_page_stat(struct page *page,
-					    enum mem_cgroup_page_stat_item idx)
+					    enum mem_cgroup_stat_index idx)
 {
 	mem_cgroup_update_page_stat(page, idx, -1);
 }
@@ -349,12 +358,12 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
 }
 
 static inline void mem_cgroup_inc_page_stat(struct page *page,
-					    enum mem_cgroup_page_stat_item idx)
+					    enum mem_cgroup_stat_index idx)
 {
 }
 
 static inline void mem_cgroup_dec_page_stat(struct page *page,
-					    enum mem_cgroup_page_stat_item idx)
+					    enum mem_cgroup_stat_index idx)
 {
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a2677e0..ebed1ca 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -77,20 +77,6 @@ static int really_do_swap_account __initdata = 0;
 #endif
 
 
-/*
- * Statistics for memory cgroup.
- */
-enum mem_cgroup_stat_index {
-	/*
-	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
-	 */
-	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
-	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
-	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
-	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
-	MEM_CGROUP_STAT_NSTATS,
-};
-
 static const char * const mem_cgroup_stat_names[] = {
 	"cache",
 	"rss",
@@ -1926,7 +1912,7 @@ void __mem_cgroup_end_update_page_stat(struct page *page, unsigned long *flags)
 }
 
 void mem_cgroup_update_page_stat(struct page *page,
-				 enum mem_cgroup_page_stat_item idx, int val)
+				 enum mem_cgroup_stat_index idx, int val)
 {
 	struct mem_cgroup *memcg;
 	struct page_cgroup *pc = lookup_page_cgroup(page);
@@ -1939,14 +1925,6 @@ void mem_cgroup_update_page_stat(struct page *page,
 	if (unlikely(!memcg || !PageCgroupUsed(pc)))
 		return;
 
-	switch (idx) {
-	case MEMCG_NR_FILE_MAPPED:
-		idx = MEM_CGROUP_STAT_FILE_MAPPED;
-		break;
-	default:
-		BUG();
-	}
-
 	this_cpu_add(memcg->stat->count[idx], val);
 }
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 2144160..d6b93df 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1148,7 +1148,7 @@ void page_add_file_rmap(struct page *page)
 	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (atomic_inc_and_test(&page->_mapcount)) {
 		__inc_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
+		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
 	}
 	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 }
@@ -1202,7 +1202,7 @@ void page_remove_rmap(struct page *page)
 					      NR_ANON_TRANSPARENT_HUGEPAGES);
 	} else {
 		__dec_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
+		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
 	}
 	/*
 	 * It would be tidy to reset the PageAnon mapping here,
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH 3/7] Make TestSetPageDirty and dirty page accounting in one func
  2012-06-28 10:54 ` Sha Zhengju
@ 2012-06-28 11:01   ` Sha Zhengju
  -1 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-06-28 11:01 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: kamezawa.hiroyu, gthelen, yinghan, akpm, mhocko, linux-kernel,
	torvalds, viro, linux-fsdevel, Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

Commit a8e7d49a(Fix race in create_empty_buffers() vs __set_page_dirty_buffers())
extracts TestSetPageDirty from __set_page_dirty and is far away from
account_page_dirtied.But it's better to make the two operations in one single
function to keep modular.So in order to avoid the potential race mentioned in
commit a8e7d49a, we can hold private_lock until __set_page_dirty completes.
I guess there's no deadlock between ->private_lock and ->tree_lock by quick look.

It's a prepare patch for following memcg dirty page accounting patches.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
---
 fs/buffer.c |   25 +++++++++++++------------
 1 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 838a9cf..e8d96b8 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -610,9 +610,15 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
  * If warn is true, then emit a warning if the page is not uptodate and has
  * not been truncated.
  */
-static void __set_page_dirty(struct page *page,
+static int __set_page_dirty(struct page *page,
 		struct address_space *mapping, int warn)
 {
+	if (unlikely(!mapping))
+		return !TestSetPageDirty(page);
+
+	if (TestSetPageDirty(page))
+		return 0;
+
 	spin_lock_irq(&mapping->tree_lock);
 	if (page->mapping) {	/* Race with truncate? */
 		WARN_ON_ONCE(warn && !PageUptodate(page));
@@ -622,6 +628,8 @@ static void __set_page_dirty(struct page *page,
 	}
 	spin_unlock_irq(&mapping->tree_lock);
 	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+
+	return 1;
 }
 
 /*
@@ -667,11 +675,9 @@ int __set_page_dirty_buffers(struct page *page)
 			bh = bh->b_this_page;
 		} while (bh != head);
 	}
-	newly_dirty = !TestSetPageDirty(page);
+	newly_dirty = __set_page_dirty(page, mapping, 1);
 	spin_unlock(&mapping->private_lock);
 
-	if (newly_dirty)
-		__set_page_dirty(page, mapping, 1);
 	return newly_dirty;
 }
 EXPORT_SYMBOL(__set_page_dirty_buffers);
@@ -1115,14 +1121,9 @@ void mark_buffer_dirty(struct buffer_head *bh)
 			return;
 	}
 
-	if (!test_set_buffer_dirty(bh)) {
-		struct page *page = bh->b_page;
-		if (!TestSetPageDirty(page)) {
-			struct address_space *mapping = page_mapping(page);
-			if (mapping)
-				__set_page_dirty(page, mapping, 0);
-		}
-	}
+	if (!test_set_buffer_dirty(bh))
+		__set_page_dirty(bh->b_page, page_mapping(bh->b_page), 0);
+
 }
 EXPORT_SYMBOL(mark_buffer_dirty);
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH 3/7] Make TestSetPageDirty and dirty page accounting in one func
@ 2012-06-28 11:01   ` Sha Zhengju
  0 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-06-28 11:01 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: kamezawa.hiroyu, gthelen, yinghan, akpm, mhocko, linux-kernel,
	torvalds, viro, linux-fsdevel, Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

Commit a8e7d49a(Fix race in create_empty_buffers() vs __set_page_dirty_buffers())
extracts TestSetPageDirty from __set_page_dirty and is far away from
account_page_dirtied.But it's better to make the two operations in one single
function to keep modular.So in order to avoid the potential race mentioned in
commit a8e7d49a, we can hold private_lock until __set_page_dirty completes.
I guess there's no deadlock between ->private_lock and ->tree_lock by quick look.

It's a prepare patch for following memcg dirty page accounting patches.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
---
 fs/buffer.c |   25 +++++++++++++------------
 1 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 838a9cf..e8d96b8 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -610,9 +610,15 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
  * If warn is true, then emit a warning if the page is not uptodate and has
  * not been truncated.
  */
-static void __set_page_dirty(struct page *page,
+static int __set_page_dirty(struct page *page,
 		struct address_space *mapping, int warn)
 {
+	if (unlikely(!mapping))
+		return !TestSetPageDirty(page);
+
+	if (TestSetPageDirty(page))
+		return 0;
+
 	spin_lock_irq(&mapping->tree_lock);
 	if (page->mapping) {	/* Race with truncate? */
 		WARN_ON_ONCE(warn && !PageUptodate(page));
@@ -622,6 +628,8 @@ static void __set_page_dirty(struct page *page,
 	}
 	spin_unlock_irq(&mapping->tree_lock);
 	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+
+	return 1;
 }
 
 /*
@@ -667,11 +675,9 @@ int __set_page_dirty_buffers(struct page *page)
 			bh = bh->b_this_page;
 		} while (bh != head);
 	}
-	newly_dirty = !TestSetPageDirty(page);
+	newly_dirty = __set_page_dirty(page, mapping, 1);
 	spin_unlock(&mapping->private_lock);
 
-	if (newly_dirty)
-		__set_page_dirty(page, mapping, 1);
 	return newly_dirty;
 }
 EXPORT_SYMBOL(__set_page_dirty_buffers);
@@ -1115,14 +1121,9 @@ void mark_buffer_dirty(struct buffer_head *bh)
 			return;
 	}
 
-	if (!test_set_buffer_dirty(bh)) {
-		struct page *page = bh->b_page;
-		if (!TestSetPageDirty(page)) {
-			struct address_space *mapping = page_mapping(page);
-			if (mapping)
-				__set_page_dirty(page, mapping, 0);
-		}
-	}
+	if (!test_set_buffer_dirty(bh))
+		__set_page_dirty(bh->b_page, page_mapping(bh->b_page), 0);
+
 }
 EXPORT_SYMBOL(mark_buffer_dirty);
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH 4/7] Use vfs __set_page_dirty interface instead of doing it inside filesystem
@ 2012-06-28 11:03   ` Sha Zhengju
  0 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-06-28 11:03 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: kamezawa.hiroyu, gthelen, yinghan, akpm, mhocko, linux-kernel,
	torvalds, viro, linux-fsdevel, sage, ceph-devel, Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

Following we will treat SetPageDirty and dirty page accounting as an integrated
operation. Filesystems had better use vfs interface directly to avoid those details.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
---
 fs/buffer.c                 |    2 +-
 fs/ceph/addr.c              |   20 ++------------------
 include/linux/buffer_head.h |    2 ++
 3 files changed, 5 insertions(+), 19 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index e8d96b8..55522dd 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -610,7 +610,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
  * If warn is true, then emit a warning if the page is not uptodate and has
  * not been truncated.
  */
-static int __set_page_dirty(struct page *page,
+int __set_page_dirty(struct page *page,
 		struct address_space *mapping, int warn)
 {
 	if (unlikely(!mapping))
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 8b67304..d028fbe 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -5,6 +5,7 @@
 #include <linux/mm.h>
 #include <linux/pagemap.h>
 #include <linux/writeback.h>	/* generic_writepages */
+#include <linux/buffer_head.h>
 #include <linux/slab.h>
 #include <linux/pagevec.h>
 #include <linux/task_io_accounting_ops.h>
@@ -73,14 +74,8 @@ static int ceph_set_page_dirty(struct page *page)
 	int undo = 0;
 	struct ceph_snap_context *snapc;
 
-	if (unlikely(!mapping))
-		return !TestSetPageDirty(page);
-
-	if (TestSetPageDirty(page)) {
-		dout("%p set_page_dirty %p idx %lu -- already dirty\n",
-		     mapping->host, page, page->index);
+	if (!__set_page_dirty(page, mapping, 1))
 		return 0;
-	}
 
 	inode = mapping->host;
 	ci = ceph_inode(inode);
@@ -107,14 +102,7 @@ static int ceph_set_page_dirty(struct page *page)
 	     snapc, snapc->seq, snapc->num_snaps);
 	spin_unlock(&ci->i_ceph_lock);
 
-	/* now adjust page */
-	spin_lock_irq(&mapping->tree_lock);
 	if (page->mapping) {	/* Race with truncate? */
-		WARN_ON_ONCE(!PageUptodate(page));
-		account_page_dirtied(page, page->mapping);
-		radix_tree_tag_set(&mapping->page_tree,
-				page_index(page), PAGECACHE_TAG_DIRTY);
-
 		/*
 		 * Reference snap context in page->private.  Also set
 		 * PagePrivate so that we get invalidatepage callback.
@@ -126,14 +114,10 @@ static int ceph_set_page_dirty(struct page *page)
 		undo = 1;
 	}
 
-	spin_unlock_irq(&mapping->tree_lock);
-
 	if (undo)
 		/* whoops, we failed to dirty the page */
 		ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
 
-	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
-
 	BUG_ON(!PageDirty(page));
 	return 1;
 }
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 458f497..0a331a8 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -336,6 +336,8 @@ static inline void lock_buffer(struct buffer_head *bh)
 }
 
 extern int __set_page_dirty_buffers(struct page *page);
+extern int __set_page_dirty(struct page *page,
+		struct address_space *mapping, int warn);
 
 #else /* CONFIG_BLOCK */
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH 4/7] Use vfs __set_page_dirty interface instead of doing it inside filesystem
@ 2012-06-28 11:03   ` Sha Zhengju
  0 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-06-28 11:03 UTC (permalink / raw)
  To: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA
  Cc: kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA, yinghan-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mhocko-AlSwsSmVLrQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	sage-BnTBU8nroG7k1uMJSBkQmQ, ceph-devel-u79uwXL29TY76Z2rM5mHXA,
	Sha Zhengju

From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>

Following we will treat SetPageDirty and dirty page accounting as an integrated
operation. Filesystems had better use vfs interface directly to avoid those details.

Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
---
 fs/buffer.c                 |    2 +-
 fs/ceph/addr.c              |   20 ++------------------
 include/linux/buffer_head.h |    2 ++
 3 files changed, 5 insertions(+), 19 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index e8d96b8..55522dd 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -610,7 +610,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
  * If warn is true, then emit a warning if the page is not uptodate and has
  * not been truncated.
  */
-static int __set_page_dirty(struct page *page,
+int __set_page_dirty(struct page *page,
 		struct address_space *mapping, int warn)
 {
 	if (unlikely(!mapping))
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 8b67304..d028fbe 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -5,6 +5,7 @@
 #include <linux/mm.h>
 #include <linux/pagemap.h>
 #include <linux/writeback.h>	/* generic_writepages */
+#include <linux/buffer_head.h>
 #include <linux/slab.h>
 #include <linux/pagevec.h>
 #include <linux/task_io_accounting_ops.h>
@@ -73,14 +74,8 @@ static int ceph_set_page_dirty(struct page *page)
 	int undo = 0;
 	struct ceph_snap_context *snapc;
 
-	if (unlikely(!mapping))
-		return !TestSetPageDirty(page);
-
-	if (TestSetPageDirty(page)) {
-		dout("%p set_page_dirty %p idx %lu -- already dirty\n",
-		     mapping->host, page, page->index);
+	if (!__set_page_dirty(page, mapping, 1))
 		return 0;
-	}
 
 	inode = mapping->host;
 	ci = ceph_inode(inode);
@@ -107,14 +102,7 @@ static int ceph_set_page_dirty(struct page *page)
 	     snapc, snapc->seq, snapc->num_snaps);
 	spin_unlock(&ci->i_ceph_lock);
 
-	/* now adjust page */
-	spin_lock_irq(&mapping->tree_lock);
 	if (page->mapping) {	/* Race with truncate? */
-		WARN_ON_ONCE(!PageUptodate(page));
-		account_page_dirtied(page, page->mapping);
-		radix_tree_tag_set(&mapping->page_tree,
-				page_index(page), PAGECACHE_TAG_DIRTY);
-
 		/*
 		 * Reference snap context in page->private.  Also set
 		 * PagePrivate so that we get invalidatepage callback.
@@ -126,14 +114,10 @@ static int ceph_set_page_dirty(struct page *page)
 		undo = 1;
 	}
 
-	spin_unlock_irq(&mapping->tree_lock);
-
 	if (undo)
 		/* whoops, we failed to dirty the page */
 		ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
 
-	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
-
 	BUG_ON(!PageDirty(page));
 	return 1;
 }
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 458f497..0a331a8 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -336,6 +336,8 @@ static inline void lock_buffer(struct buffer_head *bh)
 }
 
 extern int __set_page_dirty_buffers(struct page *page);
+extern int __set_page_dirty(struct page *page,
+		struct address_space *mapping, int warn);
 
 #else /* CONFIG_BLOCK */
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH 4/7] Use vfs __set_page_dirty interface instead of doing it inside filesystem
@ 2012-06-28 11:03   ` Sha Zhengju
  0 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-06-28 11:03 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: kamezawa.hiroyu, gthelen, yinghan, akpm, mhocko, linux-kernel,
	torvalds, viro, linux-fsdevel, sage, ceph-devel, Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

Following we will treat SetPageDirty and dirty page accounting as an integrated
operation. Filesystems had better use vfs interface directly to avoid those details.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
---
 fs/buffer.c                 |    2 +-
 fs/ceph/addr.c              |   20 ++------------------
 include/linux/buffer_head.h |    2 ++
 3 files changed, 5 insertions(+), 19 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index e8d96b8..55522dd 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -610,7 +610,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
  * If warn is true, then emit a warning if the page is not uptodate and has
  * not been truncated.
  */
-static int __set_page_dirty(struct page *page,
+int __set_page_dirty(struct page *page,
 		struct address_space *mapping, int warn)
 {
 	if (unlikely(!mapping))
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 8b67304..d028fbe 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -5,6 +5,7 @@
 #include <linux/mm.h>
 #include <linux/pagemap.h>
 #include <linux/writeback.h>	/* generic_writepages */
+#include <linux/buffer_head.h>
 #include <linux/slab.h>
 #include <linux/pagevec.h>
 #include <linux/task_io_accounting_ops.h>
@@ -73,14 +74,8 @@ static int ceph_set_page_dirty(struct page *page)
 	int undo = 0;
 	struct ceph_snap_context *snapc;
 
-	if (unlikely(!mapping))
-		return !TestSetPageDirty(page);
-
-	if (TestSetPageDirty(page)) {
-		dout("%p set_page_dirty %p idx %lu -- already dirty\n",
-		     mapping->host, page, page->index);
+	if (!__set_page_dirty(page, mapping, 1))
 		return 0;
-	}
 
 	inode = mapping->host;
 	ci = ceph_inode(inode);
@@ -107,14 +102,7 @@ static int ceph_set_page_dirty(struct page *page)
 	     snapc, snapc->seq, snapc->num_snaps);
 	spin_unlock(&ci->i_ceph_lock);
 
-	/* now adjust page */
-	spin_lock_irq(&mapping->tree_lock);
 	if (page->mapping) {	/* Race with truncate? */
-		WARN_ON_ONCE(!PageUptodate(page));
-		account_page_dirtied(page, page->mapping);
-		radix_tree_tag_set(&mapping->page_tree,
-				page_index(page), PAGECACHE_TAG_DIRTY);
-
 		/*
 		 * Reference snap context in page->private.  Also set
 		 * PagePrivate so that we get invalidatepage callback.
@@ -126,14 +114,10 @@ static int ceph_set_page_dirty(struct page *page)
 		undo = 1;
 	}
 
-	spin_unlock_irq(&mapping->tree_lock);
-
 	if (undo)
 		/* whoops, we failed to dirty the page */
 		ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
 
-	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
-
 	BUG_ON(!PageDirty(page));
 	return 1;
 }
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 458f497..0a331a8 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -336,6 +336,8 @@ static inline void lock_buffer(struct buffer_head *bh)
 }
 
 extern int __set_page_dirty_buffers(struct page *page);
+extern int __set_page_dirty(struct page *page,
+		struct address_space *mapping, int warn);
 
 #else /* CONFIG_BLOCK */
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH 5/7] memcg: add per cgroup dirty pages accounting
  2012-06-28 10:54 ` Sha Zhengju
  (?)
@ 2012-06-28 11:04   ` Sha Zhengju
  -1 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-06-28 11:04 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: kamezawa.hiroyu, gthelen, yinghan, akpm, mhocko, linux-kernel,
	Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

This patch adds memcg routines to count dirty pages, which allows memory controller
to maintain an accurate view of the amount of its dirty memory and can provide some
info for users while group's direct reclaim is working.

After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
has a feature to move a page from a cgroup to another one and may have race between
"move" and "page stat accounting". So in order to avoid the race we have designed a
bigger lock:

         mem_cgroup_begin_update_page_stat()
         modify page information	-->(a)
         mem_cgroup_update_page_stat()  -->(b)
         mem_cgroup_end_update_page_stat()

It requires (a) and (b)(dirty pages accounting) can stay close enough.

In the previous two prepare patches, we have reworked the vfs set page dirty routines
and now the interfaces are more explicit:
	incrementing (2):
		__set_page_dirty
		__set_page_dirty_nobuffers
	decrementing (2):
		clear_page_dirty_for_io
		cancel_dirty_page


Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
---
 fs/buffer.c                |   17 ++++++++++++++---
 include/linux/memcontrol.h |    1 +
 mm/filemap.c               |    5 +++++
 mm/memcontrol.c            |   28 +++++++++++++++++++++-------
 mm/page-writeback.c        |   30 ++++++++++++++++++++++++------
 mm/truncate.c              |    6 ++++++
 6 files changed, 71 insertions(+), 16 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 55522dd..d3714cc 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -613,11 +613,19 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
 int __set_page_dirty(struct page *page,
 		struct address_space *mapping, int warn)
 {
+	bool locked;
+	unsigned long flags;
+	int ret = 0;
+
 	if (unlikely(!mapping))
 		return !TestSetPageDirty(page);
 
-	if (TestSetPageDirty(page))
-		return 0;
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
+
+	if (TestSetPageDirty(page)) {
+		ret = 0;
+		goto out;
+	}
 
 	spin_lock_irq(&mapping->tree_lock);
 	if (page->mapping) {	/* Race with truncate? */
@@ -629,7 +637,10 @@ int __set_page_dirty(struct page *page,
 	spin_unlock_irq(&mapping->tree_lock);
 	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 
-	return 1;
+	ret = 1;
+out:
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
+	return ret;
 }
 
 /*
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 20b0f2d..ad37b59 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -38,6 +38,7 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
 	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
 	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
+	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
 	MEM_CGROUP_STAT_NSTATS,
 };
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 1f19ec3..5159a49 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -140,6 +140,11 @@ void __delete_from_page_cache(struct page *page)
 	 * having removed the page entirely.
 	 */
 	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
+		/*
+		 * Do not change page state, so no need to use mem_cgroup_
+		 * {begin, end}_update_page_stat to get lock.
+		 */
+		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 		dec_zone_page_state(page, NR_FILE_DIRTY);
 		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
 	}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ebed1ca..90e2946 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -82,6 +82,7 @@ static const char * const mem_cgroup_stat_names[] = {
 	"rss",
 	"mapped_file",
 	"swap",
+	"dirty",
 };
 
 enum mem_cgroup_events_index {
@@ -2538,6 +2539,18 @@ void mem_cgroup_split_huge_fixup(struct page *head)
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+static inline
+void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
+					struct mem_cgroup *to,
+					enum mem_cgroup_stat_index idx)
+{
+	/* Update stat data for mem_cgroup */
+	preempt_disable();
+	__this_cpu_dec(from->stat->count[idx]);
+	__this_cpu_inc(to->stat->count[idx]);
+	preempt_enable();
+}
+
 /**
  * mem_cgroup_move_account - move account of the page
  * @page: the page
@@ -2583,13 +2596,14 @@ static int mem_cgroup_move_account(struct page *page,
 
 	move_lock_mem_cgroup(from, &flags);
 
-	if (!anon && page_mapped(page)) {
-		/* Update mapped_file data for mem_cgroup */
-		preempt_disable();
-		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		preempt_enable();
-	}
+	if (!anon && page_mapped(page))
+		mem_cgroup_move_account_page_stat(from, to,
+				MEM_CGROUP_STAT_FILE_MAPPED);
+
+	if (PageDirty(page))
+		mem_cgroup_move_account_page_stat(from, to,
+				MEM_CGROUP_STAT_FILE_DIRTY);
+
 	mem_cgroup_charge_statistics(from, anon, -nr_pages);
 
 	/* caller should have done css_get */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index e5363f3..e79a2f7 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1962,6 +1962,7 @@ int __set_page_dirty_no_writeback(struct page *page)
 void account_page_dirtied(struct page *page, struct address_space *mapping)
 {
 	if (mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_DIRTIED);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
@@ -2001,12 +2002,20 @@ EXPORT_SYMBOL(account_page_writeback);
  */
 int __set_page_dirty_nobuffers(struct page *page)
 {
+	bool locked;
+	unsigned long flags;
+	int ret = 0;
+
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
+
 	if (!TestSetPageDirty(page)) {
 		struct address_space *mapping = page_mapping(page);
 		struct address_space *mapping2;
 
-		if (!mapping)
-			return 1;
+		if (!mapping) {
+			ret = 1;
+			goto out;
+		}
 
 		spin_lock_irq(&mapping->tree_lock);
 		mapping2 = page_mapping(page);
@@ -2022,9 +2031,12 @@ int __set_page_dirty_nobuffers(struct page *page)
 			/* !PageAnon && !swapper_space */
 			__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 		}
-		return 1;
+		ret = 1;
 	}
-	return 0;
+
+out:
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
+	return ret;
 }
 EXPORT_SYMBOL(__set_page_dirty_nobuffers);
 
@@ -2139,6 +2151,9 @@ EXPORT_SYMBOL(set_page_dirty_lock);
 int clear_page_dirty_for_io(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
+	bool locked;
+	unsigned long flags;
+	int ret = 0;
 
 	BUG_ON(!PageLocked(page));
 
@@ -2180,13 +2195,16 @@ int clear_page_dirty_for_io(struct page *page)
 		 * the desired exclusion. See mm/memory.c:do_wp_page()
 		 * for more comments.
 		 */
+		mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 		if (TestClearPageDirty(page)) {
+			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
-			return 1;
+			ret = 1;
 		}
-		return 0;
+		mem_cgroup_end_update_page_stat(page, &locked, &flags);
+		return ret;
 	}
 	return TestClearPageDirty(page);
 }
diff --git a/mm/truncate.c b/mm/truncate.c
index 75801ac..052016a 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -73,9 +73,14 @@ static inline void truncate_partial_page(struct page *page, unsigned partial)
  */
 void cancel_dirty_page(struct page *page, unsigned int account_size)
 {
+	bool locked;
+	unsigned long flags;
+
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (TestClearPageDirty(page)) {
 		struct address_space *mapping = page->mapping;
 		if (mapping && mapping_cap_account_dirty(mapping)) {
+			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
@@ -83,6 +88,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
 				task_io_account_cancelled_write(account_size);
 		}
 	}
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 }
 EXPORT_SYMBOL(cancel_dirty_page);
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH 5/7] memcg: add per cgroup dirty pages accounting
@ 2012-06-28 11:04   ` Sha Zhengju
  0 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-06-28 11:04 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: kamezawa.hiroyu, gthelen, yinghan, akpm, mhocko, linux-kernel,
	Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

This patch adds memcg routines to count dirty pages, which allows memory controller
to maintain an accurate view of the amount of its dirty memory and can provide some
info for users while group's direct reclaim is working.

After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
has a feature to move a page from a cgroup to another one and may have race between
"move" and "page stat accounting". So in order to avoid the race we have designed a
bigger lock:

         mem_cgroup_begin_update_page_stat()
         modify page information	-->(a)
         mem_cgroup_update_page_stat()  -->(b)
         mem_cgroup_end_update_page_stat()

It requires (a) and (b)(dirty pages accounting) can stay close enough.

In the previous two prepare patches, we have reworked the vfs set page dirty routines
and now the interfaces are more explicit:
	incrementing (2):
		__set_page_dirty
		__set_page_dirty_nobuffers
	decrementing (2):
		clear_page_dirty_for_io
		cancel_dirty_page


Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
---
 fs/buffer.c                |   17 ++++++++++++++---
 include/linux/memcontrol.h |    1 +
 mm/filemap.c               |    5 +++++
 mm/memcontrol.c            |   28 +++++++++++++++++++++-------
 mm/page-writeback.c        |   30 ++++++++++++++++++++++++------
 mm/truncate.c              |    6 ++++++
 6 files changed, 71 insertions(+), 16 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 55522dd..d3714cc 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -613,11 +613,19 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
 int __set_page_dirty(struct page *page,
 		struct address_space *mapping, int warn)
 {
+	bool locked;
+	unsigned long flags;
+	int ret = 0;
+
 	if (unlikely(!mapping))
 		return !TestSetPageDirty(page);
 
-	if (TestSetPageDirty(page))
-		return 0;
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
+
+	if (TestSetPageDirty(page)) {
+		ret = 0;
+		goto out;
+	}
 
 	spin_lock_irq(&mapping->tree_lock);
 	if (page->mapping) {	/* Race with truncate? */
@@ -629,7 +637,10 @@ int __set_page_dirty(struct page *page,
 	spin_unlock_irq(&mapping->tree_lock);
 	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 
-	return 1;
+	ret = 1;
+out:
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
+	return ret;
 }
 
 /*
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 20b0f2d..ad37b59 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -38,6 +38,7 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
 	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
 	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
+	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
 	MEM_CGROUP_STAT_NSTATS,
 };
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 1f19ec3..5159a49 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -140,6 +140,11 @@ void __delete_from_page_cache(struct page *page)
 	 * having removed the page entirely.
 	 */
 	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
+		/*
+		 * Do not change page state, so no need to use mem_cgroup_
+		 * {begin, end}_update_page_stat to get lock.
+		 */
+		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 		dec_zone_page_state(page, NR_FILE_DIRTY);
 		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
 	}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ebed1ca..90e2946 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -82,6 +82,7 @@ static const char * const mem_cgroup_stat_names[] = {
 	"rss",
 	"mapped_file",
 	"swap",
+	"dirty",
 };
 
 enum mem_cgroup_events_index {
@@ -2538,6 +2539,18 @@ void mem_cgroup_split_huge_fixup(struct page *head)
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+static inline
+void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
+					struct mem_cgroup *to,
+					enum mem_cgroup_stat_index idx)
+{
+	/* Update stat data for mem_cgroup */
+	preempt_disable();
+	__this_cpu_dec(from->stat->count[idx]);
+	__this_cpu_inc(to->stat->count[idx]);
+	preempt_enable();
+}
+
 /**
  * mem_cgroup_move_account - move account of the page
  * @page: the page
@@ -2583,13 +2596,14 @@ static int mem_cgroup_move_account(struct page *page,
 
 	move_lock_mem_cgroup(from, &flags);
 
-	if (!anon && page_mapped(page)) {
-		/* Update mapped_file data for mem_cgroup */
-		preempt_disable();
-		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		preempt_enable();
-	}
+	if (!anon && page_mapped(page))
+		mem_cgroup_move_account_page_stat(from, to,
+				MEM_CGROUP_STAT_FILE_MAPPED);
+
+	if (PageDirty(page))
+		mem_cgroup_move_account_page_stat(from, to,
+				MEM_CGROUP_STAT_FILE_DIRTY);
+
 	mem_cgroup_charge_statistics(from, anon, -nr_pages);
 
 	/* caller should have done css_get */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index e5363f3..e79a2f7 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1962,6 +1962,7 @@ int __set_page_dirty_no_writeback(struct page *page)
 void account_page_dirtied(struct page *page, struct address_space *mapping)
 {
 	if (mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_DIRTIED);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
@@ -2001,12 +2002,20 @@ EXPORT_SYMBOL(account_page_writeback);
  */
 int __set_page_dirty_nobuffers(struct page *page)
 {
+	bool locked;
+	unsigned long flags;
+	int ret = 0;
+
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
+
 	if (!TestSetPageDirty(page)) {
 		struct address_space *mapping = page_mapping(page);
 		struct address_space *mapping2;
 
-		if (!mapping)
-			return 1;
+		if (!mapping) {
+			ret = 1;
+			goto out;
+		}
 
 		spin_lock_irq(&mapping->tree_lock);
 		mapping2 = page_mapping(page);
@@ -2022,9 +2031,12 @@ int __set_page_dirty_nobuffers(struct page *page)
 			/* !PageAnon && !swapper_space */
 			__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 		}
-		return 1;
+		ret = 1;
 	}
-	return 0;
+
+out:
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
+	return ret;
 }
 EXPORT_SYMBOL(__set_page_dirty_nobuffers);
 
@@ -2139,6 +2151,9 @@ EXPORT_SYMBOL(set_page_dirty_lock);
 int clear_page_dirty_for_io(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
+	bool locked;
+	unsigned long flags;
+	int ret = 0;
 
 	BUG_ON(!PageLocked(page));
 
@@ -2180,13 +2195,16 @@ int clear_page_dirty_for_io(struct page *page)
 		 * the desired exclusion. See mm/memory.c:do_wp_page()
 		 * for more comments.
 		 */
+		mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 		if (TestClearPageDirty(page)) {
+			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
-			return 1;
+			ret = 1;
 		}
-		return 0;
+		mem_cgroup_end_update_page_stat(page, &locked, &flags);
+		return ret;
 	}
 	return TestClearPageDirty(page);
 }
diff --git a/mm/truncate.c b/mm/truncate.c
index 75801ac..052016a 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -73,9 +73,14 @@ static inline void truncate_partial_page(struct page *page, unsigned partial)
  */
 void cancel_dirty_page(struct page *page, unsigned int account_size)
 {
+	bool locked;
+	unsigned long flags;
+
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (TestClearPageDirty(page)) {
 		struct address_space *mapping = page->mapping;
 		if (mapping && mapping_cap_account_dirty(mapping)) {
+			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
@@ -83,6 +88,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
 				task_io_account_cancelled_write(account_size);
 		}
 	}
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 }
 EXPORT_SYMBOL(cancel_dirty_page);
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH 5/7] memcg: add per cgroup dirty pages accounting
@ 2012-06-28 11:04   ` Sha Zhengju
  0 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-06-28 11:04 UTC (permalink / raw)
  To: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA
  Cc: kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA, yinghan-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mhocko-AlSwsSmVLrQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju

From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>

This patch adds memcg routines to count dirty pages, which allows memory controller
to maintain an accurate view of the amount of its dirty memory and can provide some
info for users while group's direct reclaim is working.

After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
has a feature to move a page from a cgroup to another one and may have race between
"move" and "page stat accounting". So in order to avoid the race we have designed a
bigger lock:

         mem_cgroup_begin_update_page_stat()
         modify page information	-->(a)
         mem_cgroup_update_page_stat()  -->(b)
         mem_cgroup_end_update_page_stat()

It requires (a) and (b)(dirty pages accounting) can stay close enough.

In the previous two prepare patches, we have reworked the vfs set page dirty routines
and now the interfaces are more explicit:
	incrementing (2):
		__set_page_dirty
		__set_page_dirty_nobuffers
	decrementing (2):
		clear_page_dirty_for_io
		cancel_dirty_page


Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
---
 fs/buffer.c                |   17 ++++++++++++++---
 include/linux/memcontrol.h |    1 +
 mm/filemap.c               |    5 +++++
 mm/memcontrol.c            |   28 +++++++++++++++++++++-------
 mm/page-writeback.c        |   30 ++++++++++++++++++++++++------
 mm/truncate.c              |    6 ++++++
 6 files changed, 71 insertions(+), 16 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 55522dd..d3714cc 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -613,11 +613,19 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
 int __set_page_dirty(struct page *page,
 		struct address_space *mapping, int warn)
 {
+	bool locked;
+	unsigned long flags;
+	int ret = 0;
+
 	if (unlikely(!mapping))
 		return !TestSetPageDirty(page);
 
-	if (TestSetPageDirty(page))
-		return 0;
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
+
+	if (TestSetPageDirty(page)) {
+		ret = 0;
+		goto out;
+	}
 
 	spin_lock_irq(&mapping->tree_lock);
 	if (page->mapping) {	/* Race with truncate? */
@@ -629,7 +637,10 @@ int __set_page_dirty(struct page *page,
 	spin_unlock_irq(&mapping->tree_lock);
 	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 
-	return 1;
+	ret = 1;
+out:
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
+	return ret;
 }
 
 /*
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 20b0f2d..ad37b59 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -38,6 +38,7 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
 	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
 	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
+	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
 	MEM_CGROUP_STAT_NSTATS,
 };
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 1f19ec3..5159a49 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -140,6 +140,11 @@ void __delete_from_page_cache(struct page *page)
 	 * having removed the page entirely.
 	 */
 	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
+		/*
+		 * Do not change page state, so no need to use mem_cgroup_
+		 * {begin, end}_update_page_stat to get lock.
+		 */
+		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 		dec_zone_page_state(page, NR_FILE_DIRTY);
 		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
 	}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ebed1ca..90e2946 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -82,6 +82,7 @@ static const char * const mem_cgroup_stat_names[] = {
 	"rss",
 	"mapped_file",
 	"swap",
+	"dirty",
 };
 
 enum mem_cgroup_events_index {
@@ -2538,6 +2539,18 @@ void mem_cgroup_split_huge_fixup(struct page *head)
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+static inline
+void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
+					struct mem_cgroup *to,
+					enum mem_cgroup_stat_index idx)
+{
+	/* Update stat data for mem_cgroup */
+	preempt_disable();
+	__this_cpu_dec(from->stat->count[idx]);
+	__this_cpu_inc(to->stat->count[idx]);
+	preempt_enable();
+}
+
 /**
  * mem_cgroup_move_account - move account of the page
  * @page: the page
@@ -2583,13 +2596,14 @@ static int mem_cgroup_move_account(struct page *page,
 
 	move_lock_mem_cgroup(from, &flags);
 
-	if (!anon && page_mapped(page)) {
-		/* Update mapped_file data for mem_cgroup */
-		preempt_disable();
-		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		preempt_enable();
-	}
+	if (!anon && page_mapped(page))
+		mem_cgroup_move_account_page_stat(from, to,
+				MEM_CGROUP_STAT_FILE_MAPPED);
+
+	if (PageDirty(page))
+		mem_cgroup_move_account_page_stat(from, to,
+				MEM_CGROUP_STAT_FILE_DIRTY);
+
 	mem_cgroup_charge_statistics(from, anon, -nr_pages);
 
 	/* caller should have done css_get */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index e5363f3..e79a2f7 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1962,6 +1962,7 @@ int __set_page_dirty_no_writeback(struct page *page)
 void account_page_dirtied(struct page *page, struct address_space *mapping)
 {
 	if (mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_DIRTIED);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
@@ -2001,12 +2002,20 @@ EXPORT_SYMBOL(account_page_writeback);
  */
 int __set_page_dirty_nobuffers(struct page *page)
 {
+	bool locked;
+	unsigned long flags;
+	int ret = 0;
+
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
+
 	if (!TestSetPageDirty(page)) {
 		struct address_space *mapping = page_mapping(page);
 		struct address_space *mapping2;
 
-		if (!mapping)
-			return 1;
+		if (!mapping) {
+			ret = 1;
+			goto out;
+		}
 
 		spin_lock_irq(&mapping->tree_lock);
 		mapping2 = page_mapping(page);
@@ -2022,9 +2031,12 @@ int __set_page_dirty_nobuffers(struct page *page)
 			/* !PageAnon && !swapper_space */
 			__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 		}
-		return 1;
+		ret = 1;
 	}
-	return 0;
+
+out:
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
+	return ret;
 }
 EXPORT_SYMBOL(__set_page_dirty_nobuffers);
 
@@ -2139,6 +2151,9 @@ EXPORT_SYMBOL(set_page_dirty_lock);
 int clear_page_dirty_for_io(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
+	bool locked;
+	unsigned long flags;
+	int ret = 0;
 
 	BUG_ON(!PageLocked(page));
 
@@ -2180,13 +2195,16 @@ int clear_page_dirty_for_io(struct page *page)
 		 * the desired exclusion. See mm/memory.c:do_wp_page()
 		 * for more comments.
 		 */
+		mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 		if (TestClearPageDirty(page)) {
+			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
-			return 1;
+			ret = 1;
 		}
-		return 0;
+		mem_cgroup_end_update_page_stat(page, &locked, &flags);
+		return ret;
 	}
 	return TestClearPageDirty(page);
 }
diff --git a/mm/truncate.c b/mm/truncate.c
index 75801ac..052016a 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -73,9 +73,14 @@ static inline void truncate_partial_page(struct page *page, unsigned partial)
  */
 void cancel_dirty_page(struct page *page, unsigned int account_size)
 {
+	bool locked;
+	unsigned long flags;
+
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (TestClearPageDirty(page)) {
 		struct address_space *mapping = page->mapping;
 		if (mapping && mapping_cap_account_dirty(mapping)) {
+			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
@@ -83,6 +88,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
 				task_io_account_cancelled_write(account_size);
 		}
 	}
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 }
 EXPORT_SYMBOL(cancel_dirty_page);
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH 6/7] memcg: add per cgroup writeback pages accounting
  2012-06-28 10:54 ` Sha Zhengju
@ 2012-06-28 11:05   ` Sha Zhengju
  -1 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-06-28 11:05 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: kamezawa.hiroyu, gthelen, yinghan, akpm, mhocko, linux-kernel,
	Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

Similar to dirty page, we add per cgroup writeback pages accounting. The lock
rule still is:
	mem_cgroup_begin_update_page_stat()
	modify page WRITEBACK stat
	mem_cgroup_update_page_stat()
	mem_cgroup_end_update_page_stat()

There're two writeback interface to modify: test_clear/set_page_writeback.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
---
 include/linux/memcontrol.h |    1 +
 mm/memcontrol.c            |    5 +++++
 mm/page-writeback.c        |   12 ++++++++++++
 3 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ad37b59..9193d93 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -39,6 +39,7 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
 	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
 	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
+	MEM_CGROUP_STAT_FILE_WRITEBACK,  /* # of pages under writeback */
 	MEM_CGROUP_STAT_NSTATS,
 };
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 90e2946..8493119 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -83,6 +83,7 @@ static const char * const mem_cgroup_stat_names[] = {
 	"mapped_file",
 	"swap",
 	"dirty",
+	"writeback",
 };
 
 enum mem_cgroup_events_index {
@@ -2604,6 +2605,10 @@ static int mem_cgroup_move_account(struct page *page,
 		mem_cgroup_move_account_page_stat(from, to,
 				MEM_CGROUP_STAT_FILE_DIRTY);
 
+	if (PageWriteback(page))
+		mem_cgroup_move_account_page_stat(from, to,
+				MEM_CGROUP_STAT_FILE_WRITEBACK);
+
 	mem_cgroup_charge_statistics(from, anon, -nr_pages);
 
 	/* caller should have done css_get */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index e79a2f7..7398836 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1981,6 +1981,7 @@ EXPORT_SYMBOL(account_page_dirtied);
  */
 void account_page_writeback(struct page *page)
 {
+	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_WRITEBACK);
 	inc_zone_page_state(page, NR_WRITEBACK);
 }
 EXPORT_SYMBOL(account_page_writeback);
@@ -2214,7 +2215,10 @@ int test_clear_page_writeback(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
 	int ret;
+	bool locked;
+	unsigned long flags;
 
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (mapping) {
 		struct backing_dev_info *bdi = mapping->backing_dev_info;
 		unsigned long flags;
@@ -2235,9 +2239,12 @@ int test_clear_page_writeback(struct page *page)
 		ret = TestClearPageWriteback(page);
 	}
 	if (ret) {
+		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_WRITEBACK);
 		dec_zone_page_state(page, NR_WRITEBACK);
 		inc_zone_page_state(page, NR_WRITTEN);
 	}
+
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	return ret;
 }
 
@@ -2245,7 +2252,10 @@ int test_set_page_writeback(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
 	int ret;
+	bool locked;
+	unsigned long flags;
 
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (mapping) {
 		struct backing_dev_info *bdi = mapping->backing_dev_info;
 		unsigned long flags;
@@ -2272,6 +2282,8 @@ int test_set_page_writeback(struct page *page)
 	}
 	if (!ret)
 		account_page_writeback(page);
+
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	return ret;
 
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH 6/7] memcg: add per cgroup writeback pages accounting
@ 2012-06-28 11:05   ` Sha Zhengju
  0 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-06-28 11:05 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: kamezawa.hiroyu, gthelen, yinghan, akpm, mhocko, linux-kernel,
	Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

Similar to dirty page, we add per cgroup writeback pages accounting. The lock
rule still is:
	mem_cgroup_begin_update_page_stat()
	modify page WRITEBACK stat
	mem_cgroup_update_page_stat()
	mem_cgroup_end_update_page_stat()

There're two writeback interface to modify: test_clear/set_page_writeback.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
---
 include/linux/memcontrol.h |    1 +
 mm/memcontrol.c            |    5 +++++
 mm/page-writeback.c        |   12 ++++++++++++
 3 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ad37b59..9193d93 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -39,6 +39,7 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
 	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
 	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
+	MEM_CGROUP_STAT_FILE_WRITEBACK,  /* # of pages under writeback */
 	MEM_CGROUP_STAT_NSTATS,
 };
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 90e2946..8493119 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -83,6 +83,7 @@ static const char * const mem_cgroup_stat_names[] = {
 	"mapped_file",
 	"swap",
 	"dirty",
+	"writeback",
 };
 
 enum mem_cgroup_events_index {
@@ -2604,6 +2605,10 @@ static int mem_cgroup_move_account(struct page *page,
 		mem_cgroup_move_account_page_stat(from, to,
 				MEM_CGROUP_STAT_FILE_DIRTY);
 
+	if (PageWriteback(page))
+		mem_cgroup_move_account_page_stat(from, to,
+				MEM_CGROUP_STAT_FILE_WRITEBACK);
+
 	mem_cgroup_charge_statistics(from, anon, -nr_pages);
 
 	/* caller should have done css_get */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index e79a2f7..7398836 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1981,6 +1981,7 @@ EXPORT_SYMBOL(account_page_dirtied);
  */
 void account_page_writeback(struct page *page)
 {
+	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_WRITEBACK);
 	inc_zone_page_state(page, NR_WRITEBACK);
 }
 EXPORT_SYMBOL(account_page_writeback);
@@ -2214,7 +2215,10 @@ int test_clear_page_writeback(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
 	int ret;
+	bool locked;
+	unsigned long flags;
 
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (mapping) {
 		struct backing_dev_info *bdi = mapping->backing_dev_info;
 		unsigned long flags;
@@ -2235,9 +2239,12 @@ int test_clear_page_writeback(struct page *page)
 		ret = TestClearPageWriteback(page);
 	}
 	if (ret) {
+		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_WRITEBACK);
 		dec_zone_page_state(page, NR_WRITEBACK);
 		inc_zone_page_state(page, NR_WRITTEN);
 	}
+
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	return ret;
 }
 
@@ -2245,7 +2252,10 @@ int test_set_page_writeback(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
 	int ret;
+	bool locked;
+	unsigned long flags;
 
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (mapping) {
 		struct backing_dev_info *bdi = mapping->backing_dev_info;
 		unsigned long flags;
@@ -2272,6 +2282,8 @@ int test_set_page_writeback(struct page *page)
 	}
 	if (!ret)
 		account_page_writeback(page);
+
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	return ret;
 
 }
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH 6/7] memcg: add per cgroup writeback pages accounting
  2012-06-28 10:54 ` Sha Zhengju
  (?)
@ 2012-06-28 11:06   ` Sha Zhengju
  -1 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-06-28 11:06 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: kamezawa.hiroyu, gthelen, yinghan, akpm, mhocko, linux-kernel,
	Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

Similar to dirty page, we add per cgroup writeback pages accounting. The lock
rule still is:
	mem_cgroup_begin_update_page_stat()
	modify page WRITEBACK stat
	mem_cgroup_update_page_stat()
	mem_cgroup_end_update_page_stat()

There're two writeback interface to modify: test_clear/set_page_writeback.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
---
 include/linux/memcontrol.h |    1 +
 mm/memcontrol.c            |    5 +++++
 mm/page-writeback.c        |   12 ++++++++++++
 3 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ad37b59..9193d93 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -39,6 +39,7 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
 	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
 	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
+	MEM_CGROUP_STAT_FILE_WRITEBACK,  /* # of pages under writeback */
 	MEM_CGROUP_STAT_NSTATS,
 };
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 90e2946..8493119 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -83,6 +83,7 @@ static const char * const mem_cgroup_stat_names[] = {
 	"mapped_file",
 	"swap",
 	"dirty",
+	"writeback",
 };
 
 enum mem_cgroup_events_index {
@@ -2604,6 +2605,10 @@ static int mem_cgroup_move_account(struct page *page,
 		mem_cgroup_move_account_page_stat(from, to,
 				MEM_CGROUP_STAT_FILE_DIRTY);
 
+	if (PageWriteback(page))
+		mem_cgroup_move_account_page_stat(from, to,
+				MEM_CGROUP_STAT_FILE_WRITEBACK);
+
 	mem_cgroup_charge_statistics(from, anon, -nr_pages);
 
 	/* caller should have done css_get */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index e79a2f7..7398836 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1981,6 +1981,7 @@ EXPORT_SYMBOL(account_page_dirtied);
  */
 void account_page_writeback(struct page *page)
 {
+	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_WRITEBACK);
 	inc_zone_page_state(page, NR_WRITEBACK);
 }
 EXPORT_SYMBOL(account_page_writeback);
@@ -2214,7 +2215,10 @@ int test_clear_page_writeback(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
 	int ret;
+	bool locked;
+	unsigned long flags;
 
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (mapping) {
 		struct backing_dev_info *bdi = mapping->backing_dev_info;
 		unsigned long flags;
@@ -2235,9 +2239,12 @@ int test_clear_page_writeback(struct page *page)
 		ret = TestClearPageWriteback(page);
 	}
 	if (ret) {
+		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_WRITEBACK);
 		dec_zone_page_state(page, NR_WRITEBACK);
 		inc_zone_page_state(page, NR_WRITTEN);
 	}
+
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	return ret;
 }
 
@@ -2245,7 +2252,10 @@ int test_set_page_writeback(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
 	int ret;
+	bool locked;
+	unsigned long flags;
 
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (mapping) {
 		struct backing_dev_info *bdi = mapping->backing_dev_info;
 		unsigned long flags;
@@ -2272,6 +2282,8 @@ int test_set_page_writeback(struct page *page)
 	}
 	if (!ret)
 		account_page_writeback(page);
+
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	return ret;
 
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH 6/7] memcg: add per cgroup writeback pages accounting
@ 2012-06-28 11:06   ` Sha Zhengju
  0 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-06-28 11:06 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: kamezawa.hiroyu, gthelen, yinghan, akpm, mhocko, linux-kernel,
	Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

Similar to dirty page, we add per cgroup writeback pages accounting. The lock
rule still is:
	mem_cgroup_begin_update_page_stat()
	modify page WRITEBACK stat
	mem_cgroup_update_page_stat()
	mem_cgroup_end_update_page_stat()

There're two writeback interface to modify: test_clear/set_page_writeback.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
---
 include/linux/memcontrol.h |    1 +
 mm/memcontrol.c            |    5 +++++
 mm/page-writeback.c        |   12 ++++++++++++
 3 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ad37b59..9193d93 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -39,6 +39,7 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
 	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
 	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
+	MEM_CGROUP_STAT_FILE_WRITEBACK,  /* # of pages under writeback */
 	MEM_CGROUP_STAT_NSTATS,
 };
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 90e2946..8493119 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -83,6 +83,7 @@ static const char * const mem_cgroup_stat_names[] = {
 	"mapped_file",
 	"swap",
 	"dirty",
+	"writeback",
 };
 
 enum mem_cgroup_events_index {
@@ -2604,6 +2605,10 @@ static int mem_cgroup_move_account(struct page *page,
 		mem_cgroup_move_account_page_stat(from, to,
 				MEM_CGROUP_STAT_FILE_DIRTY);
 
+	if (PageWriteback(page))
+		mem_cgroup_move_account_page_stat(from, to,
+				MEM_CGROUP_STAT_FILE_WRITEBACK);
+
 	mem_cgroup_charge_statistics(from, anon, -nr_pages);
 
 	/* caller should have done css_get */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index e79a2f7..7398836 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1981,6 +1981,7 @@ EXPORT_SYMBOL(account_page_dirtied);
  */
 void account_page_writeback(struct page *page)
 {
+	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_WRITEBACK);
 	inc_zone_page_state(page, NR_WRITEBACK);
 }
 EXPORT_SYMBOL(account_page_writeback);
@@ -2214,7 +2215,10 @@ int test_clear_page_writeback(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
 	int ret;
+	bool locked;
+	unsigned long flags;
 
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (mapping) {
 		struct backing_dev_info *bdi = mapping->backing_dev_info;
 		unsigned long flags;
@@ -2235,9 +2239,12 @@ int test_clear_page_writeback(struct page *page)
 		ret = TestClearPageWriteback(page);
 	}
 	if (ret) {
+		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_WRITEBACK);
 		dec_zone_page_state(page, NR_WRITEBACK);
 		inc_zone_page_state(page, NR_WRITTEN);
 	}
+
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	return ret;
 }
 
@@ -2245,7 +2252,10 @@ int test_set_page_writeback(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
 	int ret;
+	bool locked;
+	unsigned long flags;
 
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (mapping) {
 		struct backing_dev_info *bdi = mapping->backing_dev_info;
 		unsigned long flags;
@@ -2272,6 +2282,8 @@ int test_set_page_writeback(struct page *page)
 	}
 	if (!ret)
 		account_page_writeback(page);
+
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	return ret;
 
 }
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH 6/7] memcg: add per cgroup writeback pages accounting
@ 2012-06-28 11:06   ` Sha Zhengju
  0 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-06-28 11:06 UTC (permalink / raw)
  To: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA
  Cc: kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA, yinghan-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mhocko-AlSwsSmVLrQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju

From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>

Similar to dirty page, we add per cgroup writeback pages accounting. The lock
rule still is:
	mem_cgroup_begin_update_page_stat()
	modify page WRITEBACK stat
	mem_cgroup_update_page_stat()
	mem_cgroup_end_update_page_stat()

There're two writeback interface to modify: test_clear/set_page_writeback.

Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
---
 include/linux/memcontrol.h |    1 +
 mm/memcontrol.c            |    5 +++++
 mm/page-writeback.c        |   12 ++++++++++++
 3 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ad37b59..9193d93 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -39,6 +39,7 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
 	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
 	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
+	MEM_CGROUP_STAT_FILE_WRITEBACK,  /* # of pages under writeback */
 	MEM_CGROUP_STAT_NSTATS,
 };
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 90e2946..8493119 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -83,6 +83,7 @@ static const char * const mem_cgroup_stat_names[] = {
 	"mapped_file",
 	"swap",
 	"dirty",
+	"writeback",
 };
 
 enum mem_cgroup_events_index {
@@ -2604,6 +2605,10 @@ static int mem_cgroup_move_account(struct page *page,
 		mem_cgroup_move_account_page_stat(from, to,
 				MEM_CGROUP_STAT_FILE_DIRTY);
 
+	if (PageWriteback(page))
+		mem_cgroup_move_account_page_stat(from, to,
+				MEM_CGROUP_STAT_FILE_WRITEBACK);
+
 	mem_cgroup_charge_statistics(from, anon, -nr_pages);
 
 	/* caller should have done css_get */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index e79a2f7..7398836 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1981,6 +1981,7 @@ EXPORT_SYMBOL(account_page_dirtied);
  */
 void account_page_writeback(struct page *page)
 {
+	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_WRITEBACK);
 	inc_zone_page_state(page, NR_WRITEBACK);
 }
 EXPORT_SYMBOL(account_page_writeback);
@@ -2214,7 +2215,10 @@ int test_clear_page_writeback(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
 	int ret;
+	bool locked;
+	unsigned long flags;
 
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (mapping) {
 		struct backing_dev_info *bdi = mapping->backing_dev_info;
 		unsigned long flags;
@@ -2235,9 +2239,12 @@ int test_clear_page_writeback(struct page *page)
 		ret = TestClearPageWriteback(page);
 	}
 	if (ret) {
+		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_WRITEBACK);
 		dec_zone_page_state(page, NR_WRITEBACK);
 		inc_zone_page_state(page, NR_WRITTEN);
 	}
+
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	return ret;
 }
 
@@ -2245,7 +2252,10 @@ int test_set_page_writeback(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
 	int ret;
+	bool locked;
+	unsigned long flags;
 
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (mapping) {
 		struct backing_dev_info *bdi = mapping->backing_dev_info;
 		unsigned long flags;
@@ -2272,6 +2282,8 @@ int test_set_page_writeback(struct page *page)
 	}
 	if (!ret)
 		account_page_writeback(page);
+
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	return ret;
 
 }
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH 7/7] memcg: print more detailed info while memcg oom happening
  2012-06-28 10:54 ` Sha Zhengju
  (?)
@ 2012-06-28 11:06   ` Sha Zhengju
  -1 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-06-28 11:06 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: kamezawa.hiroyu, gthelen, yinghan, akpm, mhocko, linux-kernel,
	Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

While memcg oom happening, the dump info is limited, so add this
to provide memcg page stat.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
---
 mm/memcontrol.c |   42 ++++++++++++++++++++++++++++++++++--------
 1 files changed, 34 insertions(+), 8 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8493119..3ed41e9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -101,6 +101,14 @@ static const char * const mem_cgroup_events_names[] = {
 	"pgmajfault",
 };
 
+static const char * const mem_cgroup_lru_names[] = {
+	"inactive_anon",
+	"active_anon",
+	"inactive_file",
+	"active_file",
+	"unevictable",
+};
+
 /*
  * Per memcg event counter is incremented at every pagein/pageout. With THP,
  * it will be incremated by the number of pages. This counter is used for
@@ -1358,6 +1366,30 @@ static void move_unlock_mem_cgroup(struct mem_cgroup *memcg,
 	spin_unlock_irqrestore(&memcg->move_lock, *flags);
 }
 
+#define K(x) ((x) << (PAGE_SHIFT-10))
+static void mem_cgroup_print_oom_stat(struct mem_cgroup *memcg)
+{
+	int i;
+
+	printk(KERN_INFO "Memory cgroup stat:\n");
+	for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
+		if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account)
+			continue;
+		printk(KERN_CONT "%s:%ldKB ", mem_cgroup_stat_names[i],
+			   K(mem_cgroup_read_stat(memcg, i)));
+	}
+
+	for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++)
+		printk(KERN_CONT "%s:%lu ", mem_cgroup_events_names[i],
+			   mem_cgroup_read_events(memcg, i));
+
+	for (i = 0; i < NR_LRU_LISTS; i++)
+		printk(KERN_CONT "%s:%luKB ", mem_cgroup_lru_names[i],
+			   K(mem_cgroup_nr_lru_pages(memcg, BIT(i))));
+	printk(KERN_CONT "\n");
+
+}
+
 /**
  * mem_cgroup_print_oom_info: Called from OOM with tasklist_lock held in read mode.
  * @memcg: The memory cgroup that went over limit
@@ -1422,6 +1454,8 @@ done:
 		res_counter_read_u64(&memcg->memsw, RES_USAGE) >> 10,
 		res_counter_read_u64(&memcg->memsw, RES_LIMIT) >> 10,
 		res_counter_read_u64(&memcg->memsw, RES_FAILCNT));
+
+	mem_cgroup_print_oom_stat(memcg);
 }
 
 /*
@@ -4043,14 +4077,6 @@ static int mem_control_numa_stat_show(struct cgroup *cont, struct cftype *cft,
 }
 #endif /* CONFIG_NUMA */
 
-static const char * const mem_cgroup_lru_names[] = {
-	"inactive_anon",
-	"active_anon",
-	"inactive_file",
-	"active_file",
-	"unevictable",
-};
-
 static inline void mem_cgroup_lru_names_not_uptodate(void)
 {
 	BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH 7/7] memcg: print more detailed info while memcg oom happening
@ 2012-06-28 11:06   ` Sha Zhengju
  0 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-06-28 11:06 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: kamezawa.hiroyu, gthelen, yinghan, akpm, mhocko, linux-kernel,
	Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

While memcg oom happening, the dump info is limited, so add this
to provide memcg page stat.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
---
 mm/memcontrol.c |   42 ++++++++++++++++++++++++++++++++++--------
 1 files changed, 34 insertions(+), 8 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8493119..3ed41e9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -101,6 +101,14 @@ static const char * const mem_cgroup_events_names[] = {
 	"pgmajfault",
 };
 
+static const char * const mem_cgroup_lru_names[] = {
+	"inactive_anon",
+	"active_anon",
+	"inactive_file",
+	"active_file",
+	"unevictable",
+};
+
 /*
  * Per memcg event counter is incremented at every pagein/pageout. With THP,
  * it will be incremated by the number of pages. This counter is used for
@@ -1358,6 +1366,30 @@ static void move_unlock_mem_cgroup(struct mem_cgroup *memcg,
 	spin_unlock_irqrestore(&memcg->move_lock, *flags);
 }
 
+#define K(x) ((x) << (PAGE_SHIFT-10))
+static void mem_cgroup_print_oom_stat(struct mem_cgroup *memcg)
+{
+	int i;
+
+	printk(KERN_INFO "Memory cgroup stat:\n");
+	for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
+		if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account)
+			continue;
+		printk(KERN_CONT "%s:%ldKB ", mem_cgroup_stat_names[i],
+			   K(mem_cgroup_read_stat(memcg, i)));
+	}
+
+	for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++)
+		printk(KERN_CONT "%s:%lu ", mem_cgroup_events_names[i],
+			   mem_cgroup_read_events(memcg, i));
+
+	for (i = 0; i < NR_LRU_LISTS; i++)
+		printk(KERN_CONT "%s:%luKB ", mem_cgroup_lru_names[i],
+			   K(mem_cgroup_nr_lru_pages(memcg, BIT(i))));
+	printk(KERN_CONT "\n");
+
+}
+
 /**
  * mem_cgroup_print_oom_info: Called from OOM with tasklist_lock held in read mode.
  * @memcg: The memory cgroup that went over limit
@@ -1422,6 +1454,8 @@ done:
 		res_counter_read_u64(&memcg->memsw, RES_USAGE) >> 10,
 		res_counter_read_u64(&memcg->memsw, RES_LIMIT) >> 10,
 		res_counter_read_u64(&memcg->memsw, RES_FAILCNT));
+
+	mem_cgroup_print_oom_stat(memcg);
 }
 
 /*
@@ -4043,14 +4077,6 @@ static int mem_control_numa_stat_show(struct cgroup *cont, struct cftype *cft,
 }
 #endif /* CONFIG_NUMA */
 
-static const char * const mem_cgroup_lru_names[] = {
-	"inactive_anon",
-	"active_anon",
-	"inactive_file",
-	"active_file",
-	"unevictable",
-};
-
 static inline void mem_cgroup_lru_names_not_uptodate(void)
 {
 	BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS);
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH 7/7] memcg: print more detailed info while memcg oom happening
@ 2012-06-28 11:06   ` Sha Zhengju
  0 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-06-28 11:06 UTC (permalink / raw)
  To: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA
  Cc: kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA, yinghan-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mhocko-AlSwsSmVLrQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju

From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>

While memcg oom happening, the dump info is limited, so add this
to provide memcg page stat.

Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
---
 mm/memcontrol.c |   42 ++++++++++++++++++++++++++++++++++--------
 1 files changed, 34 insertions(+), 8 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8493119..3ed41e9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -101,6 +101,14 @@ static const char * const mem_cgroup_events_names[] = {
 	"pgmajfault",
 };
 
+static const char * const mem_cgroup_lru_names[] = {
+	"inactive_anon",
+	"active_anon",
+	"inactive_file",
+	"active_file",
+	"unevictable",
+};
+
 /*
  * Per memcg event counter is incremented at every pagein/pageout. With THP,
  * it will be incremated by the number of pages. This counter is used for
@@ -1358,6 +1366,30 @@ static void move_unlock_mem_cgroup(struct mem_cgroup *memcg,
 	spin_unlock_irqrestore(&memcg->move_lock, *flags);
 }
 
+#define K(x) ((x) << (PAGE_SHIFT-10))
+static void mem_cgroup_print_oom_stat(struct mem_cgroup *memcg)
+{
+	int i;
+
+	printk(KERN_INFO "Memory cgroup stat:\n");
+	for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
+		if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account)
+			continue;
+		printk(KERN_CONT "%s:%ldKB ", mem_cgroup_stat_names[i],
+			   K(mem_cgroup_read_stat(memcg, i)));
+	}
+
+	for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++)
+		printk(KERN_CONT "%s:%lu ", mem_cgroup_events_names[i],
+			   mem_cgroup_read_events(memcg, i));
+
+	for (i = 0; i < NR_LRU_LISTS; i++)
+		printk(KERN_CONT "%s:%luKB ", mem_cgroup_lru_names[i],
+			   K(mem_cgroup_nr_lru_pages(memcg, BIT(i))));
+	printk(KERN_CONT "\n");
+
+}
+
 /**
  * mem_cgroup_print_oom_info: Called from OOM with tasklist_lock held in read mode.
  * @memcg: The memory cgroup that went over limit
@@ -1422,6 +1454,8 @@ done:
 		res_counter_read_u64(&memcg->memsw, RES_USAGE) >> 10,
 		res_counter_read_u64(&memcg->memsw, RES_LIMIT) >> 10,
 		res_counter_read_u64(&memcg->memsw, RES_FAILCNT));
+
+	mem_cgroup_print_oom_stat(memcg);
 }
 
 /*
@@ -4043,14 +4077,6 @@ static int mem_control_numa_stat_show(struct cgroup *cont, struct cftype *cft,
 }
 #endif /* CONFIG_NUMA */
 
-static const char * const mem_cgroup_lru_names[] = {
-	"inactive_anon",
-	"active_anon",
-	"inactive_file",
-	"active_file",
-	"unevictable",
-};
-
 static inline void mem_cgroup_lru_names_not_uptodate(void)
 {
 	BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* Re: [PATCH 4/7] Use vfs __set_page_dirty interface instead of doing it inside filesystem
@ 2012-06-29  5:21     ` Sage Weil
  0 siblings, 0 replies; 132+ messages in thread
From: Sage Weil @ 2012-06-29  5:21 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
	mhocko, linux-kernel, torvalds, viro, linux-fsdevel, sage,
	ceph-devel, Sha Zhengju

On Thu, 28 Jun 2012, Sha Zhengju wrote:

> From: Sha Zhengju <handai.szj@taobao.com>
> 
> Following we will treat SetPageDirty and dirty page accounting as an integrated
> operation. Filesystems had better use vfs interface directly to avoid those details.
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> ---
>  fs/buffer.c                 |    2 +-
>  fs/ceph/addr.c              |   20 ++------------------
>  include/linux/buffer_head.h |    2 ++
>  3 files changed, 5 insertions(+), 19 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index e8d96b8..55522dd 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -610,7 +610,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>   * If warn is true, then emit a warning if the page is not uptodate and has
>   * not been truncated.
>   */
> -static int __set_page_dirty(struct page *page,
> +int __set_page_dirty(struct page *page,
>  		struct address_space *mapping, int warn)
>  {
>  	if (unlikely(!mapping))

This also needs an EXPORT_SYMBOL(__set_page_dirty) to allow ceph to 
continue to build as a module.

With that fixed, the ceph bits are a welcome cleanup!

Acked-by: Sage Weil <sage@inktank.com>

> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 8b67304..d028fbe 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -5,6 +5,7 @@
>  #include <linux/mm.h>
>  #include <linux/pagemap.h>
>  #include <linux/writeback.h>	/* generic_writepages */
> +#include <linux/buffer_head.h>
>  #include <linux/slab.h>
>  #include <linux/pagevec.h>
>  #include <linux/task_io_accounting_ops.h>
> @@ -73,14 +74,8 @@ static int ceph_set_page_dirty(struct page *page)
>  	int undo = 0;
>  	struct ceph_snap_context *snapc;
>  
> -	if (unlikely(!mapping))
> -		return !TestSetPageDirty(page);
> -
> -	if (TestSetPageDirty(page)) {
> -		dout("%p set_page_dirty %p idx %lu -- already dirty\n",
> -		     mapping->host, page, page->index);
> +	if (!__set_page_dirty(page, mapping, 1))
>  		return 0;
> -	}
>  
>  	inode = mapping->host;
>  	ci = ceph_inode(inode);
> @@ -107,14 +102,7 @@ static int ceph_set_page_dirty(struct page *page)
>  	     snapc, snapc->seq, snapc->num_snaps);
>  	spin_unlock(&ci->i_ceph_lock);
>  
> -	/* now adjust page */
> -	spin_lock_irq(&mapping->tree_lock);
>  	if (page->mapping) {	/* Race with truncate? */
> -		WARN_ON_ONCE(!PageUptodate(page));
> -		account_page_dirtied(page, page->mapping);
> -		radix_tree_tag_set(&mapping->page_tree,
> -				page_index(page), PAGECACHE_TAG_DIRTY);
> -
>  		/*
>  		 * Reference snap context in page->private.  Also set
>  		 * PagePrivate so that we get invalidatepage callback.
> @@ -126,14 +114,10 @@ static int ceph_set_page_dirty(struct page *page)
>  		undo = 1;
>  	}
>  
> -	spin_unlock_irq(&mapping->tree_lock);
> -
>  	if (undo)
>  		/* whoops, we failed to dirty the page */
>  		ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
>  
> -	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> -
>  	BUG_ON(!PageDirty(page));
>  	return 1;
>  }
> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> index 458f497..0a331a8 100644
> --- a/include/linux/buffer_head.h
> +++ b/include/linux/buffer_head.h
> @@ -336,6 +336,8 @@ static inline void lock_buffer(struct buffer_head *bh)
>  }
>  
>  extern int __set_page_dirty_buffers(struct page *page);
> +extern int __set_page_dirty(struct page *page,
> +		struct address_space *mapping, int warn);
>  
>  #else /* CONFIG_BLOCK */
>  
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 4/7] Use vfs __set_page_dirty interface instead of doing it inside filesystem
@ 2012-06-29  5:21     ` Sage Weil
  0 siblings, 0 replies; 132+ messages in thread
From: Sage Weil @ 2012-06-29  5:21 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA, yinghan-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mhocko-AlSwsSmVLrQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	sage-BnTBU8nroG7k1uMJSBkQmQ, ceph-devel-u79uwXL29TY76Z2rM5mHXA,
	Sha Zhengju

On Thu, 28 Jun 2012, Sha Zhengju wrote:

> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> 
> Following we will treat SetPageDirty and dirty page accounting as an integrated
> operation. Filesystems had better use vfs interface directly to avoid those details.
> 
> Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> ---
>  fs/buffer.c                 |    2 +-
>  fs/ceph/addr.c              |   20 ++------------------
>  include/linux/buffer_head.h |    2 ++
>  3 files changed, 5 insertions(+), 19 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index e8d96b8..55522dd 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -610,7 +610,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>   * If warn is true, then emit a warning if the page is not uptodate and has
>   * not been truncated.
>   */
> -static int __set_page_dirty(struct page *page,
> +int __set_page_dirty(struct page *page,
>  		struct address_space *mapping, int warn)
>  {
>  	if (unlikely(!mapping))

This also needs an EXPORT_SYMBOL(__set_page_dirty) to allow ceph to 
continue to build as a module.

With that fixed, the ceph bits are a welcome cleanup!

Acked-by: Sage Weil <sage-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org>

> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 8b67304..d028fbe 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -5,6 +5,7 @@
>  #include <linux/mm.h>
>  #include <linux/pagemap.h>
>  #include <linux/writeback.h>	/* generic_writepages */
> +#include <linux/buffer_head.h>
>  #include <linux/slab.h>
>  #include <linux/pagevec.h>
>  #include <linux/task_io_accounting_ops.h>
> @@ -73,14 +74,8 @@ static int ceph_set_page_dirty(struct page *page)
>  	int undo = 0;
>  	struct ceph_snap_context *snapc;
>  
> -	if (unlikely(!mapping))
> -		return !TestSetPageDirty(page);
> -
> -	if (TestSetPageDirty(page)) {
> -		dout("%p set_page_dirty %p idx %lu -- already dirty\n",
> -		     mapping->host, page, page->index);
> +	if (!__set_page_dirty(page, mapping, 1))
>  		return 0;
> -	}
>  
>  	inode = mapping->host;
>  	ci = ceph_inode(inode);
> @@ -107,14 +102,7 @@ static int ceph_set_page_dirty(struct page *page)
>  	     snapc, snapc->seq, snapc->num_snaps);
>  	spin_unlock(&ci->i_ceph_lock);
>  
> -	/* now adjust page */
> -	spin_lock_irq(&mapping->tree_lock);
>  	if (page->mapping) {	/* Race with truncate? */
> -		WARN_ON_ONCE(!PageUptodate(page));
> -		account_page_dirtied(page, page->mapping);
> -		radix_tree_tag_set(&mapping->page_tree,
> -				page_index(page), PAGECACHE_TAG_DIRTY);
> -
>  		/*
>  		 * Reference snap context in page->private.  Also set
>  		 * PagePrivate so that we get invalidatepage callback.
> @@ -126,14 +114,10 @@ static int ceph_set_page_dirty(struct page *page)
>  		undo = 1;
>  	}
>  
> -	spin_unlock_irq(&mapping->tree_lock);
> -
>  	if (undo)
>  		/* whoops, we failed to dirty the page */
>  		ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
>  
> -	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> -
>  	BUG_ON(!PageDirty(page));
>  	return 1;
>  }
> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> index 458f497..0a331a8 100644
> --- a/include/linux/buffer_head.h
> +++ b/include/linux/buffer_head.h
> @@ -336,6 +336,8 @@ static inline void lock_buffer(struct buffer_head *bh)
>  }
>  
>  extern int __set_page_dirty_buffers(struct page *page);
> +extern int __set_page_dirty(struct page *page,
> +		struct address_space *mapping, int warn);
>  
>  #else /* CONFIG_BLOCK */
>  
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 4/7] Use vfs __set_page_dirty interface instead of doing it inside filesystem
@ 2012-06-29  5:21     ` Sage Weil
  0 siblings, 0 replies; 132+ messages in thread
From: Sage Weil @ 2012-06-29  5:21 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
	mhocko, linux-kernel, torvalds, viro, linux-fsdevel, sage,
	ceph-devel, Sha Zhengju

On Thu, 28 Jun 2012, Sha Zhengju wrote:

> From: Sha Zhengju <handai.szj@taobao.com>
> 
> Following we will treat SetPageDirty and dirty page accounting as an integrated
> operation. Filesystems had better use vfs interface directly to avoid those details.
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> ---
>  fs/buffer.c                 |    2 +-
>  fs/ceph/addr.c              |   20 ++------------------
>  include/linux/buffer_head.h |    2 ++
>  3 files changed, 5 insertions(+), 19 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index e8d96b8..55522dd 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -610,7 +610,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>   * If warn is true, then emit a warning if the page is not uptodate and has
>   * not been truncated.
>   */
> -static int __set_page_dirty(struct page *page,
> +int __set_page_dirty(struct page *page,
>  		struct address_space *mapping, int warn)
>  {
>  	if (unlikely(!mapping))

This also needs an EXPORT_SYMBOL(__set_page_dirty) to allow ceph to 
continue to build as a module.

With that fixed, the ceph bits are a welcome cleanup!

Acked-by: Sage Weil <sage@inktank.com>

> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 8b67304..d028fbe 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -5,6 +5,7 @@
>  #include <linux/mm.h>
>  #include <linux/pagemap.h>
>  #include <linux/writeback.h>	/* generic_writepages */
> +#include <linux/buffer_head.h>
>  #include <linux/slab.h>
>  #include <linux/pagevec.h>
>  #include <linux/task_io_accounting_ops.h>
> @@ -73,14 +74,8 @@ static int ceph_set_page_dirty(struct page *page)
>  	int undo = 0;
>  	struct ceph_snap_context *snapc;
>  
> -	if (unlikely(!mapping))
> -		return !TestSetPageDirty(page);
> -
> -	if (TestSetPageDirty(page)) {
> -		dout("%p set_page_dirty %p idx %lu -- already dirty\n",
> -		     mapping->host, page, page->index);
> +	if (!__set_page_dirty(page, mapping, 1))
>  		return 0;
> -	}
>  
>  	inode = mapping->host;
>  	ci = ceph_inode(inode);
> @@ -107,14 +102,7 @@ static int ceph_set_page_dirty(struct page *page)
>  	     snapc, snapc->seq, snapc->num_snaps);
>  	spin_unlock(&ci->i_ceph_lock);
>  
> -	/* now adjust page */
> -	spin_lock_irq(&mapping->tree_lock);
>  	if (page->mapping) {	/* Race with truncate? */
> -		WARN_ON_ONCE(!PageUptodate(page));
> -		account_page_dirtied(page, page->mapping);
> -		radix_tree_tag_set(&mapping->page_tree,
> -				page_index(page), PAGECACHE_TAG_DIRTY);
> -
>  		/*
>  		 * Reference snap context in page->private.  Also set
>  		 * PagePrivate so that we get invalidatepage callback.
> @@ -126,14 +114,10 @@ static int ceph_set_page_dirty(struct page *page)
>  		undo = 1;
>  	}
>  
> -	spin_unlock_irq(&mapping->tree_lock);
> -
>  	if (undo)
>  		/* whoops, we failed to dirty the page */
>  		ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
>  
> -	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> -
>  	BUG_ON(!PageDirty(page));
>  	return 1;
>  }
> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> index 458f497..0a331a8 100644
> --- a/include/linux/buffer_head.h
> +++ b/include/linux/buffer_head.h
> @@ -336,6 +336,8 @@ static inline void lock_buffer(struct buffer_head *bh)
>  }
>  
>  extern int __set_page_dirty_buffers(struct page *page);
> +extern int __set_page_dirty(struct page *page,
> +		struct address_space *mapping, int warn);
>  
>  #else /* CONFIG_BLOCK */
>  
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 0/7] Per-cgroup page stat accounting
  2012-06-28 10:54 ` Sha Zhengju
@ 2012-06-29  8:23   ` Kamezawa Hiroyuki
  -1 siblings, 0 replies; 132+ messages in thread
From: Kamezawa Hiroyuki @ 2012-06-29  8:23 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, gthelen, yinghan, akpm, mhocko, linux-kernel,
	Sha Zhengju

(2012/06/28 19:54), Sha Zhengju wrote:
> This patch series provide the ability for each memory cgroup to have independent
> dirty/writeback page stats. This can provide some information for per-cgroup direct
> reclaim. Meanwhile, we add more detailed dump messages for memcg OOMs.
> 
> Three features are included in this patch series:
>   (0).prepare patches for page accounting
>    1. memcg dirty page accounting
>    2. memcg writeback page accounting
>    3. memcg OOMs dump info
> 
> In (0) prepare patches, we have reworked vfs set page dirty routines to make "modify
> page info" and "dirty page accouting" stay in one function as much as possible for
> the sake of memcg bigger lock.
> 
> These patches are cooked based on Andrew's akpm tree.
> 

Thank you !, it seems good in general. I'll review in detail, later.

Do you have any performance comparison between before/after the series ?
I mean, set_page_dirty() is the hot-path and we should be careful to add a new accounting.

Thanks,
-Kame



> Sha Zhengju (7):
> 	memcg-update-cgroup-memory-document.patch
> 	memcg-remove-MEMCG_NR_FILE_MAPPED.patch
> 	Make-TestSetPageDirty-and-dirty-page-accounting-in-o.patch
> 	Use-vfs-__set_page_dirty-interface-instead-of-doing-.patch
> 	memcg-add-per-cgroup-dirty-pages-accounting.patch
> 	memcg-add-per-cgroup-writeback-pages-accounting.patch
> 	memcg-print-more-detailed-info-while-memcg-oom-happe.patch	
> 
>   Documentation/cgroups/memory.txt |    2 +
>   fs/buffer.c                      |   36 +++++++++-----
>   fs/ceph/addr.c                   |   20 +-------
>   include/linux/buffer_head.h      |    2 +
>   include/linux/memcontrol.h       |   27 +++++++---
>   mm/filemap.c                     |    5 ++
>   mm/memcontrol.c                  |   99 +++++++++++++++++++++++--------------
>   mm/page-writeback.c              |   42 ++++++++++++++--
>   mm/rmap.c                        |    4 +-
>   mm/truncate.c                    |    6 ++
>   10 files changed, 159 insertions(+), 84 deletions(-)
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 




^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 0/7] Per-cgroup page stat accounting
@ 2012-06-29  8:23   ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 132+ messages in thread
From: Kamezawa Hiroyuki @ 2012-06-29  8:23 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, gthelen, yinghan, akpm, mhocko, linux-kernel,
	Sha Zhengju

(2012/06/28 19:54), Sha Zhengju wrote:
> This patch series provide the ability for each memory cgroup to have independent
> dirty/writeback page stats. This can provide some information for per-cgroup direct
> reclaim. Meanwhile, we add more detailed dump messages for memcg OOMs.
> 
> Three features are included in this patch series:
>   (0).prepare patches for page accounting
>    1. memcg dirty page accounting
>    2. memcg writeback page accounting
>    3. memcg OOMs dump info
> 
> In (0) prepare patches, we have reworked vfs set page dirty routines to make "modify
> page info" and "dirty page accouting" stay in one function as much as possible for
> the sake of memcg bigger lock.
> 
> These patches are cooked based on Andrew's akpm tree.
> 

Thank you !, it seems good in general. I'll review in detail, later.

Do you have any performance comparison between before/after the series ?
I mean, set_page_dirty() is the hot-path and we should be careful to add a new accounting.

Thanks,
-Kame



> Sha Zhengju (7):
> 	memcg-update-cgroup-memory-document.patch
> 	memcg-remove-MEMCG_NR_FILE_MAPPED.patch
> 	Make-TestSetPageDirty-and-dirty-page-accounting-in-o.patch
> 	Use-vfs-__set_page_dirty-interface-instead-of-doing-.patch
> 	memcg-add-per-cgroup-dirty-pages-accounting.patch
> 	memcg-add-per-cgroup-writeback-pages-accounting.patch
> 	memcg-print-more-detailed-info-while-memcg-oom-happe.patch	
> 
>   Documentation/cgroups/memory.txt |    2 +
>   fs/buffer.c                      |   36 +++++++++-----
>   fs/ceph/addr.c                   |   20 +-------
>   include/linux/buffer_head.h      |    2 +
>   include/linux/memcontrol.h       |   27 +++++++---
>   mm/filemap.c                     |    5 ++
>   mm/memcontrol.c                  |   99 +++++++++++++++++++++++--------------
>   mm/page-writeback.c              |   42 ++++++++++++++--
>   mm/rmap.c                        |    4 +-
>   mm/truncate.c                    |    6 ++
>   10 files changed, 159 insertions(+), 84 deletions(-)
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 1/7] memcg: update cgroup memory document
  2012-06-28 10:57   ` Sha Zhengju
@ 2012-07-02  7:00     ` Kamezawa Hiroyuki
  -1 siblings, 0 replies; 132+ messages in thread
From: Kamezawa Hiroyuki @ 2012-07-02  7:00 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, gthelen, yinghan, akpm, mhocko, linux-kernel,
	Sha Zhengju

(2012/06/28 19:57), Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> Document cgroup dirty/writeback memory statistics.
> 
> The implementation for these new interface routines come in a series
> of following patches.
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 1/7] memcg: update cgroup memory document
@ 2012-07-02  7:00     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 132+ messages in thread
From: Kamezawa Hiroyuki @ 2012-07-02  7:00 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, gthelen, yinghan, akpm, mhocko, linux-kernel,
	Sha Zhengju

(2012/06/28 19:57), Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> Document cgroup dirty/writeback memory statistics.
> 
> The implementation for these new interface routines come in a series
> of following patches.
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 0/7] Per-cgroup page stat accounting
  2012-06-29  8:23   ` Kamezawa Hiroyuki
  (?)
@ 2012-07-02  7:51     ` Sha Zhengju
  -1 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-07-02  7:51 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-mm, cgroups, gthelen, yinghan, akpm, mhocko, linux-kernel,
	Sha Zhengju

On 06/29/2012 04:23 PM, Kamezawa Hiroyuki wrote:
> (2012/06/28 19:54), Sha Zhengju wrote:
>> This patch series provide the ability for each memory cgroup to have independent
>> dirty/writeback page stats. This can provide some information for per-cgroup direct
>> reclaim. Meanwhile, we add more detailed dump messages for memcg OOMs.
>>
>> Three features are included in this patch series:
>>   (0).prepare patches for page accounting
>>    1. memcg dirty page accounting
>>    2. memcg writeback page accounting
>>    3. memcg OOMs dump info
>>
>> In (0) prepare patches, we have reworked vfs set page dirty routines to make "modify
>> page info" and "dirty page accouting" stay in one function as much as possible for
>> the sake of memcg bigger lock.
>>
>> These patches are cooked based on Andrew's akpm tree.
>>
> Thank you !, it seems good in general. I'll review in detail, later.
>
> Do you have any performance comparison between before/after the series ?
> I mean, set_page_dirty() is the hot-path and we should be careful to add a new accounting.


Not yet, I sent it out as soon as I worked out this solution to check
whether it's okay.
I can test the series after most of people agree with it.


Thanks,
Sha


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 0/7] Per-cgroup page stat accounting
@ 2012-07-02  7:51     ` Sha Zhengju
  0 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-07-02  7:51 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-mm, cgroups, gthelen, yinghan, akpm, mhocko, linux-kernel,
	Sha Zhengju

On 06/29/2012 04:23 PM, Kamezawa Hiroyuki wrote:
> (2012/06/28 19:54), Sha Zhengju wrote:
>> This patch series provide the ability for each memory cgroup to have independent
>> dirty/writeback page stats. This can provide some information for per-cgroup direct
>> reclaim. Meanwhile, we add more detailed dump messages for memcg OOMs.
>>
>> Three features are included in this patch series:
>>   (0).prepare patches for page accounting
>>    1. memcg dirty page accounting
>>    2. memcg writeback page accounting
>>    3. memcg OOMs dump info
>>
>> In (0) prepare patches, we have reworked vfs set page dirty routines to make "modify
>> page info" and "dirty page accouting" stay in one function as much as possible for
>> the sake of memcg bigger lock.
>>
>> These patches are cooked based on Andrew's akpm tree.
>>
> Thank you !, it seems good in general. I'll review in detail, later.
>
> Do you have any performance comparison between before/after the series ?
> I mean, set_page_dirty() is the hot-path and we should be careful to add a new accounting.


Not yet, I sent it out as soon as I worked out this solution to check
whether it's okay.
I can test the series after most of people agree with it.


Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 0/7] Per-cgroup page stat accounting
@ 2012-07-02  7:51     ` Sha Zhengju
  0 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-07-02  7:51 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA, yinghan-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mhocko-AlSwsSmVLrQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju

On 06/29/2012 04:23 PM, Kamezawa Hiroyuki wrote:
> (2012/06/28 19:54), Sha Zhengju wrote:
>> This patch series provide the ability for each memory cgroup to have independent
>> dirty/writeback page stats. This can provide some information for per-cgroup direct
>> reclaim. Meanwhile, we add more detailed dump messages for memcg OOMs.
>>
>> Three features are included in this patch series:
>>   (0).prepare patches for page accounting
>>    1. memcg dirty page accounting
>>    2. memcg writeback page accounting
>>    3. memcg OOMs dump info
>>
>> In (0) prepare patches, we have reworked vfs set page dirty routines to make "modify
>> page info" and "dirty page accouting" stay in one function as much as possible for
>> the sake of memcg bigger lock.
>>
>> These patches are cooked based on Andrew's akpm tree.
>>
> Thank you !, it seems good in general. I'll review in detail, later.
>
> Do you have any performance comparison between before/after the series ?
> I mean, set_page_dirty() is the hot-path and we should be careful to add a new accounting.


Not yet, I sent it out as soon as I worked out this solution to check
whether it's okay.
I can test the series after most of people agree with it.


Thanks,
Sha

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 4/7] Use vfs __set_page_dirty interface instead of doing it inside filesystem
  2012-06-29  5:21     ` Sage Weil
@ 2012-07-02  8:10       ` Sha Zhengju
  -1 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-07-02  8:10 UTC (permalink / raw)
  To: Sage Weil
  Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
	mhocko, linux-kernel, torvalds, viro, linux-fsdevel, sage,
	ceph-devel, Sha Zhengju

On 06/29/2012 01:21 PM, Sage Weil wrote:
> On Thu, 28 Jun 2012, Sha Zhengju wrote:
>
>> From: Sha Zhengju<handai.szj@taobao.com>
>>
>> Following we will treat SetPageDirty and dirty page accounting as an integrated
>> operation. Filesystems had better use vfs interface directly to avoid those details.
>>
>> Signed-off-by: Sha Zhengju<handai.szj@taobao.com>
>> ---
>>   fs/buffer.c                 |    2 +-
>>   fs/ceph/addr.c              |   20 ++------------------
>>   include/linux/buffer_head.h |    2 ++
>>   3 files changed, 5 insertions(+), 19 deletions(-)
>>
>> diff --git a/fs/buffer.c b/fs/buffer.c
>> index e8d96b8..55522dd 100644
>> --- a/fs/buffer.c
>> +++ b/fs/buffer.c
>> @@ -610,7 +610,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>>    * If warn is true, then emit a warning if the page is not uptodate and has
>>    * not been truncated.
>>    */
>> -static int __set_page_dirty(struct page *page,
>> +int __set_page_dirty(struct page *page,
>>   		struct address_space *mapping, int warn)
>>   {
>>   	if (unlikely(!mapping))
> This also needs an EXPORT_SYMBOL(__set_page_dirty) to allow ceph to
> continue to build as a module.
>
> With that fixed, the ceph bits are a welcome cleanup!
>
> Acked-by: Sage Weil<sage@inktank.com>

Further, I check the path again and may it be reworked as follows to 
avoid undo?

__set_page_dirty();                                            
__set_page_dirty();
ceph operations;                ==>                    if (page->mapping)
if (page->mapping)                                            ceph 
operations;
     ;
else
     undo = 1;
if (undo)
     xxx;



Thanks,
Sha

>> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
>> index 8b67304..d028fbe 100644
>> --- a/fs/ceph/addr.c
>> +++ b/fs/ceph/addr.c
>> @@ -5,6 +5,7 @@
>>   #include<linux/mm.h>
>>   #include<linux/pagemap.h>
>>   #include<linux/writeback.h>	/* generic_writepages */
>> +#include<linux/buffer_head.h>
>>   #include<linux/slab.h>
>>   #include<linux/pagevec.h>
>>   #include<linux/task_io_accounting_ops.h>
>> @@ -73,14 +74,8 @@ static int ceph_set_page_dirty(struct page *page)
>>   	int undo = 0;
>>   	struct ceph_snap_context *snapc;
>>
>> -	if (unlikely(!mapping))
>> -		return !TestSetPageDirty(page);
>> -
>> -	if (TestSetPageDirty(page)) {
>> -		dout("%p set_page_dirty %p idx %lu -- already dirty\n",
>> -		     mapping->host, page, page->index);
>> +	if (!__set_page_dirty(page, mapping, 1))
>>   		return 0;
>> -	}
>>
>>   	inode = mapping->host;
>>   	ci = ceph_inode(inode);
>> @@ -107,14 +102,7 @@ static int ceph_set_page_dirty(struct page *page)
>>   	     snapc, snapc->seq, snapc->num_snaps);
>>   	spin_unlock(&ci->i_ceph_lock);
>>
>> -	/* now adjust page */
>> -	spin_lock_irq(&mapping->tree_lock);
>>   	if (page->mapping) {	/* Race with truncate? */
>> -		WARN_ON_ONCE(!PageUptodate(page));
>> -		account_page_dirtied(page, page->mapping);
>> -		radix_tree_tag_set(&mapping->page_tree,
>> -				page_index(page), PAGECACHE_TAG_DIRTY);
>> -
>>   		/*
>>   		 * Reference snap context in page->private.  Also set
>>   		 * PagePrivate so that we get invalidatepage callback.
>> @@ -126,14 +114,10 @@ static int ceph_set_page_dirty(struct page *page)
>>   		undo = 1;
>>   	}
>>
>> -	spin_unlock_irq(&mapping->tree_lock);
>> -
>>   	if (undo)
>>   		/* whoops, we failed to dirty the page */
>>   		ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
>>
>> -	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>> -
>>   	BUG_ON(!PageDirty(page));
>>   	return 1;
>>   }
>> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
>> index 458f497..0a331a8 100644
>> --- a/include/linux/buffer_head.h
>> +++ b/include/linux/buffer_head.h
>> @@ -336,6 +336,8 @@ static inline void lock_buffer(struct buffer_head *bh)
>>   }
>>
>>   extern int __set_page_dirty_buffers(struct page *page);
>> +extern int __set_page_dirty(struct page *page,
>> +		struct address_space *mapping, int warn);
>>
>>   #else /* CONFIG_BLOCK */
>>
>> -- 
>> 1.7.1
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 4/7] Use vfs __set_page_dirty interface instead of doing it inside filesystem
@ 2012-07-02  8:10       ` Sha Zhengju
  0 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-07-02  8:10 UTC (permalink / raw)
  To: Sage Weil
  Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
	mhocko, linux-kernel, torvalds, viro, linux-fsdevel, sage,
	ceph-devel, Sha Zhengju

On 06/29/2012 01:21 PM, Sage Weil wrote:
> On Thu, 28 Jun 2012, Sha Zhengju wrote:
>
>> From: Sha Zhengju<handai.szj@taobao.com>
>>
>> Following we will treat SetPageDirty and dirty page accounting as an integrated
>> operation. Filesystems had better use vfs interface directly to avoid those details.
>>
>> Signed-off-by: Sha Zhengju<handai.szj@taobao.com>
>> ---
>>   fs/buffer.c                 |    2 +-
>>   fs/ceph/addr.c              |   20 ++------------------
>>   include/linux/buffer_head.h |    2 ++
>>   3 files changed, 5 insertions(+), 19 deletions(-)
>>
>> diff --git a/fs/buffer.c b/fs/buffer.c
>> index e8d96b8..55522dd 100644
>> --- a/fs/buffer.c
>> +++ b/fs/buffer.c
>> @@ -610,7 +610,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>>    * If warn is true, then emit a warning if the page is not uptodate and has
>>    * not been truncated.
>>    */
>> -static int __set_page_dirty(struct page *page,
>> +int __set_page_dirty(struct page *page,
>>   		struct address_space *mapping, int warn)
>>   {
>>   	if (unlikely(!mapping))
> This also needs an EXPORT_SYMBOL(__set_page_dirty) to allow ceph to
> continue to build as a module.
>
> With that fixed, the ceph bits are a welcome cleanup!
>
> Acked-by: Sage Weil<sage@inktank.com>

Further, I check the path again and may it be reworked as follows to 
avoid undo?

__set_page_dirty();                                            
__set_page_dirty();
ceph operations;                ==>                    if (page->mapping)
if (page->mapping)                                            ceph 
operations;
     ;
else
     undo = 1;
if (undo)
     xxx;



Thanks,
Sha

>> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
>> index 8b67304..d028fbe 100644
>> --- a/fs/ceph/addr.c
>> +++ b/fs/ceph/addr.c
>> @@ -5,6 +5,7 @@
>>   #include<linux/mm.h>
>>   #include<linux/pagemap.h>
>>   #include<linux/writeback.h>	/* generic_writepages */
>> +#include<linux/buffer_head.h>
>>   #include<linux/slab.h>
>>   #include<linux/pagevec.h>
>>   #include<linux/task_io_accounting_ops.h>
>> @@ -73,14 +74,8 @@ static int ceph_set_page_dirty(struct page *page)
>>   	int undo = 0;
>>   	struct ceph_snap_context *snapc;
>>
>> -	if (unlikely(!mapping))
>> -		return !TestSetPageDirty(page);
>> -
>> -	if (TestSetPageDirty(page)) {
>> -		dout("%p set_page_dirty %p idx %lu -- already dirty\n",
>> -		     mapping->host, page, page->index);
>> +	if (!__set_page_dirty(page, mapping, 1))
>>   		return 0;
>> -	}
>>
>>   	inode = mapping->host;
>>   	ci = ceph_inode(inode);
>> @@ -107,14 +102,7 @@ static int ceph_set_page_dirty(struct page *page)
>>   	     snapc, snapc->seq, snapc->num_snaps);
>>   	spin_unlock(&ci->i_ceph_lock);
>>
>> -	/* now adjust page */
>> -	spin_lock_irq(&mapping->tree_lock);
>>   	if (page->mapping) {	/* Race with truncate? */
>> -		WARN_ON_ONCE(!PageUptodate(page));
>> -		account_page_dirtied(page, page->mapping);
>> -		radix_tree_tag_set(&mapping->page_tree,
>> -				page_index(page), PAGECACHE_TAG_DIRTY);
>> -
>>   		/*
>>   		 * Reference snap context in page->private.  Also set
>>   		 * PagePrivate so that we get invalidatepage callback.
>> @@ -126,14 +114,10 @@ static int ceph_set_page_dirty(struct page *page)
>>   		undo = 1;
>>   	}
>>
>> -	spin_unlock_irq(&mapping->tree_lock);
>> -
>>   	if (undo)
>>   		/* whoops, we failed to dirty the page */
>>   		ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
>>
>> -	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>> -
>>   	BUG_ON(!PageDirty(page));
>>   	return 1;
>>   }
>> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
>> index 458f497..0a331a8 100644
>> --- a/include/linux/buffer_head.h
>> +++ b/include/linux/buffer_head.h
>> @@ -336,6 +336,8 @@ static inline void lock_buffer(struct buffer_head *bh)
>>   }
>>
>>   extern int __set_page_dirty_buffers(struct page *page);
>> +extern int __set_page_dirty(struct page *page,
>> +		struct address_space *mapping, int warn);
>>
>>   #else /* CONFIG_BLOCK */
>>
>> -- 
>> 1.7.1
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 2/7] memcg: remove MEMCG_NR_FILE_MAPPED
  2012-06-28 10:58   ` Sha Zhengju
  (?)
@ 2012-07-02 10:44     ` Kamezawa Hiroyuki
  -1 siblings, 0 replies; 132+ messages in thread
From: Kamezawa Hiroyuki @ 2012-07-02 10:44 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, gthelen, yinghan, akpm, mhocko, linux-kernel,
	Sha Zhengju

(2012/06/28 19:58), Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> While accounting memcg page stat, it's not worth to use MEMCG_NR_FILE_MAPPED
> as an extra layer of indirection because of the complexity and presumed
> performance overhead. We can use MEM_CGROUP_STAT_FILE_MAPPED directly.
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>



^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 2/7] memcg: remove MEMCG_NR_FILE_MAPPED
@ 2012-07-02 10:44     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 132+ messages in thread
From: Kamezawa Hiroyuki @ 2012-07-02 10:44 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, gthelen, yinghan, akpm, mhocko, linux-kernel,
	Sha Zhengju

(2012/06/28 19:58), Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> While accounting memcg page stat, it's not worth to use MEMCG_NR_FILE_MAPPED
> as an extra layer of indirection because of the complexity and presumed
> performance overhead. We can use MEM_CGROUP_STAT_FILE_MAPPED directly.
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 2/7] memcg: remove MEMCG_NR_FILE_MAPPED
@ 2012-07-02 10:44     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 132+ messages in thread
From: Kamezawa Hiroyuki @ 2012-07-02 10:44 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA, yinghan-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mhocko-AlSwsSmVLrQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju

(2012/06/28 19:58), Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> 
> While accounting memcg page stat, it's not worth to use MEMCG_NR_FILE_MAPPED
> as an extra layer of indirection because of the complexity and presumed
> performance overhead. We can use MEM_CGROUP_STAT_FILE_MAPPED directly.
> 
> Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 3/7] Make TestSetPageDirty and dirty page accounting in one func
@ 2012-07-02 11:14     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 132+ messages in thread
From: Kamezawa Hiroyuki @ 2012-07-02 11:14 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, gthelen, yinghan, akpm, mhocko, linux-kernel,
	torvalds, viro, linux-fsdevel, Sha Zhengju

(2012/06/28 20:01), Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> Commit a8e7d49a(Fix race in create_empty_buffers() vs __set_page_dirty_buffers())
> extracts TestSetPageDirty from __set_page_dirty and is far away from
> account_page_dirtied.But it's better to make the two operations in one single
> function to keep modular.So in order to avoid the potential race mentioned in
> commit a8e7d49a, we can hold private_lock until __set_page_dirty completes.
> I guess there's no deadlock between ->private_lock and ->tree_lock by quick look.
> 
> It's a prepare patch for following memcg dirty page accounting patches.
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>

I think there is no problem with the lock order.

My small concern is the impact on the performance. IIUC, lock contention here can be
seen if multiple threads write to the same file in parallel.
Do you have any numbers before/after the patch ?


Thanks,
-Kmae


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 3/7] Make TestSetPageDirty and dirty page accounting in one func
@ 2012-07-02 11:14     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 132+ messages in thread
From: Kamezawa Hiroyuki @ 2012-07-02 11:14 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA, yinghan-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mhocko-AlSwsSmVLrQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju

(2012/06/28 20:01), Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> 
> Commit a8e7d49a(Fix race in create_empty_buffers() vs __set_page_dirty_buffers())
> extracts TestSetPageDirty from __set_page_dirty and is far away from
> account_page_dirtied.But it's better to make the two operations in one single
> function to keep modular.So in order to avoid the potential race mentioned in
> commit a8e7d49a, we can hold private_lock until __set_page_dirty completes.
> I guess there's no deadlock between ->private_lock and ->tree_lock by quick look.
> 
> It's a prepare patch for following memcg dirty page accounting patches.
> 
> Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>

I think there is no problem with the lock order.

My small concern is the impact on the performance. IIUC, lock contention here can be
seen if multiple threads write to the same file in parallel.
Do you have any numbers before/after the patch ?


Thanks,
-Kmae

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 3/7] Make TestSetPageDirty and dirty page accounting in one func
@ 2012-07-02 11:14     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 132+ messages in thread
From: Kamezawa Hiroyuki @ 2012-07-02 11:14 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, gthelen, yinghan, akpm, mhocko, linux-kernel,
	torvalds, viro, linux-fsdevel, Sha Zhengju

(2012/06/28 20:01), Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> Commit a8e7d49a(Fix race in create_empty_buffers() vs __set_page_dirty_buffers())
> extracts TestSetPageDirty from __set_page_dirty and is far away from
> account_page_dirtied.But it's better to make the two operations in one single
> function to keep modular.So in order to avoid the potential race mentioned in
> commit a8e7d49a, we can hold private_lock until __set_page_dirty completes.
> I guess there's no deadlock between ->private_lock and ->tree_lock by quick look.
> 
> It's a prepare patch for following memcg dirty page accounting patches.
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>

I think there is no problem with the lock order.

My small concern is the impact on the performance. IIUC, lock contention here can be
seen if multiple threads write to the same file in parallel.
Do you have any numbers before/after the patch ?


Thanks,
-Kmae

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 4/7] Use vfs __set_page_dirty interface instead of doing it inside filesystem
  2012-07-02  8:10       ` Sha Zhengju
@ 2012-07-02 14:49         ` Sage Weil
  -1 siblings, 0 replies; 132+ messages in thread
From: Sage Weil @ 2012-07-02 14:49 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
	mhocko, linux-kernel, torvalds, viro, linux-fsdevel, sage,
	ceph-devel, Sha Zhengju

On Mon, 2 Jul 2012, Sha Zhengju wrote:
> On 06/29/2012 01:21 PM, Sage Weil wrote:
> > On Thu, 28 Jun 2012, Sha Zhengju wrote:
> > 
> > > From: Sha Zhengju<handai.szj@taobao.com>
> > > 
> > > Following we will treat SetPageDirty and dirty page accounting as an
> > > integrated
> > > operation. Filesystems had better use vfs interface directly to avoid
> > > those details.
> > > 
> > > Signed-off-by: Sha Zhengju<handai.szj@taobao.com>
> > > ---
> > >   fs/buffer.c                 |    2 +-
> > >   fs/ceph/addr.c              |   20 ++------------------
> > >   include/linux/buffer_head.h |    2 ++
> > >   3 files changed, 5 insertions(+), 19 deletions(-)
> > > 
> > > diff --git a/fs/buffer.c b/fs/buffer.c
> > > index e8d96b8..55522dd 100644
> > > --- a/fs/buffer.c
> > > +++ b/fs/buffer.c
> > > @@ -610,7 +610,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
> > >    * If warn is true, then emit a warning if the page is not uptodate and
> > > has
> > >    * not been truncated.
> > >    */
> > > -static int __set_page_dirty(struct page *page,
> > > +int __set_page_dirty(struct page *page,
> > >   		struct address_space *mapping, int warn)
> > >   {
> > >   	if (unlikely(!mapping))
> > This also needs an EXPORT_SYMBOL(__set_page_dirty) to allow ceph to
> > continue to build as a module.
> > 
> > With that fixed, the ceph bits are a welcome cleanup!
> > 
> > Acked-by: Sage Weil<sage@inktank.com>
> 
> Further, I check the path again and may it be reworked as follows to avoid
> undo?
> 
> __set_page_dirty();
> __set_page_dirty();
> ceph operations;                ==>                    if (page->mapping)
> if (page->mapping)                                            ceph operations;
>     ;
> else
>     undo = 1;
> if (undo)
>     xxx;

Yep.  Taking another look at the original code, though, I'm worried that 
one reason the __set_page_dirty() actions were spread out the way they are 
is because we wanted to ensure that the ceph operations were always 
performed when PagePrivate was set.

It looks like invalidatepage won't get called if private isn't set, and 
presumably it handles the truncate race with __set_page_dirty() properly 
(right?).  What about writeback?  Do we need to worry about writepage[s] 
getting called with a NULL page->private?

Thanks!
sage



> 
> 
> 
> Thanks,
> Sha
> 
> > > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> > > index 8b67304..d028fbe 100644
> > > --- a/fs/ceph/addr.c
> > > +++ b/fs/ceph/addr.c
> > > @@ -5,6 +5,7 @@
> > >   #include<linux/mm.h>
> > >   #include<linux/pagemap.h>
> > >   #include<linux/writeback.h>	/* generic_writepages */
> > > +#include<linux/buffer_head.h>
> > >   #include<linux/slab.h>
> > >   #include<linux/pagevec.h>
> > >   #include<linux/task_io_accounting_ops.h>
> > > @@ -73,14 +74,8 @@ static int ceph_set_page_dirty(struct page *page)
> > >   	int undo = 0;
> > >   	struct ceph_snap_context *snapc;
> > > 
> > > -	if (unlikely(!mapping))
> > > -		return !TestSetPageDirty(page);
> > > -
> > > -	if (TestSetPageDirty(page)) {
> > > -		dout("%p set_page_dirty %p idx %lu -- already dirty\n",
> > > -		     mapping->host, page, page->index);
> > > +	if (!__set_page_dirty(page, mapping, 1))
> > >   		return 0;
> > > -	}
> > > 
> > >   	inode = mapping->host;
> > >   	ci = ceph_inode(inode);
> > > @@ -107,14 +102,7 @@ static int ceph_set_page_dirty(struct page *page)
> > >   	     snapc, snapc->seq, snapc->num_snaps);
> > >   	spin_unlock(&ci->i_ceph_lock);
> > > 
> > > -	/* now adjust page */
> > > -	spin_lock_irq(&mapping->tree_lock);
> > >   	if (page->mapping) {	/* Race with truncate? */
> > > -		WARN_ON_ONCE(!PageUptodate(page));
> > > -		account_page_dirtied(page, page->mapping);
> > > -		radix_tree_tag_set(&mapping->page_tree,
> > > -				page_index(page), PAGECACHE_TAG_DIRTY);
> > > -
> > >   		/*
> > >   		 * Reference snap context in page->private.  Also set
> > >   		 * PagePrivate so that we get invalidatepage callback.
> > > @@ -126,14 +114,10 @@ static int ceph_set_page_dirty(struct page *page)
> > >   		undo = 1;
> > >   	}
> > > 
> > > -	spin_unlock_irq(&mapping->tree_lock);
> > > -
> > >   	if (undo)
> > >   		/* whoops, we failed to dirty the page */
> > >   		ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
> > > 
> > > -	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> > > -
> > >   	BUG_ON(!PageDirty(page));
> > >   	return 1;
> > >   }
> > > diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> > > index 458f497..0a331a8 100644
> > > --- a/include/linux/buffer_head.h
> > > +++ b/include/linux/buffer_head.h
> > > @@ -336,6 +336,8 @@ static inline void lock_buffer(struct buffer_head *bh)
> > >   }
> > > 
> > >   extern int __set_page_dirty_buffers(struct page *page);
> > > +extern int __set_page_dirty(struct page *page,
> > > +		struct address_space *mapping, int warn);
> > > 
> > >   #else /* CONFIG_BLOCK */
> > > 
> > > -- 
> > > 1.7.1
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel"
> > > in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> 
> 

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 4/7] Use vfs __set_page_dirty interface instead of doing it inside filesystem
@ 2012-07-02 14:49         ` Sage Weil
  0 siblings, 0 replies; 132+ messages in thread
From: Sage Weil @ 2012-07-02 14:49 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
	mhocko, linux-kernel, torvalds, viro, linux-fsdevel, sage,
	ceph-devel, Sha Zhengju

On Mon, 2 Jul 2012, Sha Zhengju wrote:
> On 06/29/2012 01:21 PM, Sage Weil wrote:
> > On Thu, 28 Jun 2012, Sha Zhengju wrote:
> > 
> > > From: Sha Zhengju<handai.szj@taobao.com>
> > > 
> > > Following we will treat SetPageDirty and dirty page accounting as an
> > > integrated
> > > operation. Filesystems had better use vfs interface directly to avoid
> > > those details.
> > > 
> > > Signed-off-by: Sha Zhengju<handai.szj@taobao.com>
> > > ---
> > >   fs/buffer.c                 |    2 +-
> > >   fs/ceph/addr.c              |   20 ++------------------
> > >   include/linux/buffer_head.h |    2 ++
> > >   3 files changed, 5 insertions(+), 19 deletions(-)
> > > 
> > > diff --git a/fs/buffer.c b/fs/buffer.c
> > > index e8d96b8..55522dd 100644
> > > --- a/fs/buffer.c
> > > +++ b/fs/buffer.c
> > > @@ -610,7 +610,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
> > >    * If warn is true, then emit a warning if the page is not uptodate and
> > > has
> > >    * not been truncated.
> > >    */
> > > -static int __set_page_dirty(struct page *page,
> > > +int __set_page_dirty(struct page *page,
> > >   		struct address_space *mapping, int warn)
> > >   {
> > >   	if (unlikely(!mapping))
> > This also needs an EXPORT_SYMBOL(__set_page_dirty) to allow ceph to
> > continue to build as a module.
> > 
> > With that fixed, the ceph bits are a welcome cleanup!
> > 
> > Acked-by: Sage Weil<sage@inktank.com>
> 
> Further, I check the path again and may it be reworked as follows to avoid
> undo?
> 
> __set_page_dirty();
> __set_page_dirty();
> ceph operations;                ==>                    if (page->mapping)
> if (page->mapping)                                            ceph operations;
>     ;
> else
>     undo = 1;
> if (undo)
>     xxx;

Yep.  Taking another look at the original code, though, I'm worried that 
one reason the __set_page_dirty() actions were spread out the way they are 
is because we wanted to ensure that the ceph operations were always 
performed when PagePrivate was set.

It looks like invalidatepage won't get called if private isn't set, and 
presumably it handles the truncate race with __set_page_dirty() properly 
(right?).  What about writeback?  Do we need to worry about writepage[s] 
getting called with a NULL page->private?

Thanks!
sage



> 
> 
> 
> Thanks,
> Sha
> 
> > > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> > > index 8b67304..d028fbe 100644
> > > --- a/fs/ceph/addr.c
> > > +++ b/fs/ceph/addr.c
> > > @@ -5,6 +5,7 @@
> > >   #include<linux/mm.h>
> > >   #include<linux/pagemap.h>
> > >   #include<linux/writeback.h>	/* generic_writepages */
> > > +#include<linux/buffer_head.h>
> > >   #include<linux/slab.h>
> > >   #include<linux/pagevec.h>
> > >   #include<linux/task_io_accounting_ops.h>
> > > @@ -73,14 +74,8 @@ static int ceph_set_page_dirty(struct page *page)
> > >   	int undo = 0;
> > >   	struct ceph_snap_context *snapc;
> > > 
> > > -	if (unlikely(!mapping))
> > > -		return !TestSetPageDirty(page);
> > > -
> > > -	if (TestSetPageDirty(page)) {
> > > -		dout("%p set_page_dirty %p idx %lu -- already dirty\n",
> > > -		     mapping->host, page, page->index);
> > > +	if (!__set_page_dirty(page, mapping, 1))
> > >   		return 0;
> > > -	}
> > > 
> > >   	inode = mapping->host;
> > >   	ci = ceph_inode(inode);
> > > @@ -107,14 +102,7 @@ static int ceph_set_page_dirty(struct page *page)
> > >   	     snapc, snapc->seq, snapc->num_snaps);
> > >   	spin_unlock(&ci->i_ceph_lock);
> > > 
> > > -	/* now adjust page */
> > > -	spin_lock_irq(&mapping->tree_lock);
> > >   	if (page->mapping) {	/* Race with truncate? */
> > > -		WARN_ON_ONCE(!PageUptodate(page));
> > > -		account_page_dirtied(page, page->mapping);
> > > -		radix_tree_tag_set(&mapping->page_tree,
> > > -				page_index(page), PAGECACHE_TAG_DIRTY);
> > > -
> > >   		/*
> > >   		 * Reference snap context in page->private.  Also set
> > >   		 * PagePrivate so that we get invalidatepage callback.
> > > @@ -126,14 +114,10 @@ static int ceph_set_page_dirty(struct page *page)
> > >   		undo = 1;
> > >   	}
> > > 
> > > -	spin_unlock_irq(&mapping->tree_lock);
> > > -
> > >   	if (undo)
> > >   		/* whoops, we failed to dirty the page */
> > >   		ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
> > > 
> > > -	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> > > -
> > >   	BUG_ON(!PageDirty(page));
> > >   	return 1;
> > >   }
> > > diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> > > index 458f497..0a331a8 100644
> > > --- a/include/linux/buffer_head.h
> > > +++ b/include/linux/buffer_head.h
> > > @@ -336,6 +336,8 @@ static inline void lock_buffer(struct buffer_head *bh)
> > >   }
> > > 
> > >   extern int __set_page_dirty_buffers(struct page *page);
> > > +extern int __set_page_dirty(struct page *page,
> > > +		struct address_space *mapping, int warn);
> > > 
> > >   #else /* CONFIG_BLOCK */
> > > 
> > > -- 
> > > 1.7.1
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel"
> > > in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 5/7] memcg: add per cgroup dirty pages accounting
  2012-06-28 11:04   ` Sha Zhengju
@ 2012-07-03  5:57     ` Kamezawa Hiroyuki
  -1 siblings, 0 replies; 132+ messages in thread
From: Kamezawa Hiroyuki @ 2012-07-03  5:57 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, gthelen, yinghan, akpm, mhocko, linux-kernel,
	Sha Zhengju

(2012/06/28 20:04), Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> This patch adds memcg routines to count dirty pages, which allows memory controller
> to maintain an accurate view of the amount of its dirty memory and can provide some
> info for users while group's direct reclaim is working.
> 
> After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
> use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
> has a feature to move a page from a cgroup to another one and may have race between
> "move" and "page stat accounting". So in order to avoid the race we have designed a
> bigger lock:
> 
>           mem_cgroup_begin_update_page_stat()
>           modify page information	-->(a)
>           mem_cgroup_update_page_stat()  -->(b)
>           mem_cgroup_end_update_page_stat()
> 
> It requires (a) and (b)(dirty pages accounting) can stay close enough.
> 
> In the previous two prepare patches, we have reworked the vfs set page dirty routines
> and now the interfaces are more explicit:
> 	incrementing (2):
> 		__set_page_dirty
> 		__set_page_dirty_nobuffers
> 	decrementing (2):
> 		clear_page_dirty_for_io
> 		cancel_dirty_page
> 
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>

Thank you. This seems much cleaner than expected ! very good.

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>



> ---
>   fs/buffer.c                |   17 ++++++++++++++---
>   include/linux/memcontrol.h |    1 +
>   mm/filemap.c               |    5 +++++
>   mm/memcontrol.c            |   28 +++++++++++++++++++++-------
>   mm/page-writeback.c        |   30 ++++++++++++++++++++++++------
>   mm/truncate.c              |    6 ++++++
>   6 files changed, 71 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 55522dd..d3714cc 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -613,11 +613,19 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>   int __set_page_dirty(struct page *page,
>   		struct address_space *mapping, int warn)
>   {
> +	bool locked;
> +	unsigned long flags;
> +	int ret = 0;
> +
>   	if (unlikely(!mapping))
>   		return !TestSetPageDirty(page);
>   
> -	if (TestSetPageDirty(page))
> -		return 0;
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +
> +	if (TestSetPageDirty(page)) {
> +		ret = 0;
> +		goto out;
> +	}
>   
>   	spin_lock_irq(&mapping->tree_lock);
>   	if (page->mapping) {	/* Race with truncate? */
> @@ -629,7 +637,10 @@ int __set_page_dirty(struct page *page,
>   	spin_unlock_irq(&mapping->tree_lock);
>   	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>   
> -	return 1;
> +	ret = 1;
> +out:
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +	return ret;
>   }
>   
>   /*
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 20b0f2d..ad37b59 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -38,6 +38,7 @@ enum mem_cgroup_stat_index {
>   	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
>   	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>   	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
> +	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
>   	MEM_CGROUP_STAT_NSTATS,
>   };
>   
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 1f19ec3..5159a49 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -140,6 +140,11 @@ void __delete_from_page_cache(struct page *page)
>   	 * having removed the page entirely.
>   	 */
>   	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> +		/*
> +		 * Do not change page state, so no need to use mem_cgroup_
> +		 * {begin, end}_update_page_stat to get lock.
> +		 */
> +		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>   		dec_zone_page_state(page, NR_FILE_DIRTY);
>   		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>   	}
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ebed1ca..90e2946 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -82,6 +82,7 @@ static const char * const mem_cgroup_stat_names[] = {
>   	"rss",
>   	"mapped_file",
>   	"swap",
> +	"dirty",
>   };
>   
>   enum mem_cgroup_events_index {
> @@ -2538,6 +2539,18 @@ void mem_cgroup_split_huge_fixup(struct page *head)
>   }
>   #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>   
> +static inline
> +void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
> +					struct mem_cgroup *to,
> +					enum mem_cgroup_stat_index idx)
> +{
> +	/* Update stat data for mem_cgroup */
> +	preempt_disable();
> +	__this_cpu_dec(from->stat->count[idx]);
> +	__this_cpu_inc(to->stat->count[idx]);
> +	preempt_enable();
> +}
> +
>   /**
>    * mem_cgroup_move_account - move account of the page
>    * @page: the page
> @@ -2583,13 +2596,14 @@ static int mem_cgroup_move_account(struct page *page,
>   
>   	move_lock_mem_cgroup(from, &flags);
>   
> -	if (!anon && page_mapped(page)) {
> -		/* Update mapped_file data for mem_cgroup */
> -		preempt_disable();
> -		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		preempt_enable();
> -	}
> +	if (!anon && page_mapped(page))
> +		mem_cgroup_move_account_page_stat(from, to,
> +				MEM_CGROUP_STAT_FILE_MAPPED);
> +
> +	if (PageDirty(page))
> +		mem_cgroup_move_account_page_stat(from, to,
> +				MEM_CGROUP_STAT_FILE_DIRTY);
> +
>   	mem_cgroup_charge_statistics(from, anon, -nr_pages);
>   
>   	/* caller should have done css_get */
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index e5363f3..e79a2f7 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1962,6 +1962,7 @@ int __set_page_dirty_no_writeback(struct page *page)
>   void account_page_dirtied(struct page *page, struct address_space *mapping)
>   {
>   	if (mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>   		__inc_zone_page_state(page, NR_FILE_DIRTY);
>   		__inc_zone_page_state(page, NR_DIRTIED);
>   		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> @@ -2001,12 +2002,20 @@ EXPORT_SYMBOL(account_page_writeback);
>    */
>   int __set_page_dirty_nobuffers(struct page *page)
>   {
> +	bool locked;
> +	unsigned long flags;
> +	int ret = 0;
> +
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +
>   	if (!TestSetPageDirty(page)) {
>   		struct address_space *mapping = page_mapping(page);
>   		struct address_space *mapping2;
>   
> -		if (!mapping)
> -			return 1;
> +		if (!mapping) {
> +			ret = 1;
> +			goto out;
> +		}
>   
>   		spin_lock_irq(&mapping->tree_lock);
>   		mapping2 = page_mapping(page);
> @@ -2022,9 +2031,12 @@ int __set_page_dirty_nobuffers(struct page *page)
>   			/* !PageAnon && !swapper_space */
>   			__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>   		}
> -		return 1;
> +		ret = 1;
>   	}
> -	return 0;
> +
> +out:
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +	return ret;
>   }
>   EXPORT_SYMBOL(__set_page_dirty_nobuffers);
>   
> @@ -2139,6 +2151,9 @@ EXPORT_SYMBOL(set_page_dirty_lock);
>   int clear_page_dirty_for_io(struct page *page)
>   {
>   	struct address_space *mapping = page_mapping(page);
> +	bool locked;
> +	unsigned long flags;
> +	int ret = 0;
>   
>   	BUG_ON(!PageLocked(page));
>   
> @@ -2180,13 +2195,16 @@ int clear_page_dirty_for_io(struct page *page)
>   		 * the desired exclusion. See mm/memory.c:do_wp_page()
>   		 * for more comments.
>   		 */
> +		mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>   		if (TestClearPageDirty(page)) {
> +			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>   			dec_zone_page_state(page, NR_FILE_DIRTY);
>   			dec_bdi_stat(mapping->backing_dev_info,
>   					BDI_RECLAIMABLE);
> -			return 1;
> +			ret = 1;
>   		}
> -		return 0;
> +		mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +		return ret;
>   	}
>   	return TestClearPageDirty(page);
>   }
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 75801ac..052016a 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -73,9 +73,14 @@ static inline void truncate_partial_page(struct page *page, unsigned partial)
>    */
>   void cancel_dirty_page(struct page *page, unsigned int account_size)
>   {
> +	bool locked;
> +	unsigned long flags;
> +
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>   	if (TestClearPageDirty(page)) {
>   		struct address_space *mapping = page->mapping;
>   		if (mapping && mapping_cap_account_dirty(mapping)) {
> +			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>   			dec_zone_page_state(page, NR_FILE_DIRTY);
>   			dec_bdi_stat(mapping->backing_dev_info,
>   					BDI_RECLAIMABLE);
> @@ -83,6 +88,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
>   				task_io_account_cancelled_write(account_size);
>   		}
>   	}
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>   }
>   EXPORT_SYMBOL(cancel_dirty_page);
>   
> 




^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 5/7] memcg: add per cgroup dirty pages accounting
@ 2012-07-03  5:57     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 132+ messages in thread
From: Kamezawa Hiroyuki @ 2012-07-03  5:57 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, gthelen, yinghan, akpm, mhocko, linux-kernel,
	Sha Zhengju

(2012/06/28 20:04), Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> This patch adds memcg routines to count dirty pages, which allows memory controller
> to maintain an accurate view of the amount of its dirty memory and can provide some
> info for users while group's direct reclaim is working.
> 
> After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
> use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
> has a feature to move a page from a cgroup to another one and may have race between
> "move" and "page stat accounting". So in order to avoid the race we have designed a
> bigger lock:
> 
>           mem_cgroup_begin_update_page_stat()
>           modify page information	-->(a)
>           mem_cgroup_update_page_stat()  -->(b)
>           mem_cgroup_end_update_page_stat()
> 
> It requires (a) and (b)(dirty pages accounting) can stay close enough.
> 
> In the previous two prepare patches, we have reworked the vfs set page dirty routines
> and now the interfaces are more explicit:
> 	incrementing (2):
> 		__set_page_dirty
> 		__set_page_dirty_nobuffers
> 	decrementing (2):
> 		clear_page_dirty_for_io
> 		cancel_dirty_page
> 
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>

Thank you. This seems much cleaner than expected ! very good.

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>



> ---
>   fs/buffer.c                |   17 ++++++++++++++---
>   include/linux/memcontrol.h |    1 +
>   mm/filemap.c               |    5 +++++
>   mm/memcontrol.c            |   28 +++++++++++++++++++++-------
>   mm/page-writeback.c        |   30 ++++++++++++++++++++++++------
>   mm/truncate.c              |    6 ++++++
>   6 files changed, 71 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 55522dd..d3714cc 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -613,11 +613,19 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>   int __set_page_dirty(struct page *page,
>   		struct address_space *mapping, int warn)
>   {
> +	bool locked;
> +	unsigned long flags;
> +	int ret = 0;
> +
>   	if (unlikely(!mapping))
>   		return !TestSetPageDirty(page);
>   
> -	if (TestSetPageDirty(page))
> -		return 0;
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +
> +	if (TestSetPageDirty(page)) {
> +		ret = 0;
> +		goto out;
> +	}
>   
>   	spin_lock_irq(&mapping->tree_lock);
>   	if (page->mapping) {	/* Race with truncate? */
> @@ -629,7 +637,10 @@ int __set_page_dirty(struct page *page,
>   	spin_unlock_irq(&mapping->tree_lock);
>   	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>   
> -	return 1;
> +	ret = 1;
> +out:
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +	return ret;
>   }
>   
>   /*
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 20b0f2d..ad37b59 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -38,6 +38,7 @@ enum mem_cgroup_stat_index {
>   	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
>   	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>   	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
> +	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
>   	MEM_CGROUP_STAT_NSTATS,
>   };
>   
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 1f19ec3..5159a49 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -140,6 +140,11 @@ void __delete_from_page_cache(struct page *page)
>   	 * having removed the page entirely.
>   	 */
>   	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> +		/*
> +		 * Do not change page state, so no need to use mem_cgroup_
> +		 * {begin, end}_update_page_stat to get lock.
> +		 */
> +		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>   		dec_zone_page_state(page, NR_FILE_DIRTY);
>   		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>   	}
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ebed1ca..90e2946 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -82,6 +82,7 @@ static const char * const mem_cgroup_stat_names[] = {
>   	"rss",
>   	"mapped_file",
>   	"swap",
> +	"dirty",
>   };
>   
>   enum mem_cgroup_events_index {
> @@ -2538,6 +2539,18 @@ void mem_cgroup_split_huge_fixup(struct page *head)
>   }
>   #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>   
> +static inline
> +void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
> +					struct mem_cgroup *to,
> +					enum mem_cgroup_stat_index idx)
> +{
> +	/* Update stat data for mem_cgroup */
> +	preempt_disable();
> +	__this_cpu_dec(from->stat->count[idx]);
> +	__this_cpu_inc(to->stat->count[idx]);
> +	preempt_enable();
> +}
> +
>   /**
>    * mem_cgroup_move_account - move account of the page
>    * @page: the page
> @@ -2583,13 +2596,14 @@ static int mem_cgroup_move_account(struct page *page,
>   
>   	move_lock_mem_cgroup(from, &flags);
>   
> -	if (!anon && page_mapped(page)) {
> -		/* Update mapped_file data for mem_cgroup */
> -		preempt_disable();
> -		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		preempt_enable();
> -	}
> +	if (!anon && page_mapped(page))
> +		mem_cgroup_move_account_page_stat(from, to,
> +				MEM_CGROUP_STAT_FILE_MAPPED);
> +
> +	if (PageDirty(page))
> +		mem_cgroup_move_account_page_stat(from, to,
> +				MEM_CGROUP_STAT_FILE_DIRTY);
> +
>   	mem_cgroup_charge_statistics(from, anon, -nr_pages);
>   
>   	/* caller should have done css_get */
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index e5363f3..e79a2f7 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1962,6 +1962,7 @@ int __set_page_dirty_no_writeback(struct page *page)
>   void account_page_dirtied(struct page *page, struct address_space *mapping)
>   {
>   	if (mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>   		__inc_zone_page_state(page, NR_FILE_DIRTY);
>   		__inc_zone_page_state(page, NR_DIRTIED);
>   		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> @@ -2001,12 +2002,20 @@ EXPORT_SYMBOL(account_page_writeback);
>    */
>   int __set_page_dirty_nobuffers(struct page *page)
>   {
> +	bool locked;
> +	unsigned long flags;
> +	int ret = 0;
> +
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +
>   	if (!TestSetPageDirty(page)) {
>   		struct address_space *mapping = page_mapping(page);
>   		struct address_space *mapping2;
>   
> -		if (!mapping)
> -			return 1;
> +		if (!mapping) {
> +			ret = 1;
> +			goto out;
> +		}
>   
>   		spin_lock_irq(&mapping->tree_lock);
>   		mapping2 = page_mapping(page);
> @@ -2022,9 +2031,12 @@ int __set_page_dirty_nobuffers(struct page *page)
>   			/* !PageAnon && !swapper_space */
>   			__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>   		}
> -		return 1;
> +		ret = 1;
>   	}
> -	return 0;
> +
> +out:
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +	return ret;
>   }
>   EXPORT_SYMBOL(__set_page_dirty_nobuffers);
>   
> @@ -2139,6 +2151,9 @@ EXPORT_SYMBOL(set_page_dirty_lock);
>   int clear_page_dirty_for_io(struct page *page)
>   {
>   	struct address_space *mapping = page_mapping(page);
> +	bool locked;
> +	unsigned long flags;
> +	int ret = 0;
>   
>   	BUG_ON(!PageLocked(page));
>   
> @@ -2180,13 +2195,16 @@ int clear_page_dirty_for_io(struct page *page)
>   		 * the desired exclusion. See mm/memory.c:do_wp_page()
>   		 * for more comments.
>   		 */
> +		mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>   		if (TestClearPageDirty(page)) {
> +			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>   			dec_zone_page_state(page, NR_FILE_DIRTY);
>   			dec_bdi_stat(mapping->backing_dev_info,
>   					BDI_RECLAIMABLE);
> -			return 1;
> +			ret = 1;
>   		}
> -		return 0;
> +		mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +		return ret;
>   	}
>   	return TestClearPageDirty(page);
>   }
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 75801ac..052016a 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -73,9 +73,14 @@ static inline void truncate_partial_page(struct page *page, unsigned partial)
>    */
>   void cancel_dirty_page(struct page *page, unsigned int account_size)
>   {
> +	bool locked;
> +	unsigned long flags;
> +
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>   	if (TestClearPageDirty(page)) {
>   		struct address_space *mapping = page->mapping;
>   		if (mapping && mapping_cap_account_dirty(mapping)) {
> +			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>   			dec_zone_page_state(page, NR_FILE_DIRTY);
>   			dec_bdi_stat(mapping->backing_dev_info,
>   					BDI_RECLAIMABLE);
> @@ -83,6 +88,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
>   				task_io_account_cancelled_write(account_size);
>   		}
>   	}
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>   }
>   EXPORT_SYMBOL(cancel_dirty_page);
>   
> 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 6/7] memcg: add per cgroup writeback pages accounting
  2012-06-28 11:05   ` Sha Zhengju
@ 2012-07-03  6:31     ` Kamezawa Hiroyuki
  -1 siblings, 0 replies; 132+ messages in thread
From: Kamezawa Hiroyuki @ 2012-07-03  6:31 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, gthelen, yinghan, akpm, mhocko, linux-kernel,
	Sha Zhengju

(2012/06/28 20:05), Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> Similar to dirty page, we add per cgroup writeback pages accounting. The lock
> rule still is:
> 	mem_cgroup_begin_update_page_stat()
> 	modify page WRITEBACK stat
> 	mem_cgroup_update_page_stat()
> 	mem_cgroup_end_update_page_stat()
> 
> There're two writeback interface to modify: test_clear/set_page_writeback.
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>

Seems good to me. BTW, you named macros as MEM_CGROUP_STAT_FILE_XXX
but I wonder these counters will be used for accounting swap-out's dirty pages..

STAT_DIRTY, STAT_WRITEBACK ? do you have better name ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 6/7] memcg: add per cgroup writeback pages accounting
@ 2012-07-03  6:31     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 132+ messages in thread
From: Kamezawa Hiroyuki @ 2012-07-03  6:31 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, gthelen, yinghan, akpm, mhocko, linux-kernel,
	Sha Zhengju

(2012/06/28 20:05), Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> Similar to dirty page, we add per cgroup writeback pages accounting. The lock
> rule still is:
> 	mem_cgroup_begin_update_page_stat()
> 	modify page WRITEBACK stat
> 	mem_cgroup_update_page_stat()
> 	mem_cgroup_end_update_page_stat()
> 
> There're two writeback interface to modify: test_clear/set_page_writeback.
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>

Seems good to me. BTW, you named macros as MEM_CGROUP_STAT_FILE_XXX
but I wonder these counters will be used for accounting swap-out's dirty pages..

STAT_DIRTY, STAT_WRITEBACK ? do you have better name ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 4/7] Use vfs __set_page_dirty interface instead of doing it inside filesystem
  2012-07-02 14:49         ` Sage Weil
@ 2012-07-04  8:11           ` Sha Zhengju
  -1 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-07-04  8:11 UTC (permalink / raw)
  To: Sage Weil
  Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
	mhocko, linux-kernel, torvalds, viro, linux-fsdevel, sage,
	ceph-devel, Sha Zhengju

On 07/02/2012 10:49 PM, Sage Weil wrote:
> On Mon, 2 Jul 2012, Sha Zhengju wrote:
>> On 06/29/2012 01:21 PM, Sage Weil wrote:
>>> On Thu, 28 Jun 2012, Sha Zhengju wrote:
>>>
>>>> From: Sha Zhengju<handai.szj@taobao.com>
>>>>
>>>> Following we will treat SetPageDirty and dirty page accounting as an
>>>> integrated
>>>> operation. Filesystems had better use vfs interface directly to avoid
>>>> those details.
>>>>
>>>> Signed-off-by: Sha Zhengju<handai.szj@taobao.com>
>>>> ---
>>>>    fs/buffer.c                 |    2 +-
>>>>    fs/ceph/addr.c              |   20 ++------------------
>>>>    include/linux/buffer_head.h |    2 ++
>>>>    3 files changed, 5 insertions(+), 19 deletions(-)
>>>>
>>>> diff --git a/fs/buffer.c b/fs/buffer.c
>>>> index e8d96b8..55522dd 100644
>>>> --- a/fs/buffer.c
>>>> +++ b/fs/buffer.c
>>>> @@ -610,7 +610,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>>>>     * If warn is true, then emit a warning if the page is not uptodate and
>>>> has
>>>>     * not been truncated.
>>>>     */
>>>> -static int __set_page_dirty(struct page *page,
>>>> +int __set_page_dirty(struct page *page,
>>>>    		struct address_space *mapping, int warn)
>>>>    {
>>>>    	if (unlikely(!mapping))
>>> This also needs an EXPORT_SYMBOL(__set_page_dirty) to allow ceph to
>>> continue to build as a module.
>>>
>>> With that fixed, the ceph bits are a welcome cleanup!
>>>
>>> Acked-by: Sage Weil<sage@inktank.com>
>> Further, I check the path again and may it be reworked as follows to avoid
>> undo?
>>
>> __set_page_dirty();
>> __set_page_dirty();
>> ceph operations;                ==>                     if (page->mapping)
>> if (page->mapping)                                            ceph operations;
>>      ;
>> else
>>      undo = 1;
>> if (undo)
>>      xxx;
> Yep.  Taking another look at the original code, though, I'm worried that
> one reason the __set_page_dirty() actions were spread out the way they are
> is because we wanted to ensure that the ceph operations were always
> performed when PagePrivate was set.
>

Sorry, I've lost something:

__set_page_dirty();                        __set_page_dirty();
ceph operations;
if(page->mapping)         ==>      if(page->mapping) {
        SetPagePrivate;                            SetPagePrivate;
else                                                      ceph operations;
     undo = 1;                                  }

if (undo)
     XXX;

I think this can ensure that ceph operations are performed together with
SetPagePrivate.

> It looks like invalidatepage won't get called if private isn't set, and
> presumably it handles the truncate race with __set_page_dirty() properly
> (right?).  What about writeback?  Do we need to worry about writepage[s]
> getting called with a NULL page->private?

__set_page_dirty does handle racing conditions with truncate and
writeback writepage[s] also take page->private into consideration
which is done inside specific filesystems. I notice that ceph has handled
this in ceph_writepage().
Sorry, not vfs expert and maybe I've not caught your point...


Thanks,
Sha

> Thanks!
> sage
>
>
>
>>
>>
>> Thanks,
>> Sha
>>
>>>> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
>>>> index 8b67304..d028fbe 100644
>>>> --- a/fs/ceph/addr.c
>>>> +++ b/fs/ceph/addr.c
>>>> @@ -5,6 +5,7 @@
>>>>    #include<linux/mm.h>
>>>>    #include<linux/pagemap.h>
>>>>    #include<linux/writeback.h>	/* generic_writepages */
>>>> +#include<linux/buffer_head.h>
>>>>    #include<linux/slab.h>
>>>>    #include<linux/pagevec.h>
>>>>    #include<linux/task_io_accounting_ops.h>
>>>> @@ -73,14 +74,8 @@ static int ceph_set_page_dirty(struct page *page)
>>>>    	int undo = 0;
>>>>    	struct ceph_snap_context *snapc;
>>>>
>>>> -	if (unlikely(!mapping))
>>>> -		return !TestSetPageDirty(page);
>>>> -
>>>> -	if (TestSetPageDirty(page)) {
>>>> -		dout("%p set_page_dirty %p idx %lu -- already dirty\n",
>>>> -		     mapping->host, page, page->index);
>>>> +	if (!__set_page_dirty(page, mapping, 1))
>>>>    		return 0;
>>>> -	}
>>>>
>>>>    	inode = mapping->host;
>>>>    	ci = ceph_inode(inode);
>>>> @@ -107,14 +102,7 @@ static int ceph_set_page_dirty(struct page *page)
>>>>    	     snapc, snapc->seq, snapc->num_snaps);
>>>>    	spin_unlock(&ci->i_ceph_lock);
>>>>
>>>> -	/* now adjust page */
>>>> -	spin_lock_irq(&mapping->tree_lock);
>>>>    	if (page->mapping) {	/* Race with truncate? */
>>>> -		WARN_ON_ONCE(!PageUptodate(page));
>>>> -		account_page_dirtied(page, page->mapping);
>>>> -		radix_tree_tag_set(&mapping->page_tree,
>>>> -				page_index(page), PAGECACHE_TAG_DIRTY);
>>>> -
>>>>    		/*
>>>>    		 * Reference snap context in page->private.  Also set
>>>>    		 * PagePrivate so that we get invalidatepage callback.
>>>> @@ -126,14 +114,10 @@ static int ceph_set_page_dirty(struct page *page)
>>>>    		undo = 1;
>>>>    	}
>>>>
>>>> -	spin_unlock_irq(&mapping->tree_lock);
>>>> -
>>>>    	if (undo)
>>>>    		/* whoops, we failed to dirty the page */
>>>>    		ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
>>>>
>>>> -	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>>>> -
>>>>    	BUG_ON(!PageDirty(page));
>>>>    	return 1;
>>>>    }
>>>> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
>>>> index 458f497..0a331a8 100644
>>>> --- a/include/linux/buffer_head.h
>>>> +++ b/include/linux/buffer_head.h
>>>> @@ -336,6 +336,8 @@ static inline void lock_buffer(struct buffer_head *bh)
>>>>    }
>>>>
>>>>    extern int __set_page_dirty_buffers(struct page *page);
>>>> +extern int __set_page_dirty(struct page *page,
>>>> +		struct address_space *mapping, int warn);
>>>>
>>>>    #else /* CONFIG_BLOCK */
>>>>
>>>> -- 
>>>> 1.7.1
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel"
>>>> in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 4/7] Use vfs __set_page_dirty interface instead of doing it inside filesystem
@ 2012-07-04  8:11           ` Sha Zhengju
  0 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-07-04  8:11 UTC (permalink / raw)
  To: Sage Weil
  Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
	mhocko, linux-kernel, torvalds, viro, linux-fsdevel, sage,
	ceph-devel, Sha Zhengju

On 07/02/2012 10:49 PM, Sage Weil wrote:
> On Mon, 2 Jul 2012, Sha Zhengju wrote:
>> On 06/29/2012 01:21 PM, Sage Weil wrote:
>>> On Thu, 28 Jun 2012, Sha Zhengju wrote:
>>>
>>>> From: Sha Zhengju<handai.szj@taobao.com>
>>>>
>>>> Following we will treat SetPageDirty and dirty page accounting as an
>>>> integrated
>>>> operation. Filesystems had better use vfs interface directly to avoid
>>>> those details.
>>>>
>>>> Signed-off-by: Sha Zhengju<handai.szj@taobao.com>
>>>> ---
>>>>    fs/buffer.c                 |    2 +-
>>>>    fs/ceph/addr.c              |   20 ++------------------
>>>>    include/linux/buffer_head.h |    2 ++
>>>>    3 files changed, 5 insertions(+), 19 deletions(-)
>>>>
>>>> diff --git a/fs/buffer.c b/fs/buffer.c
>>>> index e8d96b8..55522dd 100644
>>>> --- a/fs/buffer.c
>>>> +++ b/fs/buffer.c
>>>> @@ -610,7 +610,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>>>>     * If warn is true, then emit a warning if the page is not uptodate and
>>>> has
>>>>     * not been truncated.
>>>>     */
>>>> -static int __set_page_dirty(struct page *page,
>>>> +int __set_page_dirty(struct page *page,
>>>>    		struct address_space *mapping, int warn)
>>>>    {
>>>>    	if (unlikely(!mapping))
>>> This also needs an EXPORT_SYMBOL(__set_page_dirty) to allow ceph to
>>> continue to build as a module.
>>>
>>> With that fixed, the ceph bits are a welcome cleanup!
>>>
>>> Acked-by: Sage Weil<sage@inktank.com>
>> Further, I check the path again and may it be reworked as follows to avoid
>> undo?
>>
>> __set_page_dirty();
>> __set_page_dirty();
>> ceph operations;                ==>                     if (page->mapping)
>> if (page->mapping)                                            ceph operations;
>>      ;
>> else
>>      undo = 1;
>> if (undo)
>>      xxx;
> Yep.  Taking another look at the original code, though, I'm worried that
> one reason the __set_page_dirty() actions were spread out the way they are
> is because we wanted to ensure that the ceph operations were always
> performed when PagePrivate was set.
>

Sorry, I've lost something:

__set_page_dirty();                        __set_page_dirty();
ceph operations;
if(page->mapping)         ==>      if(page->mapping) {
        SetPagePrivate;                            SetPagePrivate;
else                                                      ceph operations;
     undo = 1;                                  }

if (undo)
     XXX;

I think this can ensure that ceph operations are performed together with
SetPagePrivate.

> It looks like invalidatepage won't get called if private isn't set, and
> presumably it handles the truncate race with __set_page_dirty() properly
> (right?).  What about writeback?  Do we need to worry about writepage[s]
> getting called with a NULL page->private?

__set_page_dirty does handle racing conditions with truncate and
writeback writepage[s] also take page->private into consideration
which is done inside specific filesystems. I notice that ceph has handled
this in ceph_writepage().
Sorry, not vfs expert and maybe I've not caught your point...


Thanks,
Sha

> Thanks!
> sage
>
>
>
>>
>>
>> Thanks,
>> Sha
>>
>>>> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
>>>> index 8b67304..d028fbe 100644
>>>> --- a/fs/ceph/addr.c
>>>> +++ b/fs/ceph/addr.c
>>>> @@ -5,6 +5,7 @@
>>>>    #include<linux/mm.h>
>>>>    #include<linux/pagemap.h>
>>>>    #include<linux/writeback.h>	/* generic_writepages */
>>>> +#include<linux/buffer_head.h>
>>>>    #include<linux/slab.h>
>>>>    #include<linux/pagevec.h>
>>>>    #include<linux/task_io_accounting_ops.h>
>>>> @@ -73,14 +74,8 @@ static int ceph_set_page_dirty(struct page *page)
>>>>    	int undo = 0;
>>>>    	struct ceph_snap_context *snapc;
>>>>
>>>> -	if (unlikely(!mapping))
>>>> -		return !TestSetPageDirty(page);
>>>> -
>>>> -	if (TestSetPageDirty(page)) {
>>>> -		dout("%p set_page_dirty %p idx %lu -- already dirty\n",
>>>> -		     mapping->host, page, page->index);
>>>> +	if (!__set_page_dirty(page, mapping, 1))
>>>>    		return 0;
>>>> -	}
>>>>
>>>>    	inode = mapping->host;
>>>>    	ci = ceph_inode(inode);
>>>> @@ -107,14 +102,7 @@ static int ceph_set_page_dirty(struct page *page)
>>>>    	     snapc, snapc->seq, snapc->num_snaps);
>>>>    	spin_unlock(&ci->i_ceph_lock);
>>>>
>>>> -	/* now adjust page */
>>>> -	spin_lock_irq(&mapping->tree_lock);
>>>>    	if (page->mapping) {	/* Race with truncate? */
>>>> -		WARN_ON_ONCE(!PageUptodate(page));
>>>> -		account_page_dirtied(page, page->mapping);
>>>> -		radix_tree_tag_set(&mapping->page_tree,
>>>> -				page_index(page), PAGECACHE_TAG_DIRTY);
>>>> -
>>>>    		/*
>>>>    		 * Reference snap context in page->private.  Also set
>>>>    		 * PagePrivate so that we get invalidatepage callback.
>>>> @@ -126,14 +114,10 @@ static int ceph_set_page_dirty(struct page *page)
>>>>    		undo = 1;
>>>>    	}
>>>>
>>>> -	spin_unlock_irq(&mapping->tree_lock);
>>>> -
>>>>    	if (undo)
>>>>    		/* whoops, we failed to dirty the page */
>>>>    		ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
>>>>
>>>> -	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>>>> -
>>>>    	BUG_ON(!PageDirty(page));
>>>>    	return 1;
>>>>    }
>>>> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
>>>> index 458f497..0a331a8 100644
>>>> --- a/include/linux/buffer_head.h
>>>> +++ b/include/linux/buffer_head.h
>>>> @@ -336,6 +336,8 @@ static inline void lock_buffer(struct buffer_head *bh)
>>>>    }
>>>>
>>>>    extern int __set_page_dirty_buffers(struct page *page);
>>>> +extern int __set_page_dirty(struct page *page,
>>>> +		struct address_space *mapping, int warn);
>>>>
>>>>    #else /* CONFIG_BLOCK */
>>>>
>>>> -- 
>>>> 1.7.1
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel"
>>>> in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 6/7] memcg: add per cgroup writeback pages accounting
  2012-07-03  6:31     ` Kamezawa Hiroyuki
@ 2012-07-04  8:24       ` Sha Zhengju
  -1 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-07-04  8:24 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-mm, cgroups, gthelen, yinghan, akpm, mhocko, linux-kernel,
	Sha Zhengju

On 07/03/2012 02:31 PM, Kamezawa Hiroyuki wrote:
> (2012/06/28 20:05), Sha Zhengju wrote:
>> From: Sha Zhengju <handai.szj@taobao.com>
>>
>> Similar to dirty page, we add per cgroup writeback pages accounting. The lock
>> rule still is:
>> 	mem_cgroup_begin_update_page_stat()
>> 	modify page WRITEBACK stat
>> 	mem_cgroup_update_page_stat()
>> 	mem_cgroup_end_update_page_stat()
>>
>> There're two writeback interface to modify: test_clear/set_page_writeback.
>>
>> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> Seems good to me. BTW, you named macros as MEM_CGROUP_STAT_FILE_XXX
> but I wonder these counters will be used for accounting swap-out's dirty pages..
>
> STAT_DIRTY, STAT_WRITEBACK ? do you have better name ?

Okay, STAT_DIRTY/WRITEBACK seem good, I'll change them in next version.


Thanks,
Sha

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 6/7] memcg: add per cgroup writeback pages accounting
@ 2012-07-04  8:24       ` Sha Zhengju
  0 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-07-04  8:24 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-mm, cgroups, gthelen, yinghan, akpm, mhocko, linux-kernel,
	Sha Zhengju

On 07/03/2012 02:31 PM, Kamezawa Hiroyuki wrote:
> (2012/06/28 20:05), Sha Zhengju wrote:
>> From: Sha Zhengju <handai.szj@taobao.com>
>>
>> Similar to dirty page, we add per cgroup writeback pages accounting. The lock
>> rule still is:
>> 	mem_cgroup_begin_update_page_stat()
>> 	modify page WRITEBACK stat
>> 	mem_cgroup_update_page_stat()
>> 	mem_cgroup_end_update_page_stat()
>>
>> There're two writeback interface to modify: test_clear/set_page_writeback.
>>
>> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> Seems good to me. BTW, you named macros as MEM_CGROUP_STAT_FILE_XXX
> but I wonder these counters will be used for accounting swap-out's dirty pages..
>
> STAT_DIRTY, STAT_WRITEBACK ? do you have better name ?

Okay, STAT_DIRTY/WRITEBACK seem good, I'll change them in next version.


Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 7/7] memcg: print more detailed info while memcg oom happening
  2012-06-28 11:06   ` Sha Zhengju
  (?)
@ 2012-07-04  8:25     ` Sha Zhengju
  -1 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-07-04  8:25 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: kamezawa.hiroyu, gthelen, yinghan, akpm, mhocko, linux-kernel,
	Sha Zhengju

Hi, Kame

How about this bit? :-)

On 06/28/2012 07:06 PM, Sha Zhengju wrote:
> From: Sha Zhengju<handai.szj@taobao.com>
>
> While memcg oom happening, the dump info is limited, so add this
> to provide memcg page stat.
>
> Signed-off-by: Sha Zhengju<handai.szj@taobao.com>
> ---
>   mm/memcontrol.c |   42 ++++++++++++++++++++++++++++++++++--------
>   1 files changed, 34 insertions(+), 8 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 8493119..3ed41e9 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -101,6 +101,14 @@ static const char * const mem_cgroup_events_names[] = {
>   	"pgmajfault",
>   };
>
> +static const char * const mem_cgroup_lru_names[] = {
> +	"inactive_anon",
> +	"active_anon",
> +	"inactive_file",
> +	"active_file",
> +	"unevictable",
> +};
> +
>   /*
>    * Per memcg event counter is incremented at every pagein/pageout. With THP,
>    * it will be incremated by the number of pages. This counter is used for
> @@ -1358,6 +1366,30 @@ static void move_unlock_mem_cgroup(struct mem_cgroup *memcg,
>   	spin_unlock_irqrestore(&memcg->move_lock, *flags);
>   }
>
> +#define K(x) ((x)<<  (PAGE_SHIFT-10))
> +static void mem_cgroup_print_oom_stat(struct mem_cgroup *memcg)
> +{
> +	int i;
> +
> +	printk(KERN_INFO "Memory cgroup stat:\n");
> +	for (i = 0; i<  MEM_CGROUP_STAT_NSTATS; i++) {
> +		if (i == MEM_CGROUP_STAT_SWAP&&  !do_swap_account)
> +			continue;
> +		printk(KERN_CONT "%s:%ldKB ", mem_cgroup_stat_names[i],
> +			   K(mem_cgroup_read_stat(memcg, i)));
> +	}
> +
> +	for (i = 0; i<  MEM_CGROUP_EVENTS_NSTATS; i++)
> +		printk(KERN_CONT "%s:%lu ", mem_cgroup_events_names[i],
> +			   mem_cgroup_read_events(memcg, i));
> +
> +	for (i = 0; i<  NR_LRU_LISTS; i++)
> +		printk(KERN_CONT "%s:%luKB ", mem_cgroup_lru_names[i],
> +			   K(mem_cgroup_nr_lru_pages(memcg, BIT(i))));
> +	printk(KERN_CONT "\n");
> +
> +}
> +
>   /**
>    * mem_cgroup_print_oom_info: Called from OOM with tasklist_lock held in read mode.
>    * @memcg: The memory cgroup that went over limit
> @@ -1422,6 +1454,8 @@ done:
>   		res_counter_read_u64(&memcg->memsw, RES_USAGE)>>  10,
>   		res_counter_read_u64(&memcg->memsw, RES_LIMIT)>>  10,
>   		res_counter_read_u64(&memcg->memsw, RES_FAILCNT));
> +
> +	mem_cgroup_print_oom_stat(memcg);
>   }
>
>   /*
> @@ -4043,14 +4077,6 @@ static int mem_control_numa_stat_show(struct cgroup *cont, struct cftype *cft,
>   }
>   #endif /* CONFIG_NUMA */
>
> -static const char * const mem_cgroup_lru_names[] = {
> -	"inactive_anon",
> -	"active_anon",
> -	"inactive_file",
> -	"active_file",
> -	"unevictable",
> -};
> -
>   static inline void mem_cgroup_lru_names_not_uptodate(void)
>   {
>   	BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS);


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 7/7] memcg: print more detailed info while memcg oom happening
@ 2012-07-04  8:25     ` Sha Zhengju
  0 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-07-04  8:25 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: kamezawa.hiroyu, gthelen, yinghan, akpm, mhocko, linux-kernel,
	Sha Zhengju

Hi, Kame

How about this biti 1/4 ? :-)

On 06/28/2012 07:06 PM, Sha Zhengju wrote:
> From: Sha Zhengju<handai.szj@taobao.com>
>
> While memcg oom happening, the dump info is limited, so add this
> to provide memcg page stat.
>
> Signed-off-by: Sha Zhengju<handai.szj@taobao.com>
> ---
>   mm/memcontrol.c |   42 ++++++++++++++++++++++++++++++++++--------
>   1 files changed, 34 insertions(+), 8 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 8493119..3ed41e9 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -101,6 +101,14 @@ static const char * const mem_cgroup_events_names[] = {
>   	"pgmajfault",
>   };
>
> +static const char * const mem_cgroup_lru_names[] = {
> +	"inactive_anon",
> +	"active_anon",
> +	"inactive_file",
> +	"active_file",
> +	"unevictable",
> +};
> +
>   /*
>    * Per memcg event counter is incremented at every pagein/pageout. With THP,
>    * it will be incremated by the number of pages. This counter is used for
> @@ -1358,6 +1366,30 @@ static void move_unlock_mem_cgroup(struct mem_cgroup *memcg,
>   	spin_unlock_irqrestore(&memcg->move_lock, *flags);
>   }
>
> +#define K(x) ((x)<<  (PAGE_SHIFT-10))
> +static void mem_cgroup_print_oom_stat(struct mem_cgroup *memcg)
> +{
> +	int i;
> +
> +	printk(KERN_INFO "Memory cgroup stat:\n");
> +	for (i = 0; i<  MEM_CGROUP_STAT_NSTATS; i++) {
> +		if (i == MEM_CGROUP_STAT_SWAP&&  !do_swap_account)
> +			continue;
> +		printk(KERN_CONT "%s:%ldKB ", mem_cgroup_stat_names[i],
> +			   K(mem_cgroup_read_stat(memcg, i)));
> +	}
> +
> +	for (i = 0; i<  MEM_CGROUP_EVENTS_NSTATS; i++)
> +		printk(KERN_CONT "%s:%lu ", mem_cgroup_events_names[i],
> +			   mem_cgroup_read_events(memcg, i));
> +
> +	for (i = 0; i<  NR_LRU_LISTS; i++)
> +		printk(KERN_CONT "%s:%luKB ", mem_cgroup_lru_names[i],
> +			   K(mem_cgroup_nr_lru_pages(memcg, BIT(i))));
> +	printk(KERN_CONT "\n");
> +
> +}
> +
>   /**
>    * mem_cgroup_print_oom_info: Called from OOM with tasklist_lock held in read mode.
>    * @memcg: The memory cgroup that went over limit
> @@ -1422,6 +1454,8 @@ done:
>   		res_counter_read_u64(&memcg->memsw, RES_USAGE)>>  10,
>   		res_counter_read_u64(&memcg->memsw, RES_LIMIT)>>  10,
>   		res_counter_read_u64(&memcg->memsw, RES_FAILCNT));
> +
> +	mem_cgroup_print_oom_stat(memcg);
>   }
>
>   /*
> @@ -4043,14 +4077,6 @@ static int mem_control_numa_stat_show(struct cgroup *cont, struct cftype *cft,
>   }
>   #endif /* CONFIG_NUMA */
>
> -static const char * const mem_cgroup_lru_names[] = {
> -	"inactive_anon",
> -	"active_anon",
> -	"inactive_file",
> -	"active_file",
> -	"unevictable",
> -};
> -
>   static inline void mem_cgroup_lru_names_not_uptodate(void)
>   {
>   	BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 7/7] memcg: print more detailed info while memcg oom happening
@ 2012-07-04  8:25     ` Sha Zhengju
  0 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-07-04  8:25 UTC (permalink / raw)
  To: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA
  Cc: kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA, yinghan-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mhocko-AlSwsSmVLrQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju

Hi, Kame

How about this bit? :-)

On 06/28/2012 07:06 PM, Sha Zhengju wrote:
> From: Sha Zhengju<handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
>
> While memcg oom happening, the dump info is limited, so add this
> to provide memcg page stat.
>
> Signed-off-by: Sha Zhengju<handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> ---
>   mm/memcontrol.c |   42 ++++++++++++++++++++++++++++++++++--------
>   1 files changed, 34 insertions(+), 8 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 8493119..3ed41e9 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -101,6 +101,14 @@ static const char * const mem_cgroup_events_names[] = {
>   	"pgmajfault",
>   };
>
> +static const char * const mem_cgroup_lru_names[] = {
> +	"inactive_anon",
> +	"active_anon",
> +	"inactive_file",
> +	"active_file",
> +	"unevictable",
> +};
> +
>   /*
>    * Per memcg event counter is incremented at every pagein/pageout. With THP,
>    * it will be incremated by the number of pages. This counter is used for
> @@ -1358,6 +1366,30 @@ static void move_unlock_mem_cgroup(struct mem_cgroup *memcg,
>   	spin_unlock_irqrestore(&memcg->move_lock, *flags);
>   }
>
> +#define K(x) ((x)<<  (PAGE_SHIFT-10))
> +static void mem_cgroup_print_oom_stat(struct mem_cgroup *memcg)
> +{
> +	int i;
> +
> +	printk(KERN_INFO "Memory cgroup stat:\n");
> +	for (i = 0; i<  MEM_CGROUP_STAT_NSTATS; i++) {
> +		if (i == MEM_CGROUP_STAT_SWAP&&  !do_swap_account)
> +			continue;
> +		printk(KERN_CONT "%s:%ldKB ", mem_cgroup_stat_names[i],
> +			   K(mem_cgroup_read_stat(memcg, i)));
> +	}
> +
> +	for (i = 0; i<  MEM_CGROUP_EVENTS_NSTATS; i++)
> +		printk(KERN_CONT "%s:%lu ", mem_cgroup_events_names[i],
> +			   mem_cgroup_read_events(memcg, i));
> +
> +	for (i = 0; i<  NR_LRU_LISTS; i++)
> +		printk(KERN_CONT "%s:%luKB ", mem_cgroup_lru_names[i],
> +			   K(mem_cgroup_nr_lru_pages(memcg, BIT(i))));
> +	printk(KERN_CONT "\n");
> +
> +}
> +
>   /**
>    * mem_cgroup_print_oom_info: Called from OOM with tasklist_lock held in read mode.
>    * @memcg: The memory cgroup that went over limit
> @@ -1422,6 +1454,8 @@ done:
>   		res_counter_read_u64(&memcg->memsw, RES_USAGE)>>  10,
>   		res_counter_read_u64(&memcg->memsw, RES_LIMIT)>>  10,
>   		res_counter_read_u64(&memcg->memsw, RES_FAILCNT));
> +
> +	mem_cgroup_print_oom_stat(memcg);
>   }
>
>   /*
> @@ -4043,14 +4077,6 @@ static int mem_control_numa_stat_show(struct cgroup *cont, struct cftype *cft,
>   }
>   #endif /* CONFIG_NUMA */
>
> -static const char * const mem_cgroup_lru_names[] = {
> -	"inactive_anon",
> -	"active_anon",
> -	"inactive_file",
> -	"active_file",
> -	"unevictable",
> -};
> -
>   static inline void mem_cgroup_lru_names_not_uptodate(void)
>   {
>   	BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS);

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 7/7] memcg: print more detailed info while memcg oom happening
  2012-06-28 11:06   ` Sha Zhengju
@ 2012-07-04  8:29     ` Kamezawa Hiroyuki
  -1 siblings, 0 replies; 132+ messages in thread
From: Kamezawa Hiroyuki @ 2012-07-04  8:29 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, gthelen, yinghan, akpm, mhocko, linux-kernel,
	Sha Zhengju

(2012/06/28 20:06), Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> While memcg oom happening, the dump info is limited, so add this
> to provide memcg page stat.
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>

Could you split this into a different series ?
seems good to me in general but...one concern is hierarchy handling.

IIUC, the passed 'memcg' is the root of hierarchy which gets OOM.
So, the LRU info, which is local to the root memcg, may not contain any good
information. I think you should visit all memcg under the tree.

Thanks,
-Kame

> ---
>   mm/memcontrol.c |   42 ++++++++++++++++++++++++++++++++++--------
>   1 files changed, 34 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 8493119..3ed41e9 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -101,6 +101,14 @@ static const char * const mem_cgroup_events_names[] = {
>   	"pgmajfault",
>   };
>   
> +static const char * const mem_cgroup_lru_names[] = {
> +	"inactive_anon",
> +	"active_anon",
> +	"inactive_file",
> +	"active_file",
> +	"unevictable",
> +};
> +
>   /*
>    * Per memcg event counter is incremented at every pagein/pageout. With THP,
>    * it will be incremated by the number of pages. This counter is used for
> @@ -1358,6 +1366,30 @@ static void move_unlock_mem_cgroup(struct mem_cgroup *memcg,
>   	spin_unlock_irqrestore(&memcg->move_lock, *flags);
>   }
>   
> +#define K(x) ((x) << (PAGE_SHIFT-10))
> +static void mem_cgroup_print_oom_stat(struct mem_cgroup *memcg)
> +{
> +	int i;
> +
> +	printk(KERN_INFO "Memory cgroup stat:\n");
> +	for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
> +		if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account)
> +			continue;
> +		printk(KERN_CONT "%s:%ldKB ", mem_cgroup_stat_names[i],
> +			   K(mem_cgroup_read_stat(memcg, i)));
> +	}
> +
> +	for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++)
> +		printk(KERN_CONT "%s:%lu ", mem_cgroup_events_names[i],
> +			   mem_cgroup_read_events(memcg, i));
> +
> +	for (i = 0; i < NR_LRU_LISTS; i++)
> +		printk(KERN_CONT "%s:%luKB ", mem_cgroup_lru_names[i],
> +			   K(mem_cgroup_nr_lru_pages(memcg, BIT(i))));
> +	printk(KERN_CONT "\n");
> +
> +}
> +
>   /**
>    * mem_cgroup_print_oom_info: Called from OOM with tasklist_lock held in read mode.
>    * @memcg: The memory cgroup that went over limit
> @@ -1422,6 +1454,8 @@ done:
>   		res_counter_read_u64(&memcg->memsw, RES_USAGE) >> 10,
>   		res_counter_read_u64(&memcg->memsw, RES_LIMIT) >> 10,
>   		res_counter_read_u64(&memcg->memsw, RES_FAILCNT));
> +
> +	mem_cgroup_print_oom_stat(memcg);
>   }
>   
>   /*
> @@ -4043,14 +4077,6 @@ static int mem_control_numa_stat_show(struct cgroup *cont, struct cftype *cft,
>   }
>   #endif /* CONFIG_NUMA */
>   
> -static const char * const mem_cgroup_lru_names[] = {
> -	"inactive_anon",
> -	"active_anon",
> -	"inactive_file",
> -	"active_file",
> -	"unevictable",
> -};
> -
>   static inline void mem_cgroup_lru_names_not_uptodate(void)
>   {
>   	BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS);
> 




^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 7/7] memcg: print more detailed info while memcg oom happening
@ 2012-07-04  8:29     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 132+ messages in thread
From: Kamezawa Hiroyuki @ 2012-07-04  8:29 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, gthelen, yinghan, akpm, mhocko, linux-kernel,
	Sha Zhengju

(2012/06/28 20:06), Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> While memcg oom happening, the dump info is limited, so add this
> to provide memcg page stat.
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>

Could you split this into a different series ?
seems good to me in general but...one concern is hierarchy handling.

IIUC, the passed 'memcg' is the root of hierarchy which gets OOM.
So, the LRU info, which is local to the root memcg, may not contain any good
information. I think you should visit all memcg under the tree.

Thanks,
-Kame

> ---
>   mm/memcontrol.c |   42 ++++++++++++++++++++++++++++++++++--------
>   1 files changed, 34 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 8493119..3ed41e9 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -101,6 +101,14 @@ static const char * const mem_cgroup_events_names[] = {
>   	"pgmajfault",
>   };
>   
> +static const char * const mem_cgroup_lru_names[] = {
> +	"inactive_anon",
> +	"active_anon",
> +	"inactive_file",
> +	"active_file",
> +	"unevictable",
> +};
> +
>   /*
>    * Per memcg event counter is incremented at every pagein/pageout. With THP,
>    * it will be incremated by the number of pages. This counter is used for
> @@ -1358,6 +1366,30 @@ static void move_unlock_mem_cgroup(struct mem_cgroup *memcg,
>   	spin_unlock_irqrestore(&memcg->move_lock, *flags);
>   }
>   
> +#define K(x) ((x) << (PAGE_SHIFT-10))
> +static void mem_cgroup_print_oom_stat(struct mem_cgroup *memcg)
> +{
> +	int i;
> +
> +	printk(KERN_INFO "Memory cgroup stat:\n");
> +	for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
> +		if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account)
> +			continue;
> +		printk(KERN_CONT "%s:%ldKB ", mem_cgroup_stat_names[i],
> +			   K(mem_cgroup_read_stat(memcg, i)));
> +	}
> +
> +	for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++)
> +		printk(KERN_CONT "%s:%lu ", mem_cgroup_events_names[i],
> +			   mem_cgroup_read_events(memcg, i));
> +
> +	for (i = 0; i < NR_LRU_LISTS; i++)
> +		printk(KERN_CONT "%s:%luKB ", mem_cgroup_lru_names[i],
> +			   K(mem_cgroup_nr_lru_pages(memcg, BIT(i))));
> +	printk(KERN_CONT "\n");
> +
> +}
> +
>   /**
>    * mem_cgroup_print_oom_info: Called from OOM with tasklist_lock held in read mode.
>    * @memcg: The memory cgroup that went over limit
> @@ -1422,6 +1454,8 @@ done:
>   		res_counter_read_u64(&memcg->memsw, RES_USAGE) >> 10,
>   		res_counter_read_u64(&memcg->memsw, RES_LIMIT) >> 10,
>   		res_counter_read_u64(&memcg->memsw, RES_FAILCNT));
> +
> +	mem_cgroup_print_oom_stat(memcg);
>   }
>   
>   /*
> @@ -4043,14 +4077,6 @@ static int mem_control_numa_stat_show(struct cgroup *cont, struct cftype *cft,
>   }
>   #endif /* CONFIG_NUMA */
>   
> -static const char * const mem_cgroup_lru_names[] = {
> -	"inactive_anon",
> -	"active_anon",
> -	"inactive_file",
> -	"active_file",
> -	"unevictable",
> -};
> -
>   static inline void mem_cgroup_lru_names_not_uptodate(void)
>   {
>   	BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS);
> 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 7/7] memcg: print more detailed info while memcg oom happening
  2012-07-04  8:29     ` Kamezawa Hiroyuki
  (?)
@ 2012-07-04 11:20       ` Sha Zhengju
  -1 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-07-04 11:20 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-mm, cgroups, gthelen, yinghan, akpm, mhocko, linux-kernel,
	Sha Zhengju

On 07/04/2012 04:29 PM, Kamezawa Hiroyuki wrote:
> (2012/06/28 20:06), Sha Zhengju wrote:
>> From: Sha Zhengju <handai.szj@taobao.com>
>>
>> While memcg oom happening, the dump info is limited, so add this
>> to provide memcg page stat.
>>
>> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> Could you split this into a different series ?
> seems good to me in general but...one concern is hierarchy handling.
>
> IIUC, the passed 'memcg' is the root of hierarchy which gets OOM.
> So, the LRU info, which is local to the root memcg, may not contain any good
> information. I think you should visit all memcg under the tree.
>
Yes, you're right!
I did not handle hierarchy here, and just now I make a test case to
prove this.
I'll split it to another series later.

Thanks for reviewing!


Thanks,
Sha

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 7/7] memcg: print more detailed info while memcg oom happening
@ 2012-07-04 11:20       ` Sha Zhengju
  0 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-07-04 11:20 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-mm, cgroups, gthelen, yinghan, akpm, mhocko, linux-kernel,
	Sha Zhengju

On 07/04/2012 04:29 PM, Kamezawa Hiroyuki wrote:
> (2012/06/28 20:06), Sha Zhengju wrote:
>> From: Sha Zhengju <handai.szj@taobao.com>
>>
>> While memcg oom happening, the dump info is limited, so add this
>> to provide memcg page stat.
>>
>> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> Could you split this into a different series ?
> seems good to me in general but...one concern is hierarchy handling.
>
> IIUC, the passed 'memcg' is the root of hierarchy which gets OOM.
> So, the LRU info, which is local to the root memcg, may not contain any good
> information. I think you should visit all memcg under the tree.
>
Yes, you're right!
I did not handle hierarchy here, and just now I make a test case to
prove this.
I'll split it to another series later.

Thanks for reviewing!


Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 7/7] memcg: print more detailed info while memcg oom happening
@ 2012-07-04 11:20       ` Sha Zhengju
  0 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-07-04 11:20 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA, yinghan-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mhocko-AlSwsSmVLrQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju

On 07/04/2012 04:29 PM, Kamezawa Hiroyuki wrote:
> (2012/06/28 20:06), Sha Zhengju wrote:
>> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
>>
>> While memcg oom happening, the dump info is limited, so add this
>> to provide memcg page stat.
>>
>> Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> Could you split this into a different series ?
> seems good to me in general but...one concern is hierarchy handling.
>
> IIUC, the passed 'memcg' is the root of hierarchy which gets OOM.
> So, the LRU info, which is local to the root memcg, may not contain any good
> information. I think you should visit all memcg under the tree.
>
Yes, you're right!
I did not handle hierarchy here, and just now I make a test case to
prove this.
I'll split it to another series later.

Thanks for reviewing!


Thanks,
Sha

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 1/7] memcg: update cgroup memory document
  2012-06-28 10:57   ` Sha Zhengju
  (?)
@ 2012-07-04 12:47     ` Michal Hocko
  -1 siblings, 0 replies; 132+ messages in thread
From: Michal Hocko @ 2012-07-04 12:47 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
	linux-kernel, Sha Zhengju

On Thu 28-06-12 18:57:35, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> Document cgroup dirty/writeback memory statistics.
> 
> The implementation for these new interface routines come in a series
> of following patches.

I would expect this one the be the last...

> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>

Ackedy-by: Michal Hocko <mhocko@suse.cz>

> ---
>  Documentation/cgroups/memory.txt |    2 ++
>  1 files changed, 2 insertions(+), 0 deletions(-)
> 
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index dd88540..24d7e3c 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -420,6 +420,8 @@ pgpgin		- # of charging events to the memory cgroup. The charging
>  pgpgout		- # of uncharging events to the memory cgroup. The uncharging
>  		event happens each time a page is unaccounted from the cgroup.
>  swap		- # of bytes of swap usage
> +dirty		- # of bytes that are waiting to get written back to the disk.
> +writeback	- # of bytes that are actively being written back to the disk.
>  inactive_anon	- # of bytes of anonymous memory and swap cache memory on
>  		LRU list.
>  active_anon	- # of bytes of anonymous and swap cache memory on active
> -- 
> 1.7.1
> 

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 1/7] memcg: update cgroup memory document
@ 2012-07-04 12:47     ` Michal Hocko
  0 siblings, 0 replies; 132+ messages in thread
From: Michal Hocko @ 2012-07-04 12:47 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
	linux-kernel, Sha Zhengju

On Thu 28-06-12 18:57:35, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> Document cgroup dirty/writeback memory statistics.
> 
> The implementation for these new interface routines come in a series
> of following patches.

I would expect this one the be the last...

> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>

Ackedy-by: Michal Hocko <mhocko@suse.cz>

> ---
>  Documentation/cgroups/memory.txt |    2 ++
>  1 files changed, 2 insertions(+), 0 deletions(-)
> 
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index dd88540..24d7e3c 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -420,6 +420,8 @@ pgpgin		- # of charging events to the memory cgroup. The charging
>  pgpgout		- # of uncharging events to the memory cgroup. The uncharging
>  		event happens each time a page is unaccounted from the cgroup.
>  swap		- # of bytes of swap usage
> +dirty		- # of bytes that are waiting to get written back to the disk.
> +writeback	- # of bytes that are actively being written back to the disk.
>  inactive_anon	- # of bytes of anonymous memory and swap cache memory on
>  		LRU list.
>  active_anon	- # of bytes of anonymous and swap cache memory on active
> -- 
> 1.7.1
> 

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 1/7] memcg: update cgroup memory document
@ 2012-07-04 12:47     ` Michal Hocko
  0 siblings, 0 replies; 132+ messages in thread
From: Michal Hocko @ 2012-07-04 12:47 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA, yinghan-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju

On Thu 28-06-12 18:57:35, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> 
> Document cgroup dirty/writeback memory statistics.
> 
> The implementation for these new interface routines come in a series
> of following patches.

I would expect this one the be the last...

> 
> Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>

Ackedy-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>

> ---
>  Documentation/cgroups/memory.txt |    2 ++
>  1 files changed, 2 insertions(+), 0 deletions(-)
> 
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index dd88540..24d7e3c 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -420,6 +420,8 @@ pgpgin		- # of charging events to the memory cgroup. The charging
>  pgpgout		- # of uncharging events to the memory cgroup. The uncharging
>  		event happens each time a page is unaccounted from the cgroup.
>  swap		- # of bytes of swap usage
> +dirty		- # of bytes that are waiting to get written back to the disk.
> +writeback	- # of bytes that are actively being written back to the disk.
>  inactive_anon	- # of bytes of anonymous memory and swap cache memory on
>  		LRU list.
>  active_anon	- # of bytes of anonymous and swap cache memory on active
> -- 
> 1.7.1
> 

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 2/7] memcg: remove MEMCG_NR_FILE_MAPPED
  2012-06-28 10:58   ` Sha Zhengju
@ 2012-07-04 12:56     ` Michal Hocko
  -1 siblings, 0 replies; 132+ messages in thread
From: Michal Hocko @ 2012-07-04 12:56 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
	linux-kernel, Sha Zhengju

On Thu 28-06-12 18:58:31, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> While accounting memcg page stat, it's not worth to use MEMCG_NR_FILE_MAPPED
> as an extra layer of indirection because of the complexity and presumed
> performance overhead. We can use MEM_CGROUP_STAT_FILE_MAPPED directly.
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>

Acked-by: Michal Hocko <mhocko@suse.cz>

> ---
>  include/linux/memcontrol.h |   25 +++++++++++++++++--------
>  mm/memcontrol.c            |   24 +-----------------------
>  mm/rmap.c                  |    4 ++--
>  3 files changed, 20 insertions(+), 33 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 83e7ba9..20b0f2d 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -27,9 +27,18 @@ struct page_cgroup;
>  struct page;
>  struct mm_struct;
>  
> -/* Stats that can be updated by kernel. */
> -enum mem_cgroup_page_stat_item {
> -	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
> +/*
> + * Statistics for memory cgroup.
> + */
> +enum mem_cgroup_stat_index {
> +	/*
> +	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> +	 */
> +	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
> +	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
> +	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> +	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
> +	MEM_CGROUP_STAT_NSTATS,
>  };
>  
>  struct mem_cgroup_reclaim_cookie {
> @@ -164,17 +173,17 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
>  }
>  
>  void mem_cgroup_update_page_stat(struct page *page,
> -				 enum mem_cgroup_page_stat_item idx,
> +				 enum mem_cgroup_stat_index idx,
>  				 int val);
>  
>  static inline void mem_cgroup_inc_page_stat(struct page *page,
> -					    enum mem_cgroup_page_stat_item idx)
> +					    enum mem_cgroup_stat_index idx)
>  {
>  	mem_cgroup_update_page_stat(page, idx, 1);
>  }
>  
>  static inline void mem_cgroup_dec_page_stat(struct page *page,
> -					    enum mem_cgroup_page_stat_item idx)
> +					    enum mem_cgroup_stat_index idx)
>  {
>  	mem_cgroup_update_page_stat(page, idx, -1);
>  }
> @@ -349,12 +358,12 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
>  }
>  
>  static inline void mem_cgroup_inc_page_stat(struct page *page,
> -					    enum mem_cgroup_page_stat_item idx)
> +					    enum mem_cgroup_stat_index idx)
>  {
>  }
>  
>  static inline void mem_cgroup_dec_page_stat(struct page *page,
> -					    enum mem_cgroup_page_stat_item idx)
> +					    enum mem_cgroup_stat_index idx)
>  {
>  }
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a2677e0..ebed1ca 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -77,20 +77,6 @@ static int really_do_swap_account __initdata = 0;
>  #endif
>  
>  
> -/*
> - * Statistics for memory cgroup.
> - */
> -enum mem_cgroup_stat_index {
> -	/*
> -	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> -	 */
> -	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
> -	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
> -	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> -	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
> -	MEM_CGROUP_STAT_NSTATS,
> -};
> -
>  static const char * const mem_cgroup_stat_names[] = {
>  	"cache",
>  	"rss",
> @@ -1926,7 +1912,7 @@ void __mem_cgroup_end_update_page_stat(struct page *page, unsigned long *flags)
>  }
>  
>  void mem_cgroup_update_page_stat(struct page *page,
> -				 enum mem_cgroup_page_stat_item idx, int val)
> +				 enum mem_cgroup_stat_index idx, int val)
>  {
>  	struct mem_cgroup *memcg;
>  	struct page_cgroup *pc = lookup_page_cgroup(page);
> @@ -1939,14 +1925,6 @@ void mem_cgroup_update_page_stat(struct page *page,
>  	if (unlikely(!memcg || !PageCgroupUsed(pc)))
>  		return;
>  
> -	switch (idx) {
> -	case MEMCG_NR_FILE_MAPPED:
> -		idx = MEM_CGROUP_STAT_FILE_MAPPED;
> -		break;
> -	default:
> -		BUG();
> -	}
> -
>  	this_cpu_add(memcg->stat->count[idx], val);
>  }
>  
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 2144160..d6b93df 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1148,7 +1148,7 @@ void page_add_file_rmap(struct page *page)
>  	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  	if (atomic_inc_and_test(&page->_mapcount)) {
>  		__inc_zone_page_state(page, NR_FILE_MAPPED);
> -		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
> +		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
>  	}
>  	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  }
> @@ -1202,7 +1202,7 @@ void page_remove_rmap(struct page *page)
>  					      NR_ANON_TRANSPARENT_HUGEPAGES);
>  	} else {
>  		__dec_zone_page_state(page, NR_FILE_MAPPED);
> -		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
> +		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
>  	}
>  	/*
>  	 * It would be tidy to reset the PageAnon mapping here,
> -- 
> 1.7.1
> 

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 2/7] memcg: remove MEMCG_NR_FILE_MAPPED
@ 2012-07-04 12:56     ` Michal Hocko
  0 siblings, 0 replies; 132+ messages in thread
From: Michal Hocko @ 2012-07-04 12:56 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
	linux-kernel, Sha Zhengju

On Thu 28-06-12 18:58:31, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> While accounting memcg page stat, it's not worth to use MEMCG_NR_FILE_MAPPED
> as an extra layer of indirection because of the complexity and presumed
> performance overhead. We can use MEM_CGROUP_STAT_FILE_MAPPED directly.
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>

Acked-by: Michal Hocko <mhocko@suse.cz>

> ---
>  include/linux/memcontrol.h |   25 +++++++++++++++++--------
>  mm/memcontrol.c            |   24 +-----------------------
>  mm/rmap.c                  |    4 ++--
>  3 files changed, 20 insertions(+), 33 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 83e7ba9..20b0f2d 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -27,9 +27,18 @@ struct page_cgroup;
>  struct page;
>  struct mm_struct;
>  
> -/* Stats that can be updated by kernel. */
> -enum mem_cgroup_page_stat_item {
> -	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
> +/*
> + * Statistics for memory cgroup.
> + */
> +enum mem_cgroup_stat_index {
> +	/*
> +	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> +	 */
> +	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
> +	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
> +	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> +	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
> +	MEM_CGROUP_STAT_NSTATS,
>  };
>  
>  struct mem_cgroup_reclaim_cookie {
> @@ -164,17 +173,17 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
>  }
>  
>  void mem_cgroup_update_page_stat(struct page *page,
> -				 enum mem_cgroup_page_stat_item idx,
> +				 enum mem_cgroup_stat_index idx,
>  				 int val);
>  
>  static inline void mem_cgroup_inc_page_stat(struct page *page,
> -					    enum mem_cgroup_page_stat_item idx)
> +					    enum mem_cgroup_stat_index idx)
>  {
>  	mem_cgroup_update_page_stat(page, idx, 1);
>  }
>  
>  static inline void mem_cgroup_dec_page_stat(struct page *page,
> -					    enum mem_cgroup_page_stat_item idx)
> +					    enum mem_cgroup_stat_index idx)
>  {
>  	mem_cgroup_update_page_stat(page, idx, -1);
>  }
> @@ -349,12 +358,12 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
>  }
>  
>  static inline void mem_cgroup_inc_page_stat(struct page *page,
> -					    enum mem_cgroup_page_stat_item idx)
> +					    enum mem_cgroup_stat_index idx)
>  {
>  }
>  
>  static inline void mem_cgroup_dec_page_stat(struct page *page,
> -					    enum mem_cgroup_page_stat_item idx)
> +					    enum mem_cgroup_stat_index idx)
>  {
>  }
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a2677e0..ebed1ca 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -77,20 +77,6 @@ static int really_do_swap_account __initdata = 0;
>  #endif
>  
>  
> -/*
> - * Statistics for memory cgroup.
> - */
> -enum mem_cgroup_stat_index {
> -	/*
> -	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> -	 */
> -	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
> -	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
> -	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> -	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
> -	MEM_CGROUP_STAT_NSTATS,
> -};
> -
>  static const char * const mem_cgroup_stat_names[] = {
>  	"cache",
>  	"rss",
> @@ -1926,7 +1912,7 @@ void __mem_cgroup_end_update_page_stat(struct page *page, unsigned long *flags)
>  }
>  
>  void mem_cgroup_update_page_stat(struct page *page,
> -				 enum mem_cgroup_page_stat_item idx, int val)
> +				 enum mem_cgroup_stat_index idx, int val)
>  {
>  	struct mem_cgroup *memcg;
>  	struct page_cgroup *pc = lookup_page_cgroup(page);
> @@ -1939,14 +1925,6 @@ void mem_cgroup_update_page_stat(struct page *page,
>  	if (unlikely(!memcg || !PageCgroupUsed(pc)))
>  		return;
>  
> -	switch (idx) {
> -	case MEMCG_NR_FILE_MAPPED:
> -		idx = MEM_CGROUP_STAT_FILE_MAPPED;
> -		break;
> -	default:
> -		BUG();
> -	}
> -
>  	this_cpu_add(memcg->stat->count[idx], val);
>  }
>  
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 2144160..d6b93df 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1148,7 +1148,7 @@ void page_add_file_rmap(struct page *page)
>  	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  	if (atomic_inc_and_test(&page->_mapcount)) {
>  		__inc_zone_page_state(page, NR_FILE_MAPPED);
> -		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
> +		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
>  	}
>  	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  }
> @@ -1202,7 +1202,7 @@ void page_remove_rmap(struct page *page)
>  					      NR_ANON_TRANSPARENT_HUGEPAGES);
>  	} else {
>  		__dec_zone_page_state(page, NR_FILE_MAPPED);
> -		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
> +		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
>  	}
>  	/*
>  	 * It would be tidy to reset the PageAnon mapping here,
> -- 
> 1.7.1
> 

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 2/7] memcg: remove MEMCG_NR_FILE_MAPPED
  2012-07-04 12:56     ` Michal Hocko
  (?)
@ 2012-07-04 12:58       ` Michal Hocko
  -1 siblings, 0 replies; 132+ messages in thread
From: Michal Hocko @ 2012-07-04 12:58 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
	linux-kernel, Sha Zhengju

On Wed 04-07-12 14:56:08, Michal Hocko wrote:
> On Thu 28-06-12 18:58:31, Sha Zhengju wrote:
> > From: Sha Zhengju <handai.szj@taobao.com>
> > 
> > While accounting memcg page stat, it's not worth to use MEMCG_NR_FILE_MAPPED
> > as an extra layer of indirection because of the complexity and presumed
> > performance overhead. We can use MEM_CGROUP_STAT_FILE_MAPPED directly.
> > 
> > Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> 
> Acked-by: Michal Hocko <mhocko@suse.cz>

And forgot to mention that this one can be merged right away.

> 
> > ---
> >  include/linux/memcontrol.h |   25 +++++++++++++++++--------
> >  mm/memcontrol.c            |   24 +-----------------------
> >  mm/rmap.c                  |    4 ++--
> >  3 files changed, 20 insertions(+), 33 deletions(-)
> > 
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 83e7ba9..20b0f2d 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -27,9 +27,18 @@ struct page_cgroup;
> >  struct page;
> >  struct mm_struct;
> >  
> > -/* Stats that can be updated by kernel. */
> > -enum mem_cgroup_page_stat_item {
> > -	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
> > +/*
> > + * Statistics for memory cgroup.
> > + */
> > +enum mem_cgroup_stat_index {
> > +	/*
> > +	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> > +	 */
> > +	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
> > +	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
> > +	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> > +	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
> > +	MEM_CGROUP_STAT_NSTATS,
> >  };
> >  
> >  struct mem_cgroup_reclaim_cookie {
> > @@ -164,17 +173,17 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
> >  }
> >  
> >  void mem_cgroup_update_page_stat(struct page *page,
> > -				 enum mem_cgroup_page_stat_item idx,
> > +				 enum mem_cgroup_stat_index idx,
> >  				 int val);
> >  
> >  static inline void mem_cgroup_inc_page_stat(struct page *page,
> > -					    enum mem_cgroup_page_stat_item idx)
> > +					    enum mem_cgroup_stat_index idx)
> >  {
> >  	mem_cgroup_update_page_stat(page, idx, 1);
> >  }
> >  
> >  static inline void mem_cgroup_dec_page_stat(struct page *page,
> > -					    enum mem_cgroup_page_stat_item idx)
> > +					    enum mem_cgroup_stat_index idx)
> >  {
> >  	mem_cgroup_update_page_stat(page, idx, -1);
> >  }
> > @@ -349,12 +358,12 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
> >  }
> >  
> >  static inline void mem_cgroup_inc_page_stat(struct page *page,
> > -					    enum mem_cgroup_page_stat_item idx)
> > +					    enum mem_cgroup_stat_index idx)
> >  {
> >  }
> >  
> >  static inline void mem_cgroup_dec_page_stat(struct page *page,
> > -					    enum mem_cgroup_page_stat_item idx)
> > +					    enum mem_cgroup_stat_index idx)
> >  {
> >  }
> >  
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index a2677e0..ebed1ca 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -77,20 +77,6 @@ static int really_do_swap_account __initdata = 0;
> >  #endif
> >  
> >  
> > -/*
> > - * Statistics for memory cgroup.
> > - */
> > -enum mem_cgroup_stat_index {
> > -	/*
> > -	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> > -	 */
> > -	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
> > -	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
> > -	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> > -	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
> > -	MEM_CGROUP_STAT_NSTATS,
> > -};
> > -
> >  static const char * const mem_cgroup_stat_names[] = {
> >  	"cache",
> >  	"rss",
> > @@ -1926,7 +1912,7 @@ void __mem_cgroup_end_update_page_stat(struct page *page, unsigned long *flags)
> >  }
> >  
> >  void mem_cgroup_update_page_stat(struct page *page,
> > -				 enum mem_cgroup_page_stat_item idx, int val)
> > +				 enum mem_cgroup_stat_index idx, int val)
> >  {
> >  	struct mem_cgroup *memcg;
> >  	struct page_cgroup *pc = lookup_page_cgroup(page);
> > @@ -1939,14 +1925,6 @@ void mem_cgroup_update_page_stat(struct page *page,
> >  	if (unlikely(!memcg || !PageCgroupUsed(pc)))
> >  		return;
> >  
> > -	switch (idx) {
> > -	case MEMCG_NR_FILE_MAPPED:
> > -		idx = MEM_CGROUP_STAT_FILE_MAPPED;
> > -		break;
> > -	default:
> > -		BUG();
> > -	}
> > -
> >  	this_cpu_add(memcg->stat->count[idx], val);
> >  }
> >  
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 2144160..d6b93df 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1148,7 +1148,7 @@ void page_add_file_rmap(struct page *page)
> >  	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> >  	if (atomic_inc_and_test(&page->_mapcount)) {
> >  		__inc_zone_page_state(page, NR_FILE_MAPPED);
> > -		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
> > +		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
> >  	}
> >  	mem_cgroup_end_update_page_stat(page, &locked, &flags);
> >  }
> > @@ -1202,7 +1202,7 @@ void page_remove_rmap(struct page *page)
> >  					      NR_ANON_TRANSPARENT_HUGEPAGES);
> >  	} else {
> >  		__dec_zone_page_state(page, NR_FILE_MAPPED);
> > -		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
> > +		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
> >  	}
> >  	/*
> >  	 * It would be tidy to reset the PageAnon mapping here,
> > -- 
> > 1.7.1
> > 

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 2/7] memcg: remove MEMCG_NR_FILE_MAPPED
@ 2012-07-04 12:58       ` Michal Hocko
  0 siblings, 0 replies; 132+ messages in thread
From: Michal Hocko @ 2012-07-04 12:58 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
	linux-kernel, Sha Zhengju

On Wed 04-07-12 14:56:08, Michal Hocko wrote:
> On Thu 28-06-12 18:58:31, Sha Zhengju wrote:
> > From: Sha Zhengju <handai.szj@taobao.com>
> > 
> > While accounting memcg page stat, it's not worth to use MEMCG_NR_FILE_MAPPED
> > as an extra layer of indirection because of the complexity and presumed
> > performance overhead. We can use MEM_CGROUP_STAT_FILE_MAPPED directly.
> > 
> > Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> 
> Acked-by: Michal Hocko <mhocko@suse.cz>

And forgot to mention that this one can be merged right away.

> 
> > ---
> >  include/linux/memcontrol.h |   25 +++++++++++++++++--------
> >  mm/memcontrol.c            |   24 +-----------------------
> >  mm/rmap.c                  |    4 ++--
> >  3 files changed, 20 insertions(+), 33 deletions(-)
> > 
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 83e7ba9..20b0f2d 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -27,9 +27,18 @@ struct page_cgroup;
> >  struct page;
> >  struct mm_struct;
> >  
> > -/* Stats that can be updated by kernel. */
> > -enum mem_cgroup_page_stat_item {
> > -	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
> > +/*
> > + * Statistics for memory cgroup.
> > + */
> > +enum mem_cgroup_stat_index {
> > +	/*
> > +	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> > +	 */
> > +	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
> > +	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
> > +	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> > +	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
> > +	MEM_CGROUP_STAT_NSTATS,
> >  };
> >  
> >  struct mem_cgroup_reclaim_cookie {
> > @@ -164,17 +173,17 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
> >  }
> >  
> >  void mem_cgroup_update_page_stat(struct page *page,
> > -				 enum mem_cgroup_page_stat_item idx,
> > +				 enum mem_cgroup_stat_index idx,
> >  				 int val);
> >  
> >  static inline void mem_cgroup_inc_page_stat(struct page *page,
> > -					    enum mem_cgroup_page_stat_item idx)
> > +					    enum mem_cgroup_stat_index idx)
> >  {
> >  	mem_cgroup_update_page_stat(page, idx, 1);
> >  }
> >  
> >  static inline void mem_cgroup_dec_page_stat(struct page *page,
> > -					    enum mem_cgroup_page_stat_item idx)
> > +					    enum mem_cgroup_stat_index idx)
> >  {
> >  	mem_cgroup_update_page_stat(page, idx, -1);
> >  }
> > @@ -349,12 +358,12 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
> >  }
> >  
> >  static inline void mem_cgroup_inc_page_stat(struct page *page,
> > -					    enum mem_cgroup_page_stat_item idx)
> > +					    enum mem_cgroup_stat_index idx)
> >  {
> >  }
> >  
> >  static inline void mem_cgroup_dec_page_stat(struct page *page,
> > -					    enum mem_cgroup_page_stat_item idx)
> > +					    enum mem_cgroup_stat_index idx)
> >  {
> >  }
> >  
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index a2677e0..ebed1ca 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -77,20 +77,6 @@ static int really_do_swap_account __initdata = 0;
> >  #endif
> >  
> >  
> > -/*
> > - * Statistics for memory cgroup.
> > - */
> > -enum mem_cgroup_stat_index {
> > -	/*
> > -	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> > -	 */
> > -	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
> > -	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
> > -	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> > -	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
> > -	MEM_CGROUP_STAT_NSTATS,
> > -};
> > -
> >  static const char * const mem_cgroup_stat_names[] = {
> >  	"cache",
> >  	"rss",
> > @@ -1926,7 +1912,7 @@ void __mem_cgroup_end_update_page_stat(struct page *page, unsigned long *flags)
> >  }
> >  
> >  void mem_cgroup_update_page_stat(struct page *page,
> > -				 enum mem_cgroup_page_stat_item idx, int val)
> > +				 enum mem_cgroup_stat_index idx, int val)
> >  {
> >  	struct mem_cgroup *memcg;
> >  	struct page_cgroup *pc = lookup_page_cgroup(page);
> > @@ -1939,14 +1925,6 @@ void mem_cgroup_update_page_stat(struct page *page,
> >  	if (unlikely(!memcg || !PageCgroupUsed(pc)))
> >  		return;
> >  
> > -	switch (idx) {
> > -	case MEMCG_NR_FILE_MAPPED:
> > -		idx = MEM_CGROUP_STAT_FILE_MAPPED;
> > -		break;
> > -	default:
> > -		BUG();
> > -	}
> > -
> >  	this_cpu_add(memcg->stat->count[idx], val);
> >  }
> >  
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 2144160..d6b93df 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1148,7 +1148,7 @@ void page_add_file_rmap(struct page *page)
> >  	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> >  	if (atomic_inc_and_test(&page->_mapcount)) {
> >  		__inc_zone_page_state(page, NR_FILE_MAPPED);
> > -		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
> > +		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
> >  	}
> >  	mem_cgroup_end_update_page_stat(page, &locked, &flags);
> >  }
> > @@ -1202,7 +1202,7 @@ void page_remove_rmap(struct page *page)
> >  					      NR_ANON_TRANSPARENT_HUGEPAGES);
> >  	} else {
> >  		__dec_zone_page_state(page, NR_FILE_MAPPED);
> > -		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
> > +		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
> >  	}
> >  	/*
> >  	 * It would be tidy to reset the PageAnon mapping here,
> > -- 
> > 1.7.1
> > 

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 2/7] memcg: remove MEMCG_NR_FILE_MAPPED
@ 2012-07-04 12:58       ` Michal Hocko
  0 siblings, 0 replies; 132+ messages in thread
From: Michal Hocko @ 2012-07-04 12:58 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA, yinghan-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju

On Wed 04-07-12 14:56:08, Michal Hocko wrote:
> On Thu 28-06-12 18:58:31, Sha Zhengju wrote:
> > From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> > 
> > While accounting memcg page stat, it's not worth to use MEMCG_NR_FILE_MAPPED
> > as an extra layer of indirection because of the complexity and presumed
> > performance overhead. We can use MEM_CGROUP_STAT_FILE_MAPPED directly.
> > 
> > Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> 
> Acked-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>

And forgot to mention that this one can be merged right away.

> 
> > ---
> >  include/linux/memcontrol.h |   25 +++++++++++++++++--------
> >  mm/memcontrol.c            |   24 +-----------------------
> >  mm/rmap.c                  |    4 ++--
> >  3 files changed, 20 insertions(+), 33 deletions(-)
> > 
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 83e7ba9..20b0f2d 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -27,9 +27,18 @@ struct page_cgroup;
> >  struct page;
> >  struct mm_struct;
> >  
> > -/* Stats that can be updated by kernel. */
> > -enum mem_cgroup_page_stat_item {
> > -	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
> > +/*
> > + * Statistics for memory cgroup.
> > + */
> > +enum mem_cgroup_stat_index {
> > +	/*
> > +	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> > +	 */
> > +	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
> > +	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
> > +	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> > +	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
> > +	MEM_CGROUP_STAT_NSTATS,
> >  };
> >  
> >  struct mem_cgroup_reclaim_cookie {
> > @@ -164,17 +173,17 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
> >  }
> >  
> >  void mem_cgroup_update_page_stat(struct page *page,
> > -				 enum mem_cgroup_page_stat_item idx,
> > +				 enum mem_cgroup_stat_index idx,
> >  				 int val);
> >  
> >  static inline void mem_cgroup_inc_page_stat(struct page *page,
> > -					    enum mem_cgroup_page_stat_item idx)
> > +					    enum mem_cgroup_stat_index idx)
> >  {
> >  	mem_cgroup_update_page_stat(page, idx, 1);
> >  }
> >  
> >  static inline void mem_cgroup_dec_page_stat(struct page *page,
> > -					    enum mem_cgroup_page_stat_item idx)
> > +					    enum mem_cgroup_stat_index idx)
> >  {
> >  	mem_cgroup_update_page_stat(page, idx, -1);
> >  }
> > @@ -349,12 +358,12 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
> >  }
> >  
> >  static inline void mem_cgroup_inc_page_stat(struct page *page,
> > -					    enum mem_cgroup_page_stat_item idx)
> > +					    enum mem_cgroup_stat_index idx)
> >  {
> >  }
> >  
> >  static inline void mem_cgroup_dec_page_stat(struct page *page,
> > -					    enum mem_cgroup_page_stat_item idx)
> > +					    enum mem_cgroup_stat_index idx)
> >  {
> >  }
> >  
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index a2677e0..ebed1ca 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -77,20 +77,6 @@ static int really_do_swap_account __initdata = 0;
> >  #endif
> >  
> >  
> > -/*
> > - * Statistics for memory cgroup.
> > - */
> > -enum mem_cgroup_stat_index {
> > -	/*
> > -	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> > -	 */
> > -	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
> > -	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
> > -	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> > -	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
> > -	MEM_CGROUP_STAT_NSTATS,
> > -};
> > -
> >  static const char * const mem_cgroup_stat_names[] = {
> >  	"cache",
> >  	"rss",
> > @@ -1926,7 +1912,7 @@ void __mem_cgroup_end_update_page_stat(struct page *page, unsigned long *flags)
> >  }
> >  
> >  void mem_cgroup_update_page_stat(struct page *page,
> > -				 enum mem_cgroup_page_stat_item idx, int val)
> > +				 enum mem_cgroup_stat_index idx, int val)
> >  {
> >  	struct mem_cgroup *memcg;
> >  	struct page_cgroup *pc = lookup_page_cgroup(page);
> > @@ -1939,14 +1925,6 @@ void mem_cgroup_update_page_stat(struct page *page,
> >  	if (unlikely(!memcg || !PageCgroupUsed(pc)))
> >  		return;
> >  
> > -	switch (idx) {
> > -	case MEMCG_NR_FILE_MAPPED:
> > -		idx = MEM_CGROUP_STAT_FILE_MAPPED;
> > -		break;
> > -	default:
> > -		BUG();
> > -	}
> > -
> >  	this_cpu_add(memcg->stat->count[idx], val);
> >  }
> >  
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 2144160..d6b93df 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1148,7 +1148,7 @@ void page_add_file_rmap(struct page *page)
> >  	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> >  	if (atomic_inc_and_test(&page->_mapcount)) {
> >  		__inc_zone_page_state(page, NR_FILE_MAPPED);
> > -		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
> > +		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
> >  	}
> >  	mem_cgroup_end_update_page_stat(page, &locked, &flags);
> >  }
> > @@ -1202,7 +1202,7 @@ void page_remove_rmap(struct page *page)
> >  					      NR_ANON_TRANSPARENT_HUGEPAGES);
> >  	} else {
> >  		__dec_zone_page_state(page, NR_FILE_MAPPED);
> > -		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
> > +		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
> >  	}
> >  	/*
> >  	 * It would be tidy to reset the PageAnon mapping here,
> > -- 
> > 1.7.1
> > 

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 3/7] Make TestSetPageDirty and dirty page accounting in one func
  2012-06-28 11:01   ` Sha Zhengju
@ 2012-07-04 14:23     ` Michal Hocko
  -1 siblings, 0 replies; 132+ messages in thread
From: Michal Hocko @ 2012-07-04 14:23 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
	linux-kernel, torvalds, viro, linux-fsdevel, Sha Zhengju,
	Nick Piggin

[Add Nick to the CC]

On Thu 28-06-12 19:01:15, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> Commit a8e7d49a(Fix race in create_empty_buffers() vs __set_page_dirty_buffers())
> extracts TestSetPageDirty from __set_page_dirty and is far away from
> account_page_dirtied.But it's better to make the two operations in one single
> function to keep modular.

> So in order to avoid the potential race mentioned in
> commit a8e7d49a, we can hold private_lock until __set_page_dirty completes.

So this is exactly what Nick suggested in the beginning
(https://lkml.org/lkml/2009/3/19/169)

> I guess there's no deadlock between ->private_lock and ->tree_lock by quick look.

Yes and mm/filemap.c says that explicitly:
 * Lock ordering:
 *
 *  ->i_mmap_mutex              (truncate_pagecache)
 *    ->private_lock            (__free_pte->__set_page_dirty_buffers)
 *      ->swap_lock             (exclusive_swap_page, others)
 *        ->mapping->tree_lock

> 
> It's a prepare patch for following memcg dirty page accounting patches.
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>

The patch seems to be correct.
Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
>  fs/buffer.c |   25 +++++++++++++------------
>  1 files changed, 13 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 838a9cf..e8d96b8 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -610,9 +610,15 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>   * If warn is true, then emit a warning if the page is not uptodate and has
>   * not been truncated.
>   */
> -static void __set_page_dirty(struct page *page,
> +static int __set_page_dirty(struct page *page,
>  		struct address_space *mapping, int warn)
>  {
> +	if (unlikely(!mapping))
> +		return !TestSetPageDirty(page);
> +
> +	if (TestSetPageDirty(page))
> +		return 0;
> +
>  	spin_lock_irq(&mapping->tree_lock);
>  	if (page->mapping) {	/* Race with truncate? */
>  		WARN_ON_ONCE(warn && !PageUptodate(page));
> @@ -622,6 +628,8 @@ static void __set_page_dirty(struct page *page,
>  	}
>  	spin_unlock_irq(&mapping->tree_lock);
>  	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> +
> +	return 1;
>  }
>  
>  /*
> @@ -667,11 +675,9 @@ int __set_page_dirty_buffers(struct page *page)
>  			bh = bh->b_this_page;
>  		} while (bh != head);
>  	}
> -	newly_dirty = !TestSetPageDirty(page);
> +	newly_dirty = __set_page_dirty(page, mapping, 1);
>  	spin_unlock(&mapping->private_lock);
>  
> -	if (newly_dirty)
> -		__set_page_dirty(page, mapping, 1);
>  	return newly_dirty;
>  }
>  EXPORT_SYMBOL(__set_page_dirty_buffers);
> @@ -1115,14 +1121,9 @@ void mark_buffer_dirty(struct buffer_head *bh)
>  			return;
>  	}
>  
> -	if (!test_set_buffer_dirty(bh)) {
> -		struct page *page = bh->b_page;
> -		if (!TestSetPageDirty(page)) {
> -			struct address_space *mapping = page_mapping(page);
> -			if (mapping)
> -				__set_page_dirty(page, mapping, 0);
> -		}
> -	}
> +	if (!test_set_buffer_dirty(bh))
> +		__set_page_dirty(bh->b_page, page_mapping(bh->b_page), 0);
> +
>  }
>  EXPORT_SYMBOL(mark_buffer_dirty);
>  
> -- 
> 1.7.1
> 

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 3/7] Make TestSetPageDirty and dirty page accounting in one func
@ 2012-07-04 14:23     ` Michal Hocko
  0 siblings, 0 replies; 132+ messages in thread
From: Michal Hocko @ 2012-07-04 14:23 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
	linux-kernel, torvalds, viro, linux-fsdevel, Sha Zhengju,
	Nick Piggin

[Add Nick to the CC]

On Thu 28-06-12 19:01:15, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> Commit a8e7d49a(Fix race in create_empty_buffers() vs __set_page_dirty_buffers())
> extracts TestSetPageDirty from __set_page_dirty and is far away from
> account_page_dirtied.But it's better to make the two operations in one single
> function to keep modular.

> So in order to avoid the potential race mentioned in
> commit a8e7d49a, we can hold private_lock until __set_page_dirty completes.

So this is exactly what Nick suggested in the beginning
(https://lkml.org/lkml/2009/3/19/169)

> I guess there's no deadlock between ->private_lock and ->tree_lock by quick look.

Yes and mm/filemap.c says that explicitly:
 * Lock ordering:
 *
 *  ->i_mmap_mutex              (truncate_pagecache)
 *    ->private_lock            (__free_pte->__set_page_dirty_buffers)
 *      ->swap_lock             (exclusive_swap_page, others)
 *        ->mapping->tree_lock

> 
> It's a prepare patch for following memcg dirty page accounting patches.
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>

The patch seems to be correct.
Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
>  fs/buffer.c |   25 +++++++++++++------------
>  1 files changed, 13 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 838a9cf..e8d96b8 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -610,9 +610,15 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>   * If warn is true, then emit a warning if the page is not uptodate and has
>   * not been truncated.
>   */
> -static void __set_page_dirty(struct page *page,
> +static int __set_page_dirty(struct page *page,
>  		struct address_space *mapping, int warn)
>  {
> +	if (unlikely(!mapping))
> +		return !TestSetPageDirty(page);
> +
> +	if (TestSetPageDirty(page))
> +		return 0;
> +
>  	spin_lock_irq(&mapping->tree_lock);
>  	if (page->mapping) {	/* Race with truncate? */
>  		WARN_ON_ONCE(warn && !PageUptodate(page));
> @@ -622,6 +628,8 @@ static void __set_page_dirty(struct page *page,
>  	}
>  	spin_unlock_irq(&mapping->tree_lock);
>  	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> +
> +	return 1;
>  }
>  
>  /*
> @@ -667,11 +675,9 @@ int __set_page_dirty_buffers(struct page *page)
>  			bh = bh->b_this_page;
>  		} while (bh != head);
>  	}
> -	newly_dirty = !TestSetPageDirty(page);
> +	newly_dirty = __set_page_dirty(page, mapping, 1);
>  	spin_unlock(&mapping->private_lock);
>  
> -	if (newly_dirty)
> -		__set_page_dirty(page, mapping, 1);
>  	return newly_dirty;
>  }
>  EXPORT_SYMBOL(__set_page_dirty_buffers);
> @@ -1115,14 +1121,9 @@ void mark_buffer_dirty(struct buffer_head *bh)
>  			return;
>  	}
>  
> -	if (!test_set_buffer_dirty(bh)) {
> -		struct page *page = bh->b_page;
> -		if (!TestSetPageDirty(page)) {
> -			struct address_space *mapping = page_mapping(page);
> -			if (mapping)
> -				__set_page_dirty(page, mapping, 0);
> -		}
> -	}
> +	if (!test_set_buffer_dirty(bh))
> +		__set_page_dirty(bh->b_page, page_mapping(bh->b_page), 0);
> +
>  }
>  EXPORT_SYMBOL(mark_buffer_dirty);
>  
> -- 
> 1.7.1
> 

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 4/7] Use vfs __set_page_dirty interface instead of doing it inside filesystem
  2012-06-28 11:03   ` Sha Zhengju
@ 2012-07-04 14:27     ` Michal Hocko
  -1 siblings, 0 replies; 132+ messages in thread
From: Michal Hocko @ 2012-07-04 14:27 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
	linux-kernel, torvalds, viro, linux-fsdevel, sage, ceph-devel,
	Sha Zhengju

On Thu 28-06-12 19:03:43, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> Following we will treat SetPageDirty and dirty page accounting as an integrated
> operation. Filesystems had better use vfs interface directly to avoid those details.
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> ---
>  fs/buffer.c                 |    2 +-
>  fs/ceph/addr.c              |   20 ++------------------
>  include/linux/buffer_head.h |    2 ++
>  3 files changed, 5 insertions(+), 19 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index e8d96b8..55522dd 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -610,7 +610,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>   * If warn is true, then emit a warning if the page is not uptodate and has
>   * not been truncated.
>   */
> -static int __set_page_dirty(struct page *page,
> +int __set_page_dirty(struct page *page,
>  		struct address_space *mapping, int warn)
>  {
>  	if (unlikely(!mapping))
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 8b67304..d028fbe 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -5,6 +5,7 @@
>  #include <linux/mm.h>
>  #include <linux/pagemap.h>
>  #include <linux/writeback.h>	/* generic_writepages */
> +#include <linux/buffer_head.h>
>  #include <linux/slab.h>
>  #include <linux/pagevec.h>
>  #include <linux/task_io_accounting_ops.h>
> @@ -73,14 +74,8 @@ static int ceph_set_page_dirty(struct page *page)
>  	int undo = 0;
>  	struct ceph_snap_context *snapc;
>  
> -	if (unlikely(!mapping))
> -		return !TestSetPageDirty(page);
> -
> -	if (TestSetPageDirty(page)) {
> -		dout("%p set_page_dirty %p idx %lu -- already dirty\n",
> -		     mapping->host, page, page->index);

I am not familiar with the code but this looks we loose an information
about something bad(?) is going on?

> +	if (!__set_page_dirty(page, mapping, 1))
>  		return 0;
> -	}
>  
>  	inode = mapping->host;
>  	ci = ceph_inode(inode);
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 4/7] Use vfs __set_page_dirty interface instead of doing it inside filesystem
@ 2012-07-04 14:27     ` Michal Hocko
  0 siblings, 0 replies; 132+ messages in thread
From: Michal Hocko @ 2012-07-04 14:27 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
	linux-kernel, torvalds, viro, linux-fsdevel, sage, ceph-devel,
	Sha Zhengju

On Thu 28-06-12 19:03:43, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> Following we will treat SetPageDirty and dirty page accounting as an integrated
> operation. Filesystems had better use vfs interface directly to avoid those details.
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> ---
>  fs/buffer.c                 |    2 +-
>  fs/ceph/addr.c              |   20 ++------------------
>  include/linux/buffer_head.h |    2 ++
>  3 files changed, 5 insertions(+), 19 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index e8d96b8..55522dd 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -610,7 +610,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>   * If warn is true, then emit a warning if the page is not uptodate and has
>   * not been truncated.
>   */
> -static int __set_page_dirty(struct page *page,
> +int __set_page_dirty(struct page *page,
>  		struct address_space *mapping, int warn)
>  {
>  	if (unlikely(!mapping))
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 8b67304..d028fbe 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -5,6 +5,7 @@
>  #include <linux/mm.h>
>  #include <linux/pagemap.h>
>  #include <linux/writeback.h>	/* generic_writepages */
> +#include <linux/buffer_head.h>
>  #include <linux/slab.h>
>  #include <linux/pagevec.h>
>  #include <linux/task_io_accounting_ops.h>
> @@ -73,14 +74,8 @@ static int ceph_set_page_dirty(struct page *page)
>  	int undo = 0;
>  	struct ceph_snap_context *snapc;
>  
> -	if (unlikely(!mapping))
> -		return !TestSetPageDirty(page);
> -
> -	if (TestSetPageDirty(page)) {
> -		dout("%p set_page_dirty %p idx %lu -- already dirty\n",
> -		     mapping->host, page, page->index);

I am not familiar with the code but this looks we loose an information
about something bad(?) is going on?

> +	if (!__set_page_dirty(page, mapping, 1))
>  		return 0;
> -	}
>  
>  	inode = mapping->host;
>  	ci = ceph_inode(inode);
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 5/7] memcg: add per cgroup dirty pages accounting
@ 2012-07-04 16:11     ` Michal Hocko
  0 siblings, 0 replies; 132+ messages in thread
From: Michal Hocko @ 2012-07-04 16:11 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
	linux-kernel, Sha Zhengju, Alexander Viro, linux-fsdevel

I guess you should CC vfs people

On Thu 28-06-12 19:04:46, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> This patch adds memcg routines to count dirty pages, which allows memory controller
> to maintain an accurate view of the amount of its dirty memory and can provide some
> info for users while group's direct reclaim is working.
> 
> After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
> use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
> has a feature to move a page from a cgroup to another one and may have race between
> "move" and "page stat accounting". So in order to avoid the race we have designed a
> bigger lock:
> 
>          mem_cgroup_begin_update_page_stat()
>          modify page information	-->(a)
>          mem_cgroup_update_page_stat()  -->(b)
>          mem_cgroup_end_update_page_stat()
> 
> It requires (a) and (b)(dirty pages accounting) can stay close enough.
> 
> In the previous two prepare patches, we have reworked the vfs set page dirty routines
> and now the interfaces are more explicit:
> 	incrementing (2):
> 		__set_page_dirty
> 		__set_page_dirty_nobuffers
> 	decrementing (2):
> 		clear_page_dirty_for_io
> 		cancel_dirty_page

The patch seems correct at first glance (I have to look closer but I am
in rush at the momemnt and will be back next week).

I was just thinking that memcg is enabled by most distributions these
days but not that many people use it. So it would be probably good to
think how to reduce an overhead if !mem_cgroup_disabled && no cgroups.

Something similar Glauber did for the kmem accounting.

> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> ---
>  fs/buffer.c                |   17 ++++++++++++++---
>  include/linux/memcontrol.h |    1 +
>  mm/filemap.c               |    5 +++++
>  mm/memcontrol.c            |   28 +++++++++++++++++++++-------
>  mm/page-writeback.c        |   30 ++++++++++++++++++++++++------
>  mm/truncate.c              |    6 ++++++
>  6 files changed, 71 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 55522dd..d3714cc 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -613,11 +613,19 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>  int __set_page_dirty(struct page *page,
>  		struct address_space *mapping, int warn)
>  {
> +	bool locked;
> +	unsigned long flags;
> +	int ret = 0;
> +
>  	if (unlikely(!mapping))
>  		return !TestSetPageDirty(page);
>  
> -	if (TestSetPageDirty(page))
> -		return 0;
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +
> +	if (TestSetPageDirty(page)) {
> +		ret = 0;
> +		goto out;
> +	}
>  
>  	spin_lock_irq(&mapping->tree_lock);
>  	if (page->mapping) {	/* Race with truncate? */
> @@ -629,7 +637,10 @@ int __set_page_dirty(struct page *page,
>  	spin_unlock_irq(&mapping->tree_lock);
>  	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>  
> -	return 1;
> +	ret = 1;
> +out:
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +	return ret;
>  }
>  
>  /*
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 20b0f2d..ad37b59 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -38,6 +38,7 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
>  	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>  	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
> +	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
>  	MEM_CGROUP_STAT_NSTATS,
>  };
>  
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 1f19ec3..5159a49 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -140,6 +140,11 @@ void __delete_from_page_cache(struct page *page)
>  	 * having removed the page entirely.
>  	 */
>  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> +		/*
> +		 * Do not change page state, so no need to use mem_cgroup_
> +		 * {begin, end}_update_page_stat to get lock.
> +		 */
> +		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  		dec_zone_page_state(page, NR_FILE_DIRTY);
>  		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>  	}
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ebed1ca..90e2946 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -82,6 +82,7 @@ static const char * const mem_cgroup_stat_names[] = {
>  	"rss",
>  	"mapped_file",
>  	"swap",
> +	"dirty",
>  };
>  
>  enum mem_cgroup_events_index {
> @@ -2538,6 +2539,18 @@ void mem_cgroup_split_huge_fixup(struct page *head)
>  }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
> +static inline
> +void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
> +					struct mem_cgroup *to,
> +					enum mem_cgroup_stat_index idx)
> +{
> +	/* Update stat data for mem_cgroup */
> +	preempt_disable();
> +	__this_cpu_dec(from->stat->count[idx]);
> +	__this_cpu_inc(to->stat->count[idx]);
> +	preempt_enable();
> +}
> +
>  /**
>   * mem_cgroup_move_account - move account of the page
>   * @page: the page
> @@ -2583,13 +2596,14 @@ static int mem_cgroup_move_account(struct page *page,
>  
>  	move_lock_mem_cgroup(from, &flags);
>  
> -	if (!anon && page_mapped(page)) {
> -		/* Update mapped_file data for mem_cgroup */
> -		preempt_disable();
> -		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		preempt_enable();
> -	}
> +	if (!anon && page_mapped(page))
> +		mem_cgroup_move_account_page_stat(from, to,
> +				MEM_CGROUP_STAT_FILE_MAPPED);
> +
> +	if (PageDirty(page))
> +		mem_cgroup_move_account_page_stat(from, to,
> +				MEM_CGROUP_STAT_FILE_DIRTY);
> +
>  	mem_cgroup_charge_statistics(from, anon, -nr_pages);
>  
>  	/* caller should have done css_get */
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index e5363f3..e79a2f7 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1962,6 +1962,7 @@ int __set_page_dirty_no_writeback(struct page *page)
>  void account_page_dirtied(struct page *page, struct address_space *mapping)
>  {
>  	if (mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  		__inc_zone_page_state(page, NR_FILE_DIRTY);
>  		__inc_zone_page_state(page, NR_DIRTIED);
>  		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> @@ -2001,12 +2002,20 @@ EXPORT_SYMBOL(account_page_writeback);
>   */
>  int __set_page_dirty_nobuffers(struct page *page)
>  {
> +	bool locked;
> +	unsigned long flags;
> +	int ret = 0;
> +
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +
>  	if (!TestSetPageDirty(page)) {
>  		struct address_space *mapping = page_mapping(page);
>  		struct address_space *mapping2;
>  
> -		if (!mapping)
> -			return 1;
> +		if (!mapping) {
> +			ret = 1;
> +			goto out;
> +		}
>  
>  		spin_lock_irq(&mapping->tree_lock);
>  		mapping2 = page_mapping(page);
> @@ -2022,9 +2031,12 @@ int __set_page_dirty_nobuffers(struct page *page)
>  			/* !PageAnon && !swapper_space */
>  			__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>  		}
> -		return 1;
> +		ret = 1;
>  	}
> -	return 0;
> +
> +out:
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +	return ret;
>  }
>  EXPORT_SYMBOL(__set_page_dirty_nobuffers);
>  
> @@ -2139,6 +2151,9 @@ EXPORT_SYMBOL(set_page_dirty_lock);
>  int clear_page_dirty_for_io(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
> +	bool locked;
> +	unsigned long flags;
> +	int ret = 0;
>  
>  	BUG_ON(!PageLocked(page));
>  
> @@ -2180,13 +2195,16 @@ int clear_page_dirty_for_io(struct page *page)
>  		 * the desired exclusion. See mm/memory.c:do_wp_page()
>  		 * for more comments.
>  		 */
> +		mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  		if (TestClearPageDirty(page)) {
> +			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_RECLAIMABLE);
> -			return 1;
> +			ret = 1;
>  		}
> -		return 0;
> +		mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +		return ret;
>  	}
>  	return TestClearPageDirty(page);
>  }
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 75801ac..052016a 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -73,9 +73,14 @@ static inline void truncate_partial_page(struct page *page, unsigned partial)
>   */
>  void cancel_dirty_page(struct page *page, unsigned int account_size)
>  {
> +	bool locked;
> +	unsigned long flags;
> +
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  	if (TestClearPageDirty(page)) {
>  		struct address_space *mapping = page->mapping;
>  		if (mapping && mapping_cap_account_dirty(mapping)) {
> +			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_RECLAIMABLE);
> @@ -83,6 +88,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
>  				task_io_account_cancelled_write(account_size);
>  		}
>  	}
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  }
>  EXPORT_SYMBOL(cancel_dirty_page);
>  
> -- 
> 1.7.1
> 

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 5/7] memcg: add per cgroup dirty pages accounting
@ 2012-07-04 16:11     ` Michal Hocko
  0 siblings, 0 replies; 132+ messages in thread
From: Michal Hocko @ 2012-07-04 16:11 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA, yinghan-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju, Alexander Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

I guess you should CC vfs people

On Thu 28-06-12 19:04:46, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> 
> This patch adds memcg routines to count dirty pages, which allows memory controller
> to maintain an accurate view of the amount of its dirty memory and can provide some
> info for users while group's direct reclaim is working.
> 
> After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
> use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
> has a feature to move a page from a cgroup to another one and may have race between
> "move" and "page stat accounting". So in order to avoid the race we have designed a
> bigger lock:
> 
>          mem_cgroup_begin_update_page_stat()
>          modify page information	-->(a)
>          mem_cgroup_update_page_stat()  -->(b)
>          mem_cgroup_end_update_page_stat()
> 
> It requires (a) and (b)(dirty pages accounting) can stay close enough.
> 
> In the previous two prepare patches, we have reworked the vfs set page dirty routines
> and now the interfaces are more explicit:
> 	incrementing (2):
> 		__set_page_dirty
> 		__set_page_dirty_nobuffers
> 	decrementing (2):
> 		clear_page_dirty_for_io
> 		cancel_dirty_page

The patch seems correct at first glance (I have to look closer but I am
in rush at the momemnt and will be back next week).

I was just thinking that memcg is enabled by most distributions these
days but not that many people use it. So it would be probably good to
think how to reduce an overhead if !mem_cgroup_disabled && no cgroups.

Something similar Glauber did for the kmem accounting.

> Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> ---
>  fs/buffer.c                |   17 ++++++++++++++---
>  include/linux/memcontrol.h |    1 +
>  mm/filemap.c               |    5 +++++
>  mm/memcontrol.c            |   28 +++++++++++++++++++++-------
>  mm/page-writeback.c        |   30 ++++++++++++++++++++++++------
>  mm/truncate.c              |    6 ++++++
>  6 files changed, 71 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 55522dd..d3714cc 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -613,11 +613,19 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>  int __set_page_dirty(struct page *page,
>  		struct address_space *mapping, int warn)
>  {
> +	bool locked;
> +	unsigned long flags;
> +	int ret = 0;
> +
>  	if (unlikely(!mapping))
>  		return !TestSetPageDirty(page);
>  
> -	if (TestSetPageDirty(page))
> -		return 0;
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +
> +	if (TestSetPageDirty(page)) {
> +		ret = 0;
> +		goto out;
> +	}
>  
>  	spin_lock_irq(&mapping->tree_lock);
>  	if (page->mapping) {	/* Race with truncate? */
> @@ -629,7 +637,10 @@ int __set_page_dirty(struct page *page,
>  	spin_unlock_irq(&mapping->tree_lock);
>  	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>  
> -	return 1;
> +	ret = 1;
> +out:
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +	return ret;
>  }
>  
>  /*
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 20b0f2d..ad37b59 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -38,6 +38,7 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
>  	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>  	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
> +	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
>  	MEM_CGROUP_STAT_NSTATS,
>  };
>  
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 1f19ec3..5159a49 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -140,6 +140,11 @@ void __delete_from_page_cache(struct page *page)
>  	 * having removed the page entirely.
>  	 */
>  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> +		/*
> +		 * Do not change page state, so no need to use mem_cgroup_
> +		 * {begin, end}_update_page_stat to get lock.
> +		 */
> +		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  		dec_zone_page_state(page, NR_FILE_DIRTY);
>  		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>  	}
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ebed1ca..90e2946 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -82,6 +82,7 @@ static const char * const mem_cgroup_stat_names[] = {
>  	"rss",
>  	"mapped_file",
>  	"swap",
> +	"dirty",
>  };
>  
>  enum mem_cgroup_events_index {
> @@ -2538,6 +2539,18 @@ void mem_cgroup_split_huge_fixup(struct page *head)
>  }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
> +static inline
> +void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
> +					struct mem_cgroup *to,
> +					enum mem_cgroup_stat_index idx)
> +{
> +	/* Update stat data for mem_cgroup */
> +	preempt_disable();
> +	__this_cpu_dec(from->stat->count[idx]);
> +	__this_cpu_inc(to->stat->count[idx]);
> +	preempt_enable();
> +}
> +
>  /**
>   * mem_cgroup_move_account - move account of the page
>   * @page: the page
> @@ -2583,13 +2596,14 @@ static int mem_cgroup_move_account(struct page *page,
>  
>  	move_lock_mem_cgroup(from, &flags);
>  
> -	if (!anon && page_mapped(page)) {
> -		/* Update mapped_file data for mem_cgroup */
> -		preempt_disable();
> -		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		preempt_enable();
> -	}
> +	if (!anon && page_mapped(page))
> +		mem_cgroup_move_account_page_stat(from, to,
> +				MEM_CGROUP_STAT_FILE_MAPPED);
> +
> +	if (PageDirty(page))
> +		mem_cgroup_move_account_page_stat(from, to,
> +				MEM_CGROUP_STAT_FILE_DIRTY);
> +
>  	mem_cgroup_charge_statistics(from, anon, -nr_pages);
>  
>  	/* caller should have done css_get */
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index e5363f3..e79a2f7 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1962,6 +1962,7 @@ int __set_page_dirty_no_writeback(struct page *page)
>  void account_page_dirtied(struct page *page, struct address_space *mapping)
>  {
>  	if (mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  		__inc_zone_page_state(page, NR_FILE_DIRTY);
>  		__inc_zone_page_state(page, NR_DIRTIED);
>  		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> @@ -2001,12 +2002,20 @@ EXPORT_SYMBOL(account_page_writeback);
>   */
>  int __set_page_dirty_nobuffers(struct page *page)
>  {
> +	bool locked;
> +	unsigned long flags;
> +	int ret = 0;
> +
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +
>  	if (!TestSetPageDirty(page)) {
>  		struct address_space *mapping = page_mapping(page);
>  		struct address_space *mapping2;
>  
> -		if (!mapping)
> -			return 1;
> +		if (!mapping) {
> +			ret = 1;
> +			goto out;
> +		}
>  
>  		spin_lock_irq(&mapping->tree_lock);
>  		mapping2 = page_mapping(page);
> @@ -2022,9 +2031,12 @@ int __set_page_dirty_nobuffers(struct page *page)
>  			/* !PageAnon && !swapper_space */
>  			__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>  		}
> -		return 1;
> +		ret = 1;
>  	}
> -	return 0;
> +
> +out:
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +	return ret;
>  }
>  EXPORT_SYMBOL(__set_page_dirty_nobuffers);
>  
> @@ -2139,6 +2151,9 @@ EXPORT_SYMBOL(set_page_dirty_lock);
>  int clear_page_dirty_for_io(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
> +	bool locked;
> +	unsigned long flags;
> +	int ret = 0;
>  
>  	BUG_ON(!PageLocked(page));
>  
> @@ -2180,13 +2195,16 @@ int clear_page_dirty_for_io(struct page *page)
>  		 * the desired exclusion. See mm/memory.c:do_wp_page()
>  		 * for more comments.
>  		 */
> +		mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  		if (TestClearPageDirty(page)) {
> +			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_RECLAIMABLE);
> -			return 1;
> +			ret = 1;
>  		}
> -		return 0;
> +		mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +		return ret;
>  	}
>  	return TestClearPageDirty(page);
>  }
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 75801ac..052016a 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -73,9 +73,14 @@ static inline void truncate_partial_page(struct page *page, unsigned partial)
>   */
>  void cancel_dirty_page(struct page *page, unsigned int account_size)
>  {
> +	bool locked;
> +	unsigned long flags;
> +
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  	if (TestClearPageDirty(page)) {
>  		struct address_space *mapping = page->mapping;
>  		if (mapping && mapping_cap_account_dirty(mapping)) {
> +			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_RECLAIMABLE);
> @@ -83,6 +88,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
>  				task_io_account_cancelled_write(account_size);
>  		}
>  	}
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  }
>  EXPORT_SYMBOL(cancel_dirty_page);
>  
> -- 
> 1.7.1
> 

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 5/7] memcg: add per cgroup dirty pages accounting
@ 2012-07-04 16:11     ` Michal Hocko
  0 siblings, 0 replies; 132+ messages in thread
From: Michal Hocko @ 2012-07-04 16:11 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
	linux-kernel, Sha Zhengju, Alexander Viro, linux-fsdevel

I guess you should CC vfs people

On Thu 28-06-12 19:04:46, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> This patch adds memcg routines to count dirty pages, which allows memory controller
> to maintain an accurate view of the amount of its dirty memory and can provide some
> info for users while group's direct reclaim is working.
> 
> After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
> use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
> has a feature to move a page from a cgroup to another one and may have race between
> "move" and "page stat accounting". So in order to avoid the race we have designed a
> bigger lock:
> 
>          mem_cgroup_begin_update_page_stat()
>          modify page information	-->(a)
>          mem_cgroup_update_page_stat()  -->(b)
>          mem_cgroup_end_update_page_stat()
> 
> It requires (a) and (b)(dirty pages accounting) can stay close enough.
> 
> In the previous two prepare patches, we have reworked the vfs set page dirty routines
> and now the interfaces are more explicit:
> 	incrementing (2):
> 		__set_page_dirty
> 		__set_page_dirty_nobuffers
> 	decrementing (2):
> 		clear_page_dirty_for_io
> 		cancel_dirty_page

The patch seems correct at first glance (I have to look closer but I am
in rush at the momemnt and will be back next week).

I was just thinking that memcg is enabled by most distributions these
days but not that many people use it. So it would be probably good to
think how to reduce an overhead if !mem_cgroup_disabled && no cgroups.

Something similar Glauber did for the kmem accounting.

> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> ---
>  fs/buffer.c                |   17 ++++++++++++++---
>  include/linux/memcontrol.h |    1 +
>  mm/filemap.c               |    5 +++++
>  mm/memcontrol.c            |   28 +++++++++++++++++++++-------
>  mm/page-writeback.c        |   30 ++++++++++++++++++++++++------
>  mm/truncate.c              |    6 ++++++
>  6 files changed, 71 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 55522dd..d3714cc 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -613,11 +613,19 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>  int __set_page_dirty(struct page *page,
>  		struct address_space *mapping, int warn)
>  {
> +	bool locked;
> +	unsigned long flags;
> +	int ret = 0;
> +
>  	if (unlikely(!mapping))
>  		return !TestSetPageDirty(page);
>  
> -	if (TestSetPageDirty(page))
> -		return 0;
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +
> +	if (TestSetPageDirty(page)) {
> +		ret = 0;
> +		goto out;
> +	}
>  
>  	spin_lock_irq(&mapping->tree_lock);
>  	if (page->mapping) {	/* Race with truncate? */
> @@ -629,7 +637,10 @@ int __set_page_dirty(struct page *page,
>  	spin_unlock_irq(&mapping->tree_lock);
>  	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>  
> -	return 1;
> +	ret = 1;
> +out:
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +	return ret;
>  }
>  
>  /*
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 20b0f2d..ad37b59 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -38,6 +38,7 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
>  	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>  	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
> +	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
>  	MEM_CGROUP_STAT_NSTATS,
>  };
>  
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 1f19ec3..5159a49 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -140,6 +140,11 @@ void __delete_from_page_cache(struct page *page)
>  	 * having removed the page entirely.
>  	 */
>  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> +		/*
> +		 * Do not change page state, so no need to use mem_cgroup_
> +		 * {begin, end}_update_page_stat to get lock.
> +		 */
> +		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  		dec_zone_page_state(page, NR_FILE_DIRTY);
>  		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>  	}
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ebed1ca..90e2946 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -82,6 +82,7 @@ static const char * const mem_cgroup_stat_names[] = {
>  	"rss",
>  	"mapped_file",
>  	"swap",
> +	"dirty",
>  };
>  
>  enum mem_cgroup_events_index {
> @@ -2538,6 +2539,18 @@ void mem_cgroup_split_huge_fixup(struct page *head)
>  }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
> +static inline
> +void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
> +					struct mem_cgroup *to,
> +					enum mem_cgroup_stat_index idx)
> +{
> +	/* Update stat data for mem_cgroup */
> +	preempt_disable();
> +	__this_cpu_dec(from->stat->count[idx]);
> +	__this_cpu_inc(to->stat->count[idx]);
> +	preempt_enable();
> +}
> +
>  /**
>   * mem_cgroup_move_account - move account of the page
>   * @page: the page
> @@ -2583,13 +2596,14 @@ static int mem_cgroup_move_account(struct page *page,
>  
>  	move_lock_mem_cgroup(from, &flags);
>  
> -	if (!anon && page_mapped(page)) {
> -		/* Update mapped_file data for mem_cgroup */
> -		preempt_disable();
> -		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		preempt_enable();
> -	}
> +	if (!anon && page_mapped(page))
> +		mem_cgroup_move_account_page_stat(from, to,
> +				MEM_CGROUP_STAT_FILE_MAPPED);
> +
> +	if (PageDirty(page))
> +		mem_cgroup_move_account_page_stat(from, to,
> +				MEM_CGROUP_STAT_FILE_DIRTY);
> +
>  	mem_cgroup_charge_statistics(from, anon, -nr_pages);
>  
>  	/* caller should have done css_get */
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index e5363f3..e79a2f7 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1962,6 +1962,7 @@ int __set_page_dirty_no_writeback(struct page *page)
>  void account_page_dirtied(struct page *page, struct address_space *mapping)
>  {
>  	if (mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  		__inc_zone_page_state(page, NR_FILE_DIRTY);
>  		__inc_zone_page_state(page, NR_DIRTIED);
>  		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> @@ -2001,12 +2002,20 @@ EXPORT_SYMBOL(account_page_writeback);
>   */
>  int __set_page_dirty_nobuffers(struct page *page)
>  {
> +	bool locked;
> +	unsigned long flags;
> +	int ret = 0;
> +
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +
>  	if (!TestSetPageDirty(page)) {
>  		struct address_space *mapping = page_mapping(page);
>  		struct address_space *mapping2;
>  
> -		if (!mapping)
> -			return 1;
> +		if (!mapping) {
> +			ret = 1;
> +			goto out;
> +		}
>  
>  		spin_lock_irq(&mapping->tree_lock);
>  		mapping2 = page_mapping(page);
> @@ -2022,9 +2031,12 @@ int __set_page_dirty_nobuffers(struct page *page)
>  			/* !PageAnon && !swapper_space */
>  			__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>  		}
> -		return 1;
> +		ret = 1;
>  	}
> -	return 0;
> +
> +out:
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +	return ret;
>  }
>  EXPORT_SYMBOL(__set_page_dirty_nobuffers);
>  
> @@ -2139,6 +2151,9 @@ EXPORT_SYMBOL(set_page_dirty_lock);
>  int clear_page_dirty_for_io(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
> +	bool locked;
> +	unsigned long flags;
> +	int ret = 0;
>  
>  	BUG_ON(!PageLocked(page));
>  
> @@ -2180,13 +2195,16 @@ int clear_page_dirty_for_io(struct page *page)
>  		 * the desired exclusion. See mm/memory.c:do_wp_page()
>  		 * for more comments.
>  		 */
> +		mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  		if (TestClearPageDirty(page)) {
> +			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_RECLAIMABLE);
> -			return 1;
> +			ret = 1;
>  		}
> -		return 0;
> +		mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +		return ret;
>  	}
>  	return TestClearPageDirty(page);
>  }
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 75801ac..052016a 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -73,9 +73,14 @@ static inline void truncate_partial_page(struct page *page, unsigned partial)
>   */
>  void cancel_dirty_page(struct page *page, unsigned int account_size)
>  {
> +	bool locked;
> +	unsigned long flags;
> +
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  	if (TestClearPageDirty(page)) {
>  		struct address_space *mapping = page->mapping;
>  		if (mapping && mapping_cap_account_dirty(mapping)) {
> +			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_RECLAIMABLE);
> @@ -83,6 +88,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
>  				task_io_account_cancelled_write(account_size);
>  		}
>  	}
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  }
>  EXPORT_SYMBOL(cancel_dirty_page);
>  
> -- 
> 1.7.1
> 

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 6/7] memcg: add per cgroup writeback pages accounting
  2012-06-28 11:05   ` Sha Zhengju
@ 2012-07-04 16:15     ` Michal Hocko
  -1 siblings, 0 replies; 132+ messages in thread
From: Michal Hocko @ 2012-07-04 16:15 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
	linux-kernel, Sha Zhengju, Jan Kara, Wu Fengguang

[Let's add writeback people]

On Thu 28-06-12 19:05:25, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> Similar to dirty page, we add per cgroup writeback pages accounting. The lock
> rule still is:
> 	mem_cgroup_begin_update_page_stat()
> 	modify page WRITEBACK stat
> 	mem_cgroup_update_page_stat()
> 	mem_cgroup_end_update_page_stat()
> 
> There're two writeback interface to modify: test_clear/set_page_writeback.
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> ---
>  include/linux/memcontrol.h |    1 +
>  mm/memcontrol.c            |    5 +++++
>  mm/page-writeback.c        |   12 ++++++++++++
>  3 files changed, 18 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index ad37b59..9193d93 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -39,6 +39,7 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>  	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
>  	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
> +	MEM_CGROUP_STAT_FILE_WRITEBACK,  /* # of pages under writeback */
>  	MEM_CGROUP_STAT_NSTATS,
>  };
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 90e2946..8493119 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -83,6 +83,7 @@ static const char * const mem_cgroup_stat_names[] = {
>  	"mapped_file",
>  	"swap",
>  	"dirty",
> +	"writeback",
>  };
>  
>  enum mem_cgroup_events_index {
> @@ -2604,6 +2605,10 @@ static int mem_cgroup_move_account(struct page *page,
>  		mem_cgroup_move_account_page_stat(from, to,
>  				MEM_CGROUP_STAT_FILE_DIRTY);
>  
> +	if (PageWriteback(page))
> +		mem_cgroup_move_account_page_stat(from, to,
> +				MEM_CGROUP_STAT_FILE_WRITEBACK);
> +
>  	mem_cgroup_charge_statistics(from, anon, -nr_pages);
>  
>  	/* caller should have done css_get */
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index e79a2f7..7398836 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1981,6 +1981,7 @@ EXPORT_SYMBOL(account_page_dirtied);
>   */
>  void account_page_writeback(struct page *page)
>  {
> +	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_WRITEBACK);
>  	inc_zone_page_state(page, NR_WRITEBACK);
>  }
>  EXPORT_SYMBOL(account_page_writeback);
> @@ -2214,7 +2215,10 @@ int test_clear_page_writeback(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
>  	int ret;
> +	bool locked;
> +	unsigned long flags;
>  
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  	if (mapping) {
>  		struct backing_dev_info *bdi = mapping->backing_dev_info;
>  		unsigned long flags;
> @@ -2235,9 +2239,12 @@ int test_clear_page_writeback(struct page *page)
>  		ret = TestClearPageWriteback(page);
>  	}
>  	if (ret) {
> +		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_WRITEBACK);
>  		dec_zone_page_state(page, NR_WRITEBACK);
>  		inc_zone_page_state(page, NR_WRITTEN);
>  	}
> +
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  	return ret;
>  }
>  
> @@ -2245,7 +2252,10 @@ int test_set_page_writeback(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
>  	int ret;
> +	bool locked;
> +	unsigned long flags;
>  
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  	if (mapping) {
>  		struct backing_dev_info *bdi = mapping->backing_dev_info;
>  		unsigned long flags;
> @@ -2272,6 +2282,8 @@ int test_set_page_writeback(struct page *page)
>  	}
>  	if (!ret)
>  		account_page_writeback(page);
> +
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  	return ret;
>  
>  }
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 6/7] memcg: add per cgroup writeback pages accounting
@ 2012-07-04 16:15     ` Michal Hocko
  0 siblings, 0 replies; 132+ messages in thread
From: Michal Hocko @ 2012-07-04 16:15 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
	linux-kernel, Sha Zhengju, Jan Kara, Wu Fengguang

[Let's add writeback people]

On Thu 28-06-12 19:05:25, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> Similar to dirty page, we add per cgroup writeback pages accounting. The lock
> rule still is:
> 	mem_cgroup_begin_update_page_stat()
> 	modify page WRITEBACK stat
> 	mem_cgroup_update_page_stat()
> 	mem_cgroup_end_update_page_stat()
> 
> There're two writeback interface to modify: test_clear/set_page_writeback.
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> ---
>  include/linux/memcontrol.h |    1 +
>  mm/memcontrol.c            |    5 +++++
>  mm/page-writeback.c        |   12 ++++++++++++
>  3 files changed, 18 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index ad37b59..9193d93 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -39,6 +39,7 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>  	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
>  	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
> +	MEM_CGROUP_STAT_FILE_WRITEBACK,  /* # of pages under writeback */
>  	MEM_CGROUP_STAT_NSTATS,
>  };
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 90e2946..8493119 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -83,6 +83,7 @@ static const char * const mem_cgroup_stat_names[] = {
>  	"mapped_file",
>  	"swap",
>  	"dirty",
> +	"writeback",
>  };
>  
>  enum mem_cgroup_events_index {
> @@ -2604,6 +2605,10 @@ static int mem_cgroup_move_account(struct page *page,
>  		mem_cgroup_move_account_page_stat(from, to,
>  				MEM_CGROUP_STAT_FILE_DIRTY);
>  
> +	if (PageWriteback(page))
> +		mem_cgroup_move_account_page_stat(from, to,
> +				MEM_CGROUP_STAT_FILE_WRITEBACK);
> +
>  	mem_cgroup_charge_statistics(from, anon, -nr_pages);
>  
>  	/* caller should have done css_get */
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index e79a2f7..7398836 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1981,6 +1981,7 @@ EXPORT_SYMBOL(account_page_dirtied);
>   */
>  void account_page_writeback(struct page *page)
>  {
> +	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_WRITEBACK);
>  	inc_zone_page_state(page, NR_WRITEBACK);
>  }
>  EXPORT_SYMBOL(account_page_writeback);
> @@ -2214,7 +2215,10 @@ int test_clear_page_writeback(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
>  	int ret;
> +	bool locked;
> +	unsigned long flags;
>  
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  	if (mapping) {
>  		struct backing_dev_info *bdi = mapping->backing_dev_info;
>  		unsigned long flags;
> @@ -2235,9 +2239,12 @@ int test_clear_page_writeback(struct page *page)
>  		ret = TestClearPageWriteback(page);
>  	}
>  	if (ret) {
> +		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_WRITEBACK);
>  		dec_zone_page_state(page, NR_WRITEBACK);
>  		inc_zone_page_state(page, NR_WRITTEN);
>  	}
> +
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  	return ret;
>  }
>  
> @@ -2245,7 +2252,10 @@ int test_set_page_writeback(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
>  	int ret;
> +	bool locked;
> +	unsigned long flags;
>  
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  	if (mapping) {
>  		struct backing_dev_info *bdi = mapping->backing_dev_info;
>  		unsigned long flags;
> @@ -2272,6 +2282,8 @@ int test_set_page_writeback(struct page *page)
>  	}
>  	if (!ret)
>  		account_page_writeback(page);
> +
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  	return ret;
>  
>  }
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 4/7] Use vfs __set_page_dirty interface instead of doing it inside filesystem
  2012-07-04  8:11           ` Sha Zhengju
@ 2012-07-05 15:20             ` Sage Weil
  -1 siblings, 0 replies; 132+ messages in thread
From: Sage Weil @ 2012-07-05 15:20 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
	mhocko, linux-kernel, torvalds, viro, linux-fsdevel, sage,
	ceph-devel, Sha Zhengju

On Wed, 4 Jul 2012, Sha Zhengju wrote:
> On 07/02/2012 10:49 PM, Sage Weil wrote:
> > On Mon, 2 Jul 2012, Sha Zhengju wrote:
> > > On 06/29/2012 01:21 PM, Sage Weil wrote:
> > > > On Thu, 28 Jun 2012, Sha Zhengju wrote:
> > > > 
> > > > > From: Sha Zhengju<handai.szj@taobao.com>
> > > > > 
> > > > > Following we will treat SetPageDirty and dirty page accounting as an
> > > > > integrated
> > > > > operation. Filesystems had better use vfs interface directly to avoid
> > > > > those details.
> > > > > 
> > > > > Signed-off-by: Sha Zhengju<handai.szj@taobao.com>
> > > > > ---
> > > > >    fs/buffer.c                 |    2 +-
> > > > >    fs/ceph/addr.c              |   20 ++------------------
> > > > >    include/linux/buffer_head.h |    2 ++
> > > > >    3 files changed, 5 insertions(+), 19 deletions(-)
> > > > > 
> > > > > diff --git a/fs/buffer.c b/fs/buffer.c
> > > > > index e8d96b8..55522dd 100644
> > > > > --- a/fs/buffer.c
> > > > > +++ b/fs/buffer.c
> > > > > @@ -610,7 +610,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
> > > > >     * If warn is true, then emit a warning if the page is not uptodate
> > > > > and
> > > > > has
> > > > >     * not been truncated.
> > > > >     */
> > > > > -static int __set_page_dirty(struct page *page,
> > > > > +int __set_page_dirty(struct page *page,
> > > > >    		struct address_space *mapping, int warn)
> > > > >    {
> > > > >    	if (unlikely(!mapping))
> > > > This also needs an EXPORT_SYMBOL(__set_page_dirty) to allow ceph to
> > > > continue to build as a module.
> > > > 
> > > > With that fixed, the ceph bits are a welcome cleanup!
> > > > 
> > > > Acked-by: Sage Weil<sage@inktank.com>
> > > Further, I check the path again and may it be reworked as follows to avoid
> > > undo?
> > > 
> > > __set_page_dirty();
> > > __set_page_dirty();
> > > ceph operations;                ==>                     if (page->mapping)
> > > if (page->mapping)                                            ceph
> > > operations;
> > >      ;
> > > else
> > >      undo = 1;
> > > if (undo)
> > >      xxx;
> > Yep.  Taking another look at the original code, though, I'm worried that
> > one reason the __set_page_dirty() actions were spread out the way they are
> > is because we wanted to ensure that the ceph operations were always
> > performed when PagePrivate was set.
> > 
> 
> Sorry, I've lost something:
> 
> __set_page_dirty();                        __set_page_dirty();
> ceph operations;
> if(page->mapping)         ==>      if(page->mapping) {
>        SetPagePrivate;                            SetPagePrivate;
> else                                                      ceph operations;
>     undo = 1;                                  }
> 
> if (undo)
>     XXX;
> 
> I think this can ensure that ceph operations are performed together with
> SetPagePrivate.

Yeah, that looks right, as long as the ceph accounting operations happen 
before SetPagePrivate.  I think it's no more or less racy than before, at 
least. 

The patch doesn't apply without the previous ones in the series, it looks 
like.  Do you want to prepare a new version or should I?

Thanks!
sage

 
> > It looks like invalidatepage won't get called if private isn't set, and
> > presumably it handles the truncate race with __set_page_dirty() properly
> > (right?).  What about writeback?  Do we need to worry about writepage[s]
> > getting called with a NULL page->private?
> 
> __set_page_dirty does handle racing conditions with truncate and
> writeback writepage[s] also take page->private into consideration
> which is done inside specific filesystems. I notice that ceph has handled
> this in ceph_writepage().
> Sorry, not vfs expert and maybe I've not caught your point...
>
> 
> 
> Thanks,
> Sha
> 
> > Thanks!
> > sage
> > 
> > 
> > 
> > > 
> > > 
> > > Thanks,
> > > Sha
> > > 
> > > > > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> > > > > index 8b67304..d028fbe 100644
> > > > > --- a/fs/ceph/addr.c
> > > > > +++ b/fs/ceph/addr.c
> > > > > @@ -5,6 +5,7 @@
> > > > >    #include<linux/mm.h>
> > > > >    #include<linux/pagemap.h>
> > > > >    #include<linux/writeback.h>	/* generic_writepages */
> > > > > +#include<linux/buffer_head.h>
> > > > >    #include<linux/slab.h>
> > > > >    #include<linux/pagevec.h>
> > > > >    #include<linux/task_io_accounting_ops.h>
> > > > > @@ -73,14 +74,8 @@ static int ceph_set_page_dirty(struct page *page)
> > > > >    	int undo = 0;
> > > > >    	struct ceph_snap_context *snapc;
> > > > > 
> > > > > -	if (unlikely(!mapping))
> > > > > -		return !TestSetPageDirty(page);
> > > > > -
> > > > > -	if (TestSetPageDirty(page)) {
> > > > > -		dout("%p set_page_dirty %p idx %lu -- already
> > > > > dirty\n",
> > > > > -		     mapping->host, page, page->index);
> > > > > +	if (!__set_page_dirty(page, mapping, 1))
> > > > >    		return 0;
> > > > > -	}
> > > > > 
> > > > >    	inode = mapping->host;
> > > > >    	ci = ceph_inode(inode);
> > > > > @@ -107,14 +102,7 @@ static int ceph_set_page_dirty(struct page *page)
> > > > >    	     snapc, snapc->seq, snapc->num_snaps);
> > > > >    	spin_unlock(&ci->i_ceph_lock);
> > > > > 
> > > > > -	/* now adjust page */
> > > > > -	spin_lock_irq(&mapping->tree_lock);
> > > > >    	if (page->mapping) {	/* Race with truncate? */
> > > > > -		WARN_ON_ONCE(!PageUptodate(page));
> > > > > -		account_page_dirtied(page, page->mapping);
> > > > > -		radix_tree_tag_set(&mapping->page_tree,
> > > > > -				page_index(page),
> > > > > PAGECACHE_TAG_DIRTY);
> > > > > -
> > > > >    		/*
> > > > >    		 * Reference snap context in page->private.  Also set
> > > > >    		 * PagePrivate so that we get invalidatepage callback.
> > > > > @@ -126,14 +114,10 @@ static int ceph_set_page_dirty(struct page
> > > > > *page)
> > > > >    		undo = 1;
> > > > >    	}
> > > > > 
> > > > > -	spin_unlock_irq(&mapping->tree_lock);
> > > > > -
> > > > >    	if (undo)
> > > > >    		/* whoops, we failed to dirty the page */
> > > > >    		ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
> > > > > 
> > > > > -	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> > > > > -
> > > > >    	BUG_ON(!PageDirty(page));
> > > > >    	return 1;
> > > > >    }
> > > > > diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> > > > > index 458f497..0a331a8 100644
> > > > > --- a/include/linux/buffer_head.h
> > > > > +++ b/include/linux/buffer_head.h
> > > > > @@ -336,6 +336,8 @@ static inline void lock_buffer(struct buffer_head
> > > > > *bh)
> > > > >    }
> > > > > 
> > > > >    extern int __set_page_dirty_buffers(struct page *page);
> > > > > +extern int __set_page_dirty(struct page *page,
> > > > > +		struct address_space *mapping, int warn);
> > > > > 
> > > > >    #else /* CONFIG_BLOCK */
> > > > > 
> > > > > -- 
> > > > > 1.7.1
> > > > > 
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe
> > > > > linux-fsdevel"
> > > > > in
> > > > > the body of a message to majordomo@vger.kernel.org
> > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > 
> > > > > 
> > > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 4/7] Use vfs __set_page_dirty interface instead of doing it inside filesystem
@ 2012-07-05 15:20             ` Sage Weil
  0 siblings, 0 replies; 132+ messages in thread
From: Sage Weil @ 2012-07-05 15:20 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
	mhocko, linux-kernel, torvalds, viro, linux-fsdevel, sage,
	ceph-devel, Sha Zhengju

On Wed, 4 Jul 2012, Sha Zhengju wrote:
> On 07/02/2012 10:49 PM, Sage Weil wrote:
> > On Mon, 2 Jul 2012, Sha Zhengju wrote:
> > > On 06/29/2012 01:21 PM, Sage Weil wrote:
> > > > On Thu, 28 Jun 2012, Sha Zhengju wrote:
> > > > 
> > > > > From: Sha Zhengju<handai.szj@taobao.com>
> > > > > 
> > > > > Following we will treat SetPageDirty and dirty page accounting as an
> > > > > integrated
> > > > > operation. Filesystems had better use vfs interface directly to avoid
> > > > > those details.
> > > > > 
> > > > > Signed-off-by: Sha Zhengju<handai.szj@taobao.com>
> > > > > ---
> > > > >    fs/buffer.c                 |    2 +-
> > > > >    fs/ceph/addr.c              |   20 ++------------------
> > > > >    include/linux/buffer_head.h |    2 ++
> > > > >    3 files changed, 5 insertions(+), 19 deletions(-)
> > > > > 
> > > > > diff --git a/fs/buffer.c b/fs/buffer.c
> > > > > index e8d96b8..55522dd 100644
> > > > > --- a/fs/buffer.c
> > > > > +++ b/fs/buffer.c
> > > > > @@ -610,7 +610,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
> > > > >     * If warn is true, then emit a warning if the page is not uptodate
> > > > > and
> > > > > has
> > > > >     * not been truncated.
> > > > >     */
> > > > > -static int __set_page_dirty(struct page *page,
> > > > > +int __set_page_dirty(struct page *page,
> > > > >    		struct address_space *mapping, int warn)
> > > > >    {
> > > > >    	if (unlikely(!mapping))
> > > > This also needs an EXPORT_SYMBOL(__set_page_dirty) to allow ceph to
> > > > continue to build as a module.
> > > > 
> > > > With that fixed, the ceph bits are a welcome cleanup!
> > > > 
> > > > Acked-by: Sage Weil<sage@inktank.com>
> > > Further, I check the path again and may it be reworked as follows to avoid
> > > undo?
> > > 
> > > __set_page_dirty();
> > > __set_page_dirty();
> > > ceph operations;                ==>                     if (page->mapping)
> > > if (page->mapping)                                            ceph
> > > operations;
> > >      ;
> > > else
> > >      undo = 1;
> > > if (undo)
> > >      xxx;
> > Yep.  Taking another look at the original code, though, I'm worried that
> > one reason the __set_page_dirty() actions were spread out the way they are
> > is because we wanted to ensure that the ceph operations were always
> > performed when PagePrivate was set.
> > 
> 
> Sorry, I've lost something:
> 
> __set_page_dirty();                        __set_page_dirty();
> ceph operations;
> if(page->mapping)         ==>      if(page->mapping) {
>        SetPagePrivate;                            SetPagePrivate;
> else                                                      ceph operations;
>     undo = 1;                                  }
> 
> if (undo)
>     XXX;
> 
> I think this can ensure that ceph operations are performed together with
> SetPagePrivate.

Yeah, that looks right, as long as the ceph accounting operations happen 
before SetPagePrivate.  I think it's no more or less racy than before, at 
least. 

The patch doesn't apply without the previous ones in the series, it looks 
like.  Do you want to prepare a new version or should I?

Thanks!
sage

 
> > It looks like invalidatepage won't get called if private isn't set, and
> > presumably it handles the truncate race with __set_page_dirty() properly
> > (right?).  What about writeback?  Do we need to worry about writepage[s]
> > getting called with a NULL page->private?
> 
> __set_page_dirty does handle racing conditions with truncate and
> writeback writepage[s] also take page->private into consideration
> which is done inside specific filesystems. I notice that ceph has handled
> this in ceph_writepage().
> Sorry, not vfs expert and maybe I've not caught your point...
>
> 
> 
> Thanks,
> Sha
> 
> > Thanks!
> > sage
> > 
> > 
> > 
> > > 
> > > 
> > > Thanks,
> > > Sha
> > > 
> > > > > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> > > > > index 8b67304..d028fbe 100644
> > > > > --- a/fs/ceph/addr.c
> > > > > +++ b/fs/ceph/addr.c
> > > > > @@ -5,6 +5,7 @@
> > > > >    #include<linux/mm.h>
> > > > >    #include<linux/pagemap.h>
> > > > >    #include<linux/writeback.h>	/* generic_writepages */
> > > > > +#include<linux/buffer_head.h>
> > > > >    #include<linux/slab.h>
> > > > >    #include<linux/pagevec.h>
> > > > >    #include<linux/task_io_accounting_ops.h>
> > > > > @@ -73,14 +74,8 @@ static int ceph_set_page_dirty(struct page *page)
> > > > >    	int undo = 0;
> > > > >    	struct ceph_snap_context *snapc;
> > > > > 
> > > > > -	if (unlikely(!mapping))
> > > > > -		return !TestSetPageDirty(page);
> > > > > -
> > > > > -	if (TestSetPageDirty(page)) {
> > > > > -		dout("%p set_page_dirty %p idx %lu -- already
> > > > > dirty\n",
> > > > > -		     mapping->host, page, page->index);
> > > > > +	if (!__set_page_dirty(page, mapping, 1))
> > > > >    		return 0;
> > > > > -	}
> > > > > 
> > > > >    	inode = mapping->host;
> > > > >    	ci = ceph_inode(inode);
> > > > > @@ -107,14 +102,7 @@ static int ceph_set_page_dirty(struct page *page)
> > > > >    	     snapc, snapc->seq, snapc->num_snaps);
> > > > >    	spin_unlock(&ci->i_ceph_lock);
> > > > > 
> > > > > -	/* now adjust page */
> > > > > -	spin_lock_irq(&mapping->tree_lock);
> > > > >    	if (page->mapping) {	/* Race with truncate? */
> > > > > -		WARN_ON_ONCE(!PageUptodate(page));
> > > > > -		account_page_dirtied(page, page->mapping);
> > > > > -		radix_tree_tag_set(&mapping->page_tree,
> > > > > -				page_index(page),
> > > > > PAGECACHE_TAG_DIRTY);
> > > > > -
> > > > >    		/*
> > > > >    		 * Reference snap context in page->private.  Also set
> > > > >    		 * PagePrivate so that we get invalidatepage callback.
> > > > > @@ -126,14 +114,10 @@ static int ceph_set_page_dirty(struct page
> > > > > *page)
> > > > >    		undo = 1;
> > > > >    	}
> > > > > 
> > > > > -	spin_unlock_irq(&mapping->tree_lock);
> > > > > -
> > > > >    	if (undo)
> > > > >    		/* whoops, we failed to dirty the page */
> > > > >    		ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
> > > > > 
> > > > > -	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> > > > > -
> > > > >    	BUG_ON(!PageDirty(page));
> > > > >    	return 1;
> > > > >    }
> > > > > diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> > > > > index 458f497..0a331a8 100644
> > > > > --- a/include/linux/buffer_head.h
> > > > > +++ b/include/linux/buffer_head.h
> > > > > @@ -336,6 +336,8 @@ static inline void lock_buffer(struct buffer_head
> > > > > *bh)
> > > > >    }
> > > > > 
> > > > >    extern int __set_page_dirty_buffers(struct page *page);
> > > > > +extern int __set_page_dirty(struct page *page,
> > > > > +		struct address_space *mapping, int warn);
> > > > > 
> > > > >    #else /* CONFIG_BLOCK */
> > > > > 
> > > > > -- 
> > > > > 1.7.1
> > > > > 
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe
> > > > > linux-fsdevel"
> > > > > in
> > > > > the body of a message to majordomo@vger.kernel.org
> > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > 
> > > > > 
> > > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 4/7] Use vfs __set_page_dirty interface instead of doing it inside filesystem
  2012-07-05 15:20             ` Sage Weil
@ 2012-07-05 15:40               ` Sha Zhengju
  -1 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-07-05 15:40 UTC (permalink / raw)
  To: Sage Weil
  Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
	mhocko, linux-kernel, torvalds, viro, linux-fsdevel, sage,
	ceph-devel, Sha Zhengju

On Thu, Jul 5, 2012 at 11:20 PM, Sage Weil <sage@inktank.com> wrote:
> On Wed, 4 Jul 2012, Sha Zhengju wrote:
>> On 07/02/2012 10:49 PM, Sage Weil wrote:
>> > On Mon, 2 Jul 2012, Sha Zhengju wrote:
>> > > On 06/29/2012 01:21 PM, Sage Weil wrote:
>> > > > On Thu, 28 Jun 2012, Sha Zhengju wrote:
>> > > >
>> > > > > From: Sha Zhengju<handai.szj@taobao.com>
>> > > > >
>> > > > > Following we will treat SetPageDirty and dirty page accounting as an
>> > > > > integrated
>> > > > > operation. Filesystems had better use vfs interface directly to avoid
>> > > > > those details.
>> > > > >
>> > > > > Signed-off-by: Sha Zhengju<handai.szj@taobao.com>
>> > > > > ---
>> > > > >    fs/buffer.c                 |    2 +-
>> > > > >    fs/ceph/addr.c              |   20 ++------------------
>> > > > >    include/linux/buffer_head.h |    2 ++
>> > > > >    3 files changed, 5 insertions(+), 19 deletions(-)
>> > > > >
>> > > > > diff --git a/fs/buffer.c b/fs/buffer.c
>> > > > > index e8d96b8..55522dd 100644
>> > > > > --- a/fs/buffer.c
>> > > > > +++ b/fs/buffer.c
>> > > > > @@ -610,7 +610,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>> > > > >     * If warn is true, then emit a warning if the page is not uptodate
>> > > > > and
>> > > > > has
>> > > > >     * not been truncated.
>> > > > >     */
>> > > > > -static int __set_page_dirty(struct page *page,
>> > > > > +int __set_page_dirty(struct page *page,
>> > > > >               struct address_space *mapping, int warn)
>> > > > >    {
>> > > > >       if (unlikely(!mapping))
>> > > > This also needs an EXPORT_SYMBOL(__set_page_dirty) to allow ceph to
>> > > > continue to build as a module.
>> > > >
>> > > > With that fixed, the ceph bits are a welcome cleanup!
>> > > >
>> > > > Acked-by: Sage Weil<sage@inktank.com>
>> > > Further, I check the path again and may it be reworked as follows to avoid
>> > > undo?
>> > >
>> > > __set_page_dirty();
>> > > __set_page_dirty();
>> > > ceph operations;                ==>                     if (page->mapping)
>> > > if (page->mapping)                                            ceph
>> > > operations;
>> > >      ;
>> > > else
>> > >      undo = 1;
>> > > if (undo)
>> > >      xxx;
>> > Yep.  Taking another look at the original code, though, I'm worried that
>> > one reason the __set_page_dirty() actions were spread out the way they are
>> > is because we wanted to ensure that the ceph operations were always
>> > performed when PagePrivate was set.
>> >
>>
>> Sorry, I've lost something:
>>
>> __set_page_dirty();                        __set_page_dirty();
>> ceph operations;
>> if(page->mapping)         ==>      if(page->mapping) {
>>        SetPagePrivate;                            SetPagePrivate;
>> else                                                      ceph operations;
>>     undo = 1;                                  }
>>
>> if (undo)
>>     XXX;
>>
>> I think this can ensure that ceph operations are performed together with
>> SetPagePrivate.
>
> Yeah, that looks right, as long as the ceph accounting operations happen
> before SetPagePrivate.  I think it's no more or less racy than before, at
> least.
>
> The patch doesn't apply without the previous ones in the series, it looks
> like.  Do you want to prepare a new version or should I?
>

Good. I'm doing some test then I'll send out a new version patchset, please
wait a bit. : )


Thanks,
Sha

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 4/7] Use vfs __set_page_dirty interface instead of doing it inside filesystem
@ 2012-07-05 15:40               ` Sha Zhengju
  0 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-07-05 15:40 UTC (permalink / raw)
  To: Sage Weil
  Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
	mhocko, linux-kernel, torvalds, viro, linux-fsdevel, sage,
	ceph-devel, Sha Zhengju

On Thu, Jul 5, 2012 at 11:20 PM, Sage Weil <sage@inktank.com> wrote:
> On Wed, 4 Jul 2012, Sha Zhengju wrote:
>> On 07/02/2012 10:49 PM, Sage Weil wrote:
>> > On Mon, 2 Jul 2012, Sha Zhengju wrote:
>> > > On 06/29/2012 01:21 PM, Sage Weil wrote:
>> > > > On Thu, 28 Jun 2012, Sha Zhengju wrote:
>> > > >
>> > > > > From: Sha Zhengju<handai.szj@taobao.com>
>> > > > >
>> > > > > Following we will treat SetPageDirty and dirty page accounting as an
>> > > > > integrated
>> > > > > operation. Filesystems had better use vfs interface directly to avoid
>> > > > > those details.
>> > > > >
>> > > > > Signed-off-by: Sha Zhengju<handai.szj@taobao.com>
>> > > > > ---
>> > > > >    fs/buffer.c                 |    2 +-
>> > > > >    fs/ceph/addr.c              |   20 ++------------------
>> > > > >    include/linux/buffer_head.h |    2 ++
>> > > > >    3 files changed, 5 insertions(+), 19 deletions(-)
>> > > > >
>> > > > > diff --git a/fs/buffer.c b/fs/buffer.c
>> > > > > index e8d96b8..55522dd 100644
>> > > > > --- a/fs/buffer.c
>> > > > > +++ b/fs/buffer.c
>> > > > > @@ -610,7 +610,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>> > > > >     * If warn is true, then emit a warning if the page is not uptodate
>> > > > > and
>> > > > > has
>> > > > >     * not been truncated.
>> > > > >     */
>> > > > > -static int __set_page_dirty(struct page *page,
>> > > > > +int __set_page_dirty(struct page *page,
>> > > > >               struct address_space *mapping, int warn)
>> > > > >    {
>> > > > >       if (unlikely(!mapping))
>> > > > This also needs an EXPORT_SYMBOL(__set_page_dirty) to allow ceph to
>> > > > continue to build as a module.
>> > > >
>> > > > With that fixed, the ceph bits are a welcome cleanup!
>> > > >
>> > > > Acked-by: Sage Weil<sage@inktank.com>
>> > > Further, I check the path again and may it be reworked as follows to avoid
>> > > undo?
>> > >
>> > > __set_page_dirty();
>> > > __set_page_dirty();
>> > > ceph operations;                ==>                     if (page->mapping)
>> > > if (page->mapping)                                            ceph
>> > > operations;
>> > >      ;
>> > > else
>> > >      undo = 1;
>> > > if (undo)
>> > >      xxx;
>> > Yep.  Taking another look at the original code, though, I'm worried that
>> > one reason the __set_page_dirty() actions were spread out the way they are
>> > is because we wanted to ensure that the ceph operations were always
>> > performed when PagePrivate was set.
>> >
>>
>> Sorry, I've lost something:
>>
>> __set_page_dirty();                        __set_page_dirty();
>> ceph operations;
>> if(page->mapping)         ==>      if(page->mapping) {
>>        SetPagePrivate;                            SetPagePrivate;
>> else                                                      ceph operations;
>>     undo = 1;                                  }
>>
>> if (undo)
>>     XXX;
>>
>> I think this can ensure that ceph operations are performed together with
>> SetPagePrivate.
>
> Yeah, that looks right, as long as the ceph accounting operations happen
> before SetPagePrivate.  I think it's no more or less racy than before, at
> least.
>
> The patch doesn't apply without the previous ones in the series, it looks
> like.  Do you want to prepare a new version or should I?
>

Good. I'm doing some test then I'll send out a new version patchset, please
wait a bit. : )


Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 1/7] memcg: update cgroup memory document
  2012-06-28 10:57   ` Sha Zhengju
  (?)
@ 2012-07-07 13:45     ` Fengguang Wu
  -1 siblings, 0 replies; 132+ messages in thread
From: Fengguang Wu @ 2012-07-07 13:45 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
	mhocko, linux-kernel, Sha Zhengju

> +dirty		- # of bytes that are waiting to get written back to the disk.
> +writeback	- # of bytes that are actively being written back to the disk.

This should be a bit more clear to the user:

dirty - # of bytes of file cache that are not in sync with the disk copy
writeback - # of bytes of file cache that are queued for syncing to disk

Thanks,
Fengguang

>  inactive_anon	- # of bytes of anonymous memory and swap cache memory on
>  		LRU list.
>  active_anon	- # of bytes of anonymous and swap cache memory on active


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 1/7] memcg: update cgroup memory document
@ 2012-07-07 13:45     ` Fengguang Wu
  0 siblings, 0 replies; 132+ messages in thread
From: Fengguang Wu @ 2012-07-07 13:45 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
	mhocko, linux-kernel, Sha Zhengju

> +dirty		- # of bytes that are waiting to get written back to the disk.
> +writeback	- # of bytes that are actively being written back to the disk.

This should be a bit more clear to the user:

dirty - # of bytes of file cache that are not in sync with the disk copy
writeback - # of bytes of file cache that are queued for syncing to disk

Thanks,
Fengguang

>  inactive_anon	- # of bytes of anonymous memory and swap cache memory on
>  		LRU list.
>  active_anon	- # of bytes of anonymous and swap cache memory on active

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 1/7] memcg: update cgroup memory document
@ 2012-07-07 13:45     ` Fengguang Wu
  0 siblings, 0 replies; 132+ messages in thread
From: Fengguang Wu @ 2012-07-07 13:45 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA, yinghan-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mhocko-AlSwsSmVLrQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju

> +dirty		- # of bytes that are waiting to get written back to the disk.
> +writeback	- # of bytes that are actively being written back to the disk.

This should be a bit more clear to the user:

dirty - # of bytes of file cache that are not in sync with the disk copy
writeback - # of bytes of file cache that are queued for syncing to disk

Thanks,
Fengguang

>  inactive_anon	- # of bytes of anonymous memory and swap cache memory on
>  		LRU list.
>  active_anon	- # of bytes of anonymous and swap cache memory on active

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 2/7] memcg: remove MEMCG_NR_FILE_MAPPED
  2012-06-28 10:58   ` Sha Zhengju
@ 2012-07-07 13:48     ` Fengguang Wu
  -1 siblings, 0 replies; 132+ messages in thread
From: Fengguang Wu @ 2012-07-07 13:48 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
	mhocko, linux-kernel, Sha Zhengju

On Thu, Jun 28, 2012 at 06:58:31PM +0800, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> While accounting memcg page stat, it's not worth to use MEMCG_NR_FILE_MAPPED
> as an extra layer of indirection because of the complexity and presumed
> performance overhead. We can use MEM_CGROUP_STAT_FILE_MAPPED directly.
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> ---
>  include/linux/memcontrol.h |   25 +++++++++++++++++--------
>  mm/memcontrol.c            |   24 +-----------------------
>  mm/rmap.c                  |    4 ++--
>  3 files changed, 20 insertions(+), 33 deletions(-)

Nice cleanup!

Acked-by: Fengguang Wu <fengguang.wu@intel.com>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 2/7] memcg: remove MEMCG_NR_FILE_MAPPED
@ 2012-07-07 13:48     ` Fengguang Wu
  0 siblings, 0 replies; 132+ messages in thread
From: Fengguang Wu @ 2012-07-07 13:48 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
	mhocko, linux-kernel, Sha Zhengju

On Thu, Jun 28, 2012 at 06:58:31PM +0800, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> While accounting memcg page stat, it's not worth to use MEMCG_NR_FILE_MAPPED
> as an extra layer of indirection because of the complexity and presumed
> performance overhead. We can use MEM_CGROUP_STAT_FILE_MAPPED directly.
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> ---
>  include/linux/memcontrol.h |   25 +++++++++++++++++--------
>  mm/memcontrol.c            |   24 +-----------------------
>  mm/rmap.c                  |    4 ++--
>  3 files changed, 20 insertions(+), 33 deletions(-)

Nice cleanup!

Acked-by: Fengguang Wu <fengguang.wu@intel.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 3/7] Make TestSetPageDirty and dirty page accounting in one func
  2012-07-02 11:14     ` Kamezawa Hiroyuki
@ 2012-07-07 14:42       ` Fengguang Wu
  -1 siblings, 0 replies; 132+ messages in thread
From: Fengguang Wu @ 2012-07-07 14:42 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Sha Zhengju, linux-mm, cgroups, gthelen, yinghan, akpm, mhocko,
	linux-kernel, torvalds, viro, linux-fsdevel, Sha Zhengju

On Mon, Jul 02, 2012 at 08:14:02PM +0900, KAMEZAWA Hiroyuki wrote:
> (2012/06/28 20:01), Sha Zhengju wrote:
> > From: Sha Zhengju <handai.szj@taobao.com>
> > 
> > Commit a8e7d49a(Fix race in create_empty_buffers() vs __set_page_dirty_buffers())
> > extracts TestSetPageDirty from __set_page_dirty and is far away from
> > account_page_dirtied.But it's better to make the two operations in one single
> > function to keep modular.So in order to avoid the potential race mentioned in
> > commit a8e7d49a, we can hold private_lock until __set_page_dirty completes.
> > I guess there's no deadlock between ->private_lock and ->tree_lock by quick look.
> > 
> > It's a prepare patch for following memcg dirty page accounting patches.
> > 
> > Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> 
> I think there is no problem with the lock order.

Me think so, too.

> My small concern is the impact on the performance. IIUC, lock contention here can be
> seen if multiple threads write to the same file in parallel.
> Do you have any numbers before/after the patch ?

That would be a worthwhile test. The patch moves ->tree_lock and
->i_lock into ->private_lock, these are often contented locks..

For example, in the below case of 12 hard disks, each running 1 dd
write, the ->tree_lock and ->private_lock have the top #1 and #2
contentions.

lkp-nex04/JBOD-12HDD-thresh=1000M/ext4-1dd-1-3.3.0/lock_stat
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                              class name    con-bounces    contentions   waittime-min   waittime-max waittime-total    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

           &(&mapping->tree_lock)->rlock:      18629034       19138284           0.09        1029.32    24353812.07       49650988      482883410           0.11         186.88   260706119.09
           -----------------------------
           &(&mapping->tree_lock)->rlock            783          [<ffffffff81109267>] tag_pages_for_writeback+0x2b/0x9d
           &(&mapping->tree_lock)->rlock        3195817          [<ffffffff81100d6c>] add_to_page_cache_locked+0xa3/0x119
           &(&mapping->tree_lock)->rlock        3863710          [<ffffffff81108df7>] test_set_page_writeback+0x63/0x140
           &(&mapping->tree_lock)->rlock        3311518          [<ffffffff81172ade>] __set_page_dirty+0x25/0xa5
           -----------------------------
           &(&mapping->tree_lock)->rlock        3450725          [<ffffffff81100d6c>] add_to_page_cache_locked+0xa3/0x119
           &(&mapping->tree_lock)->rlock        3225542          [<ffffffff81172ade>] __set_page_dirty+0x25/0xa5
           &(&mapping->tree_lock)->rlock        2241958          [<ffffffff81108df7>] test_set_page_writeback+0x63/0x140
           &(&mapping->tree_lock)->rlock        7339603          [<ffffffff8110ac33>] test_clear_page_writeback+0x64/0x155

...............................................................................................................................................................................................

        &(&mapping->private_lock)->rlock:       1165199        1191201           0.11        2843.25     1621608.38       13341420      152761848           0.10        3727.92    33559035.07
        --------------------------------
        &(&mapping->private_lock)->rlock              1          [<ffffffff81172913>] __find_get_block_slow+0x5a/0x135
        &(&mapping->private_lock)->rlock         385576          [<ffffffff811735d6>] create_empty_buffers+0x48/0xbf
        &(&mapping->private_lock)->rlock         805624          [<ffffffff8117346d>] try_to_free_buffers+0x57/0xaa
        --------------------------------
        &(&mapping->private_lock)->rlock              1          [<ffffffff811746dd>] __getblk+0x1b8/0x257
        &(&mapping->private_lock)->rlock         952718          [<ffffffff8117346d>] try_to_free_buffers+0x57/0xaa
        &(&mapping->private_lock)->rlock         238482          [<ffffffff811735d6>] create_empty_buffers+0x48/0xbf

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 3/7] Make TestSetPageDirty and dirty page accounting in one func
@ 2012-07-07 14:42       ` Fengguang Wu
  0 siblings, 0 replies; 132+ messages in thread
From: Fengguang Wu @ 2012-07-07 14:42 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Sha Zhengju, linux-mm, cgroups, gthelen, yinghan, akpm, mhocko,
	linux-kernel, torvalds, viro, linux-fsdevel, Sha Zhengju

On Mon, Jul 02, 2012 at 08:14:02PM +0900, KAMEZAWA Hiroyuki wrote:
> (2012/06/28 20:01), Sha Zhengju wrote:
> > From: Sha Zhengju <handai.szj@taobao.com>
> > 
> > Commit a8e7d49a(Fix race in create_empty_buffers() vs __set_page_dirty_buffers())
> > extracts TestSetPageDirty from __set_page_dirty and is far away from
> > account_page_dirtied.But it's better to make the two operations in one single
> > function to keep modular.So in order to avoid the potential race mentioned in
> > commit a8e7d49a, we can hold private_lock until __set_page_dirty completes.
> > I guess there's no deadlock between ->private_lock and ->tree_lock by quick look.
> > 
> > It's a prepare patch for following memcg dirty page accounting patches.
> > 
> > Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> 
> I think there is no problem with the lock order.

Me think so, too.

> My small concern is the impact on the performance. IIUC, lock contention here can be
> seen if multiple threads write to the same file in parallel.
> Do you have any numbers before/after the patch ?

That would be a worthwhile test. The patch moves ->tree_lock and
->i_lock into ->private_lock, these are often contented locks..

For example, in the below case of 12 hard disks, each running 1 dd
write, the ->tree_lock and ->private_lock have the top #1 and #2
contentions.

lkp-nex04/JBOD-12HDD-thresh=1000M/ext4-1dd-1-3.3.0/lock_stat
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                              class name    con-bounces    contentions   waittime-min   waittime-max waittime-total    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

           &(&mapping->tree_lock)->rlock:      18629034       19138284           0.09        1029.32    24353812.07       49650988      482883410           0.11         186.88   260706119.09
           -----------------------------
           &(&mapping->tree_lock)->rlock            783          [<ffffffff81109267>] tag_pages_for_writeback+0x2b/0x9d
           &(&mapping->tree_lock)->rlock        3195817          [<ffffffff81100d6c>] add_to_page_cache_locked+0xa3/0x119
           &(&mapping->tree_lock)->rlock        3863710          [<ffffffff81108df7>] test_set_page_writeback+0x63/0x140
           &(&mapping->tree_lock)->rlock        3311518          [<ffffffff81172ade>] __set_page_dirty+0x25/0xa5
           -----------------------------
           &(&mapping->tree_lock)->rlock        3450725          [<ffffffff81100d6c>] add_to_page_cache_locked+0xa3/0x119
           &(&mapping->tree_lock)->rlock        3225542          [<ffffffff81172ade>] __set_page_dirty+0x25/0xa5
           &(&mapping->tree_lock)->rlock        2241958          [<ffffffff81108df7>] test_set_page_writeback+0x63/0x140
           &(&mapping->tree_lock)->rlock        7339603          [<ffffffff8110ac33>] test_clear_page_writeback+0x64/0x155

...............................................................................................................................................................................................

        &(&mapping->private_lock)->rlock:       1165199        1191201           0.11        2843.25     1621608.38       13341420      152761848           0.10        3727.92    33559035.07
        --------------------------------
        &(&mapping->private_lock)->rlock              1          [<ffffffff81172913>] __find_get_block_slow+0x5a/0x135
        &(&mapping->private_lock)->rlock         385576          [<ffffffff811735d6>] create_empty_buffers+0x48/0xbf
        &(&mapping->private_lock)->rlock         805624          [<ffffffff8117346d>] try_to_free_buffers+0x57/0xaa
        --------------------------------
        &(&mapping->private_lock)->rlock              1          [<ffffffff811746dd>] __getblk+0x1b8/0x257
        &(&mapping->private_lock)->rlock         952718          [<ffffffff8117346d>] try_to_free_buffers+0x57/0xaa
        &(&mapping->private_lock)->rlock         238482          [<ffffffff811735d6>] create_empty_buffers+0x48/0xbf

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 6/7] memcg: add per cgroup writeback pages accounting
  2012-07-03  6:31     ` Kamezawa Hiroyuki
@ 2012-07-08 14:44       ` Fengguang Wu
  -1 siblings, 0 replies; 132+ messages in thread
From: Fengguang Wu @ 2012-07-08 14:44 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Sha Zhengju, linux-mm, cgroups, gthelen, yinghan, akpm, mhocko,
	linux-kernel, Sha Zhengju

On Tue, Jul 03, 2012 at 03:31:26PM +0900, KAMEZAWA Hiroyuki wrote:
> (2012/06/28 20:05), Sha Zhengju wrote:
> > From: Sha Zhengju <handai.szj@taobao.com>
> > 
> > Similar to dirty page, we add per cgroup writeback pages accounting. The lock
> > rule still is:
> > 	mem_cgroup_begin_update_page_stat()
> > 	modify page WRITEBACK stat
> > 	mem_cgroup_update_page_stat()
> > 	mem_cgroup_end_update_page_stat()
> > 
> > There're two writeback interface to modify: test_clear/set_page_writeback.
> > 
> > Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> 
> Seems good to me. BTW, you named macros as MEM_CGROUP_STAT_FILE_XXX
> but I wonder these counters will be used for accounting swap-out's dirty pages..
> 
> STAT_DIRTY, STAT_WRITEBACK ? do you have better name ?

Perhaps we can follow the established "enum zone_stat_item" names:

        NR_FILE_DIRTY,
        NR_WRITEBACK,

s/NR_/MEM_CGROUP_STAT_/

The names indicate that dirty pages for anonymous pages are not
accounted (by __set_page_dirty_no_writeback()). While the writeback
pages accounting include both the file/anon pages.

Ah then we'll need to update the document in patch 0 accordingly. This
may sound a bit tricky to the users..

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 6/7] memcg: add per cgroup writeback pages accounting
@ 2012-07-08 14:44       ` Fengguang Wu
  0 siblings, 0 replies; 132+ messages in thread
From: Fengguang Wu @ 2012-07-08 14:44 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Sha Zhengju, linux-mm, cgroups, gthelen, yinghan, akpm, mhocko,
	linux-kernel, Sha Zhengju

On Tue, Jul 03, 2012 at 03:31:26PM +0900, KAMEZAWA Hiroyuki wrote:
> (2012/06/28 20:05), Sha Zhengju wrote:
> > From: Sha Zhengju <handai.szj@taobao.com>
> > 
> > Similar to dirty page, we add per cgroup writeback pages accounting. The lock
> > rule still is:
> > 	mem_cgroup_begin_update_page_stat()
> > 	modify page WRITEBACK stat
> > 	mem_cgroup_update_page_stat()
> > 	mem_cgroup_end_update_page_stat()
> > 
> > There're two writeback interface to modify: test_clear/set_page_writeback.
> > 
> > Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> 
> Seems good to me. BTW, you named macros as MEM_CGROUP_STAT_FILE_XXX
> but I wonder these counters will be used for accounting swap-out's dirty pages..
> 
> STAT_DIRTY, STAT_WRITEBACK ? do you have better name ?

Perhaps we can follow the established "enum zone_stat_item" names:

        NR_FILE_DIRTY,
        NR_WRITEBACK,

s/NR_/MEM_CGROUP_STAT_/

The names indicate that dirty pages for anonymous pages are not
accounted (by __set_page_dirty_no_writeback()). While the writeback
pages accounting include both the file/anon pages.

Ah then we'll need to update the document in patch 0 accordingly. This
may sound a bit tricky to the users..

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 5/7] memcg: add per cgroup dirty pages accounting
  2012-07-03  5:57     ` Kamezawa Hiroyuki
@ 2012-07-08 14:45       ` Fengguang Wu
  -1 siblings, 0 replies; 132+ messages in thread
From: Fengguang Wu @ 2012-07-08 14:45 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Sha Zhengju, linux-mm, cgroups, gthelen, yinghan, akpm, mhocko,
	linux-kernel, Sha Zhengju

On Tue, Jul 03, 2012 at 02:57:08PM +0900, KAMEZAWA Hiroyuki wrote:
> (2012/06/28 20:04), Sha Zhengju wrote:
> > From: Sha Zhengju <handai.szj@taobao.com>
> > 
> > This patch adds memcg routines to count dirty pages, which allows memory controller
> > to maintain an accurate view of the amount of its dirty memory and can provide some
> > info for users while group's direct reclaim is working.
> > 
> > After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
> > use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
> > has a feature to move a page from a cgroup to another one and may have race between
> > "move" and "page stat accounting". So in order to avoid the race we have designed a
> > bigger lock:
> > 
> >           mem_cgroup_begin_update_page_stat()
> >           modify page information	-->(a)
> >           mem_cgroup_update_page_stat()  -->(b)
> >           mem_cgroup_end_update_page_stat()
> > 
> > It requires (a) and (b)(dirty pages accounting) can stay close enough.
> > 
> > In the previous two prepare patches, we have reworked the vfs set page dirty routines
> > and now the interfaces are more explicit:
> > 	incrementing (2):
> > 		__set_page_dirty
> > 		__set_page_dirty_nobuffers
> > 	decrementing (2):
> > 		clear_page_dirty_for_io
> > 		cancel_dirty_page
> > 
> > 
> > Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> 
> Thank you. This seems much cleaner than expected ! very good.
> 
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>

I have the same good feelings :)

Acked-by: Fengguang Wu <fengguang.wu@intel.com>


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 5/7] memcg: add per cgroup dirty pages accounting
@ 2012-07-08 14:45       ` Fengguang Wu
  0 siblings, 0 replies; 132+ messages in thread
From: Fengguang Wu @ 2012-07-08 14:45 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Sha Zhengju, linux-mm, cgroups, gthelen, yinghan, akpm, mhocko,
	linux-kernel, Sha Zhengju

On Tue, Jul 03, 2012 at 02:57:08PM +0900, KAMEZAWA Hiroyuki wrote:
> (2012/06/28 20:04), Sha Zhengju wrote:
> > From: Sha Zhengju <handai.szj@taobao.com>
> > 
> > This patch adds memcg routines to count dirty pages, which allows memory controller
> > to maintain an accurate view of the amount of its dirty memory and can provide some
> > info for users while group's direct reclaim is working.
> > 
> > After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
> > use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
> > has a feature to move a page from a cgroup to another one and may have race between
> > "move" and "page stat accounting". So in order to avoid the race we have designed a
> > bigger lock:
> > 
> >           mem_cgroup_begin_update_page_stat()
> >           modify page information	-->(a)
> >           mem_cgroup_update_page_stat()  -->(b)
> >           mem_cgroup_end_update_page_stat()
> > 
> > It requires (a) and (b)(dirty pages accounting) can stay close enough.
> > 
> > In the previous two prepare patches, we have reworked the vfs set page dirty routines
> > and now the interfaces are more explicit:
> > 	incrementing (2):
> > 		__set_page_dirty
> > 		__set_page_dirty_nobuffers
> > 	decrementing (2):
> > 		clear_page_dirty_for_io
> > 		cancel_dirty_page
> > 
> > 
> > Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> 
> Thank you. This seems much cleaner than expected ! very good.
> 
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>

I have the same good feelings :)

Acked-by: Fengguang Wu <fengguang.wu@intel.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 6/7] memcg: add per cgroup writeback pages accounting
  2012-06-28 11:06   ` Sha Zhengju
  (?)
@ 2012-07-08 14:53     ` Fengguang Wu
  -1 siblings, 0 replies; 132+ messages in thread
From: Fengguang Wu @ 2012-07-08 14:53 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
	mhocko, linux-kernel, Sha Zhengju

> @@ -2245,7 +2252,10 @@ int test_set_page_writeback(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
>  	int ret;
> +	bool locked;
> +	unsigned long flags;
>  
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  	if (mapping) {
>  		struct backing_dev_info *bdi = mapping->backing_dev_info;
>  		unsigned long flags;
> @@ -2272,6 +2282,8 @@ int test_set_page_writeback(struct page *page)
>  	}
>  	if (!ret)
>  		account_page_writeback(page);
> +
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  	return ret;
>  
>  }

Where is the MEM_CGROUP_STAT_FILE_WRITEBACK increased?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 6/7] memcg: add per cgroup writeback pages accounting
@ 2012-07-08 14:53     ` Fengguang Wu
  0 siblings, 0 replies; 132+ messages in thread
From: Fengguang Wu @ 2012-07-08 14:53 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
	mhocko, linux-kernel, Sha Zhengju

> @@ -2245,7 +2252,10 @@ int test_set_page_writeback(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
>  	int ret;
> +	bool locked;
> +	unsigned long flags;
>  
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  	if (mapping) {
>  		struct backing_dev_info *bdi = mapping->backing_dev_info;
>  		unsigned long flags;
> @@ -2272,6 +2282,8 @@ int test_set_page_writeback(struct page *page)
>  	}
>  	if (!ret)
>  		account_page_writeback(page);
> +
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  	return ret;
>  
>  }

Where is the MEM_CGROUP_STAT_FILE_WRITEBACK increased?

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 6/7] memcg: add per cgroup writeback pages accounting
@ 2012-07-08 14:53     ` Fengguang Wu
  0 siblings, 0 replies; 132+ messages in thread
From: Fengguang Wu @ 2012-07-08 14:53 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA, yinghan-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mhocko-AlSwsSmVLrQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju

> @@ -2245,7 +2252,10 @@ int test_set_page_writeback(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
>  	int ret;
> +	bool locked;
> +	unsigned long flags;
>  
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  	if (mapping) {
>  		struct backing_dev_info *bdi = mapping->backing_dev_info;
>  		unsigned long flags;
> @@ -2272,6 +2282,8 @@ int test_set_page_writeback(struct page *page)
>  	}
>  	if (!ret)
>  		account_page_writeback(page);
> +
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  	return ret;
>  
>  }

Where is the MEM_CGROUP_STAT_FILE_WRITEBACK increased?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 6/7] memcg: add per cgroup writeback pages accounting
  2012-07-08 14:44       ` Fengguang Wu
@ 2012-07-08 23:01         ` Johannes Weiner
  -1 siblings, 0 replies; 132+ messages in thread
From: Johannes Weiner @ 2012-07-08 23:01 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Kamezawa Hiroyuki, Sha Zhengju, linux-mm, cgroups, gthelen,
	yinghan, akpm, mhocko, linux-kernel, Sha Zhengju

On Sun, Jul 08, 2012 at 10:44:59PM +0800, Fengguang Wu wrote:
> On Tue, Jul 03, 2012 at 03:31:26PM +0900, KAMEZAWA Hiroyuki wrote:
> > (2012/06/28 20:05), Sha Zhengju wrote:
> > > From: Sha Zhengju <handai.szj@taobao.com>
> > > 
> > > Similar to dirty page, we add per cgroup writeback pages accounting. The lock
> > > rule still is:
> > > 	mem_cgroup_begin_update_page_stat()
> > > 	modify page WRITEBACK stat
> > > 	mem_cgroup_update_page_stat()
> > > 	mem_cgroup_end_update_page_stat()
> > > 
> > > There're two writeback interface to modify: test_clear/set_page_writeback.
> > > 
> > > Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> > 
> > Seems good to me. BTW, you named macros as MEM_CGROUP_STAT_FILE_XXX
> > but I wonder these counters will be used for accounting swap-out's dirty pages..
> > 
> > STAT_DIRTY, STAT_WRITEBACK ? do you have better name ?
> 
> Perhaps we can follow the established "enum zone_stat_item" names:
> 
>         NR_FILE_DIRTY,
>         NR_WRITEBACK,
> 
> s/NR_/MEM_CGROUP_STAT_/
> 
> The names indicate that dirty pages for anonymous pages are not
> accounted (by __set_page_dirty_no_writeback()). While the writeback
> pages accounting include both the file/anon pages.
> 
> Ah then we'll need to update the document in patch 0 accordingly. This
> may sound a bit tricky to the users..

We already report the global one as "nr_dirty", though.  Please don't
give the memcg one a different name.

The enum naming is not too critical, but it would be nice to have it
match the public name.

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 6/7] memcg: add per cgroup writeback pages accounting
@ 2012-07-08 23:01         ` Johannes Weiner
  0 siblings, 0 replies; 132+ messages in thread
From: Johannes Weiner @ 2012-07-08 23:01 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Kamezawa Hiroyuki, Sha Zhengju, linux-mm, cgroups, gthelen,
	yinghan, akpm, mhocko, linux-kernel, Sha Zhengju

On Sun, Jul 08, 2012 at 10:44:59PM +0800, Fengguang Wu wrote:
> On Tue, Jul 03, 2012 at 03:31:26PM +0900, KAMEZAWA Hiroyuki wrote:
> > (2012/06/28 20:05), Sha Zhengju wrote:
> > > From: Sha Zhengju <handai.szj@taobao.com>
> > > 
> > > Similar to dirty page, we add per cgroup writeback pages accounting. The lock
> > > rule still is:
> > > 	mem_cgroup_begin_update_page_stat()
> > > 	modify page WRITEBACK stat
> > > 	mem_cgroup_update_page_stat()
> > > 	mem_cgroup_end_update_page_stat()
> > > 
> > > There're two writeback interface to modify: test_clear/set_page_writeback.
> > > 
> > > Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> > 
> > Seems good to me. BTW, you named macros as MEM_CGROUP_STAT_FILE_XXX
> > but I wonder these counters will be used for accounting swap-out's dirty pages..
> > 
> > STAT_DIRTY, STAT_WRITEBACK ? do you have better name ?
> 
> Perhaps we can follow the established "enum zone_stat_item" names:
> 
>         NR_FILE_DIRTY,
>         NR_WRITEBACK,
> 
> s/NR_/MEM_CGROUP_STAT_/
> 
> The names indicate that dirty pages for anonymous pages are not
> accounted (by __set_page_dirty_no_writeback()). While the writeback
> pages accounting include both the file/anon pages.
> 
> Ah then we'll need to update the document in patch 0 accordingly. This
> may sound a bit tricky to the users..

We already report the global one as "nr_dirty", though.  Please don't
give the memcg one a different name.

The enum naming is not too critical, but it would be nice to have it
match the public name.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 6/7] memcg: add per cgroup writeback pages accounting
  2012-07-08 23:01         ` Johannes Weiner
  (?)
@ 2012-07-09  1:37           ` Fengguang Wu
  -1 siblings, 0 replies; 132+ messages in thread
From: Fengguang Wu @ 2012-07-09  1:37 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Kamezawa Hiroyuki, Sha Zhengju, linux-mm, cgroups, gthelen,
	yinghan, akpm, mhocko, linux-kernel, Sha Zhengju

On Mon, Jul 09, 2012 at 01:01:00AM +0200, Johannes Weiner wrote:
> On Sun, Jul 08, 2012 at 10:44:59PM +0800, Fengguang Wu wrote:
> > On Tue, Jul 03, 2012 at 03:31:26PM +0900, KAMEZAWA Hiroyuki wrote:
> > > (2012/06/28 20:05), Sha Zhengju wrote:
> > > > From: Sha Zhengju <handai.szj@taobao.com>
> > > > 
> > > > Similar to dirty page, we add per cgroup writeback pages accounting. The lock
> > > > rule still is:
> > > > 	mem_cgroup_begin_update_page_stat()
> > > > 	modify page WRITEBACK stat
> > > > 	mem_cgroup_update_page_stat()
> > > > 	mem_cgroup_end_update_page_stat()
> > > > 
> > > > There're two writeback interface to modify: test_clear/set_page_writeback.
> > > > 
> > > > Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> > > 
> > > Seems good to me. BTW, you named macros as MEM_CGROUP_STAT_FILE_XXX
> > > but I wonder these counters will be used for accounting swap-out's dirty pages..
> > > 
> > > STAT_DIRTY, STAT_WRITEBACK ? do you have better name ?
> > 
> > Perhaps we can follow the established "enum zone_stat_item" names:
> > 
> >         NR_FILE_DIRTY,
> >         NR_WRITEBACK,
> > 
> > s/NR_/MEM_CGROUP_STAT_/
> > 
> > The names indicate that dirty pages for anonymous pages are not
> > accounted (by __set_page_dirty_no_writeback()). While the writeback
> > pages accounting include both the file/anon pages.
> > 
> > Ah then we'll need to update the document in patch 0 accordingly. This
> > may sound a bit tricky to the users..
> 
> We already report the global one as "nr_dirty", though.  Please don't
> give the memcg one a different name.
> 
> The enum naming is not too critical, but it would be nice to have it
> match the public name.

Fair enough. The public name obviously has more weight :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 6/7] memcg: add per cgroup writeback pages accounting
@ 2012-07-09  1:37           ` Fengguang Wu
  0 siblings, 0 replies; 132+ messages in thread
From: Fengguang Wu @ 2012-07-09  1:37 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Kamezawa Hiroyuki, Sha Zhengju, linux-mm, cgroups, gthelen,
	yinghan, akpm, mhocko, linux-kernel, Sha Zhengju

On Mon, Jul 09, 2012 at 01:01:00AM +0200, Johannes Weiner wrote:
> On Sun, Jul 08, 2012 at 10:44:59PM +0800, Fengguang Wu wrote:
> > On Tue, Jul 03, 2012 at 03:31:26PM +0900, KAMEZAWA Hiroyuki wrote:
> > > (2012/06/28 20:05), Sha Zhengju wrote:
> > > > From: Sha Zhengju <handai.szj@taobao.com>
> > > > 
> > > > Similar to dirty page, we add per cgroup writeback pages accounting. The lock
> > > > rule still is:
> > > > 	mem_cgroup_begin_update_page_stat()
> > > > 	modify page WRITEBACK stat
> > > > 	mem_cgroup_update_page_stat()
> > > > 	mem_cgroup_end_update_page_stat()
> > > > 
> > > > There're two writeback interface to modify: test_clear/set_page_writeback.
> > > > 
> > > > Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> > > 
> > > Seems good to me. BTW, you named macros as MEM_CGROUP_STAT_FILE_XXX
> > > but I wonder these counters will be used for accounting swap-out's dirty pages..
> > > 
> > > STAT_DIRTY, STAT_WRITEBACK ? do you have better name ?
> > 
> > Perhaps we can follow the established "enum zone_stat_item" names:
> > 
> >         NR_FILE_DIRTY,
> >         NR_WRITEBACK,
> > 
> > s/NR_/MEM_CGROUP_STAT_/
> > 
> > The names indicate that dirty pages for anonymous pages are not
> > accounted (by __set_page_dirty_no_writeback()). While the writeback
> > pages accounting include both the file/anon pages.
> > 
> > Ah then we'll need to update the document in patch 0 accordingly. This
> > may sound a bit tricky to the users..
> 
> We already report the global one as "nr_dirty", though.  Please don't
> give the memcg one a different name.
> 
> The enum naming is not too critical, but it would be nice to have it
> match the public name.

Fair enough. The public name obviously has more weight :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 6/7] memcg: add per cgroup writeback pages accounting
@ 2012-07-09  1:37           ` Fengguang Wu
  0 siblings, 0 replies; 132+ messages in thread
From: Fengguang Wu @ 2012-07-09  1:37 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Kamezawa Hiroyuki, Sha Zhengju, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA, gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	yinghan-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mhocko-AlSwsSmVLrQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju

On Mon, Jul 09, 2012 at 01:01:00AM +0200, Johannes Weiner wrote:
> On Sun, Jul 08, 2012 at 10:44:59PM +0800, Fengguang Wu wrote:
> > On Tue, Jul 03, 2012 at 03:31:26PM +0900, KAMEZAWA Hiroyuki wrote:
> > > (2012/06/28 20:05), Sha Zhengju wrote:
> > > > From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> > > > 
> > > > Similar to dirty page, we add per cgroup writeback pages accounting. The lock
> > > > rule still is:
> > > > 	mem_cgroup_begin_update_page_stat()
> > > > 	modify page WRITEBACK stat
> > > > 	mem_cgroup_update_page_stat()
> > > > 	mem_cgroup_end_update_page_stat()
> > > > 
> > > > There're two writeback interface to modify: test_clear/set_page_writeback.
> > > > 
> > > > Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> > > 
> > > Seems good to me. BTW, you named macros as MEM_CGROUP_STAT_FILE_XXX
> > > but I wonder these counters will be used for accounting swap-out's dirty pages..
> > > 
> > > STAT_DIRTY, STAT_WRITEBACK ? do you have better name ?
> > 
> > Perhaps we can follow the established "enum zone_stat_item" names:
> > 
> >         NR_FILE_DIRTY,
> >         NR_WRITEBACK,
> > 
> > s/NR_/MEM_CGROUP_STAT_/
> > 
> > The names indicate that dirty pages for anonymous pages are not
> > accounted (by __set_page_dirty_no_writeback()). While the writeback
> > pages accounting include both the file/anon pages.
> > 
> > Ah then we'll need to update the document in patch 0 accordingly. This
> > may sound a bit tricky to the users..
> 
> We already report the global one as "nr_dirty", though.  Please don't
> give the memcg one a different name.
> 
> The enum naming is not too critical, but it would be nice to have it
> match the public name.

Fair enough. The public name obviously has more weight :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 6/7] memcg: add per cgroup writeback pages accounting
  2012-07-08 14:53     ` Fengguang Wu
  (?)
@ 2012-07-09  3:36       ` Sha Zhengju
  -1 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-07-09  3:36 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
	mhocko, linux-kernel, Sha Zhengju

On 07/08/2012 10:53 PM, Fengguang Wu wrote:
>> @@ -2245,7 +2252,10 @@ int test_set_page_writeback(struct page *page)
>>   {
>>   	struct address_space *mapping = page_mapping(page);
>>   	int ret;
>> +	bool locked;
>> +	unsigned long flags;
>>
>> +	mem_cgroup_begin_update_page_stat(page,&locked,&flags);
>>   	if (mapping) {
>>   		struct backing_dev_info *bdi = mapping->backing_dev_info;
>>   		unsigned long flags;
>> @@ -2272,6 +2282,8 @@ int test_set_page_writeback(struct page *page)
>>   	}
>>   	if (!ret)
>>   		account_page_writeback(page);
>> +
>> +	mem_cgroup_end_update_page_stat(page,&locked,&flags);
>>   	return ret;
>>
>>   }
> Where is the MEM_CGROUP_STAT_FILE_WRITEBACK increased?
>

It's in account_page_writeback().

  void account_page_writeback(struct page *page)
  {
+	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_WRITEBACK);
  	inc_zone_page_state(page, NR_WRITEBACK);
  }


There isn't a unified interface to dec/inc writeback accounting, so
I just follow that.
Maybe we can rework account_page_writeback() to also account
dec in?


Thanks,
Sha

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 6/7] memcg: add per cgroup writeback pages accounting
@ 2012-07-09  3:36       ` Sha Zhengju
  0 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-07-09  3:36 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
	mhocko, linux-kernel, Sha Zhengju

On 07/08/2012 10:53 PM, Fengguang Wu wrote:
>> @@ -2245,7 +2252,10 @@ int test_set_page_writeback(struct page *page)
>>   {
>>   	struct address_space *mapping = page_mapping(page);
>>   	int ret;
>> +	bool locked;
>> +	unsigned long flags;
>>
>> +	mem_cgroup_begin_update_page_stat(page,&locked,&flags);
>>   	if (mapping) {
>>   		struct backing_dev_info *bdi = mapping->backing_dev_info;
>>   		unsigned long flags;
>> @@ -2272,6 +2282,8 @@ int test_set_page_writeback(struct page *page)
>>   	}
>>   	if (!ret)
>>   		account_page_writeback(page);
>> +
>> +	mem_cgroup_end_update_page_stat(page,&locked,&flags);
>>   	return ret;
>>
>>   }
> Where is the MEM_CGROUP_STAT_FILE_WRITEBACK increased?
>

It's in account_page_writeback().

  void account_page_writeback(struct page *page)
  {
+	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_WRITEBACK);
  	inc_zone_page_state(page, NR_WRITEBACK);
  }


There isn't a unified interface to dec/inc writeback accounting, so
I just follow that.
Maybe we can rework account_page_writeback() to also account
dec in?


Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 6/7] memcg: add per cgroup writeback pages accounting
@ 2012-07-09  3:36       ` Sha Zhengju
  0 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-07-09  3:36 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA, yinghan-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mhocko-AlSwsSmVLrQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju

On 07/08/2012 10:53 PM, Fengguang Wu wrote:
>> @@ -2245,7 +2252,10 @@ int test_set_page_writeback(struct page *page)
>>   {
>>   	struct address_space *mapping = page_mapping(page);
>>   	int ret;
>> +	bool locked;
>> +	unsigned long flags;
>>
>> +	mem_cgroup_begin_update_page_stat(page,&locked,&flags);
>>   	if (mapping) {
>>   		struct backing_dev_info *bdi = mapping->backing_dev_info;
>>   		unsigned long flags;
>> @@ -2272,6 +2282,8 @@ int test_set_page_writeback(struct page *page)
>>   	}
>>   	if (!ret)
>>   		account_page_writeback(page);
>> +
>> +	mem_cgroup_end_update_page_stat(page,&locked,&flags);
>>   	return ret;
>>
>>   }
> Where is the MEM_CGROUP_STAT_FILE_WRITEBACK increased?
>

It's in account_page_writeback().

  void account_page_writeback(struct page *page)
  {
+	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_WRITEBACK);
  	inc_zone_page_state(page, NR_WRITEBACK);
  }


There isn't a unified interface to dec/inc writeback accounting, so
I just follow that.
Maybe we can rework account_page_writeback() to also account
dec in?


Thanks,
Sha

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 6/7] memcg: add per cgroup writeback pages accounting
  2012-07-09  3:36       ` Sha Zhengju
  (?)
@ 2012-07-09  4:14         ` Fengguang Wu
  -1 siblings, 0 replies; 132+ messages in thread
From: Fengguang Wu @ 2012-07-09  4:14 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
	mhocko, linux-kernel, Sha Zhengju

On Mon, Jul 09, 2012 at 11:36:11AM +0800, Sha Zhengju wrote:
> On 07/08/2012 10:53 PM, Fengguang Wu wrote:
> >>@@ -2245,7 +2252,10 @@ int test_set_page_writeback(struct page *page)
> >>  {
> >>  	struct address_space *mapping = page_mapping(page);
> >>  	int ret;
> >>+	bool locked;
> >>+	unsigned long flags;
> >>
> >>+	mem_cgroup_begin_update_page_stat(page,&locked,&flags);
> >>  	if (mapping) {
> >>  		struct backing_dev_info *bdi = mapping->backing_dev_info;
> >>  		unsigned long flags;
> >>@@ -2272,6 +2282,8 @@ int test_set_page_writeback(struct page *page)
> >>  	}
> >>  	if (!ret)
> >>  		account_page_writeback(page);
> >>+
> >>+	mem_cgroup_end_update_page_stat(page,&locked,&flags);
> >>  	return ret;
> >>
> >>  }
> >Where is the MEM_CGROUP_STAT_FILE_WRITEBACK increased?
> >
> 
> It's in account_page_writeback().
> 
>  void account_page_writeback(struct page *page)
>  {
> +	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_WRITEBACK);
>  	inc_zone_page_state(page, NR_WRITEBACK);
>  }

I didn't find that chunk, perhaps it's lost due to rebase..

> There isn't a unified interface to dec/inc writeback accounting, so
> I just follow that.
> Maybe we can rework account_page_writeback() to also account
> dec in?

The current seperate inc/dec paths are fine. It sounds like
over-engineering if going any further.

I'm a bit worried about some 3rd party kernel module to call
account_page_writeback() without mem_cgroup_begin/end_update_page_stat().
Will that lead to serious locking issues, or merely inaccurate
accounting?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 6/7] memcg: add per cgroup writeback pages accounting
@ 2012-07-09  4:14         ` Fengguang Wu
  0 siblings, 0 replies; 132+ messages in thread
From: Fengguang Wu @ 2012-07-09  4:14 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
	mhocko, linux-kernel, Sha Zhengju

On Mon, Jul 09, 2012 at 11:36:11AM +0800, Sha Zhengju wrote:
> On 07/08/2012 10:53 PM, Fengguang Wu wrote:
> >>@@ -2245,7 +2252,10 @@ int test_set_page_writeback(struct page *page)
> >>  {
> >>  	struct address_space *mapping = page_mapping(page);
> >>  	int ret;
> >>+	bool locked;
> >>+	unsigned long flags;
> >>
> >>+	mem_cgroup_begin_update_page_stat(page,&locked,&flags);
> >>  	if (mapping) {
> >>  		struct backing_dev_info *bdi = mapping->backing_dev_info;
> >>  		unsigned long flags;
> >>@@ -2272,6 +2282,8 @@ int test_set_page_writeback(struct page *page)
> >>  	}
> >>  	if (!ret)
> >>  		account_page_writeback(page);
> >>+
> >>+	mem_cgroup_end_update_page_stat(page,&locked,&flags);
> >>  	return ret;
> >>
> >>  }
> >Where is the MEM_CGROUP_STAT_FILE_WRITEBACK increased?
> >
> 
> It's in account_page_writeback().
> 
>  void account_page_writeback(struct page *page)
>  {
> +	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_WRITEBACK);
>  	inc_zone_page_state(page, NR_WRITEBACK);
>  }

I didn't find that chunk, perhaps it's lost due to rebase..

> There isn't a unified interface to dec/inc writeback accounting, so
> I just follow that.
> Maybe we can rework account_page_writeback() to also account
> dec in?

The current seperate inc/dec paths are fine. It sounds like
over-engineering if going any further.

I'm a bit worried about some 3rd party kernel module to call
account_page_writeback() without mem_cgroup_begin/end_update_page_stat().
Will that lead to serious locking issues, or merely inaccurate
accounting?

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 6/7] memcg: add per cgroup writeback pages accounting
@ 2012-07-09  4:14         ` Fengguang Wu
  0 siblings, 0 replies; 132+ messages in thread
From: Fengguang Wu @ 2012-07-09  4:14 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA, yinghan-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mhocko-AlSwsSmVLrQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju

On Mon, Jul 09, 2012 at 11:36:11AM +0800, Sha Zhengju wrote:
> On 07/08/2012 10:53 PM, Fengguang Wu wrote:
> >>@@ -2245,7 +2252,10 @@ int test_set_page_writeback(struct page *page)
> >>  {
> >>  	struct address_space *mapping = page_mapping(page);
> >>  	int ret;
> >>+	bool locked;
> >>+	unsigned long flags;
> >>
> >>+	mem_cgroup_begin_update_page_stat(page,&locked,&flags);
> >>  	if (mapping) {
> >>  		struct backing_dev_info *bdi = mapping->backing_dev_info;
> >>  		unsigned long flags;
> >>@@ -2272,6 +2282,8 @@ int test_set_page_writeback(struct page *page)
> >>  	}
> >>  	if (!ret)
> >>  		account_page_writeback(page);
> >>+
> >>+	mem_cgroup_end_update_page_stat(page,&locked,&flags);
> >>  	return ret;
> >>
> >>  }
> >Where is the MEM_CGROUP_STAT_FILE_WRITEBACK increased?
> >
> 
> It's in account_page_writeback().
> 
>  void account_page_writeback(struct page *page)
>  {
> +	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_WRITEBACK);
>  	inc_zone_page_state(page, NR_WRITEBACK);
>  }

I didn't find that chunk, perhaps it's lost due to rebase..

> There isn't a unified interface to dec/inc writeback accounting, so
> I just follow that.
> Maybe we can rework account_page_writeback() to also account
> dec in?

The current seperate inc/dec paths are fine. It sounds like
over-engineering if going any further.

I'm a bit worried about some 3rd party kernel module to call
account_page_writeback() without mem_cgroup_begin/end_update_page_stat().
Will that lead to serious locking issues, or merely inaccurate
accounting?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 6/7] memcg: add per cgroup writeback pages accounting
  2012-07-09  4:14         ` Fengguang Wu
@ 2012-07-09  4:18           ` Kamezawa Hiroyuki
  -1 siblings, 0 replies; 132+ messages in thread
From: Kamezawa Hiroyuki @ 2012-07-09  4:18 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Sha Zhengju, linux-mm, cgroups, gthelen, yinghan, akpm, mhocko,
	linux-kernel, Sha Zhengju

(2012/07/09 13:14), Fengguang Wu wrote:
> On Mon, Jul 09, 2012 at 11:36:11AM +0800, Sha Zhengju wrote:
>> On 07/08/2012 10:53 PM, Fengguang Wu wrote:
>>>> @@ -2245,7 +2252,10 @@ int test_set_page_writeback(struct page *page)
>>>>   {
>>>>   	struct address_space *mapping = page_mapping(page);
>>>>   	int ret;
>>>> +	bool locked;
>>>> +	unsigned long flags;
>>>>
>>>> +	mem_cgroup_begin_update_page_stat(page,&locked,&flags);
>>>>   	if (mapping) {
>>>>   		struct backing_dev_info *bdi = mapping->backing_dev_info;
>>>>   		unsigned long flags;
>>>> @@ -2272,6 +2282,8 @@ int test_set_page_writeback(struct page *page)
>>>>   	}
>>>>   	if (!ret)
>>>>   		account_page_writeback(page);
>>>> +
>>>> +	mem_cgroup_end_update_page_stat(page,&locked,&flags);
>>>>   	return ret;
>>>>
>>>>   }
>>> Where is the MEM_CGROUP_STAT_FILE_WRITEBACK increased?
>>>
>>
>> It's in account_page_writeback().
>>
>>   void account_page_writeback(struct page *page)
>>   {
>> +	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_WRITEBACK);
>>   	inc_zone_page_state(page, NR_WRITEBACK);
>>   }
>
> I didn't find that chunk, perhaps it's lost due to rebase..
>
>> There isn't a unified interface to dec/inc writeback accounting, so
>> I just follow that.
>> Maybe we can rework account_page_writeback() to also account
>> dec in?
>
> The current seperate inc/dec paths are fine. It sounds like
> over-engineering if going any further.
>
> I'm a bit worried about some 3rd party kernel module to call
> account_page_writeback() without mem_cgroup_begin/end_update_page_stat().
> Will that lead to serious locking issues, or merely inaccurate
> accounting?
>

Ah, Hm. Maybe it's better to add some debug check in
  mem_cgroup_update_page_stat(). rcu_read_lock_held() or some.

Thanks,
-Kame




^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 6/7] memcg: add per cgroup writeback pages accounting
@ 2012-07-09  4:18           ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 132+ messages in thread
From: Kamezawa Hiroyuki @ 2012-07-09  4:18 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Sha Zhengju, linux-mm, cgroups, gthelen, yinghan, akpm, mhocko,
	linux-kernel, Sha Zhengju

(2012/07/09 13:14), Fengguang Wu wrote:
> On Mon, Jul 09, 2012 at 11:36:11AM +0800, Sha Zhengju wrote:
>> On 07/08/2012 10:53 PM, Fengguang Wu wrote:
>>>> @@ -2245,7 +2252,10 @@ int test_set_page_writeback(struct page *page)
>>>>   {
>>>>   	struct address_space *mapping = page_mapping(page);
>>>>   	int ret;
>>>> +	bool locked;
>>>> +	unsigned long flags;
>>>>
>>>> +	mem_cgroup_begin_update_page_stat(page,&locked,&flags);
>>>>   	if (mapping) {
>>>>   		struct backing_dev_info *bdi = mapping->backing_dev_info;
>>>>   		unsigned long flags;
>>>> @@ -2272,6 +2282,8 @@ int test_set_page_writeback(struct page *page)
>>>>   	}
>>>>   	if (!ret)
>>>>   		account_page_writeback(page);
>>>> +
>>>> +	mem_cgroup_end_update_page_stat(page,&locked,&flags);
>>>>   	return ret;
>>>>
>>>>   }
>>> Where is the MEM_CGROUP_STAT_FILE_WRITEBACK increased?
>>>
>>
>> It's in account_page_writeback().
>>
>>   void account_page_writeback(struct page *page)
>>   {
>> +	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_WRITEBACK);
>>   	inc_zone_page_state(page, NR_WRITEBACK);
>>   }
>
> I didn't find that chunk, perhaps it's lost due to rebase..
>
>> There isn't a unified interface to dec/inc writeback accounting, so
>> I just follow that.
>> Maybe we can rework account_page_writeback() to also account
>> dec in?
>
> The current seperate inc/dec paths are fine. It sounds like
> over-engineering if going any further.
>
> I'm a bit worried about some 3rd party kernel module to call
> account_page_writeback() without mem_cgroup_begin/end_update_page_stat().
> Will that lead to serious locking issues, or merely inaccurate
> accounting?
>

Ah, Hm. Maybe it's better to add some debug check in
  mem_cgroup_update_page_stat(). rcu_read_lock_held() or some.

Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 6/7] memcg: add per cgroup writeback pages accounting
  2012-07-09  4:14         ` Fengguang Wu
@ 2012-07-09  5:19           ` Sha Zhengju
  -1 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-07-09  5:19 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
	mhocko, linux-kernel, Sha Zhengju

On 07/09/2012 12:14 PM, Fengguang Wu wrote:
> On Mon, Jul 09, 2012 at 11:36:11AM +0800, Sha Zhengju wrote:
>> On 07/08/2012 10:53 PM, Fengguang Wu wrote:
>>>> @@ -2245,7 +2252,10 @@ int test_set_page_writeback(struct page *page)
>>>>   {
>>>>   	struct address_space *mapping = page_mapping(page);
>>>>   	int ret;
>>>> +	bool locked;
>>>> +	unsigned long flags;
>>>>
>>>> +	mem_cgroup_begin_update_page_stat(page,&locked,&flags);
>>>>   	if (mapping) {
>>>>   		struct backing_dev_info *bdi = mapping->backing_dev_info;
>>>>   		unsigned long flags;
>>>> @@ -2272,6 +2282,8 @@ int test_set_page_writeback(struct page *page)
>>>>   	}
>>>>   	if (!ret)
>>>>   		account_page_writeback(page);
>>>> +
>>>> +	mem_cgroup_end_update_page_stat(page,&locked,&flags);
>>>>   	return ret;
>>>>
>>>>   }
>>> Where is the MEM_CGROUP_STAT_FILE_WRITEBACK increased?
>>>
>> It's in account_page_writeback().
>>
>>   void account_page_writeback(struct page *page)
>>   {
>> +	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_WRITEBACK);
>>   	inc_zone_page_state(page, NR_WRITEBACK);
>>   }
> I didn't find that chunk, perhaps it's lost due to rebase..

Ah? a bit weird... you can refer to the link 
http://thread.gmane.org/gmane.linux.kernel.cgroups/3134
which is an integral one. Thanks!

>> There isn't a unified interface to dec/inc writeback accounting, so
>> I just follow that.
>> Maybe we can rework account_page_writeback() to also account
>> dec in?
> The current seperate inc/dec paths are fine. It sounds like
> over-engineering if going any further.
>
> I'm a bit worried about some 3rd party kernel module to call
> account_page_writeback() without mem_cgroup_begin/end_update_page_stat().
> Will that lead to serious locking issues, or merely inaccurate
> accounting?
>
> Thanks,
> Fengguang
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 6/7] memcg: add per cgroup writeback pages accounting
@ 2012-07-09  5:19           ` Sha Zhengju
  0 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-07-09  5:19 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
	mhocko, linux-kernel, Sha Zhengju

On 07/09/2012 12:14 PM, Fengguang Wu wrote:
> On Mon, Jul 09, 2012 at 11:36:11AM +0800, Sha Zhengju wrote:
>> On 07/08/2012 10:53 PM, Fengguang Wu wrote:
>>>> @@ -2245,7 +2252,10 @@ int test_set_page_writeback(struct page *page)
>>>>   {
>>>>   	struct address_space *mapping = page_mapping(page);
>>>>   	int ret;
>>>> +	bool locked;
>>>> +	unsigned long flags;
>>>>
>>>> +	mem_cgroup_begin_update_page_stat(page,&locked,&flags);
>>>>   	if (mapping) {
>>>>   		struct backing_dev_info *bdi = mapping->backing_dev_info;
>>>>   		unsigned long flags;
>>>> @@ -2272,6 +2282,8 @@ int test_set_page_writeback(struct page *page)
>>>>   	}
>>>>   	if (!ret)
>>>>   		account_page_writeback(page);
>>>> +
>>>> +	mem_cgroup_end_update_page_stat(page,&locked,&flags);
>>>>   	return ret;
>>>>
>>>>   }
>>> Where is the MEM_CGROUP_STAT_FILE_WRITEBACK increased?
>>>
>> It's in account_page_writeback().
>>
>>   void account_page_writeback(struct page *page)
>>   {
>> +	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_WRITEBACK);
>>   	inc_zone_page_state(page, NR_WRITEBACK);
>>   }
> I didn't find that chunk, perhaps it's lost due to rebase..

Ah? a bit weird... you can refer to the link 
http://thread.gmane.org/gmane.linux.kernel.cgroups/3134
which is an integral one. Thanks!

>> There isn't a unified interface to dec/inc writeback accounting, so
>> I just follow that.
>> Maybe we can rework account_page_writeback() to also account
>> dec in?
> The current seperate inc/dec paths are fine. It sounds like
> over-engineering if going any further.
>
> I'm a bit worried about some 3rd party kernel module to call
> account_page_writeback() without mem_cgroup_begin/end_update_page_stat().
> Will that lead to serious locking issues, or merely inaccurate
> accounting?
>
> Thanks,
> Fengguang
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 6/7] memcg: add per cgroup writeback pages accounting
  2012-07-09  4:18           ` Kamezawa Hiroyuki
  (?)
@ 2012-07-09  5:22             ` Sha Zhengju
  -1 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-07-09  5:22 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Fengguang Wu, linux-mm, cgroups, gthelen, yinghan, akpm, mhocko,
	linux-kernel, Sha Zhengju

On 07/09/2012 12:18 PM, Kamezawa Hiroyuki wrote:
> (2012/07/09 13:14), Fengguang Wu wrote:
>> On Mon, Jul 09, 2012 at 11:36:11AM +0800, Sha Zhengju wrote:
>>> On 07/08/2012 10:53 PM, Fengguang Wu wrote:
>>>>> @@ -2245,7 +2252,10 @@ int test_set_page_writeback(struct page *page)
>>>>>   {
>>>>>       struct address_space *mapping = page_mapping(page);
>>>>>       int ret;
>>>>> +    bool locked;
>>>>> +    unsigned long flags;
>>>>>
>>>>> +    mem_cgroup_begin_update_page_stat(page,&locked,&flags);
>>>>>       if (mapping) {
>>>>>           struct backing_dev_info *bdi = mapping->backing_dev_info;
>>>>>           unsigned long flags;
>>>>> @@ -2272,6 +2282,8 @@ int test_set_page_writeback(struct page *page)
>>>>>       }
>>>>>       if (!ret)
>>>>>           account_page_writeback(page);
>>>>> +
>>>>> +    mem_cgroup_end_update_page_stat(page,&locked,&flags);
>>>>>       return ret;
>>>>>
>>>>>   }
>>>> Where is the MEM_CGROUP_STAT_FILE_WRITEBACK increased?
>>>>
>>>
>>> It's in account_page_writeback().
>>>
>>>   void account_page_writeback(struct page *page)
>>>   {
>>> +    mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_WRITEBACK);
>>>       inc_zone_page_state(page, NR_WRITEBACK);
>>>   }
>>
>> I didn't find that chunk, perhaps it's lost due to rebase..
>>
>>> There isn't a unified interface to dec/inc writeback accounting, so
>>> I just follow that.
>>> Maybe we can rework account_page_writeback() to also account
>>> dec in?
>>
>> The current seperate inc/dec paths are fine. It sounds like
>> over-engineering if going any further.
>>
>> I'm a bit worried about some 3rd party kernel module to call
>> account_page_writeback() without 
>> mem_cgroup_begin/end_update_page_stat().
>> Will that lead to serious locking issues, or merely inaccurate
>> accounting?
>>
>
> Ah, Hm. Maybe it's better to add some debug check in
>  mem_cgroup_update_page_stat(). rcu_read_lock_held() or some.
>

This also apply to account_page_dirtied()... But as an "range" lock, I 
think it's common
in current kernel: just as set_page_dirty(), the caller should call it 
under the page lock
(in most cases) and it's his responsibility to guarantee correctness. I 
can add some
comments or debug check as reminding but I think i can only do so...


Thanks,
Sha



^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 6/7] memcg: add per cgroup writeback pages accounting
@ 2012-07-09  5:22             ` Sha Zhengju
  0 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-07-09  5:22 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Fengguang Wu, linux-mm, cgroups, gthelen, yinghan, akpm, mhocko,
	linux-kernel, Sha Zhengju

On 07/09/2012 12:18 PM, Kamezawa Hiroyuki wrote:
> (2012/07/09 13:14), Fengguang Wu wrote:
>> On Mon, Jul 09, 2012 at 11:36:11AM +0800, Sha Zhengju wrote:
>>> On 07/08/2012 10:53 PM, Fengguang Wu wrote:
>>>>> @@ -2245,7 +2252,10 @@ int test_set_page_writeback(struct page *page)
>>>>>   {
>>>>>       struct address_space *mapping = page_mapping(page);
>>>>>       int ret;
>>>>> +    bool locked;
>>>>> +    unsigned long flags;
>>>>>
>>>>> +    mem_cgroup_begin_update_page_stat(page,&locked,&flags);
>>>>>       if (mapping) {
>>>>>           struct backing_dev_info *bdi = mapping->backing_dev_info;
>>>>>           unsigned long flags;
>>>>> @@ -2272,6 +2282,8 @@ int test_set_page_writeback(struct page *page)
>>>>>       }
>>>>>       if (!ret)
>>>>>           account_page_writeback(page);
>>>>> +
>>>>> +    mem_cgroup_end_update_page_stat(page,&locked,&flags);
>>>>>       return ret;
>>>>>
>>>>>   }
>>>> Where is the MEM_CGROUP_STAT_FILE_WRITEBACK increased?
>>>>
>>>
>>> It's in account_page_writeback().
>>>
>>>   void account_page_writeback(struct page *page)
>>>   {
>>> +    mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_WRITEBACK);
>>>       inc_zone_page_state(page, NR_WRITEBACK);
>>>   }
>>
>> I didn't find that chunk, perhaps it's lost due to rebase..
>>
>>> There isn't a unified interface to dec/inc writeback accounting, so
>>> I just follow that.
>>> Maybe we can rework account_page_writeback() to also account
>>> dec in?
>>
>> The current seperate inc/dec paths are fine. It sounds like
>> over-engineering if going any further.
>>
>> I'm a bit worried about some 3rd party kernel module to call
>> account_page_writeback() without 
>> mem_cgroup_begin/end_update_page_stat().
>> Will that lead to serious locking issues, or merely inaccurate
>> accounting?
>>
>
> Ah, Hm. Maybe it's better to add some debug check in
>  mem_cgroup_update_page_stat(). rcu_read_lock_held() or some.
>

This also apply to account_page_dirtied()... But as an "range" lock, I 
think it's common
in current kernel: just as set_page_dirty(), the caller should call it 
under the page lock
(in most cases) and it's his responsibility to guarantee correctness. I 
can add some
comments or debug check as reminding but I think i can only do so...


Thanks,
Sha


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 6/7] memcg: add per cgroup writeback pages accounting
@ 2012-07-09  5:22             ` Sha Zhengju
  0 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-07-09  5:22 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Fengguang Wu, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA, gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	yinghan-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mhocko-AlSwsSmVLrQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju

On 07/09/2012 12:18 PM, Kamezawa Hiroyuki wrote:
> (2012/07/09 13:14), Fengguang Wu wrote:
>> On Mon, Jul 09, 2012 at 11:36:11AM +0800, Sha Zhengju wrote:
>>> On 07/08/2012 10:53 PM, Fengguang Wu wrote:
>>>>> @@ -2245,7 +2252,10 @@ int test_set_page_writeback(struct page *page)
>>>>>   {
>>>>>       struct address_space *mapping = page_mapping(page);
>>>>>       int ret;
>>>>> +    bool locked;
>>>>> +    unsigned long flags;
>>>>>
>>>>> +    mem_cgroup_begin_update_page_stat(page,&locked,&flags);
>>>>>       if (mapping) {
>>>>>           struct backing_dev_info *bdi = mapping->backing_dev_info;
>>>>>           unsigned long flags;
>>>>> @@ -2272,6 +2282,8 @@ int test_set_page_writeback(struct page *page)
>>>>>       }
>>>>>       if (!ret)
>>>>>           account_page_writeback(page);
>>>>> +
>>>>> +    mem_cgroup_end_update_page_stat(page,&locked,&flags);
>>>>>       return ret;
>>>>>
>>>>>   }
>>>> Where is the MEM_CGROUP_STAT_FILE_WRITEBACK increased?
>>>>
>>>
>>> It's in account_page_writeback().
>>>
>>>   void account_page_writeback(struct page *page)
>>>   {
>>> +    mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_WRITEBACK);
>>>       inc_zone_page_state(page, NR_WRITEBACK);
>>>   }
>>
>> I didn't find that chunk, perhaps it's lost due to rebase..
>>
>>> There isn't a unified interface to dec/inc writeback accounting, so
>>> I just follow that.
>>> Maybe we can rework account_page_writeback() to also account
>>> dec in?
>>
>> The current seperate inc/dec paths are fine. It sounds like
>> over-engineering if going any further.
>>
>> I'm a bit worried about some 3rd party kernel module to call
>> account_page_writeback() without 
>> mem_cgroup_begin/end_update_page_stat().
>> Will that lead to serious locking issues, or merely inaccurate
>> accounting?
>>
>
> Ah, Hm. Maybe it's better to add some debug check in
>  mem_cgroup_update_page_stat(). rcu_read_lock_held() or some.
>

This also apply to account_page_dirtied()... But as an "range" lock, I 
think it's common
in current kernel: just as set_page_dirty(), the caller should call it 
under the page lock
(in most cases) and it's his responsibility to guarantee correctness. I 
can add some
comments or debug check as reminding but I think i can only do so...


Thanks,
Sha


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 6/7] memcg: add per cgroup writeback pages accounting
  2012-07-09  5:19           ` Sha Zhengju
@ 2012-07-09  5:25             ` Fengguang Wu
  -1 siblings, 0 replies; 132+ messages in thread
From: Fengguang Wu @ 2012-07-09  5:25 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
	mhocko, linux-kernel, Sha Zhengju

> >>>Where is the MEM_CGROUP_STAT_FILE_WRITEBACK increased?
> >>>
> >>It's in account_page_writeback().
> >>
> >>  void account_page_writeback(struct page *page)
> >>  {
> >>+	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_WRITEBACK);
> >>  	inc_zone_page_state(page, NR_WRITEBACK);
> >>  }
> >I didn't find that chunk, perhaps it's lost due to rebase..
> 
> Ah? a bit weird... you can refer to the link
> http://thread.gmane.org/gmane.linux.kernel.cgroups/3134
> which is an integral one. Thanks!

Ah I got it. Sorry I overlooked it..and the new view does help make it
obvious ;)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 6/7] memcg: add per cgroup writeback pages accounting
@ 2012-07-09  5:25             ` Fengguang Wu
  0 siblings, 0 replies; 132+ messages in thread
From: Fengguang Wu @ 2012-07-09  5:25 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
	mhocko, linux-kernel, Sha Zhengju

> >>>Where is the MEM_CGROUP_STAT_FILE_WRITEBACK increased?
> >>>
> >>It's in account_page_writeback().
> >>
> >>  void account_page_writeback(struct page *page)
> >>  {
> >>+	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_WRITEBACK);
> >>  	inc_zone_page_state(page, NR_WRITEBACK);
> >>  }
> >I didn't find that chunk, perhaps it's lost due to rebase..
> 
> Ah? a bit weird... you can refer to the link
> http://thread.gmane.org/gmane.linux.kernel.cgroups/3134
> which is an integral one. Thanks!

Ah I got it. Sorry I overlooked it..and the new view does help make it
obvious ;)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 6/7] memcg: add per cgroup writeback pages accounting
  2012-07-09  5:22             ` Sha Zhengju
  (?)
@ 2012-07-09  5:28               ` Fengguang Wu
  -1 siblings, 0 replies; 132+ messages in thread
From: Fengguang Wu @ 2012-07-09  5:28 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: Kamezawa Hiroyuki, linux-mm, cgroups, gthelen, yinghan, akpm,
	mhocko, linux-kernel, Sha Zhengju

On Mon, Jul 09, 2012 at 01:22:54PM +0800, Sha Zhengju wrote:
> On 07/09/2012 12:18 PM, Kamezawa Hiroyuki wrote:
> >(2012/07/09 13:14), Fengguang Wu wrote:
> >>On Mon, Jul 09, 2012 at 11:36:11AM +0800, Sha Zhengju wrote:
> >>>On 07/08/2012 10:53 PM, Fengguang Wu wrote:
> >>>>>@@ -2245,7 +2252,10 @@ int test_set_page_writeback(struct page *page)
> >>>>>  {
> >>>>>      struct address_space *mapping = page_mapping(page);
> >>>>>      int ret;
> >>>>>+    bool locked;
> >>>>>+    unsigned long flags;
> >>>>>
> >>>>>+    mem_cgroup_begin_update_page_stat(page,&locked,&flags);
> >>>>>      if (mapping) {
> >>>>>          struct backing_dev_info *bdi = mapping->backing_dev_info;
> >>>>>          unsigned long flags;
> >>>>>@@ -2272,6 +2282,8 @@ int test_set_page_writeback(struct page *page)
> >>>>>      }
> >>>>>      if (!ret)
> >>>>>          account_page_writeback(page);
> >>>>>+
> >>>>>+    mem_cgroup_end_update_page_stat(page,&locked,&flags);
> >>>>>      return ret;
> >>>>>
> >>>>>  }
> >>>>Where is the MEM_CGROUP_STAT_FILE_WRITEBACK increased?
> >>>>
> >>>
> >>>It's in account_page_writeback().
> >>>
> >>>  void account_page_writeback(struct page *page)
> >>>  {
> >>>+    mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_WRITEBACK);
> >>>      inc_zone_page_state(page, NR_WRITEBACK);
> >>>  }
> >>
> >>I didn't find that chunk, perhaps it's lost due to rebase..
> >>
> >>>There isn't a unified interface to dec/inc writeback accounting, so
> >>>I just follow that.
> >>>Maybe we can rework account_page_writeback() to also account
> >>>dec in?
> >>
> >>The current seperate inc/dec paths are fine. It sounds like
> >>over-engineering if going any further.
> >>
> >>I'm a bit worried about some 3rd party kernel module to call
> >>account_page_writeback() without
> >>mem_cgroup_begin/end_update_page_stat().
> >>Will that lead to serious locking issues, or merely inaccurate
> >>accounting?
> >>
> >
> >Ah, Hm. Maybe it's better to add some debug check in
> > mem_cgroup_update_page_stat(). rcu_read_lock_held() or some.
> >
> 
> This also apply to account_page_dirtied()... But as an "range" lock,
> I think it's common
> in current kernel: just as set_page_dirty(), the caller should call
> it under the page lock
> (in most cases) and it's his responsibility to guarantee
> correctness. I can add some
> comments or debug check as reminding but I think i can only do so...

Yeah, it helps to add some brief comment on the locking rule in
account_page_*().

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 6/7] memcg: add per cgroup writeback pages accounting
@ 2012-07-09  5:28               ` Fengguang Wu
  0 siblings, 0 replies; 132+ messages in thread
From: Fengguang Wu @ 2012-07-09  5:28 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: Kamezawa Hiroyuki, linux-mm, cgroups, gthelen, yinghan, akpm,
	mhocko, linux-kernel, Sha Zhengju

On Mon, Jul 09, 2012 at 01:22:54PM +0800, Sha Zhengju wrote:
> On 07/09/2012 12:18 PM, Kamezawa Hiroyuki wrote:
> >(2012/07/09 13:14), Fengguang Wu wrote:
> >>On Mon, Jul 09, 2012 at 11:36:11AM +0800, Sha Zhengju wrote:
> >>>On 07/08/2012 10:53 PM, Fengguang Wu wrote:
> >>>>>@@ -2245,7 +2252,10 @@ int test_set_page_writeback(struct page *page)
> >>>>>  {
> >>>>>      struct address_space *mapping = page_mapping(page);
> >>>>>      int ret;
> >>>>>+    bool locked;
> >>>>>+    unsigned long flags;
> >>>>>
> >>>>>+    mem_cgroup_begin_update_page_stat(page,&locked,&flags);
> >>>>>      if (mapping) {
> >>>>>          struct backing_dev_info *bdi = mapping->backing_dev_info;
> >>>>>          unsigned long flags;
> >>>>>@@ -2272,6 +2282,8 @@ int test_set_page_writeback(struct page *page)
> >>>>>      }
> >>>>>      if (!ret)
> >>>>>          account_page_writeback(page);
> >>>>>+
> >>>>>+    mem_cgroup_end_update_page_stat(page,&locked,&flags);
> >>>>>      return ret;
> >>>>>
> >>>>>  }
> >>>>Where is the MEM_CGROUP_STAT_FILE_WRITEBACK increased?
> >>>>
> >>>
> >>>It's in account_page_writeback().
> >>>
> >>>  void account_page_writeback(struct page *page)
> >>>  {
> >>>+    mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_WRITEBACK);
> >>>      inc_zone_page_state(page, NR_WRITEBACK);
> >>>  }
> >>
> >>I didn't find that chunk, perhaps it's lost due to rebase..
> >>
> >>>There isn't a unified interface to dec/inc writeback accounting, so
> >>>I just follow that.
> >>>Maybe we can rework account_page_writeback() to also account
> >>>dec in?
> >>
> >>The current seperate inc/dec paths are fine. It sounds like
> >>over-engineering if going any further.
> >>
> >>I'm a bit worried about some 3rd party kernel module to call
> >>account_page_writeback() without
> >>mem_cgroup_begin/end_update_page_stat().
> >>Will that lead to serious locking issues, or merely inaccurate
> >>accounting?
> >>
> >
> >Ah, Hm. Maybe it's better to add some debug check in
> > mem_cgroup_update_page_stat(). rcu_read_lock_held() or some.
> >
> 
> This also apply to account_page_dirtied()... But as an "range" lock,
> I think it's common
> in current kernel: just as set_page_dirty(), the caller should call
> it under the page lock
> (in most cases) and it's his responsibility to guarantee
> correctness. I can add some
> comments or debug check as reminding but I think i can only do so...

Yeah, it helps to add some brief comment on the locking rule in
account_page_*().

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 6/7] memcg: add per cgroup writeback pages accounting
@ 2012-07-09  5:28               ` Fengguang Wu
  0 siblings, 0 replies; 132+ messages in thread
From: Fengguang Wu @ 2012-07-09  5:28 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: Kamezawa Hiroyuki, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA, gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	yinghan-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mhocko-AlSwsSmVLrQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju

On Mon, Jul 09, 2012 at 01:22:54PM +0800, Sha Zhengju wrote:
> On 07/09/2012 12:18 PM, Kamezawa Hiroyuki wrote:
> >(2012/07/09 13:14), Fengguang Wu wrote:
> >>On Mon, Jul 09, 2012 at 11:36:11AM +0800, Sha Zhengju wrote:
> >>>On 07/08/2012 10:53 PM, Fengguang Wu wrote:
> >>>>>@@ -2245,7 +2252,10 @@ int test_set_page_writeback(struct page *page)
> >>>>>  {
> >>>>>      struct address_space *mapping = page_mapping(page);
> >>>>>      int ret;
> >>>>>+    bool locked;
> >>>>>+    unsigned long flags;
> >>>>>
> >>>>>+    mem_cgroup_begin_update_page_stat(page,&locked,&flags);
> >>>>>      if (mapping) {
> >>>>>          struct backing_dev_info *bdi = mapping->backing_dev_info;
> >>>>>          unsigned long flags;
> >>>>>@@ -2272,6 +2282,8 @@ int test_set_page_writeback(struct page *page)
> >>>>>      }
> >>>>>      if (!ret)
> >>>>>          account_page_writeback(page);
> >>>>>+
> >>>>>+    mem_cgroup_end_update_page_stat(page,&locked,&flags);
> >>>>>      return ret;
> >>>>>
> >>>>>  }
> >>>>Where is the MEM_CGROUP_STAT_FILE_WRITEBACK increased?
> >>>>
> >>>
> >>>It's in account_page_writeback().
> >>>
> >>>  void account_page_writeback(struct page *page)
> >>>  {
> >>>+    mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_WRITEBACK);
> >>>      inc_zone_page_state(page, NR_WRITEBACK);
> >>>  }
> >>
> >>I didn't find that chunk, perhaps it's lost due to rebase..
> >>
> >>>There isn't a unified interface to dec/inc writeback accounting, so
> >>>I just follow that.
> >>>Maybe we can rework account_page_writeback() to also account
> >>>dec in?
> >>
> >>The current seperate inc/dec paths are fine. It sounds like
> >>over-engineering if going any further.
> >>
> >>I'm a bit worried about some 3rd party kernel module to call
> >>account_page_writeback() without
> >>mem_cgroup_begin/end_update_page_stat().
> >>Will that lead to serious locking issues, or merely inaccurate
> >>accounting?
> >>
> >
> >Ah, Hm. Maybe it's better to add some debug check in
> > mem_cgroup_update_page_stat(). rcu_read_lock_held() or some.
> >
> 
> This also apply to account_page_dirtied()... But as an "range" lock,
> I think it's common
> in current kernel: just as set_page_dirty(), the caller should call
> it under the page lock
> (in most cases) and it's his responsibility to guarantee
> correctness. I can add some
> comments or debug check as reminding but I think i can only do so...

Yeah, it helps to add some brief comment on the locking rule in
account_page_*().

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 2/7] memcg: remove MEMCG_NR_FILE_MAPPED
  2012-06-28 10:58   ` Sha Zhengju
  (?)
@ 2012-07-09 21:01     ` Greg Thelen
  -1 siblings, 0 replies; 132+ messages in thread
From: Greg Thelen @ 2012-07-09 21:01 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, yinghan, akpm, mhocko,
	linux-kernel, Sha Zhengju

On Thu, Jun 28 2012, Sha Zhengju wrote:

> From: Sha Zhengju <handai.szj@taobao.com>
>
> While accounting memcg page stat, it's not worth to use MEMCG_NR_FILE_MAPPED
> as an extra layer of indirection because of the complexity and presumed
> performance overhead. We can use MEM_CGROUP_STAT_FILE_MAPPED directly.
>
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> ---
>  include/linux/memcontrol.h |   25 +++++++++++++++++--------
>  mm/memcontrol.c            |   24 +-----------------------
>  mm/rmap.c                  |    4 ++--
>  3 files changed, 20 insertions(+), 33 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 83e7ba9..20b0f2d 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -27,9 +27,18 @@ struct page_cgroup;
>  struct page;
>  struct mm_struct;
>  
> -/* Stats that can be updated by kernel. */
> -enum mem_cgroup_page_stat_item {
> -	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
> +/*
> + * Statistics for memory cgroup.
> + */
> +enum mem_cgroup_stat_index {
> +	/*
> +	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> +	 */
> +	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
> +	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
> +	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> +	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
> +	MEM_CGROUP_STAT_NSTATS,
>  };

Nit.  Moving mem_cgroup_stat_index from memcontrol.c to memcontrol.h is
fine with me.  But this does increase the distance between related
defintions of definition mem_cgroup_stat_index and
mem_cgroup_stat_names.  These two lists have to be kept in sync.  So it
might help to add a comment to both indicating their relationship so we
don't accidentally modify the enum without updating the dependent string
table.

Otherwise, looks good.

Reviewed-by: Greg Thelen <gthelen@google.com>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 2/7] memcg: remove MEMCG_NR_FILE_MAPPED
@ 2012-07-09 21:01     ` Greg Thelen
  0 siblings, 0 replies; 132+ messages in thread
From: Greg Thelen @ 2012-07-09 21:01 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, yinghan, akpm, mhocko,
	linux-kernel, Sha Zhengju

On Thu, Jun 28 2012, Sha Zhengju wrote:

> From: Sha Zhengju <handai.szj@taobao.com>
>
> While accounting memcg page stat, it's not worth to use MEMCG_NR_FILE_MAPPED
> as an extra layer of indirection because of the complexity and presumed
> performance overhead. We can use MEM_CGROUP_STAT_FILE_MAPPED directly.
>
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> ---
>  include/linux/memcontrol.h |   25 +++++++++++++++++--------
>  mm/memcontrol.c            |   24 +-----------------------
>  mm/rmap.c                  |    4 ++--
>  3 files changed, 20 insertions(+), 33 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 83e7ba9..20b0f2d 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -27,9 +27,18 @@ struct page_cgroup;
>  struct page;
>  struct mm_struct;
>  
> -/* Stats that can be updated by kernel. */
> -enum mem_cgroup_page_stat_item {
> -	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
> +/*
> + * Statistics for memory cgroup.
> + */
> +enum mem_cgroup_stat_index {
> +	/*
> +	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> +	 */
> +	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
> +	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
> +	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> +	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
> +	MEM_CGROUP_STAT_NSTATS,
>  };

Nit.  Moving mem_cgroup_stat_index from memcontrol.c to memcontrol.h is
fine with me.  But this does increase the distance between related
defintions of definition mem_cgroup_stat_index and
mem_cgroup_stat_names.  These two lists have to be kept in sync.  So it
might help to add a comment to both indicating their relationship so we
don't accidentally modify the enum without updating the dependent string
table.

Otherwise, looks good.

Reviewed-by: Greg Thelen <gthelen@google.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 2/7] memcg: remove MEMCG_NR_FILE_MAPPED
@ 2012-07-09 21:01     ` Greg Thelen
  0 siblings, 0 replies; 132+ messages in thread
From: Greg Thelen @ 2012-07-09 21:01 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	yinghan-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mhocko-AlSwsSmVLrQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju

On Thu, Jun 28 2012, Sha Zhengju wrote:

> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
>
> While accounting memcg page stat, it's not worth to use MEMCG_NR_FILE_MAPPED
> as an extra layer of indirection because of the complexity and presumed
> performance overhead. We can use MEM_CGROUP_STAT_FILE_MAPPED directly.
>
> Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> ---
>  include/linux/memcontrol.h |   25 +++++++++++++++++--------
>  mm/memcontrol.c            |   24 +-----------------------
>  mm/rmap.c                  |    4 ++--
>  3 files changed, 20 insertions(+), 33 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 83e7ba9..20b0f2d 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -27,9 +27,18 @@ struct page_cgroup;
>  struct page;
>  struct mm_struct;
>  
> -/* Stats that can be updated by kernel. */
> -enum mem_cgroup_page_stat_item {
> -	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
> +/*
> + * Statistics for memory cgroup.
> + */
> +enum mem_cgroup_stat_index {
> +	/*
> +	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> +	 */
> +	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
> +	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
> +	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> +	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
> +	MEM_CGROUP_STAT_NSTATS,
>  };

Nit.  Moving mem_cgroup_stat_index from memcontrol.c to memcontrol.h is
fine with me.  But this does increase the distance between related
defintions of definition mem_cgroup_stat_index and
mem_cgroup_stat_names.  These two lists have to be kept in sync.  So it
might help to add a comment to both indicating their relationship so we
don't accidentally modify the enum without updating the dependent string
table.

Otherwise, looks good.

Reviewed-by: Greg Thelen <gthelen-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 5/7] memcg: add per cgroup dirty pages accounting
  2012-06-28 11:04   ` Sha Zhengju
@ 2012-07-09 21:02     ` Greg Thelen
  -1 siblings, 0 replies; 132+ messages in thread
From: Greg Thelen @ 2012-07-09 21:02 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, yinghan, akpm, mhocko,
	linux-kernel, Sha Zhengju

On Thu, Jun 28 2012, Sha Zhengju wrote:

> From: Sha Zhengju <handai.szj@taobao.com>
>
> This patch adds memcg routines to count dirty pages, which allows memory controller
> to maintain an accurate view of the amount of its dirty memory and can provide some
> info for users while group's direct reclaim is working.
>
> After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
> use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
> has a feature to move a page from a cgroup to another one and may have race between
> "move" and "page stat accounting". So in order to avoid the race we have designed a
> bigger lock:
>
>          mem_cgroup_begin_update_page_stat()
>          modify page information	-->(a)
>          mem_cgroup_update_page_stat()  -->(b)
>          mem_cgroup_end_update_page_stat()
>
> It requires (a) and (b)(dirty pages accounting) can stay close enough.
>
> In the previous two prepare patches, we have reworked the vfs set page dirty routines
> and now the interfaces are more explicit:
> 	incrementing (2):
> 		__set_page_dirty
> 		__set_page_dirty_nobuffers
> 	decrementing (2):
> 		clear_page_dirty_for_io
> 		cancel_dirty_page
>
>
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> ---
>  fs/buffer.c                |   17 ++++++++++++++---
>  include/linux/memcontrol.h |    1 +
>  mm/filemap.c               |    5 +++++
>  mm/memcontrol.c            |   28 +++++++++++++++++++++-------
>  mm/page-writeback.c        |   30 ++++++++++++++++++++++++------
>  mm/truncate.c              |    6 ++++++
>  6 files changed, 71 insertions(+), 16 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 55522dd..d3714cc 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -613,11 +613,19 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>  int __set_page_dirty(struct page *page,
>  		struct address_space *mapping, int warn)
>  {
> +	bool locked;
> +	unsigned long flags;
> +	int ret = 0;

'= 0' and 'ret = 0' change (below) are redundant.  My vote is to remove
'= 0' here.

> +
>  	if (unlikely(!mapping))
>  		return !TestSetPageDirty(page);
>  
> -	if (TestSetPageDirty(page))
> -		return 0;
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +
> +	if (TestSetPageDirty(page)) {
> +		ret = 0;
> +		goto out;
> +	}
>  
>  	spin_lock_irq(&mapping->tree_lock);
>  	if (page->mapping) {	/* Race with truncate? */
> @@ -629,7 +637,10 @@ int __set_page_dirty(struct page *page,
>  	spin_unlock_irq(&mapping->tree_lock);
>  	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>  
> -	return 1;
> +	ret = 1;
> +out:
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +	return ret;
>  }
>  
>  /*
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 20b0f2d..ad37b59 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -38,6 +38,7 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
>  	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>  	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
> +	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
>  	MEM_CGROUP_STAT_NSTATS,
>  };
>  
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 1f19ec3..5159a49 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -140,6 +140,11 @@ void __delete_from_page_cache(struct page *page)
>  	 * having removed the page entirely.
>  	 */
>  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> +		/*
> +		 * Do not change page state, so no need to use mem_cgroup_
> +		 * {begin, end}_update_page_stat to get lock.
> +		 */
> +		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);

I do not understand this comment.  What serializes this function and
mem_cgroup_move_account()?

>  		dec_zone_page_state(page, NR_FILE_DIRTY);
>  		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>  	}
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ebed1ca..90e2946 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -82,6 +82,7 @@ static const char * const mem_cgroup_stat_names[] = {
>  	"rss",
>  	"mapped_file",
>  	"swap",
> +	"dirty",
>  };
>  
>  enum mem_cgroup_events_index {
> @@ -2538,6 +2539,18 @@ void mem_cgroup_split_huge_fixup(struct page *head)
>  }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
> +static inline
> +void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
> +					struct mem_cgroup *to,
> +					enum mem_cgroup_stat_index idx)
> +{
> +	/* Update stat data for mem_cgroup */
> +	preempt_disable();
> +	__this_cpu_dec(from->stat->count[idx]);
> +	__this_cpu_inc(to->stat->count[idx]);
> +	preempt_enable();
> +}
> +
>  /**
>   * mem_cgroup_move_account - move account of the page
>   * @page: the page
> @@ -2583,13 +2596,14 @@ static int mem_cgroup_move_account(struct page *page,
>  
>  	move_lock_mem_cgroup(from, &flags);
>  
> -	if (!anon && page_mapped(page)) {
> -		/* Update mapped_file data for mem_cgroup */
> -		preempt_disable();
> -		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		preempt_enable();
> -	}
> +	if (!anon && page_mapped(page))
> +		mem_cgroup_move_account_page_stat(from, to,
> +				MEM_CGROUP_STAT_FILE_MAPPED);
> +
> +	if (PageDirty(page))
> +		mem_cgroup_move_account_page_stat(from, to,
> +				MEM_CGROUP_STAT_FILE_DIRTY);
> +
>  	mem_cgroup_charge_statistics(from, anon, -nr_pages);
>  
>  	/* caller should have done css_get */
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index e5363f3..e79a2f7 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1962,6 +1962,7 @@ int __set_page_dirty_no_writeback(struct page *page)
>  void account_page_dirtied(struct page *page, struct address_space *mapping)
>  {
>  	if (mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);

It might be helpful to add a comment to account_page_dirtied()
indicating that caller must hold mem_cgroup_begin_update_page_stat()
lock.  Extra credit for an new assertion added to
mem_cgroup_update_page_stat() confirming the needed lock is held.

>  		__inc_zone_page_state(page, NR_FILE_DIRTY);
>  		__inc_zone_page_state(page, NR_DIRTIED);
>  		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> @@ -2001,12 +2002,20 @@ EXPORT_SYMBOL(account_page_writeback);
>   */
>  int __set_page_dirty_nobuffers(struct page *page)
>  {
> +	bool locked;
> +	unsigned long flags;
> +	int ret = 0;
> +
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +

Is there a strict lock ordering which says that
mem_cgroup_begin_update_page_stat() must not be called while holding
tree_lock?  If yes, then maybe we should update the 'Lock ordering'
comment in mm/filemap.c to describe the
mem_cgroup_begin_update_page_stat() lock.

>  	if (!TestSetPageDirty(page)) {
>  		struct address_space *mapping = page_mapping(page);
>  		struct address_space *mapping2;
>  
> -		if (!mapping)
> -			return 1;
> +		if (!mapping) {
> +			ret = 1;
> +			goto out;
> +		}


The following seems even easier because it does not need your 'ret = 1'
change below.

+		ret = 1;
		if (!mapping)
-			return 1;
+			goto out;


>  
>  		spin_lock_irq(&mapping->tree_lock);
>  		mapping2 = page_mapping(page);
> @@ -2022,9 +2031,12 @@ int __set_page_dirty_nobuffers(struct page *page)
>  			/* !PageAnon && !swapper_space */
>  			__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>  		}
> -		return 1;
> +		ret = 1;

With the ret=1 change above, this can be changed to:
-		return 1;

>  	}
> -	return 0;
> +
> +out:
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +	return ret;
>  }
>  EXPORT_SYMBOL(__set_page_dirty_nobuffers);
>  
> @@ -2139,6 +2151,9 @@ EXPORT_SYMBOL(set_page_dirty_lock);
>  int clear_page_dirty_for_io(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
> +	bool locked;
> +	unsigned long flags;
> +	int ret = 0;
>  
>  	BUG_ON(!PageLocked(page));
>  
> @@ -2180,13 +2195,16 @@ int clear_page_dirty_for_io(struct page *page)
>  		 * the desired exclusion. See mm/memory.c:do_wp_page()
>  		 * for more comments.
>  		 */
> +		mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  		if (TestClearPageDirty(page)) {
> +			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_RECLAIMABLE);
> -			return 1;
> +			ret = 1;
>  		}
> -		return 0;
> +		mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +		return ret;
>  	}
>  	return TestClearPageDirty(page);
>  }
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 75801ac..052016a 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -73,9 +73,14 @@ static inline void truncate_partial_page(struct page *page, unsigned partial)
>   */
>  void cancel_dirty_page(struct page *page, unsigned int account_size)
>  {
> +	bool locked;
> +	unsigned long flags;
> +
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  	if (TestClearPageDirty(page)) {
>  		struct address_space *mapping = page->mapping;
>  		if (mapping && mapping_cap_account_dirty(mapping)) {
> +			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_RECLAIMABLE);
> @@ -83,6 +88,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
>  				task_io_account_cancelled_write(account_size);
>  		}
>  	}
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  }
>  EXPORT_SYMBOL(cancel_dirty_page);

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 5/7] memcg: add per cgroup dirty pages accounting
@ 2012-07-09 21:02     ` Greg Thelen
  0 siblings, 0 replies; 132+ messages in thread
From: Greg Thelen @ 2012-07-09 21:02 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, yinghan, akpm, mhocko,
	linux-kernel, Sha Zhengju

On Thu, Jun 28 2012, Sha Zhengju wrote:

> From: Sha Zhengju <handai.szj@taobao.com>
>
> This patch adds memcg routines to count dirty pages, which allows memory controller
> to maintain an accurate view of the amount of its dirty memory and can provide some
> info for users while group's direct reclaim is working.
>
> After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
> use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
> has a feature to move a page from a cgroup to another one and may have race between
> "move" and "page stat accounting". So in order to avoid the race we have designed a
> bigger lock:
>
>          mem_cgroup_begin_update_page_stat()
>          modify page information	-->(a)
>          mem_cgroup_update_page_stat()  -->(b)
>          mem_cgroup_end_update_page_stat()
>
> It requires (a) and (b)(dirty pages accounting) can stay close enough.
>
> In the previous two prepare patches, we have reworked the vfs set page dirty routines
> and now the interfaces are more explicit:
> 	incrementing (2):
> 		__set_page_dirty
> 		__set_page_dirty_nobuffers
> 	decrementing (2):
> 		clear_page_dirty_for_io
> 		cancel_dirty_page
>
>
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> ---
>  fs/buffer.c                |   17 ++++++++++++++---
>  include/linux/memcontrol.h |    1 +
>  mm/filemap.c               |    5 +++++
>  mm/memcontrol.c            |   28 +++++++++++++++++++++-------
>  mm/page-writeback.c        |   30 ++++++++++++++++++++++++------
>  mm/truncate.c              |    6 ++++++
>  6 files changed, 71 insertions(+), 16 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 55522dd..d3714cc 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -613,11 +613,19 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>  int __set_page_dirty(struct page *page,
>  		struct address_space *mapping, int warn)
>  {
> +	bool locked;
> +	unsigned long flags;
> +	int ret = 0;

'= 0' and 'ret = 0' change (below) are redundant.  My vote is to remove
'= 0' here.

> +
>  	if (unlikely(!mapping))
>  		return !TestSetPageDirty(page);
>  
> -	if (TestSetPageDirty(page))
> -		return 0;
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +
> +	if (TestSetPageDirty(page)) {
> +		ret = 0;
> +		goto out;
> +	}
>  
>  	spin_lock_irq(&mapping->tree_lock);
>  	if (page->mapping) {	/* Race with truncate? */
> @@ -629,7 +637,10 @@ int __set_page_dirty(struct page *page,
>  	spin_unlock_irq(&mapping->tree_lock);
>  	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>  
> -	return 1;
> +	ret = 1;
> +out:
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +	return ret;
>  }
>  
>  /*
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 20b0f2d..ad37b59 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -38,6 +38,7 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
>  	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>  	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
> +	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
>  	MEM_CGROUP_STAT_NSTATS,
>  };
>  
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 1f19ec3..5159a49 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -140,6 +140,11 @@ void __delete_from_page_cache(struct page *page)
>  	 * having removed the page entirely.
>  	 */
>  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> +		/*
> +		 * Do not change page state, so no need to use mem_cgroup_
> +		 * {begin, end}_update_page_stat to get lock.
> +		 */
> +		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);

I do not understand this comment.  What serializes this function and
mem_cgroup_move_account()?

>  		dec_zone_page_state(page, NR_FILE_DIRTY);
>  		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>  	}
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ebed1ca..90e2946 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -82,6 +82,7 @@ static const char * const mem_cgroup_stat_names[] = {
>  	"rss",
>  	"mapped_file",
>  	"swap",
> +	"dirty",
>  };
>  
>  enum mem_cgroup_events_index {
> @@ -2538,6 +2539,18 @@ void mem_cgroup_split_huge_fixup(struct page *head)
>  }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
> +static inline
> +void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
> +					struct mem_cgroup *to,
> +					enum mem_cgroup_stat_index idx)
> +{
> +	/* Update stat data for mem_cgroup */
> +	preempt_disable();
> +	__this_cpu_dec(from->stat->count[idx]);
> +	__this_cpu_inc(to->stat->count[idx]);
> +	preempt_enable();
> +}
> +
>  /**
>   * mem_cgroup_move_account - move account of the page
>   * @page: the page
> @@ -2583,13 +2596,14 @@ static int mem_cgroup_move_account(struct page *page,
>  
>  	move_lock_mem_cgroup(from, &flags);
>  
> -	if (!anon && page_mapped(page)) {
> -		/* Update mapped_file data for mem_cgroup */
> -		preempt_disable();
> -		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		preempt_enable();
> -	}
> +	if (!anon && page_mapped(page))
> +		mem_cgroup_move_account_page_stat(from, to,
> +				MEM_CGROUP_STAT_FILE_MAPPED);
> +
> +	if (PageDirty(page))
> +		mem_cgroup_move_account_page_stat(from, to,
> +				MEM_CGROUP_STAT_FILE_DIRTY);
> +
>  	mem_cgroup_charge_statistics(from, anon, -nr_pages);
>  
>  	/* caller should have done css_get */
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index e5363f3..e79a2f7 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1962,6 +1962,7 @@ int __set_page_dirty_no_writeback(struct page *page)
>  void account_page_dirtied(struct page *page, struct address_space *mapping)
>  {
>  	if (mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);

It might be helpful to add a comment to account_page_dirtied()
indicating that caller must hold mem_cgroup_begin_update_page_stat()
lock.  Extra credit for an new assertion added to
mem_cgroup_update_page_stat() confirming the needed lock is held.

>  		__inc_zone_page_state(page, NR_FILE_DIRTY);
>  		__inc_zone_page_state(page, NR_DIRTIED);
>  		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> @@ -2001,12 +2002,20 @@ EXPORT_SYMBOL(account_page_writeback);
>   */
>  int __set_page_dirty_nobuffers(struct page *page)
>  {
> +	bool locked;
> +	unsigned long flags;
> +	int ret = 0;
> +
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +

Is there a strict lock ordering which says that
mem_cgroup_begin_update_page_stat() must not be called while holding
tree_lock?  If yes, then maybe we should update the 'Lock ordering'
comment in mm/filemap.c to describe the
mem_cgroup_begin_update_page_stat() lock.

>  	if (!TestSetPageDirty(page)) {
>  		struct address_space *mapping = page_mapping(page);
>  		struct address_space *mapping2;
>  
> -		if (!mapping)
> -			return 1;
> +		if (!mapping) {
> +			ret = 1;
> +			goto out;
> +		}


The following seems even easier because it does not need your 'ret = 1'
change below.

+		ret = 1;
		if (!mapping)
-			return 1;
+			goto out;


>  
>  		spin_lock_irq(&mapping->tree_lock);
>  		mapping2 = page_mapping(page);
> @@ -2022,9 +2031,12 @@ int __set_page_dirty_nobuffers(struct page *page)
>  			/* !PageAnon && !swapper_space */
>  			__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>  		}
> -		return 1;
> +		ret = 1;

With the ret=1 change above, this can be changed to:
-		return 1;

>  	}
> -	return 0;
> +
> +out:
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +	return ret;
>  }
>  EXPORT_SYMBOL(__set_page_dirty_nobuffers);
>  
> @@ -2139,6 +2151,9 @@ EXPORT_SYMBOL(set_page_dirty_lock);
>  int clear_page_dirty_for_io(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
> +	bool locked;
> +	unsigned long flags;
> +	int ret = 0;
>  
>  	BUG_ON(!PageLocked(page));
>  
> @@ -2180,13 +2195,16 @@ int clear_page_dirty_for_io(struct page *page)
>  		 * the desired exclusion. See mm/memory.c:do_wp_page()
>  		 * for more comments.
>  		 */
> +		mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  		if (TestClearPageDirty(page)) {
> +			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_RECLAIMABLE);
> -			return 1;
> +			ret = 1;
>  		}
> -		return 0;
> +		mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +		return ret;
>  	}
>  	return TestClearPageDirty(page);
>  }
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 75801ac..052016a 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -73,9 +73,14 @@ static inline void truncate_partial_page(struct page *page, unsigned partial)
>   */
>  void cancel_dirty_page(struct page *page, unsigned int account_size)
>  {
> +	bool locked;
> +	unsigned long flags;
> +
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  	if (TestClearPageDirty(page)) {
>  		struct address_space *mapping = page->mapping;
>  		if (mapping && mapping_cap_account_dirty(mapping)) {
> +			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_RECLAIMABLE);
> @@ -83,6 +88,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
>  				task_io_account_cancelled_write(account_size);
>  		}
>  	}
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  }
>  EXPORT_SYMBOL(cancel_dirty_page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 6/7] memcg: add per cgroup writeback pages accounting
  2012-06-28 11:06   ` Sha Zhengju
@ 2012-07-09 21:02     ` Greg Thelen
  -1 siblings, 0 replies; 132+ messages in thread
From: Greg Thelen @ 2012-07-09 21:02 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, yinghan, akpm, mhocko,
	linux-kernel, Sha Zhengju

On Thu, Jun 28 2012, Sha Zhengju wrote:

> From: Sha Zhengju <handai.szj@taobao.com>
>
> Similar to dirty page, we add per cgroup writeback pages accounting. The lock
> rule still is:
> 	mem_cgroup_begin_update_page_stat()
> 	modify page WRITEBACK stat
> 	mem_cgroup_update_page_stat()
> 	mem_cgroup_end_update_page_stat()
>
> There're two writeback interface to modify: test_clear/set_page_writeback.
>
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> ---
>  include/linux/memcontrol.h |    1 +
>  mm/memcontrol.c            |    5 +++++
>  mm/page-writeback.c        |   12 ++++++++++++
>  3 files changed, 18 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index ad37b59..9193d93 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -39,6 +39,7 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>  	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
>  	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
> +	MEM_CGROUP_STAT_FILE_WRITEBACK,  /* # of pages under writeback */
>  	MEM_CGROUP_STAT_NSTATS,
>  };
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 90e2946..8493119 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -83,6 +83,7 @@ static const char * const mem_cgroup_stat_names[] = {
>  	"mapped_file",
>  	"swap",
>  	"dirty",
> +	"writeback",
>  };
>  
>  enum mem_cgroup_events_index {
> @@ -2604,6 +2605,10 @@ static int mem_cgroup_move_account(struct page *page,
>  		mem_cgroup_move_account_page_stat(from, to,
>  				MEM_CGROUP_STAT_FILE_DIRTY);
>  
> +	if (PageWriteback(page))
> +		mem_cgroup_move_account_page_stat(from, to,
> +				MEM_CGROUP_STAT_FILE_WRITEBACK);
> +
>  	mem_cgroup_charge_statistics(from, anon, -nr_pages);
>  
>  	/* caller should have done css_get */
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index e79a2f7..7398836 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1981,6 +1981,7 @@ EXPORT_SYMBOL(account_page_dirtied);
>   */
>  void account_page_writeback(struct page *page)
>  {
> +	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_WRITEBACK);

As already mentioned, I'd also like to see a comment added to
account_page_writeback() describing the new locking requirements
(specifically mem_cgroup_begin_update_page_stat being held by caller).

>  	inc_zone_page_state(page, NR_WRITEBACK);
>  }
>  EXPORT_SYMBOL(account_page_writeback);
> @@ -2214,7 +2215,10 @@ int test_clear_page_writeback(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
>  	int ret;
> +	bool locked;
> +	unsigned long flags;
>  
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  	if (mapping) {
>  		struct backing_dev_info *bdi = mapping->backing_dev_info;
>  		unsigned long flags;
> @@ -2235,9 +2239,12 @@ int test_clear_page_writeback(struct page *page)
>  		ret = TestClearPageWriteback(page);
>  	}
>  	if (ret) {
> +		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_WRITEBACK);
>  		dec_zone_page_state(page, NR_WRITEBACK);
>  		inc_zone_page_state(page, NR_WRITTEN);
>  	}
> +
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  	return ret;
>  }
>  
> @@ -2245,7 +2252,10 @@ int test_set_page_writeback(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
>  	int ret;
> +	bool locked;
> +	unsigned long flags;
>  
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  	if (mapping) {
>  		struct backing_dev_info *bdi = mapping->backing_dev_info;
>  		unsigned long flags;
> @@ -2272,6 +2282,8 @@ int test_set_page_writeback(struct page *page)
>  	}
>  	if (!ret)
>  		account_page_writeback(page);
> +
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  	return ret;
>  
>  }

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 6/7] memcg: add per cgroup writeback pages accounting
@ 2012-07-09 21:02     ` Greg Thelen
  0 siblings, 0 replies; 132+ messages in thread
From: Greg Thelen @ 2012-07-09 21:02 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, yinghan, akpm, mhocko,
	linux-kernel, Sha Zhengju

On Thu, Jun 28 2012, Sha Zhengju wrote:

> From: Sha Zhengju <handai.szj@taobao.com>
>
> Similar to dirty page, we add per cgroup writeback pages accounting. The lock
> rule still is:
> 	mem_cgroup_begin_update_page_stat()
> 	modify page WRITEBACK stat
> 	mem_cgroup_update_page_stat()
> 	mem_cgroup_end_update_page_stat()
>
> There're two writeback interface to modify: test_clear/set_page_writeback.
>
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> ---
>  include/linux/memcontrol.h |    1 +
>  mm/memcontrol.c            |    5 +++++
>  mm/page-writeback.c        |   12 ++++++++++++
>  3 files changed, 18 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index ad37b59..9193d93 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -39,6 +39,7 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>  	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
>  	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
> +	MEM_CGROUP_STAT_FILE_WRITEBACK,  /* # of pages under writeback */
>  	MEM_CGROUP_STAT_NSTATS,
>  };
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 90e2946..8493119 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -83,6 +83,7 @@ static const char * const mem_cgroup_stat_names[] = {
>  	"mapped_file",
>  	"swap",
>  	"dirty",
> +	"writeback",
>  };
>  
>  enum mem_cgroup_events_index {
> @@ -2604,6 +2605,10 @@ static int mem_cgroup_move_account(struct page *page,
>  		mem_cgroup_move_account_page_stat(from, to,
>  				MEM_CGROUP_STAT_FILE_DIRTY);
>  
> +	if (PageWriteback(page))
> +		mem_cgroup_move_account_page_stat(from, to,
> +				MEM_CGROUP_STAT_FILE_WRITEBACK);
> +
>  	mem_cgroup_charge_statistics(from, anon, -nr_pages);
>  
>  	/* caller should have done css_get */
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index e79a2f7..7398836 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1981,6 +1981,7 @@ EXPORT_SYMBOL(account_page_dirtied);
>   */
>  void account_page_writeback(struct page *page)
>  {
> +	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_WRITEBACK);

As already mentioned, I'd also like to see a comment added to
account_page_writeback() describing the new locking requirements
(specifically mem_cgroup_begin_update_page_stat being held by caller).

>  	inc_zone_page_state(page, NR_WRITEBACK);
>  }
>  EXPORT_SYMBOL(account_page_writeback);
> @@ -2214,7 +2215,10 @@ int test_clear_page_writeback(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
>  	int ret;
> +	bool locked;
> +	unsigned long flags;
>  
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  	if (mapping) {
>  		struct backing_dev_info *bdi = mapping->backing_dev_info;
>  		unsigned long flags;
> @@ -2235,9 +2239,12 @@ int test_clear_page_writeback(struct page *page)
>  		ret = TestClearPageWriteback(page);
>  	}
>  	if (ret) {
> +		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_WRITEBACK);
>  		dec_zone_page_state(page, NR_WRITEBACK);
>  		inc_zone_page_state(page, NR_WRITTEN);
>  	}
> +
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  	return ret;
>  }
>  
> @@ -2245,7 +2252,10 @@ int test_set_page_writeback(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
>  	int ret;
> +	bool locked;
> +	unsigned long flags;
>  
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  	if (mapping) {
>  		struct backing_dev_info *bdi = mapping->backing_dev_info;
>  		unsigned long flags;
> @@ -2272,6 +2282,8 @@ int test_set_page_writeback(struct page *page)
>  	}
>  	if (!ret)
>  		account_page_writeback(page);
> +
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  	return ret;
>  
>  }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 2/7] memcg: remove MEMCG_NR_FILE_MAPPED
  2012-07-09 21:01     ` Greg Thelen
@ 2012-07-11  8:00       ` Sha Zhengju
  -1 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-07-11  8:00 UTC (permalink / raw)
  To: Greg Thelen
  Cc: linux-mm, cgroups, kamezawa.hiroyu, yinghan, akpm, mhocko,
	linux-kernel, Sha Zhengju

On 07/10/2012 05:01 AM, Greg Thelen wrote:
> On Thu, Jun 28 2012, Sha Zhengju wrote:
>
>> From: Sha Zhengju<handai.szj@taobao.com>
>>
>> While accounting memcg page stat, it's not worth to use MEMCG_NR_FILE_MAPPED
>> as an extra layer of indirection because of the complexity and presumed
>> performance overhead. We can use MEM_CGROUP_STAT_FILE_MAPPED directly.
>>
>> Signed-off-by: Sha Zhengju<handai.szj@taobao.com>
>> ---
>>   include/linux/memcontrol.h |   25 +++++++++++++++++--------
>>   mm/memcontrol.c            |   24 +-----------------------
>>   mm/rmap.c                  |    4 ++--
>>   3 files changed, 20 insertions(+), 33 deletions(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 83e7ba9..20b0f2d 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -27,9 +27,18 @@ struct page_cgroup;
>>   struct page;
>>   struct mm_struct;
>>
>> -/* Stats that can be updated by kernel. */
>> -enum mem_cgroup_page_stat_item {
>> -	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
>> +/*
>> + * Statistics for memory cgroup.
>> + */
>> +enum mem_cgroup_stat_index {
>> +	/*
>> +	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
>> +	 */
>> +	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
>> +	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
>> +	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>> +	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
>> +	MEM_CGROUP_STAT_NSTATS,
>>   };
> Nit.  Moving mem_cgroup_stat_index from memcontrol.c to memcontrol.h is
> fine with me.  But this does increase the distance between related
> defintions of definition mem_cgroup_stat_index and
> mem_cgroup_stat_names.  These two lists have to be kept in sync.  So it
> might help to add a comment to both indicating their relationship so we
> don't accidentally modify the enum without updating the dependent string
> table.
>
> Otherwise, looks good.
>
> Reviewed-by: Greg Thelen<gthelen@google.com>

Sorry for the delay.
OK, I'll add some comment here. Thanks for reminding!

Thanks,
Sha

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 2/7] memcg: remove MEMCG_NR_FILE_MAPPED
@ 2012-07-11  8:00       ` Sha Zhengju
  0 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-07-11  8:00 UTC (permalink / raw)
  To: Greg Thelen
  Cc: linux-mm, cgroups, kamezawa.hiroyu, yinghan, akpm, mhocko,
	linux-kernel, Sha Zhengju

On 07/10/2012 05:01 AM, Greg Thelen wrote:
> On Thu, Jun 28 2012, Sha Zhengju wrote:
>
>> From: Sha Zhengju<handai.szj@taobao.com>
>>
>> While accounting memcg page stat, it's not worth to use MEMCG_NR_FILE_MAPPED
>> as an extra layer of indirection because of the complexity and presumed
>> performance overhead. We can use MEM_CGROUP_STAT_FILE_MAPPED directly.
>>
>> Signed-off-by: Sha Zhengju<handai.szj@taobao.com>
>> ---
>>   include/linux/memcontrol.h |   25 +++++++++++++++++--------
>>   mm/memcontrol.c            |   24 +-----------------------
>>   mm/rmap.c                  |    4 ++--
>>   3 files changed, 20 insertions(+), 33 deletions(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 83e7ba9..20b0f2d 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -27,9 +27,18 @@ struct page_cgroup;
>>   struct page;
>>   struct mm_struct;
>>
>> -/* Stats that can be updated by kernel. */
>> -enum mem_cgroup_page_stat_item {
>> -	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
>> +/*
>> + * Statistics for memory cgroup.
>> + */
>> +enum mem_cgroup_stat_index {
>> +	/*
>> +	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
>> +	 */
>> +	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
>> +	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
>> +	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>> +	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
>> +	MEM_CGROUP_STAT_NSTATS,
>>   };
> Nit.  Moving mem_cgroup_stat_index from memcontrol.c to memcontrol.h is
> fine with me.  But this does increase the distance between related
> defintions of definition mem_cgroup_stat_index and
> mem_cgroup_stat_names.  These two lists have to be kept in sync.  So it
> might help to add a comment to both indicating their relationship so we
> don't accidentally modify the enum without updating the dependent string
> table.
>
> Otherwise, looks good.
>
> Reviewed-by: Greg Thelen<gthelen@google.com>

Sorry for the delay.
OK, I'll add some comment here. Thanks for reminding!

Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 5/7] memcg: add per cgroup dirty pages accounting
  2012-07-09 21:02     ` Greg Thelen
@ 2012-07-11  9:32       ` Sha Zhengju
  -1 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-07-11  9:32 UTC (permalink / raw)
  To: Greg Thelen
  Cc: linux-mm, cgroups, kamezawa.hiroyu, yinghan, akpm, mhocko,
	linux-kernel, Sha Zhengju

On 07/10/2012 05:02 AM, Greg Thelen wrote:
> On Thu, Jun 28 2012, Sha Zhengju wrote:
>
>> From: Sha Zhengju<handai.szj@taobao.com>
>>
>> This patch adds memcg routines to count dirty pages, which allows memory controller
>> to maintain an accurate view of the amount of its dirty memory and can provide some
>> info for users while group's direct reclaim is working.
>>
>> After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
>> use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
>> has a feature to move a page from a cgroup to another one and may have race between
>> "move" and "page stat accounting". So in order to avoid the race we have designed a
>> bigger lock:
>>
>>           mem_cgroup_begin_update_page_stat()
>>           modify page information	-->(a)
>>           mem_cgroup_update_page_stat()  -->(b)
>>           mem_cgroup_end_update_page_stat()
>>
>> It requires (a) and (b)(dirty pages accounting) can stay close enough.
>>
>> In the previous two prepare patches, we have reworked the vfs set page dirty routines
>> and now the interfaces are more explicit:
>> 	incrementing (2):
>> 		__set_page_dirty
>> 		__set_page_dirty_nobuffers
>> 	decrementing (2):
>> 		clear_page_dirty_for_io
>> 		cancel_dirty_page
>>
>>
>> Signed-off-by: Sha Zhengju<handai.szj@taobao.com>
>> ---
>>   fs/buffer.c                |   17 ++++++++++++++---
>>   include/linux/memcontrol.h |    1 +
>>   mm/filemap.c               |    5 +++++
>>   mm/memcontrol.c            |   28 +++++++++++++++++++++-------
>>   mm/page-writeback.c        |   30 ++++++++++++++++++++++++------
>>   mm/truncate.c              |    6 ++++++
>>   6 files changed, 71 insertions(+), 16 deletions(-)
>>
>> diff --git a/fs/buffer.c b/fs/buffer.c
>> index 55522dd..d3714cc 100644
>> --- a/fs/buffer.c
>> +++ b/fs/buffer.c
>> @@ -613,11 +613,19 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>>   int __set_page_dirty(struct page *page,
>>   		struct address_space *mapping, int warn)
>>   {
>> +	bool locked;
>> +	unsigned long flags;
>> +	int ret = 0;
> '= 0' and 'ret = 0' change (below) are redundant.  My vote is to remove
> '= 0' here.
>

Nice catch. :-)

>> +
>>   	if (unlikely(!mapping))
>>   		return !TestSetPageDirty(page);
>>
>> -	if (TestSetPageDirty(page))
>> -		return 0;
>> +	mem_cgroup_begin_update_page_stat(page,&locked,&flags);
>> +
>> +	if (TestSetPageDirty(page)) {
>> +		ret = 0;
>> +		goto out;
>> +	}
>>
>>   	spin_lock_irq(&mapping->tree_lock);
>>   	if (page->mapping) {	/* Race with truncate? */
>> @@ -629,7 +637,10 @@ int __set_page_dirty(struct page *page,
>>   	spin_unlock_irq(&mapping->tree_lock);
>>   	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>>
>> -	return 1;
>> +	ret = 1;
>> +out:
>> +	mem_cgroup_end_update_page_stat(page,&locked,&flags);
>> +	return ret;
>>   }
>>
>>   /*
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 20b0f2d..ad37b59 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -38,6 +38,7 @@ enum mem_cgroup_stat_index {
>>   	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
>>   	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>>   	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
>> +	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
>>   	MEM_CGROUP_STAT_NSTATS,
>>   };
>>
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 1f19ec3..5159a49 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -140,6 +140,11 @@ void __delete_from_page_cache(struct page *page)
>>   	 * having removed the page entirely.
>>   	 */
>>   	if (PageDirty(page)&&  mapping_cap_account_dirty(mapping)) {
>> +		/*
>> +		 * Do not change page state, so no need to use mem_cgroup_
>> +		 * {begin, end}_update_page_stat to get lock.
>> +		 */
>> +		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
> I do not understand this comment.  What serializes this function and
> mem_cgroup_move_account()?
>

The race is exist just because the two competitors share one
public variable and one reads it and the other writes it.
I thought if both sides(accounting and cgroup_move) do not
change page flag, then risks like doule-counting(see below)
will not happen.

              CPU-A                                   CPU-B
         Set PG_dirty
         (delay)                                move_lock_mem_cgroup()
                                                    if (PageDirty(page))
                                                          
new_memcg->nr_dirty++
                                                    pc->mem_cgroup = 
new_memcg;
                                                    move_unlock_mem_cgroup()
         move_lock_mem_cgroup()
         memcg = pc->mem_cgroup
         new_memcg->nr_dirty++


But after second thoughts, it does have problem if without lock:

              CPU-A                                   CPU-B
         if (PageDirty(page)) {
                                                    move_lock_mem_cgroup()
                                                    
TestClearPageDirty(page))
                                                             memcg = 
pc->mem_cgroup
                                                             
new_memcg->nr_dirty --
                                                    move_unlock_mem_cgroup()

         memcg = pc->mem_cgroup
         new_memcg->nr_dirty--
         }


It may occur race between clear_page_dirty() operation.
So this time I think we need the lock again...

Kame, what about your opinion...

>>   		dec_zone_page_state(page, NR_FILE_DIRTY);
>>   		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>>   	}
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index ebed1ca..90e2946 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -82,6 +82,7 @@ static const char * const mem_cgroup_stat_names[] = {
>>   	"rss",
>>   	"mapped_file",
>>   	"swap",
>> +	"dirty",
>>   };
>>
>>   enum mem_cgroup_events_index {
>> @@ -2538,6 +2539,18 @@ void mem_cgroup_split_huge_fixup(struct page *head)
>>   }
>>   #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>
>> +static inline
>> +void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
>> +					struct mem_cgroup *to,
>> +					enum mem_cgroup_stat_index idx)
>> +{
>> +	/* Update stat data for mem_cgroup */
>> +	preempt_disable();
>> +	__this_cpu_dec(from->stat->count[idx]);
>> +	__this_cpu_inc(to->stat->count[idx]);
>> +	preempt_enable();
>> +}
>> +
>>   /**
>>    * mem_cgroup_move_account - move account of the page
>>    * @page: the page
>> @@ -2583,13 +2596,14 @@ static int mem_cgroup_move_account(struct page *page,
>>
>>   	move_lock_mem_cgroup(from,&flags);
>>
>> -	if (!anon&&  page_mapped(page)) {
>> -		/* Update mapped_file data for mem_cgroup */
>> -		preempt_disable();
>> -		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
>> -		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
>> -		preempt_enable();
>> -	}
>> +	if (!anon&&  page_mapped(page))
>> +		mem_cgroup_move_account_page_stat(from, to,
>> +				MEM_CGROUP_STAT_FILE_MAPPED);
>> +
>> +	if (PageDirty(page))
>> +		mem_cgroup_move_account_page_stat(from, to,
>> +				MEM_CGROUP_STAT_FILE_DIRTY);
>> +
>>   	mem_cgroup_charge_statistics(from, anon, -nr_pages);
>>
>>   	/* caller should have done css_get */
>> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>> index e5363f3..e79a2f7 100644
>> --- a/mm/page-writeback.c
>> +++ b/mm/page-writeback.c
>> @@ -1962,6 +1962,7 @@ int __set_page_dirty_no_writeback(struct page *page)
>>   void account_page_dirtied(struct page *page, struct address_space *mapping)
>>   {
>>   	if (mapping_cap_account_dirty(mapping)) {
>> +		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
> It might be helpful to add a comment to account_page_dirtied()
> indicating that caller must hold mem_cgroup_begin_update_page_stat()
> lock.  Extra credit for an new assertion added to
> mem_cgroup_update_page_stat() confirming the needed lock is held.
>

Got it! :-)

>>   		__inc_zone_page_state(page, NR_FILE_DIRTY);
>>   		__inc_zone_page_state(page, NR_DIRTIED);
>>   		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>> @@ -2001,12 +2002,20 @@ EXPORT_SYMBOL(account_page_writeback);
>>    */
>>   int __set_page_dirty_nobuffers(struct page *page)
>>   {
>> +	bool locked;
>> +	unsigned long flags;
>> +	int ret = 0;
>> +
>> +	mem_cgroup_begin_update_page_stat(page,&locked,&flags);
>> +
> Is there a strict lock ordering which says that
> mem_cgroup_begin_update_page_stat() must not be called while holding
> tree_lock?  If yes, then maybe we should update the 'Lock ordering'
> comment in mm/filemap.c to describe the
> mem_cgroup_begin_update_page_stat() lock.
>

I think yes, otherwise it may cause deadlock. I'll update it later.

>>   	if (!TestSetPageDirty(page)) {
>>   		struct address_space *mapping = page_mapping(page);
>>   		struct address_space *mapping2;
>>
>> -		if (!mapping)
>> -			return 1;
>> +		if (!mapping) {
>> +			ret = 1;
>> +			goto out;
>> +		}
>
> The following seems even easier because it does not need your 'ret = 1'
> change below.
>
> +		ret = 1;
> 		if (!mapping)
> -			return 1;
> +			goto out;
>
>
>>
>>   		spin_lock_irq(&mapping->tree_lock);
>>   		mapping2 = page_mapping(page);
>> @@ -2022,9 +2031,12 @@ int __set_page_dirty_nobuffers(struct page *page)
>>   			/* !PageAnon&&  !swapper_space */
>>   			__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>>   		}
>> -		return 1;
>> +		ret = 1;
> With the ret=1 change above, this can be changed to:
> -		return 1;
>

Seems better.

>>   	}
>> -	return 0;
>> +
>> +out:
>> +	mem_cgroup_end_update_page_stat(page,&locked,&flags);
>> +	return ret;
>>   }
>>   EXPORT_SYMBOL(__set_page_dirty_nobuffers);
>>
>> @@ -2139,6 +2151,9 @@ EXPORT_SYMBOL(set_page_dirty_lock);
>>   int clear_page_dirty_for_io(struct page *page)
>>   {
>>   	struct address_space *mapping = page_mapping(page);
>> +	bool locked;
>> +	unsigned long flags;
>> +	int ret = 0;
>>
>>   	BUG_ON(!PageLocked(page));
>>
>> @@ -2180,13 +2195,16 @@ int clear_page_dirty_for_io(struct page *page)
>>   		 * the desired exclusion. See mm/memory.c:do_wp_page()
>>   		 * for more comments.
>>   		 */
>> +		mem_cgroup_begin_update_page_stat(page,&locked,&flags);
>>   		if (TestClearPageDirty(page)) {
>> +			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>>   			dec_zone_page_state(page, NR_FILE_DIRTY);
>>   			dec_bdi_stat(mapping->backing_dev_info,
>>   					BDI_RECLAIMABLE);
>> -			return 1;
>> +			ret = 1;
>>   		}
>> -		return 0;
>> +		mem_cgroup_end_update_page_stat(page,&locked,&flags);
>> +		return ret;
>>   	}
>>   	return TestClearPageDirty(page);
>>   }
>> diff --git a/mm/truncate.c b/mm/truncate.c
>> index 75801ac..052016a 100644
>> --- a/mm/truncate.c
>> +++ b/mm/truncate.c
>> @@ -73,9 +73,14 @@ static inline void truncate_partial_page(struct page *page, unsigned partial)
>>    */
>>   void cancel_dirty_page(struct page *page, unsigned int account_size)
>>   {
>> +	bool locked;
>> +	unsigned long flags;
>> +
>> +	mem_cgroup_begin_update_page_stat(page,&locked,&flags);
>>   	if (TestClearPageDirty(page)) {
>>   		struct address_space *mapping = page->mapping;
>>   		if (mapping&&  mapping_cap_account_dirty(mapping)) {
>> +			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>>   			dec_zone_page_state(page, NR_FILE_DIRTY);
>>   			dec_bdi_stat(mapping->backing_dev_info,
>>   					BDI_RECLAIMABLE);
>> @@ -83,6 +88,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
>>   				task_io_account_cancelled_write(account_size);
>>   		}
>>   	}
>> +	mem_cgroup_end_update_page_stat(page,&locked,&flags);
>>   }
>>   EXPORT_SYMBOL(cancel_dirty_page);


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 5/7] memcg: add per cgroup dirty pages accounting
@ 2012-07-11  9:32       ` Sha Zhengju
  0 siblings, 0 replies; 132+ messages in thread
From: Sha Zhengju @ 2012-07-11  9:32 UTC (permalink / raw)
  To: Greg Thelen
  Cc: linux-mm, cgroups, kamezawa.hiroyu, yinghan, akpm, mhocko,
	linux-kernel, Sha Zhengju

On 07/10/2012 05:02 AM, Greg Thelen wrote:
> On Thu, Jun 28 2012, Sha Zhengju wrote:
>
>> From: Sha Zhengju<handai.szj@taobao.com>
>>
>> This patch adds memcg routines to count dirty pages, which allows memory controller
>> to maintain an accurate view of the amount of its dirty memory and can provide some
>> info for users while group's direct reclaim is working.
>>
>> After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
>> use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
>> has a feature to move a page from a cgroup to another one and may have race between
>> "move" and "page stat accounting". So in order to avoid the race we have designed a
>> bigger lock:
>>
>>           mem_cgroup_begin_update_page_stat()
>>           modify page information	-->(a)
>>           mem_cgroup_update_page_stat()  -->(b)
>>           mem_cgroup_end_update_page_stat()
>>
>> It requires (a) and (b)(dirty pages accounting) can stay close enough.
>>
>> In the previous two prepare patches, we have reworked the vfs set page dirty routines
>> and now the interfaces are more explicit:
>> 	incrementing (2):
>> 		__set_page_dirty
>> 		__set_page_dirty_nobuffers
>> 	decrementing (2):
>> 		clear_page_dirty_for_io
>> 		cancel_dirty_page
>>
>>
>> Signed-off-by: Sha Zhengju<handai.szj@taobao.com>
>> ---
>>   fs/buffer.c                |   17 ++++++++++++++---
>>   include/linux/memcontrol.h |    1 +
>>   mm/filemap.c               |    5 +++++
>>   mm/memcontrol.c            |   28 +++++++++++++++++++++-------
>>   mm/page-writeback.c        |   30 ++++++++++++++++++++++++------
>>   mm/truncate.c              |    6 ++++++
>>   6 files changed, 71 insertions(+), 16 deletions(-)
>>
>> diff --git a/fs/buffer.c b/fs/buffer.c
>> index 55522dd..d3714cc 100644
>> --- a/fs/buffer.c
>> +++ b/fs/buffer.c
>> @@ -613,11 +613,19 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>>   int __set_page_dirty(struct page *page,
>>   		struct address_space *mapping, int warn)
>>   {
>> +	bool locked;
>> +	unsigned long flags;
>> +	int ret = 0;
> '= 0' and 'ret = 0' change (below) are redundant.  My vote is to remove
> '= 0' here.
>

Nice catch. :-)

>> +
>>   	if (unlikely(!mapping))
>>   		return !TestSetPageDirty(page);
>>
>> -	if (TestSetPageDirty(page))
>> -		return 0;
>> +	mem_cgroup_begin_update_page_stat(page,&locked,&flags);
>> +
>> +	if (TestSetPageDirty(page)) {
>> +		ret = 0;
>> +		goto out;
>> +	}
>>
>>   	spin_lock_irq(&mapping->tree_lock);
>>   	if (page->mapping) {	/* Race with truncate? */
>> @@ -629,7 +637,10 @@ int __set_page_dirty(struct page *page,
>>   	spin_unlock_irq(&mapping->tree_lock);
>>   	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>>
>> -	return 1;
>> +	ret = 1;
>> +out:
>> +	mem_cgroup_end_update_page_stat(page,&locked,&flags);
>> +	return ret;
>>   }
>>
>>   /*
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 20b0f2d..ad37b59 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -38,6 +38,7 @@ enum mem_cgroup_stat_index {
>>   	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
>>   	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>>   	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
>> +	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
>>   	MEM_CGROUP_STAT_NSTATS,
>>   };
>>
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 1f19ec3..5159a49 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -140,6 +140,11 @@ void __delete_from_page_cache(struct page *page)
>>   	 * having removed the page entirely.
>>   	 */
>>   	if (PageDirty(page)&&  mapping_cap_account_dirty(mapping)) {
>> +		/*
>> +		 * Do not change page state, so no need to use mem_cgroup_
>> +		 * {begin, end}_update_page_stat to get lock.
>> +		 */
>> +		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
> I do not understand this comment.  What serializes this function and
> mem_cgroup_move_account()?
>

The race is exist just because the two competitors share one
public variable and one reads it and the other writes it.
I thought if both sides(accounting and cgroup_move) do not
change page flag, then risks like doule-counting(see below)
will not happen.

              CPU-A                                   CPU-B
         Set PG_dirty
         (delay)                                move_lock_mem_cgroup()
                                                    if (PageDirty(page))
                                                          
new_memcg->nr_dirty++
                                                    pc->mem_cgroup = 
new_memcg;
                                                    move_unlock_mem_cgroup()
         move_lock_mem_cgroup()
         memcg = pc->mem_cgroup
         new_memcg->nr_dirty++


But after second thoughts, it does have problem if without lock:

              CPU-A                                   CPU-B
         if (PageDirty(page)) {
                                                    move_lock_mem_cgroup()
                                                    
TestClearPageDirty(page))
                                                             memcg = 
pc->mem_cgroup
                                                             
new_memcg->nr_dirty --
                                                    move_unlock_mem_cgroup()

         memcg = pc->mem_cgroup
         new_memcg->nr_dirty--
         }


It may occur race between clear_page_dirty() operation.
So this time I think we need the lock again...

Kame, what about your opinion...

>>   		dec_zone_page_state(page, NR_FILE_DIRTY);
>>   		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>>   	}
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index ebed1ca..90e2946 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -82,6 +82,7 @@ static const char * const mem_cgroup_stat_names[] = {
>>   	"rss",
>>   	"mapped_file",
>>   	"swap",
>> +	"dirty",
>>   };
>>
>>   enum mem_cgroup_events_index {
>> @@ -2538,6 +2539,18 @@ void mem_cgroup_split_huge_fixup(struct page *head)
>>   }
>>   #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>
>> +static inline
>> +void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
>> +					struct mem_cgroup *to,
>> +					enum mem_cgroup_stat_index idx)
>> +{
>> +	/* Update stat data for mem_cgroup */
>> +	preempt_disable();
>> +	__this_cpu_dec(from->stat->count[idx]);
>> +	__this_cpu_inc(to->stat->count[idx]);
>> +	preempt_enable();
>> +}
>> +
>>   /**
>>    * mem_cgroup_move_account - move account of the page
>>    * @page: the page
>> @@ -2583,13 +2596,14 @@ static int mem_cgroup_move_account(struct page *page,
>>
>>   	move_lock_mem_cgroup(from,&flags);
>>
>> -	if (!anon&&  page_mapped(page)) {
>> -		/* Update mapped_file data for mem_cgroup */
>> -		preempt_disable();
>> -		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
>> -		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
>> -		preempt_enable();
>> -	}
>> +	if (!anon&&  page_mapped(page))
>> +		mem_cgroup_move_account_page_stat(from, to,
>> +				MEM_CGROUP_STAT_FILE_MAPPED);
>> +
>> +	if (PageDirty(page))
>> +		mem_cgroup_move_account_page_stat(from, to,
>> +				MEM_CGROUP_STAT_FILE_DIRTY);
>> +
>>   	mem_cgroup_charge_statistics(from, anon, -nr_pages);
>>
>>   	/* caller should have done css_get */
>> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>> index e5363f3..e79a2f7 100644
>> --- a/mm/page-writeback.c
>> +++ b/mm/page-writeback.c
>> @@ -1962,6 +1962,7 @@ int __set_page_dirty_no_writeback(struct page *page)
>>   void account_page_dirtied(struct page *page, struct address_space *mapping)
>>   {
>>   	if (mapping_cap_account_dirty(mapping)) {
>> +		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
> It might be helpful to add a comment to account_page_dirtied()
> indicating that caller must hold mem_cgroup_begin_update_page_stat()
> lock.  Extra credit for an new assertion added to
> mem_cgroup_update_page_stat() confirming the needed lock is held.
>

Got it! :-)

>>   		__inc_zone_page_state(page, NR_FILE_DIRTY);
>>   		__inc_zone_page_state(page, NR_DIRTIED);
>>   		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>> @@ -2001,12 +2002,20 @@ EXPORT_SYMBOL(account_page_writeback);
>>    */
>>   int __set_page_dirty_nobuffers(struct page *page)
>>   {
>> +	bool locked;
>> +	unsigned long flags;
>> +	int ret = 0;
>> +
>> +	mem_cgroup_begin_update_page_stat(page,&locked,&flags);
>> +
> Is there a strict lock ordering which says that
> mem_cgroup_begin_update_page_stat() must not be called while holding
> tree_lock?  If yes, then maybe we should update the 'Lock ordering'
> comment in mm/filemap.c to describe the
> mem_cgroup_begin_update_page_stat() lock.
>

I think yes, otherwise it may cause deadlock. I'll update it later.

>>   	if (!TestSetPageDirty(page)) {
>>   		struct address_space *mapping = page_mapping(page);
>>   		struct address_space *mapping2;
>>
>> -		if (!mapping)
>> -			return 1;
>> +		if (!mapping) {
>> +			ret = 1;
>> +			goto out;
>> +		}
>
> The following seems even easier because it does not need your 'ret = 1'
> change below.
>
> +		ret = 1;
> 		if (!mapping)
> -			return 1;
> +			goto out;
>
>
>>
>>   		spin_lock_irq(&mapping->tree_lock);
>>   		mapping2 = page_mapping(page);
>> @@ -2022,9 +2031,12 @@ int __set_page_dirty_nobuffers(struct page *page)
>>   			/* !PageAnon&&  !swapper_space */
>>   			__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>>   		}
>> -		return 1;
>> +		ret = 1;
> With the ret=1 change above, this can be changed to:
> -		return 1;
>

Seems better.

>>   	}
>> -	return 0;
>> +
>> +out:
>> +	mem_cgroup_end_update_page_stat(page,&locked,&flags);
>> +	return ret;
>>   }
>>   EXPORT_SYMBOL(__set_page_dirty_nobuffers);
>>
>> @@ -2139,6 +2151,9 @@ EXPORT_SYMBOL(set_page_dirty_lock);
>>   int clear_page_dirty_for_io(struct page *page)
>>   {
>>   	struct address_space *mapping = page_mapping(page);
>> +	bool locked;
>> +	unsigned long flags;
>> +	int ret = 0;
>>
>>   	BUG_ON(!PageLocked(page));
>>
>> @@ -2180,13 +2195,16 @@ int clear_page_dirty_for_io(struct page *page)
>>   		 * the desired exclusion. See mm/memory.c:do_wp_page()
>>   		 * for more comments.
>>   		 */
>> +		mem_cgroup_begin_update_page_stat(page,&locked,&flags);
>>   		if (TestClearPageDirty(page)) {
>> +			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>>   			dec_zone_page_state(page, NR_FILE_DIRTY);
>>   			dec_bdi_stat(mapping->backing_dev_info,
>>   					BDI_RECLAIMABLE);
>> -			return 1;
>> +			ret = 1;
>>   		}
>> -		return 0;
>> +		mem_cgroup_end_update_page_stat(page,&locked,&flags);
>> +		return ret;
>>   	}
>>   	return TestClearPageDirty(page);
>>   }
>> diff --git a/mm/truncate.c b/mm/truncate.c
>> index 75801ac..052016a 100644
>> --- a/mm/truncate.c
>> +++ b/mm/truncate.c
>> @@ -73,9 +73,14 @@ static inline void truncate_partial_page(struct page *page, unsigned partial)
>>    */
>>   void cancel_dirty_page(struct page *page, unsigned int account_size)
>>   {
>> +	bool locked;
>> +	unsigned long flags;
>> +
>> +	mem_cgroup_begin_update_page_stat(page,&locked,&flags);
>>   	if (TestClearPageDirty(page)) {
>>   		struct address_space *mapping = page->mapping;
>>   		if (mapping&&  mapping_cap_account_dirty(mapping)) {
>> +			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>>   			dec_zone_page_state(page, NR_FILE_DIRTY);
>>   			dec_bdi_stat(mapping->backing_dev_info,
>>   					BDI_RECLAIMABLE);
>> @@ -83,6 +88,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
>>   				task_io_account_cancelled_write(account_size);
>>   		}
>>   	}
>> +	mem_cgroup_end_update_page_stat(page,&locked,&flags);
>>   }
>>   EXPORT_SYMBOL(cancel_dirty_page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 5/7] memcg: add per cgroup dirty pages accounting
  2012-07-11  9:32       ` Sha Zhengju
  (?)
@ 2012-07-19  6:33         ` Kamezawa Hiroyuki
  -1 siblings, 0 replies; 132+ messages in thread
From: Kamezawa Hiroyuki @ 2012-07-19  6:33 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: Greg Thelen, linux-mm, cgroups, yinghan, akpm, mhocko,
	linux-kernel, Sha Zhengju

(2012/07/11 18:32), Sha Zhengju wrote:
> On 07/10/2012 05:02 AM, Greg Thelen wrote:
>> On Thu, Jun 28 2012, Sha Zhengju wrote:
>>
>>> From: Sha Zhengju<handai.szj@taobao.com>
>>>
>>> This patch adds memcg routines to count dirty pages, which allows memory controller
>>> to maintain an accurate view of the amount of its dirty memory and can provide some
>>> info for users while group's direct reclaim is working.
>>>
>>> After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
>>> use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
>>> has a feature to move a page from a cgroup to another one and may have race between
>>> "move" and "page stat accounting". So in order to avoid the race we have designed a
>>> bigger lock:
>>>
>>>           mem_cgroup_begin_update_page_stat()
>>>           modify page information    -->(a)
>>>           mem_cgroup_update_page_stat()  -->(b)
>>>           mem_cgroup_end_update_page_stat()
>>>
>>> It requires (a) and (b)(dirty pages accounting) can stay close enough.
>>>
>>> In the previous two prepare patches, we have reworked the vfs set page dirty routines
>>> and now the interfaces are more explicit:
>>>     incrementing (2):
>>>         __set_page_dirty
>>>         __set_page_dirty_nobuffers
>>>     decrementing (2):
>>>         clear_page_dirty_for_io
>>>         cancel_dirty_page
>>>
>>>
>>> Signed-off-by: Sha Zhengju<handai.szj@taobao.com>
>>> ---
>>>   fs/buffer.c                |   17 ++++++++++++++---
>>>   include/linux/memcontrol.h |    1 +
>>>   mm/filemap.c               |    5 +++++
>>>   mm/memcontrol.c            |   28 +++++++++++++++++++++-------
>>>   mm/page-writeback.c        |   30 ++++++++++++++++++++++++------
>>>   mm/truncate.c              |    6 ++++++
>>>   6 files changed, 71 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/fs/buffer.c b/fs/buffer.c
>>> index 55522dd..d3714cc 100644
>>> --- a/fs/buffer.c
>>> +++ b/fs/buffer.c
>>> @@ -613,11 +613,19 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>>>   int __set_page_dirty(struct page *page,
>>>           struct address_space *mapping, int warn)
>>>   {
>>> +    bool locked;
>>> +    unsigned long flags;
>>> +    int ret = 0;
>> '= 0' and 'ret = 0' change (below) are redundant.  My vote is to remove
>> '= 0' here.
>>
>
> Nice catch. :-)
>
>>> +
>>>       if (unlikely(!mapping))
>>>           return !TestSetPageDirty(page);
>>>
>>> -    if (TestSetPageDirty(page))
>>> -        return 0;
>>> +    mem_cgroup_begin_update_page_stat(page,&locked,&flags);
>>> +
>>> +    if (TestSetPageDirty(page)) {
>>> +        ret = 0;
>>> +        goto out;
>>> +    }
>>>
>>>       spin_lock_irq(&mapping->tree_lock);
>>>       if (page->mapping) {    /* Race with truncate? */
>>> @@ -629,7 +637,10 @@ int __set_page_dirty(struct page *page,
>>>       spin_unlock_irq(&mapping->tree_lock);
>>>       __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>>>
>>> -    return 1;
>>> +    ret = 1;
>>> +out:
>>> +    mem_cgroup_end_update_page_stat(page,&locked,&flags);
>>> +    return ret;
>>>   }
>>>
>>>   /*
>>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>>> index 20b0f2d..ad37b59 100644
>>> --- a/include/linux/memcontrol.h
>>> +++ b/include/linux/memcontrol.h
>>> @@ -38,6 +38,7 @@ enum mem_cgroup_stat_index {
>>>       MEM_CGROUP_STAT_RSS,       /* # of pages charged as anon rss */
>>>       MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>>>       MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
>>> +    MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
>>>       MEM_CGROUP_STAT_NSTATS,
>>>   };
>>>
>>> diff --git a/mm/filemap.c b/mm/filemap.c
>>> index 1f19ec3..5159a49 100644
>>> --- a/mm/filemap.c
>>> +++ b/mm/filemap.c
>>> @@ -140,6 +140,11 @@ void __delete_from_page_cache(struct page *page)
>>>        * having removed the page entirely.
>>>        */
>>>       if (PageDirty(page)&&  mapping_cap_account_dirty(mapping)) {
>>> +        /*
>>> +         * Do not change page state, so no need to use mem_cgroup_
>>> +         * {begin, end}_update_page_stat to get lock.
>>> +         */
>>> +        mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>> I do not understand this comment.  What serializes this function and
>> mem_cgroup_move_account()?
>>
>
> The race is exist just because the two competitors share one
> public variable and one reads it and the other writes it.
> I thought if both sides(accounting and cgroup_move) do not
> change page flag, then risks like doule-counting(see below)
> will not happen.
>
>               CPU-A                                   CPU-B
>          Set PG_dirty
>          (delay)                                move_lock_mem_cgroup()
>                                                     if (PageDirty(page))
> new_memcg->nr_dirty++
>                                                     pc->mem_cgroup = new_memcg;
>                                                     move_unlock_mem_cgroup()
>          move_lock_mem_cgroup()
>          memcg = pc->mem_cgroup
>          new_memcg->nr_dirty++
>
>
> But after second thoughts, it does have problem if without lock:
>
>               CPU-A                                   CPU-B
>          if (PageDirty(page)) {
>                                                     move_lock_mem_cgroup()
> TestClearPageDirty(page))
>                                                              memcg = pc->mem_cgroup
> new_memcg->nr_dirty --
>                                                     move_unlock_mem_cgroup()
>
>          memcg = pc->mem_cgroup
>          new_memcg->nr_dirty--
>          }
>
>
> It may occur race between clear_page_dirty() operation.
> So this time I think we need the lock again...
>
> Kame, what about your opinion...
>
I think Dirty bit is cleared implicitly here...So, having lock will be good.
  
Thanks,
-Kame




^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 5/7] memcg: add per cgroup dirty pages accounting
@ 2012-07-19  6:33         ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 132+ messages in thread
From: Kamezawa Hiroyuki @ 2012-07-19  6:33 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: Greg Thelen, linux-mm, cgroups, yinghan, akpm, mhocko,
	linux-kernel, Sha Zhengju

(2012/07/11 18:32), Sha Zhengju wrote:
> On 07/10/2012 05:02 AM, Greg Thelen wrote:
>> On Thu, Jun 28 2012, Sha Zhengju wrote:
>>
>>> From: Sha Zhengju<handai.szj@taobao.com>
>>>
>>> This patch adds memcg routines to count dirty pages, which allows memory controller
>>> to maintain an accurate view of the amount of its dirty memory and can provide some
>>> info for users while group's direct reclaim is working.
>>>
>>> After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
>>> use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
>>> has a feature to move a page from a cgroup to another one and may have race between
>>> "move" and "page stat accounting". So in order to avoid the race we have designed a
>>> bigger lock:
>>>
>>>           mem_cgroup_begin_update_page_stat()
>>>           modify page information    -->(a)
>>>           mem_cgroup_update_page_stat()  -->(b)
>>>           mem_cgroup_end_update_page_stat()
>>>
>>> It requires (a) and (b)(dirty pages accounting) can stay close enough.
>>>
>>> In the previous two prepare patches, we have reworked the vfs set page dirty routines
>>> and now the interfaces are more explicit:
>>>     incrementing (2):
>>>         __set_page_dirty
>>>         __set_page_dirty_nobuffers
>>>     decrementing (2):
>>>         clear_page_dirty_for_io
>>>         cancel_dirty_page
>>>
>>>
>>> Signed-off-by: Sha Zhengju<handai.szj@taobao.com>
>>> ---
>>>   fs/buffer.c                |   17 ++++++++++++++---
>>>   include/linux/memcontrol.h |    1 +
>>>   mm/filemap.c               |    5 +++++
>>>   mm/memcontrol.c            |   28 +++++++++++++++++++++-------
>>>   mm/page-writeback.c        |   30 ++++++++++++++++++++++++------
>>>   mm/truncate.c              |    6 ++++++
>>>   6 files changed, 71 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/fs/buffer.c b/fs/buffer.c
>>> index 55522dd..d3714cc 100644
>>> --- a/fs/buffer.c
>>> +++ b/fs/buffer.c
>>> @@ -613,11 +613,19 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>>>   int __set_page_dirty(struct page *page,
>>>           struct address_space *mapping, int warn)
>>>   {
>>> +    bool locked;
>>> +    unsigned long flags;
>>> +    int ret = 0;
>> '= 0' and 'ret = 0' change (below) are redundant.  My vote is to remove
>> '= 0' here.
>>
>
> Nice catch. :-)
>
>>> +
>>>       if (unlikely(!mapping))
>>>           return !TestSetPageDirty(page);
>>>
>>> -    if (TestSetPageDirty(page))
>>> -        return 0;
>>> +    mem_cgroup_begin_update_page_stat(page,&locked,&flags);
>>> +
>>> +    if (TestSetPageDirty(page)) {
>>> +        ret = 0;
>>> +        goto out;
>>> +    }
>>>
>>>       spin_lock_irq(&mapping->tree_lock);
>>>       if (page->mapping) {    /* Race with truncate? */
>>> @@ -629,7 +637,10 @@ int __set_page_dirty(struct page *page,
>>>       spin_unlock_irq(&mapping->tree_lock);
>>>       __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>>>
>>> -    return 1;
>>> +    ret = 1;
>>> +out:
>>> +    mem_cgroup_end_update_page_stat(page,&locked,&flags);
>>> +    return ret;
>>>   }
>>>
>>>   /*
>>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>>> index 20b0f2d..ad37b59 100644
>>> --- a/include/linux/memcontrol.h
>>> +++ b/include/linux/memcontrol.h
>>> @@ -38,6 +38,7 @@ enum mem_cgroup_stat_index {
>>>       MEM_CGROUP_STAT_RSS,       /* # of pages charged as anon rss */
>>>       MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>>>       MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
>>> +    MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
>>>       MEM_CGROUP_STAT_NSTATS,
>>>   };
>>>
>>> diff --git a/mm/filemap.c b/mm/filemap.c
>>> index 1f19ec3..5159a49 100644
>>> --- a/mm/filemap.c
>>> +++ b/mm/filemap.c
>>> @@ -140,6 +140,11 @@ void __delete_from_page_cache(struct page *page)
>>>        * having removed the page entirely.
>>>        */
>>>       if (PageDirty(page)&&  mapping_cap_account_dirty(mapping)) {
>>> +        /*
>>> +         * Do not change page state, so no need to use mem_cgroup_
>>> +         * {begin, end}_update_page_stat to get lock.
>>> +         */
>>> +        mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>> I do not understand this comment.  What serializes this function and
>> mem_cgroup_move_account()?
>>
>
> The race is exist just because the two competitors share one
> public variable and one reads it and the other writes it.
> I thought if both sides(accounting and cgroup_move) do not
> change page flag, then risks like doule-counting(see below)
> will not happen.
>
>               CPU-A                                   CPU-B
>          Set PG_dirty
>          (delay)                                move_lock_mem_cgroup()
>                                                     if (PageDirty(page))
> new_memcg->nr_dirty++
>                                                     pc->mem_cgroup = new_memcg;
>                                                     move_unlock_mem_cgroup()
>          move_lock_mem_cgroup()
>          memcg = pc->mem_cgroup
>          new_memcg->nr_dirty++
>
>
> But after second thoughts, it does have problem if without lock:
>
>               CPU-A                                   CPU-B
>          if (PageDirty(page)) {
>                                                     move_lock_mem_cgroup()
> TestClearPageDirty(page))
>                                                              memcg = pc->mem_cgroup
> new_memcg->nr_dirty --
>                                                     move_unlock_mem_cgroup()
>
>          memcg = pc->mem_cgroup
>          new_memcg->nr_dirty--
>          }
>
>
> It may occur race between clear_page_dirty() operation.
> So this time I think we need the lock again...
>
> Kame, what about your opinion...
>
I think Dirty bit is cleared implicitly here...So, having lock will be good.
  
Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH 5/7] memcg: add per cgroup dirty pages accounting
@ 2012-07-19  6:33         ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 132+ messages in thread
From: Kamezawa Hiroyuki @ 2012-07-19  6:33 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: Greg Thelen, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA, yinghan-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mhocko-AlSwsSmVLrQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju

(2012/07/11 18:32), Sha Zhengju wrote:
> On 07/10/2012 05:02 AM, Greg Thelen wrote:
>> On Thu, Jun 28 2012, Sha Zhengju wrote:
>>
>>> From: Sha Zhengju<handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
>>>
>>> This patch adds memcg routines to count dirty pages, which allows memory controller
>>> to maintain an accurate view of the amount of its dirty memory and can provide some
>>> info for users while group's direct reclaim is working.
>>>
>>> After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
>>> use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
>>> has a feature to move a page from a cgroup to another one and may have race between
>>> "move" and "page stat accounting". So in order to avoid the race we have designed a
>>> bigger lock:
>>>
>>>           mem_cgroup_begin_update_page_stat()
>>>           modify page information    -->(a)
>>>           mem_cgroup_update_page_stat()  -->(b)
>>>           mem_cgroup_end_update_page_stat()
>>>
>>> It requires (a) and (b)(dirty pages accounting) can stay close enough.
>>>
>>> In the previous two prepare patches, we have reworked the vfs set page dirty routines
>>> and now the interfaces are more explicit:
>>>     incrementing (2):
>>>         __set_page_dirty
>>>         __set_page_dirty_nobuffers
>>>     decrementing (2):
>>>         clear_page_dirty_for_io
>>>         cancel_dirty_page
>>>
>>>
>>> Signed-off-by: Sha Zhengju<handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
>>> ---
>>>   fs/buffer.c                |   17 ++++++++++++++---
>>>   include/linux/memcontrol.h |    1 +
>>>   mm/filemap.c               |    5 +++++
>>>   mm/memcontrol.c            |   28 +++++++++++++++++++++-------
>>>   mm/page-writeback.c        |   30 ++++++++++++++++++++++++------
>>>   mm/truncate.c              |    6 ++++++
>>>   6 files changed, 71 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/fs/buffer.c b/fs/buffer.c
>>> index 55522dd..d3714cc 100644
>>> --- a/fs/buffer.c
>>> +++ b/fs/buffer.c
>>> @@ -613,11 +613,19 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>>>   int __set_page_dirty(struct page *page,
>>>           struct address_space *mapping, int warn)
>>>   {
>>> +    bool locked;
>>> +    unsigned long flags;
>>> +    int ret = 0;
>> '= 0' and 'ret = 0' change (below) are redundant.  My vote is to remove
>> '= 0' here.
>>
>
> Nice catch. :-)
>
>>> +
>>>       if (unlikely(!mapping))
>>>           return !TestSetPageDirty(page);
>>>
>>> -    if (TestSetPageDirty(page))
>>> -        return 0;
>>> +    mem_cgroup_begin_update_page_stat(page,&locked,&flags);
>>> +
>>> +    if (TestSetPageDirty(page)) {
>>> +        ret = 0;
>>> +        goto out;
>>> +    }
>>>
>>>       spin_lock_irq(&mapping->tree_lock);
>>>       if (page->mapping) {    /* Race with truncate? */
>>> @@ -629,7 +637,10 @@ int __set_page_dirty(struct page *page,
>>>       spin_unlock_irq(&mapping->tree_lock);
>>>       __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>>>
>>> -    return 1;
>>> +    ret = 1;
>>> +out:
>>> +    mem_cgroup_end_update_page_stat(page,&locked,&flags);
>>> +    return ret;
>>>   }
>>>
>>>   /*
>>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>>> index 20b0f2d..ad37b59 100644
>>> --- a/include/linux/memcontrol.h
>>> +++ b/include/linux/memcontrol.h
>>> @@ -38,6 +38,7 @@ enum mem_cgroup_stat_index {
>>>       MEM_CGROUP_STAT_RSS,       /* # of pages charged as anon rss */
>>>       MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>>>       MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
>>> +    MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
>>>       MEM_CGROUP_STAT_NSTATS,
>>>   };
>>>
>>> diff --git a/mm/filemap.c b/mm/filemap.c
>>> index 1f19ec3..5159a49 100644
>>> --- a/mm/filemap.c
>>> +++ b/mm/filemap.c
>>> @@ -140,6 +140,11 @@ void __delete_from_page_cache(struct page *page)
>>>        * having removed the page entirely.
>>>        */
>>>       if (PageDirty(page)&&  mapping_cap_account_dirty(mapping)) {
>>> +        /*
>>> +         * Do not change page state, so no need to use mem_cgroup_
>>> +         * {begin, end}_update_page_stat to get lock.
>>> +         */
>>> +        mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>> I do not understand this comment.  What serializes this function and
>> mem_cgroup_move_account()?
>>
>
> The race is exist just because the two competitors share one
> public variable and one reads it and the other writes it.
> I thought if both sides(accounting and cgroup_move) do not
> change page flag, then risks like doule-counting(see below)
> will not happen.
>
>               CPU-A                                   CPU-B
>          Set PG_dirty
>          (delay)                                move_lock_mem_cgroup()
>                                                     if (PageDirty(page))
> new_memcg->nr_dirty++
>                                                     pc->mem_cgroup = new_memcg;
>                                                     move_unlock_mem_cgroup()
>          move_lock_mem_cgroup()
>          memcg = pc->mem_cgroup
>          new_memcg->nr_dirty++
>
>
> But after second thoughts, it does have problem if without lock:
>
>               CPU-A                                   CPU-B
>          if (PageDirty(page)) {
>                                                     move_lock_mem_cgroup()
> TestClearPageDirty(page))
>                                                              memcg = pc->mem_cgroup
> new_memcg->nr_dirty --
>                                                     move_unlock_mem_cgroup()
>
>          memcg = pc->mem_cgroup
>          new_memcg->nr_dirty--
>          }
>
>
> It may occur race between clear_page_dirty() operation.
> So this time I think we need the lock again...
>
> Kame, what about your opinion...
>
I think Dirty bit is cleared implicitly here...So, having lock will be good.
  
Thanks,
-Kame



^ permalink raw reply	[flat|nested] 132+ messages in thread

end of thread, other threads:[~2012-07-19  6:35 UTC | newest]

Thread overview: 132+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-28 10:54 [PATCH 0/7] Per-cgroup page stat accounting Sha Zhengju
2012-06-28 10:54 ` Sha Zhengju
2012-06-28 10:54 ` Sha Zhengju
2012-06-28 10:57 ` [PATCH 1/7] memcg: update cgroup memory document Sha Zhengju
2012-06-28 10:57   ` Sha Zhengju
2012-06-28 10:57   ` Sha Zhengju
2012-07-02  7:00   ` Kamezawa Hiroyuki
2012-07-02  7:00     ` Kamezawa Hiroyuki
2012-07-04 12:47   ` Michal Hocko
2012-07-04 12:47     ` Michal Hocko
2012-07-04 12:47     ` Michal Hocko
2012-07-07 13:45   ` Fengguang Wu
2012-07-07 13:45     ` Fengguang Wu
2012-07-07 13:45     ` Fengguang Wu
2012-06-28 10:58 ` [PATCH 2/7] memcg: remove MEMCG_NR_FILE_MAPPED Sha Zhengju
2012-06-28 10:58   ` Sha Zhengju
2012-07-02 10:44   ` Kamezawa Hiroyuki
2012-07-02 10:44     ` Kamezawa Hiroyuki
2012-07-02 10:44     ` Kamezawa Hiroyuki
2012-07-04 12:56   ` Michal Hocko
2012-07-04 12:56     ` Michal Hocko
2012-07-04 12:58     ` Michal Hocko
2012-07-04 12:58       ` Michal Hocko
2012-07-04 12:58       ` Michal Hocko
2012-07-07 13:48   ` Fengguang Wu
2012-07-07 13:48     ` Fengguang Wu
2012-07-09 21:01   ` Greg Thelen
2012-07-09 21:01     ` Greg Thelen
2012-07-09 21:01     ` Greg Thelen
2012-07-11  8:00     ` Sha Zhengju
2012-07-11  8:00       ` Sha Zhengju
2012-06-28 11:01 ` [PATCH 3/7] Make TestSetPageDirty and dirty page accounting in one func Sha Zhengju
2012-06-28 11:01   ` Sha Zhengju
2012-07-02 11:14   ` Kamezawa Hiroyuki
2012-07-02 11:14     ` Kamezawa Hiroyuki
2012-07-02 11:14     ` Kamezawa Hiroyuki
2012-07-07 14:42     ` Fengguang Wu
2012-07-07 14:42       ` Fengguang Wu
2012-07-04 14:23   ` Michal Hocko
2012-07-04 14:23     ` Michal Hocko
2012-06-28 11:03 ` [PATCH 4/7] Use vfs __set_page_dirty interface instead of doing it inside filesystem Sha Zhengju
2012-06-28 11:03   ` Sha Zhengju
2012-06-28 11:03   ` Sha Zhengju
2012-06-29  5:21   ` Sage Weil
2012-06-29  5:21     ` Sage Weil
2012-06-29  5:21     ` Sage Weil
2012-07-02  8:10     ` Sha Zhengju
2012-07-02  8:10       ` Sha Zhengju
2012-07-02 14:49       ` Sage Weil
2012-07-02 14:49         ` Sage Weil
2012-07-04  8:11         ` Sha Zhengju
2012-07-04  8:11           ` Sha Zhengju
2012-07-05 15:20           ` Sage Weil
2012-07-05 15:20             ` Sage Weil
2012-07-05 15:40             ` Sha Zhengju
2012-07-05 15:40               ` Sha Zhengju
2012-07-04 14:27   ` Michal Hocko
2012-07-04 14:27     ` Michal Hocko
2012-06-28 11:04 ` [PATCH 5/7] memcg: add per cgroup dirty pages accounting Sha Zhengju
2012-06-28 11:04   ` Sha Zhengju
2012-06-28 11:04   ` Sha Zhengju
2012-07-03  5:57   ` Kamezawa Hiroyuki
2012-07-03  5:57     ` Kamezawa Hiroyuki
2012-07-08 14:45     ` Fengguang Wu
2012-07-08 14:45       ` Fengguang Wu
2012-07-04 16:11   ` Michal Hocko
2012-07-04 16:11     ` Michal Hocko
2012-07-04 16:11     ` Michal Hocko
2012-07-09 21:02   ` Greg Thelen
2012-07-09 21:02     ` Greg Thelen
2012-07-11  9:32     ` Sha Zhengju
2012-07-11  9:32       ` Sha Zhengju
2012-07-19  6:33       ` Kamezawa Hiroyuki
2012-07-19  6:33         ` Kamezawa Hiroyuki
2012-07-19  6:33         ` Kamezawa Hiroyuki
2012-06-28 11:05 ` [PATCH 6/7] memcg: add per cgroup writeback " Sha Zhengju
2012-06-28 11:05   ` Sha Zhengju
2012-07-03  6:31   ` Kamezawa Hiroyuki
2012-07-03  6:31     ` Kamezawa Hiroyuki
2012-07-04  8:24     ` Sha Zhengju
2012-07-04  8:24       ` Sha Zhengju
2012-07-08 14:44     ` Fengguang Wu
2012-07-08 14:44       ` Fengguang Wu
2012-07-08 23:01       ` Johannes Weiner
2012-07-08 23:01         ` Johannes Weiner
2012-07-09  1:37         ` Fengguang Wu
2012-07-09  1:37           ` Fengguang Wu
2012-07-09  1:37           ` Fengguang Wu
2012-07-04 16:15   ` Michal Hocko
2012-07-04 16:15     ` Michal Hocko
2012-06-28 11:06 ` Sha Zhengju
2012-06-28 11:06   ` Sha Zhengju
2012-06-28 11:06   ` Sha Zhengju
2012-07-08 14:53   ` Fengguang Wu
2012-07-08 14:53     ` Fengguang Wu
2012-07-08 14:53     ` Fengguang Wu
2012-07-09  3:36     ` Sha Zhengju
2012-07-09  3:36       ` Sha Zhengju
2012-07-09  3:36       ` Sha Zhengju
2012-07-09  4:14       ` Fengguang Wu
2012-07-09  4:14         ` Fengguang Wu
2012-07-09  4:14         ` Fengguang Wu
2012-07-09  4:18         ` Kamezawa Hiroyuki
2012-07-09  4:18           ` Kamezawa Hiroyuki
2012-07-09  5:22           ` Sha Zhengju
2012-07-09  5:22             ` Sha Zhengju
2012-07-09  5:22             ` Sha Zhengju
2012-07-09  5:28             ` Fengguang Wu
2012-07-09  5:28               ` Fengguang Wu
2012-07-09  5:28               ` Fengguang Wu
2012-07-09  5:19         ` Sha Zhengju
2012-07-09  5:19           ` Sha Zhengju
2012-07-09  5:25           ` Fengguang Wu
2012-07-09  5:25             ` Fengguang Wu
2012-07-09 21:02   ` Greg Thelen
2012-07-09 21:02     ` Greg Thelen
2012-06-28 11:06 ` [PATCH 7/7] memcg: print more detailed info while memcg oom happening Sha Zhengju
2012-06-28 11:06   ` Sha Zhengju
2012-06-28 11:06   ` Sha Zhengju
2012-07-04  8:25   ` Sha Zhengju
2012-07-04  8:25     ` Sha Zhengju
2012-07-04  8:25     ` Sha Zhengju
2012-07-04  8:29   ` Kamezawa Hiroyuki
2012-07-04  8:29     ` Kamezawa Hiroyuki
2012-07-04 11:20     ` Sha Zhengju
2012-07-04 11:20       ` Sha Zhengju
2012-07-04 11:20       ` Sha Zhengju
2012-06-29  8:23 ` [PATCH 0/7] Per-cgroup page stat accounting Kamezawa Hiroyuki
2012-06-29  8:23   ` Kamezawa Hiroyuki
2012-07-02  7:51   ` Sha Zhengju
2012-07-02  7:51     ` Sha Zhengju
2012-07-02  7:51     ` Sha Zhengju

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.