All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V5 0/8] Add memcg dirty/writeback page accounting
@ 2013-08-01 11:43 ` Sha Zhengju
  0 siblings, 0 replies; 47+ messages in thread
From: Sha Zhengju @ 2013-08-01 11:43 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: mhocko, kamezawa.hiroyu, glommer, gthelen, fengguang.wu, akpm,
	Sha Zhengju

Hi,

This is V5 patch series that provide the ability for each memory cgroup to
have independent dirty/writeback page statistics.
Previous version is here:
  V4 - http://www.spinics.net/lists/cgroups/msg08200.html;
  V3 - http://lwn.net/Articles/530776/;
  V2 - http://lwn.net/Articles/508452/;
  V1 - http://lwn.net/Articles/504017/;

The first three patches are doing some cleanup and prepare works. The first two
is nothing changed since V3 and patch 3/8 is a new one to check proper locks held.
Patch 4/8 and 5/8 are doing memcg dirty and writeback page accounting, which adds
statistic codes in several hot paths.

Patch 6/8 and 7/8 are trying to wipe off the overheads introduced in by previous
two patches, and this is the main changes towards V3. Patch 6 is a prepare one
to make nocpu_base available for all usages not only hotplug cases. I stealed it
from Glauber Costa - http://www.spinics.net/lists/cgroups/msg06233.html. Patch 7
is doing some optimization by jump label: if only root memcg exists, we don't
need to do page stat accounting and transfer global page stats to root only when
the first non-root memcg is created.  

Some perforcemance numbers got by Mel's pft test (On a 4g memory and 4-core
i5 CPU machine):

vanilla  : memcg enabled, patch not applied
patched  : all patches are patched

* Duration numbers:
             vanilla     patched
User          385.38      379.47
System         65.12       66.46
Elapsed       457.46      452.21

* Summary numbers:
vanilla:
Clients User        System      Elapsed     Faults/cpu  Faults/sec  
1       0.03        0.18        0.21        931682.645  910993.850  
2       0.03        0.22        0.13        760431.152  1472985.863 
3       0.03        0.29        0.12        600495.043  1620311.084 
4       0.04        0.37        0.12        475380.531  1688013.267

patched:
Clients User        System      Elapsed     Faults/cpu  Faults/sec  
1       0.02        0.19        0.22        915362.875  898763.732  
2       0.03        0.23        0.13        757518.387  1464893.996 
3       0.03        0.30        0.12        592113.126  1611873.469 
4       0.04        0.38        0.12        472203.393  1680013.271

We can see the performance gap is minor.

Change log:
v5 <-- v4:
	1. add patch 3 to check proper lock held suggested by Michal Hock
        2. add another two interfaces which should call mem_cgroup_begin/end_
	   update_page_stat() in dirty page accounting
        3. make nobase_cpu not only used in hotplug cases
        4. don't account root memcg page stats if only root exist
v4 <-- v3:
	1. give up reworking vfs codes
	2. change lock order of memcg->move_lock and mapping->tree_lock
	3. patch out mem_cgroup_{begin,end}_update_page_stat when not used
	4. rebased to since-3.10 branch
v3 <-- v2:
	1. change lock order of mapping->tree_lock and memcg->move_lock
	2. performance optimization in 6/8 and 7/8
v2 <-- v1:
        1. add test numbers
        2. some small fix and comments

Glauber Costa (1):
      memcg: make nocpu_base available for non-hotplug

Sha Zhengju (7):
      memcg: remove MEMCG_NR_FILE_MAPPED
      fs/ceph: vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem
      memcg: check for proper lock held in mem_cgroup_update_page_stat
      memcg: add per cgroup dirty pages accounting
      memcg: add per cgroup writeback pages accounting
      memcg: patch mem_cgroup_{begin,end}_update_page_stat() out if only root memcg exists
      memcg: Document cgroup dirty/writeback memory statistics

 Documentation/cgroups/memory.txt |    2 +
 fs/buffer.c                      |   13 +++
 fs/ceph/addr.c                   |   13 +--
 include/linux/memcontrol.h       |   47 ++++++++--
 mm/filemap.c                     |   17 +++-
 mm/memcontrol.c                  |  189 +++++++++++++++++++++++++++++---------
 mm/page-writeback.c              |   39 +++++++-
 mm/rmap.c                        |    4 +-
 mm/truncate.c                    |   12 +++
 mm/vmscan.c                      |    7 ++
 10 files changed, 273 insertions(+), 70 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH V5 0/8] Add memcg dirty/writeback page accounting
@ 2013-08-01 11:43 ` Sha Zhengju
  0 siblings, 0 replies; 47+ messages in thread
From: Sha Zhengju @ 2013-08-01 11:43 UTC (permalink / raw)
  To: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA
  Cc: mhocko-AlSwsSmVLrQ, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	glommer-Re5JQEeQqe8AvxtiuMwx3w, gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Sha Zhengju

Hi,

This is V5 patch series that provide the ability for each memory cgroup to
have independent dirty/writeback page statistics.
Previous version is here:
  V4 - http://www.spinics.net/lists/cgroups/msg08200.html;
  V3 - http://lwn.net/Articles/530776/;
  V2 - http://lwn.net/Articles/508452/;
  V1 - http://lwn.net/Articles/504017/;

The first three patches are doing some cleanup and prepare works. The first two
is nothing changed since V3 and patch 3/8 is a new one to check proper locks held.
Patch 4/8 and 5/8 are doing memcg dirty and writeback page accounting, which adds
statistic codes in several hot paths.

Patch 6/8 and 7/8 are trying to wipe off the overheads introduced in by previous
two patches, and this is the main changes towards V3. Patch 6 is a prepare one
to make nocpu_base available for all usages not only hotplug cases. I stealed it
from Glauber Costa - http://www.spinics.net/lists/cgroups/msg06233.html. Patch 7
is doing some optimization by jump label: if only root memcg exists, we don't
need to do page stat accounting and transfer global page stats to root only when
the first non-root memcg is created.  

Some perforcemance numbers got by Mel's pft test (On a 4g memory and 4-core
i5 CPU machine):

vanilla  : memcg enabled, patch not applied
patched  : all patches are patched

* Duration numbers:
             vanilla     patched
User          385.38      379.47
System         65.12       66.46
Elapsed       457.46      452.21

* Summary numbers:
vanilla:
Clients User        System      Elapsed     Faults/cpu  Faults/sec  
1       0.03        0.18        0.21        931682.645  910993.850  
2       0.03        0.22        0.13        760431.152  1472985.863 
3       0.03        0.29        0.12        600495.043  1620311.084 
4       0.04        0.37        0.12        475380.531  1688013.267

patched:
Clients User        System      Elapsed     Faults/cpu  Faults/sec  
1       0.02        0.19        0.22        915362.875  898763.732  
2       0.03        0.23        0.13        757518.387  1464893.996 
3       0.03        0.30        0.12        592113.126  1611873.469 
4       0.04        0.38        0.12        472203.393  1680013.271

We can see the performance gap is minor.

Change log:
v5 <-- v4:
	1. add patch 3 to check proper lock held suggested by Michal Hock
        2. add another two interfaces which should call mem_cgroup_begin/end_
	   update_page_stat() in dirty page accounting
        3. make nobase_cpu not only used in hotplug cases
        4. don't account root memcg page stats if only root exist
v4 <-- v3:
	1. give up reworking vfs codes
	2. change lock order of memcg->move_lock and mapping->tree_lock
	3. patch out mem_cgroup_{begin,end}_update_page_stat when not used
	4. rebased to since-3.10 branch
v3 <-- v2:
	1. change lock order of mapping->tree_lock and memcg->move_lock
	2. performance optimization in 6/8 and 7/8
v2 <-- v1:
        1. add test numbers
        2. some small fix and comments

Glauber Costa (1):
      memcg: make nocpu_base available for non-hotplug

Sha Zhengju (7):
      memcg: remove MEMCG_NR_FILE_MAPPED
      fs/ceph: vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem
      memcg: check for proper lock held in mem_cgroup_update_page_stat
      memcg: add per cgroup dirty pages accounting
      memcg: add per cgroup writeback pages accounting
      memcg: patch mem_cgroup_{begin,end}_update_page_stat() out if only root memcg exists
      memcg: Document cgroup dirty/writeback memory statistics

 Documentation/cgroups/memory.txt |    2 +
 fs/buffer.c                      |   13 +++
 fs/ceph/addr.c                   |   13 +--
 include/linux/memcontrol.h       |   47 ++++++++--
 mm/filemap.c                     |   17 +++-
 mm/memcontrol.c                  |  189 +++++++++++++++++++++++++++++---------
 mm/page-writeback.c              |   39 +++++++-
 mm/rmap.c                        |    4 +-
 mm/truncate.c                    |   12 +++
 mm/vmscan.c                      |    7 ++
 10 files changed, 273 insertions(+), 70 deletions(-)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 1/8] memcg: remove MEMCG_NR_FILE_MAPPED
  2013-08-01 11:43 ` Sha Zhengju
  (?)
@ 2013-08-01 11:44 ` Sha Zhengju
  -1 siblings, 0 replies; 47+ messages in thread
From: Sha Zhengju @ 2013-08-01 11:44 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: mhocko, kamezawa.hiroyu, glommer, gthelen, fengguang.wu, akpm,
	Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

While accounting memcg page stat, it's not worth to use MEMCG_NR_FILE_MAPPED
as an extra layer of indirection because of the complexity and presumed
performance overhead. We can use MEM_CGROUP_STAT_FILE_MAPPED directly.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Greg Thelen <gthelen@google.com>
---
 include/linux/memcontrol.h |   27 +++++++++++++++++++--------
 mm/memcontrol.c            |   25 +------------------------
 mm/rmap.c                  |    4 ++--
 3 files changed, 22 insertions(+), 34 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 7b4d9d7..d166aeb 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -30,9 +30,20 @@ struct page;
 struct mm_struct;
 struct kmem_cache;
 
-/* Stats that can be updated by kernel. */
-enum mem_cgroup_page_stat_item {
-	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
+/*
+ * The corresponding mem_cgroup_stat_names is defined in mm/memcontrol.c,
+ * These two lists should keep in accord with each other.
+ */
+enum mem_cgroup_stat_index {
+	/*
+	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
+	 */
+	MEM_CGROUP_STAT_CACHE,		/* # of pages charged as cache */
+	MEM_CGROUP_STAT_RSS,		/* # of pages charged as anon rss */
+	MEM_CGROUP_STAT_RSS_HUGE,	/* # of pages charged as anon huge */
+	MEM_CGROUP_STAT_FILE_MAPPED,	/* # of pages charged as file rss */
+	MEM_CGROUP_STAT_SWAP,		/* # of pages, swapped out */
+	MEM_CGROUP_STAT_NSTATS,
 };
 
 struct mem_cgroup_reclaim_cookie {
@@ -165,17 +176,17 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
 }
 
 void mem_cgroup_update_page_stat(struct page *page,
-				 enum mem_cgroup_page_stat_item idx,
+				 enum mem_cgroup_stat_index idx,
 				 int val);
 
 static inline void mem_cgroup_inc_page_stat(struct page *page,
-					    enum mem_cgroup_page_stat_item idx)
+					    enum mem_cgroup_stat_index idx)
 {
 	mem_cgroup_update_page_stat(page, idx, 1);
 }
 
 static inline void mem_cgroup_dec_page_stat(struct page *page,
-					    enum mem_cgroup_page_stat_item idx)
+					    enum mem_cgroup_stat_index idx)
 {
 	mem_cgroup_update_page_stat(page, idx, -1);
 }
@@ -349,12 +360,12 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
 }
 
 static inline void mem_cgroup_inc_page_stat(struct page *page,
-					    enum mem_cgroup_page_stat_item idx)
+					    enum mem_cgroup_stat_index idx)
 {
 }
 
 static inline void mem_cgroup_dec_page_stat(struct page *page,
-					    enum mem_cgroup_page_stat_item idx)
+					    enum mem_cgroup_stat_index idx)
 {
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d12ca6f..7691cef 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -85,21 +85,6 @@ static int really_do_swap_account __initdata = 0;
 #endif
 
 
-/*
- * Statistics for memory cgroup.
- */
-enum mem_cgroup_stat_index {
-	/*
-	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
-	 */
-	MEM_CGROUP_STAT_CACHE,		/* # of pages charged as cache */
-	MEM_CGROUP_STAT_RSS,		/* # of pages charged as anon rss */
-	MEM_CGROUP_STAT_RSS_HUGE,	/* # of pages charged as anon huge */
-	MEM_CGROUP_STAT_FILE_MAPPED,	/* # of pages charged as file rss */
-	MEM_CGROUP_STAT_SWAP,		/* # of pages, swapped out */
-	MEM_CGROUP_STAT_NSTATS,
-};
-
 static const char * const mem_cgroup_stat_names[] = {
 	"cache",
 	"rss",
@@ -2307,7 +2292,7 @@ void __mem_cgroup_end_update_page_stat(struct page *page, unsigned long *flags)
 }
 
 void mem_cgroup_update_page_stat(struct page *page,
-				 enum mem_cgroup_page_stat_item idx, int val)
+				 enum mem_cgroup_stat_index idx, int val)
 {
 	struct mem_cgroup *memcg;
 	struct page_cgroup *pc = lookup_page_cgroup(page);
@@ -2320,14 +2305,6 @@ void mem_cgroup_update_page_stat(struct page *page,
 	if (unlikely(!memcg || !PageCgroupUsed(pc)))
 		return;
 
-	switch (idx) {
-	case MEMCG_NR_FILE_MAPPED:
-		idx = MEM_CGROUP_STAT_FILE_MAPPED;
-		break;
-	default:
-		BUG();
-	}
-
 	this_cpu_add(memcg->stat->count[idx], val);
 }
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 7066470..66c2260 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1114,7 +1114,7 @@ void page_add_file_rmap(struct page *page)
 	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (atomic_inc_and_test(&page->_mapcount)) {
 		__inc_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
+		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
 	}
 	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 }
@@ -1158,7 +1158,7 @@ void page_remove_rmap(struct page *page)
 				hpage_nr_pages(page));
 	} else {
 		__dec_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
+		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
 		mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	}
 	if (unlikely(PageMlocked(page)))
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH V5 2/8] fs/ceph: vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem
  2013-08-01 11:43 ` Sha Zhengju
  (?)
  (?)
@ 2013-08-01 11:51 ` Sha Zhengju
  2013-08-01 15:19   ` Yan, Zheng
  -1 siblings, 1 reply; 47+ messages in thread
From: Sha Zhengju @ 2013-08-01 11:51 UTC (permalink / raw)
  To: linux-fsdevel, ceph-devel, linux-mm, cgroups
  Cc: sage, mhocko, kamezawa.hiroyu, glommer, gthelen, fengguang.wu,
	akpm, Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

Following we will begin to add memcg dirty page accounting around __set_page_dirty_
{buffers,nobuffers} in vfs layer, so we'd better use vfs interface to avoid exporting
those details to filesystems.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
---
 fs/ceph/addr.c |   13 +------------
 1 file changed, 1 insertion(+), 12 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 3e68ac1..1445bf1 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -76,7 +76,7 @@ static int ceph_set_page_dirty(struct page *page)
 	if (unlikely(!mapping))
 		return !TestSetPageDirty(page);
 
-	if (TestSetPageDirty(page)) {
+	if (!__set_page_dirty_nobuffers(page)) {
 		dout("%p set_page_dirty %p idx %lu -- already dirty\n",
 		     mapping->host, page, page->index);
 		return 0;
@@ -107,14 +107,7 @@ static int ceph_set_page_dirty(struct page *page)
 	     snapc, snapc->seq, snapc->num_snaps);
 	spin_unlock(&ci->i_ceph_lock);
 
-	/* now adjust page */
-	spin_lock_irq(&mapping->tree_lock);
 	if (page->mapping) {	/* Race with truncate? */
-		WARN_ON_ONCE(!PageUptodate(page));
-		account_page_dirtied(page, page->mapping);
-		radix_tree_tag_set(&mapping->page_tree,
-				page_index(page), PAGECACHE_TAG_DIRTY);
-
 		/*
 		 * Reference snap context in page->private.  Also set
 		 * PagePrivate so that we get invalidatepage callback.
@@ -126,14 +119,10 @@ static int ceph_set_page_dirty(struct page *page)
 		undo = 1;
 	}
 
-	spin_unlock_irq(&mapping->tree_lock);
-
 	if (undo)
 		/* whoops, we failed to dirty the page */
 		ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
 
-	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
-
 	BUG_ON(!PageDirty(page));
 	return 1;
 }
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH V5 3/8] memcg: check for proper lock held in mem_cgroup_update_page_stat
  2013-08-01 11:43 ` Sha Zhengju
                   ` (2 preceding siblings ...)
  (?)
@ 2013-08-01 11:52 ` Sha Zhengju
  2013-08-01 14:34     ` Michal Hocko
  2013-08-04 18:48     ` Greg Thelen
  -1 siblings, 2 replies; 47+ messages in thread
From: Sha Zhengju @ 2013-08-01 11:52 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: mhocko, kamezawa.hiroyu, glommer, gthelen, fengguang.wu, akpm,
	Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

We should call mem_cgroup_begin_update_page_stat() before
mem_cgroup_update_page_stat() to get proper locks, however the
latter doesn't do any checking that we use proper locking, which
would be hard. Suggested by Michal Hock we could at least test for
rcu_read_lock_held() because RCU is held if !mem_cgroup_disabled().

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
---
 mm/memcontrol.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7691cef..4a55d46 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2301,6 +2301,7 @@ void mem_cgroup_update_page_stat(struct page *page,
 	if (mem_cgroup_disabled())
 		return;
 
+	VM_BUG_ON(!rcu_read_lock_held());
 	memcg = pc->mem_cgroup;
 	if (unlikely(!memcg || !PageCgroupUsed(pc)))
 		return;
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH V5 4/8] memcg: add per cgroup dirty pages accounting
  2013-08-01 11:43 ` Sha Zhengju
                   ` (3 preceding siblings ...)
  (?)
@ 2013-08-01 11:53 ` Sha Zhengju
  -1 siblings, 0 replies; 47+ messages in thread
From: Sha Zhengju @ 2013-08-01 11:53 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, cgroups
  Cc: mhocko, kamezawa.hiroyu, glommer, gthelen, fengguang.wu, akpm,
	Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

This patch adds memcg routines to count dirty pages, which allows memory controller
to maintain an accurate view of the amount of its dirty memory.

After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
has a feature to move a page from a cgroup to another one and may have race between
"move" and "page stat accounting". So in order to avoid the race we have designed a
new lock:

         mem_cgroup_begin_update_page_stat()
         modify page information        -->(a)
         mem_cgroup_update_page_stat()  -->(b)
         mem_cgroup_end_update_page_stat()

It requires both (a) and (b)(dirty pages accounting) to be pretected in
mem_cgroup_{begin/end}_update_page_stat(). It's full no-op for !CONFIG_MEMCG, almost
no-op if memcg is disabled (but compiled in), rcu read lock in the most cases
(no task is moving), and spin_lock_irqsave on top in the slow path.

several places should be added accounting:
        incrementing (3):
                __set_page_dirty_buffers
                __set_page_dirty_nobuffers
		mark_buffer_dirty
        decrementing (7):
                clear_page_dirty_for_io
                cancel_dirty_page
		delete_from_page_cache
		__delete_from_page_cache
		replace_page_cache_page
		invalidate_complete_page2
		__remove_mapping

The lock order between memcg lock and mapping lock is:
	--> memcg->move_lock
	  --> mapping->private_lock
            --> mapping->tree_lock

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
---
 fs/buffer.c                |   13 +++++++++++++
 include/linux/memcontrol.h |    1 +
 mm/filemap.c               |   17 ++++++++++++++++-
 mm/memcontrol.c            |   30 +++++++++++++++++++++++-------
 mm/page-writeback.c        |   24 ++++++++++++++++++++++--
 mm/truncate.c              |   12 ++++++++++++
 mm/vmscan.c                |    7 +++++++
 7 files changed, 94 insertions(+), 10 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 695eb14..2d92d60 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -694,10 +694,17 @@ int __set_page_dirty_buffers(struct page *page)
 {
 	int newly_dirty;
 	struct address_space *mapping = page_mapping(page);
+	bool locked;
+	unsigned long flags;
 
 	if (unlikely(!mapping))
 		return !TestSetPageDirty(page);
 
+	/* The main concern is page stat accounting, the lock is used to protect
+	 * both SetPageDirty and accounting, so let the above !mapping checking
+	 * outside.
+	 */
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	spin_lock(&mapping->private_lock);
 	if (page_has_buffers(page)) {
 		struct buffer_head *head = page_buffers(page);
@@ -713,6 +720,7 @@ int __set_page_dirty_buffers(struct page *page)
 
 	if (newly_dirty)
 		__set_page_dirty(page, mapping, 1);
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	return newly_dirty;
 }
 EXPORT_SYMBOL(__set_page_dirty_buffers);
@@ -1169,11 +1177,16 @@ void mark_buffer_dirty(struct buffer_head *bh)
 
 	if (!test_set_buffer_dirty(bh)) {
 		struct page *page = bh->b_page;
+		bool locked;
+		unsigned long flags;
+
+		mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 		if (!TestSetPageDirty(page)) {
 			struct address_space *mapping = page_mapping(page);
 			if (mapping)
 				__set_page_dirty(page, mapping, 0);
 		}
+		mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	}
 }
 EXPORT_SYMBOL(mark_buffer_dirty);
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d166aeb..f952be6 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -42,6 +42,7 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_RSS,		/* # of pages charged as anon rss */
 	MEM_CGROUP_STAT_RSS_HUGE,	/* # of pages charged as anon huge */
 	MEM_CGROUP_STAT_FILE_MAPPED,	/* # of pages charged as file rss */
+	MEM_CGROUP_STAT_FILE_DIRTY,	/* # of dirty pages in page cache */
 	MEM_CGROUP_STAT_SWAP,		/* # of pages, swapped out */
 	MEM_CGROUP_STAT_NSTATS,
 };
diff --git a/mm/filemap.c b/mm/filemap.c
index a6981fe..690e4d7 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -65,6 +65,11 @@
  *      ->swap_lock		(exclusive_swap_page, others)
  *        ->mapping->tree_lock
  *
+ *  ->memcg->move_lock	(mem_cgroup_begin_update_page_stat->
+ *						move_lock_mem_cgroup)
+ *    ->private_lock		(__set_page_dirty_buffers)
+ *        ->mapping->tree_lock
+ *
  *  ->i_mutex
  *    ->i_mmap_mutex		(truncate->unmap_mapping_range)
  *
@@ -110,7 +115,8 @@
 /*
  * Delete a page from the page cache and free it. Caller has to make
  * sure the page is locked and that nobody else uses it - or that usage
- * is safe.  The caller must hold the mapping's tree_lock.
+ * is safe.  The caller must hold the mapping's tree_lock and
+ * mem_cgroup_begin/end_update_page_stat() lock.
  */
 void __delete_from_page_cache(struct page *page)
 {
@@ -144,6 +150,7 @@ void __delete_from_page_cache(struct page *page)
 	 * having removed the page entirely.
 	 */
 	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 		dec_zone_page_state(page, NR_FILE_DIRTY);
 		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
 	}
@@ -161,13 +168,17 @@ void delete_from_page_cache(struct page *page)
 {
 	struct address_space *mapping = page->mapping;
 	void (*freepage)(struct page *);
+	bool locked;
+	unsigned long flags;
 
 	BUG_ON(!PageLocked(page));
 
 	freepage = mapping->a_ops->freepage;
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	spin_lock_irq(&mapping->tree_lock);
 	__delete_from_page_cache(page);
 	spin_unlock_irq(&mapping->tree_lock);
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	mem_cgroup_uncharge_cache_page(page);
 
 	if (freepage)
@@ -417,6 +428,8 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 	if (!error) {
 		struct address_space *mapping = old->mapping;
 		void (*freepage)(struct page *);
+		bool locked;
+		unsigned long flags;
 
 		pgoff_t offset = old->index;
 		freepage = mapping->a_ops->freepage;
@@ -425,6 +438,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 		new->mapping = mapping;
 		new->index = offset;
 
+		mem_cgroup_begin_update_page_stat(old, &locked, &flags);
 		spin_lock_irq(&mapping->tree_lock);
 		__delete_from_page_cache(old);
 		error = radix_tree_insert(&mapping->page_tree, offset, new);
@@ -434,6 +448,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 		if (PageSwapBacked(new))
 			__inc_zone_page_state(new, NR_SHMEM);
 		spin_unlock_irq(&mapping->tree_lock);
+		mem_cgroup_end_update_page_stat(old, &locked, &flags);
 		/* mem_cgroup codes must not be called under tree_lock */
 		mem_cgroup_replace_page_cache(old, new);
 		radix_tree_preload_end();
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4a55d46..8f3e514 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -90,6 +90,7 @@ static const char * const mem_cgroup_stat_names[] = {
 	"rss",
 	"rss_huge",
 	"mapped_file",
+	"dirty",
 	"swap",
 };
 
@@ -3744,6 +3745,20 @@ void mem_cgroup_split_huge_fixup(struct page *head)
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+static inline
+void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
+					struct mem_cgroup *to,
+					unsigned int nr_pages,
+					enum mem_cgroup_stat_index idx)
+{
+	/* Update stat data for mem_cgroup */
+	preempt_disable();
+	WARN_ON_ONCE(from->stat->count[idx] < nr_pages);
+	__this_cpu_add(from->stat->count[idx], -nr_pages);
+	__this_cpu_add(to->stat->count[idx], nr_pages);
+	preempt_enable();
+}
+
 /**
  * mem_cgroup_move_account - move account of the page
  * @page: the page
@@ -3789,13 +3804,14 @@ static int mem_cgroup_move_account(struct page *page,
 
 	move_lock_mem_cgroup(from, &flags);
 
-	if (!anon && page_mapped(page)) {
-		/* Update mapped_file data for mem_cgroup */
-		preempt_disable();
-		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		preempt_enable();
-	}
+	if (!anon && page_mapped(page))
+		mem_cgroup_move_account_page_stat(from, to, nr_pages,
+			MEM_CGROUP_STAT_FILE_MAPPED);
+
+	if (!anon && PageDirty(page))
+		mem_cgroup_move_account_page_stat(from, to, nr_pages,
+			MEM_CGROUP_STAT_FILE_DIRTY);
+
 	mem_cgroup_charge_statistics(from, page, anon, -nr_pages);
 
 	/* caller should have done css_get */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 4514ad7..a09f518 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1982,6 +1982,11 @@ int __set_page_dirty_no_writeback(struct page *page)
 
 /*
  * Helper function for set_page_dirty family.
+ *
+ * The caller must hold mem_cgroup_begin/end_update_page_stat() lock
+ * while calling this function.
+ * See __set_page_dirty_{nobuffers,buffers} for example.
+ *
  * NOTE: This relies on being atomic wrt interrupts.
  */
 void account_page_dirtied(struct page *page, struct address_space *mapping)
@@ -1989,6 +1994,7 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
 	trace_writeback_dirty_page(page, mapping);
 
 	if (mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_DIRTIED);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
@@ -2028,6 +2034,11 @@ EXPORT_SYMBOL(account_page_writeback);
  */
 int __set_page_dirty_nobuffers(struct page *page)
 {
+	bool locked;
+	unsigned long flags;
+
+
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (!TestSetPageDirty(page)) {
 		struct address_space *mapping = page_mapping(page);
 		struct address_space *mapping2;
@@ -2045,12 +2056,15 @@ int __set_page_dirty_nobuffers(struct page *page)
 				page_index(page), PAGECACHE_TAG_DIRTY);
 		}
 		spin_unlock_irq(&mapping->tree_lock);
+		mem_cgroup_end_update_page_stat(page, &locked, &flags);
+
 		if (mapping->host) {
 			/* !PageAnon && !swapper_space */
 			__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 		}
 		return 1;
 	}
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	return 0;
 }
 EXPORT_SYMBOL(__set_page_dirty_nobuffers);
@@ -2166,6 +2180,9 @@ EXPORT_SYMBOL(set_page_dirty_lock);
 int clear_page_dirty_for_io(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
+	bool locked;
+	unsigned long flags;
+	int ret = 0;
 
 	BUG_ON(!PageLocked(page));
 
@@ -2207,13 +2224,16 @@ int clear_page_dirty_for_io(struct page *page)
 		 * the desired exclusion. See mm/memory.c:do_wp_page()
 		 * for more comments.
 		 */
+		mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 		if (TestClearPageDirty(page)) {
+			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
-			return 1;
+			ret = 1;
 		}
-		return 0;
+		mem_cgroup_end_update_page_stat(page, &locked, &flags);
+		return ret;
 	}
 	return TestClearPageDirty(page);
 }
diff --git a/mm/truncate.c b/mm/truncate.c
index c75b736..ed41636 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -73,9 +73,14 @@ static inline void truncate_partial_page(struct page *page, unsigned partial)
  */
 void cancel_dirty_page(struct page *page, unsigned int account_size)
 {
+	bool locked;
+	unsigned long flags;
+
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (TestClearPageDirty(page)) {
 		struct address_space *mapping = page->mapping;
 		if (mapping && mapping_cap_account_dirty(mapping)) {
+			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
@@ -83,6 +88,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
 				task_io_account_cancelled_write(account_size);
 		}
 	}
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 }
 EXPORT_SYMBOL(cancel_dirty_page);
 
@@ -386,12 +392,16 @@ EXPORT_SYMBOL(invalidate_mapping_pages);
 static int
 invalidate_complete_page2(struct address_space *mapping, struct page *page)
 {
+	bool locked;
+	unsigned long flags;
+
 	if (page->mapping != mapping)
 		return 0;
 
 	if (page_has_private(page) && !try_to_release_page(page, GFP_KERNEL))
 		return 0;
 
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	spin_lock_irq(&mapping->tree_lock);
 	if (PageDirty(page))
 		goto failed;
@@ -399,6 +409,7 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
 	BUG_ON(page_has_private(page));
 	__delete_from_page_cache(page);
 	spin_unlock_irq(&mapping->tree_lock);
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	mem_cgroup_uncharge_cache_page(page);
 
 	if (mapping->a_ops->freepage)
@@ -408,6 +419,7 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
 	return 1;
 failed:
 	spin_unlock_irq(&mapping->tree_lock);
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	return 0;
 }
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e364542..ba1ad4a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -486,9 +486,13 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
  */
 static int __remove_mapping(struct address_space *mapping, struct page *page)
 {
+	bool locked;
+	unsigned long flags;
+
 	BUG_ON(!PageLocked(page));
 	BUG_ON(mapping != page_mapping(page));
 
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	spin_lock_irq(&mapping->tree_lock);
 	/*
 	 * The non racy check for a busy page.
@@ -527,6 +531,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
 		swp_entry_t swap = { .val = page_private(page) };
 		__delete_from_swap_cache(page);
 		spin_unlock_irq(&mapping->tree_lock);
+		mem_cgroup_end_update_page_stat(page, &locked, &flags);
 		swapcache_free(swap, page);
 	} else {
 		void (*freepage)(struct page *);
@@ -535,6 +540,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
 
 		__delete_from_page_cache(page);
 		spin_unlock_irq(&mapping->tree_lock);
+		mem_cgroup_end_update_page_stat(page, &locked, &flags);
 		mem_cgroup_uncharge_cache_page(page);
 
 		if (freepage != NULL)
@@ -545,6 +551,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
 
 cannot_free:
 	spin_unlock_irq(&mapping->tree_lock);
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	return 0;
 }
 
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH V5 5/8] memcg: add per cgroup writeback pages accounting
  2013-08-01 11:43 ` Sha Zhengju
                   ` (4 preceding siblings ...)
  (?)
@ 2013-08-01 11:54 ` Sha Zhengju
  2013-08-01 14:53   ` Michal Hocko
  2013-08-04 18:51     ` Greg Thelen
  -1 siblings, 2 replies; 47+ messages in thread
From: Sha Zhengju @ 2013-08-01 11:54 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, cgroups
  Cc: mhocko, kamezawa.hiroyu, glommer, gthelen, fengguang.wu, akpm,
	Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

Similar to dirty page, we add per cgroup writeback pages accounting. The lock
rule still is:
        mem_cgroup_begin_update_page_stat()
        modify page WRITEBACK stat
        mem_cgroup_update_page_stat()
        mem_cgroup_end_update_page_stat()

There're two writeback interfaces to modify: test_{clear/set}_page_writeback().
Lock order:
	--> memcg->move_lock
	  --> mapping->tree_lock

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
---
 include/linux/memcontrol.h |    1 +
 mm/memcontrol.c            |    5 +++++
 mm/page-writeback.c        |   15 +++++++++++++++
 3 files changed, 21 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f952be6..ccd35d8 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -43,6 +43,7 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_RSS_HUGE,	/* # of pages charged as anon huge */
 	MEM_CGROUP_STAT_FILE_MAPPED,	/* # of pages charged as file rss */
 	MEM_CGROUP_STAT_FILE_DIRTY,	/* # of dirty pages in page cache */
+	MEM_CGROUP_STAT_WRITEBACK,	/* # of pages under writeback */
 	MEM_CGROUP_STAT_SWAP,		/* # of pages, swapped out */
 	MEM_CGROUP_STAT_NSTATS,
 };
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8f3e514..6c18a6d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -91,6 +91,7 @@ static const char * const mem_cgroup_stat_names[] = {
 	"rss_huge",
 	"mapped_file",
 	"dirty",
+	"writeback",
 	"swap",
 };
 
@@ -3812,6 +3813,10 @@ static int mem_cgroup_move_account(struct page *page,
 		mem_cgroup_move_account_page_stat(from, to, nr_pages,
 			MEM_CGROUP_STAT_FILE_DIRTY);
 
+	if (PageWriteback(page))
+		mem_cgroup_move_account_page_stat(from, to, nr_pages,
+			MEM_CGROUP_STAT_WRITEBACK);
+
 	mem_cgroup_charge_statistics(from, page, anon, -nr_pages);
 
 	/* caller should have done css_get */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index a09f518..2fa6a52 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2008,11 +2008,17 @@ EXPORT_SYMBOL(account_page_dirtied);
 
 /*
  * Helper function for set_page_writeback family.
+ *
+ * The caller must hold mem_cgroup_begin/end_update_page_stat() lock
+ * while calling this function.
+ * See test_set_page_writeback for example.
+ *
  * NOTE: Unlike account_page_dirtied this does not rely on being atomic
  * wrt interrupts.
  */
 void account_page_writeback(struct page *page)
 {
+	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
 	inc_zone_page_state(page, NR_WRITEBACK);
 }
 EXPORT_SYMBOL(account_page_writeback);
@@ -2243,7 +2249,10 @@ int test_clear_page_writeback(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
 	int ret;
+	bool locked;
+	unsigned long memcg_flags;
 
+	mem_cgroup_begin_update_page_stat(page, &locked, &memcg_flags);
 	if (mapping) {
 		struct backing_dev_info *bdi = mapping->backing_dev_info;
 		unsigned long flags;
@@ -2264,9 +2273,11 @@ int test_clear_page_writeback(struct page *page)
 		ret = TestClearPageWriteback(page);
 	}
 	if (ret) {
+		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
 		dec_zone_page_state(page, NR_WRITEBACK);
 		inc_zone_page_state(page, NR_WRITTEN);
 	}
+	mem_cgroup_end_update_page_stat(page, &locked, &memcg_flags);
 	return ret;
 }
 
@@ -2274,7 +2285,10 @@ int test_set_page_writeback(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
 	int ret;
+	bool locked;
+	unsigned long flags;
 
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (mapping) {
 		struct backing_dev_info *bdi = mapping->backing_dev_info;
 		unsigned long flags;
@@ -2301,6 +2315,7 @@ int test_set_page_writeback(struct page *page)
 	}
 	if (!ret)
 		account_page_writeback(page);
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	return ret;
 
 }
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH V5 6/8] memcg: make nocpu_base available for non-hotplug
  2013-08-01 11:43 ` Sha Zhengju
                   ` (5 preceding siblings ...)
  (?)
@ 2013-08-01 11:55 ` Sha Zhengju
  -1 siblings, 0 replies; 47+ messages in thread
From: Sha Zhengju @ 2013-08-01 11:55 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: mhocko, kamezawa.hiroyu, glommer, gthelen, fengguang.wu, akpm,
	Sha Zhengju, Glauber Costa

From: Glauber Costa <glommer@parallels.com>

This is inspired by Glauber Costa, his 1st version:
http://www.spinics.net/lists/cgroups/msg06233.html.

We are using nocpu_base to accumulate numbers on the main counters
during cpu hotplug. In later patch we need to transfer page stats to
the root cgroup when lazily enabling memcg. Since system wide counter
is not kept per-cpu, it is hard to distribute it. So make this field
available for all usages, not only hotplug cases.

Sha Zhengju: rename nocpu_base to stats_base

Signed-off-by: Glauber Costa <glommer@parallels.com>
Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Acked-by: KAMEZAWA Hiroyuki<kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c |   20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6c18a6d..54da686 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -303,10 +303,10 @@ struct mem_cgroup {
 	 */
 	struct mem_cgroup_stat_cpu __percpu *stat;
 	/*
-	 * used when a cpu is offlined or other synchronizations
+	 * used when first non-memcg is created or a cpu is offlined.
 	 * See mem_cgroup_read_stat().
 	 */
-	struct mem_cgroup_stat_cpu nocpu_base;
+	struct mem_cgroup_stat_cpu stats_base;
 	spinlock_t pcp_counter_lock;
 
 	atomic_t	dead_count;
@@ -845,11 +845,11 @@ static long mem_cgroup_read_stat(struct mem_cgroup *memcg,
 	get_online_cpus();
 	for_each_online_cpu(cpu)
 		val += per_cpu(memcg->stat->count[idx], cpu);
-#ifdef CONFIG_HOTPLUG_CPU
+
 	spin_lock(&memcg->pcp_counter_lock);
-	val += memcg->nocpu_base.count[idx];
+	val += memcg->stats_base.count[idx];
 	spin_unlock(&memcg->pcp_counter_lock);
-#endif
+
 	put_online_cpus();
 	return val;
 }
@@ -869,11 +869,11 @@ static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg,
 
 	for_each_online_cpu(cpu)
 		val += per_cpu(memcg->stat->events[idx], cpu);
-#ifdef CONFIG_HOTPLUG_CPU
+
 	spin_lock(&memcg->pcp_counter_lock);
-	val += memcg->nocpu_base.events[idx];
+	val += memcg->stats_base.events[idx];
 	spin_unlock(&memcg->pcp_counter_lock);
-#endif
+
 	return val;
 }
 
@@ -2491,13 +2491,13 @@ static void mem_cgroup_drain_pcp_counter(struct mem_cgroup *memcg, int cpu)
 		long x = per_cpu(memcg->stat->count[i], cpu);
 
 		per_cpu(memcg->stat->count[i], cpu) = 0;
-		memcg->nocpu_base.count[i] += x;
+		memcg->stats_base.count[i] += x;
 	}
 	for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) {
 		unsigned long x = per_cpu(memcg->stat->events[i], cpu);
 
 		per_cpu(memcg->stat->events[i], cpu) = 0;
-		memcg->nocpu_base.events[i] += x;
+		memcg->stats_base.events[i] += x;
 	}
 	spin_unlock(&memcg->pcp_counter_lock);
 }
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH V5 7/8] memcg: don't account root memcg page stats if only root exists
  2013-08-01 11:43 ` Sha Zhengju
                   ` (6 preceding siblings ...)
  (?)
@ 2013-08-01 12:00 ` Sha Zhengju
  2013-08-01 16:20   ` Johannes Weiner
  -1 siblings, 1 reply; 47+ messages in thread
From: Sha Zhengju @ 2013-08-01 12:00 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: mhocko, kamezawa.hiroyu, glommer, gthelen, fengguang.wu, akpm,
	Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

If memcg is enabled and no non-root memcg exists, all allocated pages
belongs to root and wil go through root memcg statistics routines. So in order
to reduce overheads after adding memcg dirty/writeback accounting in hot paths,
we use jump label to patch the accounting related functions in or out when not
used. If no non-root memcg comes to life, we do not need to accquire moving
locks and update page stats.

But to keep stats of root memcg correct in the long run, we transfer global numbers
to it when the first non-root memcg is created, and since then the root will begin
to do accounting as usual. In addition to this, we also need to take care of
memcg_stat_show() if only root memcg exists.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
---
 include/linux/memcontrol.h |   18 ++++++++
 mm/memcontrol.c            |  108 ++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 122 insertions(+), 4 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ccd35d8..c66163b 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -55,6 +55,13 @@ struct mem_cgroup_reclaim_cookie {
 };
 
 #ifdef CONFIG_MEMCG
+
+extern struct static_key memcg_inuse_key;
+static inline bool mem_cgroup_in_use(void)
+{
+	return static_key_false(&memcg_inuse_key);
+}
+
 /*
  * All "charge" functions with gfp_mask should use GFP_KERNEL or
  * (gfp_mask & GFP_RECLAIM_MASK). In current implementatin, memcg doesn't
@@ -159,7 +166,10 @@ static inline void mem_cgroup_begin_update_page_stat(struct page *page,
 {
 	if (mem_cgroup_disabled())
 		return;
+
 	rcu_read_lock();
+	if (!mem_cgroup_in_use())
+		return;
 	*locked = false;
 	if (atomic_read(&memcg_moving))
 		__mem_cgroup_begin_update_page_stat(page, locked, flags);
@@ -172,6 +182,10 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
 {
 	if (mem_cgroup_disabled())
 		return;
+	if (!mem_cgroup_in_use()) {
+		rcu_read_unlock();
+		return;
+	}
 	if (*locked)
 		__mem_cgroup_end_update_page_stat(page, flags);
 	rcu_read_unlock();
@@ -215,6 +229,10 @@ void mem_cgroup_print_bad_page(struct page *page);
 #endif
 #else /* CONFIG_MEMCG */
 struct mem_cgroup;
+static inline bool mem_cgroup_in_use(void)
+{
+	return false;
+}
 
 static inline int mem_cgroup_newpage_charge(struct page *page,
 					struct mm_struct *mm, gfp_t gfp_mask)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 54da686..8928bd4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -463,6 +463,13 @@ enum res_type {
 #define MEM_CGROUP_RECLAIM_SHRINK_BIT	0x1
 #define MEM_CGROUP_RECLAIM_SHRINK	(1 << MEM_CGROUP_RECLAIM_SHRINK_BIT)
 
+/* static_key is used for marking memcg in use or not. We use this jump label
+ * to patch some memcg page stat accounting code in or out.
+ * The key will be increased when non-root memcg is created, and be decreased
+ * when memcg is destroyed.
+ */
+struct static_key memcg_inuse_key;
+
 /*
  * The memcg_create_mutex will be held whenever a new cgroup is created.
  * As a consequence, any change that needs to protect against new child cgroups
@@ -630,10 +637,22 @@ static void disarm_kmem_keys(struct mem_cgroup *memcg)
 }
 #endif /* CONFIG_MEMCG_KMEM */
 
+static void disarm_inuse_keys(struct mem_cgroup *memcg)
+{
+	if (!mem_cgroup_is_root(memcg))
+		static_key_slow_dec(&memcg_inuse_key);
+}
+
+static void arm_inuse_keys(void)
+{
+	static_key_slow_inc(&memcg_inuse_key);
+}
+
 static void disarm_static_keys(struct mem_cgroup *memcg)
 {
 	disarm_sock_keys(memcg);
 	disarm_kmem_keys(memcg);
+	disarm_inuse_keys(memcg);
 }
 
 static void drain_all_stock_async(struct mem_cgroup *memcg);
@@ -851,6 +870,9 @@ static long mem_cgroup_read_stat(struct mem_cgroup *memcg,
 	spin_unlock(&memcg->pcp_counter_lock);
 
 	put_online_cpus();
+
+	if (val < 0)
+		val = 0;
 	return val;
 }
 
@@ -2298,12 +2320,14 @@ void mem_cgroup_update_page_stat(struct page *page,
 {
 	struct mem_cgroup *memcg;
 	struct page_cgroup *pc = lookup_page_cgroup(page);
-	unsigned long uninitialized_var(flags);
 
 	if (mem_cgroup_disabled())
 		return;
 
 	VM_BUG_ON(!rcu_read_lock_held());
+	if (!mem_cgroup_in_use())
+		return;
+
 	memcg = pc->mem_cgroup;
 	if (unlikely(!memcg || !PageCgroupUsed(pc)))
 		return;
@@ -5431,12 +5455,37 @@ static int memcg_stat_show(struct cgroup *cont, struct cftype *cft,
 	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
 	struct mem_cgroup *mi;
 	unsigned int i;
-
+	struct stats {
+		unsigned long file_mapped;
+		unsigned long dirty;
+		unsigned long writeback;
+	} root_stats;
+	bool use_global = false;
+
+	/* If only root memcg exist, we should borrow some page stats
+	 * from global state.
+	 */
+	if (!mem_cgroup_in_use() && mem_cgroup_is_root(memcg)) {
+		use_global = true;
+		root_stats.file_mapped = global_page_state(NR_FILE_MAPPED);
+		root_stats.dirty = global_page_state(NR_FILE_DIRTY);
+		root_stats.writeback = global_page_state(NR_WRITEBACK);
+	}
 	for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
 		if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account)
 			continue;
-		seq_printf(m, "%s %ld\n", mem_cgroup_stat_names[i],
-			   mem_cgroup_read_stat(memcg, i) * PAGE_SIZE);
+		if (use_global && i == MEM_CGROUP_STAT_FILE_MAPPED)
+			seq_printf(m, "%s %ld\n", mem_cgroup_stat_names[i],
+					root_stats.file_mapped * PAGE_SIZE);
+		else if (use_global && i == MEM_CGROUP_STAT_FILE_DIRTY)
+			seq_printf(m, "%s %ld\n", mem_cgroup_stat_names[i],
+					root_stats.dirty * PAGE_SIZE);
+		else if (use_global && i == MEM_CGROUP_STAT_WRITEBACK)
+			seq_printf(m, "%s %ld\n", mem_cgroup_stat_names[i],
+					root_stats.writeback * PAGE_SIZE);
+		else
+			seq_printf(m, "%s %ld\n", mem_cgroup_stat_names[i],
+				mem_cgroup_read_stat(memcg, i) * PAGE_SIZE);
 	}
 
 	for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++)
@@ -5464,6 +5513,14 @@ static int memcg_stat_show(struct cgroup *cont, struct cftype *cft,
 			continue;
 		for_each_mem_cgroup_tree(mi, memcg)
 			val += mem_cgroup_read_stat(mi, i) * PAGE_SIZE;
+
+		if (use_global && i == MEM_CGROUP_STAT_FILE_MAPPED)
+			val += root_stats.file_mapped * PAGE_SIZE;
+		else if (use_global && i == MEM_CGROUP_STAT_FILE_DIRTY)
+			val += root_stats.dirty * PAGE_SIZE;
+		else if (use_global && i == MEM_CGROUP_STAT_WRITEBACK)
+			val += root_stats.writeback * PAGE_SIZE;
+
 		seq_printf(m, "total_%s %lld\n", mem_cgroup_stat_names[i], val);
 	}
 
@@ -6303,6 +6360,49 @@ mem_cgroup_css_online(struct cgroup *cont)
 	}
 
 	error = memcg_init_kmem(memcg, &mem_cgroup_subsys);
+	if (!error) {
+		if (!mem_cgroup_in_use()) {
+			/* I'm the first non-root memcg, move global stats to root memcg.
+			 * Memcg creating is serialized by cgroup locks(cgroup_mutex),
+			 * so the mem_cgroup_in_use() checking is safe.
+			 *
+			 * We use global_page_state() to get global page stats, but
+			 * because of the optimized inc/dec functions in SMP while
+			 * updating each zone's stats, We may lose some numbers
+			 * in a stock(zone->pageset->vm_stat_diff) which brings some
+			 * inaccuracy. But places where kernel use these page stats to
+			 * steer next decision e.g. dirty page throttling or writeback
+			 * also use global_page_state(), so here it's enough too.
+			 */
+			spin_lock(&root_mem_cgroup->pcp_counter_lock);
+			root_mem_cgroup->stats_base.count[MEM_CGROUP_STAT_FILE_MAPPED] =
+						global_page_state(NR_FILE_MAPPED);
+			root_mem_cgroup->stats_base.count[MEM_CGROUP_STAT_FILE_DIRTY] =
+						global_page_state(NR_FILE_DIRTY);
+			root_mem_cgroup->stats_base.count[MEM_CGROUP_STAT_WRITEBACK] =
+						global_page_state(NR_WRITEBACK);
+			spin_unlock(&root_mem_cgroup->pcp_counter_lock);
+		}
+
+		/*
+		 * memcg_inuse_key is used for checking whether non-root memcg
+		 * is created or not. To avoid race among page stat updating,
+		 * non-root memcg creating and move accounting, we should do
+		 * page stat updating under rcu_read_lock():
+		 *
+		 *	CPU-A			    CPU-B
+		 * rcu_read_lock()
+		 *
+		 * if (memcg_inuse_key)          arm_inuse_keys()
+		 *    update memcg page stat     synchronize_rcu()
+		 *
+		 * rcu_read_unlock()
+		 *				start move here and move accounting
+		 */
+		arm_inuse_keys();
+		synchronize_rcu();
+	}
+
 	mutex_unlock(&memcg_create_mutex);
 	return error;
 }
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH V5 8/8] memcg: Document cgroup dirty/writeback memory statistics
@ 2013-08-01 12:00   ` Sha Zhengju
  0 siblings, 0 replies; 47+ messages in thread
From: Sha Zhengju @ 2013-08-01 12:00 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: mhocko, kamezawa.hiroyu, glommer, gthelen, fengguang.wu, akpm,
	Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
---
 Documentation/cgroups/memory.txt |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 327acec..7ed4fa9 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -490,6 +490,8 @@ pgpgin		- # of charging events to the memory cgroup. The charging
 pgpgout		- # of uncharging events to the memory cgroup. The uncharging
 		event happens each time a page is unaccounted from the cgroup.
 swap		- # of bytes of swap usage
+dirty          - # of bytes of file cache that are not in sync with the disk copy.
+writeback      - # of bytes of file/anon cache that are queued for syncing to disk.
 inactive_anon	- # of bytes of anonymous memory and swap cache memory on
 		LRU list.
 active_anon	- # of bytes of anonymous and swap cache memory on active
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH V5 8/8] memcg: Document cgroup dirty/writeback memory statistics
@ 2013-08-01 12:00   ` Sha Zhengju
  0 siblings, 0 replies; 47+ messages in thread
From: Sha Zhengju @ 2013-08-01 12:00 UTC (permalink / raw)
  To: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA
  Cc: mhocko-AlSwsSmVLrQ, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	glommer-Re5JQEeQqe8AvxtiuMwx3w, gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Sha Zhengju

From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>

Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
---
 Documentation/cgroups/memory.txt |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 327acec..7ed4fa9 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -490,6 +490,8 @@ pgpgin		- # of charging events to the memory cgroup. The charging
 pgpgout		- # of uncharging events to the memory cgroup. The uncharging
 		event happens each time a page is unaccounted from the cgroup.
 swap		- # of bytes of swap usage
+dirty          - # of bytes of file cache that are not in sync with the disk copy.
+writeback      - # of bytes of file/anon cache that are queued for syncing to disk.
 inactive_anon	- # of bytes of anonymous memory and swap cache memory on
 		LRU list.
 active_anon	- # of bytes of anonymous and swap cache memory on active
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH V5 3/8] memcg: check for proper lock held in mem_cgroup_update_page_stat
@ 2013-08-01 14:34     ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2013-08-01 14:34 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, glommer, gthelen,
	fengguang.wu, akpm, Sha Zhengju

On Thu 01-08-13 19:52:26, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> We should call mem_cgroup_begin_update_page_stat() before
> mem_cgroup_update_page_stat() to get proper locks, however the
> latter doesn't do any checking that we use proper locking, which
> would be hard. Suggested by Michal Hock we could at least test for
> rcu_read_lock_held() because RCU is held if !mem_cgroup_disabled().
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>

Acked-by: Michal Hocko <mhocko@suse.cz>

> ---
>  mm/memcontrol.c |    1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7691cef..4a55d46 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2301,6 +2301,7 @@ void mem_cgroup_update_page_stat(struct page *page,
>  	if (mem_cgroup_disabled())
>  		return;
>  
> +	VM_BUG_ON(!rcu_read_lock_held());
>  	memcg = pc->mem_cgroup;
>  	if (unlikely(!memcg || !PageCgroupUsed(pc)))
>  		return;
> -- 
> 1.7.9.5
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V5 3/8] memcg: check for proper lock held in mem_cgroup_update_page_stat
@ 2013-08-01 14:34     ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2013-08-01 14:34 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	glommer-Re5JQEeQqe8AvxtiuMwx3w, gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Sha Zhengju

On Thu 01-08-13 19:52:26, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> 
> We should call mem_cgroup_begin_update_page_stat() before
> mem_cgroup_update_page_stat() to get proper locks, however the
> latter doesn't do any checking that we use proper locking, which
> would be hard. Suggested by Michal Hock we could at least test for
> rcu_read_lock_held() because RCU is held if !mem_cgroup_disabled().
> 
> Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>

Acked-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>

> ---
>  mm/memcontrol.c |    1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7691cef..4a55d46 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2301,6 +2301,7 @@ void mem_cgroup_update_page_stat(struct page *page,
>  	if (mem_cgroup_disabled())
>  		return;
>  
> +	VM_BUG_ON(!rcu_read_lock_held());
>  	memcg = pc->mem_cgroup;
>  	if (unlikely(!memcg || !PageCgroupUsed(pc)))
>  		return;
> -- 
> 1.7.9.5
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo-Bw31MaZKKs0EbZ0PF+XxCw@public.gmane.org  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org"> email-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V5 0/8] Add memcg dirty/writeback page accounting
  2013-08-01 11:43 ` Sha Zhengju
                   ` (8 preceding siblings ...)
  (?)
@ 2013-08-01 14:43 ` Michal Hocko
  2013-08-03  9:30   ` Sha Zhengju
  -1 siblings, 1 reply; 47+ messages in thread
From: Michal Hocko @ 2013-08-01 14:43 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, glommer, gthelen,
	fengguang.wu, akpm, Sha Zhengju

On Thu 01-08-13 19:43:22, Sha Zhengju wrote:
[...]
> Some perforcemance numbers got by Mel's pft test (On a 4g memory and 4-core
> i5 CPU machine):

I am little bit confused what is this testcase actually testing... AFAIU
it produces a lot of page faults but they are all anonymous and very
short lived. So neither dirty nor writeback accounting is done.

I would have expected a testcase which generates a lot of IO.

Also as a general note. It would be better to mention the number of runs
and standard deviation so that we have an idea about variability of the
load.

> vanilla  : memcg enabled, patch not applied
> patched  : all patches are patched
> 
> * Duration numbers:
>              vanilla     patched
> User          385.38      379.47
> System         65.12       66.46
> Elapsed       457.46      452.21
> 
> * Summary numbers:
> vanilla:
> Clients User        System      Elapsed     Faults/cpu  Faults/sec  
> 1       0.03        0.18        0.21        931682.645  910993.850  
> 2       0.03        0.22        0.13        760431.152  1472985.863 
> 3       0.03        0.29        0.12        600495.043  1620311.084 
> 4       0.04        0.37        0.12        475380.531  1688013.267
> 
> patched:
> Clients User        System      Elapsed     Faults/cpu  Faults/sec  
> 1       0.02        0.19        0.22        915362.875  898763.732  
> 2       0.03        0.23        0.13        757518.387  1464893.996 
> 3       0.03        0.30        0.12        592113.126  1611873.469 
> 4       0.04        0.38        0.12        472203.393  1680013.271
> 
> We can see the performance gap is minor.
[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V5 5/8] memcg: add per cgroup writeback pages accounting
  2013-08-01 11:54 ` [PATCH V5 5/8] memcg: add per cgroup writeback " Sha Zhengju
@ 2013-08-01 14:53   ` Michal Hocko
       [not found]     ` <20130801145302.GJ5198-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  2013-08-04 18:51     ` Greg Thelen
  1 sibling, 1 reply; 47+ messages in thread
From: Michal Hocko @ 2013-08-01 14:53 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-fsdevel, linux-mm, cgroups, kamezawa.hiroyu, glommer,
	gthelen, fengguang.wu, akpm, Sha Zhengju

On Thu 01-08-13 19:54:11, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> Similar to dirty page, we add per cgroup writeback pages accounting. The lock
> rule still is:
>         mem_cgroup_begin_update_page_stat()
>         modify page WRITEBACK stat
>         mem_cgroup_update_page_stat()
>         mem_cgroup_end_update_page_stat()
> 
> There're two writeback interfaces to modify: test_{clear/set}_page_writeback().
> Lock order:
> 	--> memcg->move_lock
> 	  --> mapping->tree_lock
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>

Looks good to me. Maybe I would suggest moving this patch up the stack
so that it might get merged earlier as it is simpler than dirty pages
accounting. Unless you insist on having the full series merged at once.

Acked-by: Michal Hocko <mhocko@suse.cz>

> ---
>  include/linux/memcontrol.h |    1 +
>  mm/memcontrol.c            |    5 +++++
>  mm/page-writeback.c        |   15 +++++++++++++++
>  3 files changed, 21 insertions(+)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index f952be6..ccd35d8 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -43,6 +43,7 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_RSS_HUGE,	/* # of pages charged as anon huge */
>  	MEM_CGROUP_STAT_FILE_MAPPED,	/* # of pages charged as file rss */
>  	MEM_CGROUP_STAT_FILE_DIRTY,	/* # of dirty pages in page cache */
> +	MEM_CGROUP_STAT_WRITEBACK,	/* # of pages under writeback */
>  	MEM_CGROUP_STAT_SWAP,		/* # of pages, swapped out */
>  	MEM_CGROUP_STAT_NSTATS,
>  };
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 8f3e514..6c18a6d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -91,6 +91,7 @@ static const char * const mem_cgroup_stat_names[] = {
>  	"rss_huge",
>  	"mapped_file",
>  	"dirty",
> +	"writeback",
>  	"swap",
>  };
>  
> @@ -3812,6 +3813,10 @@ static int mem_cgroup_move_account(struct page *page,
>  		mem_cgroup_move_account_page_stat(from, to, nr_pages,
>  			MEM_CGROUP_STAT_FILE_DIRTY);
>  
> +	if (PageWriteback(page))
> +		mem_cgroup_move_account_page_stat(from, to, nr_pages,
> +			MEM_CGROUP_STAT_WRITEBACK);
> +
>  	mem_cgroup_charge_statistics(from, page, anon, -nr_pages);
>  
>  	/* caller should have done css_get */
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index a09f518..2fa6a52 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2008,11 +2008,17 @@ EXPORT_SYMBOL(account_page_dirtied);
>  
>  /*
>   * Helper function for set_page_writeback family.
> + *
> + * The caller must hold mem_cgroup_begin/end_update_page_stat() lock
> + * while calling this function.
> + * See test_set_page_writeback for example.
> + *
>   * NOTE: Unlike account_page_dirtied this does not rely on being atomic
>   * wrt interrupts.
>   */
>  void account_page_writeback(struct page *page)
>  {
> +	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
>  	inc_zone_page_state(page, NR_WRITEBACK);
>  }
>  EXPORT_SYMBOL(account_page_writeback);
> @@ -2243,7 +2249,10 @@ int test_clear_page_writeback(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
>  	int ret;
> +	bool locked;
> +	unsigned long memcg_flags;
>  
> +	mem_cgroup_begin_update_page_stat(page, &locked, &memcg_flags);
>  	if (mapping) {
>  		struct backing_dev_info *bdi = mapping->backing_dev_info;
>  		unsigned long flags;
> @@ -2264,9 +2273,11 @@ int test_clear_page_writeback(struct page *page)
>  		ret = TestClearPageWriteback(page);
>  	}
>  	if (ret) {
> +		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
>  		dec_zone_page_state(page, NR_WRITEBACK);
>  		inc_zone_page_state(page, NR_WRITTEN);
>  	}
> +	mem_cgroup_end_update_page_stat(page, &locked, &memcg_flags);
>  	return ret;
>  }
>  
> @@ -2274,7 +2285,10 @@ int test_set_page_writeback(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
>  	int ret;
> +	bool locked;
> +	unsigned long flags;
>  
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  	if (mapping) {
>  		struct backing_dev_info *bdi = mapping->backing_dev_info;
>  		unsigned long flags;
> @@ -2301,6 +2315,7 @@ int test_set_page_writeback(struct page *page)
>  	}
>  	if (!ret)
>  		account_page_writeback(page);
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  	return ret;
>  
>  }
> -- 
> 1.7.9.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V5 2/8] fs/ceph: vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem
  2013-08-01 11:51 ` [PATCH V5 2/8] fs/ceph: vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem Sha Zhengju
@ 2013-08-01 15:19   ` Yan, Zheng
  2013-08-01 18:27     ` Sage Weil
  2013-08-02  9:04       ` Sha Zhengju
  0 siblings, 2 replies; 47+ messages in thread
From: Yan, Zheng @ 2013-08-01 15:19 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-fsdevel, ceph-devel, linux-mm, cgroups, Sage Weil, mhocko,
	kamezawa.hiroyu, glommer, Greg Thelen, Wu Fengguang,
	Andrew Morton, Sha Zhengju

[-- Attachment #1: Type: text/plain, Size: 2975 bytes --]

On Thu, Aug 1, 2013 at 7:51 PM, Sha Zhengju <handai.szj@gmail.com> wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
>
> Following we will begin to add memcg dirty page accounting around
__set_page_dirty_
> {buffers,nobuffers} in vfs layer, so we'd better use vfs interface to
avoid exporting
> those details to filesystems.
>
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> ---
>  fs/ceph/addr.c |   13 +------------
>  1 file changed, 1 insertion(+), 12 deletions(-)
>
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 3e68ac1..1445bf1 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -76,7 +76,7 @@ static int ceph_set_page_dirty(struct page *page)
>         if (unlikely(!mapping))
>                 return !TestSetPageDirty(page);
>
> -       if (TestSetPageDirty(page)) {
> +       if (!__set_page_dirty_nobuffers(page)) {

it's too early to set the radix tree tag here. We should set page's
snapshot context and increase the i_wrbuffer_ref first. This is because
once the tag is set, writeback thread can find and start flushing the page.


>                 dout("%p set_page_dirty %p idx %lu -- already dirty\n",
>                      mapping->host, page, page->index);
>                 return 0;
> @@ -107,14 +107,7 @@ static int ceph_set_page_dirty(struct page *page)
>              snapc, snapc->seq, snapc->num_snaps);
>         spin_unlock(&ci->i_ceph_lock);
>
> -       /* now adjust page */
> -       spin_lock_irq(&mapping->tree_lock);
>         if (page->mapping) {    /* Race with truncate? */
> -               WARN_ON_ONCE(!PageUptodate(page));
> -               account_page_dirtied(page, page->mapping);
> -               radix_tree_tag_set(&mapping->page_tree,
> -                               page_index(page), PAGECACHE_TAG_DIRTY);
> -

this code was coped from __set_page_dirty_nobuffers(). I think the reason
Sage did this is to handle the race described in
__set_page_dirty_nobuffers()'s comment. But I'm wonder if "page->mapping ==
NULL" can still happen here. Because truncate_inode_page() unmap page from
processes's address spaces first, then delete page from page cache.

Regards
Yan, Zheng

>                 /*
>                  * Reference snap context in page->private.  Also set
>                  * PagePrivate so that we get invalidatepage callback.
> @@ -126,14 +119,10 @@ static int ceph_set_page_dirty(struct page *page)
>                 undo = 1;
>         }
>
> -       spin_unlock_irq(&mapping->tree_lock);




> -
>         if (undo)
>                 /* whoops, we failed to dirty the page */
>                 ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
>
> -       __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> -
>         BUG_ON(!PageDirty(page));
>         return 1;
>  }
> --
> 1.7.9.5
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: Type: text/html, Size: 4110 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V5 7/8] memcg: don't account root memcg page stats if only root exists
  2013-08-01 12:00 ` [PATCH V5 7/8] memcg: don't account root memcg page stats if only root exists Sha Zhengju
@ 2013-08-01 16:20   ` Johannes Weiner
  2013-08-02  4:32       ` Sha Zhengju
  0 siblings, 1 reply; 47+ messages in thread
From: Johannes Weiner @ 2013-08-01 16:20 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, mhocko, kamezawa.hiroyu, glommer, gthelen,
	fengguang.wu, akpm, Sha Zhengju

On Thu, Aug 01, 2013 at 08:00:07PM +0800, Sha Zhengju wrote:
> @@ -6303,6 +6360,49 @@ mem_cgroup_css_online(struct cgroup *cont)
>  	}
>  
>  	error = memcg_init_kmem(memcg, &mem_cgroup_subsys);
> +	if (!error) {
> +		if (!mem_cgroup_in_use()) {
> +			/* I'm the first non-root memcg, move global stats to root memcg.
> +			 * Memcg creating is serialized by cgroup locks(cgroup_mutex),
> +			 * so the mem_cgroup_in_use() checking is safe.
> +			 *
> +			 * We use global_page_state() to get global page stats, but
> +			 * because of the optimized inc/dec functions in SMP while
> +			 * updating each zone's stats, We may lose some numbers
> +			 * in a stock(zone->pageset->vm_stat_diff) which brings some
> +			 * inaccuracy. But places where kernel use these page stats to
> +			 * steer next decision e.g. dirty page throttling or writeback
> +			 * also use global_page_state(), so here it's enough too.
> +			 */
> +			spin_lock(&root_mem_cgroup->pcp_counter_lock);
> +			root_mem_cgroup->stats_base.count[MEM_CGROUP_STAT_FILE_MAPPED] =
> +						global_page_state(NR_FILE_MAPPED);
> +			root_mem_cgroup->stats_base.count[MEM_CGROUP_STAT_FILE_DIRTY] =
> +						global_page_state(NR_FILE_DIRTY);
> +			root_mem_cgroup->stats_base.count[MEM_CGROUP_STAT_WRITEBACK] =
> +						global_page_state(NR_WRITEBACK);
> +			spin_unlock(&root_mem_cgroup->pcp_counter_lock);
> +		}

If inaccuracies in these counters are okay, why do we go through an
elaborate locking scheme that sprinkles memcg callbacks everywhere
just to be 100% reliable in the rare case somebody moves memory
between cgroups?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V5 2/8] fs/ceph: vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem
  2013-08-01 15:19   ` Yan, Zheng
@ 2013-08-01 18:27     ` Sage Weil
  2013-08-02 10:04       ` Sha Zhengju
  2013-08-02  9:04       ` Sha Zhengju
  1 sibling, 1 reply; 47+ messages in thread
From: Sage Weil @ 2013-08-01 18:27 UTC (permalink / raw)
  To: Yan, Zheng
  Cc: Sha Zhengju, linux-fsdevel, ceph-devel, linux-mm, cgroups,
	mhocko, kamezawa.hiroyu, glommer, Greg Thelen, Wu Fengguang,
	Andrew Morton, Sha Zhengju

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3697 bytes --]

On Thu, 1 Aug 2013, Yan, Zheng wrote:
> On Thu, Aug 1, 2013 at 7:51 PM, Sha Zhengju <handai.szj@gmail.com> wrote:
> > From: Sha Zhengju <handai.szj@taobao.com>
> >
> > Following we will begin to add memcg dirty page accounting around
> __set_page_dirty_
> > {buffers,nobuffers} in vfs layer, so we'd better use vfs interface to
> avoid exporting
> > those details to filesystems.
> >
> > Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> > ---
> >  fs/ceph/addr.c |   13 +------------
> >  1 file changed, 1 insertion(+), 12 deletions(-)
> >
> > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> > index 3e68ac1..1445bf1 100644
> > --- a/fs/ceph/addr.c
> > +++ b/fs/ceph/addr.c
> > @@ -76,7 +76,7 @@ static int ceph_set_page_dirty(struct page *page)
> >         if (unlikely(!mapping))
> >                 return !TestSetPageDirty(page);
> >
> > -       if (TestSetPageDirty(page)) {
> > +       if (!__set_page_dirty_nobuffers(page)) {
> it's too early to set the radix tree tag here. We should set page's snapshot
> context and increase the i_wrbuffer_ref first. This is because once the tag
> is set, writeback thread can find and start flushing the page.

Unfortunately I only remember being frustrated by this code.  :)  Looking 
at it now, though, it seems like the minimum fix is to set the 
page->private before marking the page dirty.  I don't know the locking 
rules around that, though.  If that is potentially racy, maybe the safest 
thing would be if __set_page_dirty_nobuffers() took a void* to set 
page->private to atomically while holding the tree_lock.

sage

> 
> >                 dout("%p set_page_dirty %p idx %lu -- already dirty\n",
> >                      mapping->host, page, page->index);
> >                 return 0;
> > @@ -107,14 +107,7 @@ static int ceph_set_page_dirty(struct page *page)
> >              snapc, snapc->seq, snapc->num_snaps);
> >         spin_unlock(&ci->i_ceph_lock);
> >
> > -       /* now adjust page */
> > -       spin_lock_irq(&mapping->tree_lock);
> >         if (page->mapping) {    /* Race with truncate? */
> > -               WARN_ON_ONCE(!PageUptodate(page));
> > -               account_page_dirtied(page, page->mapping);
> > -               radix_tree_tag_set(&mapping->page_tree,
> > -                               page_index(page), PAGECACHE_TAG_DIRTY);
> > -
> 
> this code was coped from __set_page_dirty_nobuffers(). I think the reason
> Sage did this is to handle the race described in
> __set_page_dirty_nobuffers()'s comment. But I'm wonder if "page->mapping ==
> NULL" can still happen here. Because truncate_inode_page() unmap page from
> processes's address spaces first, then delete page from page cache.
> 
> Regards
> Yan, Zheng
> 
> >                 /*
> >                  * Reference snap context in page->private.  Also set
> >                  * PagePrivate so that we get invalidatepage callback.
> > @@ -126,14 +119,10 @@ static int ceph_set_page_dirty(struct page *page)
> >                 undo = 1;
> >         }
> >
> > -       spin_unlock_irq(&mapping->tree_lock);
> 
> 
> 
> 
> > -
> >         if (undo)
> >                 /* whoops, we failed to dirty the page */
> >                 ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
> >
> > -       __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> > -
> >         BUG_ON(!PageDirty(page));
> >         return 1;
> >  }
> > --
> > 1.7.9.5
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V5 7/8] memcg: don't account root memcg page stats if only root exists
@ 2013-08-02  4:32       ` Sha Zhengju
  0 siblings, 0 replies; 47+ messages in thread
From: Sha Zhengju @ 2013-08-02  4:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Cgroups, Michal Hocko, KAMEZAWA Hiroyuki,
	Glauber Costa, Greg Thelen, Wu Fengguang, Andrew Morton,
	Sha Zhengju

On Fri, Aug 2, 2013 at 12:20 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Thu, Aug 01, 2013 at 08:00:07PM +0800, Sha Zhengju wrote:
>> @@ -6303,6 +6360,49 @@ mem_cgroup_css_online(struct cgroup *cont)
>>       }
>>
>>       error = memcg_init_kmem(memcg, &mem_cgroup_subsys);
>> +     if (!error) {
>> +             if (!mem_cgroup_in_use()) {
>> +                     /* I'm the first non-root memcg, move global stats to root memcg.
>> +                      * Memcg creating is serialized by cgroup locks(cgroup_mutex),
>> +                      * so the mem_cgroup_in_use() checking is safe.
>> +                      *
>> +                      * We use global_page_state() to get global page stats, but
>> +                      * because of the optimized inc/dec functions in SMP while
>> +                      * updating each zone's stats, We may lose some numbers
>> +                      * in a stock(zone->pageset->vm_stat_diff) which brings some
>> +                      * inaccuracy. But places where kernel use these page stats to
>> +                      * steer next decision e.g. dirty page throttling or writeback
>> +                      * also use global_page_state(), so here it's enough too.
>> +                      */
>> +                     spin_lock(&root_mem_cgroup->pcp_counter_lock);
>> +                     root_mem_cgroup->stats_base.count[MEM_CGROUP_STAT_FILE_MAPPED] =
>> +                                             global_page_state(NR_FILE_MAPPED);
>> +                     root_mem_cgroup->stats_base.count[MEM_CGROUP_STAT_FILE_DIRTY] =
>> +                                             global_page_state(NR_FILE_DIRTY);
>> +                     root_mem_cgroup->stats_base.count[MEM_CGROUP_STAT_WRITEBACK] =
>> +                                             global_page_state(NR_WRITEBACK);
>> +                     spin_unlock(&root_mem_cgroup->pcp_counter_lock);
>> +             }
>
> If inaccuracies in these counters are okay, why do we go through an
> elaborate locking scheme that sprinkles memcg callbacks everywhere
> just to be 100% reliable in the rare case somebody moves memory
> between cgroups?

IMO they are not the same thing. Moving between cgroups may happen
many times, if we ignore the inaccuracy between moving, then the
cumulative inaccuracies caused by this back and forth behavior can
become very large.

But the transferring above occurs only once since we do it only when
the first non-root memcg creating, so the error is at most
zone->pageset->stat_threshold. This threshold depends on the amount of
zone memory and  the number of processors, and its maximum value is
125, so I thought using global_page_state() is enough. Of course we
can add the stock to seek for greater perfection, but there are also
possibilities of inaccuracy because of racy.


-- 
Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V5 7/8] memcg: don't account root memcg page stats if only root exists
@ 2013-08-02  4:32       ` Sha Zhengju
  0 siblings, 0 replies; 47+ messages in thread
From: Sha Zhengju @ 2013-08-02  4:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Cgroups, Michal Hocko,
	KAMEZAWA Hiroyuki, Glauber Costa, Greg Thelen, Wu Fengguang,
	Andrew Morton, Sha Zhengju

On Fri, Aug 2, 2013 at 12:20 AM, Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> wrote:
> On Thu, Aug 01, 2013 at 08:00:07PM +0800, Sha Zhengju wrote:
>> @@ -6303,6 +6360,49 @@ mem_cgroup_css_online(struct cgroup *cont)
>>       }
>>
>>       error = memcg_init_kmem(memcg, &mem_cgroup_subsys);
>> +     if (!error) {
>> +             if (!mem_cgroup_in_use()) {
>> +                     /* I'm the first non-root memcg, move global stats to root memcg.
>> +                      * Memcg creating is serialized by cgroup locks(cgroup_mutex),
>> +                      * so the mem_cgroup_in_use() checking is safe.
>> +                      *
>> +                      * We use global_page_state() to get global page stats, but
>> +                      * because of the optimized inc/dec functions in SMP while
>> +                      * updating each zone's stats, We may lose some numbers
>> +                      * in a stock(zone->pageset->vm_stat_diff) which brings some
>> +                      * inaccuracy. But places where kernel use these page stats to
>> +                      * steer next decision e.g. dirty page throttling or writeback
>> +                      * also use global_page_state(), so here it's enough too.
>> +                      */
>> +                     spin_lock(&root_mem_cgroup->pcp_counter_lock);
>> +                     root_mem_cgroup->stats_base.count[MEM_CGROUP_STAT_FILE_MAPPED] =
>> +                                             global_page_state(NR_FILE_MAPPED);
>> +                     root_mem_cgroup->stats_base.count[MEM_CGROUP_STAT_FILE_DIRTY] =
>> +                                             global_page_state(NR_FILE_DIRTY);
>> +                     root_mem_cgroup->stats_base.count[MEM_CGROUP_STAT_WRITEBACK] =
>> +                                             global_page_state(NR_WRITEBACK);
>> +                     spin_unlock(&root_mem_cgroup->pcp_counter_lock);
>> +             }
>
> If inaccuracies in these counters are okay, why do we go through an
> elaborate locking scheme that sprinkles memcg callbacks everywhere
> just to be 100% reliable in the rare case somebody moves memory
> between cgroups?

IMO they are not the same thing. Moving between cgroups may happen
many times, if we ignore the inaccuracy between moving, then the
cumulative inaccuracies caused by this back and forth behavior can
become very large.

But the transferring above occurs only once since we do it only when
the first non-root memcg creating, so the error is at most
zone->pageset->stat_threshold. This threshold depends on the amount of
zone memory and  the number of processors, and its maximum value is
125, so I thought using global_page_state() is enough. Of course we
can add the stock to seek for greater perfection, but there are also
possibilities of inaccuracy because of racy.


-- 
Thanks,
Sha

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V5 2/8] fs/ceph: vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem
  2013-08-01 15:19   ` Yan, Zheng
@ 2013-08-02  9:04       ` Sha Zhengju
  2013-08-02  9:04       ` Sha Zhengju
  1 sibling, 0 replies; 47+ messages in thread
From: Sha Zhengju @ 2013-08-02  9:04 UTC (permalink / raw)
  To: Yan, Zheng
  Cc: linux-fsdevel, ceph-devel, linux-mm, Cgroups, Sage Weil,
	Michal Hocko, KAMEZAWA Hiroyuki, Glauber Costa, Greg Thelen,
	Wu Fengguang, Andrew Morton, Sha Zhengju

On Thu, Aug 1, 2013 at 11:19 PM, Yan, Zheng <ukernel@gmail.com> wrote:
> On Thu, Aug 1, 2013 at 7:51 PM, Sha Zhengju <handai.szj@gmail.com> wrote:
>> From: Sha Zhengju <handai.szj@taobao.com>
>>
>> Following we will begin to add memcg dirty page accounting around
>> __set_page_dirty_
>> {buffers,nobuffers} in vfs layer, so we'd better use vfs interface to
>> avoid exporting
>> those details to filesystems.
>>
>> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
>> ---
>>  fs/ceph/addr.c |   13 +------------
>>  1 file changed, 1 insertion(+), 12 deletions(-)
>>
>> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
>> index 3e68ac1..1445bf1 100644
>> --- a/fs/ceph/addr.c
>> +++ b/fs/ceph/addr.c
>> @@ -76,7 +76,7 @@ static int ceph_set_page_dirty(struct page *page)
>>         if (unlikely(!mapping))
>>                 return !TestSetPageDirty(page);
>>
>> -       if (TestSetPageDirty(page)) {
>> +       if (!__set_page_dirty_nobuffers(page)) {
>
> it's too early to set the radix tree tag here. We should set page's snapshot
> context and increase the i_wrbuffer_ref first. This is because once the tag
> is set, writeback thread can find and start flushing the page.

OK, thanks for pointing it out.

>
>
>>                 dout("%p set_page_dirty %p idx %lu -- already dirty\n",
>>                      mapping->host, page, page->index);
>>                 return 0;
>> @@ -107,14 +107,7 @@ static int ceph_set_page_dirty(struct page *page)
>>              snapc, snapc->seq, snapc->num_snaps);
>>         spin_unlock(&ci->i_ceph_lock);
>>
>> -       /* now adjust page */
>> -       spin_lock_irq(&mapping->tree_lock);
>>         if (page->mapping) {    /* Race with truncate? */
>> -               WARN_ON_ONCE(!PageUptodate(page));
>> -               account_page_dirtied(page, page->mapping);
>> -               radix_tree_tag_set(&mapping->page_tree,
>> -                               page_index(page), PAGECACHE_TAG_DIRTY);
>> -
>
> this code was coped from __set_page_dirty_nobuffers(). I think the reason
> Sage did this is to handle the race described in
> __set_page_dirty_nobuffers()'s comment. But I'm wonder if "page->mapping ==
> NULL" can still happen here. Because truncate_inode_page() unmap page from
> processes's address spaces first, then delete page from page cache.

But in non-mmap case, doesn't it has no relation to 'unmap page from
address spaces'?
The check is exactly avoiding racy with delete_from_page_cache(),
since the two both need to hold mapping->tree_lock, and if truncate
goes first then __set_page_dirty_nobuffers() may have NULL mapping.


Thanks,
Sha

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V5 2/8] fs/ceph: vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem
@ 2013-08-02  9:04       ` Sha Zhengju
  0 siblings, 0 replies; 47+ messages in thread
From: Sha Zhengju @ 2013-08-02  9:04 UTC (permalink / raw)
  To: Yan, Zheng
  Cc: linux-fsdevel, ceph-devel, linux-mm, Cgroups, Sage Weil,
	Michal Hocko, KAMEZAWA Hiroyuki, Glauber Costa, Greg Thelen,
	Wu Fengguang, Andrew Morton, Sha Zhengju

On Thu, Aug 1, 2013 at 11:19 PM, Yan, Zheng <ukernel@gmail.com> wrote:
> On Thu, Aug 1, 2013 at 7:51 PM, Sha Zhengju <handai.szj@gmail.com> wrote:
>> From: Sha Zhengju <handai.szj@taobao.com>
>>
>> Following we will begin to add memcg dirty page accounting around
>> __set_page_dirty_
>> {buffers,nobuffers} in vfs layer, so we'd better use vfs interface to
>> avoid exporting
>> those details to filesystems.
>>
>> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
>> ---
>>  fs/ceph/addr.c |   13 +------------
>>  1 file changed, 1 insertion(+), 12 deletions(-)
>>
>> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
>> index 3e68ac1..1445bf1 100644
>> --- a/fs/ceph/addr.c
>> +++ b/fs/ceph/addr.c
>> @@ -76,7 +76,7 @@ static int ceph_set_page_dirty(struct page *page)
>>         if (unlikely(!mapping))
>>                 return !TestSetPageDirty(page);
>>
>> -       if (TestSetPageDirty(page)) {
>> +       if (!__set_page_dirty_nobuffers(page)) {
>
> it's too early to set the radix tree tag here. We should set page's snapshot
> context and increase the i_wrbuffer_ref first. This is because once the tag
> is set, writeback thread can find and start flushing the page.

OK, thanks for pointing it out.

>
>
>>                 dout("%p set_page_dirty %p idx %lu -- already dirty\n",
>>                      mapping->host, page, page->index);
>>                 return 0;
>> @@ -107,14 +107,7 @@ static int ceph_set_page_dirty(struct page *page)
>>              snapc, snapc->seq, snapc->num_snaps);
>>         spin_unlock(&ci->i_ceph_lock);
>>
>> -       /* now adjust page */
>> -       spin_lock_irq(&mapping->tree_lock);
>>         if (page->mapping) {    /* Race with truncate? */
>> -               WARN_ON_ONCE(!PageUptodate(page));
>> -               account_page_dirtied(page, page->mapping);
>> -               radix_tree_tag_set(&mapping->page_tree,
>> -                               page_index(page), PAGECACHE_TAG_DIRTY);
>> -
>
> this code was coped from __set_page_dirty_nobuffers(). I think the reason
> Sage did this is to handle the race described in
> __set_page_dirty_nobuffers()'s comment. But I'm wonder if "page->mapping ==
> NULL" can still happen here. Because truncate_inode_page() unmap page from
> processes's address spaces first, then delete page from page cache.

But in non-mmap case, doesn't it has no relation to 'unmap page from
address spaces'?
The check is exactly avoiding racy with delete_from_page_cache(),
since the two both need to hold mapping->tree_lock, and if truncate
goes first then __set_page_dirty_nobuffers() may have NULL mapping.


Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V5 2/8] fs/ceph: vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem
  2013-08-01 18:27     ` Sage Weil
@ 2013-08-02 10:04       ` Sha Zhengju
       [not found]         ` <CAFj3OHVXvtr5BDMrGatHZi7M9y+dh1ZKRMQZGjZmNBcg3pNQtw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 47+ messages in thread
From: Sha Zhengju @ 2013-08-02 10:04 UTC (permalink / raw)
  To: Sage Weil
  Cc: Yan, Zheng, linux-fsdevel, ceph-devel, linux-mm, Cgroups,
	Michal Hocko, KAMEZAWA Hiroyuki, Glauber Costa, Greg Thelen,
	Wu Fengguang, Andrew Morton, Sha Zhengju

On Fri, Aug 2, 2013 at 2:27 AM, Sage Weil <sage@inktank.com> wrote:
> On Thu, 1 Aug 2013, Yan, Zheng wrote:
>> On Thu, Aug 1, 2013 at 7:51 PM, Sha Zhengju <handai.szj@gmail.com> wrote:
>> > From: Sha Zhengju <handai.szj@taobao.com>
>> >
>> > Following we will begin to add memcg dirty page accounting around
>> __set_page_dirty_
>> > {buffers,nobuffers} in vfs layer, so we'd better use vfs interface to
>> avoid exporting
>> > those details to filesystems.
>> >
>> > Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
>> > ---
>> >  fs/ceph/addr.c |   13 +------------
>> >  1 file changed, 1 insertion(+), 12 deletions(-)
>> >
>> > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
>> > index 3e68ac1..1445bf1 100644
>> > --- a/fs/ceph/addr.c
>> > +++ b/fs/ceph/addr.c
>> > @@ -76,7 +76,7 @@ static int ceph_set_page_dirty(struct page *page)
>> >         if (unlikely(!mapping))
>> >                 return !TestSetPageDirty(page);
>> >
>> > -       if (TestSetPageDirty(page)) {
>> > +       if (!__set_page_dirty_nobuffers(page)) {
>> it's too early to set the radix tree tag here. We should set page's snapshot
>> context and increase the i_wrbuffer_ref first. This is because once the tag
>> is set, writeback thread can find and start flushing the page.
>
> Unfortunately I only remember being frustrated by this code.  :)  Looking
> at it now, though, it seems like the minimum fix is to set the
> page->private before marking the page dirty.  I don't know the locking
> rules around that, though.  If that is potentially racy, maybe the safest
> thing would be if __set_page_dirty_nobuffers() took a void* to set
> page->private to atomically while holding the tree_lock.
>

Sorry, I don't catch the point of your last sentence... Could you
please explain it again?

I notice there is a check in __set_page_dirty_nobuffers():
      WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
So does it mean we can only set page->private after it? but if so the
__mark_inode_dirty is still ahead of setting snapc.


Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V5 2/8] fs/ceph: vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem
  2013-08-02  9:04       ` Sha Zhengju
  (?)
@ 2013-08-02 13:11       ` Yan, Zheng
  -1 siblings, 0 replies; 47+ messages in thread
From: Yan, Zheng @ 2013-08-02 13:11 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-fsdevel, ceph-devel, linux-mm, Cgroups, Sage Weil,
	Michal Hocko, KAMEZAWA Hiroyuki, Glauber Costa, Greg Thelen,
	Wu Fengguang, Andrew Morton, Sha Zhengju

On Fri, Aug 2, 2013 at 5:04 PM, Sha Zhengju <handai.szj@gmail.com> wrote:
>
> On Thu, Aug 1, 2013 at 11:19 PM, Yan, Zheng <ukernel@gmail.com> wrote:
> > On Thu, Aug 1, 2013 at 7:51 PM, Sha Zhengju <handai.szj@gmail.com> wrote:
> >> From: Sha Zhengju <handai.szj@taobao.com>
> >>
> >> Following we will begin to add memcg dirty page accounting around
> >> __set_page_dirty_
> >> {buffers,nobuffers} in vfs layer, so we'd better use vfs interface to
> >> avoid exporting
> >> those details to filesystems.
> >>
> >> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> >> ---
> >>  fs/ceph/addr.c |   13 +------------
> >>  1 file changed, 1 insertion(+), 12 deletions(-)
> >>
> >> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> >> index 3e68ac1..1445bf1 100644
> >> --- a/fs/ceph/addr.c
> >> +++ b/fs/ceph/addr.c
> >> @@ -76,7 +76,7 @@ static int ceph_set_page_dirty(struct page *page)
> >>         if (unlikely(!mapping))
> >>                 return !TestSetPageDirty(page);
> >>
> >> -       if (TestSetPageDirty(page)) {
> >> +       if (!__set_page_dirty_nobuffers(page)) {
> >
> > it's too early to set the radix tree tag here. We should set page's snapshot
> > context and increase the i_wrbuffer_ref first. This is because once the tag
> > is set, writeback thread can find and start flushing the page.
>
> OK, thanks for pointing it out.
>
> >
> >
> >>                 dout("%p set_page_dirty %p idx %lu -- already dirty\n",
> >>                      mapping->host, page, page->index);
> >>                 return 0;
> >> @@ -107,14 +107,7 @@ static int ceph_set_page_dirty(struct page *page)
> >>              snapc, snapc->seq, snapc->num_snaps);
> >>         spin_unlock(&ci->i_ceph_lock);
> >>
> >> -       /* now adjust page */
> >> -       spin_lock_irq(&mapping->tree_lock);
> >>         if (page->mapping) {    /* Race with truncate? */
> >> -               WARN_ON_ONCE(!PageUptodate(page));
> >> -               account_page_dirtied(page, page->mapping);
> >> -               radix_tree_tag_set(&mapping->page_tree,
> >> -                               page_index(page), PAGECACHE_TAG_DIRTY);
> >> -
> >
> > this code was coped from __set_page_dirty_nobuffers(). I think the reason
> > Sage did this is to handle the race described in
> > __set_page_dirty_nobuffers()'s comment. But I'm wonder if "page->mapping ==
> > NULL" can still happen here. Because truncate_inode_page() unmap page from
> > processes's address spaces first, then delete page from page cache.
>
> But in non-mmap case, doesn't it has no relation to 'unmap page from
> address spaces'?

In non-mmap case, page is locked when the set_page_dirty() callback is called.
truncate_inode_page() waits until the page is unlocked, then delete it from the
page cache.

Regards
Yan, Zheng

> The check is exactly avoiding racy with delete_from_page_cache(),
> since the two both need to hold mapping->tree_lock, and if truncate
> goes first then __set_page_dirty_nobuffers() may have NULL mapping.
>
>
> Thanks,
> Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V5 2/8] fs/ceph: vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem
  2013-08-02 10:04       ` Sha Zhengju
@ 2013-08-02 20:30             ` Sage Weil
  0 siblings, 0 replies; 47+ messages in thread
From: Sage Weil @ 2013-08-02 20:30 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: Yan, Zheng, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, ceph-devel,
	linux-mm, Cgroups, Michal Hocko, KAMEZAWA Hiroyuki,
	Glauber Costa, Greg Thelen, Wu Fengguang, Andrew Morton,
	Sha Zhengju

On Fri, 2 Aug 2013, Sha Zhengju wrote:
> On Fri, Aug 2, 2013 at 2:27 AM, Sage Weil <sage-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org> wrote:
> > On Thu, 1 Aug 2013, Yan, Zheng wrote:
> >> On Thu, Aug 1, 2013 at 7:51 PM, Sha Zhengju <handai.szj-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >> > From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> >> >
> >> > Following we will begin to add memcg dirty page accounting around
> >> __set_page_dirty_
> >> > {buffers,nobuffers} in vfs layer, so we'd better use vfs interface to
> >> avoid exporting
> >> > those details to filesystems.
> >> >
> >> > Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> >> > ---
> >> >  fs/ceph/addr.c |   13 +------------
> >> >  1 file changed, 1 insertion(+), 12 deletions(-)
> >> >
> >> > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> >> > index 3e68ac1..1445bf1 100644
> >> > --- a/fs/ceph/addr.c
> >> > +++ b/fs/ceph/addr.c
> >> > @@ -76,7 +76,7 @@ static int ceph_set_page_dirty(struct page *page)
> >> >         if (unlikely(!mapping))
> >> >                 return !TestSetPageDirty(page);
> >> >
> >> > -       if (TestSetPageDirty(page)) {
> >> > +       if (!__set_page_dirty_nobuffers(page)) {
> >> it's too early to set the radix tree tag here. We should set page's snapshot
> >> context and increase the i_wrbuffer_ref first. This is because once the tag
> >> is set, writeback thread can find and start flushing the page.
> >
> > Unfortunately I only remember being frustrated by this code.  :)  Looking
> > at it now, though, it seems like the minimum fix is to set the
> > page->private before marking the page dirty.  I don't know the locking
> > rules around that, though.  If that is potentially racy, maybe the safest
> > thing would be if __set_page_dirty_nobuffers() took a void* to set
> > page->private to atomically while holding the tree_lock.
> >
> 
> Sorry, I don't catch the point of your last sentence... Could you
> please explain it again?

It didn't make much sense.  :)  I was worried about multiple callers to 
set_page_dirty, but as understand it, this all happens under page->lock, 
right?  (There is a mention of other special cases in mm/page-writeback.c, 
but I'm hoping we don't need to worry about that.)

In any case, I suspect what we actually want is something like the below 
(untested) patch.  The snapc accounting can be ignored here because 
invalidatepage will clean it up...

sage



diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index afb2fc2..7602e46 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -76,9 +76,10 @@ static int ceph_set_page_dirty(struct page *page)
 	if (unlikely(!mapping))
 		return !TestSetPageDirty(page);
 
-	if (TestSetPageDirty(page)) {
+	if (PageDirty(page)) {
 		dout("%p set_page_dirty %p idx %lu -- already dirty\n",
 		     mapping->host, page, page->index);
+		BUG_ON(!PagePrivate(page));
 		return 0;
 	}
 
@@ -107,35 +108,16 @@ static int ceph_set_page_dirty(struct page *page)
 	     snapc, snapc->seq, snapc->num_snaps);
 	spin_unlock(&ci->i_ceph_lock);
 
-	/* now adjust page */
-	spin_lock_irq(&mapping->tree_lock);
-	if (page->mapping) {	/* Race with truncate? */
-		WARN_ON_ONCE(!PageUptodate(page));
-		account_page_dirtied(page, page->mapping);
-		radix_tree_tag_set(&mapping->page_tree,
-				page_index(page), PAGECACHE_TAG_DIRTY);
-
-		/*
-		 * Reference snap context in page->private.  Also set
-		 * PagePrivate so that we get invalidatepage callback.
-		 */
-		page->private = (unsigned long)snapc;
-		SetPagePrivate(page);
-	} else {
-		dout("ANON set_page_dirty %p (raced truncate?)\n", page);
-		undo = 1;
-	}
-
-	spin_unlock_irq(&mapping->tree_lock);
-
-	if (undo)
-		/* whoops, we failed to dirty the page */
-		ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
-
-	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+	/*
+	 * Reference snap context in page->private.  Also set
+	 * PagePrivate so that we get invalidatepage callback.
+	 */
+	BUG_ON(PagePrivate(page));
+	page->private = (unsigned long)snapc;
+	SetPagePrivate(page);
 
-	BUG_ON(!PageDirty(page));
-	return 1;
+	return __set_page_dirty_nobuffers(page);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH V5 2/8] fs/ceph: vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem
@ 2013-08-02 20:30             ` Sage Weil
  0 siblings, 0 replies; 47+ messages in thread
From: Sage Weil @ 2013-08-02 20:30 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: Yan, Zheng, linux-fsdevel, ceph-devel, linux-mm, Cgroups,
	Michal Hocko, KAMEZAWA Hiroyuki, Glauber Costa, Greg Thelen,
	Wu Fengguang, Andrew Morton, Sha Zhengju

On Fri, 2 Aug 2013, Sha Zhengju wrote:
> On Fri, Aug 2, 2013 at 2:27 AM, Sage Weil <sage@inktank.com> wrote:
> > On Thu, 1 Aug 2013, Yan, Zheng wrote:
> >> On Thu, Aug 1, 2013 at 7:51 PM, Sha Zhengju <handai.szj@gmail.com> wrote:
> >> > From: Sha Zhengju <handai.szj@taobao.com>
> >> >
> >> > Following we will begin to add memcg dirty page accounting around
> >> __set_page_dirty_
> >> > {buffers,nobuffers} in vfs layer, so we'd better use vfs interface to
> >> avoid exporting
> >> > those details to filesystems.
> >> >
> >> > Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> >> > ---
> >> >  fs/ceph/addr.c |   13 +------------
> >> >  1 file changed, 1 insertion(+), 12 deletions(-)
> >> >
> >> > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> >> > index 3e68ac1..1445bf1 100644
> >> > --- a/fs/ceph/addr.c
> >> > +++ b/fs/ceph/addr.c
> >> > @@ -76,7 +76,7 @@ static int ceph_set_page_dirty(struct page *page)
> >> >         if (unlikely(!mapping))
> >> >                 return !TestSetPageDirty(page);
> >> >
> >> > -       if (TestSetPageDirty(page)) {
> >> > +       if (!__set_page_dirty_nobuffers(page)) {
> >> it's too early to set the radix tree tag here. We should set page's snapshot
> >> context and increase the i_wrbuffer_ref first. This is because once the tag
> >> is set, writeback thread can find and start flushing the page.
> >
> > Unfortunately I only remember being frustrated by this code.  :)  Looking
> > at it now, though, it seems like the minimum fix is to set the
> > page->private before marking the page dirty.  I don't know the locking
> > rules around that, though.  If that is potentially racy, maybe the safest
> > thing would be if __set_page_dirty_nobuffers() took a void* to set
> > page->private to atomically while holding the tree_lock.
> >
> 
> Sorry, I don't catch the point of your last sentence... Could you
> please explain it again?

It didn't make much sense.  :)  I was worried about multiple callers to 
set_page_dirty, but as understand it, this all happens under page->lock, 
right?  (There is a mention of other special cases in mm/page-writeback.c, 
but I'm hoping we don't need to worry about that.)

In any case, I suspect what we actually want is something like the below 
(untested) patch.  The snapc accounting can be ignored here because 
invalidatepage will clean it up...

sage



diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index afb2fc2..7602e46 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -76,9 +76,10 @@ static int ceph_set_page_dirty(struct page *page)
 	if (unlikely(!mapping))
 		return !TestSetPageDirty(page);
 
-	if (TestSetPageDirty(page)) {
+	if (PageDirty(page)) {
 		dout("%p set_page_dirty %p idx %lu -- already dirty\n",
 		     mapping->host, page, page->index);
+		BUG_ON(!PagePrivate(page));
 		return 0;
 	}
 
@@ -107,35 +108,16 @@ static int ceph_set_page_dirty(struct page *page)
 	     snapc, snapc->seq, snapc->num_snaps);
 	spin_unlock(&ci->i_ceph_lock);
 
-	/* now adjust page */
-	spin_lock_irq(&mapping->tree_lock);
-	if (page->mapping) {	/* Race with truncate? */
-		WARN_ON_ONCE(!PageUptodate(page));
-		account_page_dirtied(page, page->mapping);
-		radix_tree_tag_set(&mapping->page_tree,
-				page_index(page), PAGECACHE_TAG_DIRTY);
-
-		/*
-		 * Reference snap context in page->private.  Also set
-		 * PagePrivate so that we get invalidatepage callback.
-		 */
-		page->private = (unsigned long)snapc;
-		SetPagePrivate(page);
-	} else {
-		dout("ANON set_page_dirty %p (raced truncate?)\n", page);
-		undo = 1;
-	}
-
-	spin_unlock_irq(&mapping->tree_lock);
-
-	if (undo)
-		/* whoops, we failed to dirty the page */
-		ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
-
-	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+	/*
+	 * Reference snap context in page->private.  Also set
+	 * PagePrivate so that we get invalidatepage callback.
+	 */
+	BUG_ON(PagePrivate(page));
+	page->private = (unsigned long)snapc;
+	SetPagePrivate(page);
 
-	BUG_ON(!PageDirty(page));
-	return 1;
+	return __set_page_dirty_nobuffers(page);
 }
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH V5 2/8] fs/ceph: vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem
  2013-08-02 20:30             ` Sage Weil
  (?)
@ 2013-08-03  8:58             ` Sha Zhengju
  -1 siblings, 0 replies; 47+ messages in thread
From: Sha Zhengju @ 2013-08-03  8:58 UTC (permalink / raw)
  To: Sage Weil
  Cc: Yan, Zheng, linux-fsdevel, ceph-devel, linux-mm, Cgroups,
	Michal Hocko, KAMEZAWA Hiroyuki, Glauber Costa, Greg Thelen,
	Wu Fengguang, Andrew Morton, Sha Zhengju

On Sat, Aug 3, 2013 at 4:30 AM, Sage Weil <sage@inktank.com> wrote:
> On Fri, 2 Aug 2013, Sha Zhengju wrote:
>> On Fri, Aug 2, 2013 at 2:27 AM, Sage Weil <sage@inktank.com> wrote:
>> > On Thu, 1 Aug 2013, Yan, Zheng wrote:
>> >> On Thu, Aug 1, 2013 at 7:51 PM, Sha Zhengju <handai.szj@gmail.com> wrote:
>> >> > From: Sha Zhengju <handai.szj@taobao.com>
>> >> >
>> >> > Following we will begin to add memcg dirty page accounting around
>> >> __set_page_dirty_
>> >> > {buffers,nobuffers} in vfs layer, so we'd better use vfs interface to
>> >> avoid exporting
>> >> > those details to filesystems.
>> >> >
>> >> > Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
>> >> > ---
>> >> >  fs/ceph/addr.c |   13 +------------
>> >> >  1 file changed, 1 insertion(+), 12 deletions(-)
>> >> >
>> >> > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
>> >> > index 3e68ac1..1445bf1 100644
>> >> > --- a/fs/ceph/addr.c
>> >> > +++ b/fs/ceph/addr.c
>> >> > @@ -76,7 +76,7 @@ static int ceph_set_page_dirty(struct page *page)
>> >> >         if (unlikely(!mapping))
>> >> >                 return !TestSetPageDirty(page);
>> >> >
>> >> > -       if (TestSetPageDirty(page)) {
>> >> > +       if (!__set_page_dirty_nobuffers(page)) {
>> >> it's too early to set the radix tree tag here. We should set page's snapshot
>> >> context and increase the i_wrbuffer_ref first. This is because once the tag
>> >> is set, writeback thread can find and start flushing the page.
>> >
>> > Unfortunately I only remember being frustrated by this code.  :)  Looking
>> > at it now, though, it seems like the minimum fix is to set the
>> > page->private before marking the page dirty.  I don't know the locking
>> > rules around that, though.  If that is potentially racy, maybe the safest
>> > thing would be if __set_page_dirty_nobuffers() took a void* to set
>> > page->private to atomically while holding the tree_lock.
>> >
>>
>> Sorry, I don't catch the point of your last sentence... Could you
>> please explain it again?
>
> It didn't make much sense.  :)  I was worried about multiple callers to
> set_page_dirty, but as understand it, this all happens under page->lock,
> right?  (There is a mention of other special cases in mm/page-writeback.c,
> but I'm hoping we don't need to worry about that.)

I agree, page lock can handle the concurrent access.

>
> In any case, I suspect what we actually want is something like the below
> (untested) patch.  The snapc accounting can be ignored here because
> invalidatepage will clean it up...
>
> sage
>
>
>
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index afb2fc2..7602e46 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -76,9 +76,10 @@ static int ceph_set_page_dirty(struct page *page)
>         if (unlikely(!mapping))
>                 return !TestSetPageDirty(page);
>
> -       if (TestSetPageDirty(page)) {
> +       if (PageDirty(page)) {
>                 dout("%p set_page_dirty %p idx %lu -- already dirty\n",
>                      mapping->host, page, page->index);
> +               BUG_ON(!PagePrivate(page));
>                 return 0;
>         }
>
> @@ -107,35 +108,16 @@ static int ceph_set_page_dirty(struct page *page)
>              snapc, snapc->seq, snapc->num_snaps);
>         spin_unlock(&ci->i_ceph_lock);
>
> -       /* now adjust page */
> -       spin_lock_irq(&mapping->tree_lock);
> -       if (page->mapping) {    /* Race with truncate? */
> -               WARN_ON_ONCE(!PageUptodate(page));
> -               account_page_dirtied(page, page->mapping);
> -               radix_tree_tag_set(&mapping->page_tree,
> -                               page_index(page), PAGECACHE_TAG_DIRTY);
> -
> -               /*
> -                * Reference snap context in page->private.  Also set
> -                * PagePrivate so that we get invalidatepage callback.
> -                */
> -               page->private = (unsigned long)snapc;
> -               SetPagePrivate(page);
> -       } else {
> -               dout("ANON set_page_dirty %p (raced truncate?)\n", page);
> -               undo = 1;
> -       }
> -
> -       spin_unlock_irq(&mapping->tree_lock);
> -
> -       if (undo)
> -               /* whoops, we failed to dirty the page */
> -               ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
> -
> -       __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> +       /*
> +        * Reference snap context in page->private.  Also set
> +        * PagePrivate so that we get invalidatepage callback.
> +        */
> +       BUG_ON(PagePrivate(page));
> +       page->private = (unsigned long)snapc;
> +       SetPagePrivate(page);
>
> -       BUG_ON(!PageDirty(page));
> -       return 1;
> +       return __set_page_dirty_nobuffers(page);
>  }
>
>  /*

Looks good. Since page lock can avoid multiple access, the undo logic
is also not necessary anymore. Thank you very much!

-- 
Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V5 5/8] memcg: add per cgroup writeback pages accounting
  2013-08-01 14:53   ` Michal Hocko
@ 2013-08-03  9:25         ` Sha Zhengju
  0 siblings, 0 replies; 47+ messages in thread
From: Sha Zhengju @ 2013-08-03  9:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Cgroups, KAMEZAWA Hiroyuki,
	Glauber Costa, Greg Thelen, Wu Fengguang, Andrew Morton,
	Sha Zhengju

On Thu, Aug 1, 2013 at 10:53 PM, Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> wrote:
> On Thu 01-08-13 19:54:11, Sha Zhengju wrote:
>> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
>>
>> Similar to dirty page, we add per cgroup writeback pages accounting. The lock
>> rule still is:
>>         mem_cgroup_begin_update_page_stat()
>>         modify page WRITEBACK stat
>>         mem_cgroup_update_page_stat()
>>         mem_cgroup_end_update_page_stat()
>>
>> There're two writeback interfaces to modify: test_{clear/set}_page_writeback().
>> Lock order:
>>       --> memcg->move_lock
>>         --> mapping->tree_lock
>>
>> Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
>
> Looks good to me. Maybe I would suggest moving this patch up the stack
> so that it might get merged earlier as it is simpler than dirty pages
> accounting. Unless you insist on having the full series merged at once.

I think the following three patches can be merged earlier:
      1/8 memcg: remove MEMCG_NR_FILE_MAPPED
      3/8 memcg: check for proper lock held in mem_cgroup_update_page_stat
      5/8 memcg: add per cgroup writeback pages accounting

Do I need to resent them again for you or they're enough?

One more word, since dirty accounting is essential to future memcg
dirty page throttling and it is not an optional feature now, I suspect
whether we can merge the following two as well and leave the overhead
optimization a separate series.  :p
      4/5 memcg: add per cgroup dirty pages accounting
      8/8 memcg: Document cgroup dirty/writeback memory statistics

The 2/8 ceph one still need more improvement, I'll separate it next version.

>
> Acked-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>

Thank you.

>
>> ---
>>  include/linux/memcontrol.h |    1 +
>>  mm/memcontrol.c            |    5 +++++
>>  mm/page-writeback.c        |   15 +++++++++++++++
>>  3 files changed, 21 insertions(+)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index f952be6..ccd35d8 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -43,6 +43,7 @@ enum mem_cgroup_stat_index {
>>       MEM_CGROUP_STAT_RSS_HUGE,       /* # of pages charged as anon huge */
>>       MEM_CGROUP_STAT_FILE_MAPPED,    /* # of pages charged as file rss */
>>       MEM_CGROUP_STAT_FILE_DIRTY,     /* # of dirty pages in page cache */
>> +     MEM_CGROUP_STAT_WRITEBACK,      /* # of pages under writeback */
>>       MEM_CGROUP_STAT_SWAP,           /* # of pages, swapped out */
>>       MEM_CGROUP_STAT_NSTATS,
>>  };
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 8f3e514..6c18a6d 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -91,6 +91,7 @@ static const char * const mem_cgroup_stat_names[] = {
>>       "rss_huge",
>>       "mapped_file",
>>       "dirty",
>> +     "writeback",
>>       "swap",
>>  };
>>
>> @@ -3812,6 +3813,10 @@ static int mem_cgroup_move_account(struct page *page,
>>               mem_cgroup_move_account_page_stat(from, to, nr_pages,
>>                       MEM_CGROUP_STAT_FILE_DIRTY);
>>
>> +     if (PageWriteback(page))
>> +             mem_cgroup_move_account_page_stat(from, to, nr_pages,
>> +                     MEM_CGROUP_STAT_WRITEBACK);
>> +
>>       mem_cgroup_charge_statistics(from, page, anon, -nr_pages);
>>
>>       /* caller should have done css_get */
>> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>> index a09f518..2fa6a52 100644
>> --- a/mm/page-writeback.c
>> +++ b/mm/page-writeback.c
>> @@ -2008,11 +2008,17 @@ EXPORT_SYMBOL(account_page_dirtied);
>>
>>  /*
>>   * Helper function for set_page_writeback family.
>> + *
>> + * The caller must hold mem_cgroup_begin/end_update_page_stat() lock
>> + * while calling this function.
>> + * See test_set_page_writeback for example.
>> + *
>>   * NOTE: Unlike account_page_dirtied this does not rely on being atomic
>>   * wrt interrupts.
>>   */
>>  void account_page_writeback(struct page *page)
>>  {
>> +     mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
>>       inc_zone_page_state(page, NR_WRITEBACK);
>>  }
>>  EXPORT_SYMBOL(account_page_writeback);
>> @@ -2243,7 +2249,10 @@ int test_clear_page_writeback(struct page *page)
>>  {
>>       struct address_space *mapping = page_mapping(page);
>>       int ret;
>> +     bool locked;
>> +     unsigned long memcg_flags;
>>
>> +     mem_cgroup_begin_update_page_stat(page, &locked, &memcg_flags);
>>       if (mapping) {
>>               struct backing_dev_info *bdi = mapping->backing_dev_info;
>>               unsigned long flags;
>> @@ -2264,9 +2273,11 @@ int test_clear_page_writeback(struct page *page)
>>               ret = TestClearPageWriteback(page);
>>       }
>>       if (ret) {
>> +             mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
>>               dec_zone_page_state(page, NR_WRITEBACK);
>>               inc_zone_page_state(page, NR_WRITTEN);
>>       }
>> +     mem_cgroup_end_update_page_stat(page, &locked, &memcg_flags);
>>       return ret;
>>  }
>>
>> @@ -2274,7 +2285,10 @@ int test_set_page_writeback(struct page *page)
>>  {
>>       struct address_space *mapping = page_mapping(page);
>>       int ret;
>> +     bool locked;
>> +     unsigned long flags;
>>
>> +     mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>>       if (mapping) {
>>               struct backing_dev_info *bdi = mapping->backing_dev_info;
>>               unsigned long flags;
>> @@ -2301,6 +2315,7 @@ int test_set_page_writeback(struct page *page)
>>       }
>>       if (!ret)
>>               account_page_writeback(page);
>> +     mem_cgroup_end_update_page_stat(page, &locked, &flags);
>>       return ret;
>>
>>  }
>> --
>> 1.7.9.5
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe cgroups" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> Michal Hocko
> SUSE Labs



-- 
Thanks,
Sha

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V5 5/8] memcg: add per cgroup writeback pages accounting
@ 2013-08-03  9:25         ` Sha Zhengju
  0 siblings, 0 replies; 47+ messages in thread
From: Sha Zhengju @ 2013-08-03  9:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-fsdevel, linux-mm, Cgroups, KAMEZAWA Hiroyuki,
	Glauber Costa, Greg Thelen, Wu Fengguang, Andrew Morton,
	Sha Zhengju

On Thu, Aug 1, 2013 at 10:53 PM, Michal Hocko <mhocko@suse.cz> wrote:
> On Thu 01-08-13 19:54:11, Sha Zhengju wrote:
>> From: Sha Zhengju <handai.szj@taobao.com>
>>
>> Similar to dirty page, we add per cgroup writeback pages accounting. The lock
>> rule still is:
>>         mem_cgroup_begin_update_page_stat()
>>         modify page WRITEBACK stat
>>         mem_cgroup_update_page_stat()
>>         mem_cgroup_end_update_page_stat()
>>
>> There're two writeback interfaces to modify: test_{clear/set}_page_writeback().
>> Lock order:
>>       --> memcg->move_lock
>>         --> mapping->tree_lock
>>
>> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
>
> Looks good to me. Maybe I would suggest moving this patch up the stack
> so that it might get merged earlier as it is simpler than dirty pages
> accounting. Unless you insist on having the full series merged at once.

I think the following three patches can be merged earlier:
      1/8 memcg: remove MEMCG_NR_FILE_MAPPED
      3/8 memcg: check for proper lock held in mem_cgroup_update_page_stat
      5/8 memcg: add per cgroup writeback pages accounting

Do I need to resent them again for you or they're enough?

One more word, since dirty accounting is essential to future memcg
dirty page throttling and it is not an optional feature now, I suspect
whether we can merge the following two as well and leave the overhead
optimization a separate series.  :p
      4/5 memcg: add per cgroup dirty pages accounting
      8/8 memcg: Document cgroup dirty/writeback memory statistics

The 2/8 ceph one still need more improvement, I'll separate it next version.

>
> Acked-by: Michal Hocko <mhocko@suse.cz>

Thank you.

>
>> ---
>>  include/linux/memcontrol.h |    1 +
>>  mm/memcontrol.c            |    5 +++++
>>  mm/page-writeback.c        |   15 +++++++++++++++
>>  3 files changed, 21 insertions(+)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index f952be6..ccd35d8 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -43,6 +43,7 @@ enum mem_cgroup_stat_index {
>>       MEM_CGROUP_STAT_RSS_HUGE,       /* # of pages charged as anon huge */
>>       MEM_CGROUP_STAT_FILE_MAPPED,    /* # of pages charged as file rss */
>>       MEM_CGROUP_STAT_FILE_DIRTY,     /* # of dirty pages in page cache */
>> +     MEM_CGROUP_STAT_WRITEBACK,      /* # of pages under writeback */
>>       MEM_CGROUP_STAT_SWAP,           /* # of pages, swapped out */
>>       MEM_CGROUP_STAT_NSTATS,
>>  };
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 8f3e514..6c18a6d 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -91,6 +91,7 @@ static const char * const mem_cgroup_stat_names[] = {
>>       "rss_huge",
>>       "mapped_file",
>>       "dirty",
>> +     "writeback",
>>       "swap",
>>  };
>>
>> @@ -3812,6 +3813,10 @@ static int mem_cgroup_move_account(struct page *page,
>>               mem_cgroup_move_account_page_stat(from, to, nr_pages,
>>                       MEM_CGROUP_STAT_FILE_DIRTY);
>>
>> +     if (PageWriteback(page))
>> +             mem_cgroup_move_account_page_stat(from, to, nr_pages,
>> +                     MEM_CGROUP_STAT_WRITEBACK);
>> +
>>       mem_cgroup_charge_statistics(from, page, anon, -nr_pages);
>>
>>       /* caller should have done css_get */
>> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>> index a09f518..2fa6a52 100644
>> --- a/mm/page-writeback.c
>> +++ b/mm/page-writeback.c
>> @@ -2008,11 +2008,17 @@ EXPORT_SYMBOL(account_page_dirtied);
>>
>>  /*
>>   * Helper function for set_page_writeback family.
>> + *
>> + * The caller must hold mem_cgroup_begin/end_update_page_stat() lock
>> + * while calling this function.
>> + * See test_set_page_writeback for example.
>> + *
>>   * NOTE: Unlike account_page_dirtied this does not rely on being atomic
>>   * wrt interrupts.
>>   */
>>  void account_page_writeback(struct page *page)
>>  {
>> +     mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
>>       inc_zone_page_state(page, NR_WRITEBACK);
>>  }
>>  EXPORT_SYMBOL(account_page_writeback);
>> @@ -2243,7 +2249,10 @@ int test_clear_page_writeback(struct page *page)
>>  {
>>       struct address_space *mapping = page_mapping(page);
>>       int ret;
>> +     bool locked;
>> +     unsigned long memcg_flags;
>>
>> +     mem_cgroup_begin_update_page_stat(page, &locked, &memcg_flags);
>>       if (mapping) {
>>               struct backing_dev_info *bdi = mapping->backing_dev_info;
>>               unsigned long flags;
>> @@ -2264,9 +2273,11 @@ int test_clear_page_writeback(struct page *page)
>>               ret = TestClearPageWriteback(page);
>>       }
>>       if (ret) {
>> +             mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
>>               dec_zone_page_state(page, NR_WRITEBACK);
>>               inc_zone_page_state(page, NR_WRITTEN);
>>       }
>> +     mem_cgroup_end_update_page_stat(page, &locked, &memcg_flags);
>>       return ret;
>>  }
>>
>> @@ -2274,7 +2285,10 @@ int test_set_page_writeback(struct page *page)
>>  {
>>       struct address_space *mapping = page_mapping(page);
>>       int ret;
>> +     bool locked;
>> +     unsigned long flags;
>>
>> +     mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>>       if (mapping) {
>>               struct backing_dev_info *bdi = mapping->backing_dev_info;
>>               unsigned long flags;
>> @@ -2301,6 +2315,7 @@ int test_set_page_writeback(struct page *page)
>>       }
>>       if (!ret)
>>               account_page_writeback(page);
>> +     mem_cgroup_end_update_page_stat(page, &locked, &flags);
>>       return ret;
>>
>>  }
>> --
>> 1.7.9.5
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe cgroups" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> Michal Hocko
> SUSE Labs



-- 
Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V5 0/8] Add memcg dirty/writeback page accounting
  2013-08-01 14:43 ` [PATCH V5 0/8] Add memcg dirty/writeback page accounting Michal Hocko
@ 2013-08-03  9:30   ` Sha Zhengju
  0 siblings, 0 replies; 47+ messages in thread
From: Sha Zhengju @ 2013-08-03  9:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Cgroups, KAMEZAWA Hiroyuki, Glauber Costa, Greg Thelen,
	Wu Fengguang, Andrew Morton, Sha Zhengju

On Thu, Aug 1, 2013 at 10:43 PM, Michal Hocko <mhocko@suse.cz> wrote:
> On Thu 01-08-13 19:43:22, Sha Zhengju wrote:
> [...]
>> Some perforcemance numbers got by Mel's pft test (On a 4g memory and 4-core
>> i5 CPU machine):
>
> I am little bit confused what is this testcase actually testing... AFAIU
> it produces a lot of page faults but they are all anonymous and very
> short lived. So neither dirty nor writeback accounting is done.
>
> I would have expected a testcase which generates a lot of IO.

I see. I'm always not good at testing. :(

>
> Also as a general note. It would be better to mention the number of runs
> and standard deviation so that we have an idea about variability of the
> load.

OK. Thanks for the notes!

>
>> vanilla  : memcg enabled, patch not applied
>> patched  : all patches are patched
>>
>> * Duration numbers:
>>              vanilla     patched
>> User          385.38      379.47
>> System         65.12       66.46
>> Elapsed       457.46      452.21
>>
>> * Summary numbers:
>> vanilla:
>> Clients User        System      Elapsed     Faults/cpu  Faults/sec
>> 1       0.03        0.18        0.21        931682.645  910993.850
>> 2       0.03        0.22        0.13        760431.152  1472985.863
>> 3       0.03        0.29        0.12        600495.043  1620311.084
>> 4       0.04        0.37        0.12        475380.531  1688013.267
>>
>> patched:
>> Clients User        System      Elapsed     Faults/cpu  Faults/sec
>> 1       0.02        0.19        0.22        915362.875  898763.732
>> 2       0.03        0.23        0.13        757518.387  1464893.996
>> 3       0.03        0.30        0.12        592113.126  1611873.469
>> 4       0.04        0.38        0.12        472203.393  1680013.271
>>
>> We can see the performance gap is minor.
> [...]
> --
> Michal Hocko
> SUSE Labs



-- 
Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V5 5/8] memcg: add per cgroup writeback pages accounting
  2013-08-03  9:25         ` Sha Zhengju
@ 2013-08-04 10:08             ` Michal Hocko
  -1 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2013-08-04 10:08 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Cgroups, KAMEZAWA Hiroyuki,
	Glauber Costa, Greg Thelen, Wu Fengguang, Andrew Morton,
	Sha Zhengju

On Sat 03-08-13 17:25:01, Sha Zhengju wrote:
> On Thu, Aug 1, 2013 at 10:53 PM, Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> wrote:
> > On Thu 01-08-13 19:54:11, Sha Zhengju wrote:
> >> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> >>
> >> Similar to dirty page, we add per cgroup writeback pages accounting. The lock
> >> rule still is:
> >>         mem_cgroup_begin_update_page_stat()
> >>         modify page WRITEBACK stat
> >>         mem_cgroup_update_page_stat()
> >>         mem_cgroup_end_update_page_stat()
> >>
> >> There're two writeback interfaces to modify: test_{clear/set}_page_writeback().
> >> Lock order:
> >>       --> memcg->move_lock
> >>         --> mapping->tree_lock
> >>
> >> Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> >
> > Looks good to me. Maybe I would suggest moving this patch up the stack
> > so that it might get merged earlier as it is simpler than dirty pages
> > accounting. Unless you insist on having the full series merged at once.
> 
> I think the following three patches can be merged earlier:
>       1/8 memcg: remove MEMCG_NR_FILE_MAPPED
>       3/8 memcg: check for proper lock held in mem_cgroup_update_page_stat
>       5/8 memcg: add per cgroup writeback pages accounting
> 
> Do I need to resent them again for you or they're enough?

This is a question for Andrew. I would go with them as they are.
 
> One more word, since dirty accounting is essential to future memcg
> dirty page throttling and it is not an optional feature now, I suspect
> whether we can merge the following two as well and leave the overhead
> optimization a separate series.  :p

I wouldn't hurry it. We need numbers for serious testing to see the
overhead. It is still just a small step towards dirty throttling.

>       4/5 memcg: add per cgroup dirty pages accounting
>       8/8 memcg: Document cgroup dirty/writeback memory statistics
> 
> The 2/8 ceph one still need more improvement, I'll separate it next version.
> 
> >
> > Acked-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> 
> Thank you.
[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V5 5/8] memcg: add per cgroup writeback pages accounting
@ 2013-08-04 10:08             ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2013-08-04 10:08 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-fsdevel, linux-mm, Cgroups, KAMEZAWA Hiroyuki,
	Glauber Costa, Greg Thelen, Wu Fengguang, Andrew Morton,
	Sha Zhengju

On Sat 03-08-13 17:25:01, Sha Zhengju wrote:
> On Thu, Aug 1, 2013 at 10:53 PM, Michal Hocko <mhocko@suse.cz> wrote:
> > On Thu 01-08-13 19:54:11, Sha Zhengju wrote:
> >> From: Sha Zhengju <handai.szj@taobao.com>
> >>
> >> Similar to dirty page, we add per cgroup writeback pages accounting. The lock
> >> rule still is:
> >>         mem_cgroup_begin_update_page_stat()
> >>         modify page WRITEBACK stat
> >>         mem_cgroup_update_page_stat()
> >>         mem_cgroup_end_update_page_stat()
> >>
> >> There're two writeback interfaces to modify: test_{clear/set}_page_writeback().
> >> Lock order:
> >>       --> memcg->move_lock
> >>         --> mapping->tree_lock
> >>
> >> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> >
> > Looks good to me. Maybe I would suggest moving this patch up the stack
> > so that it might get merged earlier as it is simpler than dirty pages
> > accounting. Unless you insist on having the full series merged at once.
> 
> I think the following three patches can be merged earlier:
>       1/8 memcg: remove MEMCG_NR_FILE_MAPPED
>       3/8 memcg: check for proper lock held in mem_cgroup_update_page_stat
>       5/8 memcg: add per cgroup writeback pages accounting
> 
> Do I need to resent them again for you or they're enough?

This is a question for Andrew. I would go with them as they are.
 
> One more word, since dirty accounting is essential to future memcg
> dirty page throttling and it is not an optional feature now, I suspect
> whether we can merge the following two as well and leave the overhead
> optimization a separate series.  :p

I wouldn't hurry it. We need numbers for serious testing to see the
overhead. It is still just a small step towards dirty throttling.

>       4/5 memcg: add per cgroup dirty pages accounting
>       8/8 memcg: Document cgroup dirty/writeback memory statistics
> 
> The 2/8 ceph one still need more improvement, I'll separate it next version.
> 
> >
> > Acked-by: Michal Hocko <mhocko@suse.cz>
> 
> Thank you.
[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V5 3/8] memcg: check for proper lock held in mem_cgroup_update_page_stat
  2013-08-01 11:52 ` [PATCH V5 3/8] memcg: check for proper lock held in mem_cgroup_update_page_stat Sha Zhengju
@ 2013-08-04 18:48     ` Greg Thelen
  2013-08-04 18:48     ` Greg Thelen
  1 sibling, 0 replies; 47+ messages in thread
From: Greg Thelen @ 2013-08-04 18:48 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, mhocko, kamezawa.hiroyu, glommer,
	fengguang.wu, akpm, Sha Zhengju

On Thu, Aug 01 2013, Sha Zhengju wrote:

> From: Sha Zhengju <handai.szj@taobao.com>
>
> We should call mem_cgroup_begin_update_page_stat() before
> mem_cgroup_update_page_stat() to get proper locks, however the
> latter doesn't do any checking that we use proper locking, which
> would be hard. Suggested by Michal Hock we could at least test for
                                     ^^ Hocko
> rcu_read_lock_held() because RCU is held if !mem_cgroup_disabled().
>
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>

Reviewed-by: Greg Thelen <gthelen@google.com>

> ---
>  mm/memcontrol.c |    1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7691cef..4a55d46 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2301,6 +2301,7 @@ void mem_cgroup_update_page_stat(struct page *page,
>  	if (mem_cgroup_disabled())
>  		return;
>  
> +	VM_BUG_ON(!rcu_read_lock_held());
>  	memcg = pc->mem_cgroup;
>  	if (unlikely(!memcg || !PageCgroupUsed(pc)))
>  		return;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V5 3/8] memcg: check for proper lock held in mem_cgroup_update_page_stat
@ 2013-08-04 18:48     ` Greg Thelen
  0 siblings, 0 replies; 47+ messages in thread
From: Greg Thelen @ 2013-08-04 18:48 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	mhocko-AlSwsSmVLrQ, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	glommer-Re5JQEeQqe8AvxtiuMwx3w,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Sha Zhengju

On Thu, Aug 01 2013, Sha Zhengju wrote:

> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
>
> We should call mem_cgroup_begin_update_page_stat() before
> mem_cgroup_update_page_stat() to get proper locks, however the
> latter doesn't do any checking that we use proper locking, which
> would be hard. Suggested by Michal Hock we could at least test for
                                     ^^ Hocko
> rcu_read_lock_held() because RCU is held if !mem_cgroup_disabled().
>
> Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>

Reviewed-by: Greg Thelen <gthelen-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

> ---
>  mm/memcontrol.c |    1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7691cef..4a55d46 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2301,6 +2301,7 @@ void mem_cgroup_update_page_stat(struct page *page,
>  	if (mem_cgroup_disabled())
>  		return;
>  
> +	VM_BUG_ON(!rcu_read_lock_held());
>  	memcg = pc->mem_cgroup;
>  	if (unlikely(!memcg || !PageCgroupUsed(pc)))
>  		return;

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V5 5/8] memcg: add per cgroup writeback pages accounting
  2013-08-01 11:54 ` [PATCH V5 5/8] memcg: add per cgroup writeback " Sha Zhengju
@ 2013-08-04 18:51     ` Greg Thelen
  2013-08-04 18:51     ` Greg Thelen
  1 sibling, 0 replies; 47+ messages in thread
From: Greg Thelen @ 2013-08-04 18:51 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-fsdevel, linux-mm, cgroups, mhocko, kamezawa.hiroyu,
	glommer, fengguang.wu, akpm, Sha Zhengju

On Thu, Aug 01 2013, Sha Zhengju wrote:

> From: Sha Zhengju <handai.szj@taobao.com>
>
> Similar to dirty page, we add per cgroup writeback pages accounting. The lock
> rule still is:
>         mem_cgroup_begin_update_page_stat()
>         modify page WRITEBACK stat
>         mem_cgroup_update_page_stat()
>         mem_cgroup_end_update_page_stat()
>
> There're two writeback interfaces to modify: test_{clear/set}_page_writeback().
> Lock order:
> 	--> memcg->move_lock
> 	  --> mapping->tree_lock
>
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> ---
>  include/linux/memcontrol.h |    1 +
>  mm/memcontrol.c            |    5 +++++
>  mm/page-writeback.c        |   15 +++++++++++++++
>  3 files changed, 21 insertions(+)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index f952be6..ccd35d8 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -43,6 +43,7 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_RSS_HUGE,	/* # of pages charged as anon huge */
>  	MEM_CGROUP_STAT_FILE_MAPPED,	/* # of pages charged as file rss */
>  	MEM_CGROUP_STAT_FILE_DIRTY,	/* # of dirty pages in page cache */
> +	MEM_CGROUP_STAT_WRITEBACK,	/* # of pages under writeback */
>  	MEM_CGROUP_STAT_SWAP,		/* # of pages, swapped out */
>  	MEM_CGROUP_STAT_NSTATS,
>  };
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 8f3e514..6c18a6d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -91,6 +91,7 @@ static const char * const mem_cgroup_stat_names[] = {
>  	"rss_huge",
>  	"mapped_file",
>  	"dirty",
> +	"writeback",
>  	"swap",
>  };
>  
> @@ -3812,6 +3813,10 @@ static int mem_cgroup_move_account(struct page *page,
>  		mem_cgroup_move_account_page_stat(from, to, nr_pages,
>  			MEM_CGROUP_STAT_FILE_DIRTY);
>  
> +	if (PageWriteback(page))
> +		mem_cgroup_move_account_page_stat(from, to, nr_pages,
> +			MEM_CGROUP_STAT_WRITEBACK);
> +
>  	mem_cgroup_charge_statistics(from, page, anon, -nr_pages);
>  
>  	/* caller should have done css_get */
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index a09f518..2fa6a52 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2008,11 +2008,17 @@ EXPORT_SYMBOL(account_page_dirtied);
>  
>  /*
>   * Helper function for set_page_writeback family.
> + *
> + * The caller must hold mem_cgroup_begin/end_update_page_stat() lock
> + * while calling this function.
> + * See test_set_page_writeback for example.
> + *
>   * NOTE: Unlike account_page_dirtied this does not rely on being atomic
>   * wrt interrupts.
>   */
>  void account_page_writeback(struct page *page)
>  {
> +	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
>  	inc_zone_page_state(page, NR_WRITEBACK);
>  }
>  EXPORT_SYMBOL(account_page_writeback);
> @@ -2243,7 +2249,10 @@ int test_clear_page_writeback(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
>  	int ret;
> +	bool locked;
> +	unsigned long memcg_flags;
>  
> +	mem_cgroup_begin_update_page_stat(page, &locked, &memcg_flags);
>  	if (mapping) {
>  		struct backing_dev_info *bdi = mapping->backing_dev_info;
>  		unsigned long flags;
> @@ -2264,9 +2273,11 @@ int test_clear_page_writeback(struct page *page)
>  		ret = TestClearPageWriteback(page);
>  	}
>  	if (ret) {
> +		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
>  		dec_zone_page_state(page, NR_WRITEBACK);
>  		inc_zone_page_state(page, NR_WRITTEN);
>  	}
> +	mem_cgroup_end_update_page_stat(page, &locked, &memcg_flags);
>  	return ret;
>  }
>  
> @@ -2274,7 +2285,10 @@ int test_set_page_writeback(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
>  	int ret;
> +	bool locked;
> +	unsigned long flags;

I recommend using memcg_flags here, which matches the pattern in
account_page_writeback, to avoid overlapping the nested 'flags' variable
below.

Otherwise,
Reviewed-by: Greg Thelen <gthelen@google.com>

>  
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  	if (mapping) {
>  		struct backing_dev_info *bdi = mapping->backing_dev_info;
>  		unsigned long flags;
> @@ -2301,6 +2315,7 @@ int test_set_page_writeback(struct page *page)
>  	}
>  	if (!ret)
>  		account_page_writeback(page);
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  	return ret;
>  
>  }

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V5 5/8] memcg: add per cgroup writeback pages accounting
@ 2013-08-04 18:51     ` Greg Thelen
  0 siblings, 0 replies; 47+ messages in thread
From: Greg Thelen @ 2013-08-04 18:51 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-fsdevel, linux-mm, cgroups, mhocko, kamezawa.hiroyu,
	glommer, fengguang.wu, akpm, Sha Zhengju

On Thu, Aug 01 2013, Sha Zhengju wrote:

> From: Sha Zhengju <handai.szj@taobao.com>
>
> Similar to dirty page, we add per cgroup writeback pages accounting. The lock
> rule still is:
>         mem_cgroup_begin_update_page_stat()
>         modify page WRITEBACK stat
>         mem_cgroup_update_page_stat()
>         mem_cgroup_end_update_page_stat()
>
> There're two writeback interfaces to modify: test_{clear/set}_page_writeback().
> Lock order:
> 	--> memcg->move_lock
> 	  --> mapping->tree_lock
>
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> ---
>  include/linux/memcontrol.h |    1 +
>  mm/memcontrol.c            |    5 +++++
>  mm/page-writeback.c        |   15 +++++++++++++++
>  3 files changed, 21 insertions(+)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index f952be6..ccd35d8 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -43,6 +43,7 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_RSS_HUGE,	/* # of pages charged as anon huge */
>  	MEM_CGROUP_STAT_FILE_MAPPED,	/* # of pages charged as file rss */
>  	MEM_CGROUP_STAT_FILE_DIRTY,	/* # of dirty pages in page cache */
> +	MEM_CGROUP_STAT_WRITEBACK,	/* # of pages under writeback */
>  	MEM_CGROUP_STAT_SWAP,		/* # of pages, swapped out */
>  	MEM_CGROUP_STAT_NSTATS,
>  };
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 8f3e514..6c18a6d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -91,6 +91,7 @@ static const char * const mem_cgroup_stat_names[] = {
>  	"rss_huge",
>  	"mapped_file",
>  	"dirty",
> +	"writeback",
>  	"swap",
>  };
>  
> @@ -3812,6 +3813,10 @@ static int mem_cgroup_move_account(struct page *page,
>  		mem_cgroup_move_account_page_stat(from, to, nr_pages,
>  			MEM_CGROUP_STAT_FILE_DIRTY);
>  
> +	if (PageWriteback(page))
> +		mem_cgroup_move_account_page_stat(from, to, nr_pages,
> +			MEM_CGROUP_STAT_WRITEBACK);
> +
>  	mem_cgroup_charge_statistics(from, page, anon, -nr_pages);
>  
>  	/* caller should have done css_get */
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index a09f518..2fa6a52 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2008,11 +2008,17 @@ EXPORT_SYMBOL(account_page_dirtied);
>  
>  /*
>   * Helper function for set_page_writeback family.
> + *
> + * The caller must hold mem_cgroup_begin/end_update_page_stat() lock
> + * while calling this function.
> + * See test_set_page_writeback for example.
> + *
>   * NOTE: Unlike account_page_dirtied this does not rely on being atomic
>   * wrt interrupts.
>   */
>  void account_page_writeback(struct page *page)
>  {
> +	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
>  	inc_zone_page_state(page, NR_WRITEBACK);
>  }
>  EXPORT_SYMBOL(account_page_writeback);
> @@ -2243,7 +2249,10 @@ int test_clear_page_writeback(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
>  	int ret;
> +	bool locked;
> +	unsigned long memcg_flags;
>  
> +	mem_cgroup_begin_update_page_stat(page, &locked, &memcg_flags);
>  	if (mapping) {
>  		struct backing_dev_info *bdi = mapping->backing_dev_info;
>  		unsigned long flags;
> @@ -2264,9 +2273,11 @@ int test_clear_page_writeback(struct page *page)
>  		ret = TestClearPageWriteback(page);
>  	}
>  	if (ret) {
> +		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
>  		dec_zone_page_state(page, NR_WRITEBACK);
>  		inc_zone_page_state(page, NR_WRITTEN);
>  	}
> +	mem_cgroup_end_update_page_stat(page, &locked, &memcg_flags);
>  	return ret;
>  }
>  
> @@ -2274,7 +2285,10 @@ int test_set_page_writeback(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
>  	int ret;
> +	bool locked;
> +	unsigned long flags;

I recommend using memcg_flags here, which matches the pattern in
account_page_writeback, to avoid overlapping the nested 'flags' variable
below.

Otherwise,
Reviewed-by: Greg Thelen <gthelen@google.com>

>  
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  	if (mapping) {
>  		struct backing_dev_info *bdi = mapping->backing_dev_info;
>  		unsigned long flags;
> @@ -2301,6 +2315,7 @@ int test_set_page_writeback(struct page *page)
>  	}
>  	if (!ret)
>  		account_page_writeback(page);
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  	return ret;
>  
>  }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V5 7/8] memcg: don't account root memcg page stats if only root exists
@ 2013-08-05 21:58         ` Johannes Weiner
  0 siblings, 0 replies; 47+ messages in thread
From: Johannes Weiner @ 2013-08-05 21:58 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, Cgroups, Michal Hocko, KAMEZAWA Hiroyuki,
	Glauber Costa, Greg Thelen, Wu Fengguang, Andrew Morton,
	Sha Zhengju

On Fri, Aug 02, 2013 at 12:32:17PM +0800, Sha Zhengju wrote:
> On Fri, Aug 2, 2013 at 12:20 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > On Thu, Aug 01, 2013 at 08:00:07PM +0800, Sha Zhengju wrote:
> >> @@ -6303,6 +6360,49 @@ mem_cgroup_css_online(struct cgroup *cont)
> >>       }
> >>
> >>       error = memcg_init_kmem(memcg, &mem_cgroup_subsys);
> >> +     if (!error) {
> >> +             if (!mem_cgroup_in_use()) {
> >> +                     /* I'm the first non-root memcg, move global stats to root memcg.
> >> +                      * Memcg creating is serialized by cgroup locks(cgroup_mutex),
> >> +                      * so the mem_cgroup_in_use() checking is safe.
> >> +                      *
> >> +                      * We use global_page_state() to get global page stats, but
> >> +                      * because of the optimized inc/dec functions in SMP while
> >> +                      * updating each zone's stats, We may lose some numbers
> >> +                      * in a stock(zone->pageset->vm_stat_diff) which brings some
> >> +                      * inaccuracy. But places where kernel use these page stats to
> >> +                      * steer next decision e.g. dirty page throttling or writeback
> >> +                      * also use global_page_state(), so here it's enough too.
> >> +                      */
> >> +                     spin_lock(&root_mem_cgroup->pcp_counter_lock);
> >> +                     root_mem_cgroup->stats_base.count[MEM_CGROUP_STAT_FILE_MAPPED] =
> >> +                                             global_page_state(NR_FILE_MAPPED);
> >> +                     root_mem_cgroup->stats_base.count[MEM_CGROUP_STAT_FILE_DIRTY] =
> >> +                                             global_page_state(NR_FILE_DIRTY);
> >> +                     root_mem_cgroup->stats_base.count[MEM_CGROUP_STAT_WRITEBACK] =
> >> +                                             global_page_state(NR_WRITEBACK);
> >> +                     spin_unlock(&root_mem_cgroup->pcp_counter_lock);
> >> +             }
> >
> > If inaccuracies in these counters are okay, why do we go through an
> > elaborate locking scheme that sprinkles memcg callbacks everywhere
> > just to be 100% reliable in the rare case somebody moves memory
> > between cgroups?
> 
> IMO they are not the same thing. Moving between cgroups may happen
> many times, if we ignore the inaccuracy between moving, then the
> cumulative inaccuracies caused by this back and forth behavior can
> become very large.
> 
> But the transferring above occurs only once since we do it only when
> the first non-root memcg creating, so the error is at most
> zone->pageset->stat_threshold. This threshold depends on the amount of
> zone memory and  the number of processors, and its maximum value is
> 125, so I thought using global_page_state() is enough. Of course we
> can add the stock to seek for greater perfection, but there are also
> possibilities of inaccuracy because of racy.

File pages may get unmapped/dirtied/put under writeback/finish
writeback between reading the stats and arming the inuse keys (before
which you are not collecting any percpu deltas).

The error is not from the percpu inaccuracies but because you don't
snapshot the counters and start accounting changes atomically wrt
ongoing counter modificatcions.  This means the error is unbounded.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V5 7/8] memcg: don't account root memcg page stats if only root exists
@ 2013-08-05 21:58         ` Johannes Weiner
  0 siblings, 0 replies; 47+ messages in thread
From: Johannes Weiner @ 2013-08-05 21:58 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Cgroups, Michal Hocko,
	KAMEZAWA Hiroyuki, Glauber Costa, Greg Thelen, Wu Fengguang,
	Andrew Morton, Sha Zhengju

On Fri, Aug 02, 2013 at 12:32:17PM +0800, Sha Zhengju wrote:
> On Fri, Aug 2, 2013 at 12:20 AM, Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> wrote:
> > On Thu, Aug 01, 2013 at 08:00:07PM +0800, Sha Zhengju wrote:
> >> @@ -6303,6 +6360,49 @@ mem_cgroup_css_online(struct cgroup *cont)
> >>       }
> >>
> >>       error = memcg_init_kmem(memcg, &mem_cgroup_subsys);
> >> +     if (!error) {
> >> +             if (!mem_cgroup_in_use()) {
> >> +                     /* I'm the first non-root memcg, move global stats to root memcg.
> >> +                      * Memcg creating is serialized by cgroup locks(cgroup_mutex),
> >> +                      * so the mem_cgroup_in_use() checking is safe.
> >> +                      *
> >> +                      * We use global_page_state() to get global page stats, but
> >> +                      * because of the optimized inc/dec functions in SMP while
> >> +                      * updating each zone's stats, We may lose some numbers
> >> +                      * in a stock(zone->pageset->vm_stat_diff) which brings some
> >> +                      * inaccuracy. But places where kernel use these page stats to
> >> +                      * steer next decision e.g. dirty page throttling or writeback
> >> +                      * also use global_page_state(), so here it's enough too.
> >> +                      */
> >> +                     spin_lock(&root_mem_cgroup->pcp_counter_lock);
> >> +                     root_mem_cgroup->stats_base.count[MEM_CGROUP_STAT_FILE_MAPPED] =
> >> +                                             global_page_state(NR_FILE_MAPPED);
> >> +                     root_mem_cgroup->stats_base.count[MEM_CGROUP_STAT_FILE_DIRTY] =
> >> +                                             global_page_state(NR_FILE_DIRTY);
> >> +                     root_mem_cgroup->stats_base.count[MEM_CGROUP_STAT_WRITEBACK] =
> >> +                                             global_page_state(NR_WRITEBACK);
> >> +                     spin_unlock(&root_mem_cgroup->pcp_counter_lock);
> >> +             }
> >
> > If inaccuracies in these counters are okay, why do we go through an
> > elaborate locking scheme that sprinkles memcg callbacks everywhere
> > just to be 100% reliable in the rare case somebody moves memory
> > between cgroups?
> 
> IMO they are not the same thing. Moving between cgroups may happen
> many times, if we ignore the inaccuracy between moving, then the
> cumulative inaccuracies caused by this back and forth behavior can
> become very large.
> 
> But the transferring above occurs only once since we do it only when
> the first non-root memcg creating, so the error is at most
> zone->pageset->stat_threshold. This threshold depends on the amount of
> zone memory and  the number of processors, and its maximum value is
> 125, so I thought using global_page_state() is enough. Of course we
> can add the stock to seek for greater perfection, but there are also
> possibilities of inaccuracy because of racy.

File pages may get unmapped/dirtied/put under writeback/finish
writeback between reading the stats and arming the inuse keys (before
which you are not collecting any percpu deltas).

The error is not from the percpu inaccuracies but because you don't
snapshot the counters and start accounting changes atomically wrt
ongoing counter modificatcions.  This means the error is unbounded.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Fwd: [PATCH V5 5/8] memcg: add per cgroup writeback pages accounting
  2013-08-04 10:08             ` Michal Hocko
@ 2013-08-22  9:46               ` Sha Zhengju
  -1 siblings, 0 replies; 47+ messages in thread
From: Sha Zhengju @ 2013-08-22  9:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, KAMEZAWA Hiroyuki, Greg Thelen, linux-mm, Cgroups,
	linux-fsdevel, Sha Zhengju

Hi Andrew,

After several rounds of review, parts of the memcg page accounting
patchset is ready (leaving memcg dirty page accounting under more
review):

  1/8 memcg: remove MEMCG_NR_FILE_MAPPED
  2/8 ceph: vfs __set_page_dirty_nobuffers interface instead of doing
it inside filesystem
  3/8 memcg: check for proper lock held in mem_cgroup_update_page_stat
  5/8 memcg: add per cgroup writeback pages accounting
  8/8 memcg: Document cgroup dirty/writeback memory statistics

But the 2/8 ceph one has been improved again and will be merged in
ceph tree, so only the other 4 patches need to be added to -mm tree.
I've moved the 5/8 writeback one up the stack and updated the 8/8 to
only document writeback changes. Could you please merge them earlier?
I'll post these updated patches soon. :)

Thank you!


---------- Forwarded message ----------
From: Michal Hocko <mhocko@suse.cz>
Date: Sun, Aug 4, 2013 at 6:08 PM
Subject: Re: [PATCH V5 5/8] memcg: add per cgroup writeback pages accounting
To: Sha Zhengju <handai.szj@gmail.com>
Cc: "linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
"linux-mm@kvack.org" <linux-mm@kvack.org>, Cgroups
<cgroups@vger.kernel.org>, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com>, Glauber Costa <glommer@gmail.com>,
Greg Thelen <gthelen@google.com>, Wu Fengguang
<fengguang.wu@intel.com>, Andrew Morton <akpm@linux-foundation.org>,
Sha Zhengju <handai.szj@taobao.com>


On Sat 03-08-13 17:25:01, Sha Zhengju wrote:
> On Thu, Aug 1, 2013 at 10:53 PM, Michal Hocko <mhocko@suse.cz> wrote:
> > On Thu 01-08-13 19:54:11, Sha Zhengju wrote:
> >> From: Sha Zhengju <handai.szj@taobao.com>
> >>
> >> Similar to dirty page, we add per cgroup writeback pages accounting. The lock
> >> rule still is:
> >>         mem_cgroup_begin_update_page_stat()
> >>         modify page WRITEBACK stat
> >>         mem_cgroup_update_page_stat()
> >>         mem_cgroup_end_update_page_stat()
> >>
> >> There're two writeback interfaces to modify: test_{clear/set}_page_writeback().
> >> Lock order:
> >>       --> memcg->move_lock
> >>         --> mapping->tree_lock
> >>
> >> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> >
> > Looks good to me. Maybe I would suggest moving this patch up the stack
> > so that it might get merged earlier as it is simpler than dirty pages
> > accounting. Unless you insist on having the full series merged at once.
>
> I think the following three patches can be merged earlier:
>       1/8 memcg: remove MEMCG_NR_FILE_MAPPED
>       3/8 memcg: check for proper lock held in mem_cgroup_update_page_stat
>       5/8 memcg: add per cgroup writeback pages accounting
>
> Do I need to resent them again for you or they're enough?

This is a question for Andrew. I would go with them as they are.

> One more word, since dirty accounting is essential to future memcg
> dirty page throttling and it is not an optional feature now, I suspect
> whether we can merge the following two as well and leave the overhead
> optimization a separate series.  :p

I wouldn't hurry it. We need numbers for serious testing to see the
overhead. It is still just a small step towards dirty throttling.

>       4/5 memcg: add per cgroup dirty pages accounting
>       8/8 memcg: Document cgroup dirty/writeback memory statistics
>
> The 2/8 ceph one still need more improvement, I'll separate it next version.
>
> >
> > Acked-by: Michal Hocko <mhocko@suse.cz>
>
> Thank you.
[...]
--
Michal Hocko
SUSE Labs


-- 
Thanks,
Sha

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Fwd: [PATCH V5 5/8] memcg: add per cgroup writeback pages accounting
@ 2013-08-22  9:46               ` Sha Zhengju
  0 siblings, 0 replies; 47+ messages in thread
From: Sha Zhengju @ 2013-08-22  9:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, KAMEZAWA Hiroyuki, Greg Thelen, linux-mm, Cgroups,
	linux-fsdevel, Sha Zhengju

Hi Andrew,

After several rounds of review, parts of the memcg page accounting
patchset is ready (leaving memcg dirty page accounting under more
review):

  1/8 memcg: remove MEMCG_NR_FILE_MAPPED
  2/8 ceph: vfs __set_page_dirty_nobuffers interface instead of doing
it inside filesystem
  3/8 memcg: check for proper lock held in mem_cgroup_update_page_stat
  5/8 memcg: add per cgroup writeback pages accounting
  8/8 memcg: Document cgroup dirty/writeback memory statistics

But the 2/8 ceph one has been improved again and will be merged in
ceph tree, so only the other 4 patches need to be added to -mm tree.
I've moved the 5/8 writeback one up the stack and updated the 8/8 to
only document writeback changes. Could you please merge them earlier?
I'll post these updated patches soon. :)

Thank you!


---------- Forwarded message ----------
From: Michal Hocko <mhocko@suse.cz>
Date: Sun, Aug 4, 2013 at 6:08 PM
Subject: Re: [PATCH V5 5/8] memcg: add per cgroup writeback pages accounting
To: Sha Zhengju <handai.szj@gmail.com>
Cc: "linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
"linux-mm@kvack.org" <linux-mm@kvack.org>, Cgroups
<cgroups@vger.kernel.org>, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com>, Glauber Costa <glommer@gmail.com>,
Greg Thelen <gthelen@google.com>, Wu Fengguang
<fengguang.wu@intel.com>, Andrew Morton <akpm@linux-foundation.org>,
Sha Zhengju <handai.szj@taobao.com>


On Sat 03-08-13 17:25:01, Sha Zhengju wrote:
> On Thu, Aug 1, 2013 at 10:53 PM, Michal Hocko <mhocko@suse.cz> wrote:
> > On Thu 01-08-13 19:54:11, Sha Zhengju wrote:
> >> From: Sha Zhengju <handai.szj@taobao.com>
> >>
> >> Similar to dirty page, we add per cgroup writeback pages accounting. The lock
> >> rule still is:
> >>         mem_cgroup_begin_update_page_stat()
> >>         modify page WRITEBACK stat
> >>         mem_cgroup_update_page_stat()
> >>         mem_cgroup_end_update_page_stat()
> >>
> >> There're two writeback interfaces to modify: test_{clear/set}_page_writeback().
> >> Lock order:
> >>       --> memcg->move_lock
> >>         --> mapping->tree_lock
> >>
> >> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> >
> > Looks good to me. Maybe I would suggest moving this patch up the stack
> > so that it might get merged earlier as it is simpler than dirty pages
> > accounting. Unless you insist on having the full series merged at once.
>
> I think the following three patches can be merged earlier:
>       1/8 memcg: remove MEMCG_NR_FILE_MAPPED
>       3/8 memcg: check for proper lock held in mem_cgroup_update_page_stat
>       5/8 memcg: add per cgroup writeback pages accounting
>
> Do I need to resent them again for you or they're enough?

This is a question for Andrew. I would go with them as they are.

> One more word, since dirty accounting is essential to future memcg
> dirty page throttling and it is not an optional feature now, I suspect
> whether we can merge the following two as well and leave the overhead
> optimization a separate series.  :p

I wouldn't hurry it. We need numbers for serious testing to see the
overhead. It is still just a small step towards dirty throttling.

>       4/5 memcg: add per cgroup dirty pages accounting
>       8/8 memcg: Document cgroup dirty/writeback memory statistics
>
> The 2/8 ceph one still need more improvement, I'll separate it next version.
>
> >
> > Acked-by: Michal Hocko <mhocko@suse.cz>
>
> Thank you.
[...]
--
Michal Hocko
SUSE Labs


-- 
Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 1/4] memcg: remove MEMCG_NR_FILE_MAPPED
  2013-08-22  9:46               ` Sha Zhengju
  (?)
@ 2013-08-22  9:50               ` Sha Zhengju
  -1 siblings, 0 replies; 47+ messages in thread
From: Sha Zhengju @ 2013-08-22  9:50 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, kamezawa.hiroyu, gthelen, linux-mm, cgroups,
	linux-fsdevel, Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

While accounting memcg page stat, it's not worth to use MEMCG_NR_FILE_MAPPED
as an extra layer of indirection because of the complexity and presumed
performance overhead. We can use MEM_CGROUP_STAT_FILE_MAPPED directly.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Greg Thelen <gthelen@google.com>
---
 include/linux/memcontrol.h |   27 +++++++++++++++++++--------
 mm/memcontrol.c            |   25 +------------------------
 mm/rmap.c                  |    4 ++--
 3 files changed, 22 insertions(+), 34 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 8a9ed4d..630bfa1 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -30,9 +30,20 @@ struct page;
 struct mm_struct;
 struct kmem_cache;
 
-/* Stats that can be updated by kernel. */
-enum mem_cgroup_page_stat_item {
-	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
+/*
+ * The corresponding mem_cgroup_stat_names is defined in mm/memcontrol.c,
+ * These two lists should keep in accord with each other.
+ */
+enum mem_cgroup_stat_index {
+	/*
+	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
+	 */
+	MEM_CGROUP_STAT_CACHE,		/* # of pages charged as cache */
+	MEM_CGROUP_STAT_RSS,		/* # of pages charged as anon rss */
+	MEM_CGROUP_STAT_RSS_HUGE,	/* # of pages charged as anon huge */
+	MEM_CGROUP_STAT_FILE_MAPPED,	/* # of pages charged as file rss */
+	MEM_CGROUP_STAT_SWAP,		/* # of pages, swapped out */
+	MEM_CGROUP_STAT_NSTATS,
 };
 
 struct mem_cgroup_reclaim_cookie {
@@ -233,17 +244,17 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
 }
 
 void mem_cgroup_update_page_stat(struct page *page,
-				 enum mem_cgroup_page_stat_item idx,
+				 enum mem_cgroup_stat_index idx,
 				 int val);
 
 static inline void mem_cgroup_inc_page_stat(struct page *page,
-					    enum mem_cgroup_page_stat_item idx)
+					    enum mem_cgroup_stat_index idx)
 {
 	mem_cgroup_update_page_stat(page, idx, 1);
 }
 
 static inline void mem_cgroup_dec_page_stat(struct page *page,
-					    enum mem_cgroup_page_stat_item idx)
+					    enum mem_cgroup_stat_index idx)
 {
 	mem_cgroup_update_page_stat(page, idx, -1);
 }
@@ -448,12 +459,12 @@ static inline bool mem_cgroup_oom_synchronize(void)
 }
 
 static inline void mem_cgroup_inc_page_stat(struct page *page,
-					    enum mem_cgroup_page_stat_item idx)
+					    enum mem_cgroup_stat_index idx)
 {
 }
 
 static inline void mem_cgroup_dec_page_stat(struct page *page,
-					    enum mem_cgroup_page_stat_item idx)
+					    enum mem_cgroup_stat_index idx)
 {
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b73988a..24d6d02 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -84,21 +84,6 @@ static int really_do_swap_account __initdata = 0;
 #endif
 
 
-/*
- * Statistics for memory cgroup.
- */
-enum mem_cgroup_stat_index {
-	/*
-	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
-	 */
-	MEM_CGROUP_STAT_CACHE,		/* # of pages charged as cache */
-	MEM_CGROUP_STAT_RSS,		/* # of pages charged as anon rss */
-	MEM_CGROUP_STAT_RSS_HUGE,	/* # of pages charged as anon huge */
-	MEM_CGROUP_STAT_FILE_MAPPED,	/* # of pages charged as file rss */
-	MEM_CGROUP_STAT_SWAP,		/* # of pages, swapped out */
-	MEM_CGROUP_STAT_NSTATS,
-};
-
 static const char * const mem_cgroup_stat_names[] = {
 	"cache",
 	"rss",
@@ -2255,7 +2240,7 @@ void __mem_cgroup_end_update_page_stat(struct page *page, unsigned long *flags)
 }
 
 void mem_cgroup_update_page_stat(struct page *page,
-				 enum mem_cgroup_page_stat_item idx, int val)
+				 enum mem_cgroup_stat_index idx, int val)
 {
 	struct mem_cgroup *memcg;
 	struct page_cgroup *pc = lookup_page_cgroup(page);
@@ -2268,14 +2253,6 @@ void mem_cgroup_update_page_stat(struct page *page,
 	if (unlikely(!memcg || !PageCgroupUsed(pc)))
 		return;
 
-	switch (idx) {
-	case MEMCG_NR_FILE_MAPPED:
-		idx = MEM_CGROUP_STAT_FILE_MAPPED;
-		break;
-	default:
-		BUG();
-	}
-
 	this_cpu_add(memcg->stat->count[idx], val);
 }
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 07afd48..373da2a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1114,7 +1114,7 @@ void page_add_file_rmap(struct page *page)
 	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (atomic_inc_and_test(&page->_mapcount)) {
 		__inc_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
+		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
 	}
 	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 }
@@ -1158,7 +1158,7 @@ void page_remove_rmap(struct page *page)
 				hpage_nr_pages(page));
 	} else {
 		__dec_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
+		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
 		mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	}
 	if (unlikely(PageMlocked(page)))
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 2/4] memcg: check for proper lock held in mem_cgroup_update_page_stat
  2013-08-22  9:46               ` Sha Zhengju
  (?)
  (?)
@ 2013-08-22  9:52               ` Sha Zhengju
  -1 siblings, 0 replies; 47+ messages in thread
From: Sha Zhengju @ 2013-08-22  9:52 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, kamezawa.hiroyu, gthelen, linux-mm, cgroups,
	linux-fsdevel, Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

We should call mem_cgroup_begin_update_page_stat() before
mem_cgroup_update_page_stat() to get proper locks, however the
latter doesn't do any checking that we use proper locking, which
would be hard. Suggested by Michal Hock we could at least test for
rcu_read_lock_held() because RCU is held if !mem_cgroup_disabled().

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Greg Thelen <gthelen@google.com>
---
 mm/memcontrol.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 24d6d02..0a50871 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2249,6 +2249,7 @@ void mem_cgroup_update_page_stat(struct page *page,
 	if (mem_cgroup_disabled())
 		return;
 
+	VM_BUG_ON(!rcu_read_lock_held());
 	memcg = pc->mem_cgroup;
 	if (unlikely(!memcg || !PageCgroupUsed(pc)))
 		return;
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 3/4] memcg: add per cgroup writeback pages accounting
  2013-08-22  9:46               ` Sha Zhengju
                                 ` (2 preceding siblings ...)
  (?)
@ 2013-08-22  9:53               ` Sha Zhengju
  2013-08-22 22:40                 ` Andrew Morton
  -1 siblings, 1 reply; 47+ messages in thread
From: Sha Zhengju @ 2013-08-22  9:53 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, kamezawa.hiroyu, gthelen, linux-mm, cgroups,
	linux-fsdevel, Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

This patch is to add memcg routines to count writeback pages, later dirty pages will
also be accounted.

After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
has a feature to move a page from a cgroup to another one and may have race between
"move" and "page stat accounting". So in order to avoid the race we have designed a
new lock:

         mem_cgroup_begin_update_page_stat()
         modify page information        -->(a)
         mem_cgroup_update_page_stat()  -->(b)
         mem_cgroup_end_update_page_stat()

It requires both (a) and (b)(writeback pages accounting) to be pretected in
mem_cgroup_{begin/end}_update_page_stat(). It's full no-op for !CONFIG_MEMCG, almost
no-op if memcg is disabled (but compiled in), rcu read lock in the most cases
(no task is moving), and spin_lock_irqsave on top in the slow path.

There're two writeback interfaces to modify: test_{clear/set}_page_writeback().
And the lock order is:
	--> memcg->move_lock
	  --> mapping->tree_lock

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Greg Thelen <gthelen@google.com>
---
 include/linux/memcontrol.h |    1 +
 mm/memcontrol.c            |   30 +++++++++++++++++++++++-------
 mm/page-writeback.c        |   15 +++++++++++++++
 3 files changed, 39 insertions(+), 7 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 630bfa1..1878644 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -42,6 +42,7 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_RSS,		/* # of pages charged as anon rss */
 	MEM_CGROUP_STAT_RSS_HUGE,	/* # of pages charged as anon huge */
 	MEM_CGROUP_STAT_FILE_MAPPED,	/* # of pages charged as file rss */
+	MEM_CGROUP_STAT_WRITEBACK,	/* # of pages under writeback */
 	MEM_CGROUP_STAT_SWAP,		/* # of pages, swapped out */
 	MEM_CGROUP_STAT_NSTATS,
 };
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0a50871..869bfaf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -89,6 +89,7 @@ static const char * const mem_cgroup_stat_names[] = {
 	"rss",
 	"rss_huge",
 	"mapped_file",
+	"writeback",
 	"swap",
 };
 
@@ -3675,6 +3676,20 @@ void mem_cgroup_split_huge_fixup(struct page *head)
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+static inline
+void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
+					struct mem_cgroup *to,
+					unsigned int nr_pages,
+					enum mem_cgroup_stat_index idx)
+{
+	/* Update stat data for mem_cgroup */
+	preempt_disable();
+	WARN_ON_ONCE(from->stat->count[idx] < nr_pages);
+	__this_cpu_add(from->stat->count[idx], -nr_pages);
+	__this_cpu_add(to->stat->count[idx], nr_pages);
+	preempt_enable();
+}
+
 /**
  * mem_cgroup_move_account - move account of the page
  * @page: the page
@@ -3720,13 +3735,14 @@ static int mem_cgroup_move_account(struct page *page,
 
 	move_lock_mem_cgroup(from, &flags);
 
-	if (!anon && page_mapped(page)) {
-		/* Update mapped_file data for mem_cgroup */
-		preempt_disable();
-		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		preempt_enable();
-	}
+	if (!anon && page_mapped(page))
+		mem_cgroup_move_account_page_stat(from, to, nr_pages,
+			MEM_CGROUP_STAT_FILE_MAPPED);
+
+	if (PageWriteback(page))
+		mem_cgroup_move_account_page_stat(from, to, nr_pages,
+			MEM_CGROUP_STAT_WRITEBACK);
+
 	mem_cgroup_charge_statistics(from, page, anon, -nr_pages);
 
 	/* caller should have done css_get */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index ddc56fc..b6ecbc6 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1999,11 +1999,17 @@ EXPORT_SYMBOL(account_page_dirtied);
 
 /*
  * Helper function for set_page_writeback family.
+ *
+ * The caller must hold mem_cgroup_begin/end_update_page_stat() lock
+ * while calling this function.
+ * See test_set_page_writeback for example.
+ *
  * NOTE: Unlike account_page_dirtied this does not rely on being atomic
  * wrt interrupts.
  */
 void account_page_writeback(struct page *page)
 {
+	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
 	inc_zone_page_state(page, NR_WRITEBACK);
 }
 EXPORT_SYMBOL(account_page_writeback);
@@ -2220,7 +2226,10 @@ int test_clear_page_writeback(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
 	int ret;
+	bool locked;
+	unsigned long memcg_flags;
 
+	mem_cgroup_begin_update_page_stat(page, &locked, &memcg_flags);
 	if (mapping) {
 		struct backing_dev_info *bdi = mapping->backing_dev_info;
 		unsigned long flags;
@@ -2241,9 +2250,11 @@ int test_clear_page_writeback(struct page *page)
 		ret = TestClearPageWriteback(page);
 	}
 	if (ret) {
+		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
 		dec_zone_page_state(page, NR_WRITEBACK);
 		inc_zone_page_state(page, NR_WRITTEN);
 	}
+	mem_cgroup_end_update_page_stat(page, &locked, &memcg_flags);
 	return ret;
 }
 
@@ -2251,7 +2262,10 @@ int test_set_page_writeback(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
 	int ret;
+	bool locked;
+	unsigned long memcg_flags;
 
+	mem_cgroup_begin_update_page_stat(page, &locked, &memcg_flags);
 	if (mapping) {
 		struct backing_dev_info *bdi = mapping->backing_dev_info;
 		unsigned long flags;
@@ -2278,6 +2292,7 @@ int test_set_page_writeback(struct page *page)
 	}
 	if (!ret)
 		account_page_writeback(page);
+	mem_cgroup_end_update_page_stat(page, &locked, &memcg_flags);
 	return ret;
 
 }
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 4/4] memcg: Document cgroup dirty/writeback memory statistics
  2013-08-22  9:46               ` Sha Zhengju
                                 ` (3 preceding siblings ...)
  (?)
@ 2013-08-22  9:53               ` Sha Zhengju
  -1 siblings, 0 replies; 47+ messages in thread
From: Sha Zhengju @ 2013-08-22  9:53 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, kamezawa.hiroyu, gthelen, linux-mm, cgroups,
	linux-fsdevel, Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
---
 Documentation/cgroups/memory.txt |    1 +
 1 file changed, 1 insertion(+)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 327acec..c9ea502 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -490,6 +490,7 @@ pgpgin		- # of charging events to the memory cgroup. The charging
 pgpgout		- # of uncharging events to the memory cgroup. The uncharging
 		event happens each time a page is unaccounted from the cgroup.
 swap		- # of bytes of swap usage
+writeback      - # of bytes of file/anon cache that are queued for syncing to disk.
 inactive_anon	- # of bytes of anonymous memory and swap cache memory on
 		LRU list.
 active_anon	- # of bytes of anonymous and swap cache memory on active
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/4] memcg: add per cgroup writeback pages accounting
  2013-08-22  9:53               ` [PATCH 3/4] memcg: add per cgroup writeback pages accounting Sha Zhengju
@ 2013-08-22 22:40                 ` Andrew Morton
  2013-08-23 16:11                     ` Sha Zhengju
  0 siblings, 1 reply; 47+ messages in thread
From: Andrew Morton @ 2013-08-22 22:40 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: mhocko, kamezawa.hiroyu, gthelen, linux-mm, cgroups,
	linux-fsdevel, Sha Zhengju

On Thu, 22 Aug 2013 17:53:10 +0800 Sha Zhengju <handai.szj@gmail.com> wrote:

> This patch is to add memcg routines to count writeback pages

Well OK, but why?  What use is the feature?  In what ways are people
suffering due to its absence?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/4] memcg: add per cgroup writeback pages accounting
  2013-08-22 22:40                 ` Andrew Morton
@ 2013-08-23 16:11                     ` Sha Zhengju
  0 siblings, 0 replies; 47+ messages in thread
From: Sha Zhengju @ 2013-08-23 16:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, KAMEZAWA Hiroyuki, Greg Thelen, linux-mm, Cgroups,
	linux-fsdevel, Sha Zhengju

On Fri, Aug 23, 2013 at 6:40 AM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Thu, 22 Aug 2013 17:53:10 +0800 Sha Zhengju <handai.szj@gmail.com> wrote:
>
>> This patch is to add memcg routines to count writeback pages
>
> Well OK, but why?  What use is the feature?  In what ways are people
> suffering due to its absence?

My apologies for not explaining it clearly.

It's subset of memcg dirty page accounting(including dirty, writeback,
nfs_unstable pages from a broad sense), which can provide a more sound
knowledge of memcg behavior. That would be straightforward to add new
features like memcg dirty page throttling and even memcg aware
flushing.
However, the dirty one is more complicated and performance senstive,
so I need more efforts to improve it and let the writeback patch go
first.  Afterwards I'll only focus on dirty page itself.

-- 
Thanks,
Sha

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/4] memcg: add per cgroup writeback pages accounting
@ 2013-08-23 16:11                     ` Sha Zhengju
  0 siblings, 0 replies; 47+ messages in thread
From: Sha Zhengju @ 2013-08-23 16:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, KAMEZAWA Hiroyuki, Greg Thelen, linux-mm, Cgroups,
	linux-fsdevel, Sha Zhengju

On Fri, Aug 23, 2013 at 6:40 AM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Thu, 22 Aug 2013 17:53:10 +0800 Sha Zhengju <handai.szj@gmail.com> wrote:
>
>> This patch is to add memcg routines to count writeback pages
>
> Well OK, but why?  What use is the feature?  In what ways are people
> suffering due to its absence?

My apologies for not explaining it clearly.

It's subset of memcg dirty page accounting(including dirty, writeback,
nfs_unstable pages from a broad sense), which can provide a more sound
knowledge of memcg behavior. That would be straightforward to add new
features like memcg dirty page throttling and even memcg aware
flushing.
However, the dirty one is more complicated and performance senstive,
so I need more efforts to improve it and let the writeback patch go
first.  Afterwards I'll only focus on dirty page itself.

-- 
Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2013-08-23 16:11 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-01 11:43 [PATCH V5 0/8] Add memcg dirty/writeback page accounting Sha Zhengju
2013-08-01 11:43 ` Sha Zhengju
2013-08-01 11:44 ` [PATCH 1/8] memcg: remove MEMCG_NR_FILE_MAPPED Sha Zhengju
2013-08-01 11:51 ` [PATCH V5 2/8] fs/ceph: vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem Sha Zhengju
2013-08-01 15:19   ` Yan, Zheng
2013-08-01 18:27     ` Sage Weil
2013-08-02 10:04       ` Sha Zhengju
     [not found]         ` <CAFj3OHVXvtr5BDMrGatHZi7M9y+dh1ZKRMQZGjZmNBcg3pNQtw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-08-02 20:30           ` Sage Weil
2013-08-02 20:30             ` Sage Weil
2013-08-03  8:58             ` Sha Zhengju
2013-08-02  9:04     ` Sha Zhengju
2013-08-02  9:04       ` Sha Zhengju
2013-08-02 13:11       ` Yan, Zheng
2013-08-01 11:52 ` [PATCH V5 3/8] memcg: check for proper lock held in mem_cgroup_update_page_stat Sha Zhengju
2013-08-01 14:34   ` Michal Hocko
2013-08-01 14:34     ` Michal Hocko
2013-08-04 18:48   ` Greg Thelen
2013-08-04 18:48     ` Greg Thelen
2013-08-01 11:53 ` [PATCH V5 4/8] memcg: add per cgroup dirty pages accounting Sha Zhengju
2013-08-01 11:54 ` [PATCH V5 5/8] memcg: add per cgroup writeback " Sha Zhengju
2013-08-01 14:53   ` Michal Hocko
     [not found]     ` <20130801145302.GJ5198-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2013-08-03  9:25       ` Sha Zhengju
2013-08-03  9:25         ` Sha Zhengju
     [not found]         ` <CAFj3OHV-VCKJfe6bv4UMvv+uj4LELDXsieRZFJD06Yrdyy=XxA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-08-04 10:08           ` Michal Hocko
2013-08-04 10:08             ` Michal Hocko
2013-08-22  9:46             ` Fwd: " Sha Zhengju
2013-08-22  9:46               ` Sha Zhengju
2013-08-22  9:50               ` [PATCH 1/4] memcg: remove MEMCG_NR_FILE_MAPPED Sha Zhengju
2013-08-22  9:52               ` [PATCH 2/4] memcg: check for proper lock held in mem_cgroup_update_page_stat Sha Zhengju
2013-08-22  9:53               ` [PATCH 3/4] memcg: add per cgroup writeback pages accounting Sha Zhengju
2013-08-22 22:40                 ` Andrew Morton
2013-08-23 16:11                   ` Sha Zhengju
2013-08-23 16:11                     ` Sha Zhengju
2013-08-22  9:53               ` [PATCH 4/4] memcg: Document cgroup dirty/writeback memory statistics Sha Zhengju
2013-08-04 18:51   ` [PATCH V5 5/8] memcg: add per cgroup writeback pages accounting Greg Thelen
2013-08-04 18:51     ` Greg Thelen
2013-08-01 11:55 ` [PATCH V5 6/8] memcg: make nocpu_base available for non-hotplug Sha Zhengju
2013-08-01 12:00 ` [PATCH V5 7/8] memcg: don't account root memcg page stats if only root exists Sha Zhengju
2013-08-01 16:20   ` Johannes Weiner
2013-08-02  4:32     ` Sha Zhengju
2013-08-02  4:32       ` Sha Zhengju
2013-08-05 21:58       ` Johannes Weiner
2013-08-05 21:58         ` Johannes Weiner
2013-08-01 12:00 ` [PATCH V5 8/8] memcg: Document cgroup dirty/writeback memory statistics Sha Zhengju
2013-08-01 12:00   ` Sha Zhengju
2013-08-01 14:43 ` [PATCH V5 0/8] Add memcg dirty/writeback page accounting Michal Hocko
2013-08-03  9:30   ` Sha Zhengju

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.