All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V4 0/6] Memcg dirty/writeback page accounting
@ 2013-07-05 17:18 Sha Zhengju
  2013-07-05 17:21 ` [PATCH V4 1/6] memcg: remove MEMCG_NR_FILE_MAPPED Sha Zhengju
                   ` (5 more replies)
  0 siblings, 6 replies; 26+ messages in thread
From: Sha Zhengju @ 2013-07-05 17:18 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: mhocko, gthelen, kamezawa.hiroyu, akpm, fengguang.wu, mgorman,
	Sha Zhengju

Hi, list

This is V4 patch series that provide the ability for each memory cgroup to
have independent dirty/writeback page statistics.
Previous version is here:
  V3 - http://lwn.net/Articles/530776/;
  V2 - http://lwn.net/Articles/508452/;
  V1 - http://lwn.net/Articles/504017/;

The first two patches are still doing some cleanup and prepare works. Comparing
to V3, we give up reworking vfs set page dirty routines. Though this may make
following memcg dirty page accounting a little complicated, but it can avoid
exporting a new symbol and make vfs review easier. Patch 3/6 and 4/6 are acctually
doing memcg dirty and writeback page accounting, which adds statistic codes in
several hot paths. So in order to reduce the overheads, patch 5/6 is trying to do
some optimization by jump label: if only root memcg exists, there's no possibllity
of task moving between memcgs, so we can patch some codes out when not used. Note
that we only patch out mem_cgroup_{begin,end}_update_page_stat() at sometimes, but
still leave mem_cgroup_update_page_stat() there because we still need root page stat.
But there seems still some impact on performance, see numbers get by Mel's pft test
(On a 4g memory and 4-core i5 CPU machine):

vanilla  : memcg enabled, patch not applied
account  : memcg enabled, and add dirty/writeback page accounting
patched  : all patches are patched, including optimization in 5/6

* Duration numbers:
             vanilla     account
User          395.02      456.68
System         65.76       69.27
Elapsed       465.38      531.74(+14.3%)

             vanilla     patched
User          395.02      411.57
System         65.76       65.82
Elapsed       465.38      481.14(+3.4%)

* Summary numbers:
vanilla:
Clients User        System      Elapsed     Faults/cpu  Faults/sec  
1       0.02        0.19        0.22        925593.267  905793.990  
2       0.03        0.24        0.14        738464.210  1429302.471 
3       0.04        0.29        0.12        589251.166  1596685.658 
4       0.04        0.38        0.12        472565.657  1694854.626

account:
Clients User        System      Elapsed     Faults/cpu  Faults/sec  
1       0.02        0.20        0.23        878176.374  863037.483 (-4.7%)
2       0.03        0.27        0.16        661140.331  1235796.314(-13.5%)
3       0.03        0.30        0.16        592960.401  1225448.132(-23.3%)
4       0.04        0.31        0.15        567897.251  1296703.568(-23.5%)

patched:
Clients User        System      Elapsed     Faults/cpu  Faults/sec  
1       0.02        0.19        0.22        912709.796  898977.675 (-0.8%)
2       0.03        0.24        0.14        710928.981  1380878.891(-3.4%)
3       0.03        0.30        0.12        584247.648  1571436.530(-1.6%)
4       0.03        0.38        0.12        470335.271  1679938.339(-0.9%)

The performance is lower than vanilla kernel, however I think it's minor. But
note that I found some fluctuation on these numbers for several rounds, I selected
the above at the instance of the majority, I don't know if I missed anything..
I'm not sure if the test case is enough, any advice is welcomed! :)


Change log:
v4 <-- v3:
	1. give up reworking vfs codes
	2. change lock order of memcg->move_lock and mapping->tree_lock
	3. patch out mem_cgroup_{begin,end}_update_page_stat when not used
	4. rebased to since-3.10 branch
v3 <-- v2:
	1. change lock order of mapping->tree_lock and memcg->move_lock
	2. performance optimization in 6/8 and 7/8
v2 <-- v1:
        1. add test numbers
        2. some small fix and comments

Sha Zhengju (6):
      memcg: remove MEMCG_NR_FILE_MAPPED
      fs/ceph: vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem
      memcg: add per cgroup dirty pages accounting
      memcg: add per cgroup writeback pages accounting
      memcg: patch mem_cgroup_{begin,end}_update_page_stat() out if only root memcg exists
      memcg: Document cgroup dirty/writeback memory statistics

 Documentation/cgroups/memory.txt |    2 +
 fs/buffer.c                      |    9 +++++
 fs/ceph/addr.c                   |   13 +-----
 include/linux/memcontrol.h       |   44 ++++++++++++++++----
 mm/filemap.c                     |   14 +++++++
 mm/memcontrol.c                  |   83 +++++++++++++++++++++++---------------
 mm/page-writeback.c              |   39 +++++++++++++++++-
 mm/rmap.c                        |    4 +-
 mm/truncate.c                    |    6 +++
 9 files changed, 158 insertions(+), 56 deletions(-)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH V4 1/6] memcg: remove MEMCG_NR_FILE_MAPPED
  2013-07-05 17:18 [PATCH V4 0/6] Memcg dirty/writeback page accounting Sha Zhengju
@ 2013-07-05 17:21 ` Sha Zhengju
  2013-07-11 13:39   ` Michal Hocko
  2013-07-05 17:26 ` [PATCH V4 2/6] fs/ceph: vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem Sha Zhengju
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 26+ messages in thread
From: Sha Zhengju @ 2013-07-05 17:21 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: mhocko, gthelen, kamezawa.hiroyu, akpm, fengguang.wu, mgorman,
	Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

While accounting memcg page stat, it's not worth to use MEMCG_NR_FILE_MAPPED
as an extra layer of indirection because of the complexity and presumed
performance overhead. We can use MEM_CGROUP_STAT_FILE_MAPPED directly.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Greg Thelen <gthelen@google.com>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: Mel Gorman <mgorman@suse.de>
---
 include/linux/memcontrol.h |   27 +++++++++++++++++++--------
 mm/memcontrol.c            |   25 +------------------------
 mm/rmap.c                  |    4 ++--
 3 files changed, 22 insertions(+), 34 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 7b4d9d7..d166aeb 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -30,9 +30,20 @@ struct page;
 struct mm_struct;
 struct kmem_cache;
 
-/* Stats that can be updated by kernel. */
-enum mem_cgroup_page_stat_item {
-	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
+/*
+ * The corresponding mem_cgroup_stat_names is defined in mm/memcontrol.c,
+ * These two lists should keep in accord with each other.
+ */
+enum mem_cgroup_stat_index {
+	/*
+	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
+	 */
+	MEM_CGROUP_STAT_CACHE,		/* # of pages charged as cache */
+	MEM_CGROUP_STAT_RSS,		/* # of pages charged as anon rss */
+	MEM_CGROUP_STAT_RSS_HUGE,	/* # of pages charged as anon huge */
+	MEM_CGROUP_STAT_FILE_MAPPED,	/* # of pages charged as file rss */
+	MEM_CGROUP_STAT_SWAP,		/* # of pages, swapped out */
+	MEM_CGROUP_STAT_NSTATS,
 };
 
 struct mem_cgroup_reclaim_cookie {
@@ -165,17 +176,17 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
 }
 
 void mem_cgroup_update_page_stat(struct page *page,
-				 enum mem_cgroup_page_stat_item idx,
+				 enum mem_cgroup_stat_index idx,
 				 int val);
 
 static inline void mem_cgroup_inc_page_stat(struct page *page,
-					    enum mem_cgroup_page_stat_item idx)
+					    enum mem_cgroup_stat_index idx)
 {
 	mem_cgroup_update_page_stat(page, idx, 1);
 }
 
 static inline void mem_cgroup_dec_page_stat(struct page *page,
-					    enum mem_cgroup_page_stat_item idx)
+					    enum mem_cgroup_stat_index idx)
 {
 	mem_cgroup_update_page_stat(page, idx, -1);
 }
@@ -349,12 +360,12 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
 }
 
 static inline void mem_cgroup_inc_page_stat(struct page *page,
-					    enum mem_cgroup_page_stat_item idx)
+					    enum mem_cgroup_stat_index idx)
 {
 }
 
 static inline void mem_cgroup_dec_page_stat(struct page *page,
-					    enum mem_cgroup_page_stat_item idx)
+					    enum mem_cgroup_stat_index idx)
 {
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6e120e4..f9acf49 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -85,21 +85,6 @@ static int really_do_swap_account __initdata = 0;
 #endif
 
 
-/*
- * Statistics for memory cgroup.
- */
-enum mem_cgroup_stat_index {
-	/*
-	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
-	 */
-	MEM_CGROUP_STAT_CACHE,		/* # of pages charged as cache */
-	MEM_CGROUP_STAT_RSS,		/* # of pages charged as anon rss */
-	MEM_CGROUP_STAT_RSS_HUGE,	/* # of pages charged as anon huge */
-	MEM_CGROUP_STAT_FILE_MAPPED,	/* # of pages charged as file rss */
-	MEM_CGROUP_STAT_SWAP,		/* # of pages, swapped out */
-	MEM_CGROUP_STAT_NSTATS,
-};
-
 static const char * const mem_cgroup_stat_names[] = {
 	"cache",
 	"rss",
@@ -2307,7 +2292,7 @@ void __mem_cgroup_end_update_page_stat(struct page *page, unsigned long *flags)
 }
 
 void mem_cgroup_update_page_stat(struct page *page,
-				 enum mem_cgroup_page_stat_item idx, int val)
+				 enum mem_cgroup_stat_index idx, int val)
 {
 	struct mem_cgroup *memcg;
 	struct page_cgroup *pc = lookup_page_cgroup(page);
@@ -2320,14 +2305,6 @@ void mem_cgroup_update_page_stat(struct page *page,
 	if (unlikely(!memcg || !PageCgroupUsed(pc)))
 		return;
 
-	switch (idx) {
-	case MEMCG_NR_FILE_MAPPED:
-		idx = MEM_CGROUP_STAT_FILE_MAPPED;
-		break;
-	default:
-		BUG();
-	}
-
 	this_cpu_add(memcg->stat->count[idx], val);
 }
 
diff --git a/mm/rmap.c b/mm/rmap.c
index cd356df..3a3e03e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1114,7 +1114,7 @@ void page_add_file_rmap(struct page *page)
 	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (atomic_inc_and_test(&page->_mapcount)) {
 		__inc_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
+		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
 	}
 	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 }
@@ -1158,7 +1158,7 @@ void page_remove_rmap(struct page *page)
 					      NR_ANON_TRANSPARENT_HUGEPAGES);
 	} else {
 		__dec_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
+		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
 		mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	}
 	if (unlikely(PageMlocked(page)))
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH V4 2/6] fs/ceph: vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem
  2013-07-05 17:18 [PATCH V4 0/6] Memcg dirty/writeback page accounting Sha Zhengju
  2013-07-05 17:21 ` [PATCH V4 1/6] memcg: remove MEMCG_NR_FILE_MAPPED Sha Zhengju
@ 2013-07-05 17:26 ` Sha Zhengju
       [not found] ` <1373044710-27371-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 26+ messages in thread
From: Sha Zhengju @ 2013-07-05 17:26 UTC (permalink / raw)
  To: linux-fsdevel, ceph-devel, linux-mm, cgroups
  Cc: sage, mhocko, gthelen, kamezawa.hiroyu, akpm, fengguang.wu,
	mgorman, Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

Following we will begin to add memcg dirty page accounting around __set_page_dirty_
{buffers,nobuffers} in vfs layer, so we'd better use vfs interface to avoid exporting
those details to filesystems.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
cc: Sage Weil <sage@inktank.com>
cc: Michal Hocko <mhocko@suse.cz>
cc: Greg Thelen <gthelen@google.com>
cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: Fengguang Wu <fengguang.wu@intel.com>
cc: Mel Gorman <mgorman@suse.de>

---
 fs/ceph/addr.c |   13 +------------
 1 file changed, 1 insertion(+), 12 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 3e68ac1..1445bf1 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -76,7 +76,7 @@ static int ceph_set_page_dirty(struct page *page)
 	if (unlikely(!mapping))
 		return !TestSetPageDirty(page);
 
-	if (TestSetPageDirty(page)) {
+	if (!__set_page_dirty_nobuffers(page)) {
 		dout("%p set_page_dirty %p idx %lu -- already dirty\n",
 		     mapping->host, page, page->index);
 		return 0;
@@ -107,14 +107,7 @@ static int ceph_set_page_dirty(struct page *page)
 	     snapc, snapc->seq, snapc->num_snaps);
 	spin_unlock(&ci->i_ceph_lock);
 
-	/* now adjust page */
-	spin_lock_irq(&mapping->tree_lock);
 	if (page->mapping) {	/* Race with truncate? */
-		WARN_ON_ONCE(!PageUptodate(page));
-		account_page_dirtied(page, page->mapping);
-		radix_tree_tag_set(&mapping->page_tree,
-				page_index(page), PAGECACHE_TAG_DIRTY);
-
 		/*
 		 * Reference snap context in page->private.  Also set
 		 * PagePrivate so that we get invalidatepage callback.
@@ -126,14 +119,10 @@ static int ceph_set_page_dirty(struct page *page)
 		undo = 1;
 	}
 
-	spin_unlock_irq(&mapping->tree_lock);
-
 	if (undo)
 		/* whoops, we failed to dirty the page */
 		ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
 
-	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
-
 	BUG_ON(!PageDirty(page));
 	return 1;
 }
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH V4 3/6] memcg: add per cgroup dirty pages accounting
  2013-07-05 17:18 [PATCH V4 0/6] Memcg dirty/writeback page accounting Sha Zhengju
@ 2013-07-05 17:30     ` Sha Zhengju
  2013-07-05 17:26 ` [PATCH V4 2/6] fs/ceph: vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem Sha Zhengju
                       ` (4 subsequent siblings)
  5 siblings, 0 replies; 26+ messages in thread
From: Sha Zhengju @ 2013-07-05 17:30 UTC (permalink / raw)
  To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA
  Cc: mhocko-AlSwsSmVLrQ, gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w, mgorman-l3A5Bk7waGM,
	Sha Zhengju

From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>

This patch adds memcg routines to count dirty pages, which allows memory controller
to maintain an accurate view of the amount of its dirty memory.

After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
has a feature to move a page from a cgroup to another one and may have race between
"move" and "page stat accounting". So in order to avoid the race we have designed a
bigger lock:

         mem_cgroup_begin_update_page_stat()
         modify page information        -->(a)
         mem_cgroup_update_page_stat()  -->(b)
         mem_cgroup_end_update_page_stat()
It requires both (a) and (b)(dirty pages accounting) to be pretected in
mem_cgroup_{begin/end}_update_page_stat().

Server places should be added accounting:
        incrementing (3):
                __set_page_dirty_buffers
                __set_page_dirty_nobuffers
		mark_buffer_dirty
        decrementing (5):
                clear_page_dirty_for_io
                cancel_dirty_page
		delete_from_page_cache
		__delete_from_page_cache
		replace_page_cache_page

The lock order between memcg lock and mapping lock is:
	--> memcg->move_lock
	  --> mapping->private_lock
            --> mapping->tree_lock

Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
cc: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
cc: Greg Thelen <gthelen-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
cc: Fengguang Wu <fengguang.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
cc: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
---
 fs/buffer.c                |    9 +++++++++
 include/linux/memcontrol.h |    1 +
 mm/filemap.c               |   14 ++++++++++++++
 mm/memcontrol.c            |   30 +++++++++++++++++++++++-------
 mm/page-writeback.c        |   24 ++++++++++++++++++++++--
 mm/truncate.c              |    6 ++++++
 6 files changed, 75 insertions(+), 9 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 695eb14..7c537f4 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -694,10 +694,13 @@ int __set_page_dirty_buffers(struct page *page)
 {
 	int newly_dirty;
 	struct address_space *mapping = page_mapping(page);
+	bool locked;
+	unsigned long flags;
 
 	if (unlikely(!mapping))
 		return !TestSetPageDirty(page);
 
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	spin_lock(&mapping->private_lock);
 	if (page_has_buffers(page)) {
 		struct buffer_head *head = page_buffers(page);
@@ -713,6 +716,7 @@ int __set_page_dirty_buffers(struct page *page)
 
 	if (newly_dirty)
 		__set_page_dirty(page, mapping, 1);
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	return newly_dirty;
 }
 EXPORT_SYMBOL(__set_page_dirty_buffers);
@@ -1169,11 +1173,16 @@ void mark_buffer_dirty(struct buffer_head *bh)
 
 	if (!test_set_buffer_dirty(bh)) {
 		struct page *page = bh->b_page;
+		bool locked;
+		unsigned long flags;
+
+		mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 		if (!TestSetPageDirty(page)) {
 			struct address_space *mapping = page_mapping(page);
 			if (mapping)
 				__set_page_dirty(page, mapping, 0);
 		}
+		mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	}
 }
 EXPORT_SYMBOL(mark_buffer_dirty);
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d166aeb..f952be6 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -42,6 +42,7 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_RSS,		/* # of pages charged as anon rss */
 	MEM_CGROUP_STAT_RSS_HUGE,	/* # of pages charged as anon huge */
 	MEM_CGROUP_STAT_FILE_MAPPED,	/* # of pages charged as file rss */
+	MEM_CGROUP_STAT_FILE_DIRTY,	/* # of dirty pages in page cache */
 	MEM_CGROUP_STAT_SWAP,		/* # of pages, swapped out */
 	MEM_CGROUP_STAT_NSTATS,
 };
diff --git a/mm/filemap.c b/mm/filemap.c
index 4b51ac1..5642de6 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -65,6 +65,11 @@
  *      ->swap_lock		(exclusive_swap_page, others)
  *        ->mapping->tree_lock
  *
+ *  ->memcg->move_lock	(mem_cgroup_begin_update_page_stat->
+ *						move_lock_mem_cgroup)
+ *    ->private_lock		(__set_page_dirty_buffers)
+ *        ->mapping->tree_lock
+ *
  *  ->i_mutex
  *    ->i_mmap_mutex		(truncate->unmap_mapping_range)
  *
@@ -144,6 +149,7 @@ void __delete_from_page_cache(struct page *page)
 	 * having removed the page entirely.
 	 */
 	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 		dec_zone_page_state(page, NR_FILE_DIRTY);
 		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
 	}
@@ -161,13 +167,17 @@ void delete_from_page_cache(struct page *page)
 {
 	struct address_space *mapping = page->mapping;
 	void (*freepage)(struct page *);
+	bool locked;
+	unsigned long flags;
 
 	BUG_ON(!PageLocked(page));
 
 	freepage = mapping->a_ops->freepage;
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	spin_lock_irq(&mapping->tree_lock);
 	__delete_from_page_cache(page);
 	spin_unlock_irq(&mapping->tree_lock);
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	mem_cgroup_uncharge_cache_page(page);
 
 	if (freepage)
@@ -417,6 +427,8 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 	if (!error) {
 		struct address_space *mapping = old->mapping;
 		void (*freepage)(struct page *);
+		bool locked;
+		unsigned long flags;
 
 		pgoff_t offset = old->index;
 		freepage = mapping->a_ops->freepage;
@@ -425,6 +437,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 		new->mapping = mapping;
 		new->index = offset;
 
+		mem_cgroup_begin_update_page_stat(old, &locked, &flags);
 		spin_lock_irq(&mapping->tree_lock);
 		__delete_from_page_cache(old);
 		error = radix_tree_insert(&mapping->page_tree, offset, new);
@@ -434,6 +447,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 		if (PageSwapBacked(new))
 			__inc_zone_page_state(new, NR_SHMEM);
 		spin_unlock_irq(&mapping->tree_lock);
+		mem_cgroup_end_update_page_stat(old, &locked, &flags);
 		/* mem_cgroup codes must not be called under tree_lock */
 		mem_cgroup_replace_page_cache(old, new);
 		radix_tree_preload_end();
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f9acf49..1d31851 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -91,6 +91,7 @@ static const char * const mem_cgroup_stat_names[] = {
 	"rss_huge",
 	"mapped_file",
 	"swap",
+	"dirty",
 };
 
 enum mem_cgroup_events_index {
@@ -3743,6 +3744,20 @@ void mem_cgroup_split_huge_fixup(struct page *head)
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+static inline
+void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
+					struct mem_cgroup *to,
+					unsigned int nr_pages,
+					enum mem_cgroup_stat_index idx)
+{
+	/* Update stat data for mem_cgroup */
+	preempt_disable();
+	WARN_ON_ONCE(from->stat->count[idx] < nr_pages);
+	__this_cpu_add(from->stat->count[idx], -nr_pages);
+	__this_cpu_add(to->stat->count[idx], nr_pages);
+	preempt_enable();
+}
+
 /**
  * mem_cgroup_move_account - move account of the page
  * @page: the page
@@ -3788,13 +3803,14 @@ static int mem_cgroup_move_account(struct page *page,
 
 	move_lock_mem_cgroup(from, &flags);
 
-	if (!anon && page_mapped(page)) {
-		/* Update mapped_file data for mem_cgroup */
-		preempt_disable();
-		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		preempt_enable();
-	}
+	if (!anon && page_mapped(page))
+		mem_cgroup_move_account_page_stat(from, to, nr_pages,
+			MEM_CGROUP_STAT_FILE_MAPPED);
+
+	if (!anon && PageDirty(page))
+		mem_cgroup_move_account_page_stat(from, to, nr_pages,
+			MEM_CGROUP_STAT_FILE_DIRTY);
+
 	mem_cgroup_charge_statistics(from, page, anon, -nr_pages);
 
 	/* caller should have done css_get */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 4514ad7..3900e62 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1982,6 +1982,11 @@ int __set_page_dirty_no_writeback(struct page *page)
 
 /*
  * Helper function for set_page_dirty family.
+ *
+ * The caller must hold mem_cgroup_begin/end_update_page_stat() lock
+ * while modifying struct page state and accounting dirty pages.
+ * See __set_page_dirty_{nobuffers,buffers} for example.
+ *
  * NOTE: This relies on being atomic wrt interrupts.
  */
 void account_page_dirtied(struct page *page, struct address_space *mapping)
@@ -1989,6 +1994,7 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
 	trace_writeback_dirty_page(page, mapping);
 
 	if (mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_DIRTIED);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
@@ -2028,6 +2034,11 @@ EXPORT_SYMBOL(account_page_writeback);
  */
 int __set_page_dirty_nobuffers(struct page *page)
 {
+	bool locked;
+	unsigned long flags;
+
+
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (!TestSetPageDirty(page)) {
 		struct address_space *mapping = page_mapping(page);
 		struct address_space *mapping2;
@@ -2045,12 +2056,15 @@ int __set_page_dirty_nobuffers(struct page *page)
 				page_index(page), PAGECACHE_TAG_DIRTY);
 		}
 		spin_unlock_irq(&mapping->tree_lock);
+		mem_cgroup_end_update_page_stat(page, &locked, &flags);
+
 		if (mapping->host) {
 			/* !PageAnon && !swapper_space */
 			__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 		}
 		return 1;
 	}
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	return 0;
 }
 EXPORT_SYMBOL(__set_page_dirty_nobuffers);
@@ -2166,6 +2180,9 @@ EXPORT_SYMBOL(set_page_dirty_lock);
 int clear_page_dirty_for_io(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
+	bool locked;
+	unsigned long flags;
+	int ret = 0;
 
 	BUG_ON(!PageLocked(page));
 
@@ -2207,13 +2224,16 @@ int clear_page_dirty_for_io(struct page *page)
 		 * the desired exclusion. See mm/memory.c:do_wp_page()
 		 * for more comments.
 		 */
+		mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 		if (TestClearPageDirty(page)) {
+			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
-			return 1;
+			ret = 1;
 		}
-		return 0;
+		mem_cgroup_end_update_page_stat(page, &locked, &flags);
+		return ret;
 	}
 	return TestClearPageDirty(page);
 }
diff --git a/mm/truncate.c b/mm/truncate.c
index c75b736..9c9aa03 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -73,9 +73,14 @@ static inline void truncate_partial_page(struct page *page, unsigned partial)
  */
 void cancel_dirty_page(struct page *page, unsigned int account_size)
 {
+	bool locked;
+	unsigned long flags;
+
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (TestClearPageDirty(page)) {
 		struct address_space *mapping = page->mapping;
 		if (mapping && mapping_cap_account_dirty(mapping)) {
+			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
@@ -83,6 +88,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
 				task_io_account_cancelled_write(account_size);
 		}
 	}
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 }
 EXPORT_SYMBOL(cancel_dirty_page);
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH V4 3/6] memcg: add per cgroup dirty pages accounting
@ 2013-07-05 17:30     ` Sha Zhengju
  0 siblings, 0 replies; 26+ messages in thread
From: Sha Zhengju @ 2013-07-05 17:30 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, cgroups
  Cc: mhocko, gthelen, kamezawa.hiroyu, akpm, fengguang.wu, mgorman,
	Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

This patch adds memcg routines to count dirty pages, which allows memory controller
to maintain an accurate view of the amount of its dirty memory.

After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
has a feature to move a page from a cgroup to another one and may have race between
"move" and "page stat accounting". So in order to avoid the race we have designed a
bigger lock:

         mem_cgroup_begin_update_page_stat()
         modify page information        -->(a)
         mem_cgroup_update_page_stat()  -->(b)
         mem_cgroup_end_update_page_stat()
It requires both (a) and (b)(dirty pages accounting) to be pretected in
mem_cgroup_{begin/end}_update_page_stat().

Server places should be added accounting:
        incrementing (3):
                __set_page_dirty_buffers
                __set_page_dirty_nobuffers
		mark_buffer_dirty
        decrementing (5):
                clear_page_dirty_for_io
                cancel_dirty_page
		delete_from_page_cache
		__delete_from_page_cache
		replace_page_cache_page

The lock order between memcg lock and mapping lock is:
	--> memcg->move_lock
	  --> mapping->private_lock
            --> mapping->tree_lock

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
cc: Michal Hocko <mhocko@suse.cz>
cc: Greg Thelen <gthelen@google.com>
cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: Fengguang Wu <fengguang.wu@intel.com>
cc: Mel Gorman <mgorman@suse.de>
---
 fs/buffer.c                |    9 +++++++++
 include/linux/memcontrol.h |    1 +
 mm/filemap.c               |   14 ++++++++++++++
 mm/memcontrol.c            |   30 +++++++++++++++++++++++-------
 mm/page-writeback.c        |   24 ++++++++++++++++++++++--
 mm/truncate.c              |    6 ++++++
 6 files changed, 75 insertions(+), 9 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 695eb14..7c537f4 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -694,10 +694,13 @@ int __set_page_dirty_buffers(struct page *page)
 {
 	int newly_dirty;
 	struct address_space *mapping = page_mapping(page);
+	bool locked;
+	unsigned long flags;
 
 	if (unlikely(!mapping))
 		return !TestSetPageDirty(page);
 
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	spin_lock(&mapping->private_lock);
 	if (page_has_buffers(page)) {
 		struct buffer_head *head = page_buffers(page);
@@ -713,6 +716,7 @@ int __set_page_dirty_buffers(struct page *page)
 
 	if (newly_dirty)
 		__set_page_dirty(page, mapping, 1);
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	return newly_dirty;
 }
 EXPORT_SYMBOL(__set_page_dirty_buffers);
@@ -1169,11 +1173,16 @@ void mark_buffer_dirty(struct buffer_head *bh)
 
 	if (!test_set_buffer_dirty(bh)) {
 		struct page *page = bh->b_page;
+		bool locked;
+		unsigned long flags;
+
+		mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 		if (!TestSetPageDirty(page)) {
 			struct address_space *mapping = page_mapping(page);
 			if (mapping)
 				__set_page_dirty(page, mapping, 0);
 		}
+		mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	}
 }
 EXPORT_SYMBOL(mark_buffer_dirty);
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d166aeb..f952be6 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -42,6 +42,7 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_RSS,		/* # of pages charged as anon rss */
 	MEM_CGROUP_STAT_RSS_HUGE,	/* # of pages charged as anon huge */
 	MEM_CGROUP_STAT_FILE_MAPPED,	/* # of pages charged as file rss */
+	MEM_CGROUP_STAT_FILE_DIRTY,	/* # of dirty pages in page cache */
 	MEM_CGROUP_STAT_SWAP,		/* # of pages, swapped out */
 	MEM_CGROUP_STAT_NSTATS,
 };
diff --git a/mm/filemap.c b/mm/filemap.c
index 4b51ac1..5642de6 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -65,6 +65,11 @@
  *      ->swap_lock		(exclusive_swap_page, others)
  *        ->mapping->tree_lock
  *
+ *  ->memcg->move_lock	(mem_cgroup_begin_update_page_stat->
+ *						move_lock_mem_cgroup)
+ *    ->private_lock		(__set_page_dirty_buffers)
+ *        ->mapping->tree_lock
+ *
  *  ->i_mutex
  *    ->i_mmap_mutex		(truncate->unmap_mapping_range)
  *
@@ -144,6 +149,7 @@ void __delete_from_page_cache(struct page *page)
 	 * having removed the page entirely.
 	 */
 	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 		dec_zone_page_state(page, NR_FILE_DIRTY);
 		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
 	}
@@ -161,13 +167,17 @@ void delete_from_page_cache(struct page *page)
 {
 	struct address_space *mapping = page->mapping;
 	void (*freepage)(struct page *);
+	bool locked;
+	unsigned long flags;
 
 	BUG_ON(!PageLocked(page));
 
 	freepage = mapping->a_ops->freepage;
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	spin_lock_irq(&mapping->tree_lock);
 	__delete_from_page_cache(page);
 	spin_unlock_irq(&mapping->tree_lock);
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	mem_cgroup_uncharge_cache_page(page);
 
 	if (freepage)
@@ -417,6 +427,8 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 	if (!error) {
 		struct address_space *mapping = old->mapping;
 		void (*freepage)(struct page *);
+		bool locked;
+		unsigned long flags;
 
 		pgoff_t offset = old->index;
 		freepage = mapping->a_ops->freepage;
@@ -425,6 +437,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 		new->mapping = mapping;
 		new->index = offset;
 
+		mem_cgroup_begin_update_page_stat(old, &locked, &flags);
 		spin_lock_irq(&mapping->tree_lock);
 		__delete_from_page_cache(old);
 		error = radix_tree_insert(&mapping->page_tree, offset, new);
@@ -434,6 +447,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 		if (PageSwapBacked(new))
 			__inc_zone_page_state(new, NR_SHMEM);
 		spin_unlock_irq(&mapping->tree_lock);
+		mem_cgroup_end_update_page_stat(old, &locked, &flags);
 		/* mem_cgroup codes must not be called under tree_lock */
 		mem_cgroup_replace_page_cache(old, new);
 		radix_tree_preload_end();
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f9acf49..1d31851 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -91,6 +91,7 @@ static const char * const mem_cgroup_stat_names[] = {
 	"rss_huge",
 	"mapped_file",
 	"swap",
+	"dirty",
 };
 
 enum mem_cgroup_events_index {
@@ -3743,6 +3744,20 @@ void mem_cgroup_split_huge_fixup(struct page *head)
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+static inline
+void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
+					struct mem_cgroup *to,
+					unsigned int nr_pages,
+					enum mem_cgroup_stat_index idx)
+{
+	/* Update stat data for mem_cgroup */
+	preempt_disable();
+	WARN_ON_ONCE(from->stat->count[idx] < nr_pages);
+	__this_cpu_add(from->stat->count[idx], -nr_pages);
+	__this_cpu_add(to->stat->count[idx], nr_pages);
+	preempt_enable();
+}
+
 /**
  * mem_cgroup_move_account - move account of the page
  * @page: the page
@@ -3788,13 +3803,14 @@ static int mem_cgroup_move_account(struct page *page,
 
 	move_lock_mem_cgroup(from, &flags);
 
-	if (!anon && page_mapped(page)) {
-		/* Update mapped_file data for mem_cgroup */
-		preempt_disable();
-		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		preempt_enable();
-	}
+	if (!anon && page_mapped(page))
+		mem_cgroup_move_account_page_stat(from, to, nr_pages,
+			MEM_CGROUP_STAT_FILE_MAPPED);
+
+	if (!anon && PageDirty(page))
+		mem_cgroup_move_account_page_stat(from, to, nr_pages,
+			MEM_CGROUP_STAT_FILE_DIRTY);
+
 	mem_cgroup_charge_statistics(from, page, anon, -nr_pages);
 
 	/* caller should have done css_get */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 4514ad7..3900e62 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1982,6 +1982,11 @@ int __set_page_dirty_no_writeback(struct page *page)
 
 /*
  * Helper function for set_page_dirty family.
+ *
+ * The caller must hold mem_cgroup_begin/end_update_page_stat() lock
+ * while modifying struct page state and accounting dirty pages.
+ * See __set_page_dirty_{nobuffers,buffers} for example.
+ *
  * NOTE: This relies on being atomic wrt interrupts.
  */
 void account_page_dirtied(struct page *page, struct address_space *mapping)
@@ -1989,6 +1994,7 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
 	trace_writeback_dirty_page(page, mapping);
 
 	if (mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_DIRTIED);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
@@ -2028,6 +2034,11 @@ EXPORT_SYMBOL(account_page_writeback);
  */
 int __set_page_dirty_nobuffers(struct page *page)
 {
+	bool locked;
+	unsigned long flags;
+
+
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (!TestSetPageDirty(page)) {
 		struct address_space *mapping = page_mapping(page);
 		struct address_space *mapping2;
@@ -2045,12 +2056,15 @@ int __set_page_dirty_nobuffers(struct page *page)
 				page_index(page), PAGECACHE_TAG_DIRTY);
 		}
 		spin_unlock_irq(&mapping->tree_lock);
+		mem_cgroup_end_update_page_stat(page, &locked, &flags);
+
 		if (mapping->host) {
 			/* !PageAnon && !swapper_space */
 			__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 		}
 		return 1;
 	}
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	return 0;
 }
 EXPORT_SYMBOL(__set_page_dirty_nobuffers);
@@ -2166,6 +2180,9 @@ EXPORT_SYMBOL(set_page_dirty_lock);
 int clear_page_dirty_for_io(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
+	bool locked;
+	unsigned long flags;
+	int ret = 0;
 
 	BUG_ON(!PageLocked(page));
 
@@ -2207,13 +2224,16 @@ int clear_page_dirty_for_io(struct page *page)
 		 * the desired exclusion. See mm/memory.c:do_wp_page()
 		 * for more comments.
 		 */
+		mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 		if (TestClearPageDirty(page)) {
+			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
-			return 1;
+			ret = 1;
 		}
-		return 0;
+		mem_cgroup_end_update_page_stat(page, &locked, &flags);
+		return ret;
 	}
 	return TestClearPageDirty(page);
 }
diff --git a/mm/truncate.c b/mm/truncate.c
index c75b736..9c9aa03 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -73,9 +73,14 @@ static inline void truncate_partial_page(struct page *page, unsigned partial)
  */
 void cancel_dirty_page(struct page *page, unsigned int account_size)
 {
+	bool locked;
+	unsigned long flags;
+
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (TestClearPageDirty(page)) {
 		struct address_space *mapping = page->mapping;
 		if (mapping && mapping_cap_account_dirty(mapping)) {
+			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
@@ -83,6 +88,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
 				task_io_account_cancelled_write(account_size);
 		}
 	}
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 }
 EXPORT_SYMBOL(cancel_dirty_page);
 
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH V4 4/6] memcg: add per cgroup writeback pages accounting
  2013-07-05 17:18 [PATCH V4 0/6] Memcg dirty/writeback page accounting Sha Zhengju
                   ` (2 preceding siblings ...)
       [not found] ` <1373044710-27371-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
@ 2013-07-05 17:32 ` Sha Zhengju
  2013-07-11 14:40     ` Michal Hocko
  2013-07-05 17:33 ` [PATCH V4 5/6] memcg: patch mem_cgroup_{begin,end}_update_page_stat() out if only root memcg exists Sha Zhengju
  2013-07-05 17:34 ` [PATCH V4 6/6] memcg: Document cgroup dirty/writeback memory statistics Sha Zhengju
  5 siblings, 1 reply; 26+ messages in thread
From: Sha Zhengju @ 2013-07-05 17:32 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, cgroups
  Cc: mhocko, gthelen, kamezawa.hiroyu, akpm, fengguang.wu, mgorman,
	Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

Similar to dirty page, we add per cgroup writeback pages accounting. The lock
rule still is:
        mem_cgroup_begin_update_page_stat()
        modify page WRITEBACK stat
        mem_cgroup_update_page_stat()
        mem_cgroup_end_update_page_stat()

There're two writeback interfaces to modify: test_{clear/set}_page_writeback().

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
cc: Greg Thelen <gthelen@google.com>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: Fengguang Wu <fengguang.wu@intel.com>
cc: Mel Gorman <mgorman@suse.de>
---
 include/linux/memcontrol.h |    1 +
 mm/memcontrol.c            |    5 +++++
 mm/page-writeback.c        |   15 +++++++++++++++
 3 files changed, 21 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f952be6..ccd35d8 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -43,6 +43,7 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_RSS_HUGE,	/* # of pages charged as anon huge */
 	MEM_CGROUP_STAT_FILE_MAPPED,	/* # of pages charged as file rss */
 	MEM_CGROUP_STAT_FILE_DIRTY,	/* # of dirty pages in page cache */
+	MEM_CGROUP_STAT_WRITEBACK,	/* # of pages under writeback */
 	MEM_CGROUP_STAT_SWAP,		/* # of pages, swapped out */
 	MEM_CGROUP_STAT_NSTATS,
 };
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1d31851..9126abc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -92,6 +92,7 @@ static const char * const mem_cgroup_stat_names[] = {
 	"mapped_file",
 	"swap",
 	"dirty",
+	"writeback",
 };
 
 enum mem_cgroup_events_index {
@@ -3811,6 +3812,10 @@ static int mem_cgroup_move_account(struct page *page,
 		mem_cgroup_move_account_page_stat(from, to, nr_pages,
 			MEM_CGROUP_STAT_FILE_DIRTY);
 
+	if (PageWriteback(page))
+		mem_cgroup_move_account_page_stat(from, to, nr_pages,
+			MEM_CGROUP_STAT_WRITEBACK);
+
 	mem_cgroup_charge_statistics(from, page, anon, -nr_pages);
 
 	/* caller should have done css_get */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 3900e62..85de9a0 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2008,11 +2008,17 @@ EXPORT_SYMBOL(account_page_dirtied);
 
 /*
  * Helper function for set_page_writeback family.
+ *
+ * The caller must hold mem_cgroup_begin/end_update_page_stat() lock
+ * while modifying struct page state and accounting writeback pages.
+ * See test_set_page_writeback for example.
+ *
  * NOTE: Unlike account_page_dirtied this does not rely on being atomic
  * wrt interrupts.
  */
 void account_page_writeback(struct page *page)
 {
+	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
 	inc_zone_page_state(page, NR_WRITEBACK);
 }
 EXPORT_SYMBOL(account_page_writeback);
@@ -2243,7 +2249,10 @@ int test_clear_page_writeback(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
 	int ret;
+	bool locked;
+	unsigned long memcg_flags;
 
+	mem_cgroup_begin_update_page_stat(page, &locked, &memcg_flags);
 	if (mapping) {
 		struct backing_dev_info *bdi = mapping->backing_dev_info;
 		unsigned long flags;
@@ -2264,9 +2273,11 @@ int test_clear_page_writeback(struct page *page)
 		ret = TestClearPageWriteback(page);
 	}
 	if (ret) {
+		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
 		dec_zone_page_state(page, NR_WRITEBACK);
 		inc_zone_page_state(page, NR_WRITTEN);
 	}
+	mem_cgroup_end_update_page_stat(page, &locked, &memcg_flags);
 	return ret;
 }
 
@@ -2274,7 +2285,10 @@ int test_set_page_writeback(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
 	int ret;
+	bool locked;
+	unsigned long flags;
 
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (mapping) {
 		struct backing_dev_info *bdi = mapping->backing_dev_info;
 		unsigned long flags;
@@ -2301,6 +2315,7 @@ int test_set_page_writeback(struct page *page)
 	}
 	if (!ret)
 		account_page_writeback(page);
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	return ret;
 
 }
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH V4 5/6] memcg: patch mem_cgroup_{begin,end}_update_page_stat() out if only root memcg exists
  2013-07-05 17:18 [PATCH V4 0/6] Memcg dirty/writeback page accounting Sha Zhengju
                   ` (3 preceding siblings ...)
  2013-07-05 17:32 ` [PATCH V4 4/6] memcg: add per cgroup writeback " Sha Zhengju
@ 2013-07-05 17:33 ` Sha Zhengju
  2013-07-11 14:56   ` Michal Hocko
  2013-07-05 17:34 ` [PATCH V4 6/6] memcg: Document cgroup dirty/writeback memory statistics Sha Zhengju
  5 siblings, 1 reply; 26+ messages in thread
From: Sha Zhengju @ 2013-07-05 17:33 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: mhocko, gthelen, kamezawa.hiroyu, akpm, fengguang.wu, mgorman,
	Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

If memcg is enabled and no non-root memcg exists, all allocated pages
belongs to root_mem_cgroup and wil go through root memcg statistics routines.
So in order to reduce overheads after adding memcg dirty/writeback accounting
in hot paths, we use jump label to patch mem_cgroup_{begin,end}_update_page_stat()
in or out when not used. If no non-root memcg comes to life, we do not need to
accquire moving locks, so patch them out.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
cc: Michal Hocko <mhocko@suse.cz>
cc: Greg Thelen <gthelen@google.com>
cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: Fengguang Wu <fengguang.wu@intel.com>
cc: Mel Gorman <mgorman@suse.de>
---
 include/linux/memcontrol.h |   15 +++++++++++++++
 mm/memcontrol.c            |   23 ++++++++++++++++++++++-
 2 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ccd35d8..0483e1a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -55,6 +55,13 @@ struct mem_cgroup_reclaim_cookie {
 };
 
 #ifdef CONFIG_MEMCG
+
+extern struct static_key memcg_inuse_key;
+static inline bool mem_cgroup_in_use(void)
+{
+	return static_key_false(&memcg_inuse_key);
+}
+
 /*
  * All "charge" functions with gfp_mask should use GFP_KERNEL or
  * (gfp_mask & GFP_RECLAIM_MASK). In current implementatin, memcg doesn't
@@ -159,6 +166,8 @@ static inline void mem_cgroup_begin_update_page_stat(struct page *page,
 {
 	if (mem_cgroup_disabled())
 		return;
+	if (!mem_cgroup_in_use())
+		return;
 	rcu_read_lock();
 	*locked = false;
 	if (atomic_read(&memcg_moving))
@@ -172,6 +181,8 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
 {
 	if (mem_cgroup_disabled())
 		return;
+	if (!mem_cgroup_in_use())
+		return;
 	if (*locked)
 		__mem_cgroup_end_update_page_stat(page, flags);
 	rcu_read_unlock();
@@ -215,6 +226,10 @@ void mem_cgroup_print_bad_page(struct page *page);
 #endif
 #else /* CONFIG_MEMCG */
 struct mem_cgroup;
+static inline bool mem_cgroup_in_use(void)
+{
+	return false;
+}
 
 static inline int mem_cgroup_newpage_charge(struct page *page,
 					struct mm_struct *mm, gfp_t gfp_mask)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9126abc..a85f7c5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -463,6 +463,13 @@ enum res_type {
 #define MEM_CGROUP_RECLAIM_SHRINK_BIT	0x1
 #define MEM_CGROUP_RECLAIM_SHRINK	(1 << MEM_CGROUP_RECLAIM_SHRINK_BIT)
 
+/* static_key used for marking memcg in use or not. We use this jump label to
+ * patch some memcg page stat accounting code in or out.
+ * The key will be increased when non-root memcg is created, and be decreased
+ * when memcg is destroyed.
+ */
+struct static_key memcg_inuse_key;
+
 /*
  * The memcg_create_mutex will be held whenever a new cgroup is created.
  * As a consequence, any change that needs to protect against new child cgroups
@@ -630,10 +637,22 @@ static void disarm_kmem_keys(struct mem_cgroup *memcg)
 }
 #endif /* CONFIG_MEMCG_KMEM */
 
+static void disarm_inuse_keys(struct mem_cgroup *memcg)
+{
+	if (!mem_cgroup_is_root(memcg))
+		static_key_slow_dec(&memcg_inuse_key);
+}
+
+static void arm_inuse_keys(void)
+{
+	static_key_slow_inc(&memcg_inuse_key);
+}
+
 static void disarm_static_keys(struct mem_cgroup *memcg)
 {
 	disarm_sock_keys(memcg);
 	disarm_kmem_keys(memcg);
+	disarm_inuse_keys(memcg);
 }
 
 static void drain_all_stock_async(struct mem_cgroup *memcg);
@@ -2298,7 +2317,6 @@ void mem_cgroup_update_page_stat(struct page *page,
 {
 	struct mem_cgroup *memcg;
 	struct page_cgroup *pc = lookup_page_cgroup(page);
-	unsigned long uninitialized_var(flags);
 
 	if (mem_cgroup_disabled())
 		return;
@@ -6293,6 +6311,9 @@ mem_cgroup_css_online(struct cgroup *cont)
 	}
 
 	error = memcg_init_kmem(memcg, &mem_cgroup_subsys);
+	if (!error)
+		arm_inuse_keys();
+
 	mutex_unlock(&memcg_create_mutex);
 	return error;
 }
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH V4 6/6] memcg: Document cgroup dirty/writeback memory statistics
  2013-07-05 17:18 [PATCH V4 0/6] Memcg dirty/writeback page accounting Sha Zhengju
                   ` (4 preceding siblings ...)
  2013-07-05 17:33 ` [PATCH V4 5/6] memcg: patch mem_cgroup_{begin,end}_update_page_stat() out if only root memcg exists Sha Zhengju
@ 2013-07-05 17:34 ` Sha Zhengju
  5 siblings, 0 replies; 26+ messages in thread
From: Sha Zhengju @ 2013-07-05 17:34 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: mhocko, gthelen, kamezawa.hiroyu, akpm, fengguang.wu, mgorman,
	Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
cc: Greg Thelen <gthelen@google.com>
cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: Fengguang Wu <fengguang.wu@intel.com>
cc: Mel Gorman <mgorman@suse.de>
---
 Documentation/cgroups/memory.txt |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 327acec..7ed4fa9 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -490,6 +490,8 @@ pgpgin		- # of charging events to the memory cgroup. The charging
 pgpgout		- # of uncharging events to the memory cgroup. The uncharging
 		event happens each time a page is unaccounted from the cgroup.
 swap		- # of bytes of swap usage
+dirty          - # of bytes of file cache that are not in sync with the disk copy.
+writeback      - # of bytes of file/anon cache that are queued for syncing to disk.
 inactive_anon	- # of bytes of anonymous memory and swap cache memory on
 		LRU list.
 active_anon	- # of bytes of anonymous and swap cache memory on active
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH V4 1/6] memcg: remove MEMCG_NR_FILE_MAPPED
  2013-07-05 17:21 ` [PATCH V4 1/6] memcg: remove MEMCG_NR_FILE_MAPPED Sha Zhengju
@ 2013-07-11 13:39   ` Michal Hocko
  2013-07-11 15:53     ` Sha Zhengju
  0 siblings, 1 reply; 26+ messages in thread
From: Michal Hocko @ 2013-07-11 13:39 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, gthelen, kamezawa.hiroyu, akpm, fengguang.wu,
	mgorman, Sha Zhengju

I think this one can go in now also without the rest of the series.
It is a good clean up

On Sat 06-07-13 01:21:42, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> While accounting memcg page stat, it's not worth to use MEMCG_NR_FILE_MAPPED
> as an extra layer of indirection because of the complexity and presumed
> performance overhead. We can use MEM_CGROUP_STAT_FILE_MAPPED directly.
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Acked-by: Michal Hocko <mhocko@suse.cz>
> Acked-by: Fengguang Wu <fengguang.wu@intel.com>
> Reviewed-by: Greg Thelen <gthelen@google.com>
> cc: Andrew Morton <akpm@linux-foundation.org>
> cc: Mel Gorman <mgorman@suse.de>
> ---
>  include/linux/memcontrol.h |   27 +++++++++++++++++++--------
>  mm/memcontrol.c            |   25 +------------------------
>  mm/rmap.c                  |    4 ++--
>  3 files changed, 22 insertions(+), 34 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 7b4d9d7..d166aeb 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -30,9 +30,20 @@ struct page;
>  struct mm_struct;
>  struct kmem_cache;
>  
> -/* Stats that can be updated by kernel. */
> -enum mem_cgroup_page_stat_item {
> -	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
> +/*
> + * The corresponding mem_cgroup_stat_names is defined in mm/memcontrol.c,
> + * These two lists should keep in accord with each other.
> + */
> +enum mem_cgroup_stat_index {
> +	/*
> +	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> +	 */
> +	MEM_CGROUP_STAT_CACHE,		/* # of pages charged as cache */
> +	MEM_CGROUP_STAT_RSS,		/* # of pages charged as anon rss */
> +	MEM_CGROUP_STAT_RSS_HUGE,	/* # of pages charged as anon huge */
> +	MEM_CGROUP_STAT_FILE_MAPPED,	/* # of pages charged as file rss */
> +	MEM_CGROUP_STAT_SWAP,		/* # of pages, swapped out */
> +	MEM_CGROUP_STAT_NSTATS,
>  };
>  
>  struct mem_cgroup_reclaim_cookie {
> @@ -165,17 +176,17 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
>  }
>  
>  void mem_cgroup_update_page_stat(struct page *page,
> -				 enum mem_cgroup_page_stat_item idx,
> +				 enum mem_cgroup_stat_index idx,
>  				 int val);
>  
>  static inline void mem_cgroup_inc_page_stat(struct page *page,
> -					    enum mem_cgroup_page_stat_item idx)
> +					    enum mem_cgroup_stat_index idx)
>  {
>  	mem_cgroup_update_page_stat(page, idx, 1);
>  }
>  
>  static inline void mem_cgroup_dec_page_stat(struct page *page,
> -					    enum mem_cgroup_page_stat_item idx)
> +					    enum mem_cgroup_stat_index idx)
>  {
>  	mem_cgroup_update_page_stat(page, idx, -1);
>  }
> @@ -349,12 +360,12 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
>  }
>  
>  static inline void mem_cgroup_inc_page_stat(struct page *page,
> -					    enum mem_cgroup_page_stat_item idx)
> +					    enum mem_cgroup_stat_index idx)
>  {
>  }
>  
>  static inline void mem_cgroup_dec_page_stat(struct page *page,
> -					    enum mem_cgroup_page_stat_item idx)
> +					    enum mem_cgroup_stat_index idx)
>  {
>  }
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6e120e4..f9acf49 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -85,21 +85,6 @@ static int really_do_swap_account __initdata = 0;
>  #endif
>  
>  
> -/*
> - * Statistics for memory cgroup.
> - */
> -enum mem_cgroup_stat_index {
> -	/*
> -	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> -	 */
> -	MEM_CGROUP_STAT_CACHE,		/* # of pages charged as cache */
> -	MEM_CGROUP_STAT_RSS,		/* # of pages charged as anon rss */
> -	MEM_CGROUP_STAT_RSS_HUGE,	/* # of pages charged as anon huge */
> -	MEM_CGROUP_STAT_FILE_MAPPED,	/* # of pages charged as file rss */
> -	MEM_CGROUP_STAT_SWAP,		/* # of pages, swapped out */
> -	MEM_CGROUP_STAT_NSTATS,
> -};
> -
>  static const char * const mem_cgroup_stat_names[] = {
>  	"cache",
>  	"rss",
> @@ -2307,7 +2292,7 @@ void __mem_cgroup_end_update_page_stat(struct page *page, unsigned long *flags)
>  }
>  
>  void mem_cgroup_update_page_stat(struct page *page,
> -				 enum mem_cgroup_page_stat_item idx, int val)
> +				 enum mem_cgroup_stat_index idx, int val)
>  {
>  	struct mem_cgroup *memcg;
>  	struct page_cgroup *pc = lookup_page_cgroup(page);
> @@ -2320,14 +2305,6 @@ void mem_cgroup_update_page_stat(struct page *page,
>  	if (unlikely(!memcg || !PageCgroupUsed(pc)))
>  		return;
>  
> -	switch (idx) {
> -	case MEMCG_NR_FILE_MAPPED:
> -		idx = MEM_CGROUP_STAT_FILE_MAPPED;
> -		break;
> -	default:
> -		BUG();
> -	}
> -
>  	this_cpu_add(memcg->stat->count[idx], val);
>  }
>  
> diff --git a/mm/rmap.c b/mm/rmap.c
> index cd356df..3a3e03e 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1114,7 +1114,7 @@ void page_add_file_rmap(struct page *page)
>  	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  	if (atomic_inc_and_test(&page->_mapcount)) {
>  		__inc_zone_page_state(page, NR_FILE_MAPPED);
> -		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
> +		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
>  	}
>  	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  }
> @@ -1158,7 +1158,7 @@ void page_remove_rmap(struct page *page)
>  					      NR_ANON_TRANSPARENT_HUGEPAGES);
>  	} else {
>  		__dec_zone_page_state(page, NR_FILE_MAPPED);
> -		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
> +		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
>  		mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  	}
>  	if (unlikely(PageMlocked(page)))
> -- 
> 1.7.9.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V4 3/6] memcg: add per cgroup dirty pages accounting
  2013-07-05 17:30     ` Sha Zhengju
  (?)
@ 2013-07-11 14:31     ` Michal Hocko
  2013-07-11 16:49       ` Sha Zhengju
  -1 siblings, 1 reply; 26+ messages in thread
From: Michal Hocko @ 2013-07-11 14:31 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-fsdevel, linux-mm, cgroups, gthelen, kamezawa.hiroyu, akpm,
	fengguang.wu, mgorman, Sha Zhengju

Please also CC vfs people

On Sat 06-07-13 01:30:09, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> This patch adds memcg routines to count dirty pages, which allows memory controller
> to maintain an accurate view of the amount of its dirty memory.
> 
> After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
> use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
> has a feature to move a page from a cgroup to another one and may have race between
> "move" and "page stat accounting". So in order to avoid the race we have designed a
> bigger lock:

Well, bigger lock is little bit an overstatement ;). It is full no-op for
!CONFIG_MEMCG, almost no-op if memcg is disabled (but compiled in), rcu
read lock in the most cases (no task is moving) and spin_lock_irqsave on
top in the slow path.

It would be good to mention this in the changelog for those who are not
familiar.

> 
>          mem_cgroup_begin_update_page_stat()
>          modify page information        -->(a)
>          mem_cgroup_update_page_stat()  -->(b)

Hmm, mem_cgroup_update_page_stat doesn't do any checking that we use a
proper locking. Which would be hard but we could at least test for
rcu_read_lock_held() because RCU is held if !mem_cgroup_disabled().
This would be a nice preparatory patch. What do you think?

>          mem_cgroup_end_update_page_stat()
> It requires both (a) and (b)(dirty pages accounting) to be pretected in
> mem_cgroup_{begin/end}_update_page_stat().
> 
> Server places should be added accounting:

Server?

>         incrementing (3):
>                 __set_page_dirty_buffers
>                 __set_page_dirty_nobuffers
> 		mark_buffer_dirty
>         decrementing (5):
>                 clear_page_dirty_for_io
>                 cancel_dirty_page
> 		delete_from_page_cache
> 		__delete_from_page_cache
> 		replace_page_cache_page
> 
> The lock order between memcg lock and mapping lock is:
> 	--> memcg->move_lock
> 	  --> mapping->private_lock
>             --> mapping->tree_lock
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> cc: Michal Hocko <mhocko@suse.cz>
> cc: Greg Thelen <gthelen@google.com>
> cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> cc: Andrew Morton <akpm@linux-foundation.org>
> cc: Fengguang Wu <fengguang.wu@intel.com>
> cc: Mel Gorman <mgorman@suse.de>
> ---
>  fs/buffer.c                |    9 +++++++++
>  include/linux/memcontrol.h |    1 +
>  mm/filemap.c               |   14 ++++++++++++++
>  mm/memcontrol.c            |   30 +++++++++++++++++++++++-------
>  mm/page-writeback.c        |   24 ++++++++++++++++++++++--
>  mm/truncate.c              |    6 ++++++
>  6 files changed, 75 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 695eb14..7c537f4 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -694,10 +694,13 @@ int __set_page_dirty_buffers(struct page *page)
>  {
>  	int newly_dirty;
>  	struct address_space *mapping = page_mapping(page);
> +	bool locked;
> +	unsigned long flags;
>  
>  	if (unlikely(!mapping))
>  		return !TestSetPageDirty(page);

I guess it would be worth mentioning why we do not care about pages
without mapping.

> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  	spin_lock(&mapping->private_lock);
>  	if (page_has_buffers(page)) {
>  		struct buffer_head *head = page_buffers(page);
[...]
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 4b51ac1..5642de6 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
[...]
> @@ -144,6 +149,7 @@ void __delete_from_page_cache(struct page *page)

This needs a comment that it has to be called from within
mem_cgroup_{begin,end}_update_page_stat context. Btw. it seems that you
are missing invalidate_complete_page2 and __remove_mapping

>  	 * having removed the page entirely.
>  	 */
>  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  		dec_zone_page_state(page, NR_FILE_DIRTY);
>  		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>  	}
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f9acf49..1d31851 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -91,6 +91,7 @@ static const char * const mem_cgroup_stat_names[] = {
>  	"rss_huge",
>  	"mapped_file",
>  	"swap",
> +	"dirty",

This doesn't match mem_cgroup_stat_index ordering.

>  };
>  
>  enum mem_cgroup_events_index {
[...]
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 4514ad7..3900e62 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1982,6 +1982,11 @@ int __set_page_dirty_no_writeback(struct page *page)
>  
>  /*
>   * Helper function for set_page_dirty family.
> + *
> + * The caller must hold mem_cgroup_begin/end_update_page_stat() lock
> + * while modifying struct page state and accounting dirty pages.

I think "while calling this function" would be sufficient.

> + * See __set_page_dirty_{nobuffers,buffers} for example.
> + *
>   * NOTE: This relies on being atomic wrt interrupts.
>   */
>  void account_page_dirtied(struct page *page, struct address_space *mapping)
[...]

Thanks
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V4 4/6] memcg: add per cgroup writeback pages accounting
  2013-07-05 17:32 ` [PATCH V4 4/6] memcg: add per cgroup writeback " Sha Zhengju
@ 2013-07-11 14:40     ` Michal Hocko
  0 siblings, 0 replies; 26+ messages in thread
From: Michal Hocko @ 2013-07-11 14:40 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-fsdevel, linux-mm, cgroups, gthelen, kamezawa.hiroyu, akpm,
	fengguang.wu, mgorman, Sha Zhengju

On Sat 06-07-13 01:32:57, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> Similar to dirty page, we add per cgroup writeback pages accounting. The lock
> rule still is:
>         mem_cgroup_begin_update_page_stat()
>         modify page WRITEBACK stat
>         mem_cgroup_update_page_stat()
>         mem_cgroup_end_update_page_stat()
> 
> There're two writeback interfaces to modify: test_{clear/set}_page_writeback().
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> Acked-by: Michal Hocko <mhocko@suse.cz>
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> cc: Greg Thelen <gthelen@google.com>
> cc: Andrew Morton <akpm@linux-foundation.org>
> cc: Fengguang Wu <fengguang.wu@intel.com>
> cc: Mel Gorman <mgorman@suse.de>
> ---
>  include/linux/memcontrol.h |    1 +
>  mm/memcontrol.c            |    5 +++++
>  mm/page-writeback.c        |   15 +++++++++++++++
>  3 files changed, 21 insertions(+)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index f952be6..ccd35d8 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -43,6 +43,7 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_RSS_HUGE,	/* # of pages charged as anon huge */
>  	MEM_CGROUP_STAT_FILE_MAPPED,	/* # of pages charged as file rss */
>  	MEM_CGROUP_STAT_FILE_DIRTY,	/* # of dirty pages in page cache */
> +	MEM_CGROUP_STAT_WRITEBACK,	/* # of pages under writeback */
>  	MEM_CGROUP_STAT_SWAP,		/* # of pages, swapped out */
>  	MEM_CGROUP_STAT_NSTATS,
>  };
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 1d31851..9126abc 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -92,6 +92,7 @@ static const char * const mem_cgroup_stat_names[] = {
>  	"mapped_file",
>  	"swap",
>  	"dirty",
> +	"writeback",

Ordering issue again (see mem_cgroup_stat_index)

>  };
>  
>  enum mem_cgroup_events_index {
[...]
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 3900e62..85de9a0 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2008,11 +2008,17 @@ EXPORT_SYMBOL(account_page_dirtied);
>  
>  /*
>   * Helper function for set_page_writeback family.
> + *
> + * The caller must hold mem_cgroup_begin/end_update_page_stat() lock
> + * while modifying struct page state and accounting writeback pages.

I guess "while calling this function" would be sufficient

> + * See test_set_page_writeback for example.
> + *
>   * NOTE: Unlike account_page_dirtied this does not rely on being atomic
>   * wrt interrupts.
>   */
>  void account_page_writeback(struct page *page)
>  {
> +	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
>  	inc_zone_page_state(page, NR_WRITEBACK);
>  }
>  EXPORT_SYMBOL(account_page_writeback);
[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V4 4/6] memcg: add per cgroup writeback pages accounting
@ 2013-07-11 14:40     ` Michal Hocko
  0 siblings, 0 replies; 26+ messages in thread
From: Michal Hocko @ 2013-07-11 14:40 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-fsdevel, linux-mm, cgroups, gthelen, kamezawa.hiroyu, akpm,
	fengguang.wu, mgorman, Sha Zhengju

On Sat 06-07-13 01:32:57, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> Similar to dirty page, we add per cgroup writeback pages accounting. The lock
> rule still is:
>         mem_cgroup_begin_update_page_stat()
>         modify page WRITEBACK stat
>         mem_cgroup_update_page_stat()
>         mem_cgroup_end_update_page_stat()
> 
> There're two writeback interfaces to modify: test_{clear/set}_page_writeback().
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> Acked-by: Michal Hocko <mhocko@suse.cz>
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> cc: Greg Thelen <gthelen@google.com>
> cc: Andrew Morton <akpm@linux-foundation.org>
> cc: Fengguang Wu <fengguang.wu@intel.com>
> cc: Mel Gorman <mgorman@suse.de>
> ---
>  include/linux/memcontrol.h |    1 +
>  mm/memcontrol.c            |    5 +++++
>  mm/page-writeback.c        |   15 +++++++++++++++
>  3 files changed, 21 insertions(+)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index f952be6..ccd35d8 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -43,6 +43,7 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_RSS_HUGE,	/* # of pages charged as anon huge */
>  	MEM_CGROUP_STAT_FILE_MAPPED,	/* # of pages charged as file rss */
>  	MEM_CGROUP_STAT_FILE_DIRTY,	/* # of dirty pages in page cache */
> +	MEM_CGROUP_STAT_WRITEBACK,	/* # of pages under writeback */
>  	MEM_CGROUP_STAT_SWAP,		/* # of pages, swapped out */
>  	MEM_CGROUP_STAT_NSTATS,
>  };
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 1d31851..9126abc 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -92,6 +92,7 @@ static const char * const mem_cgroup_stat_names[] = {
>  	"mapped_file",
>  	"swap",
>  	"dirty",
> +	"writeback",

Ordering issue again (see mem_cgroup_stat_index)

>  };
>  
>  enum mem_cgroup_events_index {
[...]
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 3900e62..85de9a0 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2008,11 +2008,17 @@ EXPORT_SYMBOL(account_page_dirtied);
>  
>  /*
>   * Helper function for set_page_writeback family.
> + *
> + * The caller must hold mem_cgroup_begin/end_update_page_stat() lock
> + * while modifying struct page state and accounting writeback pages.

I guess "while calling this function" would be sufficient

> + * See test_set_page_writeback for example.
> + *
>   * NOTE: Unlike account_page_dirtied this does not rely on being atomic
>   * wrt interrupts.
>   */
>  void account_page_writeback(struct page *page)
>  {
> +	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
>  	inc_zone_page_state(page, NR_WRITEBACK);
>  }
>  EXPORT_SYMBOL(account_page_writeback);
[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V4 5/6] memcg: patch mem_cgroup_{begin,end}_update_page_stat() out if only root memcg exists
  2013-07-05 17:33 ` [PATCH V4 5/6] memcg: patch mem_cgroup_{begin,end}_update_page_stat() out if only root memcg exists Sha Zhengju
@ 2013-07-11 14:56   ` Michal Hocko
  2013-07-12 12:59     ` Sha Zhengju
  2013-07-15 17:58     ` Greg Thelen
  0 siblings, 2 replies; 26+ messages in thread
From: Michal Hocko @ 2013-07-11 14:56 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, gthelen, kamezawa.hiroyu, akpm, fengguang.wu,
	mgorman, Sha Zhengju

On Sat 06-07-13 01:33:43, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> If memcg is enabled and no non-root memcg exists, all allocated
> pages belongs to root_mem_cgroup and wil go through root memcg
> statistics routines.  So in order to reduce overheads after adding
> memcg dirty/writeback accounting in hot paths, we use jump label to
> patch mem_cgroup_{begin,end}_update_page_stat() in or out when not
> used.

I do not think this is enough. How much do you save? One atomic read.
This doesn't seem like a killer.

I hoped we could simply not account at all and move counters to the root
cgroup once the label gets enabled.

Besides that, the current patch is racy. Consider what happens when:

mem_cgroup_begin_update_page_stat
					arm_inuse_keys
							mem_cgroup_move_account
mem_cgroup_move_account_page_stat
mem_cgroup_end_update_page_stat

The race window is small of course but it is there. I guess we need
rcu_read_lock at least.

> If no non-root memcg comes to life, we do not need to accquire moving
> locks, so patch them out.
>
> cc: Michal Hocko <mhocko@suse.cz>
> cc: Greg Thelen <gthelen@google.com>
> cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> cc: Andrew Morton <akpm@linux-foundation.org>
> cc: Fengguang Wu <fengguang.wu@intel.com>
> cc: Mel Gorman <mgorman@suse.de>
> ---
>  include/linux/memcontrol.h |   15 +++++++++++++++
>  mm/memcontrol.c            |   23 ++++++++++++++++++++++-
>  2 files changed, 37 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index ccd35d8..0483e1a 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -55,6 +55,13 @@ struct mem_cgroup_reclaim_cookie {
>  };
>  
>  #ifdef CONFIG_MEMCG
> +
> +extern struct static_key memcg_inuse_key;
> +static inline bool mem_cgroup_in_use(void)
> +{
> +	return static_key_false(&memcg_inuse_key);
> +}
> +
>  /*
>   * All "charge" functions with gfp_mask should use GFP_KERNEL or
>   * (gfp_mask & GFP_RECLAIM_MASK). In current implementatin, memcg doesn't
> @@ -159,6 +166,8 @@ static inline void mem_cgroup_begin_update_page_stat(struct page *page,
>  {
>  	if (mem_cgroup_disabled())
>  		return;
> +	if (!mem_cgroup_in_use())
> +		return;
>  	rcu_read_lock();
>  	*locked = false;
>  	if (atomic_read(&memcg_moving))
> @@ -172,6 +181,8 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
>  {
>  	if (mem_cgroup_disabled())
>  		return;
> +	if (!mem_cgroup_in_use())
> +		return;
>  	if (*locked)
>  		__mem_cgroup_end_update_page_stat(page, flags);
>  	rcu_read_unlock();
> @@ -215,6 +226,10 @@ void mem_cgroup_print_bad_page(struct page *page);
>  #endif
>  #else /* CONFIG_MEMCG */
>  struct mem_cgroup;
> +static inline bool mem_cgroup_in_use(void)
> +{
> +	return false;
> +}
>  
>  static inline int mem_cgroup_newpage_charge(struct page *page,
>  					struct mm_struct *mm, gfp_t gfp_mask)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 9126abc..a85f7c5 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -463,6 +463,13 @@ enum res_type {
>  #define MEM_CGROUP_RECLAIM_SHRINK_BIT	0x1
>  #define MEM_CGROUP_RECLAIM_SHRINK	(1 << MEM_CGROUP_RECLAIM_SHRINK_BIT)
>  
> +/* static_key used for marking memcg in use or not. We use this jump label to
> + * patch some memcg page stat accounting code in or out.
> + * The key will be increased when non-root memcg is created, and be decreased
> + * when memcg is destroyed.
> + */
> +struct static_key memcg_inuse_key;
> +
>  /*
>   * The memcg_create_mutex will be held whenever a new cgroup is created.
>   * As a consequence, any change that needs to protect against new child cgroups
> @@ -630,10 +637,22 @@ static void disarm_kmem_keys(struct mem_cgroup *memcg)
>  }
>  #endif /* CONFIG_MEMCG_KMEM */
>  
> +static void disarm_inuse_keys(struct mem_cgroup *memcg)
> +{
> +	if (!mem_cgroup_is_root(memcg))
> +		static_key_slow_dec(&memcg_inuse_key);
> +}
> +
> +static void arm_inuse_keys(void)
> +{
> +	static_key_slow_inc(&memcg_inuse_key);
> +}
> +
>  static void disarm_static_keys(struct mem_cgroup *memcg)
>  {
>  	disarm_sock_keys(memcg);
>  	disarm_kmem_keys(memcg);
> +	disarm_inuse_keys(memcg);
>  }
>  
>  static void drain_all_stock_async(struct mem_cgroup *memcg);
> @@ -2298,7 +2317,6 @@ void mem_cgroup_update_page_stat(struct page *page,
>  {
>  	struct mem_cgroup *memcg;
>  	struct page_cgroup *pc = lookup_page_cgroup(page);
> -	unsigned long uninitialized_var(flags);
>  
>  	if (mem_cgroup_disabled())
>  		return;
> @@ -6293,6 +6311,9 @@ mem_cgroup_css_online(struct cgroup *cont)
>  	}
>  
>  	error = memcg_init_kmem(memcg, &mem_cgroup_subsys);
> +	if (!error)
> +		arm_inuse_keys();
> +
>  	mutex_unlock(&memcg_create_mutex);
>  	return error;
>  }
> -- 
> 1.7.9.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V4 1/6] memcg: remove MEMCG_NR_FILE_MAPPED
  2013-07-11 13:39   ` Michal Hocko
@ 2013-07-11 15:53     ` Sha Zhengju
  0 siblings, 0 replies; 26+ messages in thread
From: Sha Zhengju @ 2013-07-11 15:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Cgroups, Greg Thelen, KAMEZAWA Hiroyuki, Andrew Morton,
	Wu Fengguang, Mel Gorman, Sha Zhengju

On Thu, Jul 11, 2013 at 9:39 PM, Michal Hocko <mhocko@suse.cz> wrote:
> I think this one can go in now also without the rest of the series.
> It is a good clean up

I'd be happy if you will. : )

>
> On Sat 06-07-13 01:21:42, Sha Zhengju wrote:
>> From: Sha Zhengju <handai.szj@taobao.com>
>>
>> While accounting memcg page stat, it's not worth to use MEMCG_NR_FILE_MAPPED
>> as an extra layer of indirection because of the complexity and presumed
>> performance overhead. We can use MEM_CGROUP_STAT_FILE_MAPPED directly.
>>
>> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
>> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> Acked-by: Michal Hocko <mhocko@suse.cz>
>> Acked-by: Fengguang Wu <fengguang.wu@intel.com>
>> Reviewed-by: Greg Thelen <gthelen@google.com>
>> cc: Andrew Morton <akpm@linux-foundation.org>
>> cc: Mel Gorman <mgorman@suse.de>
>> ---
>>  include/linux/memcontrol.h |   27 +++++++++++++++++++--------
>>  mm/memcontrol.c            |   25 +------------------------
>>  mm/rmap.c                  |    4 ++--
>>  3 files changed, 22 insertions(+), 34 deletions(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 7b4d9d7..d166aeb 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -30,9 +30,20 @@ struct page;
>>  struct mm_struct;
>>  struct kmem_cache;
>>
>> -/* Stats that can be updated by kernel. */
>> -enum mem_cgroup_page_stat_item {
>> -     MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
>> +/*
>> + * The corresponding mem_cgroup_stat_names is defined in mm/memcontrol.c,
>> + * These two lists should keep in accord with each other.
>> + */
>> +enum mem_cgroup_stat_index {
>> +     /*
>> +      * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
>> +      */
>> +     MEM_CGROUP_STAT_CACHE,          /* # of pages charged as cache */
>> +     MEM_CGROUP_STAT_RSS,            /* # of pages charged as anon rss */
>> +     MEM_CGROUP_STAT_RSS_HUGE,       /* # of pages charged as anon huge */
>> +     MEM_CGROUP_STAT_FILE_MAPPED,    /* # of pages charged as file rss */
>> +     MEM_CGROUP_STAT_SWAP,           /* # of pages, swapped out */
>> +     MEM_CGROUP_STAT_NSTATS,
>>  };
>>
>>  struct mem_cgroup_reclaim_cookie {
>> @@ -165,17 +176,17 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
>>  }
>>
>>  void mem_cgroup_update_page_stat(struct page *page,
>> -                              enum mem_cgroup_page_stat_item idx,
>> +                              enum mem_cgroup_stat_index idx,
>>                                int val);
>>
>>  static inline void mem_cgroup_inc_page_stat(struct page *page,
>> -                                         enum mem_cgroup_page_stat_item idx)
>> +                                         enum mem_cgroup_stat_index idx)
>>  {
>>       mem_cgroup_update_page_stat(page, idx, 1);
>>  }
>>
>>  static inline void mem_cgroup_dec_page_stat(struct page *page,
>> -                                         enum mem_cgroup_page_stat_item idx)
>> +                                         enum mem_cgroup_stat_index idx)
>>  {
>>       mem_cgroup_update_page_stat(page, idx, -1);
>>  }
>> @@ -349,12 +360,12 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
>>  }
>>
>>  static inline void mem_cgroup_inc_page_stat(struct page *page,
>> -                                         enum mem_cgroup_page_stat_item idx)
>> +                                         enum mem_cgroup_stat_index idx)
>>  {
>>  }
>>
>>  static inline void mem_cgroup_dec_page_stat(struct page *page,
>> -                                         enum mem_cgroup_page_stat_item idx)
>> +                                         enum mem_cgroup_stat_index idx)
>>  {
>>  }
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 6e120e4..f9acf49 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -85,21 +85,6 @@ static int really_do_swap_account __initdata = 0;
>>  #endif
>>
>>
>> -/*
>> - * Statistics for memory cgroup.
>> - */
>> -enum mem_cgroup_stat_index {
>> -     /*
>> -      * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
>> -      */
>> -     MEM_CGROUP_STAT_CACHE,          /* # of pages charged as cache */
>> -     MEM_CGROUP_STAT_RSS,            /* # of pages charged as anon rss */
>> -     MEM_CGROUP_STAT_RSS_HUGE,       /* # of pages charged as anon huge */
>> -     MEM_CGROUP_STAT_FILE_MAPPED,    /* # of pages charged as file rss */
>> -     MEM_CGROUP_STAT_SWAP,           /* # of pages, swapped out */
>> -     MEM_CGROUP_STAT_NSTATS,
>> -};
>> -
>>  static const char * const mem_cgroup_stat_names[] = {
>>       "cache",
>>       "rss",
>> @@ -2307,7 +2292,7 @@ void __mem_cgroup_end_update_page_stat(struct page *page, unsigned long *flags)
>>  }
>>
>>  void mem_cgroup_update_page_stat(struct page *page,
>> -                              enum mem_cgroup_page_stat_item idx, int val)
>> +                              enum mem_cgroup_stat_index idx, int val)
>>  {
>>       struct mem_cgroup *memcg;
>>       struct page_cgroup *pc = lookup_page_cgroup(page);
>> @@ -2320,14 +2305,6 @@ void mem_cgroup_update_page_stat(struct page *page,
>>       if (unlikely(!memcg || !PageCgroupUsed(pc)))
>>               return;
>>
>> -     switch (idx) {
>> -     case MEMCG_NR_FILE_MAPPED:
>> -             idx = MEM_CGROUP_STAT_FILE_MAPPED;
>> -             break;
>> -     default:
>> -             BUG();
>> -     }
>> -
>>       this_cpu_add(memcg->stat->count[idx], val);
>>  }
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index cd356df..3a3e03e 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1114,7 +1114,7 @@ void page_add_file_rmap(struct page *page)
>>       mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>>       if (atomic_inc_and_test(&page->_mapcount)) {
>>               __inc_zone_page_state(page, NR_FILE_MAPPED);
>> -             mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
>> +             mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
>>       }
>>       mem_cgroup_end_update_page_stat(page, &locked, &flags);
>>  }
>> @@ -1158,7 +1158,7 @@ void page_remove_rmap(struct page *page)
>>                                             NR_ANON_TRANSPARENT_HUGEPAGES);
>>       } else {
>>               __dec_zone_page_state(page, NR_FILE_MAPPED);
>> -             mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
>> +             mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
>>               mem_cgroup_end_update_page_stat(page, &locked, &flags);
>>       }
>>       if (unlikely(PageMlocked(page)))
>> --
>> 1.7.9.5
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe cgroups" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> Michal Hocko
> SUSE Labs



--
Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V4 3/6] memcg: add per cgroup dirty pages accounting
  2013-07-11 14:31     ` Michal Hocko
@ 2013-07-11 16:49       ` Sha Zhengju
  0 siblings, 0 replies; 26+ messages in thread
From: Sha Zhengju @ 2013-07-11 16:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-fsdevel, linux-mm, Cgroups, Greg Thelen, KAMEZAWA Hiroyuki,
	Andrew Morton, Wu Fengguang, Mel Gorman, Sha Zhengju

On Thu, Jul 11, 2013 at 10:31 PM, Michal Hocko <mhocko@suse.cz> wrote:
> Please also CC vfs people
>
> On Sat 06-07-13 01:30:09, Sha Zhengju wrote:
>> From: Sha Zhengju <handai.szj@taobao.com>
>>
>> This patch adds memcg routines to count dirty pages, which allows memory controller
>> to maintain an accurate view of the amount of its dirty memory.
>>
>> After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
>> use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
>> has a feature to move a page from a cgroup to another one and may have race between
>> "move" and "page stat accounting". So in order to avoid the race we have designed a
>> bigger lock:
>
> Well, bigger lock is little bit an overstatement ;). It is full no-op for
> !CONFIG_MEMCG, almost no-op if memcg is disabled (but compiled in), rcu
> read lock in the most cases (no task is moving) and spin_lock_irqsave on
> top in the slow path.
>
> It would be good to mention this in the changelog for those who are not
> familiar.

Okay, good advise.

>
>>
>>          mem_cgroup_begin_update_page_stat()
>>          modify page information        -->(a)
>>          mem_cgroup_update_page_stat()  -->(b)
>
> Hmm, mem_cgroup_update_page_stat doesn't do any checking that we use a
> proper locking. Which would be hard but we could at least test for
> rcu_read_lock_held() because RCU is held if !mem_cgroup_disabled().
> This would be a nice preparatory patch. What do you think?

In patch 5/6 I've used static branch to patch out most of
mem_cgroup_begin/end_update_page_stat() including
rcu_read_lock/unlock, so the test may be false in that case. Well,
since I still need to think it over according to your opinion, it's
not bad to add the check patch next round.

>
>>          mem_cgroup_end_update_page_stat()
>> It requires both (a) and (b)(dirty pages accounting) to be pretected in
>> mem_cgroup_{begin/end}_update_page_stat().
>>
>> Server places should be added accounting:
>
> Server?

Oops.. I mean 'several'.....

>
>>         incrementing (3):
>>                 __set_page_dirty_buffers
>>                 __set_page_dirty_nobuffers
>>               mark_buffer_dirty
>>         decrementing (5):
>>                 clear_page_dirty_for_io
>>                 cancel_dirty_page
>>               delete_from_page_cache
>>               __delete_from_page_cache
>>               replace_page_cache_page
>>
>> The lock order between memcg lock and mapping lock is:
>>       --> memcg->move_lock
>>         --> mapping->private_lock
>>             --> mapping->tree_lock
>>
>> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
>> cc: Michal Hocko <mhocko@suse.cz>
>> cc: Greg Thelen <gthelen@google.com>
>> cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> cc: Andrew Morton <akpm@linux-foundation.org>
>> cc: Fengguang Wu <fengguang.wu@intel.com>
>> cc: Mel Gorman <mgorman@suse.de>
>> ---
>>  fs/buffer.c                |    9 +++++++++
>>  include/linux/memcontrol.h |    1 +
>>  mm/filemap.c               |   14 ++++++++++++++
>>  mm/memcontrol.c            |   30 +++++++++++++++++++++++-------
>>  mm/page-writeback.c        |   24 ++++++++++++++++++++++--
>>  mm/truncate.c              |    6 ++++++
>>  6 files changed, 75 insertions(+), 9 deletions(-)
>>
>> diff --git a/fs/buffer.c b/fs/buffer.c
>> index 695eb14..7c537f4 100644
>> --- a/fs/buffer.c
>> +++ b/fs/buffer.c
>> @@ -694,10 +694,13 @@ int __set_page_dirty_buffers(struct page *page)
>>  {
>>       int newly_dirty;
>>       struct address_space *mapping = page_mapping(page);
>> +     bool locked;
>> +     unsigned long flags;
>>
>>       if (unlikely(!mapping))
>>               return !TestSetPageDirty(page);
>
> I guess it would be worth mentioning why we do not care about pages
> without mapping.

Actually what I concern is where both doing 'TestSetPageDirty' and
'account_pages_dirtied'. Since it doesn't do global counting here, I
also needn't do so.

>
>> +     mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>>       spin_lock(&mapping->private_lock);
>>       if (page_has_buffers(page)) {
>>               struct buffer_head *head = page_buffers(page);
> [...]
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 4b51ac1..5642de6 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
> [...]
>> @@ -144,6 +149,7 @@ void __delete_from_page_cache(struct page *page)
>
> This needs a comment that it has to be called from within
> mem_cgroup_{begin,end}_update_page_stat context. Btw. it seems that you
> are missing invalidate_complete_page2 and __remove_mapping

Sorry, yes, I'll add them next.

>
>>        * having removed the page entirely.
>>        */
>>       if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
>> +             mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>>               dec_zone_page_state(page, NR_FILE_DIRTY);
>>               dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>>       }
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index f9acf49..1d31851 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -91,6 +91,7 @@ static const char * const mem_cgroup_stat_names[] = {
>>       "rss_huge",
>>       "mapped_file",
>>       "swap",
>> +     "dirty",
>
> This doesn't match mem_cgroup_stat_index ordering.

Yes... I should be more careful. :(

>
>>  };
>>
>>  enum mem_cgroup_events_index {
> [...]
>> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>> index 4514ad7..3900e62 100644
>> --- a/mm/page-writeback.c
>> +++ b/mm/page-writeback.c
>> @@ -1982,6 +1982,11 @@ int __set_page_dirty_no_writeback(struct page *page)
>>
>>  /*
>>   * Helper function for set_page_dirty family.
>> + *
>> + * The caller must hold mem_cgroup_begin/end_update_page_stat() lock
>> + * while modifying struct page state and accounting dirty pages.
>
> I think "while calling this function" would be sufficient.

Okay.
Thanks for reviewing!

>
>> + * See __set_page_dirty_{nobuffers,buffers} for example.
>> + *
>>   * NOTE: This relies on being atomic wrt interrupts.
>>   */
>>  void account_page_dirtied(struct page *page, struct address_space *mapping)
> [...]
>
> Thanks
> --
> Michal Hocko
> SUSE Labs



--
Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V4 4/6] memcg: add per cgroup writeback pages accounting
  2013-07-11 14:40     ` Michal Hocko
@ 2013-07-11 16:56         ` Sha Zhengju
  -1 siblings, 0 replies; 26+ messages in thread
From: Sha Zhengju @ 2013-07-11 16:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Cgroups, Greg Thelen,
	KAMEZAWA Hiroyuki, Andrew Morton, Wu Fengguang, Mel Gorman,
	Sha Zhengju

On Thu, Jul 11, 2013 at 10:40 PM, Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> wrote:
> On Sat 06-07-13 01:32:57, Sha Zhengju wrote:
>> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
>>
>> Similar to dirty page, we add per cgroup writeback pages accounting. The lock
>> rule still is:
>>         mem_cgroup_begin_update_page_stat()
>>         modify page WRITEBACK stat
>>         mem_cgroup_update_page_stat()
>>         mem_cgroup_end_update_page_stat()
>>
>> There're two writeback interfaces to modify: test_{clear/set}_page_writeback().
>>
>> Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
>> Acked-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
>> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
>> cc: Greg Thelen <gthelen-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>> cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
>> cc: Fengguang Wu <fengguang.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
>> cc: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
>> ---
>>  include/linux/memcontrol.h |    1 +
>>  mm/memcontrol.c            |    5 +++++
>>  mm/page-writeback.c        |   15 +++++++++++++++
>>  3 files changed, 21 insertions(+)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index f952be6..ccd35d8 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -43,6 +43,7 @@ enum mem_cgroup_stat_index {
>>       MEM_CGROUP_STAT_RSS_HUGE,       /* # of pages charged as anon huge */
>>       MEM_CGROUP_STAT_FILE_MAPPED,    /* # of pages charged as file rss */
>>       MEM_CGROUP_STAT_FILE_DIRTY,     /* # of dirty pages in page cache */
>> +     MEM_CGROUP_STAT_WRITEBACK,      /* # of pages under writeback */
>>       MEM_CGROUP_STAT_SWAP,           /* # of pages, swapped out */
>>       MEM_CGROUP_STAT_NSTATS,
>>  };
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 1d31851..9126abc 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -92,6 +92,7 @@ static const char * const mem_cgroup_stat_names[] = {
>>       "mapped_file",
>>       "swap",
>>       "dirty",
>> +     "writeback",
>
> Ordering issue again (see mem_cgroup_stat_index)

Yes, you're right.

>
>>  };
>>
>>  enum mem_cgroup_events_index {
> [...]
>> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>> index 3900e62..85de9a0 100644
>> --- a/mm/page-writeback.c
>> +++ b/mm/page-writeback.c
>> @@ -2008,11 +2008,17 @@ EXPORT_SYMBOL(account_page_dirtied);
>>
>>  /*
>>   * Helper function for set_page_writeback family.
>> + *
>> + * The caller must hold mem_cgroup_begin/end_update_page_stat() lock
>> + * while modifying struct page state and accounting writeback pages.
>
> I guess "while calling this function" would be sufficient

Thanks for the advice.

>
>> + * See test_set_page_writeback for example.
>> + *
>>   * NOTE: Unlike account_page_dirtied this does not rely on being atomic
>>   * wrt interrupts.
>>   */
>>  void account_page_writeback(struct page *page)
>>  {
>> +     mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
>>       inc_zone_page_state(page, NR_WRITEBACK);
>>  }
>>  EXPORT_SYMBOL(account_page_writeback);
> [...]
> --
> Michal Hocko
> SUSE Labs



--
Thanks,
Sha

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V4 4/6] memcg: add per cgroup writeback pages accounting
@ 2013-07-11 16:56         ` Sha Zhengju
  0 siblings, 0 replies; 26+ messages in thread
From: Sha Zhengju @ 2013-07-11 16:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-fsdevel, linux-mm, Cgroups, Greg Thelen, KAMEZAWA Hiroyuki,
	Andrew Morton, Wu Fengguang, Mel Gorman, Sha Zhengju

On Thu, Jul 11, 2013 at 10:40 PM, Michal Hocko <mhocko@suse.cz> wrote:
> On Sat 06-07-13 01:32:57, Sha Zhengju wrote:
>> From: Sha Zhengju <handai.szj@taobao.com>
>>
>> Similar to dirty page, we add per cgroup writeback pages accounting. The lock
>> rule still is:
>>         mem_cgroup_begin_update_page_stat()
>>         modify page WRITEBACK stat
>>         mem_cgroup_update_page_stat()
>>         mem_cgroup_end_update_page_stat()
>>
>> There're two writeback interfaces to modify: test_{clear/set}_page_writeback().
>>
>> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
>> Acked-by: Michal Hocko <mhocko@suse.cz>
>> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> cc: Greg Thelen <gthelen@google.com>
>> cc: Andrew Morton <akpm@linux-foundation.org>
>> cc: Fengguang Wu <fengguang.wu@intel.com>
>> cc: Mel Gorman <mgorman@suse.de>
>> ---
>>  include/linux/memcontrol.h |    1 +
>>  mm/memcontrol.c            |    5 +++++
>>  mm/page-writeback.c        |   15 +++++++++++++++
>>  3 files changed, 21 insertions(+)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index f952be6..ccd35d8 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -43,6 +43,7 @@ enum mem_cgroup_stat_index {
>>       MEM_CGROUP_STAT_RSS_HUGE,       /* # of pages charged as anon huge */
>>       MEM_CGROUP_STAT_FILE_MAPPED,    /* # of pages charged as file rss */
>>       MEM_CGROUP_STAT_FILE_DIRTY,     /* # of dirty pages in page cache */
>> +     MEM_CGROUP_STAT_WRITEBACK,      /* # of pages under writeback */
>>       MEM_CGROUP_STAT_SWAP,           /* # of pages, swapped out */
>>       MEM_CGROUP_STAT_NSTATS,
>>  };
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 1d31851..9126abc 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -92,6 +92,7 @@ static const char * const mem_cgroup_stat_names[] = {
>>       "mapped_file",
>>       "swap",
>>       "dirty",
>> +     "writeback",
>
> Ordering issue again (see mem_cgroup_stat_index)

Yes, you're right.

>
>>  };
>>
>>  enum mem_cgroup_events_index {
> [...]
>> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>> index 3900e62..85de9a0 100644
>> --- a/mm/page-writeback.c
>> +++ b/mm/page-writeback.c
>> @@ -2008,11 +2008,17 @@ EXPORT_SYMBOL(account_page_dirtied);
>>
>>  /*
>>   * Helper function for set_page_writeback family.
>> + *
>> + * The caller must hold mem_cgroup_begin/end_update_page_stat() lock
>> + * while modifying struct page state and accounting writeback pages.
>
> I guess "while calling this function" would be sufficient

Thanks for the advice.

>
>> + * See test_set_page_writeback for example.
>> + *
>>   * NOTE: Unlike account_page_dirtied this does not rely on being atomic
>>   * wrt interrupts.
>>   */
>>  void account_page_writeback(struct page *page)
>>  {
>> +     mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
>>       inc_zone_page_state(page, NR_WRITEBACK);
>>  }
>>  EXPORT_SYMBOL(account_page_writeback);
> [...]
> --
> Michal Hocko
> SUSE Labs



--
Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V4 5/6] memcg: patch mem_cgroup_{begin,end}_update_page_stat() out if only root memcg exists
  2013-07-11 14:56   ` Michal Hocko
@ 2013-07-12 12:59     ` Sha Zhengju
  2013-07-12 13:13       ` Sha Zhengju
  2013-07-12 13:25         ` Michal Hocko
  2013-07-15 17:58     ` Greg Thelen
  1 sibling, 2 replies; 26+ messages in thread
From: Sha Zhengju @ 2013-07-12 12:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Cgroups, Greg Thelen, KAMEZAWA Hiroyuki, Andrew Morton,
	Wu Fengguang, Mel Gorman, Glauber Costa, Sha Zhengju

[-- Attachment #1: Type: text/plain, Size: 6407 bytes --]

Add cc to Glauber

On Thu, Jul 11, 2013 at 10:56 PM, Michal Hocko <mhocko@suse.cz> wrote:
> On Sat 06-07-13 01:33:43, Sha Zhengju wrote:
>> From: Sha Zhengju <handai.szj@taobao.com>
>>
>> If memcg is enabled and no non-root memcg exists, all allocated
>> pages belongs to root_mem_cgroup and wil go through root memcg
>> statistics routines.  So in order to reduce overheads after adding
>> memcg dirty/writeback accounting in hot paths, we use jump label to
>> patch mem_cgroup_{begin,end}_update_page_stat() in or out when not
>> used.
>
> I do not think this is enough. How much do you save? One atomic read.
> This doesn't seem like a killer.
>
> I hoped we could simply not account at all and move counters to the root
> cgroup once the label gets enabled.

I have thought of this approach before, but it would probably run into
another issue, e.g, each zone has a percpu stock named ->pageset to
optimize the increment and decrement operations, and I haven't figure out a
simpler and cheaper approach to handle that stock numbers if moving global
counters to root cgroup, maybe we can just leave them and can afford the
approximation?

Glauber have already done lots of works here, in his previous patchset he
also tried to move some global stats to root (
http://comments.gmane.org/gmane.linux.kernel.cgroups/6291). May I steal
some of your ideas here, Glauber? :P


>
> Besides that, the current patch is racy. Consider what happens when:
>
> mem_cgroup_begin_update_page_stat
>                                         arm_inuse_keys
>
mem_cgroup_move_account
> mem_cgroup_move_account_page_stat
> mem_cgroup_end_update_page_stat
>
> The race window is small of course but it is there. I guess we need
> rcu_read_lock at least.

Yes, you're right. I'm afraid we need to take care of the racy in the next
updates as well. But mem_cgroup_begin/end_update_page_stat() already have
rcu lock, so here we maybe only need a synchronize_rcu() after changing
memcg_inuse_key?


>
>> If no non-root memcg comes to life, we do not need to accquire moving
>> locks, so patch them out.
>>
>> cc: Michal Hocko <mhocko@suse.cz>
>> cc: Greg Thelen <gthelen@google.com>
>> cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> cc: Andrew Morton <akpm@linux-foundation.org>
>> cc: Fengguang Wu <fengguang.wu@intel.com>
>> cc: Mel Gorman <mgorman@suse.de>
>> ---
>>  include/linux/memcontrol.h |   15 +++++++++++++++
>>  mm/memcontrol.c            |   23 ++++++++++++++++++++++-
>>  2 files changed, 37 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index ccd35d8..0483e1a 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -55,6 +55,13 @@ struct mem_cgroup_reclaim_cookie {
>>  };
>>
>>  #ifdef CONFIG_MEMCG
>> +
>> +extern struct static_key memcg_inuse_key;
>> +static inline bool mem_cgroup_in_use(void)
>> +{
>> +     return static_key_false(&memcg_inuse_key);
>> +}
>> +
>>  /*
>>   * All "charge" functions with gfp_mask should use GFP_KERNEL or
>>   * (gfp_mask & GFP_RECLAIM_MASK). In current implementatin, memcg
doesn't
>> @@ -159,6 +166,8 @@ static inline void
mem_cgroup_begin_update_page_stat(struct page *page,
>>  {
>>       if (mem_cgroup_disabled())
>>               return;
>> +     if (!mem_cgroup_in_use())
>> +             return;
>>       rcu_read_lock();
>>       *locked = false;
>>       if (atomic_read(&memcg_moving))
>> @@ -172,6 +181,8 @@ static inline void
mem_cgroup_end_update_page_stat(struct page *page,
>>  {
>>       if (mem_cgroup_disabled())
>>               return;
>> +     if (!mem_cgroup_in_use())
>> +             return;
>>       if (*locked)
>>               __mem_cgroup_end_update_page_stat(page, flags);
>>       rcu_read_unlock();
>> @@ -215,6 +226,10 @@ void mem_cgroup_print_bad_page(struct page *page);
>>  #endif
>>  #else /* CONFIG_MEMCG */
>>  struct mem_cgroup;
>> +static inline bool mem_cgroup_in_use(void)
>> +{
>> +     return false;
>> +}
>>
>>  static inline int mem_cgroup_newpage_charge(struct page *page,
>>                                       struct mm_struct *mm, gfp_t
gfp_mask)
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 9126abc..a85f7c5 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -463,6 +463,13 @@ enum res_type {
>>  #define MEM_CGROUP_RECLAIM_SHRINK_BIT        0x1
>>  #define MEM_CGROUP_RECLAIM_SHRINK    (1 <<
MEM_CGROUP_RECLAIM_SHRINK_BIT)
>>
>> +/* static_key used for marking memcg in use or not. We use this jump
label to
>> + * patch some memcg page stat accounting code in or out.
>> + * The key will be increased when non-root memcg is created, and be
decreased
>> + * when memcg is destroyed.
>> + */
>> +struct static_key memcg_inuse_key;
>> +
>>  /*
>>   * The memcg_create_mutex will be held whenever a new cgroup is created.
>>   * As a consequence, any change that needs to protect against new child
cgroups
>> @@ -630,10 +637,22 @@ static void disarm_kmem_keys(struct mem_cgroup
*memcg)
>>  }
>>  #endif /* CONFIG_MEMCG_KMEM */
>>
>> +static void disarm_inuse_keys(struct mem_cgroup *memcg)
>> +{
>> +     if (!mem_cgroup_is_root(memcg))
>> +             static_key_slow_dec(&memcg_inuse_key);
>> +}
>> +
>> +static void arm_inuse_keys(void)
>> +{
>> +     static_key_slow_inc(&memcg_inuse_key);
>> +}
>> +
>>  static void disarm_static_keys(struct mem_cgroup *memcg)
>>  {
>>       disarm_sock_keys(memcg);
>>       disarm_kmem_keys(memcg);
>> +     disarm_inuse_keys(memcg);
>>  }
>>
>>  static void drain_all_stock_async(struct mem_cgroup *memcg);
>> @@ -2298,7 +2317,6 @@ void mem_cgroup_update_page_stat(struct page *page,
>>  {
>>       struct mem_cgroup *memcg;
>>       struct page_cgroup *pc = lookup_page_cgroup(page);
>> -     unsigned long uninitialized_var(flags);
>>
>>       if (mem_cgroup_disabled())
>>               return;
>> @@ -6293,6 +6311,9 @@ mem_cgroup_css_online(struct cgroup *cont)
>>       }
>>
>>       error = memcg_init_kmem(memcg, &mem_cgroup_subsys);
>> +     if (!error)
>> +             arm_inuse_keys();
>> +
>>       mutex_unlock(&memcg_create_mutex);
>>       return error;
>>  }
>> --
>> 1.7.9.5
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe cgroups" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> Michal Hocko
> SUSE Labs



--
Thanks,
Sha

[-- Attachment #2: Type: text/html, Size: 8668 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V4 5/6] memcg: patch mem_cgroup_{begin,end}_update_page_stat() out if only root memcg exists
  2013-07-12 12:59     ` Sha Zhengju
@ 2013-07-12 13:13       ` Sha Zhengju
  2013-07-15  6:32           ` Glauber Costa
  2013-07-12 13:25         ` Michal Hocko
  1 sibling, 1 reply; 26+ messages in thread
From: Sha Zhengju @ 2013-07-12 13:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Cgroups, Greg Thelen, KAMEZAWA Hiroyuki, Andrew Morton,
	Wu Fengguang, Mel Gorman, glommer, Sha Zhengju

[-- Attachment #1: Type: text/plain, Size: 6957 bytes --]

Ooops.... it seems unreachable, change Glauber's email...


On Fri, Jul 12, 2013 at 8:59 PM, Sha Zhengju <handai.szj@gmail.com> wrote:

> Add cc to Glauber
>
>
> On Thu, Jul 11, 2013 at 10:56 PM, Michal Hocko <mhocko@suse.cz> wrote:
> > On Sat 06-07-13 01:33:43, Sha Zhengju wrote:
> >> From: Sha Zhengju <handai.szj@taobao.com>
> >>
> >> If memcg is enabled and no non-root memcg exists, all allocated
> >> pages belongs to root_mem_cgroup and wil go through root memcg
> >> statistics routines.  So in order to reduce overheads after adding
> >> memcg dirty/writeback accounting in hot paths, we use jump label to
> >> patch mem_cgroup_{begin,end}_update_page_stat() in or out when not
> >> used.
> >
> > I do not think this is enough. How much do you save? One atomic read.
> > This doesn't seem like a killer.
> >
> > I hoped we could simply not account at all and move counters to the root
> > cgroup once the label gets enabled.
>
> I have thought of this approach before, but it would probably run into
> another issue, e.g, each zone has a percpu stock named ->pageset to
> optimize the increment and decrement operations, and I haven't figure out a
> simpler and cheaper approach to handle that stock numbers if moving global
> counters to root cgroup, maybe we can just leave them and can afford the
> approximation?
>
> Glauber have already done lots of works here, in his previous patchset he
> also tried to move some global stats to root (
> http://comments.gmane.org/gmane.linux.kernel.cgroups/6291). May I steal
> some of your ideas here, Glauber? :P
>
>
>
> >
> > Besides that, the current patch is racy. Consider what happens when:
> >
> > mem_cgroup_begin_update_page_stat
> >                                         arm_inuse_keys
> >
> mem_cgroup_move_account
> > mem_cgroup_move_account_page_stat
> > mem_cgroup_end_update_page_stat
> >
> > The race window is small of course but it is there. I guess we need
> > rcu_read_lock at least.
>
> Yes, you're right. I'm afraid we need to take care of the racy in the next
> updates as well. But mem_cgroup_begin/end_update_page_stat() already have
> rcu lock, so here we maybe only need a synchronize_rcu() after changing
> memcg_inuse_key?
>
>
> >
> >> If no non-root memcg comes to life, we do not need to accquire moving
> >> locks, so patch them out.
> >>
> >> cc: Michal Hocko <mhocko@suse.cz>
> >> cc: Greg Thelen <gthelen@google.com>
> >> cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >> cc: Andrew Morton <akpm@linux-foundation.org>
> >> cc: Fengguang Wu <fengguang.wu@intel.com>
> >> cc: Mel Gorman <mgorman@suse.de>
> >> ---
> >>  include/linux/memcontrol.h |   15 +++++++++++++++
> >>  mm/memcontrol.c            |   23 ++++++++++++++++++++++-
> >>  2 files changed, 37 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> >> index ccd35d8..0483e1a 100644
> >> --- a/include/linux/memcontrol.h
> >> +++ b/include/linux/memcontrol.h
> >> @@ -55,6 +55,13 @@ struct mem_cgroup_reclaim_cookie {
> >>  };
> >>
> >>  #ifdef CONFIG_MEMCG
> >> +
> >> +extern struct static_key memcg_inuse_key;
> >> +static inline bool mem_cgroup_in_use(void)
> >> +{
> >> +     return static_key_false(&memcg_inuse_key);
> >> +}
> >> +
> >>  /*
> >>   * All "charge" functions with gfp_mask should use GFP_KERNEL or
> >>   * (gfp_mask & GFP_RECLAIM_MASK). In current implementatin, memcg
> doesn't
> >> @@ -159,6 +166,8 @@ static inline void
> mem_cgroup_begin_update_page_stat(struct page *page,
> >>  {
> >>       if (mem_cgroup_disabled())
> >>               return;
> >> +     if (!mem_cgroup_in_use())
> >> +             return;
> >>       rcu_read_lock();
> >>       *locked = false;
> >>       if (atomic_read(&memcg_moving))
> >> @@ -172,6 +181,8 @@ static inline void
> mem_cgroup_end_update_page_stat(struct page *page,
> >>  {
> >>       if (mem_cgroup_disabled())
> >>               return;
> >> +     if (!mem_cgroup_in_use())
> >> +             return;
> >>       if (*locked)
> >>               __mem_cgroup_end_update_page_stat(page, flags);
> >>       rcu_read_unlock();
> >> @@ -215,6 +226,10 @@ void mem_cgroup_print_bad_page(struct page *page);
> >>  #endif
> >>  #else /* CONFIG_MEMCG */
> >>  struct mem_cgroup;
> >> +static inline bool mem_cgroup_in_use(void)
> >> +{
> >> +     return false;
> >> +}
> >>
> >>  static inline int mem_cgroup_newpage_charge(struct page *page,
> >>                                       struct mm_struct *mm, gfp_t
> gfp_mask)
> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >> index 9126abc..a85f7c5 100644
> >> --- a/mm/memcontrol.c
> >> +++ b/mm/memcontrol.c
> >> @@ -463,6 +463,13 @@ enum res_type {
> >>  #define MEM_CGROUP_RECLAIM_SHRINK_BIT        0x1
> >>  #define MEM_CGROUP_RECLAIM_SHRINK    (1 <<
> MEM_CGROUP_RECLAIM_SHRINK_BIT)
> >>
> >> +/* static_key used for marking memcg in use or not. We use this jump
> label to
> >> + * patch some memcg page stat accounting code in or out.
> >> + * The key will be increased when non-root memcg is created, and be
> decreased
> >> + * when memcg is destroyed.
> >> + */
> >> +struct static_key memcg_inuse_key;
> >> +
> >>  /*
> >>   * The memcg_create_mutex will be held whenever a new cgroup is
> created.
> >>   * As a consequence, any change that needs to protect against new
> child cgroups
> >> @@ -630,10 +637,22 @@ static void disarm_kmem_keys(struct mem_cgroup
> *memcg)
> >>  }
> >>  #endif /* CONFIG_MEMCG_KMEM */
> >>
> >> +static void disarm_inuse_keys(struct mem_cgroup *memcg)
> >> +{
> >> +     if (!mem_cgroup_is_root(memcg))
> >> +             static_key_slow_dec(&memcg_inuse_key);
> >> +}
> >> +
> >> +static void arm_inuse_keys(void)
> >> +{
> >> +     static_key_slow_inc(&memcg_inuse_key);
> >> +}
> >> +
> >>  static void disarm_static_keys(struct mem_cgroup *memcg)
> >>  {
> >>       disarm_sock_keys(memcg);
> >>       disarm_kmem_keys(memcg);
> >> +     disarm_inuse_keys(memcg);
> >>  }
> >>
> >>  static void drain_all_stock_async(struct mem_cgroup *memcg);
> >> @@ -2298,7 +2317,6 @@ void mem_cgroup_update_page_stat(struct page
> *page,
> >>  {
> >>       struct mem_cgroup *memcg;
> >>       struct page_cgroup *pc = lookup_page_cgroup(page);
> >> -     unsigned long uninitialized_var(flags);
> >>
> >>       if (mem_cgroup_disabled())
> >>               return;
> >> @@ -6293,6 +6311,9 @@ mem_cgroup_css_online(struct cgroup *cont)
> >>       }
> >>
> >>       error = memcg_init_kmem(memcg, &mem_cgroup_subsys);
> >> +     if (!error)
> >> +             arm_inuse_keys();
> >> +
> >>       mutex_unlock(&memcg_create_mutex);
> >>       return error;
> >>  }
> >> --
> >> 1.7.9.5
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> > --
> > Michal Hocko
> > SUSE Labs
>
>
>
> --
> Thanks,
> Sha
>



-- 
Thanks,
Sha

[-- Attachment #2: Type: text/html, Size: 9463 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V4 5/6] memcg: patch mem_cgroup_{begin,end}_update_page_stat() out if only root memcg exists
@ 2013-07-12 13:25         ` Michal Hocko
  0 siblings, 0 replies; 26+ messages in thread
From: Michal Hocko @ 2013-07-12 13:25 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, Cgroups, Greg Thelen, KAMEZAWA Hiroyuki, Andrew Morton,
	Wu Fengguang, Mel Gorman, Glauber Costa, Sha Zhengju

On Fri 12-07-13 20:59:24, Sha Zhengju wrote:
> Add cc to Glauber
> 
> On Thu, Jul 11, 2013 at 10:56 PM, Michal Hocko <mhocko@suse.cz> wrote:
> > On Sat 06-07-13 01:33:43, Sha Zhengju wrote:
> >> From: Sha Zhengju <handai.szj@taobao.com>
> >>
> >> If memcg is enabled and no non-root memcg exists, all allocated
> >> pages belongs to root_mem_cgroup and wil go through root memcg
> >> statistics routines.  So in order to reduce overheads after adding
> >> memcg dirty/writeback accounting in hot paths, we use jump label to
> >> patch mem_cgroup_{begin,end}_update_page_stat() in or out when not
> >> used.
> >
> > I do not think this is enough. How much do you save? One atomic read.
> > This doesn't seem like a killer.
> >
> > I hoped we could simply not account at all and move counters to the root
> > cgroup once the label gets enabled.
> 
> I have thought of this approach before, but it would probably run into
> another issue, e.g, each zone has a percpu stock named ->pageset to
> optimize the increment and decrement operations, and I haven't figure out a
> simpler and cheaper approach to handle that stock numbers if moving global
> counters to root cgroup, maybe we can just leave them and can afford the
> approximation?

You can read per-cpu diffs during transition and tolerate small
races. Or maybe simply summing NR_FILE_DIRTY for all zones would be
sufficient.

> Glauber have already done lots of works here, in his previous patchset he
> also tried to move some global stats to root (
> http://comments.gmane.org/gmane.linux.kernel.cgroups/6291). May I steal
> some of your ideas here, Glauber? :P
> 
> 
> >
> > Besides that, the current patch is racy. Consider what happens when:
> >
> > mem_cgroup_begin_update_page_stat
> >                                         arm_inuse_keys
> >								mem_cgroup_move_account
> > mem_cgroup_move_account_page_stat
> > mem_cgroup_end_update_page_stat
> >
> > The race window is small of course but it is there. I guess we need
> > rcu_read_lock at least.
> 
> Yes, you're right. I'm afraid we need to take care of the racy in the next
> updates as well. But mem_cgroup_begin/end_update_page_stat() already have
> rcu lock, so here we maybe only need a synchronize_rcu() after changing
> memcg_inuse_key?

Your patch doesn't take rcu_read_lock. synchronize_rcu might work but I
am still not sure this would help to prevent from the overhead which
IMHO comes from the accounting not a single atomic_read + rcu_read_lock
which is the hot path of mem_cgroup_{begin,end}_update_page_stat.

[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V4 5/6] memcg: patch mem_cgroup_{begin,end}_update_page_stat() out if only root memcg exists
@ 2013-07-12 13:25         ` Michal Hocko
  0 siblings, 0 replies; 26+ messages in thread
From: Michal Hocko @ 2013-07-12 13:25 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Cgroups, Greg Thelen,
	KAMEZAWA Hiroyuki, Andrew Morton, Wu Fengguang, Mel Gorman,
	Glauber Costa, Sha Zhengju

On Fri 12-07-13 20:59:24, Sha Zhengju wrote:
> Add cc to Glauber
> 
> On Thu, Jul 11, 2013 at 10:56 PM, Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> wrote:
> > On Sat 06-07-13 01:33:43, Sha Zhengju wrote:
> >> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> >>
> >> If memcg is enabled and no non-root memcg exists, all allocated
> >> pages belongs to root_mem_cgroup and wil go through root memcg
> >> statistics routines.  So in order to reduce overheads after adding
> >> memcg dirty/writeback accounting in hot paths, we use jump label to
> >> patch mem_cgroup_{begin,end}_update_page_stat() in or out when not
> >> used.
> >
> > I do not think this is enough. How much do you save? One atomic read.
> > This doesn't seem like a killer.
> >
> > I hoped we could simply not account at all and move counters to the root
> > cgroup once the label gets enabled.
> 
> I have thought of this approach before, but it would probably run into
> another issue, e.g, each zone has a percpu stock named ->pageset to
> optimize the increment and decrement operations, and I haven't figure out a
> simpler and cheaper approach to handle that stock numbers if moving global
> counters to root cgroup, maybe we can just leave them and can afford the
> approximation?

You can read per-cpu diffs during transition and tolerate small
races. Or maybe simply summing NR_FILE_DIRTY for all zones would be
sufficient.

> Glauber have already done lots of works here, in his previous patchset he
> also tried to move some global stats to root (
> http://comments.gmane.org/gmane.linux.kernel.cgroups/6291). May I steal
> some of your ideas here, Glauber? :P
> 
> 
> >
> > Besides that, the current patch is racy. Consider what happens when:
> >
> > mem_cgroup_begin_update_page_stat
> >                                         arm_inuse_keys
> >								mem_cgroup_move_account
> > mem_cgroup_move_account_page_stat
> > mem_cgroup_end_update_page_stat
> >
> > The race window is small of course but it is there. I guess we need
> > rcu_read_lock at least.
> 
> Yes, you're right. I'm afraid we need to take care of the racy in the next
> updates as well. But mem_cgroup_begin/end_update_page_stat() already have
> rcu lock, so here we maybe only need a synchronize_rcu() after changing
> memcg_inuse_key?

Your patch doesn't take rcu_read_lock. synchronize_rcu might work but I
am still not sure this would help to prevent from the overhead which
IMHO comes from the accounting not a single atomic_read + rcu_read_lock
which is the hot path of mem_cgroup_{begin,end}_update_page_stat.

[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V4 5/6] memcg: patch mem_cgroup_{begin,end}_update_page_stat() out if only root memcg exists
  2013-07-12 13:25         ` Michal Hocko
  (?)
@ 2013-07-13  4:15         ` Sha Zhengju
  -1 siblings, 0 replies; 26+ messages in thread
From: Sha Zhengju @ 2013-07-13  4:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Cgroups, Wu Fengguang, KAMEZAWA Hiroyuki, Sha Zhengju,
	Mel Gorman, Greg Thelen, Glauber Costa, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 2989 bytes --]

在 2013-7-12 晚上9:25,"Michal Hocko" <mhocko@suse.cz>写道:
>
> On Fri 12-07-13 20:59:24, Sha Zhengju wrote:
> > Add cc to Glauber
> >
> > On Thu, Jul 11, 2013 at 10:56 PM, Michal Hocko <mhocko@suse.cz> wrote:
> > > On Sat 06-07-13 01:33:43, Sha Zhengju wrote:
> > >> From: Sha Zhengju <handai.szj@taobao.com>
> > >>
> > >> If memcg is enabled and no non-root memcg exists, all allocated
> > >> pages belongs to root_mem_cgroup and wil go through root memcg
> > >> statistics routines.  So in order to reduce overheads after adding
> > >> memcg dirty/writeback accounting in hot paths, we use jump label to
> > >> patch mem_cgroup_{begin,end}_update_page_stat() in or out when not
> > >> used.
> > >
> > > I do not think this is enough. How much do you save? One atomic read.
> > > This doesn't seem like a killer.
> > >
> > > I hoped we could simply not account at all and move counters to the
root
> > > cgroup once the label gets enabled.
> >
> > I have thought of this approach before, but it would probably run into
> > another issue, e.g, each zone has a percpu stock named ->pageset to
> > optimize the increment and decrement operations, and I haven't figure
out a
> > simpler and cheaper approach to handle that stock numbers if moving
global
> > counters to root cgroup, maybe we can just leave them and can afford the
> > approximation?
>
> You can read per-cpu diffs during transition and tolerate small
> races. Or maybe simply summing NR_FILE_DIRTY for all zones would be
> sufficient.

Thanks, I'll have a try.

>
> > Glauber have already done lots of works here, in his previous patchset
he
> > also tried to move some global stats to root (
> > http://comments.gmane.org/gmane.linux.kernel.cgroups/6291). May I steal
> > some of your ideas here, Glauber? :P
> >
> >
> > >
> > > Besides that, the current patch is racy. Consider what happens when:
> > >
> > > mem_cgroup_begin_update_page_stat
> > >                                         arm_inuse_keys
> > >
mem_cgroup_move_account
> > > mem_cgroup_move_account_page_stat
> > > mem_cgroup_end_update_page_stat
> > >
> > > The race window is small of course but it is there. I guess we need
> > > rcu_read_lock at least.
> >
> > Yes, you're right. I'm afraid we need to take care of the racy in the
next
> > updates as well. But mem_cgroup_begin/end_update_page_stat() already
have
> > rcu lock, so here we maybe only need a synchronize_rcu() after changing
> > memcg_inuse_key?
>
> Your patch doesn't take rcu_read_lock. synchronize_rcu might work but I
> am still not sure this would help to prevent from the overhead which
> IMHO comes from the accounting not a single atomic_read + rcu_read_lock
> which is the hot path of mem_cgroup_{begin,end}_update_page_stat.

I means I'll try to zero out accounting overhead in next version, but the
race will probably also occur in that case.

Thanks!

>
> [...]
> --
> Michal Hocko
> SUSE Labs

[-- Attachment #2: Type: text/html, Size: 4269 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V4 5/6] memcg: patch mem_cgroup_{begin,end}_update_page_stat() out if only root memcg exists
@ 2013-07-15  6:32           ` Glauber Costa
  0 siblings, 0 replies; 26+ messages in thread
From: Glauber Costa @ 2013-07-15  6:32 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: Michal Hocko, linux-mm, Cgroups, Greg Thelen, KAMEZAWA Hiroyuki,
	Andrew Morton, Wu Fengguang, Mel Gorman, Sha Zhengju

On Fri, Jul 12, 2013 at 09:13:56PM +0800, Sha Zhengju wrote:
> Ooops.... it seems unreachable, change Glauber's email...
> 
> 
> On Fri, Jul 12, 2013 at 8:59 PM, Sha Zhengju <handai.szj@gmail.com> wrote:
> 
> > Add cc to Glauber
> >
Thanks

> >
> > On Thu, Jul 11, 2013 at 10:56 PM, Michal Hocko <mhocko@suse.cz> wrote:
> > > On Sat 06-07-13 01:33:43, Sha Zhengju wrote:
> > >> From: Sha Zhengju <handai.szj@taobao.com>
> > >>
> > >> If memcg is enabled and no non-root memcg exists, all allocated
> > >> pages belongs to root_mem_cgroup and wil go through root memcg
> > >> statistics routines.  So in order to reduce overheads after adding
> > >> memcg dirty/writeback accounting in hot paths, we use jump label to
> > >> patch mem_cgroup_{begin,end}_update_page_stat() in or out when not
> > >> used.
> > >
> > > I do not think this is enough. How much do you save? One atomic read.
> > > This doesn't seem like a killer.
> > >
> > > I hoped we could simply not account at all and move counters to the root
> > > cgroup once the label gets enabled.
> >
> > I have thought of this approach before, but it would probably run into
> > another issue, e.g, each zone has a percpu stock named ->pageset to
> > optimize the increment and decrement operations, and I haven't figure out a
> > simpler and cheaper approach to handle that stock numbers if moving global
> > counters to root cgroup, maybe we can just leave them and can afford the
> > approximation?
> >
> > Glauber have already done lots of works here, in his previous patchset he
> > also tried to move some global stats to root (
> > http://comments.gmane.org/gmane.linux.kernel.cgroups/6291). May I steal
> > some of your ideas here, Glauber? :P
> >
Of course. Please go ahead and keep me posted in my new address.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V4 5/6] memcg: patch mem_cgroup_{begin,end}_update_page_stat() out if only root memcg exists
@ 2013-07-15  6:32           ` Glauber Costa
  0 siblings, 0 replies; 26+ messages in thread
From: Glauber Costa @ 2013-07-15  6:32 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: Michal Hocko, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Cgroups,
	Greg Thelen, KAMEZAWA Hiroyuki, Andrew Morton, Wu Fengguang,
	Mel Gorman, Sha Zhengju

On Fri, Jul 12, 2013 at 09:13:56PM +0800, Sha Zhengju wrote:
> Ooops.... it seems unreachable, change Glauber's email...
> 
> 
> On Fri, Jul 12, 2013 at 8:59 PM, Sha Zhengju <handai.szj-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> 
> > Add cc to Glauber
> >
Thanks

> >
> > On Thu, Jul 11, 2013 at 10:56 PM, Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> wrote:
> > > On Sat 06-07-13 01:33:43, Sha Zhengju wrote:
> > >> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> > >>
> > >> If memcg is enabled and no non-root memcg exists, all allocated
> > >> pages belongs to root_mem_cgroup and wil go through root memcg
> > >> statistics routines.  So in order to reduce overheads after adding
> > >> memcg dirty/writeback accounting in hot paths, we use jump label to
> > >> patch mem_cgroup_{begin,end}_update_page_stat() in or out when not
> > >> used.
> > >
> > > I do not think this is enough. How much do you save? One atomic read.
> > > This doesn't seem like a killer.
> > >
> > > I hoped we could simply not account at all and move counters to the root
> > > cgroup once the label gets enabled.
> >
> > I have thought of this approach before, but it would probably run into
> > another issue, e.g, each zone has a percpu stock named ->pageset to
> > optimize the increment and decrement operations, and I haven't figure out a
> > simpler and cheaper approach to handle that stock numbers if moving global
> > counters to root cgroup, maybe we can just leave them and can afford the
> > approximation?
> >
> > Glauber have already done lots of works here, in his previous patchset he
> > also tried to move some global stats to root (
> > http://comments.gmane.org/gmane.linux.kernel.cgroups/6291). May I steal
> > some of your ideas here, Glauber? :P
> >
Of course. Please go ahead and keep me posted in my new address.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V4 5/6] memcg: patch mem_cgroup_{begin,end}_update_page_stat() out if only root memcg exists
  2013-07-11 14:56   ` Michal Hocko
  2013-07-12 12:59     ` Sha Zhengju
@ 2013-07-15 17:58     ` Greg Thelen
  2013-07-16  4:26       ` Sha Zhengju
  1 sibling, 1 reply; 26+ messages in thread
From: Greg Thelen @ 2013-07-15 17:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Sha Zhengju, linux-mm, cgroups, KAMEZAWA Hiroyuki, Andrew Morton,
	Wu Fengguang, Mel Gorman, Sha Zhengju

On Thu, Jul 11, 2013 at 7:56 AM, Michal Hocko <mhocko@suse.cz> wrote:
> On Sat 06-07-13 01:33:43, Sha Zhengju wrote:
>> From: Sha Zhengju <handai.szj@taobao.com>
>>
>> If memcg is enabled and no non-root memcg exists, all allocated
>> pages belongs to root_mem_cgroup and wil go through root memcg
>> statistics routines.  So in order to reduce overheads after adding
>> memcg dirty/writeback accounting in hot paths, we use jump label to
>> patch mem_cgroup_{begin,end}_update_page_stat() in or out when not
>> used.
>
> I do not think this is enough. How much do you save? One atomic read.
> This doesn't seem like a killer.

Given we're already using mem_cgroup_{begin,end}_update_page_stat(),
this optimization seems independent of memcg dirty/writeback
accounting.  Does this patch help memcg even before dirty/writeback
accounting?  If yes, then we have the option of splitting this
optimization out of the series.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V4 5/6] memcg: patch mem_cgroup_{begin,end}_update_page_stat() out if only root memcg exists
  2013-07-15 17:58     ` Greg Thelen
@ 2013-07-16  4:26       ` Sha Zhengju
  0 siblings, 0 replies; 26+ messages in thread
From: Sha Zhengju @ 2013-07-16  4:26 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Michal Hocko, linux-mm, Cgroups, KAMEZAWA Hiroyuki,
	Andrew Morton, Wu Fengguang, Mel Gorman, Sha Zhengju

On Tue, Jul 16, 2013 at 1:58 AM, Greg Thelen <gthelen@google.com> wrote:
> On Thu, Jul 11, 2013 at 7:56 AM, Michal Hocko <mhocko@suse.cz> wrote:
>> On Sat 06-07-13 01:33:43, Sha Zhengju wrote:
>>> From: Sha Zhengju <handai.szj@taobao.com>
>>>
>>> If memcg is enabled and no non-root memcg exists, all allocated
>>> pages belongs to root_mem_cgroup and wil go through root memcg
>>> statistics routines.  So in order to reduce overheads after adding
>>> memcg dirty/writeback accounting in hot paths, we use jump label to
>>> patch mem_cgroup_{begin,end}_update_page_stat() in or out when not
>>> used.
>>
>> I do not think this is enough. How much do you save? One atomic read.
>> This doesn't seem like a killer.
>
> Given we're already using mem_cgroup_{begin,end}_update_page_stat(),
> this optimization seems independent of memcg dirty/writeback
> accounting.  Does this patch help memcg even before dirty/writeback
> accounting?  If yes, then we have the option of splitting this
> optimization out of the series.

Set_page_dirty is a hot path, people said I should be careful to the
overhead of adding a new counting, and the optimization is a must
before merging.
But since we have more need of this feature now, if it's blocking
something, I'm willing to split it.

--
Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2013-07-16  4:26 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-05 17:18 [PATCH V4 0/6] Memcg dirty/writeback page accounting Sha Zhengju
2013-07-05 17:21 ` [PATCH V4 1/6] memcg: remove MEMCG_NR_FILE_MAPPED Sha Zhengju
2013-07-11 13:39   ` Michal Hocko
2013-07-11 15:53     ` Sha Zhengju
2013-07-05 17:26 ` [PATCH V4 2/6] fs/ceph: vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem Sha Zhengju
     [not found] ` <1373044710-27371-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
2013-07-05 17:30   ` [PATCH V4 3/6] memcg: add per cgroup dirty pages accounting Sha Zhengju
2013-07-05 17:30     ` Sha Zhengju
2013-07-11 14:31     ` Michal Hocko
2013-07-11 16:49       ` Sha Zhengju
2013-07-05 17:32 ` [PATCH V4 4/6] memcg: add per cgroup writeback " Sha Zhengju
2013-07-11 14:40   ` Michal Hocko
2013-07-11 14:40     ` Michal Hocko
     [not found]     ` <20130711144003.GJ21667-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2013-07-11 16:56       ` Sha Zhengju
2013-07-11 16:56         ` Sha Zhengju
2013-07-05 17:33 ` [PATCH V4 5/6] memcg: patch mem_cgroup_{begin,end}_update_page_stat() out if only root memcg exists Sha Zhengju
2013-07-11 14:56   ` Michal Hocko
2013-07-12 12:59     ` Sha Zhengju
2013-07-12 13:13       ` Sha Zhengju
2013-07-15  6:32         ` Glauber Costa
2013-07-15  6:32           ` Glauber Costa
2013-07-12 13:25       ` Michal Hocko
2013-07-12 13:25         ` Michal Hocko
2013-07-13  4:15         ` Sha Zhengju
2013-07-15 17:58     ` Greg Thelen
2013-07-16  4:26       ` Sha Zhengju
2013-07-05 17:34 ` [PATCH V4 6/6] memcg: Document cgroup dirty/writeback memory statistics Sha Zhengju

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.