[PATCH V5 2/8] fs/ceph: vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH V5 2/8] fs/ceph: vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem
       [not found] <1375357402-9811-1-git-send-email-handai.szj@taobao.com>
@ 2013-08-01 11:51 ` Sha Zhengju
  2013-08-01 15:19   ` Yan, Zheng
  2013-08-01 11:53 ` [PATCH V5 4/8] memcg: add per cgroup dirty pages accounting Sha Zhengju
  2013-08-01 11:54 ` [PATCH V5 5/8] memcg: add per cgroup writeback " Sha Zhengju
  2 siblings, 1 reply; 21+ messages in thread
From: Sha Zhengju @ 2013-08-01 11:51 UTC (permalink / raw)
  To: linux-fsdevel, ceph-devel, linux-mm, cgroups
  Cc: sage, mhocko, kamezawa.hiroyu, glommer, gthelen, fengguang.wu,
	akpm, Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

Following we will begin to add memcg dirty page accounting around __set_page_dirty_
{buffers,nobuffers} in vfs layer, so we'd better use vfs interface to avoid exporting
those details to filesystems.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
---
 fs/ceph/addr.c |   13 +------------
 1 file changed, 1 insertion(+), 12 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 3e68ac1..1445bf1 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -76,7 +76,7 @@ static int ceph_set_page_dirty(struct page *page)
 	if (unlikely(!mapping))
 		return !TestSetPageDirty(page);
 
-	if (TestSetPageDirty(page)) {
+	if (!__set_page_dirty_nobuffers(page)) {
 		dout("%p set_page_dirty %p idx %lu -- already dirty\n",
 		     mapping->host, page, page->index);
 		return 0;
@@ -107,14 +107,7 @@ static int ceph_set_page_dirty(struct page *page)
 	     snapc, snapc->seq, snapc->num_snaps);
 	spin_unlock(&ci->i_ceph_lock);
 
-	/* now adjust page */
-	spin_lock_irq(&mapping->tree_lock);
 	if (page->mapping) {	/* Race with truncate? */
-		WARN_ON_ONCE(!PageUptodate(page));
-		account_page_dirtied(page, page->mapping);
-		radix_tree_tag_set(&mapping->page_tree,
-				page_index(page), PAGECACHE_TAG_DIRTY);
-
 		/*
 		 * Reference snap context in page->private.  Also set
 		 * PagePrivate so that we get invalidatepage callback.
@@ -126,14 +119,10 @@ static int ceph_set_page_dirty(struct page *page)
 		undo = 1;
 	}
 
-	spin_unlock_irq(&mapping->tree_lock);
-
 	if (undo)
 		/* whoops, we failed to dirty the page */
 		ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
 
-	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
-
 	BUG_ON(!PageDirty(page));
 	return 1;
 }
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH V5 4/8] memcg: add per cgroup dirty pages accounting
       [not found] <1375357402-9811-1-git-send-email-handai.szj@taobao.com>
  2013-08-01 11:51 ` [PATCH V5 2/8] fs/ceph: vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem Sha Zhengju
@ 2013-08-01 11:53 ` Sha Zhengju
  2013-08-01 11:54 ` [PATCH V5 5/8] memcg: add per cgroup writeback " Sha Zhengju
  2 siblings, 0 replies; 21+ messages in thread
From: Sha Zhengju @ 2013-08-01 11:53 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, cgroups
  Cc: mhocko, kamezawa.hiroyu, glommer, gthelen, fengguang.wu, akpm,
	Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

This patch adds memcg routines to count dirty pages, which allows memory controller
to maintain an accurate view of the amount of its dirty memory.

After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
has a feature to move a page from a cgroup to another one and may have race between
"move" and "page stat accounting". So in order to avoid the race we have designed a
new lock:

         mem_cgroup_begin_update_page_stat()
         modify page information        -->(a)
         mem_cgroup_update_page_stat()  -->(b)
         mem_cgroup_end_update_page_stat()

It requires both (a) and (b)(dirty pages accounting) to be pretected in
mem_cgroup_{begin/end}_update_page_stat(). It's full no-op for !CONFIG_MEMCG, almost
no-op if memcg is disabled (but compiled in), rcu read lock in the most cases
(no task is moving), and spin_lock_irqsave on top in the slow path.

several places should be added accounting:
        incrementing (3):
                __set_page_dirty_buffers
                __set_page_dirty_nobuffers
		mark_buffer_dirty
        decrementing (7):
                clear_page_dirty_for_io
                cancel_dirty_page
		delete_from_page_cache
		__delete_from_page_cache
		replace_page_cache_page
		invalidate_complete_page2
		__remove_mapping

The lock order between memcg lock and mapping lock is:
	--> memcg->move_lock
	  --> mapping->private_lock
            --> mapping->tree_lock

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
---
 fs/buffer.c                |   13 +++++++++++++
 include/linux/memcontrol.h |    1 +
 mm/filemap.c               |   17 ++++++++++++++++-
 mm/memcontrol.c            |   30 +++++++++++++++++++++++-------
 mm/page-writeback.c        |   24 ++++++++++++++++++++++--
 mm/truncate.c              |   12 ++++++++++++
 mm/vmscan.c                |    7 +++++++
 7 files changed, 94 insertions(+), 10 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 695eb14..2d92d60 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -694,10 +694,17 @@ int __set_page_dirty_buffers(struct page *page)
 {
 	int newly_dirty;
 	struct address_space *mapping = page_mapping(page);
+	bool locked;
+	unsigned long flags;
 
 	if (unlikely(!mapping))
 		return !TestSetPageDirty(page);
 
+	/* The main concern is page stat accounting, the lock is used to protect
+	 * both SetPageDirty and accounting, so let the above !mapping checking
+	 * outside.
+	 */
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	spin_lock(&mapping->private_lock);
 	if (page_has_buffers(page)) {
 		struct buffer_head *head = page_buffers(page);
@@ -713,6 +720,7 @@ int __set_page_dirty_buffers(struct page *page)
 
 	if (newly_dirty)
 		__set_page_dirty(page, mapping, 1);
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	return newly_dirty;
 }
 EXPORT_SYMBOL(__set_page_dirty_buffers);
@@ -1169,11 +1177,16 @@ void mark_buffer_dirty(struct buffer_head *bh)
 
 	if (!test_set_buffer_dirty(bh)) {
 		struct page *page = bh->b_page;
+		bool locked;
+		unsigned long flags;
+
+		mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 		if (!TestSetPageDirty(page)) {
 			struct address_space *mapping = page_mapping(page);
 			if (mapping)
 				__set_page_dirty(page, mapping, 0);
 		}
+		mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	}
 }
 EXPORT_SYMBOL(mark_buffer_dirty);
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d166aeb..f952be6 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -42,6 +42,7 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_RSS,		/* # of pages charged as anon rss */
 	MEM_CGROUP_STAT_RSS_HUGE,	/* # of pages charged as anon huge */
 	MEM_CGROUP_STAT_FILE_MAPPED,	/* # of pages charged as file rss */
+	MEM_CGROUP_STAT_FILE_DIRTY,	/* # of dirty pages in page cache */
 	MEM_CGROUP_STAT_SWAP,		/* # of pages, swapped out */
 	MEM_CGROUP_STAT_NSTATS,
 };
diff --git a/mm/filemap.c b/mm/filemap.c
index a6981fe..690e4d7 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -65,6 +65,11 @@
  *      ->swap_lock		(exclusive_swap_page, others)
  *        ->mapping->tree_lock
  *
+ *  ->memcg->move_lock	(mem_cgroup_begin_update_page_stat->
+ *						move_lock_mem_cgroup)
+ *    ->private_lock		(__set_page_dirty_buffers)
+ *        ->mapping->tree_lock
+ *
  *  ->i_mutex
  *    ->i_mmap_mutex		(truncate->unmap_mapping_range)
  *
@@ -110,7 +115,8 @@
 /*
  * Delete a page from the page cache and free it. Caller has to make
  * sure the page is locked and that nobody else uses it - or that usage
- * is safe.  The caller must hold the mapping's tree_lock.
+ * is safe.  The caller must hold the mapping's tree_lock and
+ * mem_cgroup_begin/end_update_page_stat() lock.
  */
 void __delete_from_page_cache(struct page *page)
 {
@@ -144,6 +150,7 @@ void __delete_from_page_cache(struct page *page)
 	 * having removed the page entirely.
 	 */
 	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 		dec_zone_page_state(page, NR_FILE_DIRTY);
 		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
 	}
@@ -161,13 +168,17 @@ void delete_from_page_cache(struct page *page)
 {
 	struct address_space *mapping = page->mapping;
 	void (*freepage)(struct page *);
+	bool locked;
+	unsigned long flags;
 
 	BUG_ON(!PageLocked(page));
 
 	freepage = mapping->a_ops->freepage;
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	spin_lock_irq(&mapping->tree_lock);
 	__delete_from_page_cache(page);
 	spin_unlock_irq(&mapping->tree_lock);
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	mem_cgroup_uncharge_cache_page(page);
 
 	if (freepage)
@@ -417,6 +428,8 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 	if (!error) {
 		struct address_space *mapping = old->mapping;
 		void (*freepage)(struct page *);
+		bool locked;
+		unsigned long flags;
 
 		pgoff_t offset = old->index;
 		freepage = mapping->a_ops->freepage;
@@ -425,6 +438,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 		new->mapping = mapping;
 		new->index = offset;
 
+		mem_cgroup_begin_update_page_stat(old, &locked, &flags);
 		spin_lock_irq(&mapping->tree_lock);
 		__delete_from_page_cache(old);
 		error = radix_tree_insert(&mapping->page_tree, offset, new);
@@ -434,6 +448,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 		if (PageSwapBacked(new))
 			__inc_zone_page_state(new, NR_SHMEM);
 		spin_unlock_irq(&mapping->tree_lock);
+		mem_cgroup_end_update_page_stat(old, &locked, &flags);
 		/* mem_cgroup codes must not be called under tree_lock */
 		mem_cgroup_replace_page_cache(old, new);
 		radix_tree_preload_end();
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4a55d46..8f3e514 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -90,6 +90,7 @@ static const char * const mem_cgroup_stat_names[] = {
 	"rss",
 	"rss_huge",
 	"mapped_file",
+	"dirty",
 	"swap",
 };
 
@@ -3744,6 +3745,20 @@ void mem_cgroup_split_huge_fixup(struct page *head)
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+static inline
+void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
+					struct mem_cgroup *to,
+					unsigned int nr_pages,
+					enum mem_cgroup_stat_index idx)
+{
+	/* Update stat data for mem_cgroup */
+	preempt_disable();
+	WARN_ON_ONCE(from->stat->count[idx] < nr_pages);
+	__this_cpu_add(from->stat->count[idx], -nr_pages);
+	__this_cpu_add(to->stat->count[idx], nr_pages);
+	preempt_enable();
+}
+
 /**
  * mem_cgroup_move_account - move account of the page
  * @page: the page
@@ -3789,13 +3804,14 @@ static int mem_cgroup_move_account(struct page *page,
 
 	move_lock_mem_cgroup(from, &flags);
 
-	if (!anon && page_mapped(page)) {
-		/* Update mapped_file data for mem_cgroup */
-		preempt_disable();
-		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		preempt_enable();
-	}
+	if (!anon && page_mapped(page))
+		mem_cgroup_move_account_page_stat(from, to, nr_pages,
+			MEM_CGROUP_STAT_FILE_MAPPED);
+
+	if (!anon && PageDirty(page))
+		mem_cgroup_move_account_page_stat(from, to, nr_pages,
+			MEM_CGROUP_STAT_FILE_DIRTY);
+
 	mem_cgroup_charge_statistics(from, page, anon, -nr_pages);
 
 	/* caller should have done css_get */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 4514ad7..a09f518 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1982,6 +1982,11 @@ int __set_page_dirty_no_writeback(struct page *page)
 
 /*
  * Helper function for set_page_dirty family.
+ *
+ * The caller must hold mem_cgroup_begin/end_update_page_stat() lock
+ * while calling this function.
+ * See __set_page_dirty_{nobuffers,buffers} for example.
+ *
  * NOTE: This relies on being atomic wrt interrupts.
  */
 void account_page_dirtied(struct page *page, struct address_space *mapping)
@@ -1989,6 +1994,7 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
 	trace_writeback_dirty_page(page, mapping);
 
 	if (mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_DIRTIED);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
@@ -2028,6 +2034,11 @@ EXPORT_SYMBOL(account_page_writeback);
  */
 int __set_page_dirty_nobuffers(struct page *page)
 {
+	bool locked;
+	unsigned long flags;
+
+
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (!TestSetPageDirty(page)) {
 		struct address_space *mapping = page_mapping(page);
 		struct address_space *mapping2;
@@ -2045,12 +2056,15 @@ int __set_page_dirty_nobuffers(struct page *page)
 				page_index(page), PAGECACHE_TAG_DIRTY);
 		}
 		spin_unlock_irq(&mapping->tree_lock);
+		mem_cgroup_end_update_page_stat(page, &locked, &flags);
+
 		if (mapping->host) {
 			/* !PageAnon && !swapper_space */
 			__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 		}
 		return 1;
 	}
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	return 0;
 }
 EXPORT_SYMBOL(__set_page_dirty_nobuffers);
@@ -2166,6 +2180,9 @@ EXPORT_SYMBOL(set_page_dirty_lock);
 int clear_page_dirty_for_io(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
+	bool locked;
+	unsigned long flags;
+	int ret = 0;
 
 	BUG_ON(!PageLocked(page));
 
@@ -2207,13 +2224,16 @@ int clear_page_dirty_for_io(struct page *page)
 		 * the desired exclusion. See mm/memory.c:do_wp_page()
 		 * for more comments.
 		 */
+		mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 		if (TestClearPageDirty(page)) {
+			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
-			return 1;
+			ret = 1;
 		}
-		return 0;
+		mem_cgroup_end_update_page_stat(page, &locked, &flags);
+		return ret;
 	}
 	return TestClearPageDirty(page);
 }
diff --git a/mm/truncate.c b/mm/truncate.c
index c75b736..ed41636 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -73,9 +73,14 @@ static inline void truncate_partial_page(struct page *page, unsigned partial)
  */
 void cancel_dirty_page(struct page *page, unsigned int account_size)
 {
+	bool locked;
+	unsigned long flags;
+
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (TestClearPageDirty(page)) {
 		struct address_space *mapping = page->mapping;
 		if (mapping && mapping_cap_account_dirty(mapping)) {
+			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
@@ -83,6 +88,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
 				task_io_account_cancelled_write(account_size);
 		}
 	}
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 }
 EXPORT_SYMBOL(cancel_dirty_page);
 
@@ -386,12 +392,16 @@ EXPORT_SYMBOL(invalidate_mapping_pages);
 static int
 invalidate_complete_page2(struct address_space *mapping, struct page *page)
 {
+	bool locked;
+	unsigned long flags;
+
 	if (page->mapping != mapping)
 		return 0;
 
 	if (page_has_private(page) && !try_to_release_page(page, GFP_KERNEL))
 		return 0;
 
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	spin_lock_irq(&mapping->tree_lock);
 	if (PageDirty(page))
 		goto failed;
@@ -399,6 +409,7 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
 	BUG_ON(page_has_private(page));
 	__delete_from_page_cache(page);
 	spin_unlock_irq(&mapping->tree_lock);
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	mem_cgroup_uncharge_cache_page(page);
 
 	if (mapping->a_ops->freepage)
@@ -408,6 +419,7 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
 	return 1;
 failed:
 	spin_unlock_irq(&mapping->tree_lock);
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	return 0;
 }
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e364542..ba1ad4a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -486,9 +486,13 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
  */
 static int __remove_mapping(struct address_space *mapping, struct page *page)
 {
+	bool locked;
+	unsigned long flags;
+
 	BUG_ON(!PageLocked(page));
 	BUG_ON(mapping != page_mapping(page));
 
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	spin_lock_irq(&mapping->tree_lock);
 	/*
 	 * The non racy check for a busy page.
@@ -527,6 +531,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
 		swp_entry_t swap = { .val = page_private(page) };
 		__delete_from_swap_cache(page);
 		spin_unlock_irq(&mapping->tree_lock);
+		mem_cgroup_end_update_page_stat(page, &locked, &flags);
 		swapcache_free(swap, page);
 	} else {
 		void (*freepage)(struct page *);
@@ -535,6 +540,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
 
 		__delete_from_page_cache(page);
 		spin_unlock_irq(&mapping->tree_lock);
+		mem_cgroup_end_update_page_stat(page, &locked, &flags);
 		mem_cgroup_uncharge_cache_page(page);
 
 		if (freepage != NULL)
@@ -545,6 +551,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
 
 cannot_free:
 	spin_unlock_irq(&mapping->tree_lock);
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	return 0;
 }
 
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH V5 5/8] memcg: add per cgroup writeback pages accounting
       [not found] <1375357402-9811-1-git-send-email-handai.szj@taobao.com>
  2013-08-01 11:51 ` [PATCH V5 2/8] fs/ceph: vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem Sha Zhengju
  2013-08-01 11:53 ` [PATCH V5 4/8] memcg: add per cgroup dirty pages accounting Sha Zhengju
@ 2013-08-01 11:54 ` Sha Zhengju
  2013-08-01 14:53   ` Michal Hocko
  2013-08-04 18:51   ` [PATCH V5 5/8] memcg: add per cgroup writeback pages accounting Greg Thelen
  2 siblings, 2 replies; 21+ messages in thread
From: Sha Zhengju @ 2013-08-01 11:54 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, cgroups
  Cc: mhocko, kamezawa.hiroyu, glommer, gthelen, fengguang.wu, akpm,
	Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

Similar to dirty page, we add per cgroup writeback pages accounting. The lock
rule still is:
        mem_cgroup_begin_update_page_stat()
        modify page WRITEBACK stat
        mem_cgroup_update_page_stat()
        mem_cgroup_end_update_page_stat()

There're two writeback interfaces to modify: test_{clear/set}_page_writeback().
Lock order:
	--> memcg->move_lock
	  --> mapping->tree_lock

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
---
 include/linux/memcontrol.h |    1 +
 mm/memcontrol.c            |    5 +++++
 mm/page-writeback.c        |   15 +++++++++++++++
 3 files changed, 21 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f952be6..ccd35d8 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -43,6 +43,7 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_RSS_HUGE,	/* # of pages charged as anon huge */
 	MEM_CGROUP_STAT_FILE_MAPPED,	/* # of pages charged as file rss */
 	MEM_CGROUP_STAT_FILE_DIRTY,	/* # of dirty pages in page cache */
+	MEM_CGROUP_STAT_WRITEBACK,	/* # of pages under writeback */
 	MEM_CGROUP_STAT_SWAP,		/* # of pages, swapped out */
 	MEM_CGROUP_STAT_NSTATS,
 };
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8f3e514..6c18a6d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -91,6 +91,7 @@ static const char * const mem_cgroup_stat_names[] = {
 	"rss_huge",
 	"mapped_file",
 	"dirty",
+	"writeback",
 	"swap",
 };
 
@@ -3812,6 +3813,10 @@ static int mem_cgroup_move_account(struct page *page,
 		mem_cgroup_move_account_page_stat(from, to, nr_pages,
 			MEM_CGROUP_STAT_FILE_DIRTY);
 
+	if (PageWriteback(page))
+		mem_cgroup_move_account_page_stat(from, to, nr_pages,
+			MEM_CGROUP_STAT_WRITEBACK);
+
 	mem_cgroup_charge_statistics(from, page, anon, -nr_pages);
 
 	/* caller should have done css_get */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index a09f518..2fa6a52 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2008,11 +2008,17 @@ EXPORT_SYMBOL(account_page_dirtied);
 
 /*
  * Helper function for set_page_writeback family.
+ *
+ * The caller must hold mem_cgroup_begin/end_update_page_stat() lock
+ * while calling this function.
+ * See test_set_page_writeback for example.
+ *
  * NOTE: Unlike account_page_dirtied this does not rely on being atomic
  * wrt interrupts.
  */
 void account_page_writeback(struct page *page)
 {
+	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
 	inc_zone_page_state(page, NR_WRITEBACK);
 }
 EXPORT_SYMBOL(account_page_writeback);
@@ -2243,7 +2249,10 @@ int test_clear_page_writeback(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
 	int ret;
+	bool locked;
+	unsigned long memcg_flags;
 
+	mem_cgroup_begin_update_page_stat(page, &locked, &memcg_flags);
 	if (mapping) {
 		struct backing_dev_info *bdi = mapping->backing_dev_info;
 		unsigned long flags;
@@ -2264,9 +2273,11 @@ int test_clear_page_writeback(struct page *page)
 		ret = TestClearPageWriteback(page);
 	}
 	if (ret) {
+		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
 		dec_zone_page_state(page, NR_WRITEBACK);
 		inc_zone_page_state(page, NR_WRITTEN);
 	}
+	mem_cgroup_end_update_page_stat(page, &locked, &memcg_flags);
 	return ret;
 }
 
@@ -2274,7 +2285,10 @@ int test_set_page_writeback(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
 	int ret;
+	bool locked;
+	unsigned long flags;
 
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (mapping) {
 		struct backing_dev_info *bdi = mapping->backing_dev_info;
 		unsigned long flags;
@@ -2301,6 +2315,7 @@ int test_set_page_writeback(struct page *page)
 	}
 	if (!ret)
 		account_page_writeback(page);
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	return ret;
 
 }
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH V5 5/8] memcg: add per cgroup writeback pages accounting
  2013-08-01 11:54 ` [PATCH V5 5/8] memcg: add per cgroup writeback " Sha Zhengju
@ 2013-08-01 14:53   ` Michal Hocko
       [not found]     ` <20130801145302.GJ5198-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  2013-08-04 18:51   ` [PATCH V5 5/8] memcg: add per cgroup writeback pages accounting Greg Thelen
  1 sibling, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2013-08-01 14:53 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-fsdevel, linux-mm, cgroups, kamezawa.hiroyu, glommer,
	gthelen, fengguang.wu, akpm, Sha Zhengju

On Thu 01-08-13 19:54:11, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> Similar to dirty page, we add per cgroup writeback pages accounting. The lock
> rule still is:
>         mem_cgroup_begin_update_page_stat()
>         modify page WRITEBACK stat
>         mem_cgroup_update_page_stat()
>         mem_cgroup_end_update_page_stat()
> 
> There're two writeback interfaces to modify: test_{clear/set}_page_writeback().
> Lock order:
> 	--> memcg->move_lock
> 	  --> mapping->tree_lock
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>

Looks good to me. Maybe I would suggest moving this patch up the stack
so that it might get merged earlier as it is simpler than dirty pages
accounting. Unless you insist on having the full series merged at once.

Acked-by: Michal Hocko <mhocko@suse.cz>

> ---
>  include/linux/memcontrol.h |    1 +
>  mm/memcontrol.c            |    5 +++++
>  mm/page-writeback.c        |   15 +++++++++++++++
>  3 files changed, 21 insertions(+)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index f952be6..ccd35d8 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -43,6 +43,7 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_RSS_HUGE,	/* # of pages charged as anon huge */
>  	MEM_CGROUP_STAT_FILE_MAPPED,	/* # of pages charged as file rss */
>  	MEM_CGROUP_STAT_FILE_DIRTY,	/* # of dirty pages in page cache */
> +	MEM_CGROUP_STAT_WRITEBACK,	/* # of pages under writeback */
>  	MEM_CGROUP_STAT_SWAP,		/* # of pages, swapped out */
>  	MEM_CGROUP_STAT_NSTATS,
>  };
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 8f3e514..6c18a6d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -91,6 +91,7 @@ static const char * const mem_cgroup_stat_names[] = {
>  	"rss_huge",
>  	"mapped_file",
>  	"dirty",
> +	"writeback",
>  	"swap",
>  };
>  
> @@ -3812,6 +3813,10 @@ static int mem_cgroup_move_account(struct page *page,
>  		mem_cgroup_move_account_page_stat(from, to, nr_pages,
>  			MEM_CGROUP_STAT_FILE_DIRTY);
>  
> +	if (PageWriteback(page))
> +		mem_cgroup_move_account_page_stat(from, to, nr_pages,
> +			MEM_CGROUP_STAT_WRITEBACK);
> +
>  	mem_cgroup_charge_statistics(from, page, anon, -nr_pages);
>  
>  	/* caller should have done css_get */
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index a09f518..2fa6a52 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2008,11 +2008,17 @@ EXPORT_SYMBOL(account_page_dirtied);
>  
>  /*
>   * Helper function for set_page_writeback family.
> + *
> + * The caller must hold mem_cgroup_begin/end_update_page_stat() lock
> + * while calling this function.
> + * See test_set_page_writeback for example.
> + *
>   * NOTE: Unlike account_page_dirtied this does not rely on being atomic
>   * wrt interrupts.
>   */
>  void account_page_writeback(struct page *page)
>  {
> +	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
>  	inc_zone_page_state(page, NR_WRITEBACK);
>  }
>  EXPORT_SYMBOL(account_page_writeback);
> @@ -2243,7 +2249,10 @@ int test_clear_page_writeback(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
>  	int ret;
> +	bool locked;
> +	unsigned long memcg_flags;
>  
> +	mem_cgroup_begin_update_page_stat(page, &locked, &memcg_flags);
>  	if (mapping) {
>  		struct backing_dev_info *bdi = mapping->backing_dev_info;
>  		unsigned long flags;
> @@ -2264,9 +2273,11 @@ int test_clear_page_writeback(struct page *page)
>  		ret = TestClearPageWriteback(page);
>  	}
>  	if (ret) {
> +		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
>  		dec_zone_page_state(page, NR_WRITEBACK);
>  		inc_zone_page_state(page, NR_WRITTEN);
>  	}
> +	mem_cgroup_end_update_page_stat(page, &locked, &memcg_flags);
>  	return ret;
>  }
>  
> @@ -2274,7 +2285,10 @@ int test_set_page_writeback(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
>  	int ret;
> +	bool locked;
> +	unsigned long flags;
>  
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  	if (mapping) {
>  		struct backing_dev_info *bdi = mapping->backing_dev_info;
>  		unsigned long flags;
> @@ -2301,6 +2315,7 @@ int test_set_page_writeback(struct page *page)
>  	}
>  	if (!ret)
>  		account_page_writeback(page);
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  	return ret;
>  
>  }
> -- 
> 1.7.9.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V5 2/8] fs/ceph: vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem
  2013-08-01 11:51 ` [PATCH V5 2/8] fs/ceph: vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem Sha Zhengju
@ 2013-08-01 15:19   ` Yan, Zheng
  2013-08-01 18:27     ` Sage Weil
  2013-08-02  9:04     ` Sha Zhengju
  0 siblings, 2 replies; 21+ messages in thread
From: Yan, Zheng @ 2013-08-01 15:19 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-fsdevel, ceph-devel, linux-mm, cgroups, Sage Weil, mhocko,
	kamezawa.hiroyu, glommer, Greg Thelen, Wu Fengguang,
	Andrew Morton, Sha Zhengju

[-- Attachment #1: Type: text/plain, Size: 2975 bytes --]

On Thu, Aug 1, 2013 at 7:51 PM, Sha Zhengju <handai.szj@gmail.com> wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
>
> Following we will begin to add memcg dirty page accounting around
__set_page_dirty_
> {buffers,nobuffers} in vfs layer, so we'd better use vfs interface to
avoid exporting
> those details to filesystems.
>
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> ---
>  fs/ceph/addr.c |   13 +------------
>  1 file changed, 1 insertion(+), 12 deletions(-)
>
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 3e68ac1..1445bf1 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -76,7 +76,7 @@ static int ceph_set_page_dirty(struct page *page)
>         if (unlikely(!mapping))
>                 return !TestSetPageDirty(page);
>
> -       if (TestSetPageDirty(page)) {
> +       if (!__set_page_dirty_nobuffers(page)) {

it's too early to set the radix tree tag here. We should set page's
snapshot context and increase the i_wrbuffer_ref first. This is because
once the tag is set, writeback thread can find and start flushing the page.


>                 dout("%p set_page_dirty %p idx %lu -- already dirty\n",
>                      mapping->host, page, page->index);
>                 return 0;
> @@ -107,14 +107,7 @@ static int ceph_set_page_dirty(struct page *page)
>              snapc, snapc->seq, snapc->num_snaps);
>         spin_unlock(&ci->i_ceph_lock);
>
> -       /* now adjust page */
> -       spin_lock_irq(&mapping->tree_lock);
>         if (page->mapping) {    /* Race with truncate? */
> -               WARN_ON_ONCE(!PageUptodate(page));
> -               account_page_dirtied(page, page->mapping);
> -               radix_tree_tag_set(&mapping->page_tree,
> -                               page_index(page), PAGECACHE_TAG_DIRTY);
> -

this code was coped from __set_page_dirty_nobuffers(). I think the reason
Sage did this is to handle the race described in
__set_page_dirty_nobuffers()'s comment. But I'm wonder if "page->mapping ==
NULL" can still happen here. Because truncate_inode_page() unmap page from
processes's address spaces first, then delete page from page cache.

Regards
Yan, Zheng

>                 /*
>                  * Reference snap context in page->private.  Also set
>                  * PagePrivate so that we get invalidatepage callback.
> @@ -126,14 +119,10 @@ static int ceph_set_page_dirty(struct page *page)
>                 undo = 1;
>         }
>
> -       spin_unlock_irq(&mapping->tree_lock);




> -
>         if (undo)
>                 /* whoops, we failed to dirty the page */
>                 ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
>
> -       __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> -
>         BUG_ON(!PageDirty(page));
>         return 1;
>  }
> --
> 1.7.9.5
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: Type: text/html, Size: 4110 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V5 2/8] fs/ceph: vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem
  2013-08-01 15:19   ` Yan, Zheng
@ 2013-08-01 18:27     ` Sage Weil
  2013-08-02 10:04       ` Sha Zhengju
  2013-08-02  9:04     ` Sha Zhengju
  1 sibling, 1 reply; 21+ messages in thread
From: Sage Weil @ 2013-08-01 18:27 UTC (permalink / raw)
  To: Yan, Zheng
  Cc: Sha Zhengju, linux-fsdevel, ceph-devel, linux-mm, cgroups,
	mhocko, kamezawa.hiroyu, glommer, Greg Thelen, Wu Fengguang,
	Andrew Morton, Sha Zhengju

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3697 bytes --]

On Thu, 1 Aug 2013, Yan, Zheng wrote:
> On Thu, Aug 1, 2013 at 7:51 PM, Sha Zhengju <handai.szj@gmail.com> wrote:
> > From: Sha Zhengju <handai.szj@taobao.com>
> >
> > Following we will begin to add memcg dirty page accounting around
> __set_page_dirty_
> > {buffers,nobuffers} in vfs layer, so we'd better use vfs interface to
> avoid exporting
> > those details to filesystems.
> >
> > Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> > ---
> >  fs/ceph/addr.c |   13 +------------
> >  1 file changed, 1 insertion(+), 12 deletions(-)
> >
> > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> > index 3e68ac1..1445bf1 100644
> > --- a/fs/ceph/addr.c
> > +++ b/fs/ceph/addr.c
> > @@ -76,7 +76,7 @@ static int ceph_set_page_dirty(struct page *page)
> >         if (unlikely(!mapping))
> >                 return !TestSetPageDirty(page);
> >
> > -       if (TestSetPageDirty(page)) {
> > +       if (!__set_page_dirty_nobuffers(page)) {
> it's too early to set the radix tree tag here. We should set page's snapshot
> context and increase the i_wrbuffer_ref first. This is because once the tag
> is set, writeback thread can find and start flushing the page.

Unfortunately I only remember being frustrated by this code.  :)  Looking 
at it now, though, it seems like the minimum fix is to set the 
page->private before marking the page dirty.  I don't know the locking 
rules around that, though.  If that is potentially racy, maybe the safest 
thing would be if __set_page_dirty_nobuffers() took a void* to set 
page->private to atomically while holding the tree_lock.

sage

> 
> >                 dout("%p set_page_dirty %p idx %lu -- already dirty\n",
> >                      mapping->host, page, page->index);
> >                 return 0;
> > @@ -107,14 +107,7 @@ static int ceph_set_page_dirty(struct page *page)
> >              snapc, snapc->seq, snapc->num_snaps);
> >         spin_unlock(&ci->i_ceph_lock);
> >
> > -       /* now adjust page */
> > -       spin_lock_irq(&mapping->tree_lock);
> >         if (page->mapping) {    /* Race with truncate? */
> > -               WARN_ON_ONCE(!PageUptodate(page));
> > -               account_page_dirtied(page, page->mapping);
> > -               radix_tree_tag_set(&mapping->page_tree,
> > -                               page_index(page), PAGECACHE_TAG_DIRTY);
> > -
> 
> this code was coped from __set_page_dirty_nobuffers(). I think the reason
> Sage did this is to handle the race described in
> __set_page_dirty_nobuffers()'s comment. But I'm wonder if "page->mapping ==
> NULL" can still happen here. Because truncate_inode_page() unmap page from
> processes's address spaces first, then delete page from page cache.
> 
> Regards
> Yan, Zheng
> 
> >                 /*
> >                  * Reference snap context in page->private.  Also set
> >                  * PagePrivate so that we get invalidatepage callback.
> > @@ -126,14 +119,10 @@ static int ceph_set_page_dirty(struct page *page)
> >                 undo = 1;
> >         }
> >
> > -       spin_unlock_irq(&mapping->tree_lock);
> 
> 
> 
> 
> > -
> >         if (undo)
> >                 /* whoops, we failed to dirty the page */
> >                 ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
> >
> > -       __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> > -
> >         BUG_ON(!PageDirty(page));
> >         return 1;
> >  }
> > --
> > 1.7.9.5
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V5 2/8] fs/ceph: vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem
  2013-08-01 15:19   ` Yan, Zheng
  2013-08-01 18:27     ` Sage Weil
@ 2013-08-02  9:04     ` Sha Zhengju
  2013-08-02 13:11       ` Yan, Zheng
  1 sibling, 1 reply; 21+ messages in thread
From: Sha Zhengju @ 2013-08-02  9:04 UTC (permalink / raw)
  To: Yan, Zheng
  Cc: linux-fsdevel, ceph-devel, linux-mm, Cgroups, Sage Weil,
	Michal Hocko, KAMEZAWA Hiroyuki, Glauber Costa, Greg Thelen,
	Wu Fengguang, Andrew Morton, Sha Zhengju

On Thu, Aug 1, 2013 at 11:19 PM, Yan, Zheng <ukernel@gmail.com> wrote:
> On Thu, Aug 1, 2013 at 7:51 PM, Sha Zhengju <handai.szj@gmail.com> wrote:
>> From: Sha Zhengju <handai.szj@taobao.com>
>>
>> Following we will begin to add memcg dirty page accounting around
>> __set_page_dirty_
>> {buffers,nobuffers} in vfs layer, so we'd better use vfs interface to
>> avoid exporting
>> those details to filesystems.
>>
>> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
>> ---
>>  fs/ceph/addr.c |   13 +------------
>>  1 file changed, 1 insertion(+), 12 deletions(-)
>>
>> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
>> index 3e68ac1..1445bf1 100644
>> --- a/fs/ceph/addr.c
>> +++ b/fs/ceph/addr.c
>> @@ -76,7 +76,7 @@ static int ceph_set_page_dirty(struct page *page)
>>         if (unlikely(!mapping))
>>                 return !TestSetPageDirty(page);
>>
>> -       if (TestSetPageDirty(page)) {
>> +       if (!__set_page_dirty_nobuffers(page)) {
>
> it's too early to set the radix tree tag here. We should set page's snapshot
> context and increase the i_wrbuffer_ref first. This is because once the tag
> is set, writeback thread can find and start flushing the page.

OK, thanks for pointing it out.

>
>
>>                 dout("%p set_page_dirty %p idx %lu -- already dirty\n",
>>                      mapping->host, page, page->index);
>>                 return 0;
>> @@ -107,14 +107,7 @@ static int ceph_set_page_dirty(struct page *page)
>>              snapc, snapc->seq, snapc->num_snaps);
>>         spin_unlock(&ci->i_ceph_lock);
>>
>> -       /* now adjust page */
>> -       spin_lock_irq(&mapping->tree_lock);
>>         if (page->mapping) {    /* Race with truncate? */
>> -               WARN_ON_ONCE(!PageUptodate(page));
>> -               account_page_dirtied(page, page->mapping);
>> -               radix_tree_tag_set(&mapping->page_tree,
>> -                               page_index(page), PAGECACHE_TAG_DIRTY);
>> -
>
> this code was coped from __set_page_dirty_nobuffers(). I think the reason
> Sage did this is to handle the race described in
> __set_page_dirty_nobuffers()'s comment. But I'm wonder if "page->mapping ==
> NULL" can still happen here. Because truncate_inode_page() unmap page from
> processes's address spaces first, then delete page from page cache.

But in non-mmap case, doesn't it has no relation to 'unmap page from
address spaces'?
The check is exactly avoiding racy with delete_from_page_cache(),
since the two both need to hold mapping->tree_lock, and if truncate
goes first then __set_page_dirty_nobuffers() may have NULL mapping.


Thanks,
Sha

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V5 2/8] fs/ceph: vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem
  2013-08-01 18:27     ` Sage Weil
@ 2013-08-02 10:04       ` Sha Zhengju
       [not found]         ` <CAFj3OHVXvtr5BDMrGatHZi7M9y+dh1ZKRMQZGjZmNBcg3pNQtw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 21+ messages in thread
From: Sha Zhengju @ 2013-08-02 10:04 UTC (permalink / raw)
  To: Sage Weil
  Cc: Yan, Zheng, linux-fsdevel, ceph-devel, linux-mm, Cgroups,
	Michal Hocko, KAMEZAWA Hiroyuki, Glauber Costa, Greg Thelen,
	Wu Fengguang, Andrew Morton, Sha Zhengju

On Fri, Aug 2, 2013 at 2:27 AM, Sage Weil <sage@inktank.com> wrote:
> On Thu, 1 Aug 2013, Yan, Zheng wrote:
>> On Thu, Aug 1, 2013 at 7:51 PM, Sha Zhengju <handai.szj@gmail.com> wrote:
>> > From: Sha Zhengju <handai.szj@taobao.com>
>> >
>> > Following we will begin to add memcg dirty page accounting around
>> __set_page_dirty_
>> > {buffers,nobuffers} in vfs layer, so we'd better use vfs interface to
>> avoid exporting
>> > those details to filesystems.
>> >
>> > Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
>> > ---
>> >  fs/ceph/addr.c |   13 +------------
>> >  1 file changed, 1 insertion(+), 12 deletions(-)
>> >
>> > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
>> > index 3e68ac1..1445bf1 100644
>> > --- a/fs/ceph/addr.c
>> > +++ b/fs/ceph/addr.c
>> > @@ -76,7 +76,7 @@ static int ceph_set_page_dirty(struct page *page)
>> >         if (unlikely(!mapping))
>> >                 return !TestSetPageDirty(page);
>> >
>> > -       if (TestSetPageDirty(page)) {
>> > +       if (!__set_page_dirty_nobuffers(page)) {
>> it's too early to set the radix tree tag here. We should set page's snapshot
>> context and increase the i_wrbuffer_ref first. This is because once the tag
>> is set, writeback thread can find and start flushing the page.
>
> Unfortunately I only remember being frustrated by this code.  :)  Looking
> at it now, though, it seems like the minimum fix is to set the
> page->private before marking the page dirty.  I don't know the locking
> rules around that, though.  If that is potentially racy, maybe the safest
> thing would be if __set_page_dirty_nobuffers() took a void* to set
> page->private to atomically while holding the tree_lock.
>

Sorry, I don't catch the point of your last sentence... Could you
please explain it again?

I notice there is a check in __set_page_dirty_nobuffers():
      WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
So does it mean we can only set page->private after it? but if so the
__mark_inode_dirty is still ahead of setting snapc.


Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V5 2/8] fs/ceph: vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem
  2013-08-02  9:04     ` Sha Zhengju
@ 2013-08-02 13:11       ` Yan, Zheng
  0 siblings, 0 replies; 21+ messages in thread
From: Yan, Zheng @ 2013-08-02 13:11 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-fsdevel, ceph-devel, linux-mm, Cgroups, Sage Weil,
	Michal Hocko, KAMEZAWA Hiroyuki, Glauber Costa, Greg Thelen,
	Wu Fengguang, Andrew Morton, Sha Zhengju

On Fri, Aug 2, 2013 at 5:04 PM, Sha Zhengju <handai.szj@gmail.com> wrote:
>
> On Thu, Aug 1, 2013 at 11:19 PM, Yan, Zheng <ukernel@gmail.com> wrote:
> > On Thu, Aug 1, 2013 at 7:51 PM, Sha Zhengju <handai.szj@gmail.com> wrote:
> >> From: Sha Zhengju <handai.szj@taobao.com>
> >>
> >> Following we will begin to add memcg dirty page accounting around
> >> __set_page_dirty_
> >> {buffers,nobuffers} in vfs layer, so we'd better use vfs interface to
> >> avoid exporting
> >> those details to filesystems.
> >>
> >> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> >> ---
> >>  fs/ceph/addr.c |   13 +------------
> >>  1 file changed, 1 insertion(+), 12 deletions(-)
> >>
> >> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> >> index 3e68ac1..1445bf1 100644
> >> --- a/fs/ceph/addr.c
> >> +++ b/fs/ceph/addr.c
> >> @@ -76,7 +76,7 @@ static int ceph_set_page_dirty(struct page *page)
> >>         if (unlikely(!mapping))
> >>                 return !TestSetPageDirty(page);
> >>
> >> -       if (TestSetPageDirty(page)) {
> >> +       if (!__set_page_dirty_nobuffers(page)) {
> >
> > it's too early to set the radix tree tag here. We should set page's snapshot
> > context and increase the i_wrbuffer_ref first. This is because once the tag
> > is set, writeback thread can find and start flushing the page.
>
> OK, thanks for pointing it out.
>
> >
> >
> >>                 dout("%p set_page_dirty %p idx %lu -- already dirty\n",
> >>                      mapping->host, page, page->index);
> >>                 return 0;
> >> @@ -107,14 +107,7 @@ static int ceph_set_page_dirty(struct page *page)
> >>              snapc, snapc->seq, snapc->num_snaps);
> >>         spin_unlock(&ci->i_ceph_lock);
> >>
> >> -       /* now adjust page */
> >> -       spin_lock_irq(&mapping->tree_lock);
> >>         if (page->mapping) {    /* Race with truncate? */
> >> -               WARN_ON_ONCE(!PageUptodate(page));
> >> -               account_page_dirtied(page, page->mapping);
> >> -               radix_tree_tag_set(&mapping->page_tree,
> >> -                               page_index(page), PAGECACHE_TAG_DIRTY);
> >> -
> >
> > this code was coped from __set_page_dirty_nobuffers(). I think the reason
> > Sage did this is to handle the race described in
> > __set_page_dirty_nobuffers()'s comment. But I'm wonder if "page->mapping ==
> > NULL" can still happen here. Because truncate_inode_page() unmap page from
> > processes's address spaces first, then delete page from page cache.
>
> But in non-mmap case, doesn't it has no relation to 'unmap page from
> address spaces'?

In non-mmap case, page is locked when the set_page_dirty() callback is called.
truncate_inode_page() waits until the page is unlocked, then delete it from the
page cache.

Regards
Yan, Zheng

> The check is exactly avoiding racy with delete_from_page_cache(),
> since the two both need to hold mapping->tree_lock, and if truncate
> goes first then __set_page_dirty_nobuffers() may have NULL mapping.
>
>
> Thanks,
> Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V5 2/8] fs/ceph: vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem
       [not found]         ` <CAFj3OHVXvtr5BDMrGatHZi7M9y+dh1ZKRMQZGjZmNBcg3pNQtw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-08-02 20:30           ` Sage Weil
  2013-08-03  8:58             ` Sha Zhengju
  0 siblings, 1 reply; 21+ messages in thread
From: Sage Weil @ 2013-08-02 20:30 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: Yan, Zheng, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, ceph-devel,
	linux-mm, Cgroups, Michal Hocko, KAMEZAWA Hiroyuki,
	Glauber Costa, Greg Thelen, Wu Fengguang, Andrew Morton,
	Sha Zhengju

On Fri, 2 Aug 2013, Sha Zhengju wrote:
> On Fri, Aug 2, 2013 at 2:27 AM, Sage Weil <sage-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org> wrote:
> > On Thu, 1 Aug 2013, Yan, Zheng wrote:
> >> On Thu, Aug 1, 2013 at 7:51 PM, Sha Zhengju <handai.szj-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >> > From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> >> >
> >> > Following we will begin to add memcg dirty page accounting around
> >> __set_page_dirty_
> >> > {buffers,nobuffers} in vfs layer, so we'd better use vfs interface to
> >> avoid exporting
> >> > those details to filesystems.
> >> >
> >> > Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> >> > ---
> >> >  fs/ceph/addr.c |   13 +------------
> >> >  1 file changed, 1 insertion(+), 12 deletions(-)
> >> >
> >> > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> >> > index 3e68ac1..1445bf1 100644
> >> > --- a/fs/ceph/addr.c
> >> > +++ b/fs/ceph/addr.c
> >> > @@ -76,7 +76,7 @@ static int ceph_set_page_dirty(struct page *page)
> >> >         if (unlikely(!mapping))
> >> >                 return !TestSetPageDirty(page);
> >> >
> >> > -       if (TestSetPageDirty(page)) {
> >> > +       if (!__set_page_dirty_nobuffers(page)) {
> >> it's too early to set the radix tree tag here. We should set page's snapshot
> >> context and increase the i_wrbuffer_ref first. This is because once the tag
> >> is set, writeback thread can find and start flushing the page.
> >
> > Unfortunately I only remember being frustrated by this code.  :)  Looking
> > at it now, though, it seems like the minimum fix is to set the
> > page->private before marking the page dirty.  I don't know the locking
> > rules around that, though.  If that is potentially racy, maybe the safest
> > thing would be if __set_page_dirty_nobuffers() took a void* to set
> > page->private to atomically while holding the tree_lock.
> >
> 
> Sorry, I don't catch the point of your last sentence... Could you
> please explain it again?

It didn't make much sense.  :)  I was worried about multiple callers to 
set_page_dirty, but as understand it, this all happens under page->lock, 
right?  (There is a mention of other special cases in mm/page-writeback.c, 
but I'm hoping we don't need to worry about that.)

In any case, I suspect what we actually want is something like the below 
(untested) patch.  The snapc accounting can be ignored here because 
invalidatepage will clean it up...

sage



diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index afb2fc2..7602e46 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -76,9 +76,10 @@ static int ceph_set_page_dirty(struct page *page)
 	if (unlikely(!mapping))
 		return !TestSetPageDirty(page);
 
-	if (TestSetPageDirty(page)) {
+	if (PageDirty(page)) {
 		dout("%p set_page_dirty %p idx %lu -- already dirty\n",
 		     mapping->host, page, page->index);
+		BUG_ON(!PagePrivate(page));
 		return 0;
 	}
 
@@ -107,35 +108,16 @@ static int ceph_set_page_dirty(struct page *page)
 	     snapc, snapc->seq, snapc->num_snaps);
 	spin_unlock(&ci->i_ceph_lock);
 
-	/* now adjust page */
-	spin_lock_irq(&mapping->tree_lock);
-	if (page->mapping) {	/* Race with truncate? */
-		WARN_ON_ONCE(!PageUptodate(page));
-		account_page_dirtied(page, page->mapping);
-		radix_tree_tag_set(&mapping->page_tree,
-				page_index(page), PAGECACHE_TAG_DIRTY);
-
-		/*
-		 * Reference snap context in page->private.  Also set
-		 * PagePrivate so that we get invalidatepage callback.
-		 */
-		page->private = (unsigned long)snapc;
-		SetPagePrivate(page);
-	} else {
-		dout("ANON set_page_dirty %p (raced truncate?)\n", page);
-		undo = 1;
-	}
-
-	spin_unlock_irq(&mapping->tree_lock);
-
-	if (undo)
-		/* whoops, we failed to dirty the page */
-		ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
-
-	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+	/*
+	 * Reference snap context in page->private.  Also set
+	 * PagePrivate so that we get invalidatepage callback.
+	 */
+	BUG_ON(PagePrivate(page));
+	page->private = (unsigned long)snapc;
+	SetPagePrivate(page);
 
-	BUG_ON(!PageDirty(page));
-	return 1;
+	return __set_page_dirty_nobuffers(page);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH V5 2/8] fs/ceph: vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem
  2013-08-02 20:30           ` Sage Weil
@ 2013-08-03  8:58             ` Sha Zhengju
  0 siblings, 0 replies; 21+ messages in thread
From: Sha Zhengju @ 2013-08-03  8:58 UTC (permalink / raw)
  To: Sage Weil
  Cc: Yan, Zheng, linux-fsdevel, ceph-devel, linux-mm, Cgroups,
	Michal Hocko, KAMEZAWA Hiroyuki, Glauber Costa, Greg Thelen,
	Wu Fengguang, Andrew Morton, Sha Zhengju

On Sat, Aug 3, 2013 at 4:30 AM, Sage Weil <sage@inktank.com> wrote:
> On Fri, 2 Aug 2013, Sha Zhengju wrote:
>> On Fri, Aug 2, 2013 at 2:27 AM, Sage Weil <sage@inktank.com> wrote:
>> > On Thu, 1 Aug 2013, Yan, Zheng wrote:
>> >> On Thu, Aug 1, 2013 at 7:51 PM, Sha Zhengju <handai.szj@gmail.com> wrote:
>> >> > From: Sha Zhengju <handai.szj@taobao.com>
>> >> >
>> >> > Following we will begin to add memcg dirty page accounting around
>> >> __set_page_dirty_
>> >> > {buffers,nobuffers} in vfs layer, so we'd better use vfs interface to
>> >> avoid exporting
>> >> > those details to filesystems.
>> >> >
>> >> > Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
>> >> > ---
>> >> >  fs/ceph/addr.c |   13 +------------
>> >> >  1 file changed, 1 insertion(+), 12 deletions(-)
>> >> >
>> >> > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
>> >> > index 3e68ac1..1445bf1 100644
>> >> > --- a/fs/ceph/addr.c
>> >> > +++ b/fs/ceph/addr.c
>> >> > @@ -76,7 +76,7 @@ static int ceph_set_page_dirty(struct page *page)
>> >> >         if (unlikely(!mapping))
>> >> >                 return !TestSetPageDirty(page);
>> >> >
>> >> > -       if (TestSetPageDirty(page)) {
>> >> > +       if (!__set_page_dirty_nobuffers(page)) {
>> >> it's too early to set the radix tree tag here. We should set page's snapshot
>> >> context and increase the i_wrbuffer_ref first. This is because once the tag
>> >> is set, writeback thread can find and start flushing the page.
>> >
>> > Unfortunately I only remember being frustrated by this code.  :)  Looking
>> > at it now, though, it seems like the minimum fix is to set the
>> > page->private before marking the page dirty.  I don't know the locking
>> > rules around that, though.  If that is potentially racy, maybe the safest
>> > thing would be if __set_page_dirty_nobuffers() took a void* to set
>> > page->private to atomically while holding the tree_lock.
>> >
>>
>> Sorry, I don't catch the point of your last sentence... Could you
>> please explain it again?
>
> It didn't make much sense.  :)  I was worried about multiple callers to
> set_page_dirty, but as understand it, this all happens under page->lock,
> right?  (There is a mention of other special cases in mm/page-writeback.c,
> but I'm hoping we don't need to worry about that.)

I agree, page lock can handle the concurrent access.

>
> In any case, I suspect what we actually want is something like the below
> (untested) patch.  The snapc accounting can be ignored here because
> invalidatepage will clean it up...
>
> sage
>
>
>
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index afb2fc2..7602e46 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -76,9 +76,10 @@ static int ceph_set_page_dirty(struct page *page)
>         if (unlikely(!mapping))
>                 return !TestSetPageDirty(page);
>
> -       if (TestSetPageDirty(page)) {
> +       if (PageDirty(page)) {
>                 dout("%p set_page_dirty %p idx %lu -- already dirty\n",
>                      mapping->host, page, page->index);
> +               BUG_ON(!PagePrivate(page));
>                 return 0;
>         }
>
> @@ -107,35 +108,16 @@ static int ceph_set_page_dirty(struct page *page)
>              snapc, snapc->seq, snapc->num_snaps);
>         spin_unlock(&ci->i_ceph_lock);
>
> -       /* now adjust page */
> -       spin_lock_irq(&mapping->tree_lock);
> -       if (page->mapping) {    /* Race with truncate? */
> -               WARN_ON_ONCE(!PageUptodate(page));
> -               account_page_dirtied(page, page->mapping);
> -               radix_tree_tag_set(&mapping->page_tree,
> -                               page_index(page), PAGECACHE_TAG_DIRTY);
> -
> -               /*
> -                * Reference snap context in page->private.  Also set
> -                * PagePrivate so that we get invalidatepage callback.
> -                */
> -               page->private = (unsigned long)snapc;
> -               SetPagePrivate(page);
> -       } else {
> -               dout("ANON set_page_dirty %p (raced truncate?)\n", page);
> -               undo = 1;
> -       }
> -
> -       spin_unlock_irq(&mapping->tree_lock);
> -
> -       if (undo)
> -               /* whoops, we failed to dirty the page */
> -               ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
> -
> -       __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> +       /*
> +        * Reference snap context in page->private.  Also set
> +        * PagePrivate so that we get invalidatepage callback.
> +        */
> +       BUG_ON(PagePrivate(page));
> +       page->private = (unsigned long)snapc;
> +       SetPagePrivate(page);
>
> -       BUG_ON(!PageDirty(page));
> -       return 1;
> +       return __set_page_dirty_nobuffers(page);
>  }
>
>  /*

Looks good. Since page lock can avoid multiple access, the undo logic
is also not necessary anymore. Thank you very much!

-- 
Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V5 5/8] memcg: add per cgroup writeback pages accounting
       [not found]     ` <20130801145302.GJ5198-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2013-08-03  9:25       ` Sha Zhengju
       [not found]         ` <CAFj3OHV-VCKJfe6bv4UMvv+uj4LELDXsieRZFJD06Yrdyy=XxA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 21+ messages in thread
From: Sha Zhengju @ 2013-08-03  9:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Cgroups, KAMEZAWA Hiroyuki,
	Glauber Costa, Greg Thelen, Wu Fengguang, Andrew Morton,
	Sha Zhengju

On Thu, Aug 1, 2013 at 10:53 PM, Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> wrote:
> On Thu 01-08-13 19:54:11, Sha Zhengju wrote:
>> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
>>
>> Similar to dirty page, we add per cgroup writeback pages accounting. The lock
>> rule still is:
>>         mem_cgroup_begin_update_page_stat()
>>         modify page WRITEBACK stat
>>         mem_cgroup_update_page_stat()
>>         mem_cgroup_end_update_page_stat()
>>
>> There're two writeback interfaces to modify: test_{clear/set}_page_writeback().
>> Lock order:
>>       --> memcg->move_lock
>>         --> mapping->tree_lock
>>
>> Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
>
> Looks good to me. Maybe I would suggest moving this patch up the stack
> so that it might get merged earlier as it is simpler than dirty pages
> accounting. Unless you insist on having the full series merged at once.

I think the following three patches can be merged earlier:
      1/8 memcg: remove MEMCG_NR_FILE_MAPPED
      3/8 memcg: check for proper lock held in mem_cgroup_update_page_stat
      5/8 memcg: add per cgroup writeback pages accounting

Do I need to resent them again for you or they're enough?

One more word, since dirty accounting is essential to future memcg
dirty page throttling and it is not an optional feature now, I suspect
whether we can merge the following two as well and leave the overhead
optimization a separate series.  :p
      4/5 memcg: add per cgroup dirty pages accounting
      8/8 memcg: Document cgroup dirty/writeback memory statistics

The 2/8 ceph one still need more improvement, I'll separate it next version.

>
> Acked-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>

Thank you.

>
>> ---
>>  include/linux/memcontrol.h |    1 +
>>  mm/memcontrol.c            |    5 +++++
>>  mm/page-writeback.c        |   15 +++++++++++++++
>>  3 files changed, 21 insertions(+)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index f952be6..ccd35d8 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -43,6 +43,7 @@ enum mem_cgroup_stat_index {
>>       MEM_CGROUP_STAT_RSS_HUGE,       /* # of pages charged as anon huge */
>>       MEM_CGROUP_STAT_FILE_MAPPED,    /* # of pages charged as file rss */
>>       MEM_CGROUP_STAT_FILE_DIRTY,     /* # of dirty pages in page cache */
>> +     MEM_CGROUP_STAT_WRITEBACK,      /* # of pages under writeback */
>>       MEM_CGROUP_STAT_SWAP,           /* # of pages, swapped out */
>>       MEM_CGROUP_STAT_NSTATS,
>>  };
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 8f3e514..6c18a6d 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -91,6 +91,7 @@ static const char * const mem_cgroup_stat_names[] = {
>>       "rss_huge",
>>       "mapped_file",
>>       "dirty",
>> +     "writeback",
>>       "swap",
>>  };
>>
>> @@ -3812,6 +3813,10 @@ static int mem_cgroup_move_account(struct page *page,
>>               mem_cgroup_move_account_page_stat(from, to, nr_pages,
>>                       MEM_CGROUP_STAT_FILE_DIRTY);
>>
>> +     if (PageWriteback(page))
>> +             mem_cgroup_move_account_page_stat(from, to, nr_pages,
>> +                     MEM_CGROUP_STAT_WRITEBACK);
>> +
>>       mem_cgroup_charge_statistics(from, page, anon, -nr_pages);
>>
>>       /* caller should have done css_get */
>> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>> index a09f518..2fa6a52 100644
>> --- a/mm/page-writeback.c
>> +++ b/mm/page-writeback.c
>> @@ -2008,11 +2008,17 @@ EXPORT_SYMBOL(account_page_dirtied);
>>
>>  /*
>>   * Helper function for set_page_writeback family.
>> + *
>> + * The caller must hold mem_cgroup_begin/end_update_page_stat() lock
>> + * while calling this function.
>> + * See test_set_page_writeback for example.
>> + *
>>   * NOTE: Unlike account_page_dirtied this does not rely on being atomic
>>   * wrt interrupts.
>>   */
>>  void account_page_writeback(struct page *page)
>>  {
>> +     mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
>>       inc_zone_page_state(page, NR_WRITEBACK);
>>  }
>>  EXPORT_SYMBOL(account_page_writeback);
>> @@ -2243,7 +2249,10 @@ int test_clear_page_writeback(struct page *page)
>>  {
>>       struct address_space *mapping = page_mapping(page);
>>       int ret;
>> +     bool locked;
>> +     unsigned long memcg_flags;
>>
>> +     mem_cgroup_begin_update_page_stat(page, &locked, &memcg_flags);
>>       if (mapping) {
>>               struct backing_dev_info *bdi = mapping->backing_dev_info;
>>               unsigned long flags;
>> @@ -2264,9 +2273,11 @@ int test_clear_page_writeback(struct page *page)
>>               ret = TestClearPageWriteback(page);
>>       }
>>       if (ret) {
>> +             mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
>>               dec_zone_page_state(page, NR_WRITEBACK);
>>               inc_zone_page_state(page, NR_WRITTEN);
>>       }
>> +     mem_cgroup_end_update_page_stat(page, &locked, &memcg_flags);
>>       return ret;
>>  }
>>
>> @@ -2274,7 +2285,10 @@ int test_set_page_writeback(struct page *page)
>>  {
>>       struct address_space *mapping = page_mapping(page);
>>       int ret;
>> +     bool locked;
>> +     unsigned long flags;
>>
>> +     mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>>       if (mapping) {
>>               struct backing_dev_info *bdi = mapping->backing_dev_info;
>>               unsigned long flags;
>> @@ -2301,6 +2315,7 @@ int test_set_page_writeback(struct page *page)
>>       }
>>       if (!ret)
>>               account_page_writeback(page);
>> +     mem_cgroup_end_update_page_stat(page, &locked, &flags);
>>       return ret;
>>
>>  }
>> --
>> 1.7.9.5
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe cgroups" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> Michal Hocko
> SUSE Labs



-- 
Thanks,
Sha

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V5 5/8] memcg: add per cgroup writeback pages accounting
       [not found]         ` <CAFj3OHV-VCKJfe6bv4UMvv+uj4LELDXsieRZFJD06Yrdyy=XxA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-08-04 10:08           ` Michal Hocko
  2013-08-22  9:46             ` Fwd: " Sha Zhengju
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2013-08-04 10:08 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Cgroups, KAMEZAWA Hiroyuki,
	Glauber Costa, Greg Thelen, Wu Fengguang, Andrew Morton,
	Sha Zhengju

On Sat 03-08-13 17:25:01, Sha Zhengju wrote:
> On Thu, Aug 1, 2013 at 10:53 PM, Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> wrote:
> > On Thu 01-08-13 19:54:11, Sha Zhengju wrote:
> >> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> >>
> >> Similar to dirty page, we add per cgroup writeback pages accounting. The lock
> >> rule still is:
> >>         mem_cgroup_begin_update_page_stat()
> >>         modify page WRITEBACK stat
> >>         mem_cgroup_update_page_stat()
> >>         mem_cgroup_end_update_page_stat()
> >>
> >> There're two writeback interfaces to modify: test_{clear/set}_page_writeback().
> >> Lock order:
> >>       --> memcg->move_lock
> >>         --> mapping->tree_lock
> >>
> >> Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> >
> > Looks good to me. Maybe I would suggest moving this patch up the stack
> > so that it might get merged earlier as it is simpler than dirty pages
> > accounting. Unless you insist on having the full series merged at once.
> 
> I think the following three patches can be merged earlier:
>       1/8 memcg: remove MEMCG_NR_FILE_MAPPED
>       3/8 memcg: check for proper lock held in mem_cgroup_update_page_stat
>       5/8 memcg: add per cgroup writeback pages accounting
> 
> Do I need to resent them again for you or they're enough?

This is a question for Andrew. I would go with them as they are.
 
> One more word, since dirty accounting is essential to future memcg
> dirty page throttling and it is not an optional feature now, I suspect
> whether we can merge the following two as well and leave the overhead
> optimization a separate series.  :p

I wouldn't hurry it. We need numbers for serious testing to see the
overhead. It is still just a small step towards dirty throttling.

>       4/5 memcg: add per cgroup dirty pages accounting
>       8/8 memcg: Document cgroup dirty/writeback memory statistics
> 
> The 2/8 ceph one still need more improvement, I'll separate it next version.
> 
> >
> > Acked-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> 
> Thank you.
[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V5 5/8] memcg: add per cgroup writeback pages accounting
  2013-08-01 11:54 ` [PATCH V5 5/8] memcg: add per cgroup writeback " Sha Zhengju
  2013-08-01 14:53   ` Michal Hocko
@ 2013-08-04 18:51   ` Greg Thelen
  1 sibling, 0 replies; 21+ messages in thread
From: Greg Thelen @ 2013-08-04 18:51 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-fsdevel, linux-mm, cgroups, mhocko, kamezawa.hiroyu,
	glommer, fengguang.wu, akpm, Sha Zhengju

On Thu, Aug 01 2013, Sha Zhengju wrote:

> From: Sha Zhengju <handai.szj@taobao.com>
>
> Similar to dirty page, we add per cgroup writeback pages accounting. The lock
> rule still is:
>         mem_cgroup_begin_update_page_stat()
>         modify page WRITEBACK stat
>         mem_cgroup_update_page_stat()
>         mem_cgroup_end_update_page_stat()
>
> There're two writeback interfaces to modify: test_{clear/set}_page_writeback().
> Lock order:
> 	--> memcg->move_lock
> 	  --> mapping->tree_lock
>
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> ---
>  include/linux/memcontrol.h |    1 +
>  mm/memcontrol.c            |    5 +++++
>  mm/page-writeback.c        |   15 +++++++++++++++
>  3 files changed, 21 insertions(+)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index f952be6..ccd35d8 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -43,6 +43,7 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_RSS_HUGE,	/* # of pages charged as anon huge */
>  	MEM_CGROUP_STAT_FILE_MAPPED,	/* # of pages charged as file rss */
>  	MEM_CGROUP_STAT_FILE_DIRTY,	/* # of dirty pages in page cache */
> +	MEM_CGROUP_STAT_WRITEBACK,	/* # of pages under writeback */
>  	MEM_CGROUP_STAT_SWAP,		/* # of pages, swapped out */
>  	MEM_CGROUP_STAT_NSTATS,
>  };
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 8f3e514..6c18a6d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -91,6 +91,7 @@ static const char * const mem_cgroup_stat_names[] = {
>  	"rss_huge",
>  	"mapped_file",
>  	"dirty",
> +	"writeback",
>  	"swap",
>  };
>  
> @@ -3812,6 +3813,10 @@ static int mem_cgroup_move_account(struct page *page,
>  		mem_cgroup_move_account_page_stat(from, to, nr_pages,
>  			MEM_CGROUP_STAT_FILE_DIRTY);
>  
> +	if (PageWriteback(page))
> +		mem_cgroup_move_account_page_stat(from, to, nr_pages,
> +			MEM_CGROUP_STAT_WRITEBACK);
> +
>  	mem_cgroup_charge_statistics(from, page, anon, -nr_pages);
>  
>  	/* caller should have done css_get */
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index a09f518..2fa6a52 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2008,11 +2008,17 @@ EXPORT_SYMBOL(account_page_dirtied);
>  
>  /*
>   * Helper function for set_page_writeback family.
> + *
> + * The caller must hold mem_cgroup_begin/end_update_page_stat() lock
> + * while calling this function.
> + * See test_set_page_writeback for example.
> + *
>   * NOTE: Unlike account_page_dirtied this does not rely on being atomic
>   * wrt interrupts.
>   */
>  void account_page_writeback(struct page *page)
>  {
> +	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
>  	inc_zone_page_state(page, NR_WRITEBACK);
>  }
>  EXPORT_SYMBOL(account_page_writeback);
> @@ -2243,7 +2249,10 @@ int test_clear_page_writeback(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
>  	int ret;
> +	bool locked;
> +	unsigned long memcg_flags;
>  
> +	mem_cgroup_begin_update_page_stat(page, &locked, &memcg_flags);
>  	if (mapping) {
>  		struct backing_dev_info *bdi = mapping->backing_dev_info;
>  		unsigned long flags;
> @@ -2264,9 +2273,11 @@ int test_clear_page_writeback(struct page *page)
>  		ret = TestClearPageWriteback(page);
>  	}
>  	if (ret) {
> +		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
>  		dec_zone_page_state(page, NR_WRITEBACK);
>  		inc_zone_page_state(page, NR_WRITTEN);
>  	}
> +	mem_cgroup_end_update_page_stat(page, &locked, &memcg_flags);
>  	return ret;
>  }
>  
> @@ -2274,7 +2285,10 @@ int test_set_page_writeback(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
>  	int ret;
> +	bool locked;
> +	unsigned long flags;

I recommend using memcg_flags here, which matches the pattern in
account_page_writeback, to avoid overlapping the nested 'flags' variable
below.

Otherwise,
Reviewed-by: Greg Thelen <gthelen@google.com>

>  
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  	if (mapping) {
>  		struct backing_dev_info *bdi = mapping->backing_dev_info;
>  		unsigned long flags;
> @@ -2301,6 +2315,7 @@ int test_set_page_writeback(struct page *page)
>  	}
>  	if (!ret)
>  		account_page_writeback(page);
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  	return ret;
>  
>  }

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Fwd: [PATCH V5 5/8] memcg: add per cgroup writeback pages accounting
  2013-08-04 10:08           ` Michal Hocko
@ 2013-08-22  9:46             ` Sha Zhengju
  2013-08-22  9:50               ` [PATCH 1/4] memcg: remove MEMCG_NR_FILE_MAPPED Sha Zhengju
                                 ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Sha Zhengju @ 2013-08-22  9:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, KAMEZAWA Hiroyuki, Greg Thelen, linux-mm, Cgroups,
	linux-fsdevel, Sha Zhengju

Hi Andrew,

After several rounds of review, parts of the memcg page accounting
patchset is ready (leaving memcg dirty page accounting under more
review):

  1/8 memcg: remove MEMCG_NR_FILE_MAPPED
  2/8 ceph: vfs __set_page_dirty_nobuffers interface instead of doing
it inside filesystem
  3/8 memcg: check for proper lock held in mem_cgroup_update_page_stat
  5/8 memcg: add per cgroup writeback pages accounting
  8/8 memcg: Document cgroup dirty/writeback memory statistics

But the 2/8 ceph one has been improved again and will be merged in
ceph tree, so only the other 4 patches need to be added to -mm tree.
I've moved the 5/8 writeback one up the stack and updated the 8/8 to
only document writeback changes. Could you please merge them earlier?
I'll post these updated patches soon. :)

Thank you!


---------- Forwarded message ----------
From: Michal Hocko <mhocko@suse.cz>
Date: Sun, Aug 4, 2013 at 6:08 PM
Subject: Re: [PATCH V5 5/8] memcg: add per cgroup writeback pages accounting
To: Sha Zhengju <handai.szj@gmail.com>
Cc: "linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
"linux-mm@kvack.org" <linux-mm@kvack.org>, Cgroups
<cgroups@vger.kernel.org>, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com>, Glauber Costa <glommer@gmail.com>,
Greg Thelen <gthelen@google.com>, Wu Fengguang
<fengguang.wu@intel.com>, Andrew Morton <akpm@linux-foundation.org>,
Sha Zhengju <handai.szj@taobao.com>


On Sat 03-08-13 17:25:01, Sha Zhengju wrote:
> On Thu, Aug 1, 2013 at 10:53 PM, Michal Hocko <mhocko@suse.cz> wrote:
> > On Thu 01-08-13 19:54:11, Sha Zhengju wrote:
> >> From: Sha Zhengju <handai.szj@taobao.com>
> >>
> >> Similar to dirty page, we add per cgroup writeback pages accounting. The lock
> >> rule still is:
> >>         mem_cgroup_begin_update_page_stat()
> >>         modify page WRITEBACK stat
> >>         mem_cgroup_update_page_stat()
> >>         mem_cgroup_end_update_page_stat()
> >>
> >> There're two writeback interfaces to modify: test_{clear/set}_page_writeback().
> >> Lock order:
> >>       --> memcg->move_lock
> >>         --> mapping->tree_lock
> >>
> >> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> >
> > Looks good to me. Maybe I would suggest moving this patch up the stack
> > so that it might get merged earlier as it is simpler than dirty pages
> > accounting. Unless you insist on having the full series merged at once.
>
> I think the following three patches can be merged earlier:
>       1/8 memcg: remove MEMCG_NR_FILE_MAPPED
>       3/8 memcg: check for proper lock held in mem_cgroup_update_page_stat
>       5/8 memcg: add per cgroup writeback pages accounting
>
> Do I need to resent them again for you or they're enough?

This is a question for Andrew. I would go with them as they are.

> One more word, since dirty accounting is essential to future memcg
> dirty page throttling and it is not an optional feature now, I suspect
> whether we can merge the following two as well and leave the overhead
> optimization a separate series.  :p

I wouldn't hurry it. We need numbers for serious testing to see the
overhead. It is still just a small step towards dirty throttling.

>       4/5 memcg: add per cgroup dirty pages accounting
>       8/8 memcg: Document cgroup dirty/writeback memory statistics
>
> The 2/8 ceph one still need more improvement, I'll separate it next version.
>
> >
> > Acked-by: Michal Hocko <mhocko@suse.cz>
>
> Thank you.
[...]
--
Michal Hocko
SUSE Labs


-- 
Thanks,
Sha

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/4] memcg: remove MEMCG_NR_FILE_MAPPED
  2013-08-22  9:46             ` Fwd: " Sha Zhengju
@ 2013-08-22  9:50               ` Sha Zhengju
  2013-08-22  9:52               ` [PATCH 2/4] memcg: check for proper lock held in mem_cgroup_update_page_stat Sha Zhengju
                                 ` (2 subsequent siblings)
  3 siblings, 0 replies; 21+ messages in thread
From: Sha Zhengju @ 2013-08-22  9:50 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, kamezawa.hiroyu, gthelen, linux-mm, cgroups,
	linux-fsdevel, Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

While accounting memcg page stat, it's not worth to use MEMCG_NR_FILE_MAPPED
as an extra layer of indirection because of the complexity and presumed
performance overhead. We can use MEM_CGROUP_STAT_FILE_MAPPED directly.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Greg Thelen <gthelen@google.com>
---
 include/linux/memcontrol.h |   27 +++++++++++++++++++--------
 mm/memcontrol.c            |   25 +------------------------
 mm/rmap.c                  |    4 ++--
 3 files changed, 22 insertions(+), 34 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 8a9ed4d..630bfa1 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -30,9 +30,20 @@ struct page;
 struct mm_struct;
 struct kmem_cache;
 
-/* Stats that can be updated by kernel. */
-enum mem_cgroup_page_stat_item {
-	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
+/*
+ * The corresponding mem_cgroup_stat_names is defined in mm/memcontrol.c,
+ * These two lists should keep in accord with each other.
+ */
+enum mem_cgroup_stat_index {
+	/*
+	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
+	 */
+	MEM_CGROUP_STAT_CACHE,		/* # of pages charged as cache */
+	MEM_CGROUP_STAT_RSS,		/* # of pages charged as anon rss */
+	MEM_CGROUP_STAT_RSS_HUGE,	/* # of pages charged as anon huge */
+	MEM_CGROUP_STAT_FILE_MAPPED,	/* # of pages charged as file rss */
+	MEM_CGROUP_STAT_SWAP,		/* # of pages, swapped out */
+	MEM_CGROUP_STAT_NSTATS,
 };
 
 struct mem_cgroup_reclaim_cookie {
@@ -233,17 +244,17 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
 }
 
 void mem_cgroup_update_page_stat(struct page *page,
-				 enum mem_cgroup_page_stat_item idx,
+				 enum mem_cgroup_stat_index idx,
 				 int val);
 
 static inline void mem_cgroup_inc_page_stat(struct page *page,
-					    enum mem_cgroup_page_stat_item idx)
+					    enum mem_cgroup_stat_index idx)
 {
 	mem_cgroup_update_page_stat(page, idx, 1);
 }
 
 static inline void mem_cgroup_dec_page_stat(struct page *page,
-					    enum mem_cgroup_page_stat_item idx)
+					    enum mem_cgroup_stat_index idx)
 {
 	mem_cgroup_update_page_stat(page, idx, -1);
 }
@@ -448,12 +459,12 @@ static inline bool mem_cgroup_oom_synchronize(void)
 }
 
 static inline void mem_cgroup_inc_page_stat(struct page *page,
-					    enum mem_cgroup_page_stat_item idx)
+					    enum mem_cgroup_stat_index idx)
 {
 }
 
 static inline void mem_cgroup_dec_page_stat(struct page *page,
-					    enum mem_cgroup_page_stat_item idx)
+					    enum mem_cgroup_stat_index idx)
 {
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b73988a..24d6d02 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -84,21 +84,6 @@ static int really_do_swap_account __initdata = 0;
 #endif
 
 
-/*
- * Statistics for memory cgroup.
- */
-enum mem_cgroup_stat_index {
-	/*
-	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
-	 */
-	MEM_CGROUP_STAT_CACHE,		/* # of pages charged as cache */
-	MEM_CGROUP_STAT_RSS,		/* # of pages charged as anon rss */
-	MEM_CGROUP_STAT_RSS_HUGE,	/* # of pages charged as anon huge */
-	MEM_CGROUP_STAT_FILE_MAPPED,	/* # of pages charged as file rss */
-	MEM_CGROUP_STAT_SWAP,		/* # of pages, swapped out */
-	MEM_CGROUP_STAT_NSTATS,
-};
-
 static const char * const mem_cgroup_stat_names[] = {
 	"cache",
 	"rss",
@@ -2255,7 +2240,7 @@ void __mem_cgroup_end_update_page_stat(struct page *page, unsigned long *flags)
 }
 
 void mem_cgroup_update_page_stat(struct page *page,
-				 enum mem_cgroup_page_stat_item idx, int val)
+				 enum mem_cgroup_stat_index idx, int val)
 {
 	struct mem_cgroup *memcg;
 	struct page_cgroup *pc = lookup_page_cgroup(page);
@@ -2268,14 +2253,6 @@ void mem_cgroup_update_page_stat(struct page *page,
 	if (unlikely(!memcg || !PageCgroupUsed(pc)))
 		return;
 
-	switch (idx) {
-	case MEMCG_NR_FILE_MAPPED:
-		idx = MEM_CGROUP_STAT_FILE_MAPPED;
-		break;
-	default:
-		BUG();
-	}
-
 	this_cpu_add(memcg->stat->count[idx], val);
 }
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 07afd48..373da2a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1114,7 +1114,7 @@ void page_add_file_rmap(struct page *page)
 	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (atomic_inc_and_test(&page->_mapcount)) {
 		__inc_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
+		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
 	}
 	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 }
@@ -1158,7 +1158,7 @@ void page_remove_rmap(struct page *page)
 				hpage_nr_pages(page));
 	} else {
 		__dec_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
+		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
 		mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	}
 	if (unlikely(PageMlocked(page)))
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/4] memcg: check for proper lock held in mem_cgroup_update_page_stat
  2013-08-22  9:46             ` Fwd: " Sha Zhengju
  2013-08-22  9:50               ` [PATCH 1/4] memcg: remove MEMCG_NR_FILE_MAPPED Sha Zhengju
@ 2013-08-22  9:52               ` Sha Zhengju
  2013-08-22  9:53               ` [PATCH 3/4] memcg: add per cgroup writeback pages accounting Sha Zhengju
  2013-08-22  9:53               ` [PATCH 4/4] memcg: Document cgroup dirty/writeback memory statistics Sha Zhengju
  3 siblings, 0 replies; 21+ messages in thread
From: Sha Zhengju @ 2013-08-22  9:52 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, kamezawa.hiroyu, gthelen, linux-mm, cgroups,
	linux-fsdevel, Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

We should call mem_cgroup_begin_update_page_stat() before
mem_cgroup_update_page_stat() to get proper locks, however the
latter doesn't do any checking that we use proper locking, which
would be hard. Suggested by Michal Hock we could at least test for
rcu_read_lock_held() because RCU is held if !mem_cgroup_disabled().

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Greg Thelen <gthelen@google.com>
---
 mm/memcontrol.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 24d6d02..0a50871 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2249,6 +2249,7 @@ void mem_cgroup_update_page_stat(struct page *page,
 	if (mem_cgroup_disabled())
 		return;
 
+	VM_BUG_ON(!rcu_read_lock_held());
 	memcg = pc->mem_cgroup;
 	if (unlikely(!memcg || !PageCgroupUsed(pc)))
 		return;
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/4] memcg: add per cgroup writeback pages accounting
  2013-08-22  9:46             ` Fwd: " Sha Zhengju
  2013-08-22  9:50               ` [PATCH 1/4] memcg: remove MEMCG_NR_FILE_MAPPED Sha Zhengju
  2013-08-22  9:52               ` [PATCH 2/4] memcg: check for proper lock held in mem_cgroup_update_page_stat Sha Zhengju
@ 2013-08-22  9:53               ` Sha Zhengju
  2013-08-22 22:40                 ` Andrew Morton
  2013-08-22  9:53               ` [PATCH 4/4] memcg: Document cgroup dirty/writeback memory statistics Sha Zhengju
  3 siblings, 1 reply; 21+ messages in thread
From: Sha Zhengju @ 2013-08-22  9:53 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, kamezawa.hiroyu, gthelen, linux-mm, cgroups,
	linux-fsdevel, Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

This patch is to add memcg routines to count writeback pages, later dirty pages will
also be accounted.

After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
has a feature to move a page from a cgroup to another one and may have race between
"move" and "page stat accounting". So in order to avoid the race we have designed a
new lock:

         mem_cgroup_begin_update_page_stat()
         modify page information        -->(a)
         mem_cgroup_update_page_stat()  -->(b)
         mem_cgroup_end_update_page_stat()

It requires both (a) and (b)(writeback pages accounting) to be pretected in
mem_cgroup_{begin/end}_update_page_stat(). It's full no-op for !CONFIG_MEMCG, almost
no-op if memcg is disabled (but compiled in), rcu read lock in the most cases
(no task is moving), and spin_lock_irqsave on top in the slow path.

There're two writeback interfaces to modify: test_{clear/set}_page_writeback().
And the lock order is:
	--> memcg->move_lock
	  --> mapping->tree_lock

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Greg Thelen <gthelen@google.com>
---
 include/linux/memcontrol.h |    1 +
 mm/memcontrol.c            |   30 +++++++++++++++++++++++-------
 mm/page-writeback.c        |   15 +++++++++++++++
 3 files changed, 39 insertions(+), 7 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 630bfa1..1878644 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -42,6 +42,7 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_RSS,		/* # of pages charged as anon rss */
 	MEM_CGROUP_STAT_RSS_HUGE,	/* # of pages charged as anon huge */
 	MEM_CGROUP_STAT_FILE_MAPPED,	/* # of pages charged as file rss */
+	MEM_CGROUP_STAT_WRITEBACK,	/* # of pages under writeback */
 	MEM_CGROUP_STAT_SWAP,		/* # of pages, swapped out */
 	MEM_CGROUP_STAT_NSTATS,
 };
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0a50871..869bfaf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -89,6 +89,7 @@ static const char * const mem_cgroup_stat_names[] = {
 	"rss",
 	"rss_huge",
 	"mapped_file",
+	"writeback",
 	"swap",
 };
 
@@ -3675,6 +3676,20 @@ void mem_cgroup_split_huge_fixup(struct page *head)
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+static inline
+void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
+					struct mem_cgroup *to,
+					unsigned int nr_pages,
+					enum mem_cgroup_stat_index idx)
+{
+	/* Update stat data for mem_cgroup */
+	preempt_disable();
+	WARN_ON_ONCE(from->stat->count[idx] < nr_pages);
+	__this_cpu_add(from->stat->count[idx], -nr_pages);
+	__this_cpu_add(to->stat->count[idx], nr_pages);
+	preempt_enable();
+}
+
 /**
  * mem_cgroup_move_account - move account of the page
  * @page: the page
@@ -3720,13 +3735,14 @@ static int mem_cgroup_move_account(struct page *page,
 
 	move_lock_mem_cgroup(from, &flags);
 
-	if (!anon && page_mapped(page)) {
-		/* Update mapped_file data for mem_cgroup */
-		preempt_disable();
-		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		preempt_enable();
-	}
+	if (!anon && page_mapped(page))
+		mem_cgroup_move_account_page_stat(from, to, nr_pages,
+			MEM_CGROUP_STAT_FILE_MAPPED);
+
+	if (PageWriteback(page))
+		mem_cgroup_move_account_page_stat(from, to, nr_pages,
+			MEM_CGROUP_STAT_WRITEBACK);
+
 	mem_cgroup_charge_statistics(from, page, anon, -nr_pages);
 
 	/* caller should have done css_get */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index ddc56fc..b6ecbc6 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1999,11 +1999,17 @@ EXPORT_SYMBOL(account_page_dirtied);
 
 /*
  * Helper function for set_page_writeback family.
+ *
+ * The caller must hold mem_cgroup_begin/end_update_page_stat() lock
+ * while calling this function.
+ * See test_set_page_writeback for example.
+ *
  * NOTE: Unlike account_page_dirtied this does not rely on being atomic
  * wrt interrupts.
  */
 void account_page_writeback(struct page *page)
 {
+	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
 	inc_zone_page_state(page, NR_WRITEBACK);
 }
 EXPORT_SYMBOL(account_page_writeback);
@@ -2220,7 +2226,10 @@ int test_clear_page_writeback(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
 	int ret;
+	bool locked;
+	unsigned long memcg_flags;
 
+	mem_cgroup_begin_update_page_stat(page, &locked, &memcg_flags);
 	if (mapping) {
 		struct backing_dev_info *bdi = mapping->backing_dev_info;
 		unsigned long flags;
@@ -2241,9 +2250,11 @@ int test_clear_page_writeback(struct page *page)
 		ret = TestClearPageWriteback(page);
 	}
 	if (ret) {
+		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
 		dec_zone_page_state(page, NR_WRITEBACK);
 		inc_zone_page_state(page, NR_WRITTEN);
 	}
+	mem_cgroup_end_update_page_stat(page, &locked, &memcg_flags);
 	return ret;
 }
 
@@ -2251,7 +2262,10 @@ int test_set_page_writeback(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
 	int ret;
+	bool locked;
+	unsigned long memcg_flags;
 
+	mem_cgroup_begin_update_page_stat(page, &locked, &memcg_flags);
 	if (mapping) {
 		struct backing_dev_info *bdi = mapping->backing_dev_info;
 		unsigned long flags;
@@ -2278,6 +2292,7 @@ int test_set_page_writeback(struct page *page)
 	}
 	if (!ret)
 		account_page_writeback(page);
+	mem_cgroup_end_update_page_stat(page, &locked, &memcg_flags);
 	return ret;
 
 }
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/4] memcg: Document cgroup dirty/writeback memory statistics
  2013-08-22  9:46             ` Fwd: " Sha Zhengju
                                 ` (2 preceding siblings ...)
  2013-08-22  9:53               ` [PATCH 3/4] memcg: add per cgroup writeback pages accounting Sha Zhengju
@ 2013-08-22  9:53               ` Sha Zhengju
  3 siblings, 0 replies; 21+ messages in thread
From: Sha Zhengju @ 2013-08-22  9:53 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, kamezawa.hiroyu, gthelen, linux-mm, cgroups,
	linux-fsdevel, Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
---
 Documentation/cgroups/memory.txt |    1 +
 1 file changed, 1 insertion(+)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 327acec..c9ea502 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -490,6 +490,7 @@ pgpgin		- # of charging events to the memory cgroup. The charging
 pgpgout		- # of uncharging events to the memory cgroup. The uncharging
 		event happens each time a page is unaccounted from the cgroup.
 swap		- # of bytes of swap usage
+writeback      - # of bytes of file/anon cache that are queued for syncing to disk.
 inactive_anon	- # of bytes of anonymous memory and swap cache memory on
 		LRU list.
 active_anon	- # of bytes of anonymous and swap cache memory on active
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/4] memcg: add per cgroup writeback pages accounting
  2013-08-22  9:53               ` [PATCH 3/4] memcg: add per cgroup writeback pages accounting Sha Zhengju
@ 2013-08-22 22:40                 ` Andrew Morton
  2013-08-23 16:11                   ` Sha Zhengju
  0 siblings, 1 reply; 21+ messages in thread
From: Andrew Morton @ 2013-08-22 22:40 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: mhocko, kamezawa.hiroyu, gthelen, linux-mm, cgroups,
	linux-fsdevel, Sha Zhengju

On Thu, 22 Aug 2013 17:53:10 +0800 Sha Zhengju <handai.szj@gmail.com> wrote:

> This patch is to add memcg routines to count writeback pages

Well OK, but why?  What use is the feature?  In what ways are people
suffering due to its absence?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/4] memcg: add per cgroup writeback pages accounting
  2013-08-22 22:40                 ` Andrew Morton
@ 2013-08-23 16:11                   ` Sha Zhengju
  0 siblings, 0 replies; 21+ messages in thread
From: Sha Zhengju @ 2013-08-23 16:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, KAMEZAWA Hiroyuki, Greg Thelen, linux-mm, Cgroups,
	linux-fsdevel, Sha Zhengju

On Fri, Aug 23, 2013 at 6:40 AM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Thu, 22 Aug 2013 17:53:10 +0800 Sha Zhengju <handai.szj@gmail.com> wrote:
>
>> This patch is to add memcg routines to count writeback pages
>
> Well OK, but why?  What use is the feature?  In what ways are people
> suffering due to its absence?

My apologies for not explaining it clearly.

It's subset of memcg dirty page accounting(including dirty, writeback,
nfs_unstable pages from a broad sense), which can provide a more sound
knowledge of memcg behavior. That would be straightforward to add new
features like memcg dirty page throttling and even memcg aware
flushing.
However, the dirty one is more complicated and performance senstive,
so I need more efforts to improve it and let the writeback patch go
first.  Afterwards I'll only focus on dirty page itself.

-- 
Thanks,
Sha

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2013-08-23 16:11 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1375357402-9811-1-git-send-email-handai.szj@taobao.com>
2013-08-01 11:51 ` [PATCH V5 2/8] fs/ceph: vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem Sha Zhengju
2013-08-01 15:19   ` Yan, Zheng
2013-08-01 18:27     ` Sage Weil
2013-08-02 10:04       ` Sha Zhengju
     [not found]         ` <CAFj3OHVXvtr5BDMrGatHZi7M9y+dh1ZKRMQZGjZmNBcg3pNQtw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-08-02 20:30           ` Sage Weil
2013-08-03  8:58             ` Sha Zhengju
2013-08-02  9:04     ` Sha Zhengju
2013-08-02 13:11       ` Yan, Zheng
2013-08-01 11:53 ` [PATCH V5 4/8] memcg: add per cgroup dirty pages accounting Sha Zhengju
2013-08-01 11:54 ` [PATCH V5 5/8] memcg: add per cgroup writeback " Sha Zhengju
2013-08-01 14:53   ` Michal Hocko
     [not found]     ` <20130801145302.GJ5198-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2013-08-03  9:25       ` Sha Zhengju
     [not found]         ` <CAFj3OHV-VCKJfe6bv4UMvv+uj4LELDXsieRZFJD06Yrdyy=XxA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-08-04 10:08           ` Michal Hocko
2013-08-22  9:46             ` Fwd: " Sha Zhengju
2013-08-22  9:50               ` [PATCH 1/4] memcg: remove MEMCG_NR_FILE_MAPPED Sha Zhengju
2013-08-22  9:52               ` [PATCH 2/4] memcg: check for proper lock held in mem_cgroup_update_page_stat Sha Zhengju
2013-08-22  9:53               ` [PATCH 3/4] memcg: add per cgroup writeback pages accounting Sha Zhengju
2013-08-22 22:40                 ` Andrew Morton
2013-08-23 16:11                   ` Sha Zhengju
2013-08-22  9:53               ` [PATCH 4/4] memcg: Document cgroup dirty/writeback memory statistics Sha Zhengju
2013-08-04 18:51   ` [PATCH V5 5/8] memcg: add per cgroup writeback pages accounting Greg Thelen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).