All of lore.kernel.org
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@suse.cz>
To: Sha Zhengju <handai.szj@gmail.com>
Cc: linux-mm@kvack.org, cgroups@vger.kernel.org,
	kamezawa.hiroyu@jp.fujitsu.com, gthelen@google.com,
	yinghan@google.com, akpm@linux-foundation.org,
	linux-kernel@vger.kernel.org, Sha Zhengju <handai.szj@taobao.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH 5/7] memcg: add per cgroup dirty pages accounting
Date: Wed, 4 Jul 2012 18:11:44 +0200	[thread overview]
Message-ID: <20120704161144.GL29842@tiehlicka.suse.cz> (raw)
In-Reply-To: <1340881486-5770-1-git-send-email-handai.szj@taobao.com>

I guess you should CC vfs people

On Thu 28-06-12 19:04:46, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> This patch adds memcg routines to count dirty pages, which allows memory controller
> to maintain an accurate view of the amount of its dirty memory and can provide some
> info for users while group's direct reclaim is working.
> 
> After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
> use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
> has a feature to move a page from a cgroup to another one and may have race between
> "move" and "page stat accounting". So in order to avoid the race we have designed a
> bigger lock:
> 
>          mem_cgroup_begin_update_page_stat()
>          modify page information	-->(a)
>          mem_cgroup_update_page_stat()  -->(b)
>          mem_cgroup_end_update_page_stat()
> 
> It requires (a) and (b)(dirty pages accounting) can stay close enough.
> 
> In the previous two prepare patches, we have reworked the vfs set page dirty routines
> and now the interfaces are more explicit:
> 	incrementing (2):
> 		__set_page_dirty
> 		__set_page_dirty_nobuffers
> 	decrementing (2):
> 		clear_page_dirty_for_io
> 		cancel_dirty_page

The patch seems correct at first glance (I have to look closer but I am
in rush at the momemnt and will be back next week).

I was just thinking that memcg is enabled by most distributions these
days but not that many people use it. So it would be probably good to
think how to reduce an overhead if !mem_cgroup_disabled && no cgroups.

Something similar Glauber did for the kmem accounting.

> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> ---
>  fs/buffer.c                |   17 ++++++++++++++---
>  include/linux/memcontrol.h |    1 +
>  mm/filemap.c               |    5 +++++
>  mm/memcontrol.c            |   28 +++++++++++++++++++++-------
>  mm/page-writeback.c        |   30 ++++++++++++++++++++++++------
>  mm/truncate.c              |    6 ++++++
>  6 files changed, 71 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 55522dd..d3714cc 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -613,11 +613,19 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>  int __set_page_dirty(struct page *page,
>  		struct address_space *mapping, int warn)
>  {
> +	bool locked;
> +	unsigned long flags;
> +	int ret = 0;
> +
>  	if (unlikely(!mapping))
>  		return !TestSetPageDirty(page);
>  
> -	if (TestSetPageDirty(page))
> -		return 0;
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +
> +	if (TestSetPageDirty(page)) {
> +		ret = 0;
> +		goto out;
> +	}
>  
>  	spin_lock_irq(&mapping->tree_lock);
>  	if (page->mapping) {	/* Race with truncate? */
> @@ -629,7 +637,10 @@ int __set_page_dirty(struct page *page,
>  	spin_unlock_irq(&mapping->tree_lock);
>  	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>  
> -	return 1;
> +	ret = 1;
> +out:
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +	return ret;
>  }
>  
>  /*
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 20b0f2d..ad37b59 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -38,6 +38,7 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
>  	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>  	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
> +	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
>  	MEM_CGROUP_STAT_NSTATS,
>  };
>  
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 1f19ec3..5159a49 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -140,6 +140,11 @@ void __delete_from_page_cache(struct page *page)
>  	 * having removed the page entirely.
>  	 */
>  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> +		/*
> +		 * Do not change page state, so no need to use mem_cgroup_
> +		 * {begin, end}_update_page_stat to get lock.
> +		 */
> +		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  		dec_zone_page_state(page, NR_FILE_DIRTY);
>  		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>  	}
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ebed1ca..90e2946 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -82,6 +82,7 @@ static const char * const mem_cgroup_stat_names[] = {
>  	"rss",
>  	"mapped_file",
>  	"swap",
> +	"dirty",
>  };
>  
>  enum mem_cgroup_events_index {
> @@ -2538,6 +2539,18 @@ void mem_cgroup_split_huge_fixup(struct page *head)
>  }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
> +static inline
> +void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
> +					struct mem_cgroup *to,
> +					enum mem_cgroup_stat_index idx)
> +{
> +	/* Update stat data for mem_cgroup */
> +	preempt_disable();
> +	__this_cpu_dec(from->stat->count[idx]);
> +	__this_cpu_inc(to->stat->count[idx]);
> +	preempt_enable();
> +}
> +
>  /**
>   * mem_cgroup_move_account - move account of the page
>   * @page: the page
> @@ -2583,13 +2596,14 @@ static int mem_cgroup_move_account(struct page *page,
>  
>  	move_lock_mem_cgroup(from, &flags);
>  
> -	if (!anon && page_mapped(page)) {
> -		/* Update mapped_file data for mem_cgroup */
> -		preempt_disable();
> -		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		preempt_enable();
> -	}
> +	if (!anon && page_mapped(page))
> +		mem_cgroup_move_account_page_stat(from, to,
> +				MEM_CGROUP_STAT_FILE_MAPPED);
> +
> +	if (PageDirty(page))
> +		mem_cgroup_move_account_page_stat(from, to,
> +				MEM_CGROUP_STAT_FILE_DIRTY);
> +
>  	mem_cgroup_charge_statistics(from, anon, -nr_pages);
>  
>  	/* caller should have done css_get */
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index e5363f3..e79a2f7 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1962,6 +1962,7 @@ int __set_page_dirty_no_writeback(struct page *page)
>  void account_page_dirtied(struct page *page, struct address_space *mapping)
>  {
>  	if (mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  		__inc_zone_page_state(page, NR_FILE_DIRTY);
>  		__inc_zone_page_state(page, NR_DIRTIED);
>  		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> @@ -2001,12 +2002,20 @@ EXPORT_SYMBOL(account_page_writeback);
>   */
>  int __set_page_dirty_nobuffers(struct page *page)
>  {
> +	bool locked;
> +	unsigned long flags;
> +	int ret = 0;
> +
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +
>  	if (!TestSetPageDirty(page)) {
>  		struct address_space *mapping = page_mapping(page);
>  		struct address_space *mapping2;
>  
> -		if (!mapping)
> -			return 1;
> +		if (!mapping) {
> +			ret = 1;
> +			goto out;
> +		}
>  
>  		spin_lock_irq(&mapping->tree_lock);
>  		mapping2 = page_mapping(page);
> @@ -2022,9 +2031,12 @@ int __set_page_dirty_nobuffers(struct page *page)
>  			/* !PageAnon && !swapper_space */
>  			__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>  		}
> -		return 1;
> +		ret = 1;
>  	}
> -	return 0;
> +
> +out:
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +	return ret;
>  }
>  EXPORT_SYMBOL(__set_page_dirty_nobuffers);
>  
> @@ -2139,6 +2151,9 @@ EXPORT_SYMBOL(set_page_dirty_lock);
>  int clear_page_dirty_for_io(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
> +	bool locked;
> +	unsigned long flags;
> +	int ret = 0;
>  
>  	BUG_ON(!PageLocked(page));
>  
> @@ -2180,13 +2195,16 @@ int clear_page_dirty_for_io(struct page *page)
>  		 * the desired exclusion. See mm/memory.c:do_wp_page()
>  		 * for more comments.
>  		 */
> +		mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  		if (TestClearPageDirty(page)) {
> +			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_RECLAIMABLE);
> -			return 1;
> +			ret = 1;
>  		}
> -		return 0;
> +		mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +		return ret;
>  	}
>  	return TestClearPageDirty(page);
>  }
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 75801ac..052016a 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -73,9 +73,14 @@ static inline void truncate_partial_page(struct page *page, unsigned partial)
>   */
>  void cancel_dirty_page(struct page *page, unsigned int account_size)
>  {
> +	bool locked;
> +	unsigned long flags;
> +
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  	if (TestClearPageDirty(page)) {
>  		struct address_space *mapping = page->mapping;
>  		if (mapping && mapping_cap_account_dirty(mapping)) {
> +			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_RECLAIMABLE);
> @@ -83,6 +88,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
>  				task_io_account_cancelled_write(account_size);
>  		}
>  	}
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  }
>  EXPORT_SYMBOL(cancel_dirty_page);
>  
> -- 
> 1.7.1
> 

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

WARNING: multiple messages have this Message-ID (diff)
From: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
To: Sha Zhengju <handai.szj-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org,
	cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org,
	yinghan-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>,
	Alexander Viro
	<viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org>,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: Re: [PATCH 5/7] memcg: add per cgroup dirty pages accounting
Date: Wed, 4 Jul 2012 18:11:44 +0200	[thread overview]
Message-ID: <20120704161144.GL29842@tiehlicka.suse.cz> (raw)
In-Reply-To: <1340881486-5770-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>

I guess you should CC vfs people

On Thu 28-06-12 19:04:46, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> 
> This patch adds memcg routines to count dirty pages, which allows memory controller
> to maintain an accurate view of the amount of its dirty memory and can provide some
> info for users while group's direct reclaim is working.
> 
> After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
> use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
> has a feature to move a page from a cgroup to another one and may have race between
> "move" and "page stat accounting". So in order to avoid the race we have designed a
> bigger lock:
> 
>          mem_cgroup_begin_update_page_stat()
>          modify page information	-->(a)
>          mem_cgroup_update_page_stat()  -->(b)
>          mem_cgroup_end_update_page_stat()
> 
> It requires (a) and (b)(dirty pages accounting) can stay close enough.
> 
> In the previous two prepare patches, we have reworked the vfs set page dirty routines
> and now the interfaces are more explicit:
> 	incrementing (2):
> 		__set_page_dirty
> 		__set_page_dirty_nobuffers
> 	decrementing (2):
> 		clear_page_dirty_for_io
> 		cancel_dirty_page

The patch seems correct at first glance (I have to look closer but I am
in rush at the momemnt and will be back next week).

I was just thinking that memcg is enabled by most distributions these
days but not that many people use it. So it would be probably good to
think how to reduce an overhead if !mem_cgroup_disabled && no cgroups.

Something similar Glauber did for the kmem accounting.

> Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> ---
>  fs/buffer.c                |   17 ++++++++++++++---
>  include/linux/memcontrol.h |    1 +
>  mm/filemap.c               |    5 +++++
>  mm/memcontrol.c            |   28 +++++++++++++++++++++-------
>  mm/page-writeback.c        |   30 ++++++++++++++++++++++++------
>  mm/truncate.c              |    6 ++++++
>  6 files changed, 71 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 55522dd..d3714cc 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -613,11 +613,19 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>  int __set_page_dirty(struct page *page,
>  		struct address_space *mapping, int warn)
>  {
> +	bool locked;
> +	unsigned long flags;
> +	int ret = 0;
> +
>  	if (unlikely(!mapping))
>  		return !TestSetPageDirty(page);
>  
> -	if (TestSetPageDirty(page))
> -		return 0;
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +
> +	if (TestSetPageDirty(page)) {
> +		ret = 0;
> +		goto out;
> +	}
>  
>  	spin_lock_irq(&mapping->tree_lock);
>  	if (page->mapping) {	/* Race with truncate? */
> @@ -629,7 +637,10 @@ int __set_page_dirty(struct page *page,
>  	spin_unlock_irq(&mapping->tree_lock);
>  	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>  
> -	return 1;
> +	ret = 1;
> +out:
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +	return ret;
>  }
>  
>  /*
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 20b0f2d..ad37b59 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -38,6 +38,7 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
>  	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>  	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
> +	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
>  	MEM_CGROUP_STAT_NSTATS,
>  };
>  
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 1f19ec3..5159a49 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -140,6 +140,11 @@ void __delete_from_page_cache(struct page *page)
>  	 * having removed the page entirely.
>  	 */
>  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> +		/*
> +		 * Do not change page state, so no need to use mem_cgroup_
> +		 * {begin, end}_update_page_stat to get lock.
> +		 */
> +		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  		dec_zone_page_state(page, NR_FILE_DIRTY);
>  		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>  	}
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ebed1ca..90e2946 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -82,6 +82,7 @@ static const char * const mem_cgroup_stat_names[] = {
>  	"rss",
>  	"mapped_file",
>  	"swap",
> +	"dirty",
>  };
>  
>  enum mem_cgroup_events_index {
> @@ -2538,6 +2539,18 @@ void mem_cgroup_split_huge_fixup(struct page *head)
>  }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
> +static inline
> +void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
> +					struct mem_cgroup *to,
> +					enum mem_cgroup_stat_index idx)
> +{
> +	/* Update stat data for mem_cgroup */
> +	preempt_disable();
> +	__this_cpu_dec(from->stat->count[idx]);
> +	__this_cpu_inc(to->stat->count[idx]);
> +	preempt_enable();
> +}
> +
>  /**
>   * mem_cgroup_move_account - move account of the page
>   * @page: the page
> @@ -2583,13 +2596,14 @@ static int mem_cgroup_move_account(struct page *page,
>  
>  	move_lock_mem_cgroup(from, &flags);
>  
> -	if (!anon && page_mapped(page)) {
> -		/* Update mapped_file data for mem_cgroup */
> -		preempt_disable();
> -		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		preempt_enable();
> -	}
> +	if (!anon && page_mapped(page))
> +		mem_cgroup_move_account_page_stat(from, to,
> +				MEM_CGROUP_STAT_FILE_MAPPED);
> +
> +	if (PageDirty(page))
> +		mem_cgroup_move_account_page_stat(from, to,
> +				MEM_CGROUP_STAT_FILE_DIRTY);
> +
>  	mem_cgroup_charge_statistics(from, anon, -nr_pages);
>  
>  	/* caller should have done css_get */
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index e5363f3..e79a2f7 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1962,6 +1962,7 @@ int __set_page_dirty_no_writeback(struct page *page)
>  void account_page_dirtied(struct page *page, struct address_space *mapping)
>  {
>  	if (mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  		__inc_zone_page_state(page, NR_FILE_DIRTY);
>  		__inc_zone_page_state(page, NR_DIRTIED);
>  		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> @@ -2001,12 +2002,20 @@ EXPORT_SYMBOL(account_page_writeback);
>   */
>  int __set_page_dirty_nobuffers(struct page *page)
>  {
> +	bool locked;
> +	unsigned long flags;
> +	int ret = 0;
> +
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +
>  	if (!TestSetPageDirty(page)) {
>  		struct address_space *mapping = page_mapping(page);
>  		struct address_space *mapping2;
>  
> -		if (!mapping)
> -			return 1;
> +		if (!mapping) {
> +			ret = 1;
> +			goto out;
> +		}
>  
>  		spin_lock_irq(&mapping->tree_lock);
>  		mapping2 = page_mapping(page);
> @@ -2022,9 +2031,12 @@ int __set_page_dirty_nobuffers(struct page *page)
>  			/* !PageAnon && !swapper_space */
>  			__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>  		}
> -		return 1;
> +		ret = 1;
>  	}
> -	return 0;
> +
> +out:
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +	return ret;
>  }
>  EXPORT_SYMBOL(__set_page_dirty_nobuffers);
>  
> @@ -2139,6 +2151,9 @@ EXPORT_SYMBOL(set_page_dirty_lock);
>  int clear_page_dirty_for_io(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
> +	bool locked;
> +	unsigned long flags;
> +	int ret = 0;
>  
>  	BUG_ON(!PageLocked(page));
>  
> @@ -2180,13 +2195,16 @@ int clear_page_dirty_for_io(struct page *page)
>  		 * the desired exclusion. See mm/memory.c:do_wp_page()
>  		 * for more comments.
>  		 */
> +		mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  		if (TestClearPageDirty(page)) {
> +			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_RECLAIMABLE);
> -			return 1;
> +			ret = 1;
>  		}
> -		return 0;
> +		mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +		return ret;
>  	}
>  	return TestClearPageDirty(page);
>  }
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 75801ac..052016a 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -73,9 +73,14 @@ static inline void truncate_partial_page(struct page *page, unsigned partial)
>   */
>  void cancel_dirty_page(struct page *page, unsigned int account_size)
>  {
> +	bool locked;
> +	unsigned long flags;
> +
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  	if (TestClearPageDirty(page)) {
>  		struct address_space *mapping = page->mapping;
>  		if (mapping && mapping_cap_account_dirty(mapping)) {
> +			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_RECLAIMABLE);
> @@ -83,6 +88,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
>  				task_io_account_cancelled_write(account_size);
>  		}
>  	}
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  }
>  EXPORT_SYMBOL(cancel_dirty_page);
>  
> -- 
> 1.7.1
> 

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

WARNING: multiple messages have this Message-ID (diff)
From: Michal Hocko <mhocko@suse.cz>
To: Sha Zhengju <handai.szj@gmail.com>
Cc: linux-mm@kvack.org, cgroups@vger.kernel.org,
	kamezawa.hiroyu@jp.fujitsu.com, gthelen@google.com,
	yinghan@google.com, akpm@linux-foundation.org,
	linux-kernel@vger.kernel.org, Sha Zhengju <handai.szj@taobao.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH 5/7] memcg: add per cgroup dirty pages accounting
Date: Wed, 4 Jul 2012 18:11:44 +0200	[thread overview]
Message-ID: <20120704161144.GL29842@tiehlicka.suse.cz> (raw)
In-Reply-To: <1340881486-5770-1-git-send-email-handai.szj@taobao.com>

I guess you should CC vfs people

On Thu 28-06-12 19:04:46, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> This patch adds memcg routines to count dirty pages, which allows memory controller
> to maintain an accurate view of the amount of its dirty memory and can provide some
> info for users while group's direct reclaim is working.
> 
> After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
> use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
> has a feature to move a page from a cgroup to another one and may have race between
> "move" and "page stat accounting". So in order to avoid the race we have designed a
> bigger lock:
> 
>          mem_cgroup_begin_update_page_stat()
>          modify page information	-->(a)
>          mem_cgroup_update_page_stat()  -->(b)
>          mem_cgroup_end_update_page_stat()
> 
> It requires (a) and (b)(dirty pages accounting) can stay close enough.
> 
> In the previous two prepare patches, we have reworked the vfs set page dirty routines
> and now the interfaces are more explicit:
> 	incrementing (2):
> 		__set_page_dirty
> 		__set_page_dirty_nobuffers
> 	decrementing (2):
> 		clear_page_dirty_for_io
> 		cancel_dirty_page

The patch seems correct at first glance (I have to look closer but I am
in rush at the momemnt and will be back next week).

I was just thinking that memcg is enabled by most distributions these
days but not that many people use it. So it would be probably good to
think how to reduce an overhead if !mem_cgroup_disabled && no cgroups.

Something similar Glauber did for the kmem accounting.

> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> ---
>  fs/buffer.c                |   17 ++++++++++++++---
>  include/linux/memcontrol.h |    1 +
>  mm/filemap.c               |    5 +++++
>  mm/memcontrol.c            |   28 +++++++++++++++++++++-------
>  mm/page-writeback.c        |   30 ++++++++++++++++++++++++------
>  mm/truncate.c              |    6 ++++++
>  6 files changed, 71 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 55522dd..d3714cc 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -613,11 +613,19 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>  int __set_page_dirty(struct page *page,
>  		struct address_space *mapping, int warn)
>  {
> +	bool locked;
> +	unsigned long flags;
> +	int ret = 0;
> +
>  	if (unlikely(!mapping))
>  		return !TestSetPageDirty(page);
>  
> -	if (TestSetPageDirty(page))
> -		return 0;
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +
> +	if (TestSetPageDirty(page)) {
> +		ret = 0;
> +		goto out;
> +	}
>  
>  	spin_lock_irq(&mapping->tree_lock);
>  	if (page->mapping) {	/* Race with truncate? */
> @@ -629,7 +637,10 @@ int __set_page_dirty(struct page *page,
>  	spin_unlock_irq(&mapping->tree_lock);
>  	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>  
> -	return 1;
> +	ret = 1;
> +out:
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +	return ret;
>  }
>  
>  /*
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 20b0f2d..ad37b59 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -38,6 +38,7 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
>  	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>  	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
> +	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
>  	MEM_CGROUP_STAT_NSTATS,
>  };
>  
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 1f19ec3..5159a49 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -140,6 +140,11 @@ void __delete_from_page_cache(struct page *page)
>  	 * having removed the page entirely.
>  	 */
>  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> +		/*
> +		 * Do not change page state, so no need to use mem_cgroup_
> +		 * {begin, end}_update_page_stat to get lock.
> +		 */
> +		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  		dec_zone_page_state(page, NR_FILE_DIRTY);
>  		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>  	}
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ebed1ca..90e2946 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -82,6 +82,7 @@ static const char * const mem_cgroup_stat_names[] = {
>  	"rss",
>  	"mapped_file",
>  	"swap",
> +	"dirty",
>  };
>  
>  enum mem_cgroup_events_index {
> @@ -2538,6 +2539,18 @@ void mem_cgroup_split_huge_fixup(struct page *head)
>  }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
> +static inline
> +void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
> +					struct mem_cgroup *to,
> +					enum mem_cgroup_stat_index idx)
> +{
> +	/* Update stat data for mem_cgroup */
> +	preempt_disable();
> +	__this_cpu_dec(from->stat->count[idx]);
> +	__this_cpu_inc(to->stat->count[idx]);
> +	preempt_enable();
> +}
> +
>  /**
>   * mem_cgroup_move_account - move account of the page
>   * @page: the page
> @@ -2583,13 +2596,14 @@ static int mem_cgroup_move_account(struct page *page,
>  
>  	move_lock_mem_cgroup(from, &flags);
>  
> -	if (!anon && page_mapped(page)) {
> -		/* Update mapped_file data for mem_cgroup */
> -		preempt_disable();
> -		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		preempt_enable();
> -	}
> +	if (!anon && page_mapped(page))
> +		mem_cgroup_move_account_page_stat(from, to,
> +				MEM_CGROUP_STAT_FILE_MAPPED);
> +
> +	if (PageDirty(page))
> +		mem_cgroup_move_account_page_stat(from, to,
> +				MEM_CGROUP_STAT_FILE_DIRTY);
> +
>  	mem_cgroup_charge_statistics(from, anon, -nr_pages);
>  
>  	/* caller should have done css_get */
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index e5363f3..e79a2f7 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1962,6 +1962,7 @@ int __set_page_dirty_no_writeback(struct page *page)
>  void account_page_dirtied(struct page *page, struct address_space *mapping)
>  {
>  	if (mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  		__inc_zone_page_state(page, NR_FILE_DIRTY);
>  		__inc_zone_page_state(page, NR_DIRTIED);
>  		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> @@ -2001,12 +2002,20 @@ EXPORT_SYMBOL(account_page_writeback);
>   */
>  int __set_page_dirty_nobuffers(struct page *page)
>  {
> +	bool locked;
> +	unsigned long flags;
> +	int ret = 0;
> +
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +
>  	if (!TestSetPageDirty(page)) {
>  		struct address_space *mapping = page_mapping(page);
>  		struct address_space *mapping2;
>  
> -		if (!mapping)
> -			return 1;
> +		if (!mapping) {
> +			ret = 1;
> +			goto out;
> +		}
>  
>  		spin_lock_irq(&mapping->tree_lock);
>  		mapping2 = page_mapping(page);
> @@ -2022,9 +2031,12 @@ int __set_page_dirty_nobuffers(struct page *page)
>  			/* !PageAnon && !swapper_space */
>  			__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>  		}
> -		return 1;
> +		ret = 1;
>  	}
> -	return 0;
> +
> +out:
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +	return ret;
>  }
>  EXPORT_SYMBOL(__set_page_dirty_nobuffers);
>  
> @@ -2139,6 +2151,9 @@ EXPORT_SYMBOL(set_page_dirty_lock);
>  int clear_page_dirty_for_io(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
> +	bool locked;
> +	unsigned long flags;
> +	int ret = 0;
>  
>  	BUG_ON(!PageLocked(page));
>  
> @@ -2180,13 +2195,16 @@ int clear_page_dirty_for_io(struct page *page)
>  		 * the desired exclusion. See mm/memory.c:do_wp_page()
>  		 * for more comments.
>  		 */
> +		mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  		if (TestClearPageDirty(page)) {
> +			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_RECLAIMABLE);
> -			return 1;
> +			ret = 1;
>  		}
> -		return 0;
> +		mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +		return ret;
>  	}
>  	return TestClearPageDirty(page);
>  }
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 75801ac..052016a 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -73,9 +73,14 @@ static inline void truncate_partial_page(struct page *page, unsigned partial)
>   */
>  void cancel_dirty_page(struct page *page, unsigned int account_size)
>  {
> +	bool locked;
> +	unsigned long flags;
> +
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  	if (TestClearPageDirty(page)) {
>  		struct address_space *mapping = page->mapping;
>  		if (mapping && mapping_cap_account_dirty(mapping)) {
> +			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_RECLAIMABLE);
> @@ -83,6 +88,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
>  				task_io_account_cancelled_write(account_size);
>  		}
>  	}
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  }
>  EXPORT_SYMBOL(cancel_dirty_page);
>  
> -- 
> 1.7.1
> 

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2012-07-04 16:11 UTC|newest]

Thread overview: 132+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-06-28 10:54 [PATCH 0/7] Per-cgroup page stat accounting Sha Zhengju
2012-06-28 10:54 ` Sha Zhengju
2012-06-28 10:54 ` Sha Zhengju
2012-06-28 10:57 ` [PATCH 1/7] memcg: update cgroup memory document Sha Zhengju
2012-06-28 10:57   ` Sha Zhengju
2012-06-28 10:57   ` Sha Zhengju
2012-07-02  7:00   ` Kamezawa Hiroyuki
2012-07-02  7:00     ` Kamezawa Hiroyuki
2012-07-04 12:47   ` Michal Hocko
2012-07-04 12:47     ` Michal Hocko
2012-07-04 12:47     ` Michal Hocko
2012-07-07 13:45   ` Fengguang Wu
2012-07-07 13:45     ` Fengguang Wu
2012-07-07 13:45     ` Fengguang Wu
2012-06-28 10:58 ` [PATCH 2/7] memcg: remove MEMCG_NR_FILE_MAPPED Sha Zhengju
2012-06-28 10:58   ` Sha Zhengju
2012-07-02 10:44   ` Kamezawa Hiroyuki
2012-07-02 10:44     ` Kamezawa Hiroyuki
2012-07-02 10:44     ` Kamezawa Hiroyuki
2012-07-04 12:56   ` Michal Hocko
2012-07-04 12:56     ` Michal Hocko
2012-07-04 12:58     ` Michal Hocko
2012-07-04 12:58       ` Michal Hocko
2012-07-04 12:58       ` Michal Hocko
2012-07-07 13:48   ` Fengguang Wu
2012-07-07 13:48     ` Fengguang Wu
2012-07-09 21:01   ` Greg Thelen
2012-07-09 21:01     ` Greg Thelen
2012-07-09 21:01     ` Greg Thelen
2012-07-11  8:00     ` Sha Zhengju
2012-07-11  8:00       ` Sha Zhengju
2012-06-28 11:01 ` [PATCH 3/7] Make TestSetPageDirty and dirty page accounting in one func Sha Zhengju
2012-06-28 11:01   ` Sha Zhengju
2012-07-02 11:14   ` Kamezawa Hiroyuki
2012-07-02 11:14     ` Kamezawa Hiroyuki
2012-07-02 11:14     ` Kamezawa Hiroyuki
2012-07-07 14:42     ` Fengguang Wu
2012-07-07 14:42       ` Fengguang Wu
2012-07-04 14:23   ` Michal Hocko
2012-07-04 14:23     ` Michal Hocko
2012-06-28 11:03 ` [PATCH 4/7] Use vfs __set_page_dirty interface instead of doing it inside filesystem Sha Zhengju
2012-06-28 11:03   ` Sha Zhengju
2012-06-28 11:03   ` Sha Zhengju
2012-06-29  5:21   ` Sage Weil
2012-06-29  5:21     ` Sage Weil
2012-06-29  5:21     ` Sage Weil
2012-07-02  8:10     ` Sha Zhengju
2012-07-02  8:10       ` Sha Zhengju
2012-07-02 14:49       ` Sage Weil
2012-07-02 14:49         ` Sage Weil
2012-07-04  8:11         ` Sha Zhengju
2012-07-04  8:11           ` Sha Zhengju
2012-07-05 15:20           ` Sage Weil
2012-07-05 15:20             ` Sage Weil
2012-07-05 15:40             ` Sha Zhengju
2012-07-05 15:40               ` Sha Zhengju
2012-07-04 14:27   ` Michal Hocko
2012-07-04 14:27     ` Michal Hocko
2012-06-28 11:04 ` [PATCH 5/7] memcg: add per cgroup dirty pages accounting Sha Zhengju
2012-06-28 11:04   ` Sha Zhengju
2012-06-28 11:04   ` Sha Zhengju
2012-07-03  5:57   ` Kamezawa Hiroyuki
2012-07-03  5:57     ` Kamezawa Hiroyuki
2012-07-08 14:45     ` Fengguang Wu
2012-07-08 14:45       ` Fengguang Wu
2012-07-04 16:11   ` Michal Hocko [this message]
2012-07-04 16:11     ` Michal Hocko
2012-07-04 16:11     ` Michal Hocko
2012-07-09 21:02   ` Greg Thelen
2012-07-09 21:02     ` Greg Thelen
2012-07-11  9:32     ` Sha Zhengju
2012-07-11  9:32       ` Sha Zhengju
2012-07-19  6:33       ` Kamezawa Hiroyuki
2012-07-19  6:33         ` Kamezawa Hiroyuki
2012-07-19  6:33         ` Kamezawa Hiroyuki
2012-06-28 11:05 ` [PATCH 6/7] memcg: add per cgroup writeback " Sha Zhengju
2012-06-28 11:05   ` Sha Zhengju
2012-07-03  6:31   ` Kamezawa Hiroyuki
2012-07-03  6:31     ` Kamezawa Hiroyuki
2012-07-04  8:24     ` Sha Zhengju
2012-07-04  8:24       ` Sha Zhengju
2012-07-08 14:44     ` Fengguang Wu
2012-07-08 14:44       ` Fengguang Wu
2012-07-08 23:01       ` Johannes Weiner
2012-07-08 23:01         ` Johannes Weiner
2012-07-09  1:37         ` Fengguang Wu
2012-07-09  1:37           ` Fengguang Wu
2012-07-09  1:37           ` Fengguang Wu
2012-07-04 16:15   ` Michal Hocko
2012-07-04 16:15     ` Michal Hocko
2012-06-28 11:06 ` Sha Zhengju
2012-06-28 11:06   ` Sha Zhengju
2012-06-28 11:06   ` Sha Zhengju
2012-07-08 14:53   ` Fengguang Wu
2012-07-08 14:53     ` Fengguang Wu
2012-07-08 14:53     ` Fengguang Wu
2012-07-09  3:36     ` Sha Zhengju
2012-07-09  3:36       ` Sha Zhengju
2012-07-09  3:36       ` Sha Zhengju
2012-07-09  4:14       ` Fengguang Wu
2012-07-09  4:14         ` Fengguang Wu
2012-07-09  4:14         ` Fengguang Wu
2012-07-09  4:18         ` Kamezawa Hiroyuki
2012-07-09  4:18           ` Kamezawa Hiroyuki
2012-07-09  5:22           ` Sha Zhengju
2012-07-09  5:22             ` Sha Zhengju
2012-07-09  5:22             ` Sha Zhengju
2012-07-09  5:28             ` Fengguang Wu
2012-07-09  5:28               ` Fengguang Wu
2012-07-09  5:28               ` Fengguang Wu
2012-07-09  5:19         ` Sha Zhengju
2012-07-09  5:19           ` Sha Zhengju
2012-07-09  5:25           ` Fengguang Wu
2012-07-09  5:25             ` Fengguang Wu
2012-07-09 21:02   ` Greg Thelen
2012-07-09 21:02     ` Greg Thelen
2012-06-28 11:06 ` [PATCH 7/7] memcg: print more detailed info while memcg oom happening Sha Zhengju
2012-06-28 11:06   ` Sha Zhengju
2012-06-28 11:06   ` Sha Zhengju
2012-07-04  8:25   ` Sha Zhengju
2012-07-04  8:25     ` Sha Zhengju
2012-07-04  8:25     ` Sha Zhengju
2012-07-04  8:29   ` Kamezawa Hiroyuki
2012-07-04  8:29     ` Kamezawa Hiroyuki
2012-07-04 11:20     ` Sha Zhengju
2012-07-04 11:20       ` Sha Zhengju
2012-07-04 11:20       ` Sha Zhengju
2012-06-29  8:23 ` [PATCH 0/7] Per-cgroup page stat accounting Kamezawa Hiroyuki
2012-06-29  8:23   ` Kamezawa Hiroyuki
2012-07-02  7:51   ` Sha Zhengju
2012-07-02  7:51     ` Sha Zhengju
2012-07-02  7:51     ` Sha Zhengju

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120704161144.GL29842@tiehlicka.suse.cz \
    --to=mhocko@suse.cz \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=gthelen@google.com \
    --cc=handai.szj@gmail.com \
    --cc=handai.szj@taobao.com \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=yinghan@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.