All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH -mmotm 0/3] memcg: per cgroup dirty limit (v3)
@ 2010-03-01 21:23 ` Andrea Righi
  0 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-01 21:23 UTC (permalink / raw)
  To: Balbir Singh, KAMEZAWA Hiroyuki
  Cc: Suleiman Souhlal, Greg Thelen, Daisuke Nishimura,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

Control the maximum amount of dirty pages a cgroup can have at any given time.

Per cgroup dirty limit is like fixing the max amount of dirty (hard to reclaim)
page cache used by any cgroup. So, in case of multiple cgroup writers, they
will not be able to consume more than their designated share of dirty pages and
will be forced to perform write-out if they cross that limit.

The overall design is the following:

 - account dirty pages per cgroup
 - limit the number of dirty pages via memory.dirty_ratio / memory.dirty_bytes
   and memory.dirty_background_ratio / memory.dirty_background_bytes in
   cgroupfs
 - start to write-out (background or actively) when the cgroup limits are
   exceeded

This feature is supposed to be strictly connected to any underlying IO
controller implementation, so we can stop increasing dirty pages in VM layer
and enforce a write-out before any cgroup will consume the global amount of
dirty pages defined by the /proc/sys/vm/dirty_ratio|dirty_bytes limit.

Changelog (v2 -> v3)
~~~~~~~~~~~~~~~~~~~~~~
 * properly handle the swapless case when reading dirtyable pages statistic
 * combine similar functions + code cleanup based on the received feedbacks
 * updated documentation in Documentation/cgroups/memory.txt

-Andrea

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH -mmotm 0/3] memcg: per cgroup dirty limit (v3)
@ 2010-03-01 21:23 ` Andrea Righi
  0 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-01 21:23 UTC (permalink / raw)
  To: Balbir Singh, KAMEZAWA Hiroyuki
  Cc: Suleiman Souhlal, Greg Thelen, Daisuke Nishimura,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

Control the maximum amount of dirty pages a cgroup can have at any given time.

Per cgroup dirty limit is like fixing the max amount of dirty (hard to reclaim)
page cache used by any cgroup. So, in case of multiple cgroup writers, they
will not be able to consume more than their designated share of dirty pages and
will be forced to perform write-out if they cross that limit.

The overall design is the following:

 - account dirty pages per cgroup
 - limit the number of dirty pages via memory.dirty_ratio / memory.dirty_bytes
   and memory.dirty_background_ratio / memory.dirty_background_bytes in
   cgroupfs
 - start to write-out (background or actively) when the cgroup limits are
   exceeded

This feature is supposed to be strictly connected to any underlying IO
controller implementation, so we can stop increasing dirty pages in VM layer
and enforce a write-out before any cgroup will consume the global amount of
dirty pages defined by the /proc/sys/vm/dirty_ratio|dirty_bytes limit.

Changelog (v2 -> v3)
~~~~~~~~~~~~~~~~~~~~~~
 * properly handle the swapless case when reading dirtyable pages statistic
 * combine similar functions + code cleanup based on the received feedbacks
 * updated documentation in Documentation/cgroups/memory.txt

-Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* [PATCH -mmotm 1/3] memcg: dirty memory documentation
       [not found] ` <1267478620-5276-1-git-send-email-arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
@ 2010-03-01 21:23   ` Andrea Righi
  2010-03-01 21:23   ` [PATCH -mmotm 2/3] memcg: dirty pages accounting and limiting infrastructure Andrea Righi
  2010-03-01 21:23   ` [PATCH -mmotm 3/3] memcg: dirty pages instrumentation Andrea Righi
  2 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-01 21:23 UTC (permalink / raw)
  To: Balbir Singh, KAMEZAWA Hiroyuki
  Cc: Andrea Righi,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Andrew Morton

Document cgroup dirty memory interfaces and statistics.

Signed-off-by: Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
---
 Documentation/cgroups/memory.txt |   36 ++++++++++++++++++++++++++++++++++++
 1 files changed, 36 insertions(+), 0 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index aad7d05..878afa7 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -308,6 +308,11 @@ cache		- # of bytes of page cache memory.
 rss		- # of bytes of anonymous and swap cache memory.
 pgpgin		- # of pages paged in (equivalent to # of charging events).
 pgpgout		- # of pages paged out (equivalent to # of uncharging events).
+filedirty	- # of pages that are waiting to get written back to the disk.
+writeback	- # of pages that are actively being written back to the disk.
+writeback_tmp	- # of pages used by FUSE for temporary writeback buffers.
+nfs		- # of NFS pages sent to the server, but not yet committed to
+		  the actual storage.
 active_anon	- # of bytes of anonymous and  swap cache memory on active
 		  lru list.
 inactive_anon	- # of bytes of anonymous memory and swap cache memory on
@@ -343,6 +348,37 @@ Note:
   - a cgroup which uses hierarchy and it has child cgroup.
   - a cgroup which uses hierarchy and not the root of hierarchy.
 
+5.4 dirty memory
+
+  Control the maximum amount of dirty pages a cgroup can have at any given time.
+
+  Limiting dirty memory is like fixing the max amount of dirty (hard to
+  reclaim) page cache used by any cgroup. So, in case of multiple cgroup writers,
+  they will not be able to consume more than their designated share of dirty
+  pages and will be forced to perform write-out if they cross that limit.
+
+  The interface is equivalent to the procfs interface: /proc/sys/vm/dirty_*.
+  It is possible to configure a limit to trigger both a direct writeback or a
+  background writeback performed by per-bdi flusher threads.
+
+  Per-cgroup dirty limits can be set using the following files in the cgroupfs:
+
+  - memory.dirty_ratio: contains, as a percentage of cgroup memory, the
+    amount of dirty memory at which a process which is generating disk writes
+    inside the cgroup will start itself writing out dirty data.
+
+  - memory.dirty_bytes: the amount of dirty memory of the cgroup (expressed in
+    bytes) at which a process generating disk writes will start itself writing
+    out dirty data.
+
+  - memory.dirty_background_ratio: contains, as a percentage of the cgroup
+    memory, the amount of dirty memory at which background writeback kernel
+    threads will start writing out dirty data.
+
+  - memory.dirty_background_bytes: the amount of dirty memory of the cgroup (in
+    bytes) at which background writeback kernel threads will start writing out
+    dirty data.
+
 
 6. Hierarchy support
 
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH -mmotm 1/3] memcg: dirty memory documentation
  2010-03-01 21:23 ` Andrea Righi
@ 2010-03-01 21:23   ` Andrea Righi
  -1 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-01 21:23 UTC (permalink / raw)
  To: Balbir Singh, KAMEZAWA Hiroyuki
  Cc: Suleiman Souhlal, Greg Thelen, Daisuke Nishimura,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm, Andrea Righi

Document cgroup dirty memory interfaces and statistics.

Signed-off-by: Andrea Righi <arighi@develer.com>
---
 Documentation/cgroups/memory.txt |   36 ++++++++++++++++++++++++++++++++++++
 1 files changed, 36 insertions(+), 0 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index aad7d05..878afa7 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -308,6 +308,11 @@ cache		- # of bytes of page cache memory.
 rss		- # of bytes of anonymous and swap cache memory.
 pgpgin		- # of pages paged in (equivalent to # of charging events).
 pgpgout		- # of pages paged out (equivalent to # of uncharging events).
+filedirty	- # of pages that are waiting to get written back to the disk.
+writeback	- # of pages that are actively being written back to the disk.
+writeback_tmp	- # of pages used by FUSE for temporary writeback buffers.
+nfs		- # of NFS pages sent to the server, but not yet committed to
+		  the actual storage.
 active_anon	- # of bytes of anonymous and  swap cache memory on active
 		  lru list.
 inactive_anon	- # of bytes of anonymous memory and swap cache memory on
@@ -343,6 +348,37 @@ Note:
   - a cgroup which uses hierarchy and it has child cgroup.
   - a cgroup which uses hierarchy and not the root of hierarchy.
 
+5.4 dirty memory
+
+  Control the maximum amount of dirty pages a cgroup can have at any given time.
+
+  Limiting dirty memory is like fixing the max amount of dirty (hard to
+  reclaim) page cache used by any cgroup. So, in case of multiple cgroup writers,
+  they will not be able to consume more than their designated share of dirty
+  pages and will be forced to perform write-out if they cross that limit.
+
+  The interface is equivalent to the procfs interface: /proc/sys/vm/dirty_*.
+  It is possible to configure a limit to trigger both a direct writeback or a
+  background writeback performed by per-bdi flusher threads.
+
+  Per-cgroup dirty limits can be set using the following files in the cgroupfs:
+
+  - memory.dirty_ratio: contains, as a percentage of cgroup memory, the
+    amount of dirty memory at which a process which is generating disk writes
+    inside the cgroup will start itself writing out dirty data.
+
+  - memory.dirty_bytes: the amount of dirty memory of the cgroup (expressed in
+    bytes) at which a process generating disk writes will start itself writing
+    out dirty data.
+
+  - memory.dirty_background_ratio: contains, as a percentage of the cgroup
+    memory, the amount of dirty memory at which background writeback kernel
+    threads will start writing out dirty data.
+
+  - memory.dirty_background_bytes: the amount of dirty memory of the cgroup (in
+    bytes) at which background writeback kernel threads will start writing out
+    dirty data.
+
 
 6. Hierarchy support
 
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH -mmotm 1/3] memcg: dirty memory documentation
@ 2010-03-01 21:23   ` Andrea Righi
  0 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-01 21:23 UTC (permalink / raw)
  To: Balbir Singh, KAMEZAWA Hiroyuki
  Cc: Suleiman Souhlal, Greg Thelen, Daisuke Nishimura,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm, Andrea Righi

Document cgroup dirty memory interfaces and statistics.

Signed-off-by: Andrea Righi <arighi@develer.com>
---
 Documentation/cgroups/memory.txt |   36 ++++++++++++++++++++++++++++++++++++
 1 files changed, 36 insertions(+), 0 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index aad7d05..878afa7 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -308,6 +308,11 @@ cache		- # of bytes of page cache memory.
 rss		- # of bytes of anonymous and swap cache memory.
 pgpgin		- # of pages paged in (equivalent to # of charging events).
 pgpgout		- # of pages paged out (equivalent to # of uncharging events).
+filedirty	- # of pages that are waiting to get written back to the disk.
+writeback	- # of pages that are actively being written back to the disk.
+writeback_tmp	- # of pages used by FUSE for temporary writeback buffers.
+nfs		- # of NFS pages sent to the server, but not yet committed to
+		  the actual storage.
 active_anon	- # of bytes of anonymous and  swap cache memory on active
 		  lru list.
 inactive_anon	- # of bytes of anonymous memory and swap cache memory on
@@ -343,6 +348,37 @@ Note:
   - a cgroup which uses hierarchy and it has child cgroup.
   - a cgroup which uses hierarchy and not the root of hierarchy.
 
+5.4 dirty memory
+
+  Control the maximum amount of dirty pages a cgroup can have at any given time.
+
+  Limiting dirty memory is like fixing the max amount of dirty (hard to
+  reclaim) page cache used by any cgroup. So, in case of multiple cgroup writers,
+  they will not be able to consume more than their designated share of dirty
+  pages and will be forced to perform write-out if they cross that limit.
+
+  The interface is equivalent to the procfs interface: /proc/sys/vm/dirty_*.
+  It is possible to configure a limit to trigger both a direct writeback or a
+  background writeback performed by per-bdi flusher threads.
+
+  Per-cgroup dirty limits can be set using the following files in the cgroupfs:
+
+  - memory.dirty_ratio: contains, as a percentage of cgroup memory, the
+    amount of dirty memory at which a process which is generating disk writes
+    inside the cgroup will start itself writing out dirty data.
+
+  - memory.dirty_bytes: the amount of dirty memory of the cgroup (expressed in
+    bytes) at which a process generating disk writes will start itself writing
+    out dirty data.
+
+  - memory.dirty_background_ratio: contains, as a percentage of the cgroup
+    memory, the amount of dirty memory at which background writeback kernel
+    threads will start writing out dirty data.
+
+  - memory.dirty_background_bytes: the amount of dirty memory of the cgroup (in
+    bytes) at which background writeback kernel threads will start writing out
+    dirty data.
+
 
 6. Hierarchy support
 
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH -mmotm 2/3] memcg: dirty pages accounting and limiting infrastructure
       [not found] ` <1267478620-5276-1-git-send-email-arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
  2010-03-01 21:23   ` [PATCH -mmotm 1/3] memcg: dirty memory documentation Andrea Righi
@ 2010-03-01 21:23   ` Andrea Righi
  2010-03-01 21:23   ` [PATCH -mmotm 3/3] memcg: dirty pages instrumentation Andrea Righi
  2 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-01 21:23 UTC (permalink / raw)
  To: Balbir Singh, KAMEZAWA Hiroyuki
  Cc: Andrea Righi,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Andrew Morton

Infrastructure to account dirty pages per cgroup and add dirty limit
interfaces in the cgroupfs:

 - Direct write-out: memory.dirty_ratio, memory.dirty_bytes

 - Background write-out: memory.dirty_background_ratio, memory.dirty_background_bytes

Signed-off-by: Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
---
 include/linux/memcontrol.h |   77 ++++++++++-
 mm/memcontrol.c            |  336 ++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 384 insertions(+), 29 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1f9b119..cc88b2e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -19,12 +19,50 @@
 
 #ifndef _LINUX_MEMCONTROL_H
 #define _LINUX_MEMCONTROL_H
+
+#include <linux/writeback.h>
 #include <linux/cgroup.h>
+
 struct mem_cgroup;
 struct page_cgroup;
 struct page;
 struct mm_struct;
 
+/* Cgroup memory statistics items exported to the kernel */
+enum mem_cgroup_page_stat_item {
+	MEMCG_NR_DIRTYABLE_PAGES,
+	MEMCG_NR_RECLAIM_PAGES,
+	MEMCG_NR_WRITEBACK,
+	MEMCG_NR_DIRTY_WRITEBACK_PAGES,
+};
+
+/*
+ * Statistics for memory cgroup.
+ */
+enum mem_cgroup_stat_index {
+	/*
+	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
+	 */
+	MEM_CGROUP_STAT_CACHE,	   /* # of pages charged as cache */
+	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
+	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
+	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
+	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
+	MEM_CGROUP_STAT_EVENTS,	/* sum of pagein + pageout for internal use */
+	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
+	MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
+					used by soft limit implementation */
+	MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out.
+					used by threshold implementation */
+	MEM_CGROUP_STAT_FILE_DIRTY,   /* # of dirty pages in page cache */
+	MEM_CGROUP_STAT_WRITEBACK,   /* # of pages under writeback */
+	MEM_CGROUP_STAT_WRITEBACK_TEMP,   /* # of pages under writeback using
+						temporary buffers */
+	MEM_CGROUP_STAT_UNSTABLE_NFS,   /* # of NFS unstable pages */
+
+	MEM_CGROUP_STAT_NSTATS,
+};
+
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 /*
  * All "charge" functions with gfp_mask should use GFP_KERNEL or
@@ -117,6 +155,13 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
 extern int do_swap_account;
 #endif
 
+extern long mem_cgroup_dirty_ratio(void);
+extern unsigned long mem_cgroup_dirty_bytes(void);
+extern long mem_cgroup_dirty_background_ratio(void);
+extern unsigned long mem_cgroup_dirty_background_bytes(void);
+
+extern s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item);
+
 static inline bool mem_cgroup_disabled(void)
 {
 	if (mem_cgroup_subsys.disabled)
@@ -125,7 +170,8 @@ static inline bool mem_cgroup_disabled(void)
 }
 
 extern bool mem_cgroup_oom_called(struct task_struct *task);
-void mem_cgroup_update_file_mapped(struct page *page, int val);
+void mem_cgroup_update_stat(struct page *page,
+			enum mem_cgroup_stat_index idx, int val);
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask, int nid,
 						int zid);
@@ -300,8 +346,8 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
 {
 }
 
-static inline void mem_cgroup_update_file_mapped(struct page *page,
-							int val)
+static inline void mem_cgroup_update_stat(struct page *page,
+			enum mem_cgroup_stat_index idx, int val)
 {
 }
 
@@ -312,6 +358,31 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 	return 0;
 }
 
+static inline long mem_cgroup_dirty_ratio(void)
+{
+	return vm_dirty_ratio;
+}
+
+static inline unsigned long mem_cgroup_dirty_bytes(void)
+{
+	return vm_dirty_bytes;
+}
+
+static inline long mem_cgroup_dirty_background_ratio(void)
+{
+	return dirty_background_ratio;
+}
+
+static inline unsigned long mem_cgroup_dirty_background_bytes(void)
+{
+	return dirty_background_bytes;
+}
+
+static inline s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
+{
+	return -ENOMEM;
+}
+
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #endif /* _LINUX_MEMCONTROL_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a443c30..e74cf66 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -66,31 +66,16 @@ static int really_do_swap_account __initdata = 1; /* for remember boot option*/
 #define SOFTLIMIT_EVENTS_THRESH (1000)
 #define THRESHOLDS_EVENTS_THRESH (100)
 
-/*
- * Statistics for memory cgroup.
- */
-enum mem_cgroup_stat_index {
-	/*
-	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
-	 */
-	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
-	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
-	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
-	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
-	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
-	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
-	MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
-					used by soft limit implementation */
-	MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out.
-					used by threshold implementation */
-
-	MEM_CGROUP_STAT_NSTATS,
-};
-
 struct mem_cgroup_stat_cpu {
 	s64 count[MEM_CGROUP_STAT_NSTATS];
 };
 
+/* Per cgroup page statistics */
+struct mem_cgroup_page_stat {
+	enum mem_cgroup_page_stat_item item;
+	s64 value;
+};
+
 /*
  * per-zone information in memory controller.
  */
@@ -157,6 +142,15 @@ struct mem_cgroup_threshold_ary {
 static bool mem_cgroup_threshold_check(struct mem_cgroup *mem);
 static void mem_cgroup_threshold(struct mem_cgroup *mem);
 
+enum mem_cgroup_dirty_param {
+	MEM_CGROUP_DIRTY_RATIO,
+	MEM_CGROUP_DIRTY_BYTES,
+	MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
+	MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
+
+	MEM_CGROUP_DIRTY_NPARAMS,
+};
+
 /*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
@@ -205,6 +199,9 @@ struct mem_cgroup {
 
 	unsigned int	swappiness;
 
+	/* control memory cgroup dirty pages */
+	unsigned long dirty_param[MEM_CGROUP_DIRTY_NPARAMS];
+
 	/* set when res.limit == memsw.limit */
 	bool		memsw_is_minimum;
 
@@ -1021,6 +1018,164 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
 	return swappiness;
 }
 
+static unsigned long get_dirty_param(struct mem_cgroup *memcg,
+			enum mem_cgroup_dirty_param idx)
+{
+	unsigned long ret;
+
+	VM_BUG_ON(idx >= MEM_CGROUP_DIRTY_NPARAMS);
+	spin_lock(&memcg->reclaim_param_lock);
+	ret = memcg->dirty_param[idx];
+	spin_unlock(&memcg->reclaim_param_lock);
+
+	return ret;
+}
+
+long mem_cgroup_dirty_ratio(void)
+{
+	struct mem_cgroup *memcg;
+	long ret = vm_dirty_ratio;
+
+	if (mem_cgroup_disabled())
+		return ret;
+	/*
+	 * It's possible that "current" may be moved to other cgroup while we
+	 * access cgroup. But precise check is meaningless because the task can
+	 * be moved after our access and writeback tends to take long time.
+	 * At least, "memcg" will not be freed under rcu_read_lock().
+	 */
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(current);
+	if (likely(memcg))
+		ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_RATIO);
+	rcu_read_unlock();
+
+	return ret;
+}
+
+unsigned long mem_cgroup_dirty_bytes(void)
+{
+	struct mem_cgroup *memcg;
+	unsigned long ret = vm_dirty_bytes;
+
+	if (mem_cgroup_disabled())
+		return ret;
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(current);
+	if (likely(memcg))
+		ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BYTES);
+	rcu_read_unlock();
+
+	return ret;
+}
+
+long mem_cgroup_dirty_background_ratio(void)
+{
+	struct mem_cgroup *memcg;
+	long ret = dirty_background_ratio;
+
+	if (mem_cgroup_disabled())
+		return ret;
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(current);
+	if (likely(memcg))
+		ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BACKGROUND_RATIO);
+	rcu_read_unlock();
+
+	return ret;
+}
+
+unsigned long mem_cgroup_dirty_background_bytes(void)
+{
+	struct mem_cgroup *memcg;
+	unsigned long ret = dirty_background_bytes;
+
+	if (mem_cgroup_disabled())
+		return ret;
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(current);
+	if (likely(memcg))
+		ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BACKGROUND_BYTES);
+	rcu_read_unlock();
+
+	return ret;
+}
+
+static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
+{
+	return do_swap_account ?
+			res_counter_read_u64(&memcg->memsw, RES_LIMIT) :
+			nr_swap_pages > 0;
+}
+
+static s64 mem_cgroup_get_local_page_stat(struct mem_cgroup *memcg,
+				enum mem_cgroup_page_stat_item item)
+{
+	s64 ret;
+
+	switch (item) {
+	case MEMCG_NR_DIRTYABLE_PAGES:
+		ret = res_counter_read_u64(&memcg->res, RES_LIMIT) -
+			res_counter_read_u64(&memcg->res, RES_USAGE);
+		/* Translate free memory in pages */
+		ret >>= PAGE_SHIFT;
+		ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_FILE) +
+			mem_cgroup_read_stat(memcg, LRU_INACTIVE_FILE);
+		if (mem_cgroup_can_swap(memcg))
+			ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_ANON) +
+				mem_cgroup_read_stat(memcg, LRU_INACTIVE_ANON);
+		break;
+	case MEMCG_NR_RECLAIM_PAGES:
+		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_FILE_DIRTY) +
+			mem_cgroup_read_stat(memcg,
+					MEM_CGROUP_STAT_UNSTABLE_NFS);
+		break;
+	case MEMCG_NR_WRITEBACK:
+		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK);
+		break;
+	case MEMCG_NR_DIRTY_WRITEBACK_PAGES:
+		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK) +
+			mem_cgroup_read_stat(memcg,
+				MEM_CGROUP_STAT_UNSTABLE_NFS);
+		break;
+	default:
+		ret = 0;
+		WARN_ON_ONCE(1);
+	}
+	return ret;
+}
+
+static int mem_cgroup_page_stat_cb(struct mem_cgroup *mem, void *data)
+{
+	struct mem_cgroup_page_stat *stat = (struct mem_cgroup_page_stat *)data;
+
+	stat->value += mem_cgroup_get_local_page_stat(mem, stat->item);
+	return 0;
+}
+
+s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
+{
+	struct mem_cgroup_page_stat stat = {};
+	struct mem_cgroup *memcg;
+
+	if (mem_cgroup_disabled())
+		return -ENOMEM;
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(current);
+	if (memcg) {
+		/*
+		 * Recursively evaulate page statistics against all cgroup
+		 * under hierarchy tree
+		 */
+		stat.item = item;
+		mem_cgroup_walk_tree(memcg, &stat, mem_cgroup_page_stat_cb);
+	} else
+		stat.value = -ENOMEM;
+	rcu_read_unlock();
+
+	return stat.value;
+}
+
 static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void *data)
 {
 	int *val = data;
@@ -1263,14 +1418,16 @@ static void record_last_oom(struct mem_cgroup *mem)
 }
 
 /*
- * Currently used to update mapped file statistics, but the routine can be
- * generalized to update other statistics as well.
+ * Generalized routine to update memory cgroup statistics.
  */
-void mem_cgroup_update_file_mapped(struct page *page, int val)
+void mem_cgroup_update_stat(struct page *page,
+			enum mem_cgroup_stat_index idx, int val)
 {
 	struct mem_cgroup *mem;
 	struct page_cgroup *pc;
 
+	if (mem_cgroup_disabled())
+		return;
 	pc = lookup_page_cgroup(page);
 	if (unlikely(!pc))
 		return;
@@ -1286,7 +1443,8 @@ void mem_cgroup_update_file_mapped(struct page *page, int val)
 	/*
 	 * Preemption is already disabled. We can use __this_cpu_xxx
 	 */
-	__this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], val);
+	VM_BUG_ON(idx >= MEM_CGROUP_STAT_NSTATS);
+	__this_cpu_add(mem->stat->count[idx], val);
 
 done:
 	unlock_page_cgroup(pc);
@@ -3033,6 +3191,10 @@ enum {
 	MCS_PGPGIN,
 	MCS_PGPGOUT,
 	MCS_SWAP,
+	MCS_FILE_DIRTY,
+	MCS_WRITEBACK,
+	MCS_WRITEBACK_TEMP,
+	MCS_UNSTABLE_NFS,
 	MCS_INACTIVE_ANON,
 	MCS_ACTIVE_ANON,
 	MCS_INACTIVE_FILE,
@@ -3055,6 +3217,10 @@ struct {
 	{"pgpgin", "total_pgpgin"},
 	{"pgpgout", "total_pgpgout"},
 	{"swap", "total_swap"},
+	{"filedirty", "dirty_pages"},
+	{"writeback", "writeback_pages"},
+	{"writeback_tmp", "writeback_temp_pages"},
+	{"nfs", "nfs_unstable"},
 	{"inactive_anon", "total_inactive_anon"},
 	{"active_anon", "total_active_anon"},
 	{"inactive_file", "total_inactive_file"},
@@ -3083,6 +3249,14 @@ static int mem_cgroup_get_local_stat(struct mem_cgroup *mem, void *data)
 		val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_SWAPOUT);
 		s->stat[MCS_SWAP] += val * PAGE_SIZE;
 	}
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_DIRTY);
+	s->stat[MCS_FILE_DIRTY] += val;
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WRITEBACK);
+	s->stat[MCS_WRITEBACK] += val;
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WRITEBACK_TEMP);
+	s->stat[MCS_WRITEBACK_TEMP] += val;
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_UNSTABLE_NFS);
+	s->stat[MCS_UNSTABLE_NFS] += val;
 
 	/* per zone stat */
 	val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON);
@@ -3467,6 +3641,50 @@ unlock:
 	return ret;
 }
 
+static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	int type = cft->private;
+
+	return get_dirty_param(memcg, type);
+}
+
+static int
+mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	int type = cft->private;
+
+	if (cgrp->parent == NULL)
+		return -EINVAL;
+	if (((type == MEM_CGROUP_DIRTY_RATIO) ||
+		(type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO)) && (val > 100))
+		return -EINVAL;
+
+	spin_lock(&memcg->reclaim_param_lock);
+	switch (type) {
+	case MEM_CGROUP_DIRTY_RATIO:
+		memcg->dirty_param[MEM_CGROUP_DIRTY_RATIO] = val;
+		memcg->dirty_param[MEM_CGROUP_DIRTY_BYTES] = 0;
+		break;
+	case MEM_CGROUP_DIRTY_BYTES:
+		memcg->dirty_param[MEM_CGROUP_DIRTY_RATIO] = 0;
+		memcg->dirty_param[MEM_CGROUP_DIRTY_BYTES] = val;
+		break;
+	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
+		memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] = val;
+		memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES] = 0;
+		break;
+	case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
+		memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] = 0;
+		memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES] = val;
+		break;
+	}
+	spin_unlock(&memcg->reclaim_param_lock);
+
+	return 0;
+}
+
 static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
@@ -3518,6 +3736,30 @@ static struct cftype mem_cgroup_files[] = {
 		.write_u64 = mem_cgroup_swappiness_write,
 	},
 	{
+		.name = "dirty_ratio",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_u64 = mem_cgroup_dirty_write,
+		.private = MEM_CGROUP_DIRTY_RATIO,
+	},
+	{
+		.name = "dirty_bytes",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_u64 = mem_cgroup_dirty_write,
+		.private = MEM_CGROUP_DIRTY_BYTES,
+	},
+	{
+		.name = "dirty_background_ratio",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_u64 = mem_cgroup_dirty_write,
+		.private = MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
+	},
+	{
+		.name = "dirty_background_bytes",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_u64 = mem_cgroup_dirty_write,
+		.private = MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
+	},
+	{
 		.name = "move_charge_at_immigrate",
 		.read_u64 = mem_cgroup_move_charge_read,
 		.write_u64 = mem_cgroup_move_charge_write,
@@ -3725,6 +3967,19 @@ static int mem_cgroup_soft_limit_tree_init(void)
 	return 0;
 }
 
+/*
+ * NOTE: called only with &src->reclaim_param_lock held from
+ * mem_cgroup_create().
+ */
+static inline void
+copy_dirty_params(struct mem_cgroup *dst, struct mem_cgroup *src)
+{
+	int i;
+
+	for (i = 0; i < MEM_CGROUP_DIRTY_NPARAMS; i++)
+		dst->dirty_param[i] = src->dirty_param[i];
+}
+
 static struct cgroup_subsys_state * __ref
 mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 {
@@ -3776,8 +4031,37 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	mem->last_scanned_child = 0;
 	spin_lock_init(&mem->reclaim_param_lock);
 
-	if (parent)
+	if (parent) {
 		mem->swappiness = get_swappiness(parent);
+
+		spin_lock(&parent->reclaim_param_lock);
+		copy_dirty_params(mem, parent);
+		spin_unlock(&parent->reclaim_param_lock);
+	} else {
+		/*
+		 * XXX: should we need a lock here? we could switch from
+		 * vm_dirty_ratio to vm_dirty_bytes or vice versa but we're not
+		 * reading them atomically. The same for dirty_background_ratio
+		 * and dirty_background_bytes.
+		 *
+		 * For now, try to read them speculatively and retry if a
+		 * "conflict" is detected.
+		 */
+		do {
+			mem->dirty_param[MEM_CGROUP_DIRTY_RATIO] =
+						vm_dirty_ratio;
+			mem->dirty_param[MEM_CGROUP_DIRTY_BYTES] =
+						vm_dirty_bytes;
+		} while (mem->dirty_param[MEM_CGROUP_DIRTY_RATIO] &&
+			 mem->dirty_param[MEM_CGROUP_DIRTY_BYTES]);
+		do {
+			mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] =
+						dirty_background_ratio;
+			mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES] =
+						dirty_background_bytes;
+		} while (mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] &&
+			mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES]);
+	}
 	atomic_set(&mem->refcnt, 1);
 	mem->move_charge_at_immigrate = 0;
 	mutex_init(&mem->thresholds_lock);
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH -mmotm 2/3] memcg: dirty pages accounting and limiting infrastructure
  2010-03-01 21:23 ` Andrea Righi
@ 2010-03-01 21:23   ` Andrea Righi
  -1 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-01 21:23 UTC (permalink / raw)
  To: Balbir Singh, KAMEZAWA Hiroyuki
  Cc: Suleiman Souhlal, Greg Thelen, Daisuke Nishimura,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm, Andrea Righi

Infrastructure to account dirty pages per cgroup and add dirty limit
interfaces in the cgroupfs:

 - Direct write-out: memory.dirty_ratio, memory.dirty_bytes

 - Background write-out: memory.dirty_background_ratio, memory.dirty_background_bytes

Signed-off-by: Andrea Righi <arighi@develer.com>
---
 include/linux/memcontrol.h |   77 ++++++++++-
 mm/memcontrol.c            |  336 ++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 384 insertions(+), 29 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1f9b119..cc88b2e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -19,12 +19,50 @@
 
 #ifndef _LINUX_MEMCONTROL_H
 #define _LINUX_MEMCONTROL_H
+
+#include <linux/writeback.h>
 #include <linux/cgroup.h>
+
 struct mem_cgroup;
 struct page_cgroup;
 struct page;
 struct mm_struct;
 
+/* Cgroup memory statistics items exported to the kernel */
+enum mem_cgroup_page_stat_item {
+	MEMCG_NR_DIRTYABLE_PAGES,
+	MEMCG_NR_RECLAIM_PAGES,
+	MEMCG_NR_WRITEBACK,
+	MEMCG_NR_DIRTY_WRITEBACK_PAGES,
+};
+
+/*
+ * Statistics for memory cgroup.
+ */
+enum mem_cgroup_stat_index {
+	/*
+	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
+	 */
+	MEM_CGROUP_STAT_CACHE,	   /* # of pages charged as cache */
+	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
+	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
+	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
+	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
+	MEM_CGROUP_STAT_EVENTS,	/* sum of pagein + pageout for internal use */
+	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
+	MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
+					used by soft limit implementation */
+	MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out.
+					used by threshold implementation */
+	MEM_CGROUP_STAT_FILE_DIRTY,   /* # of dirty pages in page cache */
+	MEM_CGROUP_STAT_WRITEBACK,   /* # of pages under writeback */
+	MEM_CGROUP_STAT_WRITEBACK_TEMP,   /* # of pages under writeback using
+						temporary buffers */
+	MEM_CGROUP_STAT_UNSTABLE_NFS,   /* # of NFS unstable pages */
+
+	MEM_CGROUP_STAT_NSTATS,
+};
+
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 /*
  * All "charge" functions with gfp_mask should use GFP_KERNEL or
@@ -117,6 +155,13 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
 extern int do_swap_account;
 #endif
 
+extern long mem_cgroup_dirty_ratio(void);
+extern unsigned long mem_cgroup_dirty_bytes(void);
+extern long mem_cgroup_dirty_background_ratio(void);
+extern unsigned long mem_cgroup_dirty_background_bytes(void);
+
+extern s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item);
+
 static inline bool mem_cgroup_disabled(void)
 {
 	if (mem_cgroup_subsys.disabled)
@@ -125,7 +170,8 @@ static inline bool mem_cgroup_disabled(void)
 }
 
 extern bool mem_cgroup_oom_called(struct task_struct *task);
-void mem_cgroup_update_file_mapped(struct page *page, int val);
+void mem_cgroup_update_stat(struct page *page,
+			enum mem_cgroup_stat_index idx, int val);
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask, int nid,
 						int zid);
@@ -300,8 +346,8 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
 {
 }
 
-static inline void mem_cgroup_update_file_mapped(struct page *page,
-							int val)
+static inline void mem_cgroup_update_stat(struct page *page,
+			enum mem_cgroup_stat_index idx, int val)
 {
 }
 
@@ -312,6 +358,31 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 	return 0;
 }
 
+static inline long mem_cgroup_dirty_ratio(void)
+{
+	return vm_dirty_ratio;
+}
+
+static inline unsigned long mem_cgroup_dirty_bytes(void)
+{
+	return vm_dirty_bytes;
+}
+
+static inline long mem_cgroup_dirty_background_ratio(void)
+{
+	return dirty_background_ratio;
+}
+
+static inline unsigned long mem_cgroup_dirty_background_bytes(void)
+{
+	return dirty_background_bytes;
+}
+
+static inline s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
+{
+	return -ENOMEM;
+}
+
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #endif /* _LINUX_MEMCONTROL_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a443c30..e74cf66 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -66,31 +66,16 @@ static int really_do_swap_account __initdata = 1; /* for remember boot option*/
 #define SOFTLIMIT_EVENTS_THRESH (1000)
 #define THRESHOLDS_EVENTS_THRESH (100)
 
-/*
- * Statistics for memory cgroup.
- */
-enum mem_cgroup_stat_index {
-	/*
-	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
-	 */
-	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
-	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
-	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
-	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
-	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
-	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
-	MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
-					used by soft limit implementation */
-	MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out.
-					used by threshold implementation */
-
-	MEM_CGROUP_STAT_NSTATS,
-};
-
 struct mem_cgroup_stat_cpu {
 	s64 count[MEM_CGROUP_STAT_NSTATS];
 };
 
+/* Per cgroup page statistics */
+struct mem_cgroup_page_stat {
+	enum mem_cgroup_page_stat_item item;
+	s64 value;
+};
+
 /*
  * per-zone information in memory controller.
  */
@@ -157,6 +142,15 @@ struct mem_cgroup_threshold_ary {
 static bool mem_cgroup_threshold_check(struct mem_cgroup *mem);
 static void mem_cgroup_threshold(struct mem_cgroup *mem);
 
+enum mem_cgroup_dirty_param {
+	MEM_CGROUP_DIRTY_RATIO,
+	MEM_CGROUP_DIRTY_BYTES,
+	MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
+	MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
+
+	MEM_CGROUP_DIRTY_NPARAMS,
+};
+
 /*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
@@ -205,6 +199,9 @@ struct mem_cgroup {
 
 	unsigned int	swappiness;
 
+	/* control memory cgroup dirty pages */
+	unsigned long dirty_param[MEM_CGROUP_DIRTY_NPARAMS];
+
 	/* set when res.limit == memsw.limit */
 	bool		memsw_is_minimum;
 
@@ -1021,6 +1018,164 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
 	return swappiness;
 }
 
+static unsigned long get_dirty_param(struct mem_cgroup *memcg,
+			enum mem_cgroup_dirty_param idx)
+{
+	unsigned long ret;
+
+	VM_BUG_ON(idx >= MEM_CGROUP_DIRTY_NPARAMS);
+	spin_lock(&memcg->reclaim_param_lock);
+	ret = memcg->dirty_param[idx];
+	spin_unlock(&memcg->reclaim_param_lock);
+
+	return ret;
+}
+
+long mem_cgroup_dirty_ratio(void)
+{
+	struct mem_cgroup *memcg;
+	long ret = vm_dirty_ratio;
+
+	if (mem_cgroup_disabled())
+		return ret;
+	/*
+	 * It's possible that "current" may be moved to other cgroup while we
+	 * access cgroup. But precise check is meaningless because the task can
+	 * be moved after our access and writeback tends to take long time.
+	 * At least, "memcg" will not be freed under rcu_read_lock().
+	 */
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(current);
+	if (likely(memcg))
+		ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_RATIO);
+	rcu_read_unlock();
+
+	return ret;
+}
+
+unsigned long mem_cgroup_dirty_bytes(void)
+{
+	struct mem_cgroup *memcg;
+	unsigned long ret = vm_dirty_bytes;
+
+	if (mem_cgroup_disabled())
+		return ret;
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(current);
+	if (likely(memcg))
+		ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BYTES);
+	rcu_read_unlock();
+
+	return ret;
+}
+
+long mem_cgroup_dirty_background_ratio(void)
+{
+	struct mem_cgroup *memcg;
+	long ret = dirty_background_ratio;
+
+	if (mem_cgroup_disabled())
+		return ret;
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(current);
+	if (likely(memcg))
+		ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BACKGROUND_RATIO);
+	rcu_read_unlock();
+
+	return ret;
+}
+
+unsigned long mem_cgroup_dirty_background_bytes(void)
+{
+	struct mem_cgroup *memcg;
+	unsigned long ret = dirty_background_bytes;
+
+	if (mem_cgroup_disabled())
+		return ret;
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(current);
+	if (likely(memcg))
+		ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BACKGROUND_BYTES);
+	rcu_read_unlock();
+
+	return ret;
+}
+
+static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
+{
+	return do_swap_account ?
+			res_counter_read_u64(&memcg->memsw, RES_LIMIT) :
+			nr_swap_pages > 0;
+}
+
+static s64 mem_cgroup_get_local_page_stat(struct mem_cgroup *memcg,
+				enum mem_cgroup_page_stat_item item)
+{
+	s64 ret;
+
+	switch (item) {
+	case MEMCG_NR_DIRTYABLE_PAGES:
+		ret = res_counter_read_u64(&memcg->res, RES_LIMIT) -
+			res_counter_read_u64(&memcg->res, RES_USAGE);
+		/* Translate free memory in pages */
+		ret >>= PAGE_SHIFT;
+		ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_FILE) +
+			mem_cgroup_read_stat(memcg, LRU_INACTIVE_FILE);
+		if (mem_cgroup_can_swap(memcg))
+			ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_ANON) +
+				mem_cgroup_read_stat(memcg, LRU_INACTIVE_ANON);
+		break;
+	case MEMCG_NR_RECLAIM_PAGES:
+		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_FILE_DIRTY) +
+			mem_cgroup_read_stat(memcg,
+					MEM_CGROUP_STAT_UNSTABLE_NFS);
+		break;
+	case MEMCG_NR_WRITEBACK:
+		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK);
+		break;
+	case MEMCG_NR_DIRTY_WRITEBACK_PAGES:
+		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK) +
+			mem_cgroup_read_stat(memcg,
+				MEM_CGROUP_STAT_UNSTABLE_NFS);
+		break;
+	default:
+		ret = 0;
+		WARN_ON_ONCE(1);
+	}
+	return ret;
+}
+
+static int mem_cgroup_page_stat_cb(struct mem_cgroup *mem, void *data)
+{
+	struct mem_cgroup_page_stat *stat = (struct mem_cgroup_page_stat *)data;
+
+	stat->value += mem_cgroup_get_local_page_stat(mem, stat->item);
+	return 0;
+}
+
+s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
+{
+	struct mem_cgroup_page_stat stat = {};
+	struct mem_cgroup *memcg;
+
+	if (mem_cgroup_disabled())
+		return -ENOMEM;
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(current);
+	if (memcg) {
+		/*
+		 * Recursively evaulate page statistics against all cgroup
+		 * under hierarchy tree
+		 */
+		stat.item = item;
+		mem_cgroup_walk_tree(memcg, &stat, mem_cgroup_page_stat_cb);
+	} else
+		stat.value = -ENOMEM;
+	rcu_read_unlock();
+
+	return stat.value;
+}
+
 static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void *data)
 {
 	int *val = data;
@@ -1263,14 +1418,16 @@ static void record_last_oom(struct mem_cgroup *mem)
 }
 
 /*
- * Currently used to update mapped file statistics, but the routine can be
- * generalized to update other statistics as well.
+ * Generalized routine to update memory cgroup statistics.
  */
-void mem_cgroup_update_file_mapped(struct page *page, int val)
+void mem_cgroup_update_stat(struct page *page,
+			enum mem_cgroup_stat_index idx, int val)
 {
 	struct mem_cgroup *mem;
 	struct page_cgroup *pc;
 
+	if (mem_cgroup_disabled())
+		return;
 	pc = lookup_page_cgroup(page);
 	if (unlikely(!pc))
 		return;
@@ -1286,7 +1443,8 @@ void mem_cgroup_update_file_mapped(struct page *page, int val)
 	/*
 	 * Preemption is already disabled. We can use __this_cpu_xxx
 	 */
-	__this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], val);
+	VM_BUG_ON(idx >= MEM_CGROUP_STAT_NSTATS);
+	__this_cpu_add(mem->stat->count[idx], val);
 
 done:
 	unlock_page_cgroup(pc);
@@ -3033,6 +3191,10 @@ enum {
 	MCS_PGPGIN,
 	MCS_PGPGOUT,
 	MCS_SWAP,
+	MCS_FILE_DIRTY,
+	MCS_WRITEBACK,
+	MCS_WRITEBACK_TEMP,
+	MCS_UNSTABLE_NFS,
 	MCS_INACTIVE_ANON,
 	MCS_ACTIVE_ANON,
 	MCS_INACTIVE_FILE,
@@ -3055,6 +3217,10 @@ struct {
 	{"pgpgin", "total_pgpgin"},
 	{"pgpgout", "total_pgpgout"},
 	{"swap", "total_swap"},
+	{"filedirty", "dirty_pages"},
+	{"writeback", "writeback_pages"},
+	{"writeback_tmp", "writeback_temp_pages"},
+	{"nfs", "nfs_unstable"},
 	{"inactive_anon", "total_inactive_anon"},
 	{"active_anon", "total_active_anon"},
 	{"inactive_file", "total_inactive_file"},
@@ -3083,6 +3249,14 @@ static int mem_cgroup_get_local_stat(struct mem_cgroup *mem, void *data)
 		val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_SWAPOUT);
 		s->stat[MCS_SWAP] += val * PAGE_SIZE;
 	}
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_DIRTY);
+	s->stat[MCS_FILE_DIRTY] += val;
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WRITEBACK);
+	s->stat[MCS_WRITEBACK] += val;
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WRITEBACK_TEMP);
+	s->stat[MCS_WRITEBACK_TEMP] += val;
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_UNSTABLE_NFS);
+	s->stat[MCS_UNSTABLE_NFS] += val;
 
 	/* per zone stat */
 	val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON);
@@ -3467,6 +3641,50 @@ unlock:
 	return ret;
 }
 
+static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	int type = cft->private;
+
+	return get_dirty_param(memcg, type);
+}
+
+static int
+mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	int type = cft->private;
+
+	if (cgrp->parent == NULL)
+		return -EINVAL;
+	if (((type == MEM_CGROUP_DIRTY_RATIO) ||
+		(type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO)) && (val > 100))
+		return -EINVAL;
+
+	spin_lock(&memcg->reclaim_param_lock);
+	switch (type) {
+	case MEM_CGROUP_DIRTY_RATIO:
+		memcg->dirty_param[MEM_CGROUP_DIRTY_RATIO] = val;
+		memcg->dirty_param[MEM_CGROUP_DIRTY_BYTES] = 0;
+		break;
+	case MEM_CGROUP_DIRTY_BYTES:
+		memcg->dirty_param[MEM_CGROUP_DIRTY_RATIO] = 0;
+		memcg->dirty_param[MEM_CGROUP_DIRTY_BYTES] = val;
+		break;
+	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
+		memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] = val;
+		memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES] = 0;
+		break;
+	case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
+		memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] = 0;
+		memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES] = val;
+		break;
+	}
+	spin_unlock(&memcg->reclaim_param_lock);
+
+	return 0;
+}
+
 static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
@@ -3518,6 +3736,30 @@ static struct cftype mem_cgroup_files[] = {
 		.write_u64 = mem_cgroup_swappiness_write,
 	},
 	{
+		.name = "dirty_ratio",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_u64 = mem_cgroup_dirty_write,
+		.private = MEM_CGROUP_DIRTY_RATIO,
+	},
+	{
+		.name = "dirty_bytes",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_u64 = mem_cgroup_dirty_write,
+		.private = MEM_CGROUP_DIRTY_BYTES,
+	},
+	{
+		.name = "dirty_background_ratio",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_u64 = mem_cgroup_dirty_write,
+		.private = MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
+	},
+	{
+		.name = "dirty_background_bytes",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_u64 = mem_cgroup_dirty_write,
+		.private = MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
+	},
+	{
 		.name = "move_charge_at_immigrate",
 		.read_u64 = mem_cgroup_move_charge_read,
 		.write_u64 = mem_cgroup_move_charge_write,
@@ -3725,6 +3967,19 @@ static int mem_cgroup_soft_limit_tree_init(void)
 	return 0;
 }
 
+/*
+ * NOTE: called only with &src->reclaim_param_lock held from
+ * mem_cgroup_create().
+ */
+static inline void
+copy_dirty_params(struct mem_cgroup *dst, struct mem_cgroup *src)
+{
+	int i;
+
+	for (i = 0; i < MEM_CGROUP_DIRTY_NPARAMS; i++)
+		dst->dirty_param[i] = src->dirty_param[i];
+}
+
 static struct cgroup_subsys_state * __ref
 mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 {
@@ -3776,8 +4031,37 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	mem->last_scanned_child = 0;
 	spin_lock_init(&mem->reclaim_param_lock);
 
-	if (parent)
+	if (parent) {
 		mem->swappiness = get_swappiness(parent);
+
+		spin_lock(&parent->reclaim_param_lock);
+		copy_dirty_params(mem, parent);
+		spin_unlock(&parent->reclaim_param_lock);
+	} else {
+		/*
+		 * XXX: should we need a lock here? we could switch from
+		 * vm_dirty_ratio to vm_dirty_bytes or vice versa but we're not
+		 * reading them atomically. The same for dirty_background_ratio
+		 * and dirty_background_bytes.
+		 *
+		 * For now, try to read them speculatively and retry if a
+		 * "conflict" is detected.
+		 */
+		do {
+			mem->dirty_param[MEM_CGROUP_DIRTY_RATIO] =
+						vm_dirty_ratio;
+			mem->dirty_param[MEM_CGROUP_DIRTY_BYTES] =
+						vm_dirty_bytes;
+		} while (mem->dirty_param[MEM_CGROUP_DIRTY_RATIO] &&
+			 mem->dirty_param[MEM_CGROUP_DIRTY_BYTES]);
+		do {
+			mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] =
+						dirty_background_ratio;
+			mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES] =
+						dirty_background_bytes;
+		} while (mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] &&
+			mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES]);
+	}
 	atomic_set(&mem->refcnt, 1);
 	mem->move_charge_at_immigrate = 0;
 	mutex_init(&mem->thresholds_lock);
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH -mmotm 2/3] memcg: dirty pages accounting and limiting infrastructure
@ 2010-03-01 21:23   ` Andrea Righi
  0 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-01 21:23 UTC (permalink / raw)
  To: Balbir Singh, KAMEZAWA Hiroyuki
  Cc: Suleiman Souhlal, Greg Thelen, Daisuke Nishimura,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm, Andrea Righi

Infrastructure to account dirty pages per cgroup and add dirty limit
interfaces in the cgroupfs:

 - Direct write-out: memory.dirty_ratio, memory.dirty_bytes

 - Background write-out: memory.dirty_background_ratio, memory.dirty_background_bytes

Signed-off-by: Andrea Righi <arighi@develer.com>
---
 include/linux/memcontrol.h |   77 ++++++++++-
 mm/memcontrol.c            |  336 ++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 384 insertions(+), 29 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1f9b119..cc88b2e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -19,12 +19,50 @@
 
 #ifndef _LINUX_MEMCONTROL_H
 #define _LINUX_MEMCONTROL_H
+
+#include <linux/writeback.h>
 #include <linux/cgroup.h>
+
 struct mem_cgroup;
 struct page_cgroup;
 struct page;
 struct mm_struct;
 
+/* Cgroup memory statistics items exported to the kernel */
+enum mem_cgroup_page_stat_item {
+	MEMCG_NR_DIRTYABLE_PAGES,
+	MEMCG_NR_RECLAIM_PAGES,
+	MEMCG_NR_WRITEBACK,
+	MEMCG_NR_DIRTY_WRITEBACK_PAGES,
+};
+
+/*
+ * Statistics for memory cgroup.
+ */
+enum mem_cgroup_stat_index {
+	/*
+	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
+	 */
+	MEM_CGROUP_STAT_CACHE,	   /* # of pages charged as cache */
+	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
+	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
+	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
+	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
+	MEM_CGROUP_STAT_EVENTS,	/* sum of pagein + pageout for internal use */
+	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
+	MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
+					used by soft limit implementation */
+	MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out.
+					used by threshold implementation */
+	MEM_CGROUP_STAT_FILE_DIRTY,   /* # of dirty pages in page cache */
+	MEM_CGROUP_STAT_WRITEBACK,   /* # of pages under writeback */
+	MEM_CGROUP_STAT_WRITEBACK_TEMP,   /* # of pages under writeback using
+						temporary buffers */
+	MEM_CGROUP_STAT_UNSTABLE_NFS,   /* # of NFS unstable pages */
+
+	MEM_CGROUP_STAT_NSTATS,
+};
+
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 /*
  * All "charge" functions with gfp_mask should use GFP_KERNEL or
@@ -117,6 +155,13 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
 extern int do_swap_account;
 #endif
 
+extern long mem_cgroup_dirty_ratio(void);
+extern unsigned long mem_cgroup_dirty_bytes(void);
+extern long mem_cgroup_dirty_background_ratio(void);
+extern unsigned long mem_cgroup_dirty_background_bytes(void);
+
+extern s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item);
+
 static inline bool mem_cgroup_disabled(void)
 {
 	if (mem_cgroup_subsys.disabled)
@@ -125,7 +170,8 @@ static inline bool mem_cgroup_disabled(void)
 }
 
 extern bool mem_cgroup_oom_called(struct task_struct *task);
-void mem_cgroup_update_file_mapped(struct page *page, int val);
+void mem_cgroup_update_stat(struct page *page,
+			enum mem_cgroup_stat_index idx, int val);
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask, int nid,
 						int zid);
@@ -300,8 +346,8 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
 {
 }
 
-static inline void mem_cgroup_update_file_mapped(struct page *page,
-							int val)
+static inline void mem_cgroup_update_stat(struct page *page,
+			enum mem_cgroup_stat_index idx, int val)
 {
 }
 
@@ -312,6 +358,31 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 	return 0;
 }
 
+static inline long mem_cgroup_dirty_ratio(void)
+{
+	return vm_dirty_ratio;
+}
+
+static inline unsigned long mem_cgroup_dirty_bytes(void)
+{
+	return vm_dirty_bytes;
+}
+
+static inline long mem_cgroup_dirty_background_ratio(void)
+{
+	return dirty_background_ratio;
+}
+
+static inline unsigned long mem_cgroup_dirty_background_bytes(void)
+{
+	return dirty_background_bytes;
+}
+
+static inline s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
+{
+	return -ENOMEM;
+}
+
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #endif /* _LINUX_MEMCONTROL_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a443c30..e74cf66 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -66,31 +66,16 @@ static int really_do_swap_account __initdata = 1; /* for remember boot option*/
 #define SOFTLIMIT_EVENTS_THRESH (1000)
 #define THRESHOLDS_EVENTS_THRESH (100)
 
-/*
- * Statistics for memory cgroup.
- */
-enum mem_cgroup_stat_index {
-	/*
-	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
-	 */
-	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
-	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
-	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
-	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
-	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
-	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
-	MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
-					used by soft limit implementation */
-	MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out.
-					used by threshold implementation */
-
-	MEM_CGROUP_STAT_NSTATS,
-};
-
 struct mem_cgroup_stat_cpu {
 	s64 count[MEM_CGROUP_STAT_NSTATS];
 };
 
+/* Per cgroup page statistics */
+struct mem_cgroup_page_stat {
+	enum mem_cgroup_page_stat_item item;
+	s64 value;
+};
+
 /*
  * per-zone information in memory controller.
  */
@@ -157,6 +142,15 @@ struct mem_cgroup_threshold_ary {
 static bool mem_cgroup_threshold_check(struct mem_cgroup *mem);
 static void mem_cgroup_threshold(struct mem_cgroup *mem);
 
+enum mem_cgroup_dirty_param {
+	MEM_CGROUP_DIRTY_RATIO,
+	MEM_CGROUP_DIRTY_BYTES,
+	MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
+	MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
+
+	MEM_CGROUP_DIRTY_NPARAMS,
+};
+
 /*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
@@ -205,6 +199,9 @@ struct mem_cgroup {
 
 	unsigned int	swappiness;
 
+	/* control memory cgroup dirty pages */
+	unsigned long dirty_param[MEM_CGROUP_DIRTY_NPARAMS];
+
 	/* set when res.limit == memsw.limit */
 	bool		memsw_is_minimum;
 
@@ -1021,6 +1018,164 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
 	return swappiness;
 }
 
+static unsigned long get_dirty_param(struct mem_cgroup *memcg,
+			enum mem_cgroup_dirty_param idx)
+{
+	unsigned long ret;
+
+	VM_BUG_ON(idx >= MEM_CGROUP_DIRTY_NPARAMS);
+	spin_lock(&memcg->reclaim_param_lock);
+	ret = memcg->dirty_param[idx];
+	spin_unlock(&memcg->reclaim_param_lock);
+
+	return ret;
+}
+
+long mem_cgroup_dirty_ratio(void)
+{
+	struct mem_cgroup *memcg;
+	long ret = vm_dirty_ratio;
+
+	if (mem_cgroup_disabled())
+		return ret;
+	/*
+	 * It's possible that "current" may be moved to other cgroup while we
+	 * access cgroup. But precise check is meaningless because the task can
+	 * be moved after our access and writeback tends to take long time.
+	 * At least, "memcg" will not be freed under rcu_read_lock().
+	 */
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(current);
+	if (likely(memcg))
+		ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_RATIO);
+	rcu_read_unlock();
+
+	return ret;
+}
+
+unsigned long mem_cgroup_dirty_bytes(void)
+{
+	struct mem_cgroup *memcg;
+	unsigned long ret = vm_dirty_bytes;
+
+	if (mem_cgroup_disabled())
+		return ret;
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(current);
+	if (likely(memcg))
+		ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BYTES);
+	rcu_read_unlock();
+
+	return ret;
+}
+
+long mem_cgroup_dirty_background_ratio(void)
+{
+	struct mem_cgroup *memcg;
+	long ret = dirty_background_ratio;
+
+	if (mem_cgroup_disabled())
+		return ret;
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(current);
+	if (likely(memcg))
+		ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BACKGROUND_RATIO);
+	rcu_read_unlock();
+
+	return ret;
+}
+
+unsigned long mem_cgroup_dirty_background_bytes(void)
+{
+	struct mem_cgroup *memcg;
+	unsigned long ret = dirty_background_bytes;
+
+	if (mem_cgroup_disabled())
+		return ret;
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(current);
+	if (likely(memcg))
+		ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BACKGROUND_BYTES);
+	rcu_read_unlock();
+
+	return ret;
+}
+
+static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
+{
+	return do_swap_account ?
+			res_counter_read_u64(&memcg->memsw, RES_LIMIT) :
+			nr_swap_pages > 0;
+}
+
+static s64 mem_cgroup_get_local_page_stat(struct mem_cgroup *memcg,
+				enum mem_cgroup_page_stat_item item)
+{
+	s64 ret;
+
+	switch (item) {
+	case MEMCG_NR_DIRTYABLE_PAGES:
+		ret = res_counter_read_u64(&memcg->res, RES_LIMIT) -
+			res_counter_read_u64(&memcg->res, RES_USAGE);
+		/* Translate free memory in pages */
+		ret >>= PAGE_SHIFT;
+		ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_FILE) +
+			mem_cgroup_read_stat(memcg, LRU_INACTIVE_FILE);
+		if (mem_cgroup_can_swap(memcg))
+			ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_ANON) +
+				mem_cgroup_read_stat(memcg, LRU_INACTIVE_ANON);
+		break;
+	case MEMCG_NR_RECLAIM_PAGES:
+		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_FILE_DIRTY) +
+			mem_cgroup_read_stat(memcg,
+					MEM_CGROUP_STAT_UNSTABLE_NFS);
+		break;
+	case MEMCG_NR_WRITEBACK:
+		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK);
+		break;
+	case MEMCG_NR_DIRTY_WRITEBACK_PAGES:
+		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK) +
+			mem_cgroup_read_stat(memcg,
+				MEM_CGROUP_STAT_UNSTABLE_NFS);
+		break;
+	default:
+		ret = 0;
+		WARN_ON_ONCE(1);
+	}
+	return ret;
+}
+
+static int mem_cgroup_page_stat_cb(struct mem_cgroup *mem, void *data)
+{
+	struct mem_cgroup_page_stat *stat = (struct mem_cgroup_page_stat *)data;
+
+	stat->value += mem_cgroup_get_local_page_stat(mem, stat->item);
+	return 0;
+}
+
+s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
+{
+	struct mem_cgroup_page_stat stat = {};
+	struct mem_cgroup *memcg;
+
+	if (mem_cgroup_disabled())
+		return -ENOMEM;
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(current);
+	if (memcg) {
+		/*
+		 * Recursively evaulate page statistics against all cgroup
+		 * under hierarchy tree
+		 */
+		stat.item = item;
+		mem_cgroup_walk_tree(memcg, &stat, mem_cgroup_page_stat_cb);
+	} else
+		stat.value = -ENOMEM;
+	rcu_read_unlock();
+
+	return stat.value;
+}
+
 static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void *data)
 {
 	int *val = data;
@@ -1263,14 +1418,16 @@ static void record_last_oom(struct mem_cgroup *mem)
 }
 
 /*
- * Currently used to update mapped file statistics, but the routine can be
- * generalized to update other statistics as well.
+ * Generalized routine to update memory cgroup statistics.
  */
-void mem_cgroup_update_file_mapped(struct page *page, int val)
+void mem_cgroup_update_stat(struct page *page,
+			enum mem_cgroup_stat_index idx, int val)
 {
 	struct mem_cgroup *mem;
 	struct page_cgroup *pc;
 
+	if (mem_cgroup_disabled())
+		return;
 	pc = lookup_page_cgroup(page);
 	if (unlikely(!pc))
 		return;
@@ -1286,7 +1443,8 @@ void mem_cgroup_update_file_mapped(struct page *page, int val)
 	/*
 	 * Preemption is already disabled. We can use __this_cpu_xxx
 	 */
-	__this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], val);
+	VM_BUG_ON(idx >= MEM_CGROUP_STAT_NSTATS);
+	__this_cpu_add(mem->stat->count[idx], val);
 
 done:
 	unlock_page_cgroup(pc);
@@ -3033,6 +3191,10 @@ enum {
 	MCS_PGPGIN,
 	MCS_PGPGOUT,
 	MCS_SWAP,
+	MCS_FILE_DIRTY,
+	MCS_WRITEBACK,
+	MCS_WRITEBACK_TEMP,
+	MCS_UNSTABLE_NFS,
 	MCS_INACTIVE_ANON,
 	MCS_ACTIVE_ANON,
 	MCS_INACTIVE_FILE,
@@ -3055,6 +3217,10 @@ struct {
 	{"pgpgin", "total_pgpgin"},
 	{"pgpgout", "total_pgpgout"},
 	{"swap", "total_swap"},
+	{"filedirty", "dirty_pages"},
+	{"writeback", "writeback_pages"},
+	{"writeback_tmp", "writeback_temp_pages"},
+	{"nfs", "nfs_unstable"},
 	{"inactive_anon", "total_inactive_anon"},
 	{"active_anon", "total_active_anon"},
 	{"inactive_file", "total_inactive_file"},
@@ -3083,6 +3249,14 @@ static int mem_cgroup_get_local_stat(struct mem_cgroup *mem, void *data)
 		val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_SWAPOUT);
 		s->stat[MCS_SWAP] += val * PAGE_SIZE;
 	}
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_DIRTY);
+	s->stat[MCS_FILE_DIRTY] += val;
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WRITEBACK);
+	s->stat[MCS_WRITEBACK] += val;
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WRITEBACK_TEMP);
+	s->stat[MCS_WRITEBACK_TEMP] += val;
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_UNSTABLE_NFS);
+	s->stat[MCS_UNSTABLE_NFS] += val;
 
 	/* per zone stat */
 	val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON);
@@ -3467,6 +3641,50 @@ unlock:
 	return ret;
 }
 
+static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	int type = cft->private;
+
+	return get_dirty_param(memcg, type);
+}
+
+static int
+mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	int type = cft->private;
+
+	if (cgrp->parent == NULL)
+		return -EINVAL;
+	if (((type == MEM_CGROUP_DIRTY_RATIO) ||
+		(type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO)) && (val > 100))
+		return -EINVAL;
+
+	spin_lock(&memcg->reclaim_param_lock);
+	switch (type) {
+	case MEM_CGROUP_DIRTY_RATIO:
+		memcg->dirty_param[MEM_CGROUP_DIRTY_RATIO] = val;
+		memcg->dirty_param[MEM_CGROUP_DIRTY_BYTES] = 0;
+		break;
+	case MEM_CGROUP_DIRTY_BYTES:
+		memcg->dirty_param[MEM_CGROUP_DIRTY_RATIO] = 0;
+		memcg->dirty_param[MEM_CGROUP_DIRTY_BYTES] = val;
+		break;
+	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
+		memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] = val;
+		memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES] = 0;
+		break;
+	case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
+		memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] = 0;
+		memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES] = val;
+		break;
+	}
+	spin_unlock(&memcg->reclaim_param_lock);
+
+	return 0;
+}
+
 static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
@@ -3518,6 +3736,30 @@ static struct cftype mem_cgroup_files[] = {
 		.write_u64 = mem_cgroup_swappiness_write,
 	},
 	{
+		.name = "dirty_ratio",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_u64 = mem_cgroup_dirty_write,
+		.private = MEM_CGROUP_DIRTY_RATIO,
+	},
+	{
+		.name = "dirty_bytes",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_u64 = mem_cgroup_dirty_write,
+		.private = MEM_CGROUP_DIRTY_BYTES,
+	},
+	{
+		.name = "dirty_background_ratio",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_u64 = mem_cgroup_dirty_write,
+		.private = MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
+	},
+	{
+		.name = "dirty_background_bytes",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_u64 = mem_cgroup_dirty_write,
+		.private = MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
+	},
+	{
 		.name = "move_charge_at_immigrate",
 		.read_u64 = mem_cgroup_move_charge_read,
 		.write_u64 = mem_cgroup_move_charge_write,
@@ -3725,6 +3967,19 @@ static int mem_cgroup_soft_limit_tree_init(void)
 	return 0;
 }
 
+/*
+ * NOTE: called only with &src->reclaim_param_lock held from
+ * mem_cgroup_create().
+ */
+static inline void
+copy_dirty_params(struct mem_cgroup *dst, struct mem_cgroup *src)
+{
+	int i;
+
+	for (i = 0; i < MEM_CGROUP_DIRTY_NPARAMS; i++)
+		dst->dirty_param[i] = src->dirty_param[i];
+}
+
 static struct cgroup_subsys_state * __ref
 mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 {
@@ -3776,8 +4031,37 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	mem->last_scanned_child = 0;
 	spin_lock_init(&mem->reclaim_param_lock);
 
-	if (parent)
+	if (parent) {
 		mem->swappiness = get_swappiness(parent);
+
+		spin_lock(&parent->reclaim_param_lock);
+		copy_dirty_params(mem, parent);
+		spin_unlock(&parent->reclaim_param_lock);
+	} else {
+		/*
+		 * XXX: should we need a lock here? we could switch from
+		 * vm_dirty_ratio to vm_dirty_bytes or vice versa but we're not
+		 * reading them atomically. The same for dirty_background_ratio
+		 * and dirty_background_bytes.
+		 *
+		 * For now, try to read them speculatively and retry if a
+		 * "conflict" is detected.
+		 */
+		do {
+			mem->dirty_param[MEM_CGROUP_DIRTY_RATIO] =
+						vm_dirty_ratio;
+			mem->dirty_param[MEM_CGROUP_DIRTY_BYTES] =
+						vm_dirty_bytes;
+		} while (mem->dirty_param[MEM_CGROUP_DIRTY_RATIO] &&
+			 mem->dirty_param[MEM_CGROUP_DIRTY_BYTES]);
+		do {
+			mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] =
+						dirty_background_ratio;
+			mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES] =
+						dirty_background_bytes;
+		} while (mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] &&
+			mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES]);
+	}
 	atomic_set(&mem->refcnt, 1);
 	mem->move_charge_at_immigrate = 0;
 	mutex_init(&mem->thresholds_lock);
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
       [not found] ` <1267478620-5276-1-git-send-email-arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
  2010-03-01 21:23   ` [PATCH -mmotm 1/3] memcg: dirty memory documentation Andrea Righi
  2010-03-01 21:23   ` [PATCH -mmotm 2/3] memcg: dirty pages accounting and limiting infrastructure Andrea Righi
@ 2010-03-01 21:23   ` Andrea Righi
  2 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-01 21:23 UTC (permalink / raw)
  To: Balbir Singh, KAMEZAWA Hiroyuki
  Cc: Andrea Righi,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Andrew Morton

Apply the cgroup dirty pages accounting and limiting infrastructure to
the opportune kernel functions.

Signed-off-by: Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
---
 fs/fuse/file.c      |    5 +++
 fs/nfs/write.c      |    4 ++
 fs/nilfs2/segment.c |   10 +++++-
 mm/filemap.c        |    1 +
 mm/page-writeback.c |   84 ++++++++++++++++++++++++++++++++------------------
 mm/rmap.c           |    4 +-
 mm/truncate.c       |    2 +
 7 files changed, 76 insertions(+), 34 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index a9f5e13..dbbdd53 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -11,6 +11,7 @@
 #include <linux/pagemap.h>
 #include <linux/slab.h>
 #include <linux/kernel.h>
+#include <linux/memcontrol.h>
 #include <linux/sched.h>
 #include <linux/module.h>
 
@@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
 
 	list_del(&req->writepages_entry);
 	dec_bdi_stat(bdi, BDI_WRITEBACK);
+	mem_cgroup_update_stat(req->pages[0],
+			MEM_CGROUP_STAT_WRITEBACK_TEMP, -1);
 	dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
 	bdi_writeout_inc(bdi);
 	wake_up(&fi->page_waitq);
@@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
 	req->inode = inode;
 
 	inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
+	mem_cgroup_update_stat(tmp_page,
+			MEM_CGROUP_STAT_WRITEBACK_TEMP, 1);
 	inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
 	end_page_writeback(page);
 
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index b753242..7316f7a 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -439,6 +439,7 @@ nfs_mark_request_commit(struct nfs_page *req)
 			req->wb_index,
 			NFS_PAGE_TAG_COMMIT);
 	spin_unlock(&inode->i_lock);
+	mem_cgroup_update_stat(req->wb_page, MEM_CGROUP_STAT_UNSTABLE_NFS, 1);
 	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
 	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE);
 	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
@@ -450,6 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req)
 	struct page *page = req->wb_page;
 
 	if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
+		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
 		dec_zone_page_state(page, NR_UNSTABLE_NFS);
 		dec_bdi_stat(page->mapping->backing_dev_info, BDI_UNSTABLE);
 		return 1;
@@ -1273,6 +1275,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
 		req = nfs_list_entry(head->next);
 		nfs_list_remove_request(req);
 		nfs_mark_request_commit(req);
+		mem_cgroup_update_stat(req->wb_page,
+				MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
 		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
 				BDI_UNSTABLE);
diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index ada2f1b..aef6d13 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -1660,8 +1660,11 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out)
 	} while (bh = bh->b_this_page, bh2 = bh2->b_this_page, bh != head);
 	kunmap_atomic(kaddr, KM_USER0);
 
-	if (!TestSetPageWriteback(clone_page))
+	if (!TestSetPageWriteback(clone_page)) {
+		mem_cgroup_update_stat(clone_page,
+				MEM_CGROUP_STAT_WRITEBACK, 1);
 		inc_zone_page_state(clone_page, NR_WRITEBACK);
+	}
 	unlock_page(clone_page);
 
 	return 0;
@@ -1783,8 +1786,11 @@ static void __nilfs_end_page_io(struct page *page, int err)
 	}
 
 	if (buffer_nilfs_allocated(page_buffers(page))) {
-		if (TestClearPageWriteback(page))
+		if (TestClearPageWriteback(page)) {
+			mem_cgroup_update_stat(clone_page,
+					MEM_CGROUP_STAT_WRITEBACK, -1);
 			dec_zone_page_state(page, NR_WRITEBACK);
+		}
 	} else
 		end_page_writeback(page);
 }
diff --git a/mm/filemap.c b/mm/filemap.c
index fe09e51..f85acae 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
 	 * having removed the page entirely.
 	 */
 	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, -1);
 		dec_zone_page_state(page, NR_FILE_DIRTY);
 		dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
 	}
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 5a0f8f3..d83f41c 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -137,13 +137,14 @@ static struct prop_descriptor vm_dirties;
  */
 static int calc_period_shift(void)
 {
-	unsigned long dirty_total;
+	unsigned long dirty_total, dirty_bytes;
 
-	if (vm_dirty_bytes)
-		dirty_total = vm_dirty_bytes / PAGE_SIZE;
+	dirty_bytes = mem_cgroup_dirty_bytes();
+	if (dirty_bytes)
+		dirty_total = dirty_bytes / PAGE_SIZE;
 	else
-		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
-				100;
+		dirty_total = (mem_cgroup_dirty_ratio() *
+				determine_dirtyable_memory()) / 100;
 	return 2 + ilog2(dirty_total - 1);
 }
 
@@ -408,14 +409,16 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
  */
 unsigned long determine_dirtyable_memory(void)
 {
-	unsigned long x;
-
-	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
+	unsigned long memory;
+	s64 memcg_memory;
 
+	memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
 	if (!vm_highmem_is_dirtyable)
-		x -= highmem_dirtyable_memory(x);
-
-	return x + 1;	/* Ensure that we never return 0 */
+		memory -= highmem_dirtyable_memory(memory);
+	memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
+	if (memcg_memory < 0)
+		return memory + 1;
+	return min((unsigned long)memcg_memory, memory + 1);
 }
 
 void
@@ -423,26 +426,28 @@ get_dirty_limits(unsigned long *pbackground, unsigned long *pdirty,
 		 unsigned long *pbdi_dirty, struct backing_dev_info *bdi)
 {
 	unsigned long background;
-	unsigned long dirty;
+	unsigned long dirty, dirty_bytes, dirty_background;
 	unsigned long available_memory = determine_dirtyable_memory();
 	struct task_struct *tsk;
 
-	if (vm_dirty_bytes)
-		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
+	dirty_bytes = mem_cgroup_dirty_bytes();
+	if (dirty_bytes)
+		dirty = DIV_ROUND_UP(dirty_bytes, PAGE_SIZE);
 	else {
 		int dirty_ratio;
 
-		dirty_ratio = vm_dirty_ratio;
+		dirty_ratio = mem_cgroup_dirty_ratio();
 		if (dirty_ratio < 5)
 			dirty_ratio = 5;
 		dirty = (dirty_ratio * available_memory) / 100;
 	}
 
-	if (dirty_background_bytes)
-		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
+	dirty_background = mem_cgroup_dirty_background_bytes();
+	if (dirty_background)
+		background = DIV_ROUND_UP(dirty_background, PAGE_SIZE);
 	else
-		background = (dirty_background_ratio * available_memory) / 100;
-
+		background = (mem_cgroup_dirty_background_ratio() *
+					available_memory) / 100;
 	if (background >= dirty)
 		background = dirty / 2;
 	tsk = current;
@@ -508,9 +513,13 @@ static void balance_dirty_pages(struct address_space *mapping,
 		get_dirty_limits(&background_thresh, &dirty_thresh,
 				&bdi_thresh, bdi);
 
-		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
+		nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
+		nr_writeback = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
+		if ((nr_reclaimable < 0) || (nr_writeback < 0)) {
+			nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
-		nr_writeback = global_page_state(NR_WRITEBACK);
+			nr_writeback = global_page_state(NR_WRITEBACK);
+		}
 
 		bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY);
 		if (bdi_cap_account_unstable(bdi)) {
@@ -611,10 +620,12 @@ static void balance_dirty_pages(struct address_space *mapping,
 	 * In normal mode, we start background writeout at the lower
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
+	nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
+	if (nr_reclaimable < 0)
+		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
+				global_page_state(NR_UNSTABLE_NFS);
 	if ((laptop_mode && pages_written) ||
-	    (!laptop_mode && ((global_page_state(NR_FILE_DIRTY)
-			       + global_page_state(NR_UNSTABLE_NFS))
-					  > background_thresh)))
+	    (!laptop_mode && (nr_reclaimable > background_thresh)))
 		bdi_start_writeback(bdi, NULL, 0);
 }
 
@@ -678,6 +689,8 @@ void throttle_vm_writeout(gfp_t gfp_mask)
 	unsigned long dirty_thresh;
 
         for ( ; ; ) {
+		unsigned long dirty;
+
 		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
 
                 /*
@@ -686,10 +699,14 @@ void throttle_vm_writeout(gfp_t gfp_mask)
                  */
                 dirty_thresh += dirty_thresh / 10;      /* wheeee... */
 
-                if (global_page_state(NR_UNSTABLE_NFS) +
-			global_page_state(NR_WRITEBACK) <= dirty_thresh)
-                        	break;
-                congestion_wait(BLK_RW_ASYNC, HZ/10);
+
+		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
+		if (dirty < 0)
+			dirty = global_page_state(NR_UNSTABLE_NFS) +
+				global_page_state(NR_WRITEBACK);
+		if (dirty <= dirty_thresh)
+			break;
+		congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 		/*
 		 * The caller might hold locks which can prevent IO completion
@@ -1096,6 +1113,7 @@ int __set_page_dirty_no_writeback(struct page *page)
 void account_page_dirtied(struct page *page, struct address_space *mapping)
 {
 	if (mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, 1);
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
 		task_dirty_inc(current);
@@ -1297,6 +1315,8 @@ int clear_page_dirty_for_io(struct page *page)
 		 * for more comments.
 		 */
 		if (TestClearPageDirty(page)) {
+			mem_cgroup_update_stat(page,
+					MEM_CGROUP_STAT_FILE_DIRTY, -1);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_DIRTY);
@@ -1332,8 +1352,10 @@ int test_clear_page_writeback(struct page *page)
 	} else {
 		ret = TestClearPageWriteback(page);
 	}
-	if (ret)
+	if (ret) {
+		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_WRITEBACK, -1);
 		dec_zone_page_state(page, NR_WRITEBACK);
+	}
 	return ret;
 }
 
@@ -1363,8 +1385,10 @@ int test_set_page_writeback(struct page *page)
 	} else {
 		ret = TestSetPageWriteback(page);
 	}
-	if (!ret)
+	if (!ret) {
+		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_WRITEBACK, 1);
 		inc_zone_page_state(page, NR_WRITEBACK);
+	}
 	return ret;
 
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index 4d2fb93..8d74335 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -832,7 +832,7 @@ void page_add_file_rmap(struct page *page)
 {
 	if (atomic_inc_and_test(&page->_mapcount)) {
 		__inc_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_update_file_mapped(page, 1);
+		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, 1);
 	}
 }
 
@@ -864,7 +864,7 @@ void page_remove_rmap(struct page *page)
 		__dec_zone_page_state(page, NR_ANON_PAGES);
 	} else {
 		__dec_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_update_file_mapped(page, -1);
+		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, -1);
 	}
 	/*
 	 * It would be tidy to reset the PageAnon mapping here,
diff --git a/mm/truncate.c b/mm/truncate.c
index 2466e0c..5f437e7 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -73,6 +73,8 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
 	if (TestClearPageDirty(page)) {
 		struct address_space *mapping = page->mapping;
 		if (mapping && mapping_cap_account_dirty(mapping)) {
+			mem_cgroup_update_stat(page,
+					MEM_CGROUP_STAT_FILE_DIRTY, -1);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_DIRTY);
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-01 21:23 ` Andrea Righi
@ 2010-03-01 21:23   ` Andrea Righi
  -1 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-01 21:23 UTC (permalink / raw)
  To: Balbir Singh, KAMEZAWA Hiroyuki
  Cc: Suleiman Souhlal, Greg Thelen, Daisuke Nishimura,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm, Andrea Righi

Apply the cgroup dirty pages accounting and limiting infrastructure to
the opportune kernel functions.

Signed-off-by: Andrea Righi <arighi@develer.com>
---
 fs/fuse/file.c      |    5 +++
 fs/nfs/write.c      |    4 ++
 fs/nilfs2/segment.c |   10 +++++-
 mm/filemap.c        |    1 +
 mm/page-writeback.c |   84 ++++++++++++++++++++++++++++++++------------------
 mm/rmap.c           |    4 +-
 mm/truncate.c       |    2 +
 7 files changed, 76 insertions(+), 34 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index a9f5e13..dbbdd53 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -11,6 +11,7 @@
 #include <linux/pagemap.h>
 #include <linux/slab.h>
 #include <linux/kernel.h>
+#include <linux/memcontrol.h>
 #include <linux/sched.h>
 #include <linux/module.h>
 
@@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
 
 	list_del(&req->writepages_entry);
 	dec_bdi_stat(bdi, BDI_WRITEBACK);
+	mem_cgroup_update_stat(req->pages[0],
+			MEM_CGROUP_STAT_WRITEBACK_TEMP, -1);
 	dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
 	bdi_writeout_inc(bdi);
 	wake_up(&fi->page_waitq);
@@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
 	req->inode = inode;
 
 	inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
+	mem_cgroup_update_stat(tmp_page,
+			MEM_CGROUP_STAT_WRITEBACK_TEMP, 1);
 	inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
 	end_page_writeback(page);
 
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index b753242..7316f7a 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -439,6 +439,7 @@ nfs_mark_request_commit(struct nfs_page *req)
 			req->wb_index,
 			NFS_PAGE_TAG_COMMIT);
 	spin_unlock(&inode->i_lock);
+	mem_cgroup_update_stat(req->wb_page, MEM_CGROUP_STAT_UNSTABLE_NFS, 1);
 	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
 	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE);
 	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
@@ -450,6 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req)
 	struct page *page = req->wb_page;
 
 	if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
+		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
 		dec_zone_page_state(page, NR_UNSTABLE_NFS);
 		dec_bdi_stat(page->mapping->backing_dev_info, BDI_UNSTABLE);
 		return 1;
@@ -1273,6 +1275,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
 		req = nfs_list_entry(head->next);
 		nfs_list_remove_request(req);
 		nfs_mark_request_commit(req);
+		mem_cgroup_update_stat(req->wb_page,
+				MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
 		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
 				BDI_UNSTABLE);
diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index ada2f1b..aef6d13 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -1660,8 +1660,11 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out)
 	} while (bh = bh->b_this_page, bh2 = bh2->b_this_page, bh != head);
 	kunmap_atomic(kaddr, KM_USER0);
 
-	if (!TestSetPageWriteback(clone_page))
+	if (!TestSetPageWriteback(clone_page)) {
+		mem_cgroup_update_stat(clone_page,
+				MEM_CGROUP_STAT_WRITEBACK, 1);
 		inc_zone_page_state(clone_page, NR_WRITEBACK);
+	}
 	unlock_page(clone_page);
 
 	return 0;
@@ -1783,8 +1786,11 @@ static void __nilfs_end_page_io(struct page *page, int err)
 	}
 
 	if (buffer_nilfs_allocated(page_buffers(page))) {
-		if (TestClearPageWriteback(page))
+		if (TestClearPageWriteback(page)) {
+			mem_cgroup_update_stat(clone_page,
+					MEM_CGROUP_STAT_WRITEBACK, -1);
 			dec_zone_page_state(page, NR_WRITEBACK);
+		}
 	} else
 		end_page_writeback(page);
 }
diff --git a/mm/filemap.c b/mm/filemap.c
index fe09e51..f85acae 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
 	 * having removed the page entirely.
 	 */
 	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, -1);
 		dec_zone_page_state(page, NR_FILE_DIRTY);
 		dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
 	}
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 5a0f8f3..d83f41c 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -137,13 +137,14 @@ static struct prop_descriptor vm_dirties;
  */
 static int calc_period_shift(void)
 {
-	unsigned long dirty_total;
+	unsigned long dirty_total, dirty_bytes;
 
-	if (vm_dirty_bytes)
-		dirty_total = vm_dirty_bytes / PAGE_SIZE;
+	dirty_bytes = mem_cgroup_dirty_bytes();
+	if (dirty_bytes)
+		dirty_total = dirty_bytes / PAGE_SIZE;
 	else
-		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
-				100;
+		dirty_total = (mem_cgroup_dirty_ratio() *
+				determine_dirtyable_memory()) / 100;
 	return 2 + ilog2(dirty_total - 1);
 }
 
@@ -408,14 +409,16 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
  */
 unsigned long determine_dirtyable_memory(void)
 {
-	unsigned long x;
-
-	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
+	unsigned long memory;
+	s64 memcg_memory;
 
+	memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
 	if (!vm_highmem_is_dirtyable)
-		x -= highmem_dirtyable_memory(x);
-
-	return x + 1;	/* Ensure that we never return 0 */
+		memory -= highmem_dirtyable_memory(memory);
+	memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
+	if (memcg_memory < 0)
+		return memory + 1;
+	return min((unsigned long)memcg_memory, memory + 1);
 }
 
 void
@@ -423,26 +426,28 @@ get_dirty_limits(unsigned long *pbackground, unsigned long *pdirty,
 		 unsigned long *pbdi_dirty, struct backing_dev_info *bdi)
 {
 	unsigned long background;
-	unsigned long dirty;
+	unsigned long dirty, dirty_bytes, dirty_background;
 	unsigned long available_memory = determine_dirtyable_memory();
 	struct task_struct *tsk;
 
-	if (vm_dirty_bytes)
-		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
+	dirty_bytes = mem_cgroup_dirty_bytes();
+	if (dirty_bytes)
+		dirty = DIV_ROUND_UP(dirty_bytes, PAGE_SIZE);
 	else {
 		int dirty_ratio;
 
-		dirty_ratio = vm_dirty_ratio;
+		dirty_ratio = mem_cgroup_dirty_ratio();
 		if (dirty_ratio < 5)
 			dirty_ratio = 5;
 		dirty = (dirty_ratio * available_memory) / 100;
 	}
 
-	if (dirty_background_bytes)
-		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
+	dirty_background = mem_cgroup_dirty_background_bytes();
+	if (dirty_background)
+		background = DIV_ROUND_UP(dirty_background, PAGE_SIZE);
 	else
-		background = (dirty_background_ratio * available_memory) / 100;
-
+		background = (mem_cgroup_dirty_background_ratio() *
+					available_memory) / 100;
 	if (background >= dirty)
 		background = dirty / 2;
 	tsk = current;
@@ -508,9 +513,13 @@ static void balance_dirty_pages(struct address_space *mapping,
 		get_dirty_limits(&background_thresh, &dirty_thresh,
 				&bdi_thresh, bdi);
 
-		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
+		nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
+		nr_writeback = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
+		if ((nr_reclaimable < 0) || (nr_writeback < 0)) {
+			nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
-		nr_writeback = global_page_state(NR_WRITEBACK);
+			nr_writeback = global_page_state(NR_WRITEBACK);
+		}
 
 		bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY);
 		if (bdi_cap_account_unstable(bdi)) {
@@ -611,10 +620,12 @@ static void balance_dirty_pages(struct address_space *mapping,
 	 * In normal mode, we start background writeout at the lower
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
+	nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
+	if (nr_reclaimable < 0)
+		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
+				global_page_state(NR_UNSTABLE_NFS);
 	if ((laptop_mode && pages_written) ||
-	    (!laptop_mode && ((global_page_state(NR_FILE_DIRTY)
-			       + global_page_state(NR_UNSTABLE_NFS))
-					  > background_thresh)))
+	    (!laptop_mode && (nr_reclaimable > background_thresh)))
 		bdi_start_writeback(bdi, NULL, 0);
 }
 
@@ -678,6 +689,8 @@ void throttle_vm_writeout(gfp_t gfp_mask)
 	unsigned long dirty_thresh;
 
         for ( ; ; ) {
+		unsigned long dirty;
+
 		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
 
                 /*
@@ -686,10 +699,14 @@ void throttle_vm_writeout(gfp_t gfp_mask)
                  */
                 dirty_thresh += dirty_thresh / 10;      /* wheeee... */
 
-                if (global_page_state(NR_UNSTABLE_NFS) +
-			global_page_state(NR_WRITEBACK) <= dirty_thresh)
-                        	break;
-                congestion_wait(BLK_RW_ASYNC, HZ/10);
+
+		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
+		if (dirty < 0)
+			dirty = global_page_state(NR_UNSTABLE_NFS) +
+				global_page_state(NR_WRITEBACK);
+		if (dirty <= dirty_thresh)
+			break;
+		congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 		/*
 		 * The caller might hold locks which can prevent IO completion
@@ -1096,6 +1113,7 @@ int __set_page_dirty_no_writeback(struct page *page)
 void account_page_dirtied(struct page *page, struct address_space *mapping)
 {
 	if (mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, 1);
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
 		task_dirty_inc(current);
@@ -1297,6 +1315,8 @@ int clear_page_dirty_for_io(struct page *page)
 		 * for more comments.
 		 */
 		if (TestClearPageDirty(page)) {
+			mem_cgroup_update_stat(page,
+					MEM_CGROUP_STAT_FILE_DIRTY, -1);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_DIRTY);
@@ -1332,8 +1352,10 @@ int test_clear_page_writeback(struct page *page)
 	} else {
 		ret = TestClearPageWriteback(page);
 	}
-	if (ret)
+	if (ret) {
+		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_WRITEBACK, -1);
 		dec_zone_page_state(page, NR_WRITEBACK);
+	}
 	return ret;
 }
 
@@ -1363,8 +1385,10 @@ int test_set_page_writeback(struct page *page)
 	} else {
 		ret = TestSetPageWriteback(page);
 	}
-	if (!ret)
+	if (!ret) {
+		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_WRITEBACK, 1);
 		inc_zone_page_state(page, NR_WRITEBACK);
+	}
 	return ret;
 
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index 4d2fb93..8d74335 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -832,7 +832,7 @@ void page_add_file_rmap(struct page *page)
 {
 	if (atomic_inc_and_test(&page->_mapcount)) {
 		__inc_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_update_file_mapped(page, 1);
+		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, 1);
 	}
 }
 
@@ -864,7 +864,7 @@ void page_remove_rmap(struct page *page)
 		__dec_zone_page_state(page, NR_ANON_PAGES);
 	} else {
 		__dec_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_update_file_mapped(page, -1);
+		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, -1);
 	}
 	/*
 	 * It would be tidy to reset the PageAnon mapping here,
diff --git a/mm/truncate.c b/mm/truncate.c
index 2466e0c..5f437e7 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -73,6 +73,8 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
 	if (TestClearPageDirty(page)) {
 		struct address_space *mapping = page->mapping;
 		if (mapping && mapping_cap_account_dirty(mapping)) {
+			mem_cgroup_update_stat(page,
+					MEM_CGROUP_STAT_FILE_DIRTY, -1);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_DIRTY);
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 140+ messages in thread

* [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
@ 2010-03-01 21:23   ` Andrea Righi
  0 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-01 21:23 UTC (permalink / raw)
  To: Balbir Singh, KAMEZAWA Hiroyuki
  Cc: Suleiman Souhlal, Greg Thelen, Daisuke Nishimura,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm, Andrea Righi

Apply the cgroup dirty pages accounting and limiting infrastructure to
the opportune kernel functions.

Signed-off-by: Andrea Righi <arighi@develer.com>
---
 fs/fuse/file.c      |    5 +++
 fs/nfs/write.c      |    4 ++
 fs/nilfs2/segment.c |   10 +++++-
 mm/filemap.c        |    1 +
 mm/page-writeback.c |   84 ++++++++++++++++++++++++++++++++------------------
 mm/rmap.c           |    4 +-
 mm/truncate.c       |    2 +
 7 files changed, 76 insertions(+), 34 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index a9f5e13..dbbdd53 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -11,6 +11,7 @@
 #include <linux/pagemap.h>
 #include <linux/slab.h>
 #include <linux/kernel.h>
+#include <linux/memcontrol.h>
 #include <linux/sched.h>
 #include <linux/module.h>
 
@@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
 
 	list_del(&req->writepages_entry);
 	dec_bdi_stat(bdi, BDI_WRITEBACK);
+	mem_cgroup_update_stat(req->pages[0],
+			MEM_CGROUP_STAT_WRITEBACK_TEMP, -1);
 	dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
 	bdi_writeout_inc(bdi);
 	wake_up(&fi->page_waitq);
@@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
 	req->inode = inode;
 
 	inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
+	mem_cgroup_update_stat(tmp_page,
+			MEM_CGROUP_STAT_WRITEBACK_TEMP, 1);
 	inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
 	end_page_writeback(page);
 
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index b753242..7316f7a 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -439,6 +439,7 @@ nfs_mark_request_commit(struct nfs_page *req)
 			req->wb_index,
 			NFS_PAGE_TAG_COMMIT);
 	spin_unlock(&inode->i_lock);
+	mem_cgroup_update_stat(req->wb_page, MEM_CGROUP_STAT_UNSTABLE_NFS, 1);
 	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
 	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE);
 	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
@@ -450,6 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req)
 	struct page *page = req->wb_page;
 
 	if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
+		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
 		dec_zone_page_state(page, NR_UNSTABLE_NFS);
 		dec_bdi_stat(page->mapping->backing_dev_info, BDI_UNSTABLE);
 		return 1;
@@ -1273,6 +1275,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
 		req = nfs_list_entry(head->next);
 		nfs_list_remove_request(req);
 		nfs_mark_request_commit(req);
+		mem_cgroup_update_stat(req->wb_page,
+				MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
 		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
 				BDI_UNSTABLE);
diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index ada2f1b..aef6d13 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -1660,8 +1660,11 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out)
 	} while (bh = bh->b_this_page, bh2 = bh2->b_this_page, bh != head);
 	kunmap_atomic(kaddr, KM_USER0);
 
-	if (!TestSetPageWriteback(clone_page))
+	if (!TestSetPageWriteback(clone_page)) {
+		mem_cgroup_update_stat(clone_page,
+				MEM_CGROUP_STAT_WRITEBACK, 1);
 		inc_zone_page_state(clone_page, NR_WRITEBACK);
+	}
 	unlock_page(clone_page);
 
 	return 0;
@@ -1783,8 +1786,11 @@ static void __nilfs_end_page_io(struct page *page, int err)
 	}
 
 	if (buffer_nilfs_allocated(page_buffers(page))) {
-		if (TestClearPageWriteback(page))
+		if (TestClearPageWriteback(page)) {
+			mem_cgroup_update_stat(clone_page,
+					MEM_CGROUP_STAT_WRITEBACK, -1);
 			dec_zone_page_state(page, NR_WRITEBACK);
+		}
 	} else
 		end_page_writeback(page);
 }
diff --git a/mm/filemap.c b/mm/filemap.c
index fe09e51..f85acae 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
 	 * having removed the page entirely.
 	 */
 	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, -1);
 		dec_zone_page_state(page, NR_FILE_DIRTY);
 		dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
 	}
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 5a0f8f3..d83f41c 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -137,13 +137,14 @@ static struct prop_descriptor vm_dirties;
  */
 static int calc_period_shift(void)
 {
-	unsigned long dirty_total;
+	unsigned long dirty_total, dirty_bytes;
 
-	if (vm_dirty_bytes)
-		dirty_total = vm_dirty_bytes / PAGE_SIZE;
+	dirty_bytes = mem_cgroup_dirty_bytes();
+	if (dirty_bytes)
+		dirty_total = dirty_bytes / PAGE_SIZE;
 	else
-		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
-				100;
+		dirty_total = (mem_cgroup_dirty_ratio() *
+				determine_dirtyable_memory()) / 100;
 	return 2 + ilog2(dirty_total - 1);
 }
 
@@ -408,14 +409,16 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
  */
 unsigned long determine_dirtyable_memory(void)
 {
-	unsigned long x;
-
-	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
+	unsigned long memory;
+	s64 memcg_memory;
 
+	memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
 	if (!vm_highmem_is_dirtyable)
-		x -= highmem_dirtyable_memory(x);
-
-	return x + 1;	/* Ensure that we never return 0 */
+		memory -= highmem_dirtyable_memory(memory);
+	memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
+	if (memcg_memory < 0)
+		return memory + 1;
+	return min((unsigned long)memcg_memory, memory + 1);
 }
 
 void
@@ -423,26 +426,28 @@ get_dirty_limits(unsigned long *pbackground, unsigned long *pdirty,
 		 unsigned long *pbdi_dirty, struct backing_dev_info *bdi)
 {
 	unsigned long background;
-	unsigned long dirty;
+	unsigned long dirty, dirty_bytes, dirty_background;
 	unsigned long available_memory = determine_dirtyable_memory();
 	struct task_struct *tsk;
 
-	if (vm_dirty_bytes)
-		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
+	dirty_bytes = mem_cgroup_dirty_bytes();
+	if (dirty_bytes)
+		dirty = DIV_ROUND_UP(dirty_bytes, PAGE_SIZE);
 	else {
 		int dirty_ratio;
 
-		dirty_ratio = vm_dirty_ratio;
+		dirty_ratio = mem_cgroup_dirty_ratio();
 		if (dirty_ratio < 5)
 			dirty_ratio = 5;
 		dirty = (dirty_ratio * available_memory) / 100;
 	}
 
-	if (dirty_background_bytes)
-		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
+	dirty_background = mem_cgroup_dirty_background_bytes();
+	if (dirty_background)
+		background = DIV_ROUND_UP(dirty_background, PAGE_SIZE);
 	else
-		background = (dirty_background_ratio * available_memory) / 100;
-
+		background = (mem_cgroup_dirty_background_ratio() *
+					available_memory) / 100;
 	if (background >= dirty)
 		background = dirty / 2;
 	tsk = current;
@@ -508,9 +513,13 @@ static void balance_dirty_pages(struct address_space *mapping,
 		get_dirty_limits(&background_thresh, &dirty_thresh,
 				&bdi_thresh, bdi);
 
-		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
+		nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
+		nr_writeback = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
+		if ((nr_reclaimable < 0) || (nr_writeback < 0)) {
+			nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
-		nr_writeback = global_page_state(NR_WRITEBACK);
+			nr_writeback = global_page_state(NR_WRITEBACK);
+		}
 
 		bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY);
 		if (bdi_cap_account_unstable(bdi)) {
@@ -611,10 +620,12 @@ static void balance_dirty_pages(struct address_space *mapping,
 	 * In normal mode, we start background writeout at the lower
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
+	nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
+	if (nr_reclaimable < 0)
+		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
+				global_page_state(NR_UNSTABLE_NFS);
 	if ((laptop_mode && pages_written) ||
-	    (!laptop_mode && ((global_page_state(NR_FILE_DIRTY)
-			       + global_page_state(NR_UNSTABLE_NFS))
-					  > background_thresh)))
+	    (!laptop_mode && (nr_reclaimable > background_thresh)))
 		bdi_start_writeback(bdi, NULL, 0);
 }
 
@@ -678,6 +689,8 @@ void throttle_vm_writeout(gfp_t gfp_mask)
 	unsigned long dirty_thresh;
 
         for ( ; ; ) {
+		unsigned long dirty;
+
 		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
 
                 /*
@@ -686,10 +699,14 @@ void throttle_vm_writeout(gfp_t gfp_mask)
                  */
                 dirty_thresh += dirty_thresh / 10;      /* wheeee... */
 
-                if (global_page_state(NR_UNSTABLE_NFS) +
-			global_page_state(NR_WRITEBACK) <= dirty_thresh)
-                        	break;
-                congestion_wait(BLK_RW_ASYNC, HZ/10);
+
+		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
+		if (dirty < 0)
+			dirty = global_page_state(NR_UNSTABLE_NFS) +
+				global_page_state(NR_WRITEBACK);
+		if (dirty <= dirty_thresh)
+			break;
+		congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 		/*
 		 * The caller might hold locks which can prevent IO completion
@@ -1096,6 +1113,7 @@ int __set_page_dirty_no_writeback(struct page *page)
 void account_page_dirtied(struct page *page, struct address_space *mapping)
 {
 	if (mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, 1);
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
 		task_dirty_inc(current);
@@ -1297,6 +1315,8 @@ int clear_page_dirty_for_io(struct page *page)
 		 * for more comments.
 		 */
 		if (TestClearPageDirty(page)) {
+			mem_cgroup_update_stat(page,
+					MEM_CGROUP_STAT_FILE_DIRTY, -1);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_DIRTY);
@@ -1332,8 +1352,10 @@ int test_clear_page_writeback(struct page *page)
 	} else {
 		ret = TestClearPageWriteback(page);
 	}
-	if (ret)
+	if (ret) {
+		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_WRITEBACK, -1);
 		dec_zone_page_state(page, NR_WRITEBACK);
+	}
 	return ret;
 }
 
@@ -1363,8 +1385,10 @@ int test_set_page_writeback(struct page *page)
 	} else {
 		ret = TestSetPageWriteback(page);
 	}
-	if (!ret)
+	if (!ret) {
+		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_WRITEBACK, 1);
 		inc_zone_page_state(page, NR_WRITEBACK);
+	}
 	return ret;
 
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index 4d2fb93..8d74335 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -832,7 +832,7 @@ void page_add_file_rmap(struct page *page)
 {
 	if (atomic_inc_and_test(&page->_mapcount)) {
 		__inc_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_update_file_mapped(page, 1);
+		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, 1);
 	}
 }
 
@@ -864,7 +864,7 @@ void page_remove_rmap(struct page *page)
 		__dec_zone_page_state(page, NR_ANON_PAGES);
 	} else {
 		__dec_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_update_file_mapped(page, -1);
+		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, -1);
 	}
 	/*
 	 * It would be tidy to reset the PageAnon mapping here,
diff --git a/mm/truncate.c b/mm/truncate.c
index 2466e0c..5f437e7 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -73,6 +73,8 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
 	if (TestClearPageDirty(page)) {
 		struct address_space *mapping = page->mapping;
 		if (mapping && mapping_cap_account_dirty(mapping)) {
+			mem_cgroup_update_stat(page,
+					MEM_CGROUP_STAT_FILE_DIRTY, -1);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_DIRTY);
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
       [not found]   ` <1267478620-5276-4-git-send-email-arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
@ 2010-03-01 22:02     ` Vivek Goyal
  2010-03-02  0:23     ` KAMEZAWA Hiroyuki
                       ` (4 subsequent siblings)
  5 siblings, 0 replies; 140+ messages in thread
From: Vivek Goyal @ 2010-03-01 22:02 UTC (permalink / raw)
  Cc: Andrea Righi,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Andrew Morton,
	Balbir Singh

On Mon, Mar 01, 2010 at 10:23:40PM +0100, Andrea Righi wrote:
> Apply the cgroup dirty pages accounting and limiting infrastructure to
> the opportune kernel functions.
> 
> Signed-off-by: Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
> ---
>  fs/fuse/file.c      |    5 +++
>  fs/nfs/write.c      |    4 ++
>  fs/nilfs2/segment.c |   10 +++++-
>  mm/filemap.c        |    1 +
>  mm/page-writeback.c |   84 ++++++++++++++++++++++++++++++++------------------
>  mm/rmap.c           |    4 +-
>  mm/truncate.c       |    2 +
>  7 files changed, 76 insertions(+), 34 deletions(-)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index a9f5e13..dbbdd53 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -11,6 +11,7 @@
>  #include <linux/pagemap.h>
>  #include <linux/slab.h>
>  #include <linux/kernel.h>
> +#include <linux/memcontrol.h>
>  #include <linux/sched.h>
>  #include <linux/module.h>
>  
> @@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
>  
>  	list_del(&req->writepages_entry);
>  	dec_bdi_stat(bdi, BDI_WRITEBACK);
> +	mem_cgroup_update_stat(req->pages[0],
> +			MEM_CGROUP_STAT_WRITEBACK_TEMP, -1);
>  	dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
>  	bdi_writeout_inc(bdi);
>  	wake_up(&fi->page_waitq);
> @@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
>  	req->inode = inode;
>  
>  	inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
> +	mem_cgroup_update_stat(tmp_page,
> +			MEM_CGROUP_STAT_WRITEBACK_TEMP, 1);
>  	inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
>  	end_page_writeback(page);
>  
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index b753242..7316f7a 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -439,6 +439,7 @@ nfs_mark_request_commit(struct nfs_page *req)
>  			req->wb_index,
>  			NFS_PAGE_TAG_COMMIT);
>  	spin_unlock(&inode->i_lock);
> +	mem_cgroup_update_stat(req->wb_page, MEM_CGROUP_STAT_UNSTABLE_NFS, 1);
>  	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
>  	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE);
>  	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
> @@ -450,6 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req)
>  	struct page *page = req->wb_page;
>  
>  	if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
>  		dec_zone_page_state(page, NR_UNSTABLE_NFS);
>  		dec_bdi_stat(page->mapping->backing_dev_info, BDI_UNSTABLE);
>  		return 1;
> @@ -1273,6 +1275,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
>  		req = nfs_list_entry(head->next);
>  		nfs_list_remove_request(req);
>  		nfs_mark_request_commit(req);
> +		mem_cgroup_update_stat(req->wb_page,
> +				MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
>  		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
>  		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
>  				BDI_UNSTABLE);
> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> index ada2f1b..aef6d13 100644
> --- a/fs/nilfs2/segment.c
> +++ b/fs/nilfs2/segment.c
> @@ -1660,8 +1660,11 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out)
>  	} while (bh = bh->b_this_page, bh2 = bh2->b_this_page, bh != head);
>  	kunmap_atomic(kaddr, KM_USER0);
>  
> -	if (!TestSetPageWriteback(clone_page))
> +	if (!TestSetPageWriteback(clone_page)) {
> +		mem_cgroup_update_stat(clone_page,
> +				MEM_CGROUP_STAT_WRITEBACK, 1);
>  		inc_zone_page_state(clone_page, NR_WRITEBACK);
> +	}
>  	unlock_page(clone_page);
>  
>  	return 0;
> @@ -1783,8 +1786,11 @@ static void __nilfs_end_page_io(struct page *page, int err)
>  	}
>  
>  	if (buffer_nilfs_allocated(page_buffers(page))) {
> -		if (TestClearPageWriteback(page))
> +		if (TestClearPageWriteback(page)) {
> +			mem_cgroup_update_stat(clone_page,
> +					MEM_CGROUP_STAT_WRITEBACK, -1);
>  			dec_zone_page_state(page, NR_WRITEBACK);
> +		}
>  	} else
>  		end_page_writeback(page);
>  }
> diff --git a/mm/filemap.c b/mm/filemap.c
> index fe09e51..f85acae 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
>  	 * having removed the page entirely.
>  	 */
>  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, -1);
>  		dec_zone_page_state(page, NR_FILE_DIRTY);
>  		dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
>  	}
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 5a0f8f3..d83f41c 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -137,13 +137,14 @@ static struct prop_descriptor vm_dirties;
>   */
>  static int calc_period_shift(void)
>  {
> -	unsigned long dirty_total;
> +	unsigned long dirty_total, dirty_bytes;
>  
> -	if (vm_dirty_bytes)
> -		dirty_total = vm_dirty_bytes / PAGE_SIZE;
> +	dirty_bytes = mem_cgroup_dirty_bytes();
> +	if (dirty_bytes)
> +		dirty_total = dirty_bytes / PAGE_SIZE;
>  	else
> -		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> -				100;
> +		dirty_total = (mem_cgroup_dirty_ratio() *
> +				determine_dirtyable_memory()) / 100;
>  	return 2 + ilog2(dirty_total - 1);
>  }
>  
> @@ -408,14 +409,16 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
>   */
>  unsigned long determine_dirtyable_memory(void)
>  {
> -	unsigned long x;
> -
> -	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> +	unsigned long memory;
> +	s64 memcg_memory;
>  
> +	memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
>  	if (!vm_highmem_is_dirtyable)
> -		x -= highmem_dirtyable_memory(x);
> -
> -	return x + 1;	/* Ensure that we never return 0 */
> +		memory -= highmem_dirtyable_memory(memory);
> +	memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
> +	if (memcg_memory < 0)
> +		return memory + 1;
> +	return min((unsigned long)memcg_memory, memory + 1);
>  }
>  
>  void
> @@ -423,26 +426,28 @@ get_dirty_limits(unsigned long *pbackground, unsigned long *pdirty,
>  		 unsigned long *pbdi_dirty, struct backing_dev_info *bdi)
>  {
>  	unsigned long background;
> -	unsigned long dirty;
> +	unsigned long dirty, dirty_bytes, dirty_background;
>  	unsigned long available_memory = determine_dirtyable_memory();
>  	struct task_struct *tsk;
>  
> -	if (vm_dirty_bytes)
> -		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> +	dirty_bytes = mem_cgroup_dirty_bytes();
> +	if (dirty_bytes)
> +		dirty = DIV_ROUND_UP(dirty_bytes, PAGE_SIZE);
>  	else {
>  		int dirty_ratio;
>  
> -		dirty_ratio = vm_dirty_ratio;
> +		dirty_ratio = mem_cgroup_dirty_ratio();
>  		if (dirty_ratio < 5)
>  			dirty_ratio = 5;
>  		dirty = (dirty_ratio * available_memory) / 100;
>  	}
>  
> -	if (dirty_background_bytes)
> -		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> +	dirty_background = mem_cgroup_dirty_background_bytes();
> +	if (dirty_background)
> +		background = DIV_ROUND_UP(dirty_background, PAGE_SIZE);
>  	else
> -		background = (dirty_background_ratio * available_memory) / 100;
> -
> +		background = (mem_cgroup_dirty_background_ratio() *
> +					available_memory) / 100;
>  	if (background >= dirty)
>  		background = dirty / 2;
>  	tsk = current;
> @@ -508,9 +513,13 @@ static void balance_dirty_pages(struct address_space *mapping,
>  		get_dirty_limits(&background_thresh, &dirty_thresh,
>  				&bdi_thresh, bdi);
>  
> -		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> +		nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
> +		nr_writeback = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
> +		if ((nr_reclaimable < 0) || (nr_writeback < 0)) {
> +			nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
>  					global_page_state(NR_UNSTABLE_NFS);
> -		nr_writeback = global_page_state(NR_WRITEBACK);
> +			nr_writeback = global_page_state(NR_WRITEBACK);
> +		}
>  
>  		bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY);
>  		if (bdi_cap_account_unstable(bdi)) {
> @@ -611,10 +620,12 @@ static void balance_dirty_pages(struct address_space *mapping,
>  	 * In normal mode, we start background writeout at the lower
>  	 * background_thresh, to keep the amount of dirty memory low.
>  	 */
> +	nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
> +	if (nr_reclaimable < 0)
> +		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> +				global_page_state(NR_UNSTABLE_NFS);
>  	if ((laptop_mode && pages_written) ||
> -	    (!laptop_mode && ((global_page_state(NR_FILE_DIRTY)
> -			       + global_page_state(NR_UNSTABLE_NFS))
> -					  > background_thresh)))
> +	    (!laptop_mode && (nr_reclaimable > background_thresh)))
>  		bdi_start_writeback(bdi, NULL, 0);
>  }
>  
> @@ -678,6 +689,8 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>  	unsigned long dirty_thresh;
>  
>          for ( ; ; ) {
> +		unsigned long dirty;
> +
>  		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
>  
>                  /*
> @@ -686,10 +699,14 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>                   */
>                  dirty_thresh += dirty_thresh / 10;      /* wheeee... */
>  
> -                if (global_page_state(NR_UNSTABLE_NFS) +
> -			global_page_state(NR_WRITEBACK) <= dirty_thresh)
> -                        	break;
> -                congestion_wait(BLK_RW_ASYNC, HZ/10);
> +
> +		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> +		if (dirty < 0)
> +			dirty = global_page_state(NR_UNSTABLE_NFS) +
> +				global_page_state(NR_WRITEBACK);

dirty is unsigned long. As mentioned last time, above will never be true?
In general these patches look ok to me. I will do some testing with these.

Vivek

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-01 21:23   ` Andrea Righi
@ 2010-03-01 22:02     ` Vivek Goyal
  -1 siblings, 0 replies; 140+ messages in thread
From: Vivek Goyal @ 2010-03-01 22:02 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Kirill A. Shutemov, Andrew Morton, containers,
	linux-kernel, linux-mm, Andrea Righi

On Mon, Mar 01, 2010 at 10:23:40PM +0100, Andrea Righi wrote:
> Apply the cgroup dirty pages accounting and limiting infrastructure to
> the opportune kernel functions.
> 
> Signed-off-by: Andrea Righi <arighi@develer.com>
> ---
>  fs/fuse/file.c      |    5 +++
>  fs/nfs/write.c      |    4 ++
>  fs/nilfs2/segment.c |   10 +++++-
>  mm/filemap.c        |    1 +
>  mm/page-writeback.c |   84 ++++++++++++++++++++++++++++++++------------------
>  mm/rmap.c           |    4 +-
>  mm/truncate.c       |    2 +
>  7 files changed, 76 insertions(+), 34 deletions(-)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index a9f5e13..dbbdd53 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -11,6 +11,7 @@
>  #include <linux/pagemap.h>
>  #include <linux/slab.h>
>  #include <linux/kernel.h>
> +#include <linux/memcontrol.h>
>  #include <linux/sched.h>
>  #include <linux/module.h>
>  
> @@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
>  
>  	list_del(&req->writepages_entry);
>  	dec_bdi_stat(bdi, BDI_WRITEBACK);
> +	mem_cgroup_update_stat(req->pages[0],
> +			MEM_CGROUP_STAT_WRITEBACK_TEMP, -1);
>  	dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
>  	bdi_writeout_inc(bdi);
>  	wake_up(&fi->page_waitq);
> @@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
>  	req->inode = inode;
>  
>  	inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
> +	mem_cgroup_update_stat(tmp_page,
> +			MEM_CGROUP_STAT_WRITEBACK_TEMP, 1);
>  	inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
>  	end_page_writeback(page);
>  
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index b753242..7316f7a 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -439,6 +439,7 @@ nfs_mark_request_commit(struct nfs_page *req)
>  			req->wb_index,
>  			NFS_PAGE_TAG_COMMIT);
>  	spin_unlock(&inode->i_lock);
> +	mem_cgroup_update_stat(req->wb_page, MEM_CGROUP_STAT_UNSTABLE_NFS, 1);
>  	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
>  	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE);
>  	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
> @@ -450,6 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req)
>  	struct page *page = req->wb_page;
>  
>  	if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
>  		dec_zone_page_state(page, NR_UNSTABLE_NFS);
>  		dec_bdi_stat(page->mapping->backing_dev_info, BDI_UNSTABLE);
>  		return 1;
> @@ -1273,6 +1275,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
>  		req = nfs_list_entry(head->next);
>  		nfs_list_remove_request(req);
>  		nfs_mark_request_commit(req);
> +		mem_cgroup_update_stat(req->wb_page,
> +				MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
>  		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
>  		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
>  				BDI_UNSTABLE);
> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> index ada2f1b..aef6d13 100644
> --- a/fs/nilfs2/segment.c
> +++ b/fs/nilfs2/segment.c
> @@ -1660,8 +1660,11 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out)
>  	} while (bh = bh->b_this_page, bh2 = bh2->b_this_page, bh != head);
>  	kunmap_atomic(kaddr, KM_USER0);
>  
> -	if (!TestSetPageWriteback(clone_page))
> +	if (!TestSetPageWriteback(clone_page)) {
> +		mem_cgroup_update_stat(clone_page,
> +				MEM_CGROUP_STAT_WRITEBACK, 1);
>  		inc_zone_page_state(clone_page, NR_WRITEBACK);
> +	}
>  	unlock_page(clone_page);
>  
>  	return 0;
> @@ -1783,8 +1786,11 @@ static void __nilfs_end_page_io(struct page *page, int err)
>  	}
>  
>  	if (buffer_nilfs_allocated(page_buffers(page))) {
> -		if (TestClearPageWriteback(page))
> +		if (TestClearPageWriteback(page)) {
> +			mem_cgroup_update_stat(clone_page,
> +					MEM_CGROUP_STAT_WRITEBACK, -1);
>  			dec_zone_page_state(page, NR_WRITEBACK);
> +		}
>  	} else
>  		end_page_writeback(page);
>  }
> diff --git a/mm/filemap.c b/mm/filemap.c
> index fe09e51..f85acae 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
>  	 * having removed the page entirely.
>  	 */
>  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, -1);
>  		dec_zone_page_state(page, NR_FILE_DIRTY);
>  		dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
>  	}
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 5a0f8f3..d83f41c 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -137,13 +137,14 @@ static struct prop_descriptor vm_dirties;
>   */
>  static int calc_period_shift(void)
>  {
> -	unsigned long dirty_total;
> +	unsigned long dirty_total, dirty_bytes;
>  
> -	if (vm_dirty_bytes)
> -		dirty_total = vm_dirty_bytes / PAGE_SIZE;
> +	dirty_bytes = mem_cgroup_dirty_bytes();
> +	if (dirty_bytes)
> +		dirty_total = dirty_bytes / PAGE_SIZE;
>  	else
> -		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> -				100;
> +		dirty_total = (mem_cgroup_dirty_ratio() *
> +				determine_dirtyable_memory()) / 100;
>  	return 2 + ilog2(dirty_total - 1);
>  }
>  
> @@ -408,14 +409,16 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
>   */
>  unsigned long determine_dirtyable_memory(void)
>  {
> -	unsigned long x;
> -
> -	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> +	unsigned long memory;
> +	s64 memcg_memory;
>  
> +	memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
>  	if (!vm_highmem_is_dirtyable)
> -		x -= highmem_dirtyable_memory(x);
> -
> -	return x + 1;	/* Ensure that we never return 0 */
> +		memory -= highmem_dirtyable_memory(memory);
> +	memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
> +	if (memcg_memory < 0)
> +		return memory + 1;
> +	return min((unsigned long)memcg_memory, memory + 1);
>  }
>  
>  void
> @@ -423,26 +426,28 @@ get_dirty_limits(unsigned long *pbackground, unsigned long *pdirty,
>  		 unsigned long *pbdi_dirty, struct backing_dev_info *bdi)
>  {
>  	unsigned long background;
> -	unsigned long dirty;
> +	unsigned long dirty, dirty_bytes, dirty_background;
>  	unsigned long available_memory = determine_dirtyable_memory();
>  	struct task_struct *tsk;
>  
> -	if (vm_dirty_bytes)
> -		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> +	dirty_bytes = mem_cgroup_dirty_bytes();
> +	if (dirty_bytes)
> +		dirty = DIV_ROUND_UP(dirty_bytes, PAGE_SIZE);
>  	else {
>  		int dirty_ratio;
>  
> -		dirty_ratio = vm_dirty_ratio;
> +		dirty_ratio = mem_cgroup_dirty_ratio();
>  		if (dirty_ratio < 5)
>  			dirty_ratio = 5;
>  		dirty = (dirty_ratio * available_memory) / 100;
>  	}
>  
> -	if (dirty_background_bytes)
> -		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> +	dirty_background = mem_cgroup_dirty_background_bytes();
> +	if (dirty_background)
> +		background = DIV_ROUND_UP(dirty_background, PAGE_SIZE);
>  	else
> -		background = (dirty_background_ratio * available_memory) / 100;
> -
> +		background = (mem_cgroup_dirty_background_ratio() *
> +					available_memory) / 100;
>  	if (background >= dirty)
>  		background = dirty / 2;
>  	tsk = current;
> @@ -508,9 +513,13 @@ static void balance_dirty_pages(struct address_space *mapping,
>  		get_dirty_limits(&background_thresh, &dirty_thresh,
>  				&bdi_thresh, bdi);
>  
> -		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> +		nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
> +		nr_writeback = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
> +		if ((nr_reclaimable < 0) || (nr_writeback < 0)) {
> +			nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
>  					global_page_state(NR_UNSTABLE_NFS);
> -		nr_writeback = global_page_state(NR_WRITEBACK);
> +			nr_writeback = global_page_state(NR_WRITEBACK);
> +		}
>  
>  		bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY);
>  		if (bdi_cap_account_unstable(bdi)) {
> @@ -611,10 +620,12 @@ static void balance_dirty_pages(struct address_space *mapping,
>  	 * In normal mode, we start background writeout at the lower
>  	 * background_thresh, to keep the amount of dirty memory low.
>  	 */
> +	nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
> +	if (nr_reclaimable < 0)
> +		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> +				global_page_state(NR_UNSTABLE_NFS);
>  	if ((laptop_mode && pages_written) ||
> -	    (!laptop_mode && ((global_page_state(NR_FILE_DIRTY)
> -			       + global_page_state(NR_UNSTABLE_NFS))
> -					  > background_thresh)))
> +	    (!laptop_mode && (nr_reclaimable > background_thresh)))
>  		bdi_start_writeback(bdi, NULL, 0);
>  }
>  
> @@ -678,6 +689,8 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>  	unsigned long dirty_thresh;
>  
>          for ( ; ; ) {
> +		unsigned long dirty;
> +
>  		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
>  
>                  /*
> @@ -686,10 +699,14 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>                   */
>                  dirty_thresh += dirty_thresh / 10;      /* wheeee... */
>  
> -                if (global_page_state(NR_UNSTABLE_NFS) +
> -			global_page_state(NR_WRITEBACK) <= dirty_thresh)
> -                        	break;
> -                congestion_wait(BLK_RW_ASYNC, HZ/10);
> +
> +		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> +		if (dirty < 0)
> +			dirty = global_page_state(NR_UNSTABLE_NFS) +
> +				global_page_state(NR_WRITEBACK);

dirty is unsigned long. As mentioned last time, above will never be true?
In general these patches look ok to me. I will do some testing with these.

Vivek

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
@ 2010-03-01 22:02     ` Vivek Goyal
  0 siblings, 0 replies; 140+ messages in thread
From: Vivek Goyal @ 2010-03-01 22:02 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Kirill A. Shutemov, Andrew Morton, containers,
	linux-kernel, linux-mm

On Mon, Mar 01, 2010 at 10:23:40PM +0100, Andrea Righi wrote:
> Apply the cgroup dirty pages accounting and limiting infrastructure to
> the opportune kernel functions.
> 
> Signed-off-by: Andrea Righi <arighi@develer.com>
> ---
>  fs/fuse/file.c      |    5 +++
>  fs/nfs/write.c      |    4 ++
>  fs/nilfs2/segment.c |   10 +++++-
>  mm/filemap.c        |    1 +
>  mm/page-writeback.c |   84 ++++++++++++++++++++++++++++++++------------------
>  mm/rmap.c           |    4 +-
>  mm/truncate.c       |    2 +
>  7 files changed, 76 insertions(+), 34 deletions(-)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index a9f5e13..dbbdd53 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -11,6 +11,7 @@
>  #include <linux/pagemap.h>
>  #include <linux/slab.h>
>  #include <linux/kernel.h>
> +#include <linux/memcontrol.h>
>  #include <linux/sched.h>
>  #include <linux/module.h>
>  
> @@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
>  
>  	list_del(&req->writepages_entry);
>  	dec_bdi_stat(bdi, BDI_WRITEBACK);
> +	mem_cgroup_update_stat(req->pages[0],
> +			MEM_CGROUP_STAT_WRITEBACK_TEMP, -1);
>  	dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
>  	bdi_writeout_inc(bdi);
>  	wake_up(&fi->page_waitq);
> @@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
>  	req->inode = inode;
>  
>  	inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
> +	mem_cgroup_update_stat(tmp_page,
> +			MEM_CGROUP_STAT_WRITEBACK_TEMP, 1);
>  	inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
>  	end_page_writeback(page);
>  
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index b753242..7316f7a 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -439,6 +439,7 @@ nfs_mark_request_commit(struct nfs_page *req)
>  			req->wb_index,
>  			NFS_PAGE_TAG_COMMIT);
>  	spin_unlock(&inode->i_lock);
> +	mem_cgroup_update_stat(req->wb_page, MEM_CGROUP_STAT_UNSTABLE_NFS, 1);
>  	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
>  	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE);
>  	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
> @@ -450,6 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req)
>  	struct page *page = req->wb_page;
>  
>  	if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
>  		dec_zone_page_state(page, NR_UNSTABLE_NFS);
>  		dec_bdi_stat(page->mapping->backing_dev_info, BDI_UNSTABLE);
>  		return 1;
> @@ -1273,6 +1275,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
>  		req = nfs_list_entry(head->next);
>  		nfs_list_remove_request(req);
>  		nfs_mark_request_commit(req);
> +		mem_cgroup_update_stat(req->wb_page,
> +				MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
>  		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
>  		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
>  				BDI_UNSTABLE);
> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> index ada2f1b..aef6d13 100644
> --- a/fs/nilfs2/segment.c
> +++ b/fs/nilfs2/segment.c
> @@ -1660,8 +1660,11 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out)
>  	} while (bh = bh->b_this_page, bh2 = bh2->b_this_page, bh != head);
>  	kunmap_atomic(kaddr, KM_USER0);
>  
> -	if (!TestSetPageWriteback(clone_page))
> +	if (!TestSetPageWriteback(clone_page)) {
> +		mem_cgroup_update_stat(clone_page,
> +				MEM_CGROUP_STAT_WRITEBACK, 1);
>  		inc_zone_page_state(clone_page, NR_WRITEBACK);
> +	}
>  	unlock_page(clone_page);
>  
>  	return 0;
> @@ -1783,8 +1786,11 @@ static void __nilfs_end_page_io(struct page *page, int err)
>  	}
>  
>  	if (buffer_nilfs_allocated(page_buffers(page))) {
> -		if (TestClearPageWriteback(page))
> +		if (TestClearPageWriteback(page)) {
> +			mem_cgroup_update_stat(clone_page,
> +					MEM_CGROUP_STAT_WRITEBACK, -1);
>  			dec_zone_page_state(page, NR_WRITEBACK);
> +		}
>  	} else
>  		end_page_writeback(page);
>  }
> diff --git a/mm/filemap.c b/mm/filemap.c
> index fe09e51..f85acae 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
>  	 * having removed the page entirely.
>  	 */
>  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, -1);
>  		dec_zone_page_state(page, NR_FILE_DIRTY);
>  		dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
>  	}
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 5a0f8f3..d83f41c 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -137,13 +137,14 @@ static struct prop_descriptor vm_dirties;
>   */
>  static int calc_period_shift(void)
>  {
> -	unsigned long dirty_total;
> +	unsigned long dirty_total, dirty_bytes;
>  
> -	if (vm_dirty_bytes)
> -		dirty_total = vm_dirty_bytes / PAGE_SIZE;
> +	dirty_bytes = mem_cgroup_dirty_bytes();
> +	if (dirty_bytes)
> +		dirty_total = dirty_bytes / PAGE_SIZE;
>  	else
> -		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> -				100;
> +		dirty_total = (mem_cgroup_dirty_ratio() *
> +				determine_dirtyable_memory()) / 100;
>  	return 2 + ilog2(dirty_total - 1);
>  }
>  
> @@ -408,14 +409,16 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
>   */
>  unsigned long determine_dirtyable_memory(void)
>  {
> -	unsigned long x;
> -
> -	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> +	unsigned long memory;
> +	s64 memcg_memory;
>  
> +	memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
>  	if (!vm_highmem_is_dirtyable)
> -		x -= highmem_dirtyable_memory(x);
> -
> -	return x + 1;	/* Ensure that we never return 0 */
> +		memory -= highmem_dirtyable_memory(memory);
> +	memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
> +	if (memcg_memory < 0)
> +		return memory + 1;
> +	return min((unsigned long)memcg_memory, memory + 1);
>  }
>  
>  void
> @@ -423,26 +426,28 @@ get_dirty_limits(unsigned long *pbackground, unsigned long *pdirty,
>  		 unsigned long *pbdi_dirty, struct backing_dev_info *bdi)
>  {
>  	unsigned long background;
> -	unsigned long dirty;
> +	unsigned long dirty, dirty_bytes, dirty_background;
>  	unsigned long available_memory = determine_dirtyable_memory();
>  	struct task_struct *tsk;
>  
> -	if (vm_dirty_bytes)
> -		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> +	dirty_bytes = mem_cgroup_dirty_bytes();
> +	if (dirty_bytes)
> +		dirty = DIV_ROUND_UP(dirty_bytes, PAGE_SIZE);
>  	else {
>  		int dirty_ratio;
>  
> -		dirty_ratio = vm_dirty_ratio;
> +		dirty_ratio = mem_cgroup_dirty_ratio();
>  		if (dirty_ratio < 5)
>  			dirty_ratio = 5;
>  		dirty = (dirty_ratio * available_memory) / 100;
>  	}
>  
> -	if (dirty_background_bytes)
> -		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> +	dirty_background = mem_cgroup_dirty_background_bytes();
> +	if (dirty_background)
> +		background = DIV_ROUND_UP(dirty_background, PAGE_SIZE);
>  	else
> -		background = (dirty_background_ratio * available_memory) / 100;
> -
> +		background = (mem_cgroup_dirty_background_ratio() *
> +					available_memory) / 100;
>  	if (background >= dirty)
>  		background = dirty / 2;
>  	tsk = current;
> @@ -508,9 +513,13 @@ static void balance_dirty_pages(struct address_space *mapping,
>  		get_dirty_limits(&background_thresh, &dirty_thresh,
>  				&bdi_thresh, bdi);
>  
> -		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> +		nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
> +		nr_writeback = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
> +		if ((nr_reclaimable < 0) || (nr_writeback < 0)) {
> +			nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
>  					global_page_state(NR_UNSTABLE_NFS);
> -		nr_writeback = global_page_state(NR_WRITEBACK);
> +			nr_writeback = global_page_state(NR_WRITEBACK);
> +		}
>  
>  		bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY);
>  		if (bdi_cap_account_unstable(bdi)) {
> @@ -611,10 +620,12 @@ static void balance_dirty_pages(struct address_space *mapping,
>  	 * In normal mode, we start background writeout at the lower
>  	 * background_thresh, to keep the amount of dirty memory low.
>  	 */
> +	nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
> +	if (nr_reclaimable < 0)
> +		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> +				global_page_state(NR_UNSTABLE_NFS);
>  	if ((laptop_mode && pages_written) ||
> -	    (!laptop_mode && ((global_page_state(NR_FILE_DIRTY)
> -			       + global_page_state(NR_UNSTABLE_NFS))
> -					  > background_thresh)))
> +	    (!laptop_mode && (nr_reclaimable > background_thresh)))
>  		bdi_start_writeback(bdi, NULL, 0);
>  }
>  
> @@ -678,6 +689,8 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>  	unsigned long dirty_thresh;
>  
>          for ( ; ; ) {
> +		unsigned long dirty;
> +
>  		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
>  
>                  /*
> @@ -686,10 +699,14 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>                   */
>                  dirty_thresh += dirty_thresh / 10;      /* wheeee... */
>  
> -                if (global_page_state(NR_UNSTABLE_NFS) +
> -			global_page_state(NR_WRITEBACK) <= dirty_thresh)
> -                        	break;
> -                congestion_wait(BLK_RW_ASYNC, HZ/10);
> +
> +		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> +		if (dirty < 0)
> +			dirty = global_page_state(NR_UNSTABLE_NFS) +
> +				global_page_state(NR_WRITEBACK);

dirty is unsigned long. As mentioned last time, above will never be true?
In general these patches look ok to me. I will do some testing with these.

Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
       [not found]     ` <20100301220208.GH3109-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2010-03-01 22:18       ` Andrea Righi
  0 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-01 22:18 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Andrew Morton,
	Balbir Singh

On Mon, Mar 01, 2010 at 05:02:08PM -0500, Vivek Goyal wrote:
> > @@ -686,10 +699,14 @@ void throttle_vm_writeout(gfp_t gfp_mask)
> >                   */
> >                  dirty_thresh += dirty_thresh / 10;      /* wheeee... */
> >  
> > -                if (global_page_state(NR_UNSTABLE_NFS) +
> > -			global_page_state(NR_WRITEBACK) <= dirty_thresh)
> > -                        	break;
> > -                congestion_wait(BLK_RW_ASYNC, HZ/10);
> > +
> > +		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> > +		if (dirty < 0)
> > +			dirty = global_page_state(NR_UNSTABLE_NFS) +
> > +				global_page_state(NR_WRITEBACK);
> 
> dirty is unsigned long. As mentioned last time, above will never be true?
> In general these patches look ok to me. I will do some testing with these.

Re-introduced the same bug. My bad. :(

The value returned from mem_cgroup_page_stat() can be negative, i.e.
when memory cgroup is disabled. We could simply use a long for dirty,
the unit is in # of pages so s64 should be enough. Or cast dirty to long
only for the check (see below).

Thanks!
-Andrea

Signed-off-by: Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
---
 mm/page-writeback.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index d83f41c..dbee976 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -701,7 +701,7 @@ void throttle_vm_writeout(gfp_t gfp_mask)
 
 
 		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
-		if (dirty < 0)
+		if ((long)dirty < 0)
 			dirty = global_page_state(NR_UNSTABLE_NFS) +
 				global_page_state(NR_WRITEBACK);
 		if (dirty <= dirty_thresh)

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-01 22:02     ` Vivek Goyal
@ 2010-03-01 22:18       ` Andrea Righi
  -1 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-01 22:18 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Kirill A. Shutemov, Andrew Morton, containers,
	linux-kernel, linux-mm

On Mon, Mar 01, 2010 at 05:02:08PM -0500, Vivek Goyal wrote:
> > @@ -686,10 +699,14 @@ void throttle_vm_writeout(gfp_t gfp_mask)
> >                   */
> >                  dirty_thresh += dirty_thresh / 10;      /* wheeee... */
> >  
> > -                if (global_page_state(NR_UNSTABLE_NFS) +
> > -			global_page_state(NR_WRITEBACK) <= dirty_thresh)
> > -                        	break;
> > -                congestion_wait(BLK_RW_ASYNC, HZ/10);
> > +
> > +		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> > +		if (dirty < 0)
> > +			dirty = global_page_state(NR_UNSTABLE_NFS) +
> > +				global_page_state(NR_WRITEBACK);
> 
> dirty is unsigned long. As mentioned last time, above will never be true?
> In general these patches look ok to me. I will do some testing with these.

Re-introduced the same bug. My bad. :(

The value returned from mem_cgroup_page_stat() can be negative, i.e.
when memory cgroup is disabled. We could simply use a long for dirty,
the unit is in # of pages so s64 should be enough. Or cast dirty to long
only for the check (see below).

Thanks!
-Andrea

Signed-off-by: Andrea Righi <arighi@develer.com>
---
 mm/page-writeback.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index d83f41c..dbee976 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -701,7 +701,7 @@ void throttle_vm_writeout(gfp_t gfp_mask)
 
 
 		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
-		if (dirty < 0)
+		if ((long)dirty < 0)
 			dirty = global_page_state(NR_UNSTABLE_NFS) +
 				global_page_state(NR_WRITEBACK);
 		if (dirty <= dirty_thresh)

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
@ 2010-03-01 22:18       ` Andrea Righi
  0 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-01 22:18 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Kirill A. Shutemov, Andrew Morton, containers,
	linux-kernel, linux-mm

On Mon, Mar 01, 2010 at 05:02:08PM -0500, Vivek Goyal wrote:
> > @@ -686,10 +699,14 @@ void throttle_vm_writeout(gfp_t gfp_mask)
> >                   */
> >                  dirty_thresh += dirty_thresh / 10;      /* wheeee... */
> >  
> > -                if (global_page_state(NR_UNSTABLE_NFS) +
> > -			global_page_state(NR_WRITEBACK) <= dirty_thresh)
> > -                        	break;
> > -                congestion_wait(BLK_RW_ASYNC, HZ/10);
> > +
> > +		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> > +		if (dirty < 0)
> > +			dirty = global_page_state(NR_UNSTABLE_NFS) +
> > +				global_page_state(NR_WRITEBACK);
> 
> dirty is unsigned long. As mentioned last time, above will never be true?
> In general these patches look ok to me. I will do some testing with these.

Re-introduced the same bug. My bad. :(

The value returned from mem_cgroup_page_stat() can be negative, i.e.
when memory cgroup is disabled. We could simply use a long for dirty,
the unit is in # of pages so s64 should be enough. Or cast dirty to long
only for the check (see below).

Thanks!
-Andrea

Signed-off-by: Andrea Righi <arighi@develer.com>
---
 mm/page-writeback.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index d83f41c..dbee976 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -701,7 +701,7 @@ void throttle_vm_writeout(gfp_t gfp_mask)
 
 
 		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
-		if (dirty < 0)
+		if ((long)dirty < 0)
 			dirty = global_page_state(NR_UNSTABLE_NFS) +
 				global_page_state(NR_WRITEBACK);
 		if (dirty <= dirty_thresh)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 2/3] memcg: dirty pages accounting and limiting infrastructure
       [not found]   ` <1267478620-5276-3-git-send-email-arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
@ 2010-03-02  0:20     ` KAMEZAWA Hiroyuki
  2010-03-02 10:04     ` Kirill A. Shutemov
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 140+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-02  0:20 UTC (permalink / raw)
  To: Andrea Righi
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Andrew Morton,
	Balbir Singh

On Mon,  1 Mar 2010 22:23:39 +0100
Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org> wrote:

> Infrastructure to account dirty pages per cgroup and add dirty limit
> interfaces in the cgroupfs:
> 
>  - Direct write-out: memory.dirty_ratio, memory.dirty_bytes
> 
>  - Background write-out: memory.dirty_background_ratio, memory.dirty_background_bytes
> 
> Signed-off-by: Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>

seems nice. You can add my ack.

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>

But maybe you have to wait for a while. (some amounts of series of
patches are posted...) please be patient.

BTW, No TODO anymore ?


Regards,
-Kame

> ---
>  include/linux/memcontrol.h |   77 ++++++++++-
>  mm/memcontrol.c            |  336 ++++++++++++++++++++++++++++++++++++++++----
>  2 files changed, 384 insertions(+), 29 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 1f9b119..cc88b2e 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -19,12 +19,50 @@
>  
>  #ifndef _LINUX_MEMCONTROL_H
>  #define _LINUX_MEMCONTROL_H
> +
> +#include <linux/writeback.h>
>  #include <linux/cgroup.h>
> +
>  struct mem_cgroup;
>  struct page_cgroup;
>  struct page;
>  struct mm_struct;
>  
> +/* Cgroup memory statistics items exported to the kernel */
> +enum mem_cgroup_page_stat_item {
> +	MEMCG_NR_DIRTYABLE_PAGES,
> +	MEMCG_NR_RECLAIM_PAGES,
> +	MEMCG_NR_WRITEBACK,
> +	MEMCG_NR_DIRTY_WRITEBACK_PAGES,
> +};
> +
> +/*
> + * Statistics for memory cgroup.
> + */
> +enum mem_cgroup_stat_index {
> +	/*
> +	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> +	 */
> +	MEM_CGROUP_STAT_CACHE,	   /* # of pages charged as cache */
> +	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
> +	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> +	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
> +	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
> +	MEM_CGROUP_STAT_EVENTS,	/* sum of pagein + pageout for internal use */
> +	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> +	MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
> +					used by soft limit implementation */
> +	MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out.
> +					used by threshold implementation */
> +	MEM_CGROUP_STAT_FILE_DIRTY,   /* # of dirty pages in page cache */
> +	MEM_CGROUP_STAT_WRITEBACK,   /* # of pages under writeback */
> +	MEM_CGROUP_STAT_WRITEBACK_TEMP,   /* # of pages under writeback using
> +						temporary buffers */
> +	MEM_CGROUP_STAT_UNSTABLE_NFS,   /* # of NFS unstable pages */
> +
> +	MEM_CGROUP_STAT_NSTATS,
> +};
> +
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
>  /*
>   * All "charge" functions with gfp_mask should use GFP_KERNEL or
> @@ -117,6 +155,13 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
>  extern int do_swap_account;
>  #endif
>  
> +extern long mem_cgroup_dirty_ratio(void);
> +extern unsigned long mem_cgroup_dirty_bytes(void);
> +extern long mem_cgroup_dirty_background_ratio(void);
> +extern unsigned long mem_cgroup_dirty_background_bytes(void);
> +
> +extern s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item);
> +
>  static inline bool mem_cgroup_disabled(void)
>  {
>  	if (mem_cgroup_subsys.disabled)
> @@ -125,7 +170,8 @@ static inline bool mem_cgroup_disabled(void)
>  }
>  
>  extern bool mem_cgroup_oom_called(struct task_struct *task);
> -void mem_cgroup_update_file_mapped(struct page *page, int val);
> +void mem_cgroup_update_stat(struct page *page,
> +			enum mem_cgroup_stat_index idx, int val);
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  						gfp_t gfp_mask, int nid,
>  						int zid);
> @@ -300,8 +346,8 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
>  {
>  }
>  
> -static inline void mem_cgroup_update_file_mapped(struct page *page,
> -							int val)
> +static inline void mem_cgroup_update_stat(struct page *page,
> +			enum mem_cgroup_stat_index idx, int val)
>  {
>  }
>  
> @@ -312,6 +358,31 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  	return 0;
>  }
>  
> +static inline long mem_cgroup_dirty_ratio(void)
> +{
> +	return vm_dirty_ratio;
> +}
> +
> +static inline unsigned long mem_cgroup_dirty_bytes(void)
> +{
> +	return vm_dirty_bytes;
> +}
> +
> +static inline long mem_cgroup_dirty_background_ratio(void)
> +{
> +	return dirty_background_ratio;
> +}
> +
> +static inline unsigned long mem_cgroup_dirty_background_bytes(void)
> +{
> +	return dirty_background_bytes;
> +}
> +
> +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> +{
> +	return -ENOMEM;
> +}
> +
>  #endif /* CONFIG_CGROUP_MEM_CONT */
>  
>  #endif /* _LINUX_MEMCONTROL_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a443c30..e74cf66 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -66,31 +66,16 @@ static int really_do_swap_account __initdata = 1; /* for remember boot option*/
>  #define SOFTLIMIT_EVENTS_THRESH (1000)
>  #define THRESHOLDS_EVENTS_THRESH (100)
>  
> -/*
> - * Statistics for memory cgroup.
> - */
> -enum mem_cgroup_stat_index {
> -	/*
> -	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> -	 */
> -	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
> -	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
> -	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> -	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
> -	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
> -	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> -	MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
> -					used by soft limit implementation */
> -	MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out.
> -					used by threshold implementation */
> -
> -	MEM_CGROUP_STAT_NSTATS,
> -};
> -
>  struct mem_cgroup_stat_cpu {
>  	s64 count[MEM_CGROUP_STAT_NSTATS];
>  };
>  
> +/* Per cgroup page statistics */
> +struct mem_cgroup_page_stat {
> +	enum mem_cgroup_page_stat_item item;
> +	s64 value;
> +};
> +
>  /*
>   * per-zone information in memory controller.
>   */
> @@ -157,6 +142,15 @@ struct mem_cgroup_threshold_ary {
>  static bool mem_cgroup_threshold_check(struct mem_cgroup *mem);
>  static void mem_cgroup_threshold(struct mem_cgroup *mem);
>  
> +enum mem_cgroup_dirty_param {
> +	MEM_CGROUP_DIRTY_RATIO,
> +	MEM_CGROUP_DIRTY_BYTES,
> +	MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> +	MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
> +
> +	MEM_CGROUP_DIRTY_NPARAMS,
> +};
> +
>  /*
>   * The memory controller data structure. The memory controller controls both
>   * page cache and RSS per cgroup. We would eventually like to provide
> @@ -205,6 +199,9 @@ struct mem_cgroup {
>  
>  	unsigned int	swappiness;
>  
> +	/* control memory cgroup dirty pages */
> +	unsigned long dirty_param[MEM_CGROUP_DIRTY_NPARAMS];
> +
>  	/* set when res.limit == memsw.limit */
>  	bool		memsw_is_minimum;
>  
> @@ -1021,6 +1018,164 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
>  	return swappiness;
>  }
>  
> +static unsigned long get_dirty_param(struct mem_cgroup *memcg,
> +			enum mem_cgroup_dirty_param idx)
> +{
> +	unsigned long ret;
> +
> +	VM_BUG_ON(idx >= MEM_CGROUP_DIRTY_NPARAMS);
> +	spin_lock(&memcg->reclaim_param_lock);
> +	ret = memcg->dirty_param[idx];
> +	spin_unlock(&memcg->reclaim_param_lock);
> +
> +	return ret;
> +}
> +
> +long mem_cgroup_dirty_ratio(void)
> +{
> +	struct mem_cgroup *memcg;
> +	long ret = vm_dirty_ratio;
> +
> +	if (mem_cgroup_disabled())
> +		return ret;
> +	/*
> +	 * It's possible that "current" may be moved to other cgroup while we
> +	 * access cgroup. But precise check is meaningless because the task can
> +	 * be moved after our access and writeback tends to take long time.
> +	 * At least, "memcg" will not be freed under rcu_read_lock().
> +	 */
> +	rcu_read_lock();
> +	memcg = mem_cgroup_from_task(current);
> +	if (likely(memcg))
> +		ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_RATIO);
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> +
> +unsigned long mem_cgroup_dirty_bytes(void)
> +{
> +	struct mem_cgroup *memcg;
> +	unsigned long ret = vm_dirty_bytes;
> +
> +	if (mem_cgroup_disabled())
> +		return ret;
> +	rcu_read_lock();
> +	memcg = mem_cgroup_from_task(current);
> +	if (likely(memcg))
> +		ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BYTES);
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> +
> +long mem_cgroup_dirty_background_ratio(void)
> +{
> +	struct mem_cgroup *memcg;
> +	long ret = dirty_background_ratio;
> +
> +	if (mem_cgroup_disabled())
> +		return ret;
> +	rcu_read_lock();
> +	memcg = mem_cgroup_from_task(current);
> +	if (likely(memcg))
> +		ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BACKGROUND_RATIO);
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> +
> +unsigned long mem_cgroup_dirty_background_bytes(void)
> +{
> +	struct mem_cgroup *memcg;
> +	unsigned long ret = dirty_background_bytes;
> +
> +	if (mem_cgroup_disabled())
> +		return ret;
> +	rcu_read_lock();
> +	memcg = mem_cgroup_from_task(current);
> +	if (likely(memcg))
> +		ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BACKGROUND_BYTES);
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> +
> +static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
> +{
> +	return do_swap_account ?
> +			res_counter_read_u64(&memcg->memsw, RES_LIMIT) :
> +			nr_swap_pages > 0;
> +}
> +
> +static s64 mem_cgroup_get_local_page_stat(struct mem_cgroup *memcg,
> +				enum mem_cgroup_page_stat_item item)
> +{
> +	s64 ret;
> +
> +	switch (item) {
> +	case MEMCG_NR_DIRTYABLE_PAGES:
> +		ret = res_counter_read_u64(&memcg->res, RES_LIMIT) -
> +			res_counter_read_u64(&memcg->res, RES_USAGE);
> +		/* Translate free memory in pages */
> +		ret >>= PAGE_SHIFT;
> +		ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_FILE) +
> +			mem_cgroup_read_stat(memcg, LRU_INACTIVE_FILE);
> +		if (mem_cgroup_can_swap(memcg))
> +			ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_ANON) +
> +				mem_cgroup_read_stat(memcg, LRU_INACTIVE_ANON);
> +		break;
> +	case MEMCG_NR_RECLAIM_PAGES:
> +		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_FILE_DIRTY) +
> +			mem_cgroup_read_stat(memcg,
> +					MEM_CGROUP_STAT_UNSTABLE_NFS);
> +		break;
> +	case MEMCG_NR_WRITEBACK:
> +		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK);
> +		break;
> +	case MEMCG_NR_DIRTY_WRITEBACK_PAGES:
> +		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK) +
> +			mem_cgroup_read_stat(memcg,
> +				MEM_CGROUP_STAT_UNSTABLE_NFS);
> +		break;
> +	default:
> +		ret = 0;
> +		WARN_ON_ONCE(1);
> +	}
> +	return ret;
> +}
> +
> +static int mem_cgroup_page_stat_cb(struct mem_cgroup *mem, void *data)
> +{
> +	struct mem_cgroup_page_stat *stat = (struct mem_cgroup_page_stat *)data;
> +
> +	stat->value += mem_cgroup_get_local_page_stat(mem, stat->item);
> +	return 0;
> +}
> +
> +s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> +{
> +	struct mem_cgroup_page_stat stat = {};
> +	struct mem_cgroup *memcg;
> +
> +	if (mem_cgroup_disabled())
> +		return -ENOMEM;
> +	rcu_read_lock();
> +	memcg = mem_cgroup_from_task(current);
> +	if (memcg) {
> +		/*
> +		 * Recursively evaulate page statistics against all cgroup
> +		 * under hierarchy tree
> +		 */
> +		stat.item = item;
> +		mem_cgroup_walk_tree(memcg, &stat, mem_cgroup_page_stat_cb);
> +	} else
> +		stat.value = -ENOMEM;
> +	rcu_read_unlock();
> +
> +	return stat.value;
> +}
> +
>  static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void *data)
>  {
>  	int *val = data;
> @@ -1263,14 +1418,16 @@ static void record_last_oom(struct mem_cgroup *mem)
>  }
>  
>  /*
> - * Currently used to update mapped file statistics, but the routine can be
> - * generalized to update other statistics as well.
> + * Generalized routine to update memory cgroup statistics.
>   */
> -void mem_cgroup_update_file_mapped(struct page *page, int val)
> +void mem_cgroup_update_stat(struct page *page,
> +			enum mem_cgroup_stat_index idx, int val)
>  {
>  	struct mem_cgroup *mem;
>  	struct page_cgroup *pc;
>  
> +	if (mem_cgroup_disabled())
> +		return;
>  	pc = lookup_page_cgroup(page);
>  	if (unlikely(!pc))
>  		return;
> @@ -1286,7 +1443,8 @@ void mem_cgroup_update_file_mapped(struct page *page, int val)
>  	/*
>  	 * Preemption is already disabled. We can use __this_cpu_xxx
>  	 */
> -	__this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], val);
> +	VM_BUG_ON(idx >= MEM_CGROUP_STAT_NSTATS);
> +	__this_cpu_add(mem->stat->count[idx], val);
>  
>  done:
>  	unlock_page_cgroup(pc);
> @@ -3033,6 +3191,10 @@ enum {
>  	MCS_PGPGIN,
>  	MCS_PGPGOUT,
>  	MCS_SWAP,
> +	MCS_FILE_DIRTY,
> +	MCS_WRITEBACK,
> +	MCS_WRITEBACK_TEMP,
> +	MCS_UNSTABLE_NFS,
>  	MCS_INACTIVE_ANON,
>  	MCS_ACTIVE_ANON,
>  	MCS_INACTIVE_FILE,
> @@ -3055,6 +3217,10 @@ struct {
>  	{"pgpgin", "total_pgpgin"},
>  	{"pgpgout", "total_pgpgout"},
>  	{"swap", "total_swap"},
> +	{"filedirty", "dirty_pages"},
> +	{"writeback", "writeback_pages"},
> +	{"writeback_tmp", "writeback_temp_pages"},
> +	{"nfs", "nfs_unstable"},
>  	{"inactive_anon", "total_inactive_anon"},
>  	{"active_anon", "total_active_anon"},
>  	{"inactive_file", "total_inactive_file"},
> @@ -3083,6 +3249,14 @@ static int mem_cgroup_get_local_stat(struct mem_cgroup *mem, void *data)
>  		val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_SWAPOUT);
>  		s->stat[MCS_SWAP] += val * PAGE_SIZE;
>  	}
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_DIRTY);
> +	s->stat[MCS_FILE_DIRTY] += val;
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WRITEBACK);
> +	s->stat[MCS_WRITEBACK] += val;
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WRITEBACK_TEMP);
> +	s->stat[MCS_WRITEBACK_TEMP] += val;
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_UNSTABLE_NFS);
> +	s->stat[MCS_UNSTABLE_NFS] += val;
>  
>  	/* per zone stat */
>  	val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON);
> @@ -3467,6 +3641,50 @@ unlock:
>  	return ret;
>  }
>  
> +static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +	int type = cft->private;
> +
> +	return get_dirty_param(memcg, type);
> +}
> +
> +static int
> +mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +	int type = cft->private;
> +
> +	if (cgrp->parent == NULL)
> +		return -EINVAL;
> +	if (((type == MEM_CGROUP_DIRTY_RATIO) ||
> +		(type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO)) && (val > 100))
> +		return -EINVAL;
> +
> +	spin_lock(&memcg->reclaim_param_lock);
> +	switch (type) {
> +	case MEM_CGROUP_DIRTY_RATIO:
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_RATIO] = val;
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_BYTES] = 0;
> +		break;
> +	case MEM_CGROUP_DIRTY_BYTES:
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_RATIO] = 0;
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_BYTES] = val;
> +		break;
> +	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] = val;
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES] = 0;
> +		break;
> +	case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] = 0;
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES] = val;
> +		break;
> +	}
> +	spin_unlock(&memcg->reclaim_param_lock);
> +
> +	return 0;
> +}
> +
>  static struct cftype mem_cgroup_files[] = {
>  	{
>  		.name = "usage_in_bytes",
> @@ -3518,6 +3736,30 @@ static struct cftype mem_cgroup_files[] = {
>  		.write_u64 = mem_cgroup_swappiness_write,
>  	},
>  	{
> +		.name = "dirty_ratio",
> +		.read_u64 = mem_cgroup_dirty_read,
> +		.write_u64 = mem_cgroup_dirty_write,
> +		.private = MEM_CGROUP_DIRTY_RATIO,
> +	},
> +	{
> +		.name = "dirty_bytes",
> +		.read_u64 = mem_cgroup_dirty_read,
> +		.write_u64 = mem_cgroup_dirty_write,
> +		.private = MEM_CGROUP_DIRTY_BYTES,
> +	},
> +	{
> +		.name = "dirty_background_ratio",
> +		.read_u64 = mem_cgroup_dirty_read,
> +		.write_u64 = mem_cgroup_dirty_write,
> +		.private = MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> +	},
> +	{
> +		.name = "dirty_background_bytes",
> +		.read_u64 = mem_cgroup_dirty_read,
> +		.write_u64 = mem_cgroup_dirty_write,
> +		.private = MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
> +	},
> +	{
>  		.name = "move_charge_at_immigrate",
>  		.read_u64 = mem_cgroup_move_charge_read,
>  		.write_u64 = mem_cgroup_move_charge_write,
> @@ -3725,6 +3967,19 @@ static int mem_cgroup_soft_limit_tree_init(void)
>  	return 0;
>  }
>  
> +/*
> + * NOTE: called only with &src->reclaim_param_lock held from
> + * mem_cgroup_create().
> + */
> +static inline void
> +copy_dirty_params(struct mem_cgroup *dst, struct mem_cgroup *src)
> +{
> +	int i;
> +
> +	for (i = 0; i < MEM_CGROUP_DIRTY_NPARAMS; i++)
> +		dst->dirty_param[i] = src->dirty_param[i];
> +}
> +
>  static struct cgroup_subsys_state * __ref
>  mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  {
> @@ -3776,8 +4031,37 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  	mem->last_scanned_child = 0;
>  	spin_lock_init(&mem->reclaim_param_lock);
>  
> -	if (parent)
> +	if (parent) {
>  		mem->swappiness = get_swappiness(parent);
> +
> +		spin_lock(&parent->reclaim_param_lock);
> +		copy_dirty_params(mem, parent);
> +		spin_unlock(&parent->reclaim_param_lock);
> +	} else {
> +		/*
> +		 * XXX: should we need a lock here? we could switch from
> +		 * vm_dirty_ratio to vm_dirty_bytes or vice versa but we're not
> +		 * reading them atomically. The same for dirty_background_ratio
> +		 * and dirty_background_bytes.
> +		 *
> +		 * For now, try to read them speculatively and retry if a
> +		 * "conflict" is detected.
> +		 */
> +		do {
> +			mem->dirty_param[MEM_CGROUP_DIRTY_RATIO] =
> +						vm_dirty_ratio;
> +			mem->dirty_param[MEM_CGROUP_DIRTY_BYTES] =
> +						vm_dirty_bytes;
> +		} while (mem->dirty_param[MEM_CGROUP_DIRTY_RATIO] &&
> +			 mem->dirty_param[MEM_CGROUP_DIRTY_BYTES]);
> +		do {
> +			mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] =
> +						dirty_background_ratio;
> +			mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES] =
> +						dirty_background_bytes;
> +		} while (mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] &&
> +			mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES]);
> +	}
>  	atomic_set(&mem->refcnt, 1);
>  	mem->move_charge_at_immigrate = 0;
>  	mutex_init(&mem->thresholds_lock);
> -- 
> 1.6.3.3
> 
> 

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 2/3] memcg: dirty pages accounting and limiting infrastructure
  2010-03-01 21:23   ` Andrea Righi
@ 2010-03-02  0:20     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 140+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-02  0:20 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Balbir Singh, Suleiman Souhlal, Greg Thelen, Daisuke Nishimura,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

On Mon,  1 Mar 2010 22:23:39 +0100
Andrea Righi <arighi@develer.com> wrote:

> Infrastructure to account dirty pages per cgroup and add dirty limit
> interfaces in the cgroupfs:
> 
>  - Direct write-out: memory.dirty_ratio, memory.dirty_bytes
> 
>  - Background write-out: memory.dirty_background_ratio, memory.dirty_background_bytes
> 
> Signed-off-by: Andrea Righi <arighi@develer.com>

seems nice. You can add my ack.

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

But maybe you have to wait for a while. (some amounts of series of
patches are posted...) please be patient.

BTW, No TODO anymore ?


Regards,
-Kame

> ---
>  include/linux/memcontrol.h |   77 ++++++++++-
>  mm/memcontrol.c            |  336 ++++++++++++++++++++++++++++++++++++++++----
>  2 files changed, 384 insertions(+), 29 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 1f9b119..cc88b2e 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -19,12 +19,50 @@
>  
>  #ifndef _LINUX_MEMCONTROL_H
>  #define _LINUX_MEMCONTROL_H
> +
> +#include <linux/writeback.h>
>  #include <linux/cgroup.h>
> +
>  struct mem_cgroup;
>  struct page_cgroup;
>  struct page;
>  struct mm_struct;
>  
> +/* Cgroup memory statistics items exported to the kernel */
> +enum mem_cgroup_page_stat_item {
> +	MEMCG_NR_DIRTYABLE_PAGES,
> +	MEMCG_NR_RECLAIM_PAGES,
> +	MEMCG_NR_WRITEBACK,
> +	MEMCG_NR_DIRTY_WRITEBACK_PAGES,
> +};
> +
> +/*
> + * Statistics for memory cgroup.
> + */
> +enum mem_cgroup_stat_index {
> +	/*
> +	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> +	 */
> +	MEM_CGROUP_STAT_CACHE,	   /* # of pages charged as cache */
> +	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
> +	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> +	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
> +	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
> +	MEM_CGROUP_STAT_EVENTS,	/* sum of pagein + pageout for internal use */
> +	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> +	MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
> +					used by soft limit implementation */
> +	MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out.
> +					used by threshold implementation */
> +	MEM_CGROUP_STAT_FILE_DIRTY,   /* # of dirty pages in page cache */
> +	MEM_CGROUP_STAT_WRITEBACK,   /* # of pages under writeback */
> +	MEM_CGROUP_STAT_WRITEBACK_TEMP,   /* # of pages under writeback using
> +						temporary buffers */
> +	MEM_CGROUP_STAT_UNSTABLE_NFS,   /* # of NFS unstable pages */
> +
> +	MEM_CGROUP_STAT_NSTATS,
> +};
> +
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
>  /*
>   * All "charge" functions with gfp_mask should use GFP_KERNEL or
> @@ -117,6 +155,13 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
>  extern int do_swap_account;
>  #endif
>  
> +extern long mem_cgroup_dirty_ratio(void);
> +extern unsigned long mem_cgroup_dirty_bytes(void);
> +extern long mem_cgroup_dirty_background_ratio(void);
> +extern unsigned long mem_cgroup_dirty_background_bytes(void);
> +
> +extern s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item);
> +
>  static inline bool mem_cgroup_disabled(void)
>  {
>  	if (mem_cgroup_subsys.disabled)
> @@ -125,7 +170,8 @@ static inline bool mem_cgroup_disabled(void)
>  }
>  
>  extern bool mem_cgroup_oom_called(struct task_struct *task);
> -void mem_cgroup_update_file_mapped(struct page *page, int val);
> +void mem_cgroup_update_stat(struct page *page,
> +			enum mem_cgroup_stat_index idx, int val);
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  						gfp_t gfp_mask, int nid,
>  						int zid);
> @@ -300,8 +346,8 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
>  {
>  }
>  
> -static inline void mem_cgroup_update_file_mapped(struct page *page,
> -							int val)
> +static inline void mem_cgroup_update_stat(struct page *page,
> +			enum mem_cgroup_stat_index idx, int val)
>  {
>  }
>  
> @@ -312,6 +358,31 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  	return 0;
>  }
>  
> +static inline long mem_cgroup_dirty_ratio(void)
> +{
> +	return vm_dirty_ratio;
> +}
> +
> +static inline unsigned long mem_cgroup_dirty_bytes(void)
> +{
> +	return vm_dirty_bytes;
> +}
> +
> +static inline long mem_cgroup_dirty_background_ratio(void)
> +{
> +	return dirty_background_ratio;
> +}
> +
> +static inline unsigned long mem_cgroup_dirty_background_bytes(void)
> +{
> +	return dirty_background_bytes;
> +}
> +
> +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> +{
> +	return -ENOMEM;
> +}
> +
>  #endif /* CONFIG_CGROUP_MEM_CONT */
>  
>  #endif /* _LINUX_MEMCONTROL_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a443c30..e74cf66 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -66,31 +66,16 @@ static int really_do_swap_account __initdata = 1; /* for remember boot option*/
>  #define SOFTLIMIT_EVENTS_THRESH (1000)
>  #define THRESHOLDS_EVENTS_THRESH (100)
>  
> -/*
> - * Statistics for memory cgroup.
> - */
> -enum mem_cgroup_stat_index {
> -	/*
> -	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> -	 */
> -	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
> -	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
> -	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> -	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
> -	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
> -	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> -	MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
> -					used by soft limit implementation */
> -	MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out.
> -					used by threshold implementation */
> -
> -	MEM_CGROUP_STAT_NSTATS,
> -};
> -
>  struct mem_cgroup_stat_cpu {
>  	s64 count[MEM_CGROUP_STAT_NSTATS];
>  };
>  
> +/* Per cgroup page statistics */
> +struct mem_cgroup_page_stat {
> +	enum mem_cgroup_page_stat_item item;
> +	s64 value;
> +};
> +
>  /*
>   * per-zone information in memory controller.
>   */
> @@ -157,6 +142,15 @@ struct mem_cgroup_threshold_ary {
>  static bool mem_cgroup_threshold_check(struct mem_cgroup *mem);
>  static void mem_cgroup_threshold(struct mem_cgroup *mem);
>  
> +enum mem_cgroup_dirty_param {
> +	MEM_CGROUP_DIRTY_RATIO,
> +	MEM_CGROUP_DIRTY_BYTES,
> +	MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> +	MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
> +
> +	MEM_CGROUP_DIRTY_NPARAMS,
> +};
> +
>  /*
>   * The memory controller data structure. The memory controller controls both
>   * page cache and RSS per cgroup. We would eventually like to provide
> @@ -205,6 +199,9 @@ struct mem_cgroup {
>  
>  	unsigned int	swappiness;
>  
> +	/* control memory cgroup dirty pages */
> +	unsigned long dirty_param[MEM_CGROUP_DIRTY_NPARAMS];
> +
>  	/* set when res.limit == memsw.limit */
>  	bool		memsw_is_minimum;
>  
> @@ -1021,6 +1018,164 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
>  	return swappiness;
>  }
>  
> +static unsigned long get_dirty_param(struct mem_cgroup *memcg,
> +			enum mem_cgroup_dirty_param idx)
> +{
> +	unsigned long ret;
> +
> +	VM_BUG_ON(idx >= MEM_CGROUP_DIRTY_NPARAMS);
> +	spin_lock(&memcg->reclaim_param_lock);
> +	ret = memcg->dirty_param[idx];
> +	spin_unlock(&memcg->reclaim_param_lock);
> +
> +	return ret;
> +}
> +
> +long mem_cgroup_dirty_ratio(void)
> +{
> +	struct mem_cgroup *memcg;
> +	long ret = vm_dirty_ratio;
> +
> +	if (mem_cgroup_disabled())
> +		return ret;
> +	/*
> +	 * It's possible that "current" may be moved to other cgroup while we
> +	 * access cgroup. But precise check is meaningless because the task can
> +	 * be moved after our access and writeback tends to take long time.
> +	 * At least, "memcg" will not be freed under rcu_read_lock().
> +	 */
> +	rcu_read_lock();
> +	memcg = mem_cgroup_from_task(current);
> +	if (likely(memcg))
> +		ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_RATIO);
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> +
> +unsigned long mem_cgroup_dirty_bytes(void)
> +{
> +	struct mem_cgroup *memcg;
> +	unsigned long ret = vm_dirty_bytes;
> +
> +	if (mem_cgroup_disabled())
> +		return ret;
> +	rcu_read_lock();
> +	memcg = mem_cgroup_from_task(current);
> +	if (likely(memcg))
> +		ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BYTES);
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> +
> +long mem_cgroup_dirty_background_ratio(void)
> +{
> +	struct mem_cgroup *memcg;
> +	long ret = dirty_background_ratio;
> +
> +	if (mem_cgroup_disabled())
> +		return ret;
> +	rcu_read_lock();
> +	memcg = mem_cgroup_from_task(current);
> +	if (likely(memcg))
> +		ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BACKGROUND_RATIO);
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> +
> +unsigned long mem_cgroup_dirty_background_bytes(void)
> +{
> +	struct mem_cgroup *memcg;
> +	unsigned long ret = dirty_background_bytes;
> +
> +	if (mem_cgroup_disabled())
> +		return ret;
> +	rcu_read_lock();
> +	memcg = mem_cgroup_from_task(current);
> +	if (likely(memcg))
> +		ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BACKGROUND_BYTES);
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> +
> +static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
> +{
> +	return do_swap_account ?
> +			res_counter_read_u64(&memcg->memsw, RES_LIMIT) :
> +			nr_swap_pages > 0;
> +}
> +
> +static s64 mem_cgroup_get_local_page_stat(struct mem_cgroup *memcg,
> +				enum mem_cgroup_page_stat_item item)
> +{
> +	s64 ret;
> +
> +	switch (item) {
> +	case MEMCG_NR_DIRTYABLE_PAGES:
> +		ret = res_counter_read_u64(&memcg->res, RES_LIMIT) -
> +			res_counter_read_u64(&memcg->res, RES_USAGE);
> +		/* Translate free memory in pages */
> +		ret >>= PAGE_SHIFT;
> +		ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_FILE) +
> +			mem_cgroup_read_stat(memcg, LRU_INACTIVE_FILE);
> +		if (mem_cgroup_can_swap(memcg))
> +			ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_ANON) +
> +				mem_cgroup_read_stat(memcg, LRU_INACTIVE_ANON);
> +		break;
> +	case MEMCG_NR_RECLAIM_PAGES:
> +		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_FILE_DIRTY) +
> +			mem_cgroup_read_stat(memcg,
> +					MEM_CGROUP_STAT_UNSTABLE_NFS);
> +		break;
> +	case MEMCG_NR_WRITEBACK:
> +		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK);
> +		break;
> +	case MEMCG_NR_DIRTY_WRITEBACK_PAGES:
> +		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK) +
> +			mem_cgroup_read_stat(memcg,
> +				MEM_CGROUP_STAT_UNSTABLE_NFS);
> +		break;
> +	default:
> +		ret = 0;
> +		WARN_ON_ONCE(1);
> +	}
> +	return ret;
> +}
> +
> +static int mem_cgroup_page_stat_cb(struct mem_cgroup *mem, void *data)
> +{
> +	struct mem_cgroup_page_stat *stat = (struct mem_cgroup_page_stat *)data;
> +
> +	stat->value += mem_cgroup_get_local_page_stat(mem, stat->item);
> +	return 0;
> +}
> +
> +s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> +{
> +	struct mem_cgroup_page_stat stat = {};
> +	struct mem_cgroup *memcg;
> +
> +	if (mem_cgroup_disabled())
> +		return -ENOMEM;
> +	rcu_read_lock();
> +	memcg = mem_cgroup_from_task(current);
> +	if (memcg) {
> +		/*
> +		 * Recursively evaulate page statistics against all cgroup
> +		 * under hierarchy tree
> +		 */
> +		stat.item = item;
> +		mem_cgroup_walk_tree(memcg, &stat, mem_cgroup_page_stat_cb);
> +	} else
> +		stat.value = -ENOMEM;
> +	rcu_read_unlock();
> +
> +	return stat.value;
> +}
> +
>  static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void *data)
>  {
>  	int *val = data;
> @@ -1263,14 +1418,16 @@ static void record_last_oom(struct mem_cgroup *mem)
>  }
>  
>  /*
> - * Currently used to update mapped file statistics, but the routine can be
> - * generalized to update other statistics as well.
> + * Generalized routine to update memory cgroup statistics.
>   */
> -void mem_cgroup_update_file_mapped(struct page *page, int val)
> +void mem_cgroup_update_stat(struct page *page,
> +			enum mem_cgroup_stat_index idx, int val)
>  {
>  	struct mem_cgroup *mem;
>  	struct page_cgroup *pc;
>  
> +	if (mem_cgroup_disabled())
> +		return;
>  	pc = lookup_page_cgroup(page);
>  	if (unlikely(!pc))
>  		return;
> @@ -1286,7 +1443,8 @@ void mem_cgroup_update_file_mapped(struct page *page, int val)
>  	/*
>  	 * Preemption is already disabled. We can use __this_cpu_xxx
>  	 */
> -	__this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], val);
> +	VM_BUG_ON(idx >= MEM_CGROUP_STAT_NSTATS);
> +	__this_cpu_add(mem->stat->count[idx], val);
>  
>  done:
>  	unlock_page_cgroup(pc);
> @@ -3033,6 +3191,10 @@ enum {
>  	MCS_PGPGIN,
>  	MCS_PGPGOUT,
>  	MCS_SWAP,
> +	MCS_FILE_DIRTY,
> +	MCS_WRITEBACK,
> +	MCS_WRITEBACK_TEMP,
> +	MCS_UNSTABLE_NFS,
>  	MCS_INACTIVE_ANON,
>  	MCS_ACTIVE_ANON,
>  	MCS_INACTIVE_FILE,
> @@ -3055,6 +3217,10 @@ struct {
>  	{"pgpgin", "total_pgpgin"},
>  	{"pgpgout", "total_pgpgout"},
>  	{"swap", "total_swap"},
> +	{"filedirty", "dirty_pages"},
> +	{"writeback", "writeback_pages"},
> +	{"writeback_tmp", "writeback_temp_pages"},
> +	{"nfs", "nfs_unstable"},
>  	{"inactive_anon", "total_inactive_anon"},
>  	{"active_anon", "total_active_anon"},
>  	{"inactive_file", "total_inactive_file"},
> @@ -3083,6 +3249,14 @@ static int mem_cgroup_get_local_stat(struct mem_cgroup *mem, void *data)
>  		val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_SWAPOUT);
>  		s->stat[MCS_SWAP] += val * PAGE_SIZE;
>  	}
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_DIRTY);
> +	s->stat[MCS_FILE_DIRTY] += val;
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WRITEBACK);
> +	s->stat[MCS_WRITEBACK] += val;
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WRITEBACK_TEMP);
> +	s->stat[MCS_WRITEBACK_TEMP] += val;
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_UNSTABLE_NFS);
> +	s->stat[MCS_UNSTABLE_NFS] += val;
>  
>  	/* per zone stat */
>  	val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON);
> @@ -3467,6 +3641,50 @@ unlock:
>  	return ret;
>  }
>  
> +static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +	int type = cft->private;
> +
> +	return get_dirty_param(memcg, type);
> +}
> +
> +static int
> +mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +	int type = cft->private;
> +
> +	if (cgrp->parent == NULL)
> +		return -EINVAL;
> +	if (((type == MEM_CGROUP_DIRTY_RATIO) ||
> +		(type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO)) && (val > 100))
> +		return -EINVAL;
> +
> +	spin_lock(&memcg->reclaim_param_lock);
> +	switch (type) {
> +	case MEM_CGROUP_DIRTY_RATIO:
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_RATIO] = val;
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_BYTES] = 0;
> +		break;
> +	case MEM_CGROUP_DIRTY_BYTES:
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_RATIO] = 0;
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_BYTES] = val;
> +		break;
> +	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] = val;
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES] = 0;
> +		break;
> +	case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] = 0;
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES] = val;
> +		break;
> +	}
> +	spin_unlock(&memcg->reclaim_param_lock);
> +
> +	return 0;
> +}
> +
>  static struct cftype mem_cgroup_files[] = {
>  	{
>  		.name = "usage_in_bytes",
> @@ -3518,6 +3736,30 @@ static struct cftype mem_cgroup_files[] = {
>  		.write_u64 = mem_cgroup_swappiness_write,
>  	},
>  	{
> +		.name = "dirty_ratio",
> +		.read_u64 = mem_cgroup_dirty_read,
> +		.write_u64 = mem_cgroup_dirty_write,
> +		.private = MEM_CGROUP_DIRTY_RATIO,
> +	},
> +	{
> +		.name = "dirty_bytes",
> +		.read_u64 = mem_cgroup_dirty_read,
> +		.write_u64 = mem_cgroup_dirty_write,
> +		.private = MEM_CGROUP_DIRTY_BYTES,
> +	},
> +	{
> +		.name = "dirty_background_ratio",
> +		.read_u64 = mem_cgroup_dirty_read,
> +		.write_u64 = mem_cgroup_dirty_write,
> +		.private = MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> +	},
> +	{
> +		.name = "dirty_background_bytes",
> +		.read_u64 = mem_cgroup_dirty_read,
> +		.write_u64 = mem_cgroup_dirty_write,
> +		.private = MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
> +	},
> +	{
>  		.name = "move_charge_at_immigrate",
>  		.read_u64 = mem_cgroup_move_charge_read,
>  		.write_u64 = mem_cgroup_move_charge_write,
> @@ -3725,6 +3967,19 @@ static int mem_cgroup_soft_limit_tree_init(void)
>  	return 0;
>  }
>  
> +/*
> + * NOTE: called only with &src->reclaim_param_lock held from
> + * mem_cgroup_create().
> + */
> +static inline void
> +copy_dirty_params(struct mem_cgroup *dst, struct mem_cgroup *src)
> +{
> +	int i;
> +
> +	for (i = 0; i < MEM_CGROUP_DIRTY_NPARAMS; i++)
> +		dst->dirty_param[i] = src->dirty_param[i];
> +}
> +
>  static struct cgroup_subsys_state * __ref
>  mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  {
> @@ -3776,8 +4031,37 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  	mem->last_scanned_child = 0;
>  	spin_lock_init(&mem->reclaim_param_lock);
>  
> -	if (parent)
> +	if (parent) {
>  		mem->swappiness = get_swappiness(parent);
> +
> +		spin_lock(&parent->reclaim_param_lock);
> +		copy_dirty_params(mem, parent);
> +		spin_unlock(&parent->reclaim_param_lock);
> +	} else {
> +		/*
> +		 * XXX: should we need a lock here? we could switch from
> +		 * vm_dirty_ratio to vm_dirty_bytes or vice versa but we're not
> +		 * reading them atomically. The same for dirty_background_ratio
> +		 * and dirty_background_bytes.
> +		 *
> +		 * For now, try to read them speculatively and retry if a
> +		 * "conflict" is detected.
> +		 */
> +		do {
> +			mem->dirty_param[MEM_CGROUP_DIRTY_RATIO] =
> +						vm_dirty_ratio;
> +			mem->dirty_param[MEM_CGROUP_DIRTY_BYTES] =
> +						vm_dirty_bytes;
> +		} while (mem->dirty_param[MEM_CGROUP_DIRTY_RATIO] &&
> +			 mem->dirty_param[MEM_CGROUP_DIRTY_BYTES]);
> +		do {
> +			mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] =
> +						dirty_background_ratio;
> +			mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES] =
> +						dirty_background_bytes;
> +		} while (mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] &&
> +			mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES]);
> +	}
>  	atomic_set(&mem->refcnt, 1);
>  	mem->move_charge_at_immigrate = 0;
>  	mutex_init(&mem->thresholds_lock);
> -- 
> 1.6.3.3
> 
> 


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 2/3] memcg: dirty pages accounting and limiting infrastructure
@ 2010-03-02  0:20     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 140+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-02  0:20 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Balbir Singh, Suleiman Souhlal, Greg Thelen, Daisuke Nishimura,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

On Mon,  1 Mar 2010 22:23:39 +0100
Andrea Righi <arighi@develer.com> wrote:

> Infrastructure to account dirty pages per cgroup and add dirty limit
> interfaces in the cgroupfs:
> 
>  - Direct write-out: memory.dirty_ratio, memory.dirty_bytes
> 
>  - Background write-out: memory.dirty_background_ratio, memory.dirty_background_bytes
> 
> Signed-off-by: Andrea Righi <arighi@develer.com>

seems nice. You can add my ack.

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

But maybe you have to wait for a while. (some amounts of series of
patches are posted...) please be patient.

BTW, No TODO anymore ?


Regards,
-Kame

> ---
>  include/linux/memcontrol.h |   77 ++++++++++-
>  mm/memcontrol.c            |  336 ++++++++++++++++++++++++++++++++++++++++----
>  2 files changed, 384 insertions(+), 29 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 1f9b119..cc88b2e 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -19,12 +19,50 @@
>  
>  #ifndef _LINUX_MEMCONTROL_H
>  #define _LINUX_MEMCONTROL_H
> +
> +#include <linux/writeback.h>
>  #include <linux/cgroup.h>
> +
>  struct mem_cgroup;
>  struct page_cgroup;
>  struct page;
>  struct mm_struct;
>  
> +/* Cgroup memory statistics items exported to the kernel */
> +enum mem_cgroup_page_stat_item {
> +	MEMCG_NR_DIRTYABLE_PAGES,
> +	MEMCG_NR_RECLAIM_PAGES,
> +	MEMCG_NR_WRITEBACK,
> +	MEMCG_NR_DIRTY_WRITEBACK_PAGES,
> +};
> +
> +/*
> + * Statistics for memory cgroup.
> + */
> +enum mem_cgroup_stat_index {
> +	/*
> +	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> +	 */
> +	MEM_CGROUP_STAT_CACHE,	   /* # of pages charged as cache */
> +	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
> +	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> +	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
> +	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
> +	MEM_CGROUP_STAT_EVENTS,	/* sum of pagein + pageout for internal use */
> +	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> +	MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
> +					used by soft limit implementation */
> +	MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out.
> +					used by threshold implementation */
> +	MEM_CGROUP_STAT_FILE_DIRTY,   /* # of dirty pages in page cache */
> +	MEM_CGROUP_STAT_WRITEBACK,   /* # of pages under writeback */
> +	MEM_CGROUP_STAT_WRITEBACK_TEMP,   /* # of pages under writeback using
> +						temporary buffers */
> +	MEM_CGROUP_STAT_UNSTABLE_NFS,   /* # of NFS unstable pages */
> +
> +	MEM_CGROUP_STAT_NSTATS,
> +};
> +
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
>  /*
>   * All "charge" functions with gfp_mask should use GFP_KERNEL or
> @@ -117,6 +155,13 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
>  extern int do_swap_account;
>  #endif
>  
> +extern long mem_cgroup_dirty_ratio(void);
> +extern unsigned long mem_cgroup_dirty_bytes(void);
> +extern long mem_cgroup_dirty_background_ratio(void);
> +extern unsigned long mem_cgroup_dirty_background_bytes(void);
> +
> +extern s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item);
> +
>  static inline bool mem_cgroup_disabled(void)
>  {
>  	if (mem_cgroup_subsys.disabled)
> @@ -125,7 +170,8 @@ static inline bool mem_cgroup_disabled(void)
>  }
>  
>  extern bool mem_cgroup_oom_called(struct task_struct *task);
> -void mem_cgroup_update_file_mapped(struct page *page, int val);
> +void mem_cgroup_update_stat(struct page *page,
> +			enum mem_cgroup_stat_index idx, int val);
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  						gfp_t gfp_mask, int nid,
>  						int zid);
> @@ -300,8 +346,8 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
>  {
>  }
>  
> -static inline void mem_cgroup_update_file_mapped(struct page *page,
> -							int val)
> +static inline void mem_cgroup_update_stat(struct page *page,
> +			enum mem_cgroup_stat_index idx, int val)
>  {
>  }
>  
> @@ -312,6 +358,31 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  	return 0;
>  }
>  
> +static inline long mem_cgroup_dirty_ratio(void)
> +{
> +	return vm_dirty_ratio;
> +}
> +
> +static inline unsigned long mem_cgroup_dirty_bytes(void)
> +{
> +	return vm_dirty_bytes;
> +}
> +
> +static inline long mem_cgroup_dirty_background_ratio(void)
> +{
> +	return dirty_background_ratio;
> +}
> +
> +static inline unsigned long mem_cgroup_dirty_background_bytes(void)
> +{
> +	return dirty_background_bytes;
> +}
> +
> +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> +{
> +	return -ENOMEM;
> +}
> +
>  #endif /* CONFIG_CGROUP_MEM_CONT */
>  
>  #endif /* _LINUX_MEMCONTROL_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a443c30..e74cf66 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -66,31 +66,16 @@ static int really_do_swap_account __initdata = 1; /* for remember boot option*/
>  #define SOFTLIMIT_EVENTS_THRESH (1000)
>  #define THRESHOLDS_EVENTS_THRESH (100)
>  
> -/*
> - * Statistics for memory cgroup.
> - */
> -enum mem_cgroup_stat_index {
> -	/*
> -	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> -	 */
> -	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
> -	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
> -	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> -	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
> -	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
> -	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> -	MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
> -					used by soft limit implementation */
> -	MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out.
> -					used by threshold implementation */
> -
> -	MEM_CGROUP_STAT_NSTATS,
> -};
> -
>  struct mem_cgroup_stat_cpu {
>  	s64 count[MEM_CGROUP_STAT_NSTATS];
>  };
>  
> +/* Per cgroup page statistics */
> +struct mem_cgroup_page_stat {
> +	enum mem_cgroup_page_stat_item item;
> +	s64 value;
> +};
> +
>  /*
>   * per-zone information in memory controller.
>   */
> @@ -157,6 +142,15 @@ struct mem_cgroup_threshold_ary {
>  static bool mem_cgroup_threshold_check(struct mem_cgroup *mem);
>  static void mem_cgroup_threshold(struct mem_cgroup *mem);
>  
> +enum mem_cgroup_dirty_param {
> +	MEM_CGROUP_DIRTY_RATIO,
> +	MEM_CGROUP_DIRTY_BYTES,
> +	MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> +	MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
> +
> +	MEM_CGROUP_DIRTY_NPARAMS,
> +};
> +
>  /*
>   * The memory controller data structure. The memory controller controls both
>   * page cache and RSS per cgroup. We would eventually like to provide
> @@ -205,6 +199,9 @@ struct mem_cgroup {
>  
>  	unsigned int	swappiness;
>  
> +	/* control memory cgroup dirty pages */
> +	unsigned long dirty_param[MEM_CGROUP_DIRTY_NPARAMS];
> +
>  	/* set when res.limit == memsw.limit */
>  	bool		memsw_is_minimum;
>  
> @@ -1021,6 +1018,164 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
>  	return swappiness;
>  }
>  
> +static unsigned long get_dirty_param(struct mem_cgroup *memcg,
> +			enum mem_cgroup_dirty_param idx)
> +{
> +	unsigned long ret;
> +
> +	VM_BUG_ON(idx >= MEM_CGROUP_DIRTY_NPARAMS);
> +	spin_lock(&memcg->reclaim_param_lock);
> +	ret = memcg->dirty_param[idx];
> +	spin_unlock(&memcg->reclaim_param_lock);
> +
> +	return ret;
> +}
> +
> +long mem_cgroup_dirty_ratio(void)
> +{
> +	struct mem_cgroup *memcg;
> +	long ret = vm_dirty_ratio;
> +
> +	if (mem_cgroup_disabled())
> +		return ret;
> +	/*
> +	 * It's possible that "current" may be moved to other cgroup while we
> +	 * access cgroup. But precise check is meaningless because the task can
> +	 * be moved after our access and writeback tends to take long time.
> +	 * At least, "memcg" will not be freed under rcu_read_lock().
> +	 */
> +	rcu_read_lock();
> +	memcg = mem_cgroup_from_task(current);
> +	if (likely(memcg))
> +		ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_RATIO);
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> +
> +unsigned long mem_cgroup_dirty_bytes(void)
> +{
> +	struct mem_cgroup *memcg;
> +	unsigned long ret = vm_dirty_bytes;
> +
> +	if (mem_cgroup_disabled())
> +		return ret;
> +	rcu_read_lock();
> +	memcg = mem_cgroup_from_task(current);
> +	if (likely(memcg))
> +		ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BYTES);
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> +
> +long mem_cgroup_dirty_background_ratio(void)
> +{
> +	struct mem_cgroup *memcg;
> +	long ret = dirty_background_ratio;
> +
> +	if (mem_cgroup_disabled())
> +		return ret;
> +	rcu_read_lock();
> +	memcg = mem_cgroup_from_task(current);
> +	if (likely(memcg))
> +		ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BACKGROUND_RATIO);
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> +
> +unsigned long mem_cgroup_dirty_background_bytes(void)
> +{
> +	struct mem_cgroup *memcg;
> +	unsigned long ret = dirty_background_bytes;
> +
> +	if (mem_cgroup_disabled())
> +		return ret;
> +	rcu_read_lock();
> +	memcg = mem_cgroup_from_task(current);
> +	if (likely(memcg))
> +		ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BACKGROUND_BYTES);
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> +
> +static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
> +{
> +	return do_swap_account ?
> +			res_counter_read_u64(&memcg->memsw, RES_LIMIT) :
> +			nr_swap_pages > 0;
> +}
> +
> +static s64 mem_cgroup_get_local_page_stat(struct mem_cgroup *memcg,
> +				enum mem_cgroup_page_stat_item item)
> +{
> +	s64 ret;
> +
> +	switch (item) {
> +	case MEMCG_NR_DIRTYABLE_PAGES:
> +		ret = res_counter_read_u64(&memcg->res, RES_LIMIT) -
> +			res_counter_read_u64(&memcg->res, RES_USAGE);
> +		/* Translate free memory in pages */
> +		ret >>= PAGE_SHIFT;
> +		ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_FILE) +
> +			mem_cgroup_read_stat(memcg, LRU_INACTIVE_FILE);
> +		if (mem_cgroup_can_swap(memcg))
> +			ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_ANON) +
> +				mem_cgroup_read_stat(memcg, LRU_INACTIVE_ANON);
> +		break;
> +	case MEMCG_NR_RECLAIM_PAGES:
> +		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_FILE_DIRTY) +
> +			mem_cgroup_read_stat(memcg,
> +					MEM_CGROUP_STAT_UNSTABLE_NFS);
> +		break;
> +	case MEMCG_NR_WRITEBACK:
> +		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK);
> +		break;
> +	case MEMCG_NR_DIRTY_WRITEBACK_PAGES:
> +		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK) +
> +			mem_cgroup_read_stat(memcg,
> +				MEM_CGROUP_STAT_UNSTABLE_NFS);
> +		break;
> +	default:
> +		ret = 0;
> +		WARN_ON_ONCE(1);
> +	}
> +	return ret;
> +}
> +
> +static int mem_cgroup_page_stat_cb(struct mem_cgroup *mem, void *data)
> +{
> +	struct mem_cgroup_page_stat *stat = (struct mem_cgroup_page_stat *)data;
> +
> +	stat->value += mem_cgroup_get_local_page_stat(mem, stat->item);
> +	return 0;
> +}
> +
> +s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> +{
> +	struct mem_cgroup_page_stat stat = {};
> +	struct mem_cgroup *memcg;
> +
> +	if (mem_cgroup_disabled())
> +		return -ENOMEM;
> +	rcu_read_lock();
> +	memcg = mem_cgroup_from_task(current);
> +	if (memcg) {
> +		/*
> +		 * Recursively evaulate page statistics against all cgroup
> +		 * under hierarchy tree
> +		 */
> +		stat.item = item;
> +		mem_cgroup_walk_tree(memcg, &stat, mem_cgroup_page_stat_cb);
> +	} else
> +		stat.value = -ENOMEM;
> +	rcu_read_unlock();
> +
> +	return stat.value;
> +}
> +
>  static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void *data)
>  {
>  	int *val = data;
> @@ -1263,14 +1418,16 @@ static void record_last_oom(struct mem_cgroup *mem)
>  }
>  
>  /*
> - * Currently used to update mapped file statistics, but the routine can be
> - * generalized to update other statistics as well.
> + * Generalized routine to update memory cgroup statistics.
>   */
> -void mem_cgroup_update_file_mapped(struct page *page, int val)
> +void mem_cgroup_update_stat(struct page *page,
> +			enum mem_cgroup_stat_index idx, int val)
>  {
>  	struct mem_cgroup *mem;
>  	struct page_cgroup *pc;
>  
> +	if (mem_cgroup_disabled())
> +		return;
>  	pc = lookup_page_cgroup(page);
>  	if (unlikely(!pc))
>  		return;
> @@ -1286,7 +1443,8 @@ void mem_cgroup_update_file_mapped(struct page *page, int val)
>  	/*
>  	 * Preemption is already disabled. We can use __this_cpu_xxx
>  	 */
> -	__this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], val);
> +	VM_BUG_ON(idx >= MEM_CGROUP_STAT_NSTATS);
> +	__this_cpu_add(mem->stat->count[idx], val);
>  
>  done:
>  	unlock_page_cgroup(pc);
> @@ -3033,6 +3191,10 @@ enum {
>  	MCS_PGPGIN,
>  	MCS_PGPGOUT,
>  	MCS_SWAP,
> +	MCS_FILE_DIRTY,
> +	MCS_WRITEBACK,
> +	MCS_WRITEBACK_TEMP,
> +	MCS_UNSTABLE_NFS,
>  	MCS_INACTIVE_ANON,
>  	MCS_ACTIVE_ANON,
>  	MCS_INACTIVE_FILE,
> @@ -3055,6 +3217,10 @@ struct {
>  	{"pgpgin", "total_pgpgin"},
>  	{"pgpgout", "total_pgpgout"},
>  	{"swap", "total_swap"},
> +	{"filedirty", "dirty_pages"},
> +	{"writeback", "writeback_pages"},
> +	{"writeback_tmp", "writeback_temp_pages"},
> +	{"nfs", "nfs_unstable"},
>  	{"inactive_anon", "total_inactive_anon"},
>  	{"active_anon", "total_active_anon"},
>  	{"inactive_file", "total_inactive_file"},
> @@ -3083,6 +3249,14 @@ static int mem_cgroup_get_local_stat(struct mem_cgroup *mem, void *data)
>  		val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_SWAPOUT);
>  		s->stat[MCS_SWAP] += val * PAGE_SIZE;
>  	}
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_DIRTY);
> +	s->stat[MCS_FILE_DIRTY] += val;
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WRITEBACK);
> +	s->stat[MCS_WRITEBACK] += val;
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WRITEBACK_TEMP);
> +	s->stat[MCS_WRITEBACK_TEMP] += val;
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_UNSTABLE_NFS);
> +	s->stat[MCS_UNSTABLE_NFS] += val;
>  
>  	/* per zone stat */
>  	val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON);
> @@ -3467,6 +3641,50 @@ unlock:
>  	return ret;
>  }
>  
> +static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +	int type = cft->private;
> +
> +	return get_dirty_param(memcg, type);
> +}
> +
> +static int
> +mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +	int type = cft->private;
> +
> +	if (cgrp->parent == NULL)
> +		return -EINVAL;
> +	if (((type == MEM_CGROUP_DIRTY_RATIO) ||
> +		(type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO)) && (val > 100))
> +		return -EINVAL;
> +
> +	spin_lock(&memcg->reclaim_param_lock);
> +	switch (type) {
> +	case MEM_CGROUP_DIRTY_RATIO:
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_RATIO] = val;
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_BYTES] = 0;
> +		break;
> +	case MEM_CGROUP_DIRTY_BYTES:
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_RATIO] = 0;
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_BYTES] = val;
> +		break;
> +	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] = val;
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES] = 0;
> +		break;
> +	case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] = 0;
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES] = val;
> +		break;
> +	}
> +	spin_unlock(&memcg->reclaim_param_lock);
> +
> +	return 0;
> +}
> +
>  static struct cftype mem_cgroup_files[] = {
>  	{
>  		.name = "usage_in_bytes",
> @@ -3518,6 +3736,30 @@ static struct cftype mem_cgroup_files[] = {
>  		.write_u64 = mem_cgroup_swappiness_write,
>  	},
>  	{
> +		.name = "dirty_ratio",
> +		.read_u64 = mem_cgroup_dirty_read,
> +		.write_u64 = mem_cgroup_dirty_write,
> +		.private = MEM_CGROUP_DIRTY_RATIO,
> +	},
> +	{
> +		.name = "dirty_bytes",
> +		.read_u64 = mem_cgroup_dirty_read,
> +		.write_u64 = mem_cgroup_dirty_write,
> +		.private = MEM_CGROUP_DIRTY_BYTES,
> +	},
> +	{
> +		.name = "dirty_background_ratio",
> +		.read_u64 = mem_cgroup_dirty_read,
> +		.write_u64 = mem_cgroup_dirty_write,
> +		.private = MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> +	},
> +	{
> +		.name = "dirty_background_bytes",
> +		.read_u64 = mem_cgroup_dirty_read,
> +		.write_u64 = mem_cgroup_dirty_write,
> +		.private = MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
> +	},
> +	{
>  		.name = "move_charge_at_immigrate",
>  		.read_u64 = mem_cgroup_move_charge_read,
>  		.write_u64 = mem_cgroup_move_charge_write,
> @@ -3725,6 +3967,19 @@ static int mem_cgroup_soft_limit_tree_init(void)
>  	return 0;
>  }
>  
> +/*
> + * NOTE: called only with &src->reclaim_param_lock held from
> + * mem_cgroup_create().
> + */
> +static inline void
> +copy_dirty_params(struct mem_cgroup *dst, struct mem_cgroup *src)
> +{
> +	int i;
> +
> +	for (i = 0; i < MEM_CGROUP_DIRTY_NPARAMS; i++)
> +		dst->dirty_param[i] = src->dirty_param[i];
> +}
> +
>  static struct cgroup_subsys_state * __ref
>  mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  {
> @@ -3776,8 +4031,37 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  	mem->last_scanned_child = 0;
>  	spin_lock_init(&mem->reclaim_param_lock);
>  
> -	if (parent)
> +	if (parent) {
>  		mem->swappiness = get_swappiness(parent);
> +
> +		spin_lock(&parent->reclaim_param_lock);
> +		copy_dirty_params(mem, parent);
> +		spin_unlock(&parent->reclaim_param_lock);
> +	} else {
> +		/*
> +		 * XXX: should we need a lock here? we could switch from
> +		 * vm_dirty_ratio to vm_dirty_bytes or vice versa but we're not
> +		 * reading them atomically. The same for dirty_background_ratio
> +		 * and dirty_background_bytes.
> +		 *
> +		 * For now, try to read them speculatively and retry if a
> +		 * "conflict" is detected.
> +		 */
> +		do {
> +			mem->dirty_param[MEM_CGROUP_DIRTY_RATIO] =
> +						vm_dirty_ratio;
> +			mem->dirty_param[MEM_CGROUP_DIRTY_BYTES] =
> +						vm_dirty_bytes;
> +		} while (mem->dirty_param[MEM_CGROUP_DIRTY_RATIO] &&
> +			 mem->dirty_param[MEM_CGROUP_DIRTY_BYTES]);
> +		do {
> +			mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] =
> +						dirty_background_ratio;
> +			mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES] =
> +						dirty_background_bytes;
> +		} while (mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] &&
> +			mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES]);
> +	}
>  	atomic_set(&mem->refcnt, 1);
>  	mem->move_charge_at_immigrate = 0;
>  	mutex_init(&mem->thresholds_lock);
> -- 
> 1.6.3.3
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
       [not found]   ` <1267478620-5276-4-git-send-email-arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
  2010-03-01 22:02     ` Vivek Goyal
@ 2010-03-02  0:23     ` KAMEZAWA Hiroyuki
  2010-03-02 10:11     ` Kirill A. Shutemov
                       ` (3 subsequent siblings)
  5 siblings, 0 replies; 140+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-02  0:23 UTC (permalink / raw)
  To: Andrea Righi
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Andrew Morton,
	Balbir Singh

On Mon,  1 Mar 2010 22:23:40 +0100
Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org> wrote:

> Apply the cgroup dirty pages accounting and limiting infrastructure to
> the opportune kernel functions.
> 
> Signed-off-by: Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>

Seems nice.

Hmm. the last problem is moving account between memcg.

Right ?

Thanks,
-Kame


> ---
>  fs/fuse/file.c      |    5 +++
>  fs/nfs/write.c      |    4 ++
>  fs/nilfs2/segment.c |   10 +++++-
>  mm/filemap.c        |    1 +
>  mm/page-writeback.c |   84 ++++++++++++++++++++++++++++++++------------------
>  mm/rmap.c           |    4 +-
>  mm/truncate.c       |    2 +
>  7 files changed, 76 insertions(+), 34 deletions(-)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index a9f5e13..dbbdd53 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -11,6 +11,7 @@
>  #include <linux/pagemap.h>
>  #include <linux/slab.h>
>  #include <linux/kernel.h>
> +#include <linux/memcontrol.h>
>  #include <linux/sched.h>
>  #include <linux/module.h>
>  
> @@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
>  
>  	list_del(&req->writepages_entry);
>  	dec_bdi_stat(bdi, BDI_WRITEBACK);
> +	mem_cgroup_update_stat(req->pages[0],
> +			MEM_CGROUP_STAT_WRITEBACK_TEMP, -1);
>  	dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
>  	bdi_writeout_inc(bdi);
>  	wake_up(&fi->page_waitq);
> @@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
>  	req->inode = inode;
>  
>  	inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
> +	mem_cgroup_update_stat(tmp_page,
> +			MEM_CGROUP_STAT_WRITEBACK_TEMP, 1);
>  	inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
>  	end_page_writeback(page);
>  
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index b753242..7316f7a 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -439,6 +439,7 @@ nfs_mark_request_commit(struct nfs_page *req)
>  			req->wb_index,
>  			NFS_PAGE_TAG_COMMIT);
>  	spin_unlock(&inode->i_lock);
> +	mem_cgroup_update_stat(req->wb_page, MEM_CGROUP_STAT_UNSTABLE_NFS, 1);
>  	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
>  	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE);
>  	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
> @@ -450,6 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req)
>  	struct page *page = req->wb_page;
>  
>  	if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
>  		dec_zone_page_state(page, NR_UNSTABLE_NFS);
>  		dec_bdi_stat(page->mapping->backing_dev_info, BDI_UNSTABLE);
>  		return 1;
> @@ -1273,6 +1275,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
>  		req = nfs_list_entry(head->next);
>  		nfs_list_remove_request(req);
>  		nfs_mark_request_commit(req);
> +		mem_cgroup_update_stat(req->wb_page,
> +				MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
>  		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
>  		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
>  				BDI_UNSTABLE);
> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> index ada2f1b..aef6d13 100644
> --- a/fs/nilfs2/segment.c
> +++ b/fs/nilfs2/segment.c
> @@ -1660,8 +1660,11 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out)
>  	} while (bh = bh->b_this_page, bh2 = bh2->b_this_page, bh != head);
>  	kunmap_atomic(kaddr, KM_USER0);
>  
> -	if (!TestSetPageWriteback(clone_page))
> +	if (!TestSetPageWriteback(clone_page)) {
> +		mem_cgroup_update_stat(clone_page,
> +				MEM_CGROUP_STAT_WRITEBACK, 1);
>  		inc_zone_page_state(clone_page, NR_WRITEBACK);
> +	}
>  	unlock_page(clone_page);
>  
>  	return 0;
> @@ -1783,8 +1786,11 @@ static void __nilfs_end_page_io(struct page *page, int err)
>  	}
>  
>  	if (buffer_nilfs_allocated(page_buffers(page))) {
> -		if (TestClearPageWriteback(page))
> +		if (TestClearPageWriteback(page)) {
> +			mem_cgroup_update_stat(clone_page,
> +					MEM_CGROUP_STAT_WRITEBACK, -1);
>  			dec_zone_page_state(page, NR_WRITEBACK);
> +		}
>  	} else
>  		end_page_writeback(page);
>  }
> diff --git a/mm/filemap.c b/mm/filemap.c
> index fe09e51..f85acae 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
>  	 * having removed the page entirely.
>  	 */
>  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, -1);
>  		dec_zone_page_state(page, NR_FILE_DIRTY);
>  		dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
>  	}
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 5a0f8f3..d83f41c 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -137,13 +137,14 @@ static struct prop_descriptor vm_dirties;
>   */
>  static int calc_period_shift(void)
>  {
> -	unsigned long dirty_total;
> +	unsigned long dirty_total, dirty_bytes;
>  
> -	if (vm_dirty_bytes)
> -		dirty_total = vm_dirty_bytes / PAGE_SIZE;
> +	dirty_bytes = mem_cgroup_dirty_bytes();
> +	if (dirty_bytes)
> +		dirty_total = dirty_bytes / PAGE_SIZE;
>  	else
> -		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> -				100;
> +		dirty_total = (mem_cgroup_dirty_ratio() *
> +				determine_dirtyable_memory()) / 100;
>  	return 2 + ilog2(dirty_total - 1);
>  }
>  
> @@ -408,14 +409,16 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
>   */
>  unsigned long determine_dirtyable_memory(void)
>  {
> -	unsigned long x;
> -
> -	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> +	unsigned long memory;
> +	s64 memcg_memory;
>  
> +	memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
>  	if (!vm_highmem_is_dirtyable)
> -		x -= highmem_dirtyable_memory(x);
> -
> -	return x + 1;	/* Ensure that we never return 0 */
> +		memory -= highmem_dirtyable_memory(memory);
> +	memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
> +	if (memcg_memory < 0)
> +		return memory + 1;
> +	return min((unsigned long)memcg_memory, memory + 1);
>  }
>  
>  void
> @@ -423,26 +426,28 @@ get_dirty_limits(unsigned long *pbackground, unsigned long *pdirty,
>  		 unsigned long *pbdi_dirty, struct backing_dev_info *bdi)
>  {
>  	unsigned long background;
> -	unsigned long dirty;
> +	unsigned long dirty, dirty_bytes, dirty_background;
>  	unsigned long available_memory = determine_dirtyable_memory();
>  	struct task_struct *tsk;
>  
> -	if (vm_dirty_bytes)
> -		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> +	dirty_bytes = mem_cgroup_dirty_bytes();
> +	if (dirty_bytes)
> +		dirty = DIV_ROUND_UP(dirty_bytes, PAGE_SIZE);
>  	else {
>  		int dirty_ratio;
>  
> -		dirty_ratio = vm_dirty_ratio;
> +		dirty_ratio = mem_cgroup_dirty_ratio();
>  		if (dirty_ratio < 5)
>  			dirty_ratio = 5;
>  		dirty = (dirty_ratio * available_memory) / 100;
>  	}
>  
> -	if (dirty_background_bytes)
> -		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> +	dirty_background = mem_cgroup_dirty_background_bytes();
> +	if (dirty_background)
> +		background = DIV_ROUND_UP(dirty_background, PAGE_SIZE);
>  	else
> -		background = (dirty_background_ratio * available_memory) / 100;
> -
> +		background = (mem_cgroup_dirty_background_ratio() *
> +					available_memory) / 100;
>  	if (background >= dirty)
>  		background = dirty / 2;
>  	tsk = current;
> @@ -508,9 +513,13 @@ static void balance_dirty_pages(struct address_space *mapping,
>  		get_dirty_limits(&background_thresh, &dirty_thresh,
>  				&bdi_thresh, bdi);
>  
> -		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> +		nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
> +		nr_writeback = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
> +		if ((nr_reclaimable < 0) || (nr_writeback < 0)) {
> +			nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
>  					global_page_state(NR_UNSTABLE_NFS);
> -		nr_writeback = global_page_state(NR_WRITEBACK);
> +			nr_writeback = global_page_state(NR_WRITEBACK);
> +		}
>  
>  		bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY);
>  		if (bdi_cap_account_unstable(bdi)) {
> @@ -611,10 +620,12 @@ static void balance_dirty_pages(struct address_space *mapping,
>  	 * In normal mode, we start background writeout at the lower
>  	 * background_thresh, to keep the amount of dirty memory low.
>  	 */
> +	nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
> +	if (nr_reclaimable < 0)
> +		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> +				global_page_state(NR_UNSTABLE_NFS);
>  	if ((laptop_mode && pages_written) ||
> -	    (!laptop_mode && ((global_page_state(NR_FILE_DIRTY)
> -			       + global_page_state(NR_UNSTABLE_NFS))
> -					  > background_thresh)))
> +	    (!laptop_mode && (nr_reclaimable > background_thresh)))
>  		bdi_start_writeback(bdi, NULL, 0);
>  }
>  
> @@ -678,6 +689,8 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>  	unsigned long dirty_thresh;
>  
>          for ( ; ; ) {
> +		unsigned long dirty;
> +
>  		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
>  
>                  /*
> @@ -686,10 +699,14 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>                   */
>                  dirty_thresh += dirty_thresh / 10;      /* wheeee... */
>  
> -                if (global_page_state(NR_UNSTABLE_NFS) +
> -			global_page_state(NR_WRITEBACK) <= dirty_thresh)
> -                        	break;
> -                congestion_wait(BLK_RW_ASYNC, HZ/10);
> +
> +		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> +		if (dirty < 0)
> +			dirty = global_page_state(NR_UNSTABLE_NFS) +
> +				global_page_state(NR_WRITEBACK);
> +		if (dirty <= dirty_thresh)
> +			break;
> +		congestion_wait(BLK_RW_ASYNC, HZ/10);
>  
>  		/*
>  		 * The caller might hold locks which can prevent IO completion
> @@ -1096,6 +1113,7 @@ int __set_page_dirty_no_writeback(struct page *page)
>  void account_page_dirtied(struct page *page, struct address_space *mapping)
>  {
>  	if (mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, 1);
>  		__inc_zone_page_state(page, NR_FILE_DIRTY);
>  		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
>  		task_dirty_inc(current);
> @@ -1297,6 +1315,8 @@ int clear_page_dirty_for_io(struct page *page)
>  		 * for more comments.
>  		 */
>  		if (TestClearPageDirty(page)) {
> +			mem_cgroup_update_stat(page,
> +					MEM_CGROUP_STAT_FILE_DIRTY, -1);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_DIRTY);
> @@ -1332,8 +1352,10 @@ int test_clear_page_writeback(struct page *page)
>  	} else {
>  		ret = TestClearPageWriteback(page);
>  	}
> -	if (ret)
> +	if (ret) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_WRITEBACK, -1);
>  		dec_zone_page_state(page, NR_WRITEBACK);
> +	}
>  	return ret;
>  }
>  
> @@ -1363,8 +1385,10 @@ int test_set_page_writeback(struct page *page)
>  	} else {
>  		ret = TestSetPageWriteback(page);
>  	}
> -	if (!ret)
> +	if (!ret) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_WRITEBACK, 1);
>  		inc_zone_page_state(page, NR_WRITEBACK);
> +	}
>  	return ret;
>  
>  }
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 4d2fb93..8d74335 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -832,7 +832,7 @@ void page_add_file_rmap(struct page *page)
>  {
>  	if (atomic_inc_and_test(&page->_mapcount)) {
>  		__inc_zone_page_state(page, NR_FILE_MAPPED);
> -		mem_cgroup_update_file_mapped(page, 1);
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, 1);
>  	}
>  }
>  
> @@ -864,7 +864,7 @@ void page_remove_rmap(struct page *page)
>  		__dec_zone_page_state(page, NR_ANON_PAGES);
>  	} else {
>  		__dec_zone_page_state(page, NR_FILE_MAPPED);
> -		mem_cgroup_update_file_mapped(page, -1);
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, -1);
>  	}
>  	/*
>  	 * It would be tidy to reset the PageAnon mapping here,
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 2466e0c..5f437e7 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -73,6 +73,8 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
>  	if (TestClearPageDirty(page)) {
>  		struct address_space *mapping = page->mapping;
>  		if (mapping && mapping_cap_account_dirty(mapping)) {
> +			mem_cgroup_update_stat(page,
> +					MEM_CGROUP_STAT_FILE_DIRTY, -1);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_DIRTY);
> -- 
> 1.6.3.3
> 
> 

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-01 21:23   ` Andrea Righi
@ 2010-03-02  0:23     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 140+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-02  0:23 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Balbir Singh, Suleiman Souhlal, Greg Thelen, Daisuke Nishimura,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

On Mon,  1 Mar 2010 22:23:40 +0100
Andrea Righi <arighi@develer.com> wrote:

> Apply the cgroup dirty pages accounting and limiting infrastructure to
> the opportune kernel functions.
> 
> Signed-off-by: Andrea Righi <arighi@develer.com>

Seems nice.

Hmm. the last problem is moving account between memcg.

Right ?

Thanks,
-Kame


> ---
>  fs/fuse/file.c      |    5 +++
>  fs/nfs/write.c      |    4 ++
>  fs/nilfs2/segment.c |   10 +++++-
>  mm/filemap.c        |    1 +
>  mm/page-writeback.c |   84 ++++++++++++++++++++++++++++++++------------------
>  mm/rmap.c           |    4 +-
>  mm/truncate.c       |    2 +
>  7 files changed, 76 insertions(+), 34 deletions(-)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index a9f5e13..dbbdd53 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -11,6 +11,7 @@
>  #include <linux/pagemap.h>
>  #include <linux/slab.h>
>  #include <linux/kernel.h>
> +#include <linux/memcontrol.h>
>  #include <linux/sched.h>
>  #include <linux/module.h>
>  
> @@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
>  
>  	list_del(&req->writepages_entry);
>  	dec_bdi_stat(bdi, BDI_WRITEBACK);
> +	mem_cgroup_update_stat(req->pages[0],
> +			MEM_CGROUP_STAT_WRITEBACK_TEMP, -1);
>  	dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
>  	bdi_writeout_inc(bdi);
>  	wake_up(&fi->page_waitq);
> @@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
>  	req->inode = inode;
>  
>  	inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
> +	mem_cgroup_update_stat(tmp_page,
> +			MEM_CGROUP_STAT_WRITEBACK_TEMP, 1);
>  	inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
>  	end_page_writeback(page);
>  
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index b753242..7316f7a 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -439,6 +439,7 @@ nfs_mark_request_commit(struct nfs_page *req)
>  			req->wb_index,
>  			NFS_PAGE_TAG_COMMIT);
>  	spin_unlock(&inode->i_lock);
> +	mem_cgroup_update_stat(req->wb_page, MEM_CGROUP_STAT_UNSTABLE_NFS, 1);
>  	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
>  	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE);
>  	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
> @@ -450,6 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req)
>  	struct page *page = req->wb_page;
>  
>  	if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
>  		dec_zone_page_state(page, NR_UNSTABLE_NFS);
>  		dec_bdi_stat(page->mapping->backing_dev_info, BDI_UNSTABLE);
>  		return 1;
> @@ -1273,6 +1275,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
>  		req = nfs_list_entry(head->next);
>  		nfs_list_remove_request(req);
>  		nfs_mark_request_commit(req);
> +		mem_cgroup_update_stat(req->wb_page,
> +				MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
>  		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
>  		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
>  				BDI_UNSTABLE);
> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> index ada2f1b..aef6d13 100644
> --- a/fs/nilfs2/segment.c
> +++ b/fs/nilfs2/segment.c
> @@ -1660,8 +1660,11 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out)
>  	} while (bh = bh->b_this_page, bh2 = bh2->b_this_page, bh != head);
>  	kunmap_atomic(kaddr, KM_USER0);
>  
> -	if (!TestSetPageWriteback(clone_page))
> +	if (!TestSetPageWriteback(clone_page)) {
> +		mem_cgroup_update_stat(clone_page,
> +				MEM_CGROUP_STAT_WRITEBACK, 1);
>  		inc_zone_page_state(clone_page, NR_WRITEBACK);
> +	}
>  	unlock_page(clone_page);
>  
>  	return 0;
> @@ -1783,8 +1786,11 @@ static void __nilfs_end_page_io(struct page *page, int err)
>  	}
>  
>  	if (buffer_nilfs_allocated(page_buffers(page))) {
> -		if (TestClearPageWriteback(page))
> +		if (TestClearPageWriteback(page)) {
> +			mem_cgroup_update_stat(clone_page,
> +					MEM_CGROUP_STAT_WRITEBACK, -1);
>  			dec_zone_page_state(page, NR_WRITEBACK);
> +		}
>  	} else
>  		end_page_writeback(page);
>  }
> diff --git a/mm/filemap.c b/mm/filemap.c
> index fe09e51..f85acae 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
>  	 * having removed the page entirely.
>  	 */
>  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, -1);
>  		dec_zone_page_state(page, NR_FILE_DIRTY);
>  		dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
>  	}
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 5a0f8f3..d83f41c 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -137,13 +137,14 @@ static struct prop_descriptor vm_dirties;
>   */
>  static int calc_period_shift(void)
>  {
> -	unsigned long dirty_total;
> +	unsigned long dirty_total, dirty_bytes;
>  
> -	if (vm_dirty_bytes)
> -		dirty_total = vm_dirty_bytes / PAGE_SIZE;
> +	dirty_bytes = mem_cgroup_dirty_bytes();
> +	if (dirty_bytes)
> +		dirty_total = dirty_bytes / PAGE_SIZE;
>  	else
> -		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> -				100;
> +		dirty_total = (mem_cgroup_dirty_ratio() *
> +				determine_dirtyable_memory()) / 100;
>  	return 2 + ilog2(dirty_total - 1);
>  }
>  
> @@ -408,14 +409,16 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
>   */
>  unsigned long determine_dirtyable_memory(void)
>  {
> -	unsigned long x;
> -
> -	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> +	unsigned long memory;
> +	s64 memcg_memory;
>  
> +	memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
>  	if (!vm_highmem_is_dirtyable)
> -		x -= highmem_dirtyable_memory(x);
> -
> -	return x + 1;	/* Ensure that we never return 0 */
> +		memory -= highmem_dirtyable_memory(memory);
> +	memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
> +	if (memcg_memory < 0)
> +		return memory + 1;
> +	return min((unsigned long)memcg_memory, memory + 1);
>  }
>  
>  void
> @@ -423,26 +426,28 @@ get_dirty_limits(unsigned long *pbackground, unsigned long *pdirty,
>  		 unsigned long *pbdi_dirty, struct backing_dev_info *bdi)
>  {
>  	unsigned long background;
> -	unsigned long dirty;
> +	unsigned long dirty, dirty_bytes, dirty_background;
>  	unsigned long available_memory = determine_dirtyable_memory();
>  	struct task_struct *tsk;
>  
> -	if (vm_dirty_bytes)
> -		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> +	dirty_bytes = mem_cgroup_dirty_bytes();
> +	if (dirty_bytes)
> +		dirty = DIV_ROUND_UP(dirty_bytes, PAGE_SIZE);
>  	else {
>  		int dirty_ratio;
>  
> -		dirty_ratio = vm_dirty_ratio;
> +		dirty_ratio = mem_cgroup_dirty_ratio();
>  		if (dirty_ratio < 5)
>  			dirty_ratio = 5;
>  		dirty = (dirty_ratio * available_memory) / 100;
>  	}
>  
> -	if (dirty_background_bytes)
> -		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> +	dirty_background = mem_cgroup_dirty_background_bytes();
> +	if (dirty_background)
> +		background = DIV_ROUND_UP(dirty_background, PAGE_SIZE);
>  	else
> -		background = (dirty_background_ratio * available_memory) / 100;
> -
> +		background = (mem_cgroup_dirty_background_ratio() *
> +					available_memory) / 100;
>  	if (background >= dirty)
>  		background = dirty / 2;
>  	tsk = current;
> @@ -508,9 +513,13 @@ static void balance_dirty_pages(struct address_space *mapping,
>  		get_dirty_limits(&background_thresh, &dirty_thresh,
>  				&bdi_thresh, bdi);
>  
> -		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> +		nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
> +		nr_writeback = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
> +		if ((nr_reclaimable < 0) || (nr_writeback < 0)) {
> +			nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
>  					global_page_state(NR_UNSTABLE_NFS);
> -		nr_writeback = global_page_state(NR_WRITEBACK);
> +			nr_writeback = global_page_state(NR_WRITEBACK);
> +		}
>  
>  		bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY);
>  		if (bdi_cap_account_unstable(bdi)) {
> @@ -611,10 +620,12 @@ static void balance_dirty_pages(struct address_space *mapping,
>  	 * In normal mode, we start background writeout at the lower
>  	 * background_thresh, to keep the amount of dirty memory low.
>  	 */
> +	nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
> +	if (nr_reclaimable < 0)
> +		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> +				global_page_state(NR_UNSTABLE_NFS);
>  	if ((laptop_mode && pages_written) ||
> -	    (!laptop_mode && ((global_page_state(NR_FILE_DIRTY)
> -			       + global_page_state(NR_UNSTABLE_NFS))
> -					  > background_thresh)))
> +	    (!laptop_mode && (nr_reclaimable > background_thresh)))
>  		bdi_start_writeback(bdi, NULL, 0);
>  }
>  
> @@ -678,6 +689,8 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>  	unsigned long dirty_thresh;
>  
>          for ( ; ; ) {
> +		unsigned long dirty;
> +
>  		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
>  
>                  /*
> @@ -686,10 +699,14 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>                   */
>                  dirty_thresh += dirty_thresh / 10;      /* wheeee... */
>  
> -                if (global_page_state(NR_UNSTABLE_NFS) +
> -			global_page_state(NR_WRITEBACK) <= dirty_thresh)
> -                        	break;
> -                congestion_wait(BLK_RW_ASYNC, HZ/10);
> +
> +		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> +		if (dirty < 0)
> +			dirty = global_page_state(NR_UNSTABLE_NFS) +
> +				global_page_state(NR_WRITEBACK);
> +		if (dirty <= dirty_thresh)
> +			break;
> +		congestion_wait(BLK_RW_ASYNC, HZ/10);
>  
>  		/*
>  		 * The caller might hold locks which can prevent IO completion
> @@ -1096,6 +1113,7 @@ int __set_page_dirty_no_writeback(struct page *page)
>  void account_page_dirtied(struct page *page, struct address_space *mapping)
>  {
>  	if (mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, 1);
>  		__inc_zone_page_state(page, NR_FILE_DIRTY);
>  		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
>  		task_dirty_inc(current);
> @@ -1297,6 +1315,8 @@ int clear_page_dirty_for_io(struct page *page)
>  		 * for more comments.
>  		 */
>  		if (TestClearPageDirty(page)) {
> +			mem_cgroup_update_stat(page,
> +					MEM_CGROUP_STAT_FILE_DIRTY, -1);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_DIRTY);
> @@ -1332,8 +1352,10 @@ int test_clear_page_writeback(struct page *page)
>  	} else {
>  		ret = TestClearPageWriteback(page);
>  	}
> -	if (ret)
> +	if (ret) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_WRITEBACK, -1);
>  		dec_zone_page_state(page, NR_WRITEBACK);
> +	}
>  	return ret;
>  }
>  
> @@ -1363,8 +1385,10 @@ int test_set_page_writeback(struct page *page)
>  	} else {
>  		ret = TestSetPageWriteback(page);
>  	}
> -	if (!ret)
> +	if (!ret) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_WRITEBACK, 1);
>  		inc_zone_page_state(page, NR_WRITEBACK);
> +	}
>  	return ret;
>  
>  }
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 4d2fb93..8d74335 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -832,7 +832,7 @@ void page_add_file_rmap(struct page *page)
>  {
>  	if (atomic_inc_and_test(&page->_mapcount)) {
>  		__inc_zone_page_state(page, NR_FILE_MAPPED);
> -		mem_cgroup_update_file_mapped(page, 1);
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, 1);
>  	}
>  }
>  
> @@ -864,7 +864,7 @@ void page_remove_rmap(struct page *page)
>  		__dec_zone_page_state(page, NR_ANON_PAGES);
>  	} else {
>  		__dec_zone_page_state(page, NR_FILE_MAPPED);
> -		mem_cgroup_update_file_mapped(page, -1);
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, -1);
>  	}
>  	/*
>  	 * It would be tidy to reset the PageAnon mapping here,
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 2466e0c..5f437e7 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -73,6 +73,8 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
>  	if (TestClearPageDirty(page)) {
>  		struct address_space *mapping = page->mapping;
>  		if (mapping && mapping_cap_account_dirty(mapping)) {
> +			mem_cgroup_update_stat(page,
> +					MEM_CGROUP_STAT_FILE_DIRTY, -1);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_DIRTY);
> -- 
> 1.6.3.3
> 
> 


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
@ 2010-03-02  0:23     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 140+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-02  0:23 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Balbir Singh, Suleiman Souhlal, Greg Thelen, Daisuke Nishimura,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

On Mon,  1 Mar 2010 22:23:40 +0100
Andrea Righi <arighi@develer.com> wrote:

> Apply the cgroup dirty pages accounting and limiting infrastructure to
> the opportune kernel functions.
> 
> Signed-off-by: Andrea Righi <arighi@develer.com>

Seems nice.

Hmm. the last problem is moving account between memcg.

Right ?

Thanks,
-Kame


> ---
>  fs/fuse/file.c      |    5 +++
>  fs/nfs/write.c      |    4 ++
>  fs/nilfs2/segment.c |   10 +++++-
>  mm/filemap.c        |    1 +
>  mm/page-writeback.c |   84 ++++++++++++++++++++++++++++++++------------------
>  mm/rmap.c           |    4 +-
>  mm/truncate.c       |    2 +
>  7 files changed, 76 insertions(+), 34 deletions(-)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index a9f5e13..dbbdd53 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -11,6 +11,7 @@
>  #include <linux/pagemap.h>
>  #include <linux/slab.h>
>  #include <linux/kernel.h>
> +#include <linux/memcontrol.h>
>  #include <linux/sched.h>
>  #include <linux/module.h>
>  
> @@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
>  
>  	list_del(&req->writepages_entry);
>  	dec_bdi_stat(bdi, BDI_WRITEBACK);
> +	mem_cgroup_update_stat(req->pages[0],
> +			MEM_CGROUP_STAT_WRITEBACK_TEMP, -1);
>  	dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
>  	bdi_writeout_inc(bdi);
>  	wake_up(&fi->page_waitq);
> @@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
>  	req->inode = inode;
>  
>  	inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
> +	mem_cgroup_update_stat(tmp_page,
> +			MEM_CGROUP_STAT_WRITEBACK_TEMP, 1);
>  	inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
>  	end_page_writeback(page);
>  
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index b753242..7316f7a 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -439,6 +439,7 @@ nfs_mark_request_commit(struct nfs_page *req)
>  			req->wb_index,
>  			NFS_PAGE_TAG_COMMIT);
>  	spin_unlock(&inode->i_lock);
> +	mem_cgroup_update_stat(req->wb_page, MEM_CGROUP_STAT_UNSTABLE_NFS, 1);
>  	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
>  	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE);
>  	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
> @@ -450,6 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req)
>  	struct page *page = req->wb_page;
>  
>  	if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
>  		dec_zone_page_state(page, NR_UNSTABLE_NFS);
>  		dec_bdi_stat(page->mapping->backing_dev_info, BDI_UNSTABLE);
>  		return 1;
> @@ -1273,6 +1275,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
>  		req = nfs_list_entry(head->next);
>  		nfs_list_remove_request(req);
>  		nfs_mark_request_commit(req);
> +		mem_cgroup_update_stat(req->wb_page,
> +				MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
>  		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
>  		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
>  				BDI_UNSTABLE);
> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> index ada2f1b..aef6d13 100644
> --- a/fs/nilfs2/segment.c
> +++ b/fs/nilfs2/segment.c
> @@ -1660,8 +1660,11 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out)
>  	} while (bh = bh->b_this_page, bh2 = bh2->b_this_page, bh != head);
>  	kunmap_atomic(kaddr, KM_USER0);
>  
> -	if (!TestSetPageWriteback(clone_page))
> +	if (!TestSetPageWriteback(clone_page)) {
> +		mem_cgroup_update_stat(clone_page,
> +				MEM_CGROUP_STAT_WRITEBACK, 1);
>  		inc_zone_page_state(clone_page, NR_WRITEBACK);
> +	}
>  	unlock_page(clone_page);
>  
>  	return 0;
> @@ -1783,8 +1786,11 @@ static void __nilfs_end_page_io(struct page *page, int err)
>  	}
>  
>  	if (buffer_nilfs_allocated(page_buffers(page))) {
> -		if (TestClearPageWriteback(page))
> +		if (TestClearPageWriteback(page)) {
> +			mem_cgroup_update_stat(clone_page,
> +					MEM_CGROUP_STAT_WRITEBACK, -1);
>  			dec_zone_page_state(page, NR_WRITEBACK);
> +		}
>  	} else
>  		end_page_writeback(page);
>  }
> diff --git a/mm/filemap.c b/mm/filemap.c
> index fe09e51..f85acae 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
>  	 * having removed the page entirely.
>  	 */
>  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, -1);
>  		dec_zone_page_state(page, NR_FILE_DIRTY);
>  		dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
>  	}
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 5a0f8f3..d83f41c 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -137,13 +137,14 @@ static struct prop_descriptor vm_dirties;
>   */
>  static int calc_period_shift(void)
>  {
> -	unsigned long dirty_total;
> +	unsigned long dirty_total, dirty_bytes;
>  
> -	if (vm_dirty_bytes)
> -		dirty_total = vm_dirty_bytes / PAGE_SIZE;
> +	dirty_bytes = mem_cgroup_dirty_bytes();
> +	if (dirty_bytes)
> +		dirty_total = dirty_bytes / PAGE_SIZE;
>  	else
> -		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> -				100;
> +		dirty_total = (mem_cgroup_dirty_ratio() *
> +				determine_dirtyable_memory()) / 100;
>  	return 2 + ilog2(dirty_total - 1);
>  }
>  
> @@ -408,14 +409,16 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
>   */
>  unsigned long determine_dirtyable_memory(void)
>  {
> -	unsigned long x;
> -
> -	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> +	unsigned long memory;
> +	s64 memcg_memory;
>  
> +	memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
>  	if (!vm_highmem_is_dirtyable)
> -		x -= highmem_dirtyable_memory(x);
> -
> -	return x + 1;	/* Ensure that we never return 0 */
> +		memory -= highmem_dirtyable_memory(memory);
> +	memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
> +	if (memcg_memory < 0)
> +		return memory + 1;
> +	return min((unsigned long)memcg_memory, memory + 1);
>  }
>  
>  void
> @@ -423,26 +426,28 @@ get_dirty_limits(unsigned long *pbackground, unsigned long *pdirty,
>  		 unsigned long *pbdi_dirty, struct backing_dev_info *bdi)
>  {
>  	unsigned long background;
> -	unsigned long dirty;
> +	unsigned long dirty, dirty_bytes, dirty_background;
>  	unsigned long available_memory = determine_dirtyable_memory();
>  	struct task_struct *tsk;
>  
> -	if (vm_dirty_bytes)
> -		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> +	dirty_bytes = mem_cgroup_dirty_bytes();
> +	if (dirty_bytes)
> +		dirty = DIV_ROUND_UP(dirty_bytes, PAGE_SIZE);
>  	else {
>  		int dirty_ratio;
>  
> -		dirty_ratio = vm_dirty_ratio;
> +		dirty_ratio = mem_cgroup_dirty_ratio();
>  		if (dirty_ratio < 5)
>  			dirty_ratio = 5;
>  		dirty = (dirty_ratio * available_memory) / 100;
>  	}
>  
> -	if (dirty_background_bytes)
> -		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> +	dirty_background = mem_cgroup_dirty_background_bytes();
> +	if (dirty_background)
> +		background = DIV_ROUND_UP(dirty_background, PAGE_SIZE);
>  	else
> -		background = (dirty_background_ratio * available_memory) / 100;
> -
> +		background = (mem_cgroup_dirty_background_ratio() *
> +					available_memory) / 100;
>  	if (background >= dirty)
>  		background = dirty / 2;
>  	tsk = current;
> @@ -508,9 +513,13 @@ static void balance_dirty_pages(struct address_space *mapping,
>  		get_dirty_limits(&background_thresh, &dirty_thresh,
>  				&bdi_thresh, bdi);
>  
> -		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> +		nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
> +		nr_writeback = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
> +		if ((nr_reclaimable < 0) || (nr_writeback < 0)) {
> +			nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
>  					global_page_state(NR_UNSTABLE_NFS);
> -		nr_writeback = global_page_state(NR_WRITEBACK);
> +			nr_writeback = global_page_state(NR_WRITEBACK);
> +		}
>  
>  		bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY);
>  		if (bdi_cap_account_unstable(bdi)) {
> @@ -611,10 +620,12 @@ static void balance_dirty_pages(struct address_space *mapping,
>  	 * In normal mode, we start background writeout at the lower
>  	 * background_thresh, to keep the amount of dirty memory low.
>  	 */
> +	nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
> +	if (nr_reclaimable < 0)
> +		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> +				global_page_state(NR_UNSTABLE_NFS);
>  	if ((laptop_mode && pages_written) ||
> -	    (!laptop_mode && ((global_page_state(NR_FILE_DIRTY)
> -			       + global_page_state(NR_UNSTABLE_NFS))
> -					  > background_thresh)))
> +	    (!laptop_mode && (nr_reclaimable > background_thresh)))
>  		bdi_start_writeback(bdi, NULL, 0);
>  }
>  
> @@ -678,6 +689,8 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>  	unsigned long dirty_thresh;
>  
>          for ( ; ; ) {
> +		unsigned long dirty;
> +
>  		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
>  
>                  /*
> @@ -686,10 +699,14 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>                   */
>                  dirty_thresh += dirty_thresh / 10;      /* wheeee... */
>  
> -                if (global_page_state(NR_UNSTABLE_NFS) +
> -			global_page_state(NR_WRITEBACK) <= dirty_thresh)
> -                        	break;
> -                congestion_wait(BLK_RW_ASYNC, HZ/10);
> +
> +		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> +		if (dirty < 0)
> +			dirty = global_page_state(NR_UNSTABLE_NFS) +
> +				global_page_state(NR_WRITEBACK);
> +		if (dirty <= dirty_thresh)
> +			break;
> +		congestion_wait(BLK_RW_ASYNC, HZ/10);
>  
>  		/*
>  		 * The caller might hold locks which can prevent IO completion
> @@ -1096,6 +1113,7 @@ int __set_page_dirty_no_writeback(struct page *page)
>  void account_page_dirtied(struct page *page, struct address_space *mapping)
>  {
>  	if (mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, 1);
>  		__inc_zone_page_state(page, NR_FILE_DIRTY);
>  		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
>  		task_dirty_inc(current);
> @@ -1297,6 +1315,8 @@ int clear_page_dirty_for_io(struct page *page)
>  		 * for more comments.
>  		 */
>  		if (TestClearPageDirty(page)) {
> +			mem_cgroup_update_stat(page,
> +					MEM_CGROUP_STAT_FILE_DIRTY, -1);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_DIRTY);
> @@ -1332,8 +1352,10 @@ int test_clear_page_writeback(struct page *page)
>  	} else {
>  		ret = TestClearPageWriteback(page);
>  	}
> -	if (ret)
> +	if (ret) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_WRITEBACK, -1);
>  		dec_zone_page_state(page, NR_WRITEBACK);
> +	}
>  	return ret;
>  }
>  
> @@ -1363,8 +1385,10 @@ int test_set_page_writeback(struct page *page)
>  	} else {
>  		ret = TestSetPageWriteback(page);
>  	}
> -	if (!ret)
> +	if (!ret) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_WRITEBACK, 1);
>  		inc_zone_page_state(page, NR_WRITEBACK);
> +	}
>  	return ret;
>  
>  }
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 4d2fb93..8d74335 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -832,7 +832,7 @@ void page_add_file_rmap(struct page *page)
>  {
>  	if (atomic_inc_and_test(&page->_mapcount)) {
>  		__inc_zone_page_state(page, NR_FILE_MAPPED);
> -		mem_cgroup_update_file_mapped(page, 1);
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, 1);
>  	}
>  }
>  
> @@ -864,7 +864,7 @@ void page_remove_rmap(struct page *page)
>  		__dec_zone_page_state(page, NR_ANON_PAGES);
>  	} else {
>  		__dec_zone_page_state(page, NR_FILE_MAPPED);
> -		mem_cgroup_update_file_mapped(page, -1);
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, -1);
>  	}
>  	/*
>  	 * It would be tidy to reset the PageAnon mapping here,
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 2466e0c..5f437e7 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -73,6 +73,8 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
>  	if (TestClearPageDirty(page)) {
>  		struct address_space *mapping = page->mapping;
>  		if (mapping && mapping_cap_account_dirty(mapping)) {
> +			mem_cgroup_update_stat(page,
> +					MEM_CGROUP_STAT_FILE_DIRTY, -1);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_DIRTY);
> -- 
> 1.6.3.3
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
       [not found]     ` <20100302092309.bff454d7.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
@ 2010-03-02  8:01       ` Andrea Righi
  0 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-02  8:01 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Andrew Morton,
	Balbir Singh

On Tue, Mar 02, 2010 at 09:23:09AM +0900, KAMEZAWA Hiroyuki wrote:
> On Mon,  1 Mar 2010 22:23:40 +0100
> Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org> wrote:
> 
> > Apply the cgroup dirty pages accounting and limiting infrastructure to
> > the opportune kernel functions.
> > 
> > Signed-off-by: Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
> 
> Seems nice.
> 
> Hmm. the last problem is moving account between memcg.
> 
> Right ?

Correct. This was actually the last item of the TODO list. Anyway, I'm
still considering if it's correct to move dirty pages when a task is
migrated from a cgroup to another. Currently, dirty pages just remain in
the original cgroup and are flushed depending on the original cgroup
settings. That is not totally wrong... at least moving the dirty pages
between memcgs should be optional (move_charge_at_immigrate?).

Thanks for your ack and the detailed review!

-Andrea

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-02  0:23     ` KAMEZAWA Hiroyuki
@ 2010-03-02  8:01       ` Andrea Righi
  -1 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-02  8:01 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Balbir Singh, Suleiman Souhlal, Greg Thelen, Daisuke Nishimura,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

On Tue, Mar 02, 2010 at 09:23:09AM +0900, KAMEZAWA Hiroyuki wrote:
> On Mon,  1 Mar 2010 22:23:40 +0100
> Andrea Righi <arighi@develer.com> wrote:
> 
> > Apply the cgroup dirty pages accounting and limiting infrastructure to
> > the opportune kernel functions.
> > 
> > Signed-off-by: Andrea Righi <arighi@develer.com>
> 
> Seems nice.
> 
> Hmm. the last problem is moving account between memcg.
> 
> Right ?

Correct. This was actually the last item of the TODO list. Anyway, I'm
still considering if it's correct to move dirty pages when a task is
migrated from a cgroup to another. Currently, dirty pages just remain in
the original cgroup and are flushed depending on the original cgroup
settings. That is not totally wrong... at least moving the dirty pages
between memcgs should be optional (move_charge_at_immigrate?).

Thanks for your ack and the detailed review!

-Andrea

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
@ 2010-03-02  8:01       ` Andrea Righi
  0 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-02  8:01 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Balbir Singh, Suleiman Souhlal, Greg Thelen, Daisuke Nishimura,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

On Tue, Mar 02, 2010 at 09:23:09AM +0900, KAMEZAWA Hiroyuki wrote:
> On Mon,  1 Mar 2010 22:23:40 +0100
> Andrea Righi <arighi@develer.com> wrote:
> 
> > Apply the cgroup dirty pages accounting and limiting infrastructure to
> > the opportune kernel functions.
> > 
> > Signed-off-by: Andrea Righi <arighi@develer.com>
> 
> Seems nice.
> 
> Hmm. the last problem is moving account between memcg.
> 
> Right ?

Correct. This was actually the last item of the TODO list. Anyway, I'm
still considering if it's correct to move dirty pages when a task is
migrated from a cgroup to another. Currently, dirty pages just remain in
the original cgroup and are flushed depending on the original cgroup
settings. That is not totally wrong... at least moving the dirty pages
between memcgs should be optional (move_charge_at_immigrate?).

Thanks for your ack and the detailed review!

-Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-02  8:01       ` Andrea Righi
  (?)
@ 2010-03-02  8:12       ` Daisuke Nishimura
  -1 siblings, 0 replies; 140+ messages in thread
From: Daisuke Nishimura @ 2010-03-02  8:12 UTC (permalink / raw)
  To: Andrea Righi
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Andrew Morton,
	Balbir Singh

On Tue, 2 Mar 2010 09:01:58 +0100, Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org> wrote:
> On Tue, Mar 02, 2010 at 09:23:09AM +0900, KAMEZAWA Hiroyuki wrote:
> > On Mon,  1 Mar 2010 22:23:40 +0100
> > Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org> wrote:
> > 
> > > Apply the cgroup dirty pages accounting and limiting infrastructure to
> > > the opportune kernel functions.
> > > 
> > > Signed-off-by: Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
> > 
> > Seems nice.
> > 
> > Hmm. the last problem is moving account between memcg.
> > 
> > Right ?
> 
> Correct. This was actually the last item of the TODO list. Anyway, I'm
> still considering if it's correct to move dirty pages when a task is
> migrated from a cgroup to another. Currently, dirty pages just remain in
> the original cgroup and are flushed depending on the original cgroup
> settings. That is not totally wrong... at least moving the dirty pages
> between memcgs should be optional (move_charge_at_immigrate?).
> 
FYI, I'm planning to add file-cache and shmem/tmpfs support for move_charge feature
for 2.6.35.
But, hmm, it would be complicated if we try to move dirty account too.


Thanks,
Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-02  8:01       ` Andrea Righi
@ 2010-03-02  8:12         ` Daisuke Nishimura
  -1 siblings, 0 replies; 140+ messages in thread
From: Daisuke Nishimura @ 2010-03-02  8:12 UTC (permalink / raw)
  To: Andrea Righi
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Suleiman Souhlal, Greg Thelen,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm, Daisuke Nishimura

On Tue, 2 Mar 2010 09:01:58 +0100, Andrea Righi <arighi@develer.com> wrote:
> On Tue, Mar 02, 2010 at 09:23:09AM +0900, KAMEZAWA Hiroyuki wrote:
> > On Mon,  1 Mar 2010 22:23:40 +0100
> > Andrea Righi <arighi@develer.com> wrote:
> > 
> > > Apply the cgroup dirty pages accounting and limiting infrastructure to
> > > the opportune kernel functions.
> > > 
> > > Signed-off-by: Andrea Righi <arighi@develer.com>
> > 
> > Seems nice.
> > 
> > Hmm. the last problem is moving account between memcg.
> > 
> > Right ?
> 
> Correct. This was actually the last item of the TODO list. Anyway, I'm
> still considering if it's correct to move dirty pages when a task is
> migrated from a cgroup to another. Currently, dirty pages just remain in
> the original cgroup and are flushed depending on the original cgroup
> settings. That is not totally wrong... at least moving the dirty pages
> between memcgs should be optional (move_charge_at_immigrate?).
> 
FYI, I'm planning to add file-cache and shmem/tmpfs support for move_charge feature
for 2.6.35.
But, hmm, it would be complicated if we try to move dirty account too.


Thanks,
Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
@ 2010-03-02  8:12         ` Daisuke Nishimura
  0 siblings, 0 replies; 140+ messages in thread
From: Daisuke Nishimura @ 2010-03-02  8:12 UTC (permalink / raw)
  To: Andrea Righi
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Suleiman Souhlal, Greg Thelen,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm, Daisuke Nishimura

On Tue, 2 Mar 2010 09:01:58 +0100, Andrea Righi <arighi@develer.com> wrote:
> On Tue, Mar 02, 2010 at 09:23:09AM +0900, KAMEZAWA Hiroyuki wrote:
> > On Mon,  1 Mar 2010 22:23:40 +0100
> > Andrea Righi <arighi@develer.com> wrote:
> > 
> > > Apply the cgroup dirty pages accounting and limiting infrastructure to
> > > the opportune kernel functions.
> > > 
> > > Signed-off-by: Andrea Righi <arighi@develer.com>
> > 
> > Seems nice.
> > 
> > Hmm. the last problem is moving account between memcg.
> > 
> > Right ?
> 
> Correct. This was actually the last item of the TODO list. Anyway, I'm
> still considering if it's correct to move dirty pages when a task is
> migrated from a cgroup to another. Currently, dirty pages just remain in
> the original cgroup and are flushed depending on the original cgroup
> settings. That is not totally wrong... at least moving the dirty pages
> between memcgs should be optional (move_charge_at_immigrate?).
> 
FYI, I'm planning to add file-cache and shmem/tmpfs support for move_charge feature
for 2.6.35.
But, hmm, it would be complicated if we try to move dirty account too.


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-02  8:01       ` Andrea Righi
                         ` (2 preceding siblings ...)
  (?)
@ 2010-03-02  8:23       ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 140+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-02  8:23 UTC (permalink / raw)
  To: Andrea Righi
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Andrew Morton,
	Balbir Singh

On Tue, 2 Mar 2010 09:01:58 +0100
Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org> wrote:

> On Tue, Mar 02, 2010 at 09:23:09AM +0900, KAMEZAWA Hiroyuki wrote:
> > On Mon,  1 Mar 2010 22:23:40 +0100
> > Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org> wrote:
> > 
> > > Apply the cgroup dirty pages accounting and limiting infrastructure to
> > > the opportune kernel functions.
> > > 
> > > Signed-off-by: Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
> > 
> > Seems nice.
> > 
> > Hmm. the last problem is moving account between memcg.
> > 
> > Right ?
> 
> Correct. This was actually the last item of the TODO list. Anyway, I'm
> still considering if it's correct to move dirty pages when a task is
> migrated from a cgroup to another. Currently, dirty pages just remain in
> the original cgroup and are flushed depending on the original cgroup
> settings. That is not totally wrong... at least moving the dirty pages
> between memcgs should be optional (move_charge_at_immigrate?).
> 

My concern is 
 - migration between memcg is already suppoted
    - at task move
    - at rmdir

Then, if you leave DIRTY_PAGE accounting to original cgroup,
the new cgroup (migration target)'s Dirty page accounting may
goes to be negative, or incorrect value. Please check FILE_MAPPED
implementation in __mem_cgroup_move_account()

As
       if (page_mapped(page) && !PageAnon(page)) {
                /* Update mapped_file data for mem_cgroup */
                preempt_disable();
                __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
                __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
                preempt_enable();
        }
then, FILE_MAPPED never goes negative.


Thanks,
-Kame

> Thanks for your ack and the detailed review!
> 
> -Andrea
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo-Bw31MaZKKs0EbZ0PF+XxCw@public.gmane.org  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org"> email-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org </a>
> 

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-02  8:01       ` Andrea Righi
@ 2010-03-02  8:23         ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 140+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-02  8:23 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Balbir Singh, Suleiman Souhlal, Greg Thelen, Daisuke Nishimura,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

On Tue, 2 Mar 2010 09:01:58 +0100
Andrea Righi <arighi@develer.com> wrote:

> On Tue, Mar 02, 2010 at 09:23:09AM +0900, KAMEZAWA Hiroyuki wrote:
> > On Mon,  1 Mar 2010 22:23:40 +0100
> > Andrea Righi <arighi@develer.com> wrote:
> > 
> > > Apply the cgroup dirty pages accounting and limiting infrastructure to
> > > the opportune kernel functions.
> > > 
> > > Signed-off-by: Andrea Righi <arighi@develer.com>
> > 
> > Seems nice.
> > 
> > Hmm. the last problem is moving account between memcg.
> > 
> > Right ?
> 
> Correct. This was actually the last item of the TODO list. Anyway, I'm
> still considering if it's correct to move dirty pages when a task is
> migrated from a cgroup to another. Currently, dirty pages just remain in
> the original cgroup and are flushed depending on the original cgroup
> settings. That is not totally wrong... at least moving the dirty pages
> between memcgs should be optional (move_charge_at_immigrate?).
> 

My concern is 
 - migration between memcg is already suppoted
    - at task move
    - at rmdir

Then, if you leave DIRTY_PAGE accounting to original cgroup,
the new cgroup (migration target)'s Dirty page accounting may
goes to be negative, or incorrect value. Please check FILE_MAPPED
implementation in __mem_cgroup_move_account()

As
       if (page_mapped(page) && !PageAnon(page)) {
                /* Update mapped_file data for mem_cgroup */
                preempt_disable();
                __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
                __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
                preempt_enable();
        }
then, FILE_MAPPED never goes negative.


Thanks,
-Kame

> Thanks for your ack and the detailed review!
> 
> -Andrea
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
@ 2010-03-02  8:23         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 140+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-02  8:23 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Balbir Singh, Suleiman Souhlal, Greg Thelen, Daisuke Nishimura,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

On Tue, 2 Mar 2010 09:01:58 +0100
Andrea Righi <arighi@develer.com> wrote:

> On Tue, Mar 02, 2010 at 09:23:09AM +0900, KAMEZAWA Hiroyuki wrote:
> > On Mon,  1 Mar 2010 22:23:40 +0100
> > Andrea Righi <arighi@develer.com> wrote:
> > 
> > > Apply the cgroup dirty pages accounting and limiting infrastructure to
> > > the opportune kernel functions.
> > > 
> > > Signed-off-by: Andrea Righi <arighi@develer.com>
> > 
> > Seems nice.
> > 
> > Hmm. the last problem is moving account between memcg.
> > 
> > Right ?
> 
> Correct. This was actually the last item of the TODO list. Anyway, I'm
> still considering if it's correct to move dirty pages when a task is
> migrated from a cgroup to another. Currently, dirty pages just remain in
> the original cgroup and are flushed depending on the original cgroup
> settings. That is not totally wrong... at least moving the dirty pages
> between memcgs should be optional (move_charge_at_immigrate?).
> 

My concern is 
 - migration between memcg is already suppoted
    - at task move
    - at rmdir

Then, if you leave DIRTY_PAGE accounting to original cgroup,
the new cgroup (migration target)'s Dirty page accounting may
goes to be negative, or incorrect value. Please check FILE_MAPPED
implementation in __mem_cgroup_move_account()

As
       if (page_mapped(page) && !PageAnon(page)) {
                /* Update mapped_file data for mem_cgroup */
                preempt_disable();
                __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
                __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
                preempt_enable();
        }
then, FILE_MAPPED never goes negative.


Thanks,
-Kame

> Thanks for your ack and the detailed review!
> 
> -Andrea
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 2/3] memcg: dirty pages accounting and limiting infrastructure
       [not found]   ` <1267478620-5276-3-git-send-email-arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
  2010-03-02  0:20     ` KAMEZAWA Hiroyuki
@ 2010-03-02 10:04     ` Kirill A. Shutemov
  2010-03-02 13:02     ` Balbir Singh
  2010-03-02 18:08     ` Greg Thelen
  3 siblings, 0 replies; 140+ messages in thread
From: Kirill A. Shutemov @ 2010-03-02 10:04 UTC (permalink / raw)
  To: Andrea Righi
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Andrew Morton,
	Balbir Singh

On Mon, Mar 1, 2010 at 11:23 PM, Andrea Righi <arighi@develer.com> wrote:
> Infrastructure to account dirty pages per cgroup and add dirty limit
> interfaces in the cgroupfs:
>
>  - Direct write-out: memory.dirty_ratio, memory.dirty_bytes
>
>  - Background write-out: memory.dirty_background_ratio, memory.dirty_background_bytes
>
> Signed-off-by: Andrea Righi <arighi@develer.com>
> ---
>  include/linux/memcontrol.h |   77 ++++++++++-
>  mm/memcontrol.c            |  336 ++++++++++++++++++++++++++++++++++++++++----
>  2 files changed, 384 insertions(+), 29 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 1f9b119..cc88b2e 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -19,12 +19,50 @@
>
>  #ifndef _LINUX_MEMCONTROL_H
>  #define _LINUX_MEMCONTROL_H
> +
> +#include <linux/writeback.h>
>  #include <linux/cgroup.h>
> +
>  struct mem_cgroup;
>  struct page_cgroup;
>  struct page;
>  struct mm_struct;
>
> +/* Cgroup memory statistics items exported to the kernel */
> +enum mem_cgroup_page_stat_item {
> +       MEMCG_NR_DIRTYABLE_PAGES,
> +       MEMCG_NR_RECLAIM_PAGES,
> +       MEMCG_NR_WRITEBACK,
> +       MEMCG_NR_DIRTY_WRITEBACK_PAGES,
> +};
> +
> +/*
> + * Statistics for memory cgroup.
> + */
> +enum mem_cgroup_stat_index {
> +       /*
> +        * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> +        */
> +       MEM_CGROUP_STAT_CACHE,     /* # of pages charged as cache */
> +       MEM_CGROUP_STAT_RSS,       /* # of pages charged as anon rss */
> +       MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> +       MEM_CGROUP_STAT_PGPGIN_COUNT,   /* # of pages paged in */
> +       MEM_CGROUP_STAT_PGPGOUT_COUNT,  /* # of pages paged out */
> +       MEM_CGROUP_STAT_EVENTS, /* sum of pagein + pageout for internal use */
> +       MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> +       MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
> +                                       used by soft limit implementation */
> +       MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out.
> +                                       used by threshold implementation */
> +       MEM_CGROUP_STAT_FILE_DIRTY,   /* # of dirty pages in page cache */
> +       MEM_CGROUP_STAT_WRITEBACK,   /* # of pages under writeback */
> +       MEM_CGROUP_STAT_WRITEBACK_TEMP,   /* # of pages under writeback using
> +                                               temporary buffers */
> +       MEM_CGROUP_STAT_UNSTABLE_NFS,   /* # of NFS unstable pages */
> +
> +       MEM_CGROUP_STAT_NSTATS,
> +};
> +
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
>  /*
>  * All "charge" functions with gfp_mask should use GFP_KERNEL or
> @@ -117,6 +155,13 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
>  extern int do_swap_account;
>  #endif
>
> +extern long mem_cgroup_dirty_ratio(void);
> +extern unsigned long mem_cgroup_dirty_bytes(void);
> +extern long mem_cgroup_dirty_background_ratio(void);
> +extern unsigned long mem_cgroup_dirty_background_bytes(void);
> +
> +extern s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item);
> +
>  static inline bool mem_cgroup_disabled(void)
>  {
>        if (mem_cgroup_subsys.disabled)
> @@ -125,7 +170,8 @@ static inline bool mem_cgroup_disabled(void)
>  }
>
>  extern bool mem_cgroup_oom_called(struct task_struct *task);
> -void mem_cgroup_update_file_mapped(struct page *page, int val);
> +void mem_cgroup_update_stat(struct page *page,
> +                       enum mem_cgroup_stat_index idx, int val);
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>                                                gfp_t gfp_mask, int nid,
>                                                int zid);
> @@ -300,8 +346,8 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
>  {
>  }
>
> -static inline void mem_cgroup_update_file_mapped(struct page *page,
> -                                                       int val)
> +static inline void mem_cgroup_update_stat(struct page *page,
> +                       enum mem_cgroup_stat_index idx, int val)
>  {
>  }
>
> @@ -312,6 +358,31 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>        return 0;
>  }
>
> +static inline long mem_cgroup_dirty_ratio(void)
> +{
> +       return vm_dirty_ratio;
> +}
> +
> +static inline unsigned long mem_cgroup_dirty_bytes(void)
> +{
> +       return vm_dirty_bytes;
> +}
> +
> +static inline long mem_cgroup_dirty_background_ratio(void)
> +{
> +       return dirty_background_ratio;
> +}
> +
> +static inline unsigned long mem_cgroup_dirty_background_bytes(void)
> +{
> +       return dirty_background_bytes;
> +}
> +
> +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> +{
> +       return -ENOMEM;

Why ENOMEM? Probably, EINVAL or ENOSYS?

> +}
> +
>  #endif /* CONFIG_CGROUP_MEM_CONT */
>
>  #endif /* _LINUX_MEMCONTROL_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a443c30..e74cf66 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -66,31 +66,16 @@ static int really_do_swap_account __initdata = 1; /* for remember boot option*/
>  #define SOFTLIMIT_EVENTS_THRESH (1000)
>  #define THRESHOLDS_EVENTS_THRESH (100)
>
> -/*
> - * Statistics for memory cgroup.
> - */
> -enum mem_cgroup_stat_index {
> -       /*
> -        * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> -        */
> -       MEM_CGROUP_STAT_CACHE,     /* # of pages charged as cache */
> -       MEM_CGROUP_STAT_RSS,       /* # of pages charged as anon rss */
> -       MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> -       MEM_CGROUP_STAT_PGPGIN_COUNT,   /* # of pages paged in */
> -       MEM_CGROUP_STAT_PGPGOUT_COUNT,  /* # of pages paged out */
> -       MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> -       MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
> -                                       used by soft limit implementation */
> -       MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out.
> -                                       used by threshold implementation */
> -
> -       MEM_CGROUP_STAT_NSTATS,
> -};
> -
>  struct mem_cgroup_stat_cpu {
>        s64 count[MEM_CGROUP_STAT_NSTATS];
>  };
>
> +/* Per cgroup page statistics */
> +struct mem_cgroup_page_stat {
> +       enum mem_cgroup_page_stat_item item;
> +       s64 value;
> +};
> +
>  /*
>  * per-zone information in memory controller.
>  */
> @@ -157,6 +142,15 @@ struct mem_cgroup_threshold_ary {
>  static bool mem_cgroup_threshold_check(struct mem_cgroup *mem);
>  static void mem_cgroup_threshold(struct mem_cgroup *mem);
>
> +enum mem_cgroup_dirty_param {
> +       MEM_CGROUP_DIRTY_RATIO,
> +       MEM_CGROUP_DIRTY_BYTES,
> +       MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> +       MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
> +
> +       MEM_CGROUP_DIRTY_NPARAMS,
> +};
> +
>  /*
>  * The memory controller data structure. The memory controller controls both
>  * page cache and RSS per cgroup. We would eventually like to provide
> @@ -205,6 +199,9 @@ struct mem_cgroup {
>
>        unsigned int    swappiness;
>
> +       /* control memory cgroup dirty pages */
> +       unsigned long dirty_param[MEM_CGROUP_DIRTY_NPARAMS];
> +
>        /* set when res.limit == memsw.limit */
>        bool            memsw_is_minimum;
>
> @@ -1021,6 +1018,164 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
>        return swappiness;
>  }
>
> +static unsigned long get_dirty_param(struct mem_cgroup *memcg,
> +                       enum mem_cgroup_dirty_param idx)
> +{
> +       unsigned long ret;
> +
> +       VM_BUG_ON(idx >= MEM_CGROUP_DIRTY_NPARAMS);
> +       spin_lock(&memcg->reclaim_param_lock);
> +       ret = memcg->dirty_param[idx];
> +       spin_unlock(&memcg->reclaim_param_lock);
> +
> +       return ret;
> +}
> +
> +long mem_cgroup_dirty_ratio(void)
> +{
> +       struct mem_cgroup *memcg;
> +       long ret = vm_dirty_ratio;
> +
> +       if (mem_cgroup_disabled())
> +               return ret;
> +       /*
> +        * It's possible that "current" may be moved to other cgroup while we
> +        * access cgroup. But precise check is meaningless because the task can
> +        * be moved after our access and writeback tends to take long time.
> +        * At least, "memcg" will not be freed under rcu_read_lock().
> +        */
> +       rcu_read_lock();
> +       memcg = mem_cgroup_from_task(current);
> +       if (likely(memcg))
> +               ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_RATIO);
> +       rcu_read_unlock();
> +
> +       return ret;
> +}
> +
> +unsigned long mem_cgroup_dirty_bytes(void)
> +{
> +       struct mem_cgroup *memcg;
> +       unsigned long ret = vm_dirty_bytes;
> +
> +       if (mem_cgroup_disabled())
> +               return ret;
> +       rcu_read_lock();
> +       memcg = mem_cgroup_from_task(current);
> +       if (likely(memcg))
> +               ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BYTES);
> +       rcu_read_unlock();
> +
> +       return ret;
> +}
> +
> +long mem_cgroup_dirty_background_ratio(void)
> +{
> +       struct mem_cgroup *memcg;
> +       long ret = dirty_background_ratio;
> +
> +       if (mem_cgroup_disabled())
> +               return ret;
> +       rcu_read_lock();
> +       memcg = mem_cgroup_from_task(current);
> +       if (likely(memcg))
> +               ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BACKGROUND_RATIO);
> +       rcu_read_unlock();
> +
> +       return ret;
> +}
> +
> +unsigned long mem_cgroup_dirty_background_bytes(void)
> +{
> +       struct mem_cgroup *memcg;
> +       unsigned long ret = dirty_background_bytes;
> +
> +       if (mem_cgroup_disabled())
> +               return ret;
> +       rcu_read_lock();
> +       memcg = mem_cgroup_from_task(current);
> +       if (likely(memcg))
> +               ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BACKGROUND_BYTES);
> +       rcu_read_unlock();
> +
> +       return ret;
> +}
> +
> +static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
> +{
> +       return do_swap_account ?
> +                       res_counter_read_u64(&memcg->memsw, RES_LIMIT) :
> +                       nr_swap_pages > 0;
> +}
> +
> +static s64 mem_cgroup_get_local_page_stat(struct mem_cgroup *memcg,
> +                               enum mem_cgroup_page_stat_item item)
> +{
> +       s64 ret;
> +
> +       switch (item) {
> +       case MEMCG_NR_DIRTYABLE_PAGES:
> +               ret = res_counter_read_u64(&memcg->res, RES_LIMIT) -
> +                       res_counter_read_u64(&memcg->res, RES_USAGE);
> +               /* Translate free memory in pages */
> +               ret >>= PAGE_SHIFT;
> +               ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_FILE) +
> +                       mem_cgroup_read_stat(memcg, LRU_INACTIVE_FILE);
> +               if (mem_cgroup_can_swap(memcg))
> +                       ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_ANON) +
> +                               mem_cgroup_read_stat(memcg, LRU_INACTIVE_ANON);
> +               break;
> +       case MEMCG_NR_RECLAIM_PAGES:
> +               ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_FILE_DIRTY) +
> +                       mem_cgroup_read_stat(memcg,
> +                                       MEM_CGROUP_STAT_UNSTABLE_NFS);
> +               break;
> +       case MEMCG_NR_WRITEBACK:
> +               ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK);
> +               break;
> +       case MEMCG_NR_DIRTY_WRITEBACK_PAGES:
> +               ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK) +
> +                       mem_cgroup_read_stat(memcg,
> +                               MEM_CGROUP_STAT_UNSTABLE_NFS);
> +               break;
> +       default:
> +               ret = 0;
> +               WARN_ON_ONCE(1);

I think it's a bug, not warning.

> +       }
> +       return ret;
> +}
> +
> +static int mem_cgroup_page_stat_cb(struct mem_cgroup *mem, void *data)
> +{
> +       struct mem_cgroup_page_stat *stat = (struct mem_cgroup_page_stat *)data;
> +
> +       stat->value += mem_cgroup_get_local_page_stat(mem, stat->item);
> +       return 0;
> +}
> +
> +s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> +{
> +       struct mem_cgroup_page_stat stat = {};
> +       struct mem_cgroup *memcg;
> +
> +       if (mem_cgroup_disabled())
> +               return -ENOMEM;

EINVAL/ENOSYS?

> +       rcu_read_lock();
> +       memcg = mem_cgroup_from_task(current);
> +       if (memcg) {
> +               /*
> +                * Recursively evaulate page statistics against all cgroup
> +                * under hierarchy tree
> +                */
> +               stat.item = item;
> +               mem_cgroup_walk_tree(memcg, &stat, mem_cgroup_page_stat_cb);
> +       } else
> +               stat.value = -ENOMEM;

ditto.

> +       rcu_read_unlock();
> +
> +       return stat.value;
> +}
> +
>  static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void *data)
>  {
>        int *val = data;
> @@ -1263,14 +1418,16 @@ static void record_last_oom(struct mem_cgroup *mem)
>  }
>
>  /*
> - * Currently used to update mapped file statistics, but the routine can be
> - * generalized to update other statistics as well.
> + * Generalized routine to update memory cgroup statistics.
>  */
> -void mem_cgroup_update_file_mapped(struct page *page, int val)
> +void mem_cgroup_update_stat(struct page *page,
> +                       enum mem_cgroup_stat_index idx, int val)

EXPORT_SYMBOL_GPL(mem_cgroup_update_stat) is needed, since
it uses by filesystems.

>  {
>        struct mem_cgroup *mem;
>        struct page_cgroup *pc;
>
> +       if (mem_cgroup_disabled())
> +               return;
>        pc = lookup_page_cgroup(page);
>        if (unlikely(!pc))
>                return;
> @@ -1286,7 +1443,8 @@ void mem_cgroup_update_file_mapped(struct page *page, int val)
>        /*
>         * Preemption is already disabled. We can use __this_cpu_xxx
>         */
> -       __this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], val);
> +       VM_BUG_ON(idx >= MEM_CGROUP_STAT_NSTATS);
> +       __this_cpu_add(mem->stat->count[idx], val);
>
>  done:
>        unlock_page_cgroup(pc);
> @@ -3033,6 +3191,10 @@ enum {
>        MCS_PGPGIN,
>        MCS_PGPGOUT,
>        MCS_SWAP,
> +       MCS_FILE_DIRTY,
> +       MCS_WRITEBACK,
> +       MCS_WRITEBACK_TEMP,
> +       MCS_UNSTABLE_NFS,
>        MCS_INACTIVE_ANON,
>        MCS_ACTIVE_ANON,
>        MCS_INACTIVE_FILE,
> @@ -3055,6 +3217,10 @@ struct {
>        {"pgpgin", "total_pgpgin"},
>        {"pgpgout", "total_pgpgout"},
>        {"swap", "total_swap"},
> +       {"filedirty", "dirty_pages"},
> +       {"writeback", "writeback_pages"},
> +       {"writeback_tmp", "writeback_temp_pages"},
> +       {"nfs", "nfs_unstable"},
>        {"inactive_anon", "total_inactive_anon"},
>        {"active_anon", "total_active_anon"},
>        {"inactive_file", "total_inactive_file"},
> @@ -3083,6 +3249,14 @@ static int mem_cgroup_get_local_stat(struct mem_cgroup *mem, void *data)
>                val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_SWAPOUT);
>                s->stat[MCS_SWAP] += val * PAGE_SIZE;
>        }
> +       val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_DIRTY);
> +       s->stat[MCS_FILE_DIRTY] += val;
> +       val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WRITEBACK);
> +       s->stat[MCS_WRITEBACK] += val;
> +       val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WRITEBACK_TEMP);
> +       s->stat[MCS_WRITEBACK_TEMP] += val;
> +       val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_UNSTABLE_NFS);
> +       s->stat[MCS_UNSTABLE_NFS] += val;
>
>        /* per zone stat */
>        val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON);
> @@ -3467,6 +3641,50 @@ unlock:
>        return ret;
>  }
>
> +static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
> +{
> +       struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +       int type = cft->private;
> +
> +       return get_dirty_param(memcg, type);
> +}
> +
> +static int
> +mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
> +{
> +       struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +       int type = cft->private;
> +
> +       if (cgrp->parent == NULL)
> +               return -EINVAL;
> +       if (((type == MEM_CGROUP_DIRTY_RATIO) ||
> +               (type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO)) && (val > 100))

Too many unnecessary brackets

       if ((type == MEM_CGROUP_DIRTY_RATIO ||
               type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO) && val > 100)

> +               return -EINVAL;
> +
> +       spin_lock(&memcg->reclaim_param_lock);
> +       switch (type) {
> +       case MEM_CGROUP_DIRTY_RATIO:
> +               memcg->dirty_param[MEM_CGROUP_DIRTY_RATIO] = val;
> +               memcg->dirty_param[MEM_CGROUP_DIRTY_BYTES] = 0;
> +               break;
> +       case MEM_CGROUP_DIRTY_BYTES:
> +               memcg->dirty_param[MEM_CGROUP_DIRTY_RATIO] = 0;
> +               memcg->dirty_param[MEM_CGROUP_DIRTY_BYTES] = val;
> +               break;
> +       case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
> +               memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] = val;
> +               memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES] = 0;
> +               break;
> +       case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
> +               memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] = 0;
> +               memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES] = val;
> +               break;
> +       }
> +       spin_unlock(&memcg->reclaim_param_lock);
> +
> +       return 0;
> +}
> +
>  static struct cftype mem_cgroup_files[] = {
>        {
>                .name = "usage_in_bytes",
> @@ -3518,6 +3736,30 @@ static struct cftype mem_cgroup_files[] = {
>                .write_u64 = mem_cgroup_swappiness_write,
>        },
>        {
> +               .name = "dirty_ratio",
> +               .read_u64 = mem_cgroup_dirty_read,
> +               .write_u64 = mem_cgroup_dirty_write,
> +               .private = MEM_CGROUP_DIRTY_RATIO,
> +       },
> +       {
> +               .name = "dirty_bytes",
> +               .read_u64 = mem_cgroup_dirty_read,
> +               .write_u64 = mem_cgroup_dirty_write,
> +               .private = MEM_CGROUP_DIRTY_BYTES,
> +       },
> +       {
> +               .name = "dirty_background_ratio",
> +               .read_u64 = mem_cgroup_dirty_read,
> +               .write_u64 = mem_cgroup_dirty_write,
> +               .private = MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> +       },
> +       {
> +               .name = "dirty_background_bytes",
> +               .read_u64 = mem_cgroup_dirty_read,
> +               .write_u64 = mem_cgroup_dirty_write,
> +               .private = MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
> +       },
> +       {
>                .name = "move_charge_at_immigrate",
>                .read_u64 = mem_cgroup_move_charge_read,
>                .write_u64 = mem_cgroup_move_charge_write,
> @@ -3725,6 +3967,19 @@ static int mem_cgroup_soft_limit_tree_init(void)
>        return 0;
>  }
>
> +/*
> + * NOTE: called only with &src->reclaim_param_lock held from
> + * mem_cgroup_create().
> + */
> +static inline void
> +copy_dirty_params(struct mem_cgroup *dst, struct mem_cgroup *src)
> +{
> +       int i;
> +
> +       for (i = 0; i < MEM_CGROUP_DIRTY_NPARAMS; i++)
> +               dst->dirty_param[i] = src->dirty_param[i];
> +}
> +
>  static struct cgroup_subsys_state * __ref
>  mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  {
> @@ -3776,8 +4031,37 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>        mem->last_scanned_child = 0;
>        spin_lock_init(&mem->reclaim_param_lock);
>
> -       if (parent)
> +       if (parent) {
>                mem->swappiness = get_swappiness(parent);
> +
> +               spin_lock(&parent->reclaim_param_lock);
> +               copy_dirty_params(mem, parent);
> +               spin_unlock(&parent->reclaim_param_lock);
> +       } else {
> +               /*
> +                * XXX: should we need a lock here? we could switch from
> +                * vm_dirty_ratio to vm_dirty_bytes or vice versa but we're not
> +                * reading them atomically. The same for dirty_background_ratio
> +                * and dirty_background_bytes.
> +                *
> +                * For now, try to read them speculatively and retry if a
> +                * "conflict" is detected.
> +                */
> +               do {
> +                       mem->dirty_param[MEM_CGROUP_DIRTY_RATIO] =
> +                                               vm_dirty_ratio;
> +                       mem->dirty_param[MEM_CGROUP_DIRTY_BYTES] =
> +                                               vm_dirty_bytes;
> +               } while (mem->dirty_param[MEM_CGROUP_DIRTY_RATIO] &&
> +                        mem->dirty_param[MEM_CGROUP_DIRTY_BYTES]);
> +               do {
> +                       mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] =
> +                                               dirty_background_ratio;
> +                       mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES] =
> +                                               dirty_background_bytes;
> +               } while (mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] &&
> +                       mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES]);
> +       }
>        atomic_set(&mem->refcnt, 1);
>        mem->move_charge_at_immigrate = 0;
>        mutex_init(&mem->thresholds_lock);
> --
> 1.6.3.3
>
>
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 2/3] memcg: dirty pages accounting and limiting  infrastructure
  2010-03-01 21:23   ` Andrea Righi
@ 2010-03-02 10:04     ` Kirill A. Shutemov
  -1 siblings, 0 replies; 140+ messages in thread
From: Kirill A. Shutemov @ 2010-03-02 10:04 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Andrew Morton, containers, linux-kernel,
	linux-mm

On Mon, Mar 1, 2010 at 11:23 PM, Andrea Righi <arighi@develer.com> wrote:
> Infrastructure to account dirty pages per cgroup and add dirty limit
> interfaces in the cgroupfs:
>
>  - Direct write-out: memory.dirty_ratio, memory.dirty_bytes
>
>  - Background write-out: memory.dirty_background_ratio, memory.dirty_background_bytes
>
> Signed-off-by: Andrea Righi <arighi@develer.com>
> ---
>  include/linux/memcontrol.h |   77 ++++++++++-
>  mm/memcontrol.c            |  336 ++++++++++++++++++++++++++++++++++++++++----
>  2 files changed, 384 insertions(+), 29 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 1f9b119..cc88b2e 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -19,12 +19,50 @@
>
>  #ifndef _LINUX_MEMCONTROL_H
>  #define _LINUX_MEMCONTROL_H
> +
> +#include <linux/writeback.h>
>  #include <linux/cgroup.h>
> +
>  struct mem_cgroup;
>  struct page_cgroup;
>  struct page;
>  struct mm_struct;
>
> +/* Cgroup memory statistics items exported to the kernel */
> +enum mem_cgroup_page_stat_item {
> +       MEMCG_NR_DIRTYABLE_PAGES,
> +       MEMCG_NR_RECLAIM_PAGES,
> +       MEMCG_NR_WRITEBACK,
> +       MEMCG_NR_DIRTY_WRITEBACK_PAGES,
> +};
> +
> +/*
> + * Statistics for memory cgroup.
> + */
> +enum mem_cgroup_stat_index {
> +       /*
> +        * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> +        */
> +       MEM_CGROUP_STAT_CACHE,     /* # of pages charged as cache */
> +       MEM_CGROUP_STAT_RSS,       /* # of pages charged as anon rss */
> +       MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> +       MEM_CGROUP_STAT_PGPGIN_COUNT,   /* # of pages paged in */
> +       MEM_CGROUP_STAT_PGPGOUT_COUNT,  /* # of pages paged out */
> +       MEM_CGROUP_STAT_EVENTS, /* sum of pagein + pageout for internal use */
> +       MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> +       MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
> +                                       used by soft limit implementation */
> +       MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out.
> +                                       used by threshold implementation */
> +       MEM_CGROUP_STAT_FILE_DIRTY,   /* # of dirty pages in page cache */
> +       MEM_CGROUP_STAT_WRITEBACK,   /* # of pages under writeback */
> +       MEM_CGROUP_STAT_WRITEBACK_TEMP,   /* # of pages under writeback using
> +                                               temporary buffers */
> +       MEM_CGROUP_STAT_UNSTABLE_NFS,   /* # of NFS unstable pages */
> +
> +       MEM_CGROUP_STAT_NSTATS,
> +};
> +
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
>  /*
>  * All "charge" functions with gfp_mask should use GFP_KERNEL or
> @@ -117,6 +155,13 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
>  extern int do_swap_account;
>  #endif
>
> +extern long mem_cgroup_dirty_ratio(void);
> +extern unsigned long mem_cgroup_dirty_bytes(void);
> +extern long mem_cgroup_dirty_background_ratio(void);
> +extern unsigned long mem_cgroup_dirty_background_bytes(void);
> +
> +extern s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item);
> +
>  static inline bool mem_cgroup_disabled(void)
>  {
>        if (mem_cgroup_subsys.disabled)
> @@ -125,7 +170,8 @@ static inline bool mem_cgroup_disabled(void)
>  }
>
>  extern bool mem_cgroup_oom_called(struct task_struct *task);
> -void mem_cgroup_update_file_mapped(struct page *page, int val);
> +void mem_cgroup_update_stat(struct page *page,
> +                       enum mem_cgroup_stat_index idx, int val);
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>                                                gfp_t gfp_mask, int nid,
>                                                int zid);
> @@ -300,8 +346,8 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
>  {
>  }
>
> -static inline void mem_cgroup_update_file_mapped(struct page *page,
> -                                                       int val)
> +static inline void mem_cgroup_update_stat(struct page *page,
> +                       enum mem_cgroup_stat_index idx, int val)
>  {
>  }
>
> @@ -312,6 +358,31 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>        return 0;
>  }
>
> +static inline long mem_cgroup_dirty_ratio(void)
> +{
> +       return vm_dirty_ratio;
> +}
> +
> +static inline unsigned long mem_cgroup_dirty_bytes(void)
> +{
> +       return vm_dirty_bytes;
> +}
> +
> +static inline long mem_cgroup_dirty_background_ratio(void)
> +{
> +       return dirty_background_ratio;
> +}
> +
> +static inline unsigned long mem_cgroup_dirty_background_bytes(void)
> +{
> +       return dirty_background_bytes;
> +}
> +
> +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> +{
> +       return -ENOMEM;

Why ENOMEM? Probably, EINVAL or ENOSYS?

> +}
> +
>  #endif /* CONFIG_CGROUP_MEM_CONT */
>
>  #endif /* _LINUX_MEMCONTROL_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a443c30..e74cf66 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -66,31 +66,16 @@ static int really_do_swap_account __initdata = 1; /* for remember boot option*/
>  #define SOFTLIMIT_EVENTS_THRESH (1000)
>  #define THRESHOLDS_EVENTS_THRESH (100)
>
> -/*
> - * Statistics for memory cgroup.
> - */
> -enum mem_cgroup_stat_index {
> -       /*
> -        * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> -        */
> -       MEM_CGROUP_STAT_CACHE,     /* # of pages charged as cache */
> -       MEM_CGROUP_STAT_RSS,       /* # of pages charged as anon rss */
> -       MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> -       MEM_CGROUP_STAT_PGPGIN_COUNT,   /* # of pages paged in */
> -       MEM_CGROUP_STAT_PGPGOUT_COUNT,  /* # of pages paged out */
> -       MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> -       MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
> -                                       used by soft limit implementation */
> -       MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out.
> -                                       used by threshold implementation */
> -
> -       MEM_CGROUP_STAT_NSTATS,
> -};
> -
>  struct mem_cgroup_stat_cpu {
>        s64 count[MEM_CGROUP_STAT_NSTATS];
>  };
>
> +/* Per cgroup page statistics */
> +struct mem_cgroup_page_stat {
> +       enum mem_cgroup_page_stat_item item;
> +       s64 value;
> +};
> +
>  /*
>  * per-zone information in memory controller.
>  */
> @@ -157,6 +142,15 @@ struct mem_cgroup_threshold_ary {
>  static bool mem_cgroup_threshold_check(struct mem_cgroup *mem);
>  static void mem_cgroup_threshold(struct mem_cgroup *mem);
>
> +enum mem_cgroup_dirty_param {
> +       MEM_CGROUP_DIRTY_RATIO,
> +       MEM_CGROUP_DIRTY_BYTES,
> +       MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> +       MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
> +
> +       MEM_CGROUP_DIRTY_NPARAMS,
> +};
> +
>  /*
>  * The memory controller data structure. The memory controller controls both
>  * page cache and RSS per cgroup. We would eventually like to provide
> @@ -205,6 +199,9 @@ struct mem_cgroup {
>
>        unsigned int    swappiness;
>
> +       /* control memory cgroup dirty pages */
> +       unsigned long dirty_param[MEM_CGROUP_DIRTY_NPARAMS];
> +
>        /* set when res.limit == memsw.limit */
>        bool            memsw_is_minimum;
>
> @@ -1021,6 +1018,164 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
>        return swappiness;
>  }
>
> +static unsigned long get_dirty_param(struct mem_cgroup *memcg,
> +                       enum mem_cgroup_dirty_param idx)
> +{
> +       unsigned long ret;
> +
> +       VM_BUG_ON(idx >= MEM_CGROUP_DIRTY_NPARAMS);
> +       spin_lock(&memcg->reclaim_param_lock);
> +       ret = memcg->dirty_param[idx];
> +       spin_unlock(&memcg->reclaim_param_lock);
> +
> +       return ret;
> +}
> +
> +long mem_cgroup_dirty_ratio(void)
> +{
> +       struct mem_cgroup *memcg;
> +       long ret = vm_dirty_ratio;
> +
> +       if (mem_cgroup_disabled())
> +               return ret;
> +       /*
> +        * It's possible that "current" may be moved to other cgroup while we
> +        * access cgroup. But precise check is meaningless because the task can
> +        * be moved after our access and writeback tends to take long time.
> +        * At least, "memcg" will not be freed under rcu_read_lock().
> +        */
> +       rcu_read_lock();
> +       memcg = mem_cgroup_from_task(current);
> +       if (likely(memcg))
> +               ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_RATIO);
> +       rcu_read_unlock();
> +
> +       return ret;
> +}
> +
> +unsigned long mem_cgroup_dirty_bytes(void)
> +{
> +       struct mem_cgroup *memcg;
> +       unsigned long ret = vm_dirty_bytes;
> +
> +       if (mem_cgroup_disabled())
> +               return ret;
> +       rcu_read_lock();
> +       memcg = mem_cgroup_from_task(current);
> +       if (likely(memcg))
> +               ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BYTES);
> +       rcu_read_unlock();
> +
> +       return ret;
> +}
> +
> +long mem_cgroup_dirty_background_ratio(void)
> +{
> +       struct mem_cgroup *memcg;
> +       long ret = dirty_background_ratio;
> +
> +       if (mem_cgroup_disabled())
> +               return ret;
> +       rcu_read_lock();
> +       memcg = mem_cgroup_from_task(current);
> +       if (likely(memcg))
> +               ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BACKGROUND_RATIO);
> +       rcu_read_unlock();
> +
> +       return ret;
> +}
> +
> +unsigned long mem_cgroup_dirty_background_bytes(void)
> +{
> +       struct mem_cgroup *memcg;
> +       unsigned long ret = dirty_background_bytes;
> +
> +       if (mem_cgroup_disabled())
> +               return ret;
> +       rcu_read_lock();
> +       memcg = mem_cgroup_from_task(current);
> +       if (likely(memcg))
> +               ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BACKGROUND_BYTES);
> +       rcu_read_unlock();
> +
> +       return ret;
> +}
> +
> +static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
> +{
> +       return do_swap_account ?
> +                       res_counter_read_u64(&memcg->memsw, RES_LIMIT) :
> +                       nr_swap_pages > 0;
> +}
> +
> +static s64 mem_cgroup_get_local_page_stat(struct mem_cgroup *memcg,
> +                               enum mem_cgroup_page_stat_item item)
> +{
> +       s64 ret;
> +
> +       switch (item) {
> +       case MEMCG_NR_DIRTYABLE_PAGES:
> +               ret = res_counter_read_u64(&memcg->res, RES_LIMIT) -
> +                       res_counter_read_u64(&memcg->res, RES_USAGE);
> +               /* Translate free memory in pages */
> +               ret >>= PAGE_SHIFT;
> +               ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_FILE) +
> +                       mem_cgroup_read_stat(memcg, LRU_INACTIVE_FILE);
> +               if (mem_cgroup_can_swap(memcg))
> +                       ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_ANON) +
> +                               mem_cgroup_read_stat(memcg, LRU_INACTIVE_ANON);
> +               break;
> +       case MEMCG_NR_RECLAIM_PAGES:
> +               ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_FILE_DIRTY) +
> +                       mem_cgroup_read_stat(memcg,
> +                                       MEM_CGROUP_STAT_UNSTABLE_NFS);
> +               break;
> +       case MEMCG_NR_WRITEBACK:
> +               ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK);
> +               break;
> +       case MEMCG_NR_DIRTY_WRITEBACK_PAGES:
> +               ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK) +
> +                       mem_cgroup_read_stat(memcg,
> +                               MEM_CGROUP_STAT_UNSTABLE_NFS);
> +               break;
> +       default:
> +               ret = 0;
> +               WARN_ON_ONCE(1);

I think it's a bug, not warning.

> +       }
> +       return ret;
> +}
> +
> +static int mem_cgroup_page_stat_cb(struct mem_cgroup *mem, void *data)
> +{
> +       struct mem_cgroup_page_stat *stat = (struct mem_cgroup_page_stat *)data;
> +
> +       stat->value += mem_cgroup_get_local_page_stat(mem, stat->item);
> +       return 0;
> +}
> +
> +s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> +{
> +       struct mem_cgroup_page_stat stat = {};
> +       struct mem_cgroup *memcg;
> +
> +       if (mem_cgroup_disabled())
> +               return -ENOMEM;

EINVAL/ENOSYS?

> +       rcu_read_lock();
> +       memcg = mem_cgroup_from_task(current);
> +       if (memcg) {
> +               /*
> +                * Recursively evaulate page statistics against all cgroup
> +                * under hierarchy tree
> +                */
> +               stat.item = item;
> +               mem_cgroup_walk_tree(memcg, &stat, mem_cgroup_page_stat_cb);
> +       } else
> +               stat.value = -ENOMEM;

ditto.

> +       rcu_read_unlock();
> +
> +       return stat.value;
> +}
> +
>  static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void *data)
>  {
>        int *val = data;
> @@ -1263,14 +1418,16 @@ static void record_last_oom(struct mem_cgroup *mem)
>  }
>
>  /*
> - * Currently used to update mapped file statistics, but the routine can be
> - * generalized to update other statistics as well.
> + * Generalized routine to update memory cgroup statistics.
>  */
> -void mem_cgroup_update_file_mapped(struct page *page, int val)
> +void mem_cgroup_update_stat(struct page *page,
> +                       enum mem_cgroup_stat_index idx, int val)

EXPORT_SYMBOL_GPL(mem_cgroup_update_stat) is needed, since
it uses by filesystems.

>  {
>        struct mem_cgroup *mem;
>        struct page_cgroup *pc;
>
> +       if (mem_cgroup_disabled())
> +               return;
>        pc = lookup_page_cgroup(page);
>        if (unlikely(!pc))
>                return;
> @@ -1286,7 +1443,8 @@ void mem_cgroup_update_file_mapped(struct page *page, int val)
>        /*
>         * Preemption is already disabled. We can use __this_cpu_xxx
>         */
> -       __this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], val);
> +       VM_BUG_ON(idx >= MEM_CGROUP_STAT_NSTATS);
> +       __this_cpu_add(mem->stat->count[idx], val);
>
>  done:
>        unlock_page_cgroup(pc);
> @@ -3033,6 +3191,10 @@ enum {
>        MCS_PGPGIN,
>        MCS_PGPGOUT,
>        MCS_SWAP,
> +       MCS_FILE_DIRTY,
> +       MCS_WRITEBACK,
> +       MCS_WRITEBACK_TEMP,
> +       MCS_UNSTABLE_NFS,
>        MCS_INACTIVE_ANON,
>        MCS_ACTIVE_ANON,
>        MCS_INACTIVE_FILE,
> @@ -3055,6 +3217,10 @@ struct {
>        {"pgpgin", "total_pgpgin"},
>        {"pgpgout", "total_pgpgout"},
>        {"swap", "total_swap"},
> +       {"filedirty", "dirty_pages"},
> +       {"writeback", "writeback_pages"},
> +       {"writeback_tmp", "writeback_temp_pages"},
> +       {"nfs", "nfs_unstable"},
>        {"inactive_anon", "total_inactive_anon"},
>        {"active_anon", "total_active_anon"},
>        {"inactive_file", "total_inactive_file"},
> @@ -3083,6 +3249,14 @@ static int mem_cgroup_get_local_stat(struct mem_cgroup *mem, void *data)
>                val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_SWAPOUT);
>                s->stat[MCS_SWAP] += val * PAGE_SIZE;
>        }
> +       val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_DIRTY);
> +       s->stat[MCS_FILE_DIRTY] += val;
> +       val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WRITEBACK);
> +       s->stat[MCS_WRITEBACK] += val;
> +       val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WRITEBACK_TEMP);
> +       s->stat[MCS_WRITEBACK_TEMP] += val;
> +       val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_UNSTABLE_NFS);
> +       s->stat[MCS_UNSTABLE_NFS] += val;
>
>        /* per zone stat */
>        val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON);
> @@ -3467,6 +3641,50 @@ unlock:
>        return ret;
>  }
>
> +static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
> +{
> +       struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +       int type = cft->private;
> +
> +       return get_dirty_param(memcg, type);
> +}
> +
> +static int
> +mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
> +{
> +       struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +       int type = cft->private;
> +
> +       if (cgrp->parent == NULL)
> +               return -EINVAL;
> +       if (((type == MEM_CGROUP_DIRTY_RATIO) ||
> +               (type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO)) && (val > 100))

Too many unnecessary brackets

       if ((type == MEM_CGROUP_DIRTY_RATIO ||
               type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO) && val > 100)

> +               return -EINVAL;
> +
> +       spin_lock(&memcg->reclaim_param_lock);
> +       switch (type) {
> +       case MEM_CGROUP_DIRTY_RATIO:
> +               memcg->dirty_param[MEM_CGROUP_DIRTY_RATIO] = val;
> +               memcg->dirty_param[MEM_CGROUP_DIRTY_BYTES] = 0;
> +               break;
> +       case MEM_CGROUP_DIRTY_BYTES:
> +               memcg->dirty_param[MEM_CGROUP_DIRTY_RATIO] = 0;
> +               memcg->dirty_param[MEM_CGROUP_DIRTY_BYTES] = val;
> +               break;
> +       case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
> +               memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] = val;
> +               memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES] = 0;
> +               break;
> +       case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
> +               memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] = 0;
> +               memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES] = val;
> +               break;
> +       }
> +       spin_unlock(&memcg->reclaim_param_lock);
> +
> +       return 0;
> +}
> +
>  static struct cftype mem_cgroup_files[] = {
>        {
>                .name = "usage_in_bytes",
> @@ -3518,6 +3736,30 @@ static struct cftype mem_cgroup_files[] = {
>                .write_u64 = mem_cgroup_swappiness_write,
>        },
>        {
> +               .name = "dirty_ratio",
> +               .read_u64 = mem_cgroup_dirty_read,
> +               .write_u64 = mem_cgroup_dirty_write,
> +               .private = MEM_CGROUP_DIRTY_RATIO,
> +       },
> +       {
> +               .name = "dirty_bytes",
> +               .read_u64 = mem_cgroup_dirty_read,
> +               .write_u64 = mem_cgroup_dirty_write,
> +               .private = MEM_CGROUP_DIRTY_BYTES,
> +       },
> +       {
> +               .name = "dirty_background_ratio",
> +               .read_u64 = mem_cgroup_dirty_read,
> +               .write_u64 = mem_cgroup_dirty_write,
> +               .private = MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> +       },
> +       {
> +               .name = "dirty_background_bytes",
> +               .read_u64 = mem_cgroup_dirty_read,
> +               .write_u64 = mem_cgroup_dirty_write,
> +               .private = MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
> +       },
> +       {
>                .name = "move_charge_at_immigrate",
>                .read_u64 = mem_cgroup_move_charge_read,
>                .write_u64 = mem_cgroup_move_charge_write,
> @@ -3725,6 +3967,19 @@ static int mem_cgroup_soft_limit_tree_init(void)
>        return 0;
>  }
>
> +/*
> + * NOTE: called only with &src->reclaim_param_lock held from
> + * mem_cgroup_create().
> + */
> +static inline void
> +copy_dirty_params(struct mem_cgroup *dst, struct mem_cgroup *src)
> +{
> +       int i;
> +
> +       for (i = 0; i < MEM_CGROUP_DIRTY_NPARAMS; i++)
> +               dst->dirty_param[i] = src->dirty_param[i];
> +}
> +
>  static struct cgroup_subsys_state * __ref
>  mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  {
> @@ -3776,8 +4031,37 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>        mem->last_scanned_child = 0;
>        spin_lock_init(&mem->reclaim_param_lock);
>
> -       if (parent)
> +       if (parent) {
>                mem->swappiness = get_swappiness(parent);
> +
> +               spin_lock(&parent->reclaim_param_lock);
> +               copy_dirty_params(mem, parent);
> +               spin_unlock(&parent->reclaim_param_lock);
> +       } else {
> +               /*
> +                * XXX: should we need a lock here? we could switch from
> +                * vm_dirty_ratio to vm_dirty_bytes or vice versa but we're not
> +                * reading them atomically. The same for dirty_background_ratio
> +                * and dirty_background_bytes.
> +                *
> +                * For now, try to read them speculatively and retry if a
> +                * "conflict" is detected.
> +                */
> +               do {
> +                       mem->dirty_param[MEM_CGROUP_DIRTY_RATIO] =
> +                                               vm_dirty_ratio;
> +                       mem->dirty_param[MEM_CGROUP_DIRTY_BYTES] =
> +                                               vm_dirty_bytes;
> +               } while (mem->dirty_param[MEM_CGROUP_DIRTY_RATIO] &&
> +                        mem->dirty_param[MEM_CGROUP_DIRTY_BYTES]);
> +               do {
> +                       mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] =
> +                                               dirty_background_ratio;
> +                       mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES] =
> +                                               dirty_background_bytes;
> +               } while (mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] &&
> +                       mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES]);
> +       }
>        atomic_set(&mem->refcnt, 1);
>        mem->move_charge_at_immigrate = 0;
>        mutex_init(&mem->thresholds_lock);
> --
> 1.6.3.3
>
>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 2/3] memcg: dirty pages accounting and limiting infrastructure
@ 2010-03-02 10:04     ` Kirill A. Shutemov
  0 siblings, 0 replies; 140+ messages in thread
From: Kirill A. Shutemov @ 2010-03-02 10:04 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Andrew Morton, containers, linux-kernel,
	linux-mm

On Mon, Mar 1, 2010 at 11:23 PM, Andrea Righi <arighi@develer.com> wrote:
> Infrastructure to account dirty pages per cgroup and add dirty limit
> interfaces in the cgroupfs:
>
>  - Direct write-out: memory.dirty_ratio, memory.dirty_bytes
>
>  - Background write-out: memory.dirty_background_ratio, memory.dirty_background_bytes
>
> Signed-off-by: Andrea Righi <arighi@develer.com>
> ---
>  include/linux/memcontrol.h |   77 ++++++++++-
>  mm/memcontrol.c            |  336 ++++++++++++++++++++++++++++++++++++++++----
>  2 files changed, 384 insertions(+), 29 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 1f9b119..cc88b2e 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -19,12 +19,50 @@
>
>  #ifndef _LINUX_MEMCONTROL_H
>  #define _LINUX_MEMCONTROL_H
> +
> +#include <linux/writeback.h>
>  #include <linux/cgroup.h>
> +
>  struct mem_cgroup;
>  struct page_cgroup;
>  struct page;
>  struct mm_struct;
>
> +/* Cgroup memory statistics items exported to the kernel */
> +enum mem_cgroup_page_stat_item {
> +       MEMCG_NR_DIRTYABLE_PAGES,
> +       MEMCG_NR_RECLAIM_PAGES,
> +       MEMCG_NR_WRITEBACK,
> +       MEMCG_NR_DIRTY_WRITEBACK_PAGES,
> +};
> +
> +/*
> + * Statistics for memory cgroup.
> + */
> +enum mem_cgroup_stat_index {
> +       /*
> +        * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> +        */
> +       MEM_CGROUP_STAT_CACHE,     /* # of pages charged as cache */
> +       MEM_CGROUP_STAT_RSS,       /* # of pages charged as anon rss */
> +       MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> +       MEM_CGROUP_STAT_PGPGIN_COUNT,   /* # of pages paged in */
> +       MEM_CGROUP_STAT_PGPGOUT_COUNT,  /* # of pages paged out */
> +       MEM_CGROUP_STAT_EVENTS, /* sum of pagein + pageout for internal use */
> +       MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> +       MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
> +                                       used by soft limit implementation */
> +       MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out.
> +                                       used by threshold implementation */
> +       MEM_CGROUP_STAT_FILE_DIRTY,   /* # of dirty pages in page cache */
> +       MEM_CGROUP_STAT_WRITEBACK,   /* # of pages under writeback */
> +       MEM_CGROUP_STAT_WRITEBACK_TEMP,   /* # of pages under writeback using
> +                                               temporary buffers */
> +       MEM_CGROUP_STAT_UNSTABLE_NFS,   /* # of NFS unstable pages */
> +
> +       MEM_CGROUP_STAT_NSTATS,
> +};
> +
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
>  /*
>  * All "charge" functions with gfp_mask should use GFP_KERNEL or
> @@ -117,6 +155,13 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
>  extern int do_swap_account;
>  #endif
>
> +extern long mem_cgroup_dirty_ratio(void);
> +extern unsigned long mem_cgroup_dirty_bytes(void);
> +extern long mem_cgroup_dirty_background_ratio(void);
> +extern unsigned long mem_cgroup_dirty_background_bytes(void);
> +
> +extern s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item);
> +
>  static inline bool mem_cgroup_disabled(void)
>  {
>        if (mem_cgroup_subsys.disabled)
> @@ -125,7 +170,8 @@ static inline bool mem_cgroup_disabled(void)
>  }
>
>  extern bool mem_cgroup_oom_called(struct task_struct *task);
> -void mem_cgroup_update_file_mapped(struct page *page, int val);
> +void mem_cgroup_update_stat(struct page *page,
> +                       enum mem_cgroup_stat_index idx, int val);
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>                                                gfp_t gfp_mask, int nid,
>                                                int zid);
> @@ -300,8 +346,8 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
>  {
>  }
>
> -static inline void mem_cgroup_update_file_mapped(struct page *page,
> -                                                       int val)
> +static inline void mem_cgroup_update_stat(struct page *page,
> +                       enum mem_cgroup_stat_index idx, int val)
>  {
>  }
>
> @@ -312,6 +358,31 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>        return 0;
>  }
>
> +static inline long mem_cgroup_dirty_ratio(void)
> +{
> +       return vm_dirty_ratio;
> +}
> +
> +static inline unsigned long mem_cgroup_dirty_bytes(void)
> +{
> +       return vm_dirty_bytes;
> +}
> +
> +static inline long mem_cgroup_dirty_background_ratio(void)
> +{
> +       return dirty_background_ratio;
> +}
> +
> +static inline unsigned long mem_cgroup_dirty_background_bytes(void)
> +{
> +       return dirty_background_bytes;
> +}
> +
> +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> +{
> +       return -ENOMEM;

Why ENOMEM? Probably, EINVAL or ENOSYS?

> +}
> +
>  #endif /* CONFIG_CGROUP_MEM_CONT */
>
>  #endif /* _LINUX_MEMCONTROL_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a443c30..e74cf66 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -66,31 +66,16 @@ static int really_do_swap_account __initdata = 1; /* for remember boot option*/
>  #define SOFTLIMIT_EVENTS_THRESH (1000)
>  #define THRESHOLDS_EVENTS_THRESH (100)
>
> -/*
> - * Statistics for memory cgroup.
> - */
> -enum mem_cgroup_stat_index {
> -       /*
> -        * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> -        */
> -       MEM_CGROUP_STAT_CACHE,     /* # of pages charged as cache */
> -       MEM_CGROUP_STAT_RSS,       /* # of pages charged as anon rss */
> -       MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> -       MEM_CGROUP_STAT_PGPGIN_COUNT,   /* # of pages paged in */
> -       MEM_CGROUP_STAT_PGPGOUT_COUNT,  /* # of pages paged out */
> -       MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> -       MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
> -                                       used by soft limit implementation */
> -       MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out.
> -                                       used by threshold implementation */
> -
> -       MEM_CGROUP_STAT_NSTATS,
> -};
> -
>  struct mem_cgroup_stat_cpu {
>        s64 count[MEM_CGROUP_STAT_NSTATS];
>  };
>
> +/* Per cgroup page statistics */
> +struct mem_cgroup_page_stat {
> +       enum mem_cgroup_page_stat_item item;
> +       s64 value;
> +};
> +
>  /*
>  * per-zone information in memory controller.
>  */
> @@ -157,6 +142,15 @@ struct mem_cgroup_threshold_ary {
>  static bool mem_cgroup_threshold_check(struct mem_cgroup *mem);
>  static void mem_cgroup_threshold(struct mem_cgroup *mem);
>
> +enum mem_cgroup_dirty_param {
> +       MEM_CGROUP_DIRTY_RATIO,
> +       MEM_CGROUP_DIRTY_BYTES,
> +       MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> +       MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
> +
> +       MEM_CGROUP_DIRTY_NPARAMS,
> +};
> +
>  /*
>  * The memory controller data structure. The memory controller controls both
>  * page cache and RSS per cgroup. We would eventually like to provide
> @@ -205,6 +199,9 @@ struct mem_cgroup {
>
>        unsigned int    swappiness;
>
> +       /* control memory cgroup dirty pages */
> +       unsigned long dirty_param[MEM_CGROUP_DIRTY_NPARAMS];
> +
>        /* set when res.limit == memsw.limit */
>        bool            memsw_is_minimum;
>
> @@ -1021,6 +1018,164 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
>        return swappiness;
>  }
>
> +static unsigned long get_dirty_param(struct mem_cgroup *memcg,
> +                       enum mem_cgroup_dirty_param idx)
> +{
> +       unsigned long ret;
> +
> +       VM_BUG_ON(idx >= MEM_CGROUP_DIRTY_NPARAMS);
> +       spin_lock(&memcg->reclaim_param_lock);
> +       ret = memcg->dirty_param[idx];
> +       spin_unlock(&memcg->reclaim_param_lock);
> +
> +       return ret;
> +}
> +
> +long mem_cgroup_dirty_ratio(void)
> +{
> +       struct mem_cgroup *memcg;
> +       long ret = vm_dirty_ratio;
> +
> +       if (mem_cgroup_disabled())
> +               return ret;
> +       /*
> +        * It's possible that "current" may be moved to other cgroup while we
> +        * access cgroup. But precise check is meaningless because the task can
> +        * be moved after our access and writeback tends to take long time.
> +        * At least, "memcg" will not be freed under rcu_read_lock().
> +        */
> +       rcu_read_lock();
> +       memcg = mem_cgroup_from_task(current);
> +       if (likely(memcg))
> +               ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_RATIO);
> +       rcu_read_unlock();
> +
> +       return ret;
> +}
> +
> +unsigned long mem_cgroup_dirty_bytes(void)
> +{
> +       struct mem_cgroup *memcg;
> +       unsigned long ret = vm_dirty_bytes;
> +
> +       if (mem_cgroup_disabled())
> +               return ret;
> +       rcu_read_lock();
> +       memcg = mem_cgroup_from_task(current);
> +       if (likely(memcg))
> +               ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BYTES);
> +       rcu_read_unlock();
> +
> +       return ret;
> +}
> +
> +long mem_cgroup_dirty_background_ratio(void)
> +{
> +       struct mem_cgroup *memcg;
> +       long ret = dirty_background_ratio;
> +
> +       if (mem_cgroup_disabled())
> +               return ret;
> +       rcu_read_lock();
> +       memcg = mem_cgroup_from_task(current);
> +       if (likely(memcg))
> +               ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BACKGROUND_RATIO);
> +       rcu_read_unlock();
> +
> +       return ret;
> +}
> +
> +unsigned long mem_cgroup_dirty_background_bytes(void)
> +{
> +       struct mem_cgroup *memcg;
> +       unsigned long ret = dirty_background_bytes;
> +
> +       if (mem_cgroup_disabled())
> +               return ret;
> +       rcu_read_lock();
> +       memcg = mem_cgroup_from_task(current);
> +       if (likely(memcg))
> +               ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BACKGROUND_BYTES);
> +       rcu_read_unlock();
> +
> +       return ret;
> +}
> +
> +static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
> +{
> +       return do_swap_account ?
> +                       res_counter_read_u64(&memcg->memsw, RES_LIMIT) :
> +                       nr_swap_pages > 0;
> +}
> +
> +static s64 mem_cgroup_get_local_page_stat(struct mem_cgroup *memcg,
> +                               enum mem_cgroup_page_stat_item item)
> +{
> +       s64 ret;
> +
> +       switch (item) {
> +       case MEMCG_NR_DIRTYABLE_PAGES:
> +               ret = res_counter_read_u64(&memcg->res, RES_LIMIT) -
> +                       res_counter_read_u64(&memcg->res, RES_USAGE);
> +               /* Translate free memory in pages */
> +               ret >>= PAGE_SHIFT;
> +               ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_FILE) +
> +                       mem_cgroup_read_stat(memcg, LRU_INACTIVE_FILE);
> +               if (mem_cgroup_can_swap(memcg))
> +                       ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_ANON) +
> +                               mem_cgroup_read_stat(memcg, LRU_INACTIVE_ANON);
> +               break;
> +       case MEMCG_NR_RECLAIM_PAGES:
> +               ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_FILE_DIRTY) +
> +                       mem_cgroup_read_stat(memcg,
> +                                       MEM_CGROUP_STAT_UNSTABLE_NFS);
> +               break;
> +       case MEMCG_NR_WRITEBACK:
> +               ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK);
> +               break;
> +       case MEMCG_NR_DIRTY_WRITEBACK_PAGES:
> +               ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK) +
> +                       mem_cgroup_read_stat(memcg,
> +                               MEM_CGROUP_STAT_UNSTABLE_NFS);
> +               break;
> +       default:
> +               ret = 0;
> +               WARN_ON_ONCE(1);

I think it's a bug, not warning.

> +       }
> +       return ret;
> +}
> +
> +static int mem_cgroup_page_stat_cb(struct mem_cgroup *mem, void *data)
> +{
> +       struct mem_cgroup_page_stat *stat = (struct mem_cgroup_page_stat *)data;
> +
> +       stat->value += mem_cgroup_get_local_page_stat(mem, stat->item);
> +       return 0;
> +}
> +
> +s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> +{
> +       struct mem_cgroup_page_stat stat = {};
> +       struct mem_cgroup *memcg;
> +
> +       if (mem_cgroup_disabled())
> +               return -ENOMEM;

EINVAL/ENOSYS?

> +       rcu_read_lock();
> +       memcg = mem_cgroup_from_task(current);
> +       if (memcg) {
> +               /*
> +                * Recursively evaulate page statistics against all cgroup
> +                * under hierarchy tree
> +                */
> +               stat.item = item;
> +               mem_cgroup_walk_tree(memcg, &stat, mem_cgroup_page_stat_cb);
> +       } else
> +               stat.value = -ENOMEM;

ditto.

> +       rcu_read_unlock();
> +
> +       return stat.value;
> +}
> +
>  static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void *data)
>  {
>        int *val = data;
> @@ -1263,14 +1418,16 @@ static void record_last_oom(struct mem_cgroup *mem)
>  }
>
>  /*
> - * Currently used to update mapped file statistics, but the routine can be
> - * generalized to update other statistics as well.
> + * Generalized routine to update memory cgroup statistics.
>  */
> -void mem_cgroup_update_file_mapped(struct page *page, int val)
> +void mem_cgroup_update_stat(struct page *page,
> +                       enum mem_cgroup_stat_index idx, int val)

EXPORT_SYMBOL_GPL(mem_cgroup_update_stat) is needed, since
it uses by filesystems.

>  {
>        struct mem_cgroup *mem;
>        struct page_cgroup *pc;
>
> +       if (mem_cgroup_disabled())
> +               return;
>        pc = lookup_page_cgroup(page);
>        if (unlikely(!pc))
>                return;
> @@ -1286,7 +1443,8 @@ void mem_cgroup_update_file_mapped(struct page *page, int val)
>        /*
>         * Preemption is already disabled. We can use __this_cpu_xxx
>         */
> -       __this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], val);
> +       VM_BUG_ON(idx >= MEM_CGROUP_STAT_NSTATS);
> +       __this_cpu_add(mem->stat->count[idx], val);
>
>  done:
>        unlock_page_cgroup(pc);
> @@ -3033,6 +3191,10 @@ enum {
>        MCS_PGPGIN,
>        MCS_PGPGOUT,
>        MCS_SWAP,
> +       MCS_FILE_DIRTY,
> +       MCS_WRITEBACK,
> +       MCS_WRITEBACK_TEMP,
> +       MCS_UNSTABLE_NFS,
>        MCS_INACTIVE_ANON,
>        MCS_ACTIVE_ANON,
>        MCS_INACTIVE_FILE,
> @@ -3055,6 +3217,10 @@ struct {
>        {"pgpgin", "total_pgpgin"},
>        {"pgpgout", "total_pgpgout"},
>        {"swap", "total_swap"},
> +       {"filedirty", "dirty_pages"},
> +       {"writeback", "writeback_pages"},
> +       {"writeback_tmp", "writeback_temp_pages"},
> +       {"nfs", "nfs_unstable"},
>        {"inactive_anon", "total_inactive_anon"},
>        {"active_anon", "total_active_anon"},
>        {"inactive_file", "total_inactive_file"},
> @@ -3083,6 +3249,14 @@ static int mem_cgroup_get_local_stat(struct mem_cgroup *mem, void *data)
>                val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_SWAPOUT);
>                s->stat[MCS_SWAP] += val * PAGE_SIZE;
>        }
> +       val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_DIRTY);
> +       s->stat[MCS_FILE_DIRTY] += val;
> +       val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WRITEBACK);
> +       s->stat[MCS_WRITEBACK] += val;
> +       val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WRITEBACK_TEMP);
> +       s->stat[MCS_WRITEBACK_TEMP] += val;
> +       val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_UNSTABLE_NFS);
> +       s->stat[MCS_UNSTABLE_NFS] += val;
>
>        /* per zone stat */
>        val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON);
> @@ -3467,6 +3641,50 @@ unlock:
>        return ret;
>  }
>
> +static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
> +{
> +       struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +       int type = cft->private;
> +
> +       return get_dirty_param(memcg, type);
> +}
> +
> +static int
> +mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
> +{
> +       struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +       int type = cft->private;
> +
> +       if (cgrp->parent == NULL)
> +               return -EINVAL;
> +       if (((type == MEM_CGROUP_DIRTY_RATIO) ||
> +               (type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO)) && (val > 100))

Too many unnecessary brackets

       if ((type == MEM_CGROUP_DIRTY_RATIO ||
               type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO) && val > 100)

> +               return -EINVAL;
> +
> +       spin_lock(&memcg->reclaim_param_lock);
> +       switch (type) {
> +       case MEM_CGROUP_DIRTY_RATIO:
> +               memcg->dirty_param[MEM_CGROUP_DIRTY_RATIO] = val;
> +               memcg->dirty_param[MEM_CGROUP_DIRTY_BYTES] = 0;
> +               break;
> +       case MEM_CGROUP_DIRTY_BYTES:
> +               memcg->dirty_param[MEM_CGROUP_DIRTY_RATIO] = 0;
> +               memcg->dirty_param[MEM_CGROUP_DIRTY_BYTES] = val;
> +               break;
> +       case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
> +               memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] = val;
> +               memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES] = 0;
> +               break;
> +       case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
> +               memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] = 0;
> +               memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES] = val;
> +               break;
> +       }
> +       spin_unlock(&memcg->reclaim_param_lock);
> +
> +       return 0;
> +}
> +
>  static struct cftype mem_cgroup_files[] = {
>        {
>                .name = "usage_in_bytes",
> @@ -3518,6 +3736,30 @@ static struct cftype mem_cgroup_files[] = {
>                .write_u64 = mem_cgroup_swappiness_write,
>        },
>        {
> +               .name = "dirty_ratio",
> +               .read_u64 = mem_cgroup_dirty_read,
> +               .write_u64 = mem_cgroup_dirty_write,
> +               .private = MEM_CGROUP_DIRTY_RATIO,
> +       },
> +       {
> +               .name = "dirty_bytes",
> +               .read_u64 = mem_cgroup_dirty_read,
> +               .write_u64 = mem_cgroup_dirty_write,
> +               .private = MEM_CGROUP_DIRTY_BYTES,
> +       },
> +       {
> +               .name = "dirty_background_ratio",
> +               .read_u64 = mem_cgroup_dirty_read,
> +               .write_u64 = mem_cgroup_dirty_write,
> +               .private = MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> +       },
> +       {
> +               .name = "dirty_background_bytes",
> +               .read_u64 = mem_cgroup_dirty_read,
> +               .write_u64 = mem_cgroup_dirty_write,
> +               .private = MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
> +       },
> +       {
>                .name = "move_charge_at_immigrate",
>                .read_u64 = mem_cgroup_move_charge_read,
>                .write_u64 = mem_cgroup_move_charge_write,
> @@ -3725,6 +3967,19 @@ static int mem_cgroup_soft_limit_tree_init(void)
>        return 0;
>  }
>
> +/*
> + * NOTE: called only with &src->reclaim_param_lock held from
> + * mem_cgroup_create().
> + */
> +static inline void
> +copy_dirty_params(struct mem_cgroup *dst, struct mem_cgroup *src)
> +{
> +       int i;
> +
> +       for (i = 0; i < MEM_CGROUP_DIRTY_NPARAMS; i++)
> +               dst->dirty_param[i] = src->dirty_param[i];
> +}
> +
>  static struct cgroup_subsys_state * __ref
>  mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  {
> @@ -3776,8 +4031,37 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>        mem->last_scanned_child = 0;
>        spin_lock_init(&mem->reclaim_param_lock);
>
> -       if (parent)
> +       if (parent) {
>                mem->swappiness = get_swappiness(parent);
> +
> +               spin_lock(&parent->reclaim_param_lock);
> +               copy_dirty_params(mem, parent);
> +               spin_unlock(&parent->reclaim_param_lock);
> +       } else {
> +               /*
> +                * XXX: should we need a lock here? we could switch from
> +                * vm_dirty_ratio to vm_dirty_bytes or vice versa but we're not
> +                * reading them atomically. The same for dirty_background_ratio
> +                * and dirty_background_bytes.
> +                *
> +                * For now, try to read them speculatively and retry if a
> +                * "conflict" is detected.
> +                */
> +               do {
> +                       mem->dirty_param[MEM_CGROUP_DIRTY_RATIO] =
> +                                               vm_dirty_ratio;
> +                       mem->dirty_param[MEM_CGROUP_DIRTY_BYTES] =
> +                                               vm_dirty_bytes;
> +               } while (mem->dirty_param[MEM_CGROUP_DIRTY_RATIO] &&
> +                        mem->dirty_param[MEM_CGROUP_DIRTY_BYTES]);
> +               do {
> +                       mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] =
> +                                               dirty_background_ratio;
> +                       mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES] =
> +                                               dirty_background_bytes;
> +               } while (mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] &&
> +                       mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES]);
> +       }
>        atomic_set(&mem->refcnt, 1);
>        mem->move_charge_at_immigrate = 0;
>        mutex_init(&mem->thresholds_lock);
> --
> 1.6.3.3
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
       [not found]   ` <1267478620-5276-4-git-send-email-arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
  2010-03-01 22:02     ` Vivek Goyal
  2010-03-02  0:23     ` KAMEZAWA Hiroyuki
@ 2010-03-02 10:11     ` Kirill A. Shutemov
  2010-03-02 13:47     ` Balbir Singh
                       ` (2 subsequent siblings)
  5 siblings, 0 replies; 140+ messages in thread
From: Kirill A. Shutemov @ 2010-03-02 10:11 UTC (permalink / raw)
  To: Andrea Righi
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Andrew Morton,
	Balbir Singh

On Mon, Mar 1, 2010 at 11:23 PM, Andrea Righi <arighi@develer.com> wrote:
> Apply the cgroup dirty pages accounting and limiting infrastructure to
> the opportune kernel functions.
>
> Signed-off-by: Andrea Righi <arighi@develer.com>
> ---
>  fs/fuse/file.c      |    5 +++
>  fs/nfs/write.c      |    4 ++
>  fs/nilfs2/segment.c |   10 +++++-
>  mm/filemap.c        |    1 +
>  mm/page-writeback.c |   84 ++++++++++++++++++++++++++++++++------------------
>  mm/rmap.c           |    4 +-
>  mm/truncate.c       |    2 +
>  7 files changed, 76 insertions(+), 34 deletions(-)
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index a9f5e13..dbbdd53 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -11,6 +11,7 @@
>  #include <linux/pagemap.h>
>  #include <linux/slab.h>
>  #include <linux/kernel.h>
> +#include <linux/memcontrol.h>
>  #include <linux/sched.h>
>  #include <linux/module.h>
>
> @@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
>
>        list_del(&req->writepages_entry);
>        dec_bdi_stat(bdi, BDI_WRITEBACK);
> +       mem_cgroup_update_stat(req->pages[0],
> +                       MEM_CGROUP_STAT_WRITEBACK_TEMP, -1);
>        dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
>        bdi_writeout_inc(bdi);
>        wake_up(&fi->page_waitq);
> @@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
>        req->inode = inode;
>
>        inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
> +       mem_cgroup_update_stat(tmp_page,
> +                       MEM_CGROUP_STAT_WRITEBACK_TEMP, 1);
>        inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
>        end_page_writeback(page);
>
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index b753242..7316f7a 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -439,6 +439,7 @@ nfs_mark_request_commit(struct nfs_page *req)
>                        req->wb_index,
>                        NFS_PAGE_TAG_COMMIT);
>        spin_unlock(&inode->i_lock);
> +       mem_cgroup_update_stat(req->wb_page, MEM_CGROUP_STAT_UNSTABLE_NFS, 1);
>        inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
>        inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE);
>        __mark_inode_dirty(inode, I_DIRTY_DATASYNC);
> @@ -450,6 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req)
>        struct page *page = req->wb_page;
>
>        if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
> +               mem_cgroup_update_stat(page, MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
>                dec_zone_page_state(page, NR_UNSTABLE_NFS);
>                dec_bdi_stat(page->mapping->backing_dev_info, BDI_UNSTABLE);
>                return 1;
> @@ -1273,6 +1275,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
>                req = nfs_list_entry(head->next);
>                nfs_list_remove_request(req);
>                nfs_mark_request_commit(req);
> +               mem_cgroup_update_stat(req->wb_page,
> +                               MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
>                dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
>                dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
>                                BDI_UNSTABLE);
> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> index ada2f1b..aef6d13 100644
> --- a/fs/nilfs2/segment.c
> +++ b/fs/nilfs2/segment.c
> @@ -1660,8 +1660,11 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out)
>        } while (bh = bh->b_this_page, bh2 = bh2->b_this_page, bh != head);
>        kunmap_atomic(kaddr, KM_USER0);
>
> -       if (!TestSetPageWriteback(clone_page))
> +       if (!TestSetPageWriteback(clone_page)) {
> +               mem_cgroup_update_stat(clone_page,

s/clone_page/page/

And #include <linux/memcontrol.h> is missed.

> +                               MEM_CGROUP_STAT_WRITEBACK, 1);
>                inc_zone_page_state(clone_page, NR_WRITEBACK);
> +       }
>        unlock_page(clone_page);
>
>        return 0;
> @@ -1783,8 +1786,11 @@ static void __nilfs_end_page_io(struct page *page, int err)
>        }
>
>        if (buffer_nilfs_allocated(page_buffers(page))) {
> -               if (TestClearPageWriteback(page))
> +               if (TestClearPageWriteback(page)) {
> +                       mem_cgroup_update_stat(clone_page,
> +                                       MEM_CGROUP_STAT_WRITEBACK, -1);
>                        dec_zone_page_state(page, NR_WRITEBACK);
> +               }
>        } else
>                end_page_writeback(page);
>  }
> diff --git a/mm/filemap.c b/mm/filemap.c
> index fe09e51..f85acae 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
>         * having removed the page entirely.
>         */
>        if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> +               mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, -1);
>                dec_zone_page_state(page, NR_FILE_DIRTY);
>                dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
>        }
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 5a0f8f3..d83f41c 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -137,13 +137,14 @@ static struct prop_descriptor vm_dirties;
>  */
>  static int calc_period_shift(void)
>  {
> -       unsigned long dirty_total;
> +       unsigned long dirty_total, dirty_bytes;
>
> -       if (vm_dirty_bytes)
> -               dirty_total = vm_dirty_bytes / PAGE_SIZE;
> +       dirty_bytes = mem_cgroup_dirty_bytes();
> +       if (dirty_bytes)
> +               dirty_total = dirty_bytes / PAGE_SIZE;
>        else
> -               dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> -                               100;
> +               dirty_total = (mem_cgroup_dirty_ratio() *
> +                               determine_dirtyable_memory()) / 100;
>        return 2 + ilog2(dirty_total - 1);
>  }
>
> @@ -408,14 +409,16 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
>  */
>  unsigned long determine_dirtyable_memory(void)
>  {
> -       unsigned long x;
> -
> -       x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> +       unsigned long memory;
> +       s64 memcg_memory;
>
> +       memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
>        if (!vm_highmem_is_dirtyable)
> -               x -= highmem_dirtyable_memory(x);
> -
> -       return x + 1;   /* Ensure that we never return 0 */
> +               memory -= highmem_dirtyable_memory(memory);
> +       memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
> +       if (memcg_memory < 0)
> +               return memory + 1;
> +       return min((unsigned long)memcg_memory, memory + 1);
>  }
>
>  void
> @@ -423,26 +426,28 @@ get_dirty_limits(unsigned long *pbackground, unsigned long *pdirty,
>                 unsigned long *pbdi_dirty, struct backing_dev_info *bdi)
>  {
>        unsigned long background;
> -       unsigned long dirty;
> +       unsigned long dirty, dirty_bytes, dirty_background;
>        unsigned long available_memory = determine_dirtyable_memory();
>        struct task_struct *tsk;
>
> -       if (vm_dirty_bytes)
> -               dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> +       dirty_bytes = mem_cgroup_dirty_bytes();
> +       if (dirty_bytes)
> +               dirty = DIV_ROUND_UP(dirty_bytes, PAGE_SIZE);
>        else {
>                int dirty_ratio;
>
> -               dirty_ratio = vm_dirty_ratio;
> +               dirty_ratio = mem_cgroup_dirty_ratio();
>                if (dirty_ratio < 5)
>                        dirty_ratio = 5;
>                dirty = (dirty_ratio * available_memory) / 100;
>        }
>
> -       if (dirty_background_bytes)
> -               background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> +       dirty_background = mem_cgroup_dirty_background_bytes();
> +       if (dirty_background)
> +               background = DIV_ROUND_UP(dirty_background, PAGE_SIZE);
>        else
> -               background = (dirty_background_ratio * available_memory) / 100;
> -
> +               background = (mem_cgroup_dirty_background_ratio() *
> +                                       available_memory) / 100;
>        if (background >= dirty)
>                background = dirty / 2;
>        tsk = current;
> @@ -508,9 +513,13 @@ static void balance_dirty_pages(struct address_space *mapping,
>                get_dirty_limits(&background_thresh, &dirty_thresh,
>                                &bdi_thresh, bdi);
>
> -               nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> +               nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
> +               nr_writeback = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
> +               if ((nr_reclaimable < 0) || (nr_writeback < 0)) {
> +                       nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
>                                        global_page_state(NR_UNSTABLE_NFS);
> -               nr_writeback = global_page_state(NR_WRITEBACK);
> +                       nr_writeback = global_page_state(NR_WRITEBACK);
> +               }
>
>                bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY);
>                if (bdi_cap_account_unstable(bdi)) {
> @@ -611,10 +620,12 @@ static void balance_dirty_pages(struct address_space *mapping,
>         * In normal mode, we start background writeout at the lower
>         * background_thresh, to keep the amount of dirty memory low.
>         */
> +       nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
> +       if (nr_reclaimable < 0)
> +               nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> +                               global_page_state(NR_UNSTABLE_NFS);
>        if ((laptop_mode && pages_written) ||
> -           (!laptop_mode && ((global_page_state(NR_FILE_DIRTY)
> -                              + global_page_state(NR_UNSTABLE_NFS))
> -                                         > background_thresh)))
> +           (!laptop_mode && (nr_reclaimable > background_thresh)))
>                bdi_start_writeback(bdi, NULL, 0);
>  }
>
> @@ -678,6 +689,8 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>        unsigned long dirty_thresh;
>
>         for ( ; ; ) {
> +               unsigned long dirty;
> +
>                get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
>
>                 /*
> @@ -686,10 +699,14 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>                  */
>                 dirty_thresh += dirty_thresh / 10;      /* wheeee... */
>
> -                if (global_page_state(NR_UNSTABLE_NFS) +
> -                       global_page_state(NR_WRITEBACK) <= dirty_thresh)
> -                               break;
> -                congestion_wait(BLK_RW_ASYNC, HZ/10);
> +
> +               dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> +               if (dirty < 0)
> +                       dirty = global_page_state(NR_UNSTABLE_NFS) +
> +                               global_page_state(NR_WRITEBACK);
> +               if (dirty <= dirty_thresh)
> +                       break;
> +               congestion_wait(BLK_RW_ASYNC, HZ/10);
>
>                /*
>                 * The caller might hold locks which can prevent IO completion
> @@ -1096,6 +1113,7 @@ int __set_page_dirty_no_writeback(struct page *page)
>  void account_page_dirtied(struct page *page, struct address_space *mapping)
>  {
>        if (mapping_cap_account_dirty(mapping)) {
> +               mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, 1);
>                __inc_zone_page_state(page, NR_FILE_DIRTY);
>                __inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
>                task_dirty_inc(current);
> @@ -1297,6 +1315,8 @@ int clear_page_dirty_for_io(struct page *page)
>                 * for more comments.
>                 */
>                if (TestClearPageDirty(page)) {
> +                       mem_cgroup_update_stat(page,
> +                                       MEM_CGROUP_STAT_FILE_DIRTY, -1);
>                        dec_zone_page_state(page, NR_FILE_DIRTY);
>                        dec_bdi_stat(mapping->backing_dev_info,
>                                        BDI_DIRTY);
> @@ -1332,8 +1352,10 @@ int test_clear_page_writeback(struct page *page)
>        } else {
>                ret = TestClearPageWriteback(page);
>        }
> -       if (ret)
> +       if (ret) {
> +               mem_cgroup_update_stat(page, MEM_CGROUP_STAT_WRITEBACK, -1);
>                dec_zone_page_state(page, NR_WRITEBACK);
> +       }
>        return ret;
>  }
>
> @@ -1363,8 +1385,10 @@ int test_set_page_writeback(struct page *page)
>        } else {
>                ret = TestSetPageWriteback(page);
>        }
> -       if (!ret)
> +       if (!ret) {
> +               mem_cgroup_update_stat(page, MEM_CGROUP_STAT_WRITEBACK, 1);
>                inc_zone_page_state(page, NR_WRITEBACK);
> +       }
>        return ret;
>
>  }
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 4d2fb93..8d74335 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -832,7 +832,7 @@ void page_add_file_rmap(struct page *page)
>  {
>        if (atomic_inc_and_test(&page->_mapcount)) {
>                __inc_zone_page_state(page, NR_FILE_MAPPED);
> -               mem_cgroup_update_file_mapped(page, 1);
> +               mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, 1);
>        }
>  }
>
> @@ -864,7 +864,7 @@ void page_remove_rmap(struct page *page)
>                __dec_zone_page_state(page, NR_ANON_PAGES);
>        } else {
>                __dec_zone_page_state(page, NR_FILE_MAPPED);
> -               mem_cgroup_update_file_mapped(page, -1);
> +               mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, -1);
>        }
>        /*
>         * It would be tidy to reset the PageAnon mapping here,
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 2466e0c..5f437e7 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -73,6 +73,8 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
>        if (TestClearPageDirty(page)) {
>                struct address_space *mapping = page->mapping;
>                if (mapping && mapping_cap_account_dirty(mapping)) {
> +                       mem_cgroup_update_stat(page,
> +                                       MEM_CGROUP_STAT_FILE_DIRTY, -1);
>                        dec_zone_page_state(page, NR_FILE_DIRTY);
>                        dec_bdi_stat(mapping->backing_dev_info,
>                                        BDI_DIRTY);
> --
> 1.6.3.3
>
>
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-01 21:23   ` Andrea Righi
@ 2010-03-02 10:11     ` Kirill A. Shutemov
  -1 siblings, 0 replies; 140+ messages in thread
From: Kirill A. Shutemov @ 2010-03-02 10:11 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Andrew Morton, containers, linux-kernel,
	linux-mm

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 16112 bytes --]

On Mon, Mar 1, 2010 at 11:23 PM, Andrea Righi <arighi@develer.com> wrote:> Apply the cgroup dirty pages accounting and limiting infrastructure to> the opportune kernel functions.>> Signed-off-by: Andrea Righi <arighi@develer.com>> --->  fs/fuse/file.c      |    5 +++>  fs/nfs/write.c      |    4 ++>  fs/nilfs2/segment.c |   10 +++++->  mm/filemap.c        |    1 +>  mm/page-writeback.c |   84 ++++++++++++++++++++++++++++++++------------------>  mm/rmap.c           |    4 +->  mm/truncate.c       |    2 +>  7 files changed, 76 insertions(+), 34 deletions(-)>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c> index a9f5e13..dbbdd53 100644> --- a/fs/fuse/file.c> +++ b/fs/fuse/file.c> @@ -11,6 +11,7 @@>  #include <linux/pagemap.h>>  #include <linux/slab.h>>  #include <linux/kernel.h>> +#include <linux/memcontrol.h>>  #include <linux/sched.h>>  #include <linux/module.h>>> @@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)>>        list_del(&req->writepages_entry);>        dec_bdi_stat(bdi, BDI_WRITEBACK);> +       mem_cgroup_update_stat(req->pages[0],> +                       MEM_CGROUP_STAT_WRITEBACK_TEMP, -1);>        dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);>        bdi_writeout_inc(bdi);>        wake_up(&fi->page_waitq);> @@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)>        req->inode = inode;>>        inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);> +       mem_cgroup_update_stat(tmp_page,> +                       MEM_CGROUP_STAT_WRITEBACK_TEMP, 1);>        inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);>        end_page_writeback(page);>> diff --git a/fs/nfs/write.c b/fs/nfs/write.c> index b753242..7316f7a 100644> --- a/fs/nfs/write.c> +++ b/fs/nfs/write.c> @@ -439,6 +439,7 @@ nfs_mark_request_commit(struct nfs_page *req)>                        req->wb_index,>                        NFS_PAGE_TAG_COMMIT);>        spin_unlock(&inode->i_lock);> +       mem_cgroup_update_stat(req->wb_page, MEM_CGROUP_STAT_UNSTABLE_NFS, 1);>        inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);>        inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE);>        __mark_inode_dirty(inode, I_DIRTY_DATASYNC);> @@ -450,6 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req)>        struct page *page = req->wb_page;>>        if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {> +               mem_cgroup_update_stat(page, MEM_CGROUP_STAT_UNSTABLE_NFS, -1);>                dec_zone_page_state(page, NR_UNSTABLE_NFS);>                dec_bdi_stat(page->mapping->backing_dev_info, BDI_UNSTABLE);>                return 1;> @@ -1273,6 +1275,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)>                req = nfs_list_entry(head->next);>                nfs_list_remove_request(req);>                nfs_mark_request_commit(req);> +               mem_cgroup_update_stat(req->wb_page,> +                               MEM_CGROUP_STAT_UNSTABLE_NFS, -1);>                dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);>                dec_bdi_stat(req->wb_page->mapping->backing_dev_info,>                                BDI_UNSTABLE);> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c> index ada2f1b..aef6d13 100644> --- a/fs/nilfs2/segment.c> +++ b/fs/nilfs2/segment.c> @@ -1660,8 +1660,11 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out)>        } while (bh = bh->b_this_page, bh2 = bh2->b_this_page, bh != head);>        kunmap_atomic(kaddr, KM_USER0);>> -       if (!TestSetPageWriteback(clone_page))> +       if (!TestSetPageWriteback(clone_page)) {> +               mem_cgroup_update_stat(clone_page,
s/clone_page/page/
And #include <linux/memcontrol.h> is missed.
> +                               MEM_CGROUP_STAT_WRITEBACK, 1);>                inc_zone_page_state(clone_page, NR_WRITEBACK);> +       }>        unlock_page(clone_page);>>        return 0;> @@ -1783,8 +1786,11 @@ static void __nilfs_end_page_io(struct page *page, int err)>        }>>        if (buffer_nilfs_allocated(page_buffers(page))) {> -               if (TestClearPageWriteback(page))> +               if (TestClearPageWriteback(page)) {> +                       mem_cgroup_update_stat(clone_page,> +                                       MEM_CGROUP_STAT_WRITEBACK, -1);>                        dec_zone_page_state(page, NR_WRITEBACK);> +               }>        } else>                end_page_writeback(page);>  }> diff --git a/mm/filemap.c b/mm/filemap.c> index fe09e51..f85acae 100644> --- a/mm/filemap.c> +++ b/mm/filemap.c> @@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)>         * having removed the page entirely.>         */>        if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {> +               mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, -1);>                dec_zone_page_state(page, NR_FILE_DIRTY);>                dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);>        }> diff --git a/mm/page-writeback.c b/mm/page-writeback.c> index 5a0f8f3..d83f41c 100644> --- a/mm/page-writeback.c> +++ b/mm/page-writeback.c> @@ -137,13 +137,14 @@ static struct prop_descriptor vm_dirties;>  */>  static int calc_period_shift(void)>  {> -       unsigned long dirty_total;> +       unsigned long dirty_total, dirty_bytes;>> -       if (vm_dirty_bytes)> -               dirty_total = vm_dirty_bytes / PAGE_SIZE;> +       dirty_bytes = mem_cgroup_dirty_bytes();> +       if (dirty_bytes)> +               dirty_total = dirty_bytes / PAGE_SIZE;>        else> -               dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /> -                               100;> +               dirty_total = (mem_cgroup_dirty_ratio() *> +                               determine_dirtyable_memory()) / 100;>        return 2 + ilog2(dirty_total - 1);>  }>> @@ -408,14 +409,16 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)>  */>  unsigned long determine_dirtyable_memory(void)>  {> -       unsigned long x;> -> -       x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();> +       unsigned long memory;> +       s64 memcg_memory;>> +       memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();>        if (!vm_highmem_is_dirtyable)> -               x -= highmem_dirtyable_memory(x);> -> -       return x + 1;   /* Ensure that we never return 0 */> +               memory -= highmem_dirtyable_memory(memory);> +       memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);> +       if (memcg_memory < 0)> +               return memory + 1;> +       return min((unsigned long)memcg_memory, memory + 1);>  }>>  void> @@ -423,26 +426,28 @@ get_dirty_limits(unsigned long *pbackground, unsigned long *pdirty,>                 unsigned long *pbdi_dirty, struct backing_dev_info *bdi)>  {>        unsigned long background;> -       unsigned long dirty;> +       unsigned long dirty, dirty_bytes, dirty_background;>        unsigned long available_memory = determine_dirtyable_memory();>        struct task_struct *tsk;>> -       if (vm_dirty_bytes)> -               dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);> +       dirty_bytes = mem_cgroup_dirty_bytes();> +       if (dirty_bytes)> +               dirty = DIV_ROUND_UP(dirty_bytes, PAGE_SIZE);>        else {>                int dirty_ratio;>> -               dirty_ratio = vm_dirty_ratio;> +               dirty_ratio = mem_cgroup_dirty_ratio();>                if (dirty_ratio < 5)>                        dirty_ratio = 5;>                dirty = (dirty_ratio * available_memory) / 100;>        }>> -       if (dirty_background_bytes)> -               background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);> +       dirty_background = mem_cgroup_dirty_background_bytes();> +       if (dirty_background)> +               background = DIV_ROUND_UP(dirty_background, PAGE_SIZE);>        else> -               background = (dirty_background_ratio * available_memory) / 100;> -> +               background = (mem_cgroup_dirty_background_ratio() *> +                                       available_memory) / 100;>        if (background >= dirty)>                background = dirty / 2;>        tsk = current;> @@ -508,9 +513,13 @@ static void balance_dirty_pages(struct address_space *mapping,>                get_dirty_limits(&background_thresh, &dirty_thresh,>                                &bdi_thresh, bdi);>> -               nr_reclaimable = global_page_state(NR_FILE_DIRTY) +> +               nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);> +               nr_writeback = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);> +               if ((nr_reclaimable < 0) || (nr_writeback < 0)) {> +                       nr_reclaimable = global_page_state(NR_FILE_DIRTY) +>                                        global_page_state(NR_UNSTABLE_NFS);> -               nr_writeback = global_page_state(NR_WRITEBACK);> +                       nr_writeback = global_page_state(NR_WRITEBACK);> +               }>>                bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY);>                if (bdi_cap_account_unstable(bdi)) {> @@ -611,10 +620,12 @@ static void balance_dirty_pages(struct address_space *mapping,>         * In normal mode, we start background writeout at the lower>         * background_thresh, to keep the amount of dirty memory low.>         */> +       nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);> +       if (nr_reclaimable < 0)> +               nr_reclaimable = global_page_state(NR_FILE_DIRTY) +> +                               global_page_state(NR_UNSTABLE_NFS);>        if ((laptop_mode && pages_written) ||> -           (!laptop_mode && ((global_page_state(NR_FILE_DIRTY)> -                              + global_page_state(NR_UNSTABLE_NFS))> -                                         > background_thresh)))> +           (!laptop_mode && (nr_reclaimable > background_thresh)))>                bdi_start_writeback(bdi, NULL, 0);>  }>> @@ -678,6 +689,8 @@ void throttle_vm_writeout(gfp_t gfp_mask)>        unsigned long dirty_thresh;>>         for ( ; ; ) {> +               unsigned long dirty;> +>                get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);>>                 /*> @@ -686,10 +699,14 @@ void throttle_vm_writeout(gfp_t gfp_mask)>                  */>                 dirty_thresh += dirty_thresh / 10;      /* wheeee... */>> -                if (global_page_state(NR_UNSTABLE_NFS) +> -                       global_page_state(NR_WRITEBACK) <= dirty_thresh)> -                               break;> -                congestion_wait(BLK_RW_ASYNC, HZ/10);> +> +               dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);> +               if (dirty < 0)> +                       dirty = global_page_state(NR_UNSTABLE_NFS) +> +                               global_page_state(NR_WRITEBACK);> +               if (dirty <= dirty_thresh)> +                       break;> +               congestion_wait(BLK_RW_ASYNC, HZ/10);>>                /*>                 * The caller might hold locks which can prevent IO completion> @@ -1096,6 +1113,7 @@ int __set_page_dirty_no_writeback(struct page *page)>  void account_page_dirtied(struct page *page, struct address_space *mapping)>  {>        if (mapping_cap_account_dirty(mapping)) {> +               mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, 1);>                __inc_zone_page_state(page, NR_FILE_DIRTY);>                __inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);>                task_dirty_inc(current);> @@ -1297,6 +1315,8 @@ int clear_page_dirty_for_io(struct page *page)>                 * for more comments.>                 */>                if (TestClearPageDirty(page)) {> +                       mem_cgroup_update_stat(page,> +                                       MEM_CGROUP_STAT_FILE_DIRTY, -1);>                        dec_zone_page_state(page, NR_FILE_DIRTY);>                        dec_bdi_stat(mapping->backing_dev_info,>                                        BDI_DIRTY);> @@ -1332,8 +1352,10 @@ int test_clear_page_writeback(struct page *page)>        } else {>                ret = TestClearPageWriteback(page);>        }> -       if (ret)> +       if (ret) {> +               mem_cgroup_update_stat(page, MEM_CGROUP_STAT_WRITEBACK, -1);>                dec_zone_page_state(page, NR_WRITEBACK);> +       }>        return ret;>  }>> @@ -1363,8 +1385,10 @@ int test_set_page_writeback(struct page *page)>        } else {>                ret = TestSetPageWriteback(page);>        }> -       if (!ret)> +       if (!ret) {> +               mem_cgroup_update_stat(page, MEM_CGROUP_STAT_WRITEBACK, 1);>                inc_zone_page_state(page, NR_WRITEBACK);> +       }>        return ret;>>  }> diff --git a/mm/rmap.c b/mm/rmap.c> index 4d2fb93..8d74335 100644> --- a/mm/rmap.c> +++ b/mm/rmap.c> @@ -832,7 +832,7 @@ void page_add_file_rmap(struct page *page)>  {>        if (atomic_inc_and_test(&page->_mapcount)) {>                __inc_zone_page_state(page, NR_FILE_MAPPED);> -               mem_cgroup_update_file_mapped(page, 1);> +               mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, 1);>        }>  }>> @@ -864,7 +864,7 @@ void page_remove_rmap(struct page *page)>                __dec_zone_page_state(page, NR_ANON_PAGES);>        } else {>                __dec_zone_page_state(page, NR_FILE_MAPPED);> -               mem_cgroup_update_file_mapped(page, -1);> +               mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, -1);>        }>        /*>         * It would be tidy to reset the PageAnon mapping here,> diff --git a/mm/truncate.c b/mm/truncate.c> index 2466e0c..5f437e7 100644> --- a/mm/truncate.c> +++ b/mm/truncate.c> @@ -73,6 +73,8 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)>        if (TestClearPageDirty(page)) {>                struct address_space *mapping = page->mapping;>                if (mapping && mapping_cap_account_dirty(mapping)) {> +                       mem_cgroup_update_stat(page,> +                                       MEM_CGROUP_STAT_FILE_DIRTY, -1);>                        dec_zone_page_state(page, NR_FILE_DIRTY);>                        dec_bdi_stat(mapping->backing_dev_info,>                                        BDI_DIRTY);> --> 1.6.3.3>>ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
@ 2010-03-02 10:11     ` Kirill A. Shutemov
  0 siblings, 0 replies; 140+ messages in thread
From: Kirill A. Shutemov @ 2010-03-02 10:11 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Andrew Morton, containers, linux-kernel,
	linux-mm

On Mon, Mar 1, 2010 at 11:23 PM, Andrea Righi <arighi@develer.com> wrote:
> Apply the cgroup dirty pages accounting and limiting infrastructure to
> the opportune kernel functions.
>
> Signed-off-by: Andrea Righi <arighi@develer.com>
> ---
>  fs/fuse/file.c      |    5 +++
>  fs/nfs/write.c      |    4 ++
>  fs/nilfs2/segment.c |   10 +++++-
>  mm/filemap.c        |    1 +
>  mm/page-writeback.c |   84 ++++++++++++++++++++++++++++++++------------------
>  mm/rmap.c           |    4 +-
>  mm/truncate.c       |    2 +
>  7 files changed, 76 insertions(+), 34 deletions(-)
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index a9f5e13..dbbdd53 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -11,6 +11,7 @@
>  #include <linux/pagemap.h>
>  #include <linux/slab.h>
>  #include <linux/kernel.h>
> +#include <linux/memcontrol.h>
>  #include <linux/sched.h>
>  #include <linux/module.h>
>
> @@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
>
>        list_del(&req->writepages_entry);
>        dec_bdi_stat(bdi, BDI_WRITEBACK);
> +       mem_cgroup_update_stat(req->pages[0],
> +                       MEM_CGROUP_STAT_WRITEBACK_TEMP, -1);
>        dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
>        bdi_writeout_inc(bdi);
>        wake_up(&fi->page_waitq);
> @@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
>        req->inode = inode;
>
>        inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
> +       mem_cgroup_update_stat(tmp_page,
> +                       MEM_CGROUP_STAT_WRITEBACK_TEMP, 1);
>        inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
>        end_page_writeback(page);
>
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index b753242..7316f7a 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -439,6 +439,7 @@ nfs_mark_request_commit(struct nfs_page *req)
>                        req->wb_index,
>                        NFS_PAGE_TAG_COMMIT);
>        spin_unlock(&inode->i_lock);
> +       mem_cgroup_update_stat(req->wb_page, MEM_CGROUP_STAT_UNSTABLE_NFS, 1);
>        inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
>        inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE);
>        __mark_inode_dirty(inode, I_DIRTY_DATASYNC);
> @@ -450,6 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req)
>        struct page *page = req->wb_page;
>
>        if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
> +               mem_cgroup_update_stat(page, MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
>                dec_zone_page_state(page, NR_UNSTABLE_NFS);
>                dec_bdi_stat(page->mapping->backing_dev_info, BDI_UNSTABLE);
>                return 1;
> @@ -1273,6 +1275,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
>                req = nfs_list_entry(head->next);
>                nfs_list_remove_request(req);
>                nfs_mark_request_commit(req);
> +               mem_cgroup_update_stat(req->wb_page,
> +                               MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
>                dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
>                dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
>                                BDI_UNSTABLE);
> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> index ada2f1b..aef6d13 100644
> --- a/fs/nilfs2/segment.c
> +++ b/fs/nilfs2/segment.c
> @@ -1660,8 +1660,11 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out)
>        } while (bh = bh->b_this_page, bh2 = bh2->b_this_page, bh != head);
>        kunmap_atomic(kaddr, KM_USER0);
>
> -       if (!TestSetPageWriteback(clone_page))
> +       if (!TestSetPageWriteback(clone_page)) {
> +               mem_cgroup_update_stat(clone_page,

s/clone_page/page/

And #include <linux/memcontrol.h> is missed.

> +                               MEM_CGROUP_STAT_WRITEBACK, 1);
>                inc_zone_page_state(clone_page, NR_WRITEBACK);
> +       }
>        unlock_page(clone_page);
>
>        return 0;
> @@ -1783,8 +1786,11 @@ static void __nilfs_end_page_io(struct page *page, int err)
>        }
>
>        if (buffer_nilfs_allocated(page_buffers(page))) {
> -               if (TestClearPageWriteback(page))
> +               if (TestClearPageWriteback(page)) {
> +                       mem_cgroup_update_stat(clone_page,
> +                                       MEM_CGROUP_STAT_WRITEBACK, -1);
>                        dec_zone_page_state(page, NR_WRITEBACK);
> +               }
>        } else
>                end_page_writeback(page);
>  }
> diff --git a/mm/filemap.c b/mm/filemap.c
> index fe09e51..f85acae 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
>         * having removed the page entirely.
>         */
>        if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> +               mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, -1);
>                dec_zone_page_state(page, NR_FILE_DIRTY);
>                dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
>        }
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 5a0f8f3..d83f41c 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -137,13 +137,14 @@ static struct prop_descriptor vm_dirties;
>  */
>  static int calc_period_shift(void)
>  {
> -       unsigned long dirty_total;
> +       unsigned long dirty_total, dirty_bytes;
>
> -       if (vm_dirty_bytes)
> -               dirty_total = vm_dirty_bytes / PAGE_SIZE;
> +       dirty_bytes = mem_cgroup_dirty_bytes();
> +       if (dirty_bytes)
> +               dirty_total = dirty_bytes / PAGE_SIZE;
>        else
> -               dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> -                               100;
> +               dirty_total = (mem_cgroup_dirty_ratio() *
> +                               determine_dirtyable_memory()) / 100;
>        return 2 + ilog2(dirty_total - 1);
>  }
>
> @@ -408,14 +409,16 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
>  */
>  unsigned long determine_dirtyable_memory(void)
>  {
> -       unsigned long x;
> -
> -       x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> +       unsigned long memory;
> +       s64 memcg_memory;
>
> +       memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
>        if (!vm_highmem_is_dirtyable)
> -               x -= highmem_dirtyable_memory(x);
> -
> -       return x + 1;   /* Ensure that we never return 0 */
> +               memory -= highmem_dirtyable_memory(memory);
> +       memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
> +       if (memcg_memory < 0)
> +               return memory + 1;
> +       return min((unsigned long)memcg_memory, memory + 1);
>  }
>
>  void
> @@ -423,26 +426,28 @@ get_dirty_limits(unsigned long *pbackground, unsigned long *pdirty,
>                 unsigned long *pbdi_dirty, struct backing_dev_info *bdi)
>  {
>        unsigned long background;
> -       unsigned long dirty;
> +       unsigned long dirty, dirty_bytes, dirty_background;
>        unsigned long available_memory = determine_dirtyable_memory();
>        struct task_struct *tsk;
>
> -       if (vm_dirty_bytes)
> -               dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> +       dirty_bytes = mem_cgroup_dirty_bytes();
> +       if (dirty_bytes)
> +               dirty = DIV_ROUND_UP(dirty_bytes, PAGE_SIZE);
>        else {
>                int dirty_ratio;
>
> -               dirty_ratio = vm_dirty_ratio;
> +               dirty_ratio = mem_cgroup_dirty_ratio();
>                if (dirty_ratio < 5)
>                        dirty_ratio = 5;
>                dirty = (dirty_ratio * available_memory) / 100;
>        }
>
> -       if (dirty_background_bytes)
> -               background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> +       dirty_background = mem_cgroup_dirty_background_bytes();
> +       if (dirty_background)
> +               background = DIV_ROUND_UP(dirty_background, PAGE_SIZE);
>        else
> -               background = (dirty_background_ratio * available_memory) / 100;
> -
> +               background = (mem_cgroup_dirty_background_ratio() *
> +                                       available_memory) / 100;
>        if (background >= dirty)
>                background = dirty / 2;
>        tsk = current;
> @@ -508,9 +513,13 @@ static void balance_dirty_pages(struct address_space *mapping,
>                get_dirty_limits(&background_thresh, &dirty_thresh,
>                                &bdi_thresh, bdi);
>
> -               nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> +               nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
> +               nr_writeback = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
> +               if ((nr_reclaimable < 0) || (nr_writeback < 0)) {
> +                       nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
>                                        global_page_state(NR_UNSTABLE_NFS);
> -               nr_writeback = global_page_state(NR_WRITEBACK);
> +                       nr_writeback = global_page_state(NR_WRITEBACK);
> +               }
>
>                bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY);
>                if (bdi_cap_account_unstable(bdi)) {
> @@ -611,10 +620,12 @@ static void balance_dirty_pages(struct address_space *mapping,
>         * In normal mode, we start background writeout at the lower
>         * background_thresh, to keep the amount of dirty memory low.
>         */
> +       nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
> +       if (nr_reclaimable < 0)
> +               nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> +                               global_page_state(NR_UNSTABLE_NFS);
>        if ((laptop_mode && pages_written) ||
> -           (!laptop_mode && ((global_page_state(NR_FILE_DIRTY)
> -                              + global_page_state(NR_UNSTABLE_NFS))
> -                                         > background_thresh)))
> +           (!laptop_mode && (nr_reclaimable > background_thresh)))
>                bdi_start_writeback(bdi, NULL, 0);
>  }
>
> @@ -678,6 +689,8 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>        unsigned long dirty_thresh;
>
>         for ( ; ; ) {
> +               unsigned long dirty;
> +
>                get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
>
>                 /*
> @@ -686,10 +699,14 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>                  */
>                 dirty_thresh += dirty_thresh / 10;      /* wheeee... */
>
> -                if (global_page_state(NR_UNSTABLE_NFS) +
> -                       global_page_state(NR_WRITEBACK) <= dirty_thresh)
> -                               break;
> -                congestion_wait(BLK_RW_ASYNC, HZ/10);
> +
> +               dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> +               if (dirty < 0)
> +                       dirty = global_page_state(NR_UNSTABLE_NFS) +
> +                               global_page_state(NR_WRITEBACK);
> +               if (dirty <= dirty_thresh)
> +                       break;
> +               congestion_wait(BLK_RW_ASYNC, HZ/10);
>
>                /*
>                 * The caller might hold locks which can prevent IO completion
> @@ -1096,6 +1113,7 @@ int __set_page_dirty_no_writeback(struct page *page)
>  void account_page_dirtied(struct page *page, struct address_space *mapping)
>  {
>        if (mapping_cap_account_dirty(mapping)) {
> +               mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, 1);
>                __inc_zone_page_state(page, NR_FILE_DIRTY);
>                __inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
>                task_dirty_inc(current);
> @@ -1297,6 +1315,8 @@ int clear_page_dirty_for_io(struct page *page)
>                 * for more comments.
>                 */
>                if (TestClearPageDirty(page)) {
> +                       mem_cgroup_update_stat(page,
> +                                       MEM_CGROUP_STAT_FILE_DIRTY, -1);
>                        dec_zone_page_state(page, NR_FILE_DIRTY);
>                        dec_bdi_stat(mapping->backing_dev_info,
>                                        BDI_DIRTY);
> @@ -1332,8 +1352,10 @@ int test_clear_page_writeback(struct page *page)
>        } else {
>                ret = TestClearPageWriteback(page);
>        }
> -       if (ret)
> +       if (ret) {
> +               mem_cgroup_update_stat(page, MEM_CGROUP_STAT_WRITEBACK, -1);
>                dec_zone_page_state(page, NR_WRITEBACK);
> +       }
>        return ret;
>  }
>
> @@ -1363,8 +1385,10 @@ int test_set_page_writeback(struct page *page)
>        } else {
>                ret = TestSetPageWriteback(page);
>        }
> -       if (!ret)
> +       if (!ret) {
> +               mem_cgroup_update_stat(page, MEM_CGROUP_STAT_WRITEBACK, 1);
>                inc_zone_page_state(page, NR_WRITEBACK);
> +       }
>        return ret;
>
>  }
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 4d2fb93..8d74335 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -832,7 +832,7 @@ void page_add_file_rmap(struct page *page)
>  {
>        if (atomic_inc_and_test(&page->_mapcount)) {
>                __inc_zone_page_state(page, NR_FILE_MAPPED);
> -               mem_cgroup_update_file_mapped(page, 1);
> +               mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, 1);
>        }
>  }
>
> @@ -864,7 +864,7 @@ void page_remove_rmap(struct page *page)
>                __dec_zone_page_state(page, NR_ANON_PAGES);
>        } else {
>                __dec_zone_page_state(page, NR_FILE_MAPPED);
> -               mem_cgroup_update_file_mapped(page, -1);
> +               mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, -1);
>        }
>        /*
>         * It would be tidy to reset the PageAnon mapping here,
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 2466e0c..5f437e7 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -73,6 +73,8 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
>        if (TestClearPageDirty(page)) {
>                struct address_space *mapping = page->mapping;
>                if (mapping && mapping_cap_account_dirty(mapping)) {
> +                       mem_cgroup_update_stat(page,
> +                                       MEM_CGROUP_STAT_FILE_DIRTY, -1);
>                        dec_zone_page_state(page, NR_FILE_DIRTY);
>                        dec_bdi_stat(mapping->backing_dev_info,
>                                        BDI_DIRTY);
> --
> 1.6.3.3
>
>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 2/3] memcg: dirty pages accounting and limiting infrastructure
       [not found]     ` <cc557aab1003020204k16038838ta537357aeeb67b11-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-03-02 11:00       ` Andrea Righi
  0 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-02 11:00 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Andrew Morton,
	Balbir Singh

On Tue, Mar 02, 2010 at 12:04:53PM +0200, Kirill A. Shutemov wrote:
[snip]
> > +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> > +{
> > +       return -ENOMEM;
> 
> Why ENOMEM? Probably, EINVAL or ENOSYS?

OK, ENOSYS is more appropriate IMHO.

> > +static s64 mem_cgroup_get_local_page_stat(struct mem_cgroup *memcg,
> > +                               enum mem_cgroup_page_stat_item item)
> > +{
> > +       s64 ret;
> > +
> > +       switch (item) {
> > +       case MEMCG_NR_DIRTYABLE_PAGES:
> > +               ret = res_counter_read_u64(&memcg->res, RES_LIMIT) -
> > +                       res_counter_read_u64(&memcg->res, RES_USAGE);
> > +               /* Translate free memory in pages */
> > +               ret >>= PAGE_SHIFT;
> > +               ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_FILE) +
> > +                       mem_cgroup_read_stat(memcg, LRU_INACTIVE_FILE);
> > +               if (mem_cgroup_can_swap(memcg))
> > +                       ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_ANON) +
> > +                               mem_cgroup_read_stat(memcg, LRU_INACTIVE_ANON);
> > +               break;
> > +       case MEMCG_NR_RECLAIM_PAGES:
> > +               ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_FILE_DIRTY) +
> > +                       mem_cgroup_read_stat(memcg,
> > +                                       MEM_CGROUP_STAT_UNSTABLE_NFS);
> > +               break;
> > +       case MEMCG_NR_WRITEBACK:
> > +               ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK);
> > +               break;
> > +       case MEMCG_NR_DIRTY_WRITEBACK_PAGES:
> > +               ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK) +
> > +                       mem_cgroup_read_stat(memcg,
> > +                               MEM_CGROUP_STAT_UNSTABLE_NFS);
> > +               break;
> > +       default:
> > +               ret = 0;
> > +               WARN_ON_ONCE(1);
> 
> I think it's a bug, not warning.

OK.

> > +       }
> > +       return ret;
> > +}
> > +
> > +static int mem_cgroup_page_stat_cb(struct mem_cgroup *mem, void *data)
> > +{
> > +       struct mem_cgroup_page_stat *stat = (struct mem_cgroup_page_stat *)data;
> > +
> > +       stat->value += mem_cgroup_get_local_page_stat(mem, stat->item);
> > +       return 0;
> > +}
> > +
> > +s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> > +{
> > +       struct mem_cgroup_page_stat stat = {};
> > +       struct mem_cgroup *memcg;
> > +
> > +       if (mem_cgroup_disabled())
> > +               return -ENOMEM;
> 
> EINVAL/ENOSYS?

OK.

> 
> > +       rcu_read_lock();
> > +       memcg = mem_cgroup_from_task(current);
> > +       if (memcg) {
> > +               /*
> > +                * Recursively evaulate page statistics against all cgroup
> > +                * under hierarchy tree
> > +                */
> > +               stat.item = item;
> > +               mem_cgroup_walk_tree(memcg, &stat, mem_cgroup_page_stat_cb);
> > +       } else
> > +               stat.value = -ENOMEM;
> 
> ditto.

OK.

> 
> > +       rcu_read_unlock();
> > +
> > +       return stat.value;
> > +}
> > +
> >  static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void *data)
> >  {
> >        int *val = data;
> > @@ -1263,14 +1418,16 @@ static void record_last_oom(struct mem_cgroup *mem)
> >  }
> >
> >  /*
> > - * Currently used to update mapped file statistics, but the routine can be
> > - * generalized to update other statistics as well.
> > + * Generalized routine to update memory cgroup statistics.
> >  */
> > -void mem_cgroup_update_file_mapped(struct page *page, int val)
> > +void mem_cgroup_update_stat(struct page *page,
> > +                       enum mem_cgroup_stat_index idx, int val)
> 
> EXPORT_SYMBOL_GPL(mem_cgroup_update_stat) is needed, since
> it uses by filesystems.

Agreed.

> > +static int
> > +mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
> > +{
> > +       struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> > +       int type = cft->private;
> > +
> > +       if (cgrp->parent == NULL)
> > +               return -EINVAL;
> > +       if (((type == MEM_CGROUP_DIRTY_RATIO) ||
> > +               (type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO)) && (val > 100))
> 
> Too many unnecessary brackets
> 
>        if ((type == MEM_CGROUP_DIRTY_RATIO ||
>                type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO) && val > 100)
> 

OK.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 2/3] memcg: dirty pages accounting and limiting infrastructure
  2010-03-02 10:04     ` Kirill A. Shutemov
@ 2010-03-02 11:00       ` Andrea Righi
  -1 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-02 11:00 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Andrew Morton, containers, linux-kernel,
	linux-mm

On Tue, Mar 02, 2010 at 12:04:53PM +0200, Kirill A. Shutemov wrote:
[snip]
> > +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> > +{
> > +       return -ENOMEM;
> 
> Why ENOMEM? Probably, EINVAL or ENOSYS?

OK, ENOSYS is more appropriate IMHO.

> > +static s64 mem_cgroup_get_local_page_stat(struct mem_cgroup *memcg,
> > +                               enum mem_cgroup_page_stat_item item)
> > +{
> > +       s64 ret;
> > +
> > +       switch (item) {
> > +       case MEMCG_NR_DIRTYABLE_PAGES:
> > +               ret = res_counter_read_u64(&memcg->res, RES_LIMIT) -
> > +                       res_counter_read_u64(&memcg->res, RES_USAGE);
> > +               /* Translate free memory in pages */
> > +               ret >>= PAGE_SHIFT;
> > +               ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_FILE) +
> > +                       mem_cgroup_read_stat(memcg, LRU_INACTIVE_FILE);
> > +               if (mem_cgroup_can_swap(memcg))
> > +                       ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_ANON) +
> > +                               mem_cgroup_read_stat(memcg, LRU_INACTIVE_ANON);
> > +               break;
> > +       case MEMCG_NR_RECLAIM_PAGES:
> > +               ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_FILE_DIRTY) +
> > +                       mem_cgroup_read_stat(memcg,
> > +                                       MEM_CGROUP_STAT_UNSTABLE_NFS);
> > +               break;
> > +       case MEMCG_NR_WRITEBACK:
> > +               ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK);
> > +               break;
> > +       case MEMCG_NR_DIRTY_WRITEBACK_PAGES:
> > +               ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK) +
> > +                       mem_cgroup_read_stat(memcg,
> > +                               MEM_CGROUP_STAT_UNSTABLE_NFS);
> > +               break;
> > +       default:
> > +               ret = 0;
> > +               WARN_ON_ONCE(1);
> 
> I think it's a bug, not warning.

OK.

> > +       }
> > +       return ret;
> > +}
> > +
> > +static int mem_cgroup_page_stat_cb(struct mem_cgroup *mem, void *data)
> > +{
> > +       struct mem_cgroup_page_stat *stat = (struct mem_cgroup_page_stat *)data;
> > +
> > +       stat->value += mem_cgroup_get_local_page_stat(mem, stat->item);
> > +       return 0;
> > +}
> > +
> > +s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> > +{
> > +       struct mem_cgroup_page_stat stat = {};
> > +       struct mem_cgroup *memcg;
> > +
> > +       if (mem_cgroup_disabled())
> > +               return -ENOMEM;
> 
> EINVAL/ENOSYS?

OK.

> 
> > +       rcu_read_lock();
> > +       memcg = mem_cgroup_from_task(current);
> > +       if (memcg) {
> > +               /*
> > +                * Recursively evaulate page statistics against all cgroup
> > +                * under hierarchy tree
> > +                */
> > +               stat.item = item;
> > +               mem_cgroup_walk_tree(memcg, &stat, mem_cgroup_page_stat_cb);
> > +       } else
> > +               stat.value = -ENOMEM;
> 
> ditto.

OK.

> 
> > +       rcu_read_unlock();
> > +
> > +       return stat.value;
> > +}
> > +
> >  static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void *data)
> >  {
> >        int *val = data;
> > @@ -1263,14 +1418,16 @@ static void record_last_oom(struct mem_cgroup *mem)
> >  }
> >
> >  /*
> > - * Currently used to update mapped file statistics, but the routine can be
> > - * generalized to update other statistics as well.
> > + * Generalized routine to update memory cgroup statistics.
> >  */
> > -void mem_cgroup_update_file_mapped(struct page *page, int val)
> > +void mem_cgroup_update_stat(struct page *page,
> > +                       enum mem_cgroup_stat_index idx, int val)
> 
> EXPORT_SYMBOL_GPL(mem_cgroup_update_stat) is needed, since
> it uses by filesystems.

Agreed.

> > +static int
> > +mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
> > +{
> > +       struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> > +       int type = cft->private;
> > +
> > +       if (cgrp->parent == NULL)
> > +               return -EINVAL;
> > +       if (((type == MEM_CGROUP_DIRTY_RATIO) ||
> > +               (type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO)) && (val > 100))
> 
> Too many unnecessary brackets
> 
>        if ((type == MEM_CGROUP_DIRTY_RATIO ||
>                type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO) && val > 100)
> 

OK.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 2/3] memcg: dirty pages accounting and limiting infrastructure
@ 2010-03-02 11:00       ` Andrea Righi
  0 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-02 11:00 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Andrew Morton, containers, linux-kernel,
	linux-mm

On Tue, Mar 02, 2010 at 12:04:53PM +0200, Kirill A. Shutemov wrote:
[snip]
> > +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> > +{
> > +       return -ENOMEM;
> 
> Why ENOMEM? Probably, EINVAL or ENOSYS?

OK, ENOSYS is more appropriate IMHO.

> > +static s64 mem_cgroup_get_local_page_stat(struct mem_cgroup *memcg,
> > +                               enum mem_cgroup_page_stat_item item)
> > +{
> > +       s64 ret;
> > +
> > +       switch (item) {
> > +       case MEMCG_NR_DIRTYABLE_PAGES:
> > +               ret = res_counter_read_u64(&memcg->res, RES_LIMIT) -
> > +                       res_counter_read_u64(&memcg->res, RES_USAGE);
> > +               /* Translate free memory in pages */
> > +               ret >>= PAGE_SHIFT;
> > +               ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_FILE) +
> > +                       mem_cgroup_read_stat(memcg, LRU_INACTIVE_FILE);
> > +               if (mem_cgroup_can_swap(memcg))
> > +                       ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_ANON) +
> > +                               mem_cgroup_read_stat(memcg, LRU_INACTIVE_ANON);
> > +               break;
> > +       case MEMCG_NR_RECLAIM_PAGES:
> > +               ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_FILE_DIRTY) +
> > +                       mem_cgroup_read_stat(memcg,
> > +                                       MEM_CGROUP_STAT_UNSTABLE_NFS);
> > +               break;
> > +       case MEMCG_NR_WRITEBACK:
> > +               ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK);
> > +               break;
> > +       case MEMCG_NR_DIRTY_WRITEBACK_PAGES:
> > +               ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK) +
> > +                       mem_cgroup_read_stat(memcg,
> > +                               MEM_CGROUP_STAT_UNSTABLE_NFS);
> > +               break;
> > +       default:
> > +               ret = 0;
> > +               WARN_ON_ONCE(1);
> 
> I think it's a bug, not warning.

OK.

> > +       }
> > +       return ret;
> > +}
> > +
> > +static int mem_cgroup_page_stat_cb(struct mem_cgroup *mem, void *data)
> > +{
> > +       struct mem_cgroup_page_stat *stat = (struct mem_cgroup_page_stat *)data;
> > +
> > +       stat->value += mem_cgroup_get_local_page_stat(mem, stat->item);
> > +       return 0;
> > +}
> > +
> > +s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> > +{
> > +       struct mem_cgroup_page_stat stat = {};
> > +       struct mem_cgroup *memcg;
> > +
> > +       if (mem_cgroup_disabled())
> > +               return -ENOMEM;
> 
> EINVAL/ENOSYS?

OK.

> 
> > +       rcu_read_lock();
> > +       memcg = mem_cgroup_from_task(current);
> > +       if (memcg) {
> > +               /*
> > +                * Recursively evaulate page statistics against all cgroup
> > +                * under hierarchy tree
> > +                */
> > +               stat.item = item;
> > +               mem_cgroup_walk_tree(memcg, &stat, mem_cgroup_page_stat_cb);
> > +       } else
> > +               stat.value = -ENOMEM;
> 
> ditto.

OK.

> 
> > +       rcu_read_unlock();
> > +
> > +       return stat.value;
> > +}
> > +
> >  static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void *data)
> >  {
> >        int *val = data;
> > @@ -1263,14 +1418,16 @@ static void record_last_oom(struct mem_cgroup *mem)
> >  }
> >
> >  /*
> > - * Currently used to update mapped file statistics, but the routine can be
> > - * generalized to update other statistics as well.
> > + * Generalized routine to update memory cgroup statistics.
> >  */
> > -void mem_cgroup_update_file_mapped(struct page *page, int val)
> > +void mem_cgroup_update_stat(struct page *page,
> > +                       enum mem_cgroup_stat_index idx, int val)
> 
> EXPORT_SYMBOL_GPL(mem_cgroup_update_stat) is needed, since
> it uses by filesystems.

Agreed.

> > +static int
> > +mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
> > +{
> > +       struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> > +       int type = cft->private;
> > +
> > +       if (cgrp->parent == NULL)
> > +               return -EINVAL;
> > +       if (((type == MEM_CGROUP_DIRTY_RATIO) ||
> > +               (type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO)) && (val > 100))
> 
> Too many unnecessary brackets
> 
>        if ((type == MEM_CGROUP_DIRTY_RATIO ||
>                type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO) && val > 100)
> 

OK.

Thanks,
-Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
       [not found]     ` <cc557aab1003020211h391947f0p3eae04a298127d32-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-03-02 11:02       ` Andrea Righi
  0 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-02 11:02 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Andrew Morton,
	Balbir Singh

On Tue, Mar 02, 2010 at 12:11:10PM +0200, Kirill A. Shutemov wrote:
> On Mon, Mar 1, 2010 at 11:23 PM, Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org> wrote:
> > Apply the cgroup dirty pages accounting and limiting infrastructure to
> > the opportune kernel functions.
> >
> > Signed-off-by: Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
> > ---
> >  fs/fuse/file.c      |    5 +++
> >  fs/nfs/write.c      |    4 ++
> >  fs/nilfs2/segment.c |   10 +++++-
> >  mm/filemap.c        |    1 +
> >  mm/page-writeback.c |   84 ++++++++++++++++++++++++++++++++------------------
> >  mm/rmap.c           |    4 +-
> >  mm/truncate.c       |    2 +
> >  7 files changed, 76 insertions(+), 34 deletions(-)
> >
> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > index a9f5e13..dbbdd53 100644
> > --- a/fs/fuse/file.c
> > +++ b/fs/fuse/file.c
> > @@ -11,6 +11,7 @@
> >  #include <linux/pagemap.h>
> >  #include <linux/slab.h>
> >  #include <linux/kernel.h>
> > +#include <linux/memcontrol.h>
> >  #include <linux/sched.h>
> >  #include <linux/module.h>
> >
> > @@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
> >
> >        list_del(&req->writepages_entry);
> >        dec_bdi_stat(bdi, BDI_WRITEBACK);
> > +       mem_cgroup_update_stat(req->pages[0],
> > +                       MEM_CGROUP_STAT_WRITEBACK_TEMP, -1);
> >        dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
> >        bdi_writeout_inc(bdi);
> >        wake_up(&fi->page_waitq);
> > @@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
> >        req->inode = inode;
> >
> >        inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
> > +       mem_cgroup_update_stat(tmp_page,
> > +                       MEM_CGROUP_STAT_WRITEBACK_TEMP, 1);
> >        inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
> >        end_page_writeback(page);
> >
> > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> > index b753242..7316f7a 100644
> > --- a/fs/nfs/write.c
> > +++ b/fs/nfs/write.c
> > @@ -439,6 +439,7 @@ nfs_mark_request_commit(struct nfs_page *req)
> >                        req->wb_index,
> >                        NFS_PAGE_TAG_COMMIT);
> >        spin_unlock(&inode->i_lock);
> > +       mem_cgroup_update_stat(req->wb_page, MEM_CGROUP_STAT_UNSTABLE_NFS, 1);
> >        inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
> >        inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE);
> >        __mark_inode_dirty(inode, I_DIRTY_DATASYNC);
> > @@ -450,6 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req)
> >        struct page *page = req->wb_page;
> >
> >        if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
> > +               mem_cgroup_update_stat(page, MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
> >                dec_zone_page_state(page, NR_UNSTABLE_NFS);
> >                dec_bdi_stat(page->mapping->backing_dev_info, BDI_UNSTABLE);
> >                return 1;
> > @@ -1273,6 +1275,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
> >                req = nfs_list_entry(head->next);
> >                nfs_list_remove_request(req);
> >                nfs_mark_request_commit(req);
> > +               mem_cgroup_update_stat(req->wb_page,
> > +                               MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
> >                dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
> >                dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
> >                                BDI_UNSTABLE);
> > diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> > index ada2f1b..aef6d13 100644
> > --- a/fs/nilfs2/segment.c
> > +++ b/fs/nilfs2/segment.c
> > @@ -1660,8 +1660,11 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out)
> >        } while (bh = bh->b_this_page, bh2 = bh2->b_this_page, bh != head);
> >        kunmap_atomic(kaddr, KM_USER0);
> >
> > -       if (!TestSetPageWriteback(clone_page))
> > +       if (!TestSetPageWriteback(clone_page)) {
> > +               mem_cgroup_update_stat(clone_page,
> 
> s/clone_page/page/

mmh... shouldn't we use the same page used by TestSetPageWriteback() and
inc_zone_page_state()?

> 
> And #include <linux/memcontrol.h> is missed.

OK.

I'll apply your fixes and post a new version.

Thanks for reviewing,
-Andrea

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-02 10:11     ` Kirill A. Shutemov
@ 2010-03-02 11:02       ` Andrea Righi
  -1 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-02 11:02 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Andrew Morton, containers, linux-kernel,
	linux-mm

On Tue, Mar 02, 2010 at 12:11:10PM +0200, Kirill A. Shutemov wrote:
> On Mon, Mar 1, 2010 at 11:23 PM, Andrea Righi <arighi@develer.com> wrote:
> > Apply the cgroup dirty pages accounting and limiting infrastructure to
> > the opportune kernel functions.
> >
> > Signed-off-by: Andrea Righi <arighi@develer.com>
> > ---
> >  fs/fuse/file.c      |    5 +++
> >  fs/nfs/write.c      |    4 ++
> >  fs/nilfs2/segment.c |   10 +++++-
> >  mm/filemap.c        |    1 +
> >  mm/page-writeback.c |   84 ++++++++++++++++++++++++++++++++------------------
> >  mm/rmap.c           |    4 +-
> >  mm/truncate.c       |    2 +
> >  7 files changed, 76 insertions(+), 34 deletions(-)
> >
> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > index a9f5e13..dbbdd53 100644
> > --- a/fs/fuse/file.c
> > +++ b/fs/fuse/file.c
> > @@ -11,6 +11,7 @@
> >  #include <linux/pagemap.h>
> >  #include <linux/slab.h>
> >  #include <linux/kernel.h>
> > +#include <linux/memcontrol.h>
> >  #include <linux/sched.h>
> >  #include <linux/module.h>
> >
> > @@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
> >
> >        list_del(&req->writepages_entry);
> >        dec_bdi_stat(bdi, BDI_WRITEBACK);
> > +       mem_cgroup_update_stat(req->pages[0],
> > +                       MEM_CGROUP_STAT_WRITEBACK_TEMP, -1);
> >        dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
> >        bdi_writeout_inc(bdi);
> >        wake_up(&fi->page_waitq);
> > @@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
> >        req->inode = inode;
> >
> >        inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
> > +       mem_cgroup_update_stat(tmp_page,
> > +                       MEM_CGROUP_STAT_WRITEBACK_TEMP, 1);
> >        inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
> >        end_page_writeback(page);
> >
> > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> > index b753242..7316f7a 100644
> > --- a/fs/nfs/write.c
> > +++ b/fs/nfs/write.c
> > @@ -439,6 +439,7 @@ nfs_mark_request_commit(struct nfs_page *req)
> >                        req->wb_index,
> >                        NFS_PAGE_TAG_COMMIT);
> >        spin_unlock(&inode->i_lock);
> > +       mem_cgroup_update_stat(req->wb_page, MEM_CGROUP_STAT_UNSTABLE_NFS, 1);
> >        inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
> >        inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE);
> >        __mark_inode_dirty(inode, I_DIRTY_DATASYNC);
> > @@ -450,6 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req)
> >        struct page *page = req->wb_page;
> >
> >        if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
> > +               mem_cgroup_update_stat(page, MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
> >                dec_zone_page_state(page, NR_UNSTABLE_NFS);
> >                dec_bdi_stat(page->mapping->backing_dev_info, BDI_UNSTABLE);
> >                return 1;
> > @@ -1273,6 +1275,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
> >                req = nfs_list_entry(head->next);
> >                nfs_list_remove_request(req);
> >                nfs_mark_request_commit(req);
> > +               mem_cgroup_update_stat(req->wb_page,
> > +                               MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
> >                dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
> >                dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
> >                                BDI_UNSTABLE);
> > diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> > index ada2f1b..aef6d13 100644
> > --- a/fs/nilfs2/segment.c
> > +++ b/fs/nilfs2/segment.c
> > @@ -1660,8 +1660,11 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out)
> >        } while (bh = bh->b_this_page, bh2 = bh2->b_this_page, bh != head);
> >        kunmap_atomic(kaddr, KM_USER0);
> >
> > -       if (!TestSetPageWriteback(clone_page))
> > +       if (!TestSetPageWriteback(clone_page)) {
> > +               mem_cgroup_update_stat(clone_page,
> 
> s/clone_page/page/

mmh... shouldn't we use the same page used by TestSetPageWriteback() and
inc_zone_page_state()?

> 
> And #include <linux/memcontrol.h> is missed.

OK.

I'll apply your fixes and post a new version.

Thanks for reviewing,
-Andrea

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
@ 2010-03-02 11:02       ` Andrea Righi
  0 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-02 11:02 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Andrew Morton, containers, linux-kernel,
	linux-mm

On Tue, Mar 02, 2010 at 12:11:10PM +0200, Kirill A. Shutemov wrote:
> On Mon, Mar 1, 2010 at 11:23 PM, Andrea Righi <arighi@develer.com> wrote:
> > Apply the cgroup dirty pages accounting and limiting infrastructure to
> > the opportune kernel functions.
> >
> > Signed-off-by: Andrea Righi <arighi@develer.com>
> > ---
> >  fs/fuse/file.c      |    5 +++
> >  fs/nfs/write.c      |    4 ++
> >  fs/nilfs2/segment.c |   10 +++++-
> >  mm/filemap.c        |    1 +
> >  mm/page-writeback.c |   84 ++++++++++++++++++++++++++++++++------------------
> >  mm/rmap.c           |    4 +-
> >  mm/truncate.c       |    2 +
> >  7 files changed, 76 insertions(+), 34 deletions(-)
> >
> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > index a9f5e13..dbbdd53 100644
> > --- a/fs/fuse/file.c
> > +++ b/fs/fuse/file.c
> > @@ -11,6 +11,7 @@
> >  #include <linux/pagemap.h>
> >  #include <linux/slab.h>
> >  #include <linux/kernel.h>
> > +#include <linux/memcontrol.h>
> >  #include <linux/sched.h>
> >  #include <linux/module.h>
> >
> > @@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
> >
> >        list_del(&req->writepages_entry);
> >        dec_bdi_stat(bdi, BDI_WRITEBACK);
> > +       mem_cgroup_update_stat(req->pages[0],
> > +                       MEM_CGROUP_STAT_WRITEBACK_TEMP, -1);
> >        dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
> >        bdi_writeout_inc(bdi);
> >        wake_up(&fi->page_waitq);
> > @@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
> >        req->inode = inode;
> >
> >        inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
> > +       mem_cgroup_update_stat(tmp_page,
> > +                       MEM_CGROUP_STAT_WRITEBACK_TEMP, 1);
> >        inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
> >        end_page_writeback(page);
> >
> > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> > index b753242..7316f7a 100644
> > --- a/fs/nfs/write.c
> > +++ b/fs/nfs/write.c
> > @@ -439,6 +439,7 @@ nfs_mark_request_commit(struct nfs_page *req)
> >                        req->wb_index,
> >                        NFS_PAGE_TAG_COMMIT);
> >        spin_unlock(&inode->i_lock);
> > +       mem_cgroup_update_stat(req->wb_page, MEM_CGROUP_STAT_UNSTABLE_NFS, 1);
> >        inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
> >        inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE);
> >        __mark_inode_dirty(inode, I_DIRTY_DATASYNC);
> > @@ -450,6 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req)
> >        struct page *page = req->wb_page;
> >
> >        if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
> > +               mem_cgroup_update_stat(page, MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
> >                dec_zone_page_state(page, NR_UNSTABLE_NFS);
> >                dec_bdi_stat(page->mapping->backing_dev_info, BDI_UNSTABLE);
> >                return 1;
> > @@ -1273,6 +1275,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
> >                req = nfs_list_entry(head->next);
> >                nfs_list_remove_request(req);
> >                nfs_mark_request_commit(req);
> > +               mem_cgroup_update_stat(req->wb_page,
> > +                               MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
> >                dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
> >                dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
> >                                BDI_UNSTABLE);
> > diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> > index ada2f1b..aef6d13 100644
> > --- a/fs/nilfs2/segment.c
> > +++ b/fs/nilfs2/segment.c
> > @@ -1660,8 +1660,11 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out)
> >        } while (bh = bh->b_this_page, bh2 = bh2->b_this_page, bh != head);
> >        kunmap_atomic(kaddr, KM_USER0);
> >
> > -       if (!TestSetPageWriteback(clone_page))
> > +       if (!TestSetPageWriteback(clone_page)) {
> > +               mem_cgroup_update_stat(clone_page,
> 
> s/clone_page/page/

mmh... shouldn't we use the same page used by TestSetPageWriteback() and
inc_zone_page_state()?

> 
> And #include <linux/memcontrol.h> is missed.

OK.

I'll apply your fixes and post a new version.

Thanks for reviewing,
-Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-02 11:02       ` Andrea Righi
  (?)
@ 2010-03-02 11:09       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 140+ messages in thread
From: Kirill A. Shutemov @ 2010-03-02 11:09 UTC (permalink / raw)
  To: Andrea Righi
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Andrew Morton,
	Balbir Singh

On Tue, Mar 2, 2010 at 1:02 PM, Andrea Righi <arighi@develer.com> wrote:
> On Tue, Mar 02, 2010 at 12:11:10PM +0200, Kirill A. Shutemov wrote:
>> On Mon, Mar 1, 2010 at 11:23 PM, Andrea Righi <arighi@develer.com> wrote:
>> > Apply the cgroup dirty pages accounting and limiting infrastructure to
>> > the opportune kernel functions.
>> >
>> > Signed-off-by: Andrea Righi <arighi@develer.com>
>> > ---
>> >  fs/fuse/file.c      |    5 +++
>> >  fs/nfs/write.c      |    4 ++
>> >  fs/nilfs2/segment.c |   10 +++++-
>> >  mm/filemap.c        |    1 +
>> >  mm/page-writeback.c |   84 ++++++++++++++++++++++++++++++++------------------
>> >  mm/rmap.c           |    4 +-
>> >  mm/truncate.c       |    2 +
>> >  7 files changed, 76 insertions(+), 34 deletions(-)
>> >
>> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>> > index a9f5e13..dbbdd53 100644
>> > --- a/fs/fuse/file.c
>> > +++ b/fs/fuse/file.c
>> > @@ -11,6 +11,7 @@
>> >  #include <linux/pagemap.h>
>> >  #include <linux/slab.h>
>> >  #include <linux/kernel.h>
>> > +#include <linux/memcontrol.h>
>> >  #include <linux/sched.h>
>> >  #include <linux/module.h>
>> >
>> > @@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
>> >
>> >        list_del(&req->writepages_entry);
>> >        dec_bdi_stat(bdi, BDI_WRITEBACK);
>> > +       mem_cgroup_update_stat(req->pages[0],
>> > +                       MEM_CGROUP_STAT_WRITEBACK_TEMP, -1);
>> >        dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
>> >        bdi_writeout_inc(bdi);
>> >        wake_up(&fi->page_waitq);
>> > @@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
>> >        req->inode = inode;
>> >
>> >        inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
>> > +       mem_cgroup_update_stat(tmp_page,
>> > +                       MEM_CGROUP_STAT_WRITEBACK_TEMP, 1);
>> >        inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
>> >        end_page_writeback(page);
>> >
>> > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
>> > index b753242..7316f7a 100644
>> > --- a/fs/nfs/write.c
>> > +++ b/fs/nfs/write.c
>> > @@ -439,6 +439,7 @@ nfs_mark_request_commit(struct nfs_page *req)
>> >                        req->wb_index,
>> >                        NFS_PAGE_TAG_COMMIT);
>> >        spin_unlock(&inode->i_lock);
>> > +       mem_cgroup_update_stat(req->wb_page, MEM_CGROUP_STAT_UNSTABLE_NFS, 1);
>> >        inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
>> >        inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE);
>> >        __mark_inode_dirty(inode, I_DIRTY_DATASYNC);
>> > @@ -450,6 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req)
>> >        struct page *page = req->wb_page;
>> >
>> >        if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
>> > +               mem_cgroup_update_stat(page, MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
>> >                dec_zone_page_state(page, NR_UNSTABLE_NFS);
>> >                dec_bdi_stat(page->mapping->backing_dev_info, BDI_UNSTABLE);
>> >                return 1;
>> > @@ -1273,6 +1275,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
>> >                req = nfs_list_entry(head->next);
>> >                nfs_list_remove_request(req);
>> >                nfs_mark_request_commit(req);
>> > +               mem_cgroup_update_stat(req->wb_page,
>> > +                               MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
>> >                dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
>> >                dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
>> >                                BDI_UNSTABLE);
>> > diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
>> > index ada2f1b..aef6d13 100644
>> > --- a/fs/nilfs2/segment.c
>> > +++ b/fs/nilfs2/segment.c
>> > @@ -1660,8 +1660,11 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out)
>> >        } while (bh = bh->b_this_page, bh2 = bh2->b_this_page, bh != head);
>> >        kunmap_atomic(kaddr, KM_USER0);
>> >
>> > -       if (!TestSetPageWriteback(clone_page))
>> > +       if (!TestSetPageWriteback(clone_page)) {
>> > +               mem_cgroup_update_stat(clone_page,
>>
>> s/clone_page/page/
>
> mmh... shouldn't we use the same page used by TestSetPageWriteback() and
> inc_zone_page_state()?

Sorry, I've commented wrong hunk. It's for the next one.

>>
>> And #include <linux/memcontrol.h> is missed.
>
> OK.
>
> I'll apply your fixes and post a new version.
>
> Thanks for reviewing,
> -Andrea
>
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-02 11:02       ` Andrea Righi
@ 2010-03-02 11:09         ` Kirill A. Shutemov
  -1 siblings, 0 replies; 140+ messages in thread
From: Kirill A. Shutemov @ 2010-03-02 11:09 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Andrew Morton, containers, linux-kernel,
	linux-mm

On Tue, Mar 2, 2010 at 1:02 PM, Andrea Righi <arighi@develer.com> wrote:
> On Tue, Mar 02, 2010 at 12:11:10PM +0200, Kirill A. Shutemov wrote:
>> On Mon, Mar 1, 2010 at 11:23 PM, Andrea Righi <arighi@develer.com> wrote:
>> > Apply the cgroup dirty pages accounting and limiting infrastructure to
>> > the opportune kernel functions.
>> >
>> > Signed-off-by: Andrea Righi <arighi@develer.com>
>> > ---
>> >  fs/fuse/file.c      |    5 +++
>> >  fs/nfs/write.c      |    4 ++
>> >  fs/nilfs2/segment.c |   10 +++++-
>> >  mm/filemap.c        |    1 +
>> >  mm/page-writeback.c |   84 ++++++++++++++++++++++++++++++++------------------
>> >  mm/rmap.c           |    4 +-
>> >  mm/truncate.c       |    2 +
>> >  7 files changed, 76 insertions(+), 34 deletions(-)
>> >
>> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>> > index a9f5e13..dbbdd53 100644
>> > --- a/fs/fuse/file.c
>> > +++ b/fs/fuse/file.c
>> > @@ -11,6 +11,7 @@
>> >  #include <linux/pagemap.h>
>> >  #include <linux/slab.h>
>> >  #include <linux/kernel.h>
>> > +#include <linux/memcontrol.h>
>> >  #include <linux/sched.h>
>> >  #include <linux/module.h>
>> >
>> > @@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
>> >
>> >        list_del(&req->writepages_entry);
>> >        dec_bdi_stat(bdi, BDI_WRITEBACK);
>> > +       mem_cgroup_update_stat(req->pages[0],
>> > +                       MEM_CGROUP_STAT_WRITEBACK_TEMP, -1);
>> >        dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
>> >        bdi_writeout_inc(bdi);
>> >        wake_up(&fi->page_waitq);
>> > @@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
>> >        req->inode = inode;
>> >
>> >        inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
>> > +       mem_cgroup_update_stat(tmp_page,
>> > +                       MEM_CGROUP_STAT_WRITEBACK_TEMP, 1);
>> >        inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
>> >        end_page_writeback(page);
>> >
>> > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
>> > index b753242..7316f7a 100644
>> > --- a/fs/nfs/write.c
>> > +++ b/fs/nfs/write.c
>> > @@ -439,6 +439,7 @@ nfs_mark_request_commit(struct nfs_page *req)
>> >                        req->wb_index,
>> >                        NFS_PAGE_TAG_COMMIT);
>> >        spin_unlock(&inode->i_lock);
>> > +       mem_cgroup_update_stat(req->wb_page, MEM_CGROUP_STAT_UNSTABLE_NFS, 1);
>> >        inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
>> >        inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE);
>> >        __mark_inode_dirty(inode, I_DIRTY_DATASYNC);
>> > @@ -450,6 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req)
>> >        struct page *page = req->wb_page;
>> >
>> >        if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
>> > +               mem_cgroup_update_stat(page, MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
>> >                dec_zone_page_state(page, NR_UNSTABLE_NFS);
>> >                dec_bdi_stat(page->mapping->backing_dev_info, BDI_UNSTABLE);
>> >                return 1;
>> > @@ -1273,6 +1275,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
>> >                req = nfs_list_entry(head->next);
>> >                nfs_list_remove_request(req);
>> >                nfs_mark_request_commit(req);
>> > +               mem_cgroup_update_stat(req->wb_page,
>> > +                               MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
>> >                dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
>> >                dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
>> >                                BDI_UNSTABLE);
>> > diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
>> > index ada2f1b..aef6d13 100644
>> > --- a/fs/nilfs2/segment.c
>> > +++ b/fs/nilfs2/segment.c
>> > @@ -1660,8 +1660,11 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out)
>> >        } while (bh = bh->b_this_page, bh2 = bh2->b_this_page, bh != head);
>> >        kunmap_atomic(kaddr, KM_USER0);
>> >
>> > -       if (!TestSetPageWriteback(clone_page))
>> > +       if (!TestSetPageWriteback(clone_page)) {
>> > +               mem_cgroup_update_stat(clone_page,
>>
>> s/clone_page/page/
>
> mmh... shouldn't we use the same page used by TestSetPageWriteback() and
> inc_zone_page_state()?

Sorry, I've commented wrong hunk. It's for the next one.

>>
>> And #include <linux/memcontrol.h> is missed.
>
> OK.
>
> I'll apply your fixes and post a new version.
>
> Thanks for reviewing,
> -Andrea
>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
@ 2010-03-02 11:09         ` Kirill A. Shutemov
  0 siblings, 0 replies; 140+ messages in thread
From: Kirill A. Shutemov @ 2010-03-02 11:09 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Andrew Morton, containers, linux-kernel,
	linux-mm

On Tue, Mar 2, 2010 at 1:02 PM, Andrea Righi <arighi@develer.com> wrote:
> On Tue, Mar 02, 2010 at 12:11:10PM +0200, Kirill A. Shutemov wrote:
>> On Mon, Mar 1, 2010 at 11:23 PM, Andrea Righi <arighi@develer.com> wrote:
>> > Apply the cgroup dirty pages accounting and limiting infrastructure to
>> > the opportune kernel functions.
>> >
>> > Signed-off-by: Andrea Righi <arighi@develer.com>
>> > ---
>> >  fs/fuse/file.c      |    5 +++
>> >  fs/nfs/write.c      |    4 ++
>> >  fs/nilfs2/segment.c |   10 +++++-
>> >  mm/filemap.c        |    1 +
>> >  mm/page-writeback.c |   84 ++++++++++++++++++++++++++++++++------------------
>> >  mm/rmap.c           |    4 +-
>> >  mm/truncate.c       |    2 +
>> >  7 files changed, 76 insertions(+), 34 deletions(-)
>> >
>> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>> > index a9f5e13..dbbdd53 100644
>> > --- a/fs/fuse/file.c
>> > +++ b/fs/fuse/file.c
>> > @@ -11,6 +11,7 @@
>> >  #include <linux/pagemap.h>
>> >  #include <linux/slab.h>
>> >  #include <linux/kernel.h>
>> > +#include <linux/memcontrol.h>
>> >  #include <linux/sched.h>
>> >  #include <linux/module.h>
>> >
>> > @@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
>> >
>> >        list_del(&req->writepages_entry);
>> >        dec_bdi_stat(bdi, BDI_WRITEBACK);
>> > +       mem_cgroup_update_stat(req->pages[0],
>> > +                       MEM_CGROUP_STAT_WRITEBACK_TEMP, -1);
>> >        dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
>> >        bdi_writeout_inc(bdi);
>> >        wake_up(&fi->page_waitq);
>> > @@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
>> >        req->inode = inode;
>> >
>> >        inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
>> > +       mem_cgroup_update_stat(tmp_page,
>> > +                       MEM_CGROUP_STAT_WRITEBACK_TEMP, 1);
>> >        inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
>> >        end_page_writeback(page);
>> >
>> > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
>> > index b753242..7316f7a 100644
>> > --- a/fs/nfs/write.c
>> > +++ b/fs/nfs/write.c
>> > @@ -439,6 +439,7 @@ nfs_mark_request_commit(struct nfs_page *req)
>> >                        req->wb_index,
>> >                        NFS_PAGE_TAG_COMMIT);
>> >        spin_unlock(&inode->i_lock);
>> > +       mem_cgroup_update_stat(req->wb_page, MEM_CGROUP_STAT_UNSTABLE_NFS, 1);
>> >        inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
>> >        inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE);
>> >        __mark_inode_dirty(inode, I_DIRTY_DATASYNC);
>> > @@ -450,6 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req)
>> >        struct page *page = req->wb_page;
>> >
>> >        if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
>> > +               mem_cgroup_update_stat(page, MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
>> >                dec_zone_page_state(page, NR_UNSTABLE_NFS);
>> >                dec_bdi_stat(page->mapping->backing_dev_info, BDI_UNSTABLE);
>> >                return 1;
>> > @@ -1273,6 +1275,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
>> >                req = nfs_list_entry(head->next);
>> >                nfs_list_remove_request(req);
>> >                nfs_mark_request_commit(req);
>> > +               mem_cgroup_update_stat(req->wb_page,
>> > +                               MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
>> >                dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
>> >                dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
>> >                                BDI_UNSTABLE);
>> > diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
>> > index ada2f1b..aef6d13 100644
>> > --- a/fs/nilfs2/segment.c
>> > +++ b/fs/nilfs2/segment.c
>> > @@ -1660,8 +1660,11 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out)
>> >        } while (bh = bh->b_this_page, bh2 = bh2->b_this_page, bh != head);
>> >        kunmap_atomic(kaddr, KM_USER0);
>> >
>> > -       if (!TestSetPageWriteback(clone_page))
>> > +       if (!TestSetPageWriteback(clone_page)) {
>> > +               mem_cgroup_update_stat(clone_page,
>>
>> s/clone_page/page/
>
> mmh... shouldn't we use the same page used by TestSetPageWriteback() and
> inc_zone_page_state()?

Sorry, I've commented wrong hunk. It's for the next one.

>>
>> And #include <linux/memcontrol.h> is missed.
>
> OK.
>
> I'll apply your fixes and post a new version.
>
> Thanks for reviewing,
> -Andrea
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
       [not found]         ` <cc557aab1003020309y37587110i685d0d968bfba9f4-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-03-02 11:34           ` Andrea Righi
  0 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-02 11:34 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Andrew Morton,
	Balbir Singh

On Tue, Mar 02, 2010 at 01:09:24PM +0200, Kirill A. Shutemov wrote:
> On Tue, Mar 2, 2010 at 1:02 PM, Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org> wrote:
> > On Tue, Mar 02, 2010 at 12:11:10PM +0200, Kirill A. Shutemov wrote:
> >> On Mon, Mar 1, 2010 at 11:23 PM, Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org> wrote:
> >> > Apply the cgroup dirty pages accounting and limiting infrastructure to
> >> > the opportune kernel functions.
> >> >
> >> > Signed-off-by: Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
> >> > ---
> >> >  fs/fuse/file.c      |    5 +++
> >> >  fs/nfs/write.c      |    4 ++
> >> >  fs/nilfs2/segment.c |   10 +++++-
> >> >  mm/filemap.c        |    1 +
> >> >  mm/page-writeback.c |   84 ++++++++++++++++++++++++++++++++------------------
> >> >  mm/rmap.c           |    4 +-
> >> >  mm/truncate.c       |    2 +
> >> >  7 files changed, 76 insertions(+), 34 deletions(-)
> >> >
> >> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> >> > index a9f5e13..dbbdd53 100644
> >> > --- a/fs/fuse/file.c
> >> > +++ b/fs/fuse/file.c
> >> > @@ -11,6 +11,7 @@
> >> >  #include <linux/pagemap.h>
> >> >  #include <linux/slab.h>
> >> >  #include <linux/kernel.h>
> >> > +#include <linux/memcontrol.h>
> >> >  #include <linux/sched.h>
> >> >  #include <linux/module.h>
> >> >
> >> > @@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
> >> >
> >> >        list_del(&req->writepages_entry);
> >> >        dec_bdi_stat(bdi, BDI_WRITEBACK);
> >> > +       mem_cgroup_update_stat(req->pages[0],
> >> > +                       MEM_CGROUP_STAT_WRITEBACK_TEMP, -1);
> >> >        dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
> >> >        bdi_writeout_inc(bdi);
> >> >        wake_up(&fi->page_waitq);
> >> > @@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
> >> >        req->inode = inode;
> >> >
> >> >        inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
> >> > +       mem_cgroup_update_stat(tmp_page,
> >> > +                       MEM_CGROUP_STAT_WRITEBACK_TEMP, 1);
> >> >        inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
> >> >        end_page_writeback(page);
> >> >
> >> > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> >> > index b753242..7316f7a 100644
> >> > --- a/fs/nfs/write.c
> >> > +++ b/fs/nfs/write.c
> >> > @@ -439,6 +439,7 @@ nfs_mark_request_commit(struct nfs_page *req)
> >> >                        req->wb_index,
> >> >                        NFS_PAGE_TAG_COMMIT);
> >> >        spin_unlock(&inode->i_lock);
> >> > +       mem_cgroup_update_stat(req->wb_page, MEM_CGROUP_STAT_UNSTABLE_NFS, 1);
> >> >        inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
> >> >        inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE);
> >> >        __mark_inode_dirty(inode, I_DIRTY_DATASYNC);
> >> > @@ -450,6 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req)
> >> >        struct page *page = req->wb_page;
> >> >
> >> >        if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
> >> > +               mem_cgroup_update_stat(page, MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
> >> >                dec_zone_page_state(page, NR_UNSTABLE_NFS);
> >> >                dec_bdi_stat(page->mapping->backing_dev_info, BDI_UNSTABLE);
> >> >                return 1;
> >> > @@ -1273,6 +1275,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
> >> >                req = nfs_list_entry(head->next);
> >> >                nfs_list_remove_request(req);
> >> >                nfs_mark_request_commit(req);
> >> > +               mem_cgroup_update_stat(req->wb_page,
> >> > +                               MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
> >> >                dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
> >> >                dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
> >> >                                BDI_UNSTABLE);
> >> > diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> >> > index ada2f1b..aef6d13 100644
> >> > --- a/fs/nilfs2/segment.c
> >> > +++ b/fs/nilfs2/segment.c
> >> > @@ -1660,8 +1660,11 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out)
> >> >        } while (bh = bh->b_this_page, bh2 = bh2->b_this_page, bh != head);
> >> >        kunmap_atomic(kaddr, KM_USER0);
> >> >
> >> > -       if (!TestSetPageWriteback(clone_page))
> >> > +       if (!TestSetPageWriteback(clone_page)) {
> >> > +               mem_cgroup_update_stat(clone_page,
> >>
> >> s/clone_page/page/
> >
> > mmh... shouldn't we use the same page used by TestSetPageWriteback() and
> > inc_zone_page_state()?
> 
> Sorry, I've commented wrong hunk. It's for the next one.

Yes. Good catch! Will fix in the next version.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-02 11:09         ` Kirill A. Shutemov
@ 2010-03-02 11:34           ` Andrea Righi
  -1 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-02 11:34 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Andrew Morton, containers, linux-kernel,
	linux-mm

On Tue, Mar 02, 2010 at 01:09:24PM +0200, Kirill A. Shutemov wrote:
> On Tue, Mar 2, 2010 at 1:02 PM, Andrea Righi <arighi@develer.com> wrote:
> > On Tue, Mar 02, 2010 at 12:11:10PM +0200, Kirill A. Shutemov wrote:
> >> On Mon, Mar 1, 2010 at 11:23 PM, Andrea Righi <arighi@develer.com> wrote:
> >> > Apply the cgroup dirty pages accounting and limiting infrastructure to
> >> > the opportune kernel functions.
> >> >
> >> > Signed-off-by: Andrea Righi <arighi@develer.com>
> >> > ---
> >> >  fs/fuse/file.c      |    5 +++
> >> >  fs/nfs/write.c      |    4 ++
> >> >  fs/nilfs2/segment.c |   10 +++++-
> >> >  mm/filemap.c        |    1 +
> >> >  mm/page-writeback.c |   84 ++++++++++++++++++++++++++++++++------------------
> >> >  mm/rmap.c           |    4 +-
> >> >  mm/truncate.c       |    2 +
> >> >  7 files changed, 76 insertions(+), 34 deletions(-)
> >> >
> >> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> >> > index a9f5e13..dbbdd53 100644
> >> > --- a/fs/fuse/file.c
> >> > +++ b/fs/fuse/file.c
> >> > @@ -11,6 +11,7 @@
> >> >  #include <linux/pagemap.h>
> >> >  #include <linux/slab.h>
> >> >  #include <linux/kernel.h>
> >> > +#include <linux/memcontrol.h>
> >> >  #include <linux/sched.h>
> >> >  #include <linux/module.h>
> >> >
> >> > @@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
> >> >
> >> >        list_del(&req->writepages_entry);
> >> >        dec_bdi_stat(bdi, BDI_WRITEBACK);
> >> > +       mem_cgroup_update_stat(req->pages[0],
> >> > +                       MEM_CGROUP_STAT_WRITEBACK_TEMP, -1);
> >> >        dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
> >> >        bdi_writeout_inc(bdi);
> >> >        wake_up(&fi->page_waitq);
> >> > @@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
> >> >        req->inode = inode;
> >> >
> >> >        inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
> >> > +       mem_cgroup_update_stat(tmp_page,
> >> > +                       MEM_CGROUP_STAT_WRITEBACK_TEMP, 1);
> >> >        inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
> >> >        end_page_writeback(page);
> >> >
> >> > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> >> > index b753242..7316f7a 100644
> >> > --- a/fs/nfs/write.c
> >> > +++ b/fs/nfs/write.c
> >> > @@ -439,6 +439,7 @@ nfs_mark_request_commit(struct nfs_page *req)
> >> >                        req->wb_index,
> >> >                        NFS_PAGE_TAG_COMMIT);
> >> >        spin_unlock(&inode->i_lock);
> >> > +       mem_cgroup_update_stat(req->wb_page, MEM_CGROUP_STAT_UNSTABLE_NFS, 1);
> >> >        inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
> >> >        inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE);
> >> >        __mark_inode_dirty(inode, I_DIRTY_DATASYNC);
> >> > @@ -450,6 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req)
> >> >        struct page *page = req->wb_page;
> >> >
> >> >        if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
> >> > +               mem_cgroup_update_stat(page, MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
> >> >                dec_zone_page_state(page, NR_UNSTABLE_NFS);
> >> >                dec_bdi_stat(page->mapping->backing_dev_info, BDI_UNSTABLE);
> >> >                return 1;
> >> > @@ -1273,6 +1275,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
> >> >                req = nfs_list_entry(head->next);
> >> >                nfs_list_remove_request(req);
> >> >                nfs_mark_request_commit(req);
> >> > +               mem_cgroup_update_stat(req->wb_page,
> >> > +                               MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
> >> >                dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
> >> >                dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
> >> >                                BDI_UNSTABLE);
> >> > diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> >> > index ada2f1b..aef6d13 100644
> >> > --- a/fs/nilfs2/segment.c
> >> > +++ b/fs/nilfs2/segment.c
> >> > @@ -1660,8 +1660,11 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out)
> >> >        } while (bh = bh->b_this_page, bh2 = bh2->b_this_page, bh != head);
> >> >        kunmap_atomic(kaddr, KM_USER0);
> >> >
> >> > -       if (!TestSetPageWriteback(clone_page))
> >> > +       if (!TestSetPageWriteback(clone_page)) {
> >> > +               mem_cgroup_update_stat(clone_page,
> >>
> >> s/clone_page/page/
> >
> > mmh... shouldn't we use the same page used by TestSetPageWriteback() and
> > inc_zone_page_state()?
> 
> Sorry, I've commented wrong hunk. It's for the next one.

Yes. Good catch! Will fix in the next version.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
@ 2010-03-02 11:34           ` Andrea Righi
  0 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-02 11:34 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Andrew Morton, containers, linux-kernel,
	linux-mm

On Tue, Mar 02, 2010 at 01:09:24PM +0200, Kirill A. Shutemov wrote:
> On Tue, Mar 2, 2010 at 1:02 PM, Andrea Righi <arighi@develer.com> wrote:
> > On Tue, Mar 02, 2010 at 12:11:10PM +0200, Kirill A. Shutemov wrote:
> >> On Mon, Mar 1, 2010 at 11:23 PM, Andrea Righi <arighi@develer.com> wrote:
> >> > Apply the cgroup dirty pages accounting and limiting infrastructure to
> >> > the opportune kernel functions.
> >> >
> >> > Signed-off-by: Andrea Righi <arighi@develer.com>
> >> > ---
> >> >  fs/fuse/file.c      |    5 +++
> >> >  fs/nfs/write.c      |    4 ++
> >> >  fs/nilfs2/segment.c |   10 +++++-
> >> >  mm/filemap.c        |    1 +
> >> >  mm/page-writeback.c |   84 ++++++++++++++++++++++++++++++++------------------
> >> >  mm/rmap.c           |    4 +-
> >> >  mm/truncate.c       |    2 +
> >> >  7 files changed, 76 insertions(+), 34 deletions(-)
> >> >
> >> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> >> > index a9f5e13..dbbdd53 100644
> >> > --- a/fs/fuse/file.c
> >> > +++ b/fs/fuse/file.c
> >> > @@ -11,6 +11,7 @@
> >> >  #include <linux/pagemap.h>
> >> >  #include <linux/slab.h>
> >> >  #include <linux/kernel.h>
> >> > +#include <linux/memcontrol.h>
> >> >  #include <linux/sched.h>
> >> >  #include <linux/module.h>
> >> >
> >> > @@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
> >> >
> >> >        list_del(&req->writepages_entry);
> >> >        dec_bdi_stat(bdi, BDI_WRITEBACK);
> >> > +       mem_cgroup_update_stat(req->pages[0],
> >> > +                       MEM_CGROUP_STAT_WRITEBACK_TEMP, -1);
> >> >        dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
> >> >        bdi_writeout_inc(bdi);
> >> >        wake_up(&fi->page_waitq);
> >> > @@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
> >> >        req->inode = inode;
> >> >
> >> >        inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
> >> > +       mem_cgroup_update_stat(tmp_page,
> >> > +                       MEM_CGROUP_STAT_WRITEBACK_TEMP, 1);
> >> >        inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
> >> >        end_page_writeback(page);
> >> >
> >> > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> >> > index b753242..7316f7a 100644
> >> > --- a/fs/nfs/write.c
> >> > +++ b/fs/nfs/write.c
> >> > @@ -439,6 +439,7 @@ nfs_mark_request_commit(struct nfs_page *req)
> >> >                        req->wb_index,
> >> >                        NFS_PAGE_TAG_COMMIT);
> >> >        spin_unlock(&inode->i_lock);
> >> > +       mem_cgroup_update_stat(req->wb_page, MEM_CGROUP_STAT_UNSTABLE_NFS, 1);
> >> >        inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
> >> >        inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE);
> >> >        __mark_inode_dirty(inode, I_DIRTY_DATASYNC);
> >> > @@ -450,6 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req)
> >> >        struct page *page = req->wb_page;
> >> >
> >> >        if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
> >> > +               mem_cgroup_update_stat(page, MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
> >> >                dec_zone_page_state(page, NR_UNSTABLE_NFS);
> >> >                dec_bdi_stat(page->mapping->backing_dev_info, BDI_UNSTABLE);
> >> >                return 1;
> >> > @@ -1273,6 +1275,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
> >> >                req = nfs_list_entry(head->next);
> >> >                nfs_list_remove_request(req);
> >> >                nfs_mark_request_commit(req);
> >> > +               mem_cgroup_update_stat(req->wb_page,
> >> > +                               MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
> >> >                dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
> >> >                dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
> >> >                                BDI_UNSTABLE);
> >> > diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> >> > index ada2f1b..aef6d13 100644
> >> > --- a/fs/nilfs2/segment.c
> >> > +++ b/fs/nilfs2/segment.c
> >> > @@ -1660,8 +1660,11 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out)
> >> >        } while (bh = bh->b_this_page, bh2 = bh2->b_this_page, bh != head);
> >> >        kunmap_atomic(kaddr, KM_USER0);
> >> >
> >> > -       if (!TestSetPageWriteback(clone_page))
> >> > +       if (!TestSetPageWriteback(clone_page)) {
> >> > +               mem_cgroup_update_stat(clone_page,
> >>
> >> s/clone_page/page/
> >
> > mmh... shouldn't we use the same page used by TestSetPageWriteback() and
> > inc_zone_page_state()?
> 
> Sorry, I've commented wrong hunk. It's for the next one.

Yes. Good catch! Will fix in the next version.

Thanks,
-Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 2/3] memcg: dirty pages accounting and limiting infrastructure
       [not found]   ` <1267478620-5276-3-git-send-email-arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
  2010-03-02  0:20     ` KAMEZAWA Hiroyuki
  2010-03-02 10:04     ` Kirill A. Shutemov
@ 2010-03-02 13:02     ` Balbir Singh
  2010-03-02 18:08     ` Greg Thelen
  3 siblings, 0 replies; 140+ messages in thread
From: Balbir Singh @ 2010-03-02 13:02 UTC (permalink / raw)
  To: Andrea Righi
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Andrew Morton

* Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org> [2010-03-01 22:23:39]:

> Infrastructure to account dirty pages per cgroup and add dirty limit
> interfaces in the cgroupfs:
> 
>  - Direct write-out: memory.dirty_ratio, memory.dirty_bytes
> 
>  - Background write-out: memory.dirty_background_ratio, memory.dirty_background_bytes
> 
> Signed-off-by: Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>

Look good, but yet to be tested from my side.


> ---
>  include/linux/memcontrol.h |   77 ++++++++++-
>  mm/memcontrol.c            |  336 ++++++++++++++++++++++++++++++++++++++++----
>  2 files changed, 384 insertions(+), 29 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 1f9b119..cc88b2e 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -19,12 +19,50 @@
> 
>  #ifndef _LINUX_MEMCONTROL_H
>  #define _LINUX_MEMCONTROL_H
> +
> +#include <linux/writeback.h>
>  #include <linux/cgroup.h>
> +
>  struct mem_cgroup;
>  struct page_cgroup;
>  struct page;
>  struct mm_struct;
> 
> +/* Cgroup memory statistics items exported to the kernel */
> +enum mem_cgroup_page_stat_item {
> +	MEMCG_NR_DIRTYABLE_PAGES,
> +	MEMCG_NR_RECLAIM_PAGES,
> +	MEMCG_NR_WRITEBACK,
> +	MEMCG_NR_DIRTY_WRITEBACK_PAGES,
> +};
> +
> +/*
> + * Statistics for memory cgroup.
> + */
> +enum mem_cgroup_stat_index {
> +	/*
> +	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> +	 */
> +	MEM_CGROUP_STAT_CACHE,	   /* # of pages charged as cache */
> +	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
> +	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> +	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
> +	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
> +	MEM_CGROUP_STAT_EVENTS,	/* sum of pagein + pageout for internal use */
> +	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> +	MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
> +					used by soft limit implementation */
> +	MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out.
> +					used by threshold implementation */
> +	MEM_CGROUP_STAT_FILE_DIRTY,   /* # of dirty pages in page cache */
> +	MEM_CGROUP_STAT_WRITEBACK,   /* # of pages under writeback */
> +	MEM_CGROUP_STAT_WRITEBACK_TEMP,   /* # of pages under writeback using
> +						temporary buffers */
> +	MEM_CGROUP_STAT_UNSTABLE_NFS,   /* # of NFS unstable pages */
> +
> +	MEM_CGROUP_STAT_NSTATS,
> +};
> +
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
>  /*
>   * All "charge" functions with gfp_mask should use GFP_KERNEL or
> @@ -117,6 +155,13 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
>  extern int do_swap_account;
>  #endif
> 
> +extern long mem_cgroup_dirty_ratio(void);
> +extern unsigned long mem_cgroup_dirty_bytes(void);
> +extern long mem_cgroup_dirty_background_ratio(void);
> +extern unsigned long mem_cgroup_dirty_background_bytes(void);
> +
> +extern s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item);
> +

Docstyle comments for each function would be appreciated

>  static inline bool mem_cgroup_disabled(void)
>  {
>  	if (mem_cgroup_subsys.disabled)
> @@ -125,7 +170,8 @@ static inline bool mem_cgroup_disabled(void)
>  }
> 
>  extern bool mem_cgroup_oom_called(struct task_struct *task);
> -void mem_cgroup_update_file_mapped(struct page *page, int val);

Good to see you make generic use of this function

> +void mem_cgroup_update_stat(struct page *page,
> +			enum mem_cgroup_stat_index idx, int val);
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  						gfp_t gfp_mask, int nid,
>  						int zid);
> @@ -300,8 +346,8 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
>  {
>  }
> 
> -static inline void mem_cgroup_update_file_mapped(struct page *page,
> -							int val)
> +static inline void mem_cgroup_update_stat(struct page *page,
> +			enum mem_cgroup_stat_index idx, int val)
>  {
>  }
> 
> @@ -312,6 +358,31 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  	return 0;
>  }
> 
> +static inline long mem_cgroup_dirty_ratio(void)
> +{
> +	return vm_dirty_ratio;
> +}
> +
> +static inline unsigned long mem_cgroup_dirty_bytes(void)
> +{
> +	return vm_dirty_bytes;
> +}
> +
> +static inline long mem_cgroup_dirty_background_ratio(void)
> +{
> +	return dirty_background_ratio;
> +}
> +
> +static inline unsigned long mem_cgroup_dirty_background_bytes(void)
> +{
> +	return dirty_background_bytes;
> +}
> +
> +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> +{
> +	return -ENOMEM;
> +}
> +
>  #endif /* CONFIG_CGROUP_MEM_CONT */
> 
>  #endif /* _LINUX_MEMCONTROL_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a443c30..e74cf66 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -66,31 +66,16 @@ static int really_do_swap_account __initdata = 1; /* for remember boot option*/
>  #define SOFTLIMIT_EVENTS_THRESH (1000)
>  #define THRESHOLDS_EVENTS_THRESH (100)
> 
> -/*
> - * Statistics for memory cgroup.
> - */
> -enum mem_cgroup_stat_index {
> -	/*
> -	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> -	 */
> -	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
> -	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
> -	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> -	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
> -	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
> -	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> -	MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
> -					used by soft limit implementation */
> -	MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out.
> -					used by threshold implementation */
> -
> -	MEM_CGROUP_STAT_NSTATS,
> -};
> -
>  struct mem_cgroup_stat_cpu {
>  	s64 count[MEM_CGROUP_STAT_NSTATS];
>  };
> 
> +/* Per cgroup page statistics */
> +struct mem_cgroup_page_stat {
> +	enum mem_cgroup_page_stat_item item;
> +	s64 value;
> +};
> +
>  /*
>   * per-zone information in memory controller.
>   */
> @@ -157,6 +142,15 @@ struct mem_cgroup_threshold_ary {
>  static bool mem_cgroup_threshold_check(struct mem_cgroup *mem);
>  static void mem_cgroup_threshold(struct mem_cgroup *mem);
> 
> +enum mem_cgroup_dirty_param {
> +	MEM_CGROUP_DIRTY_RATIO,
> +	MEM_CGROUP_DIRTY_BYTES,
> +	MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> +	MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
> +
> +	MEM_CGROUP_DIRTY_NPARAMS,
> +};
> +
>  /*
>   * The memory controller data structure. The memory controller controls both
>   * page cache and RSS per cgroup. We would eventually like to provide
> @@ -205,6 +199,9 @@ struct mem_cgroup {
> 
>  	unsigned int	swappiness;
> 
> +	/* control memory cgroup dirty pages */
> +	unsigned long dirty_param[MEM_CGROUP_DIRTY_NPARAMS];
> +

Could you mention what protects this field, is it the reclaim_lock?
BTW, is unsigned long sufficient to represent dirty_param(s)?

>  	/* set when res.limit == memsw.limit */
>  	bool		memsw_is_minimum;
> 
> @@ -1021,6 +1018,164 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
>  	return swappiness;
>  }
> 
> +static unsigned long get_dirty_param(struct mem_cgroup *memcg,
> +			enum mem_cgroup_dirty_param idx)
> +{
> +	unsigned long ret;
> +
> +	VM_BUG_ON(idx >= MEM_CGROUP_DIRTY_NPARAMS);
> +	spin_lock(&memcg->reclaim_param_lock);
> +	ret = memcg->dirty_param[idx];
> +	spin_unlock(&memcg->reclaim_param_lock);

Do we need a spinlock if we protect it using RCU? Is precise data very
important?

> +
> +	return ret;
> +}
> +
> +long mem_cgroup_dirty_ratio(void)
> +{
> +	struct mem_cgroup *memcg;
> +	long ret = vm_dirty_ratio;
> +
> +	if (mem_cgroup_disabled())
> +		return ret;
> +	/*
> +	 * It's possible that "current" may be moved to other cgroup while we
> +	 * access cgroup. But precise check is meaningless because the task can
> +	 * be moved after our access and writeback tends to take long time.
> +	 * At least, "memcg" will not be freed under rcu_read_lock().
> +	 */
> +	rcu_read_lock();
> +	memcg = mem_cgroup_from_task(current);
> +	if (likely(memcg))
> +		ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_RATIO);
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> +
> +unsigned long mem_cgroup_dirty_bytes(void)
> +{
> +	struct mem_cgroup *memcg;
> +	unsigned long ret = vm_dirty_bytes;
> +
> +	if (mem_cgroup_disabled())
> +		return ret;
> +	rcu_read_lock();
> +	memcg = mem_cgroup_from_task(current);
> +	if (likely(memcg))
> +		ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BYTES);
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> +
> +long mem_cgroup_dirty_background_ratio(void)
> +{
> +	struct mem_cgroup *memcg;
> +	long ret = dirty_background_ratio;
> +
> +	if (mem_cgroup_disabled())
> +		return ret;
> +	rcu_read_lock();
> +	memcg = mem_cgroup_from_task(current);
> +	if (likely(memcg))
> +		ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BACKGROUND_RATIO);
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> +
> +unsigned long mem_cgroup_dirty_background_bytes(void)
> +{
> +	struct mem_cgroup *memcg;
> +	unsigned long ret = dirty_background_bytes;
> +
> +	if (mem_cgroup_disabled())
> +		return ret;
> +	rcu_read_lock();
> +	memcg = mem_cgroup_from_task(current);
> +	if (likely(memcg))
> +		ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BACKGROUND_BYTES);
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> +
> +static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
> +{
> +	return do_swap_account ?
> +			res_counter_read_u64(&memcg->memsw, RES_LIMIT) :

Shouldn't you do a res_counter_read_u64(...) > 0 for readability?
What happens if memcg->res, RES_LIMIT == memcg->memsw, RES_LIMIT?

> +			nr_swap_pages > 0;
> +}
> +
> +static s64 mem_cgroup_get_local_page_stat(struct mem_cgroup *memcg,
> +				enum mem_cgroup_page_stat_item item)
> +{
> +	s64 ret;
> +
> +	switch (item) {
> +	case MEMCG_NR_DIRTYABLE_PAGES:
> +		ret = res_counter_read_u64(&memcg->res, RES_LIMIT) -
> +			res_counter_read_u64(&memcg->res, RES_USAGE);
> +		/* Translate free memory in pages */
> +		ret >>= PAGE_SHIFT;
> +		ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_FILE) +
> +			mem_cgroup_read_stat(memcg, LRU_INACTIVE_FILE);
> +		if (mem_cgroup_can_swap(memcg))
> +			ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_ANON) +
> +				mem_cgroup_read_stat(memcg, LRU_INACTIVE_ANON);
> +		break;
> +	case MEMCG_NR_RECLAIM_PAGES:
> +		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_FILE_DIRTY) +
> +			mem_cgroup_read_stat(memcg,
> +					MEM_CGROUP_STAT_UNSTABLE_NFS);
> +		break;
> +	case MEMCG_NR_WRITEBACK:
> +		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK);
> +		break;
> +	case MEMCG_NR_DIRTY_WRITEBACK_PAGES:
> +		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK) +
> +			mem_cgroup_read_stat(memcg,
> +				MEM_CGROUP_STAT_UNSTABLE_NFS);
> +		break;
> +	default:
> +		ret = 0;
> +		WARN_ON_ONCE(1);
> +	}
> +	return ret;
> +}
> +
> +static int mem_cgroup_page_stat_cb(struct mem_cgroup *mem, void *data)
> +{
> +	struct mem_cgroup_page_stat *stat = (struct mem_cgroup_page_stat *)data;
> +
> +	stat->value += mem_cgroup_get_local_page_stat(mem, stat->item);
> +	return 0;
> +}
> +
> +s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> +{
> +	struct mem_cgroup_page_stat stat = {};
> +	struct mem_cgroup *memcg;
> +
> +	if (mem_cgroup_disabled())
> +		return -ENOMEM;
> +	rcu_read_lock();
> +	memcg = mem_cgroup_from_task(current);
> +	if (memcg) {
> +		/*
> +		 * Recursively evaulate page statistics against all cgroup
> +		 * under hierarchy tree
> +		 */
> +		stat.item = item;
> +		mem_cgroup_walk_tree(memcg, &stat, mem_cgroup_page_stat_cb);
> +	} else
> +		stat.value = -ENOMEM;
> +	rcu_read_unlock();
> +
> +	return stat.value;
> +}
> +
>  static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void *data)
>  {
>  	int *val = data;
> @@ -1263,14 +1418,16 @@ static void record_last_oom(struct mem_cgroup *mem)
>  }
> 
>  /*
> - * Currently used to update mapped file statistics, but the routine can be
> - * generalized to update other statistics as well.
> + * Generalized routine to update memory cgroup statistics.
>   */
> -void mem_cgroup_update_file_mapped(struct page *page, int val)
> +void mem_cgroup_update_stat(struct page *page,
> +			enum mem_cgroup_stat_index idx, int val)
>  {
>  	struct mem_cgroup *mem;
>  	struct page_cgroup *pc;
> 
> +	if (mem_cgroup_disabled())
> +		return;
>  	pc = lookup_page_cgroup(page);
>  	if (unlikely(!pc))
>  		return;
> @@ -1286,7 +1443,8 @@ void mem_cgroup_update_file_mapped(struct page *page, int val)
>  	/*
>  	 * Preemption is already disabled. We can use __this_cpu_xxx
>  	 */
> -	__this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], val);
> +	VM_BUG_ON(idx >= MEM_CGROUP_STAT_NSTATS);
> +	__this_cpu_add(mem->stat->count[idx], val);
> 
>  done:
>  	unlock_page_cgroup(pc);
> @@ -3033,6 +3191,10 @@ enum {
>  	MCS_PGPGIN,
>  	MCS_PGPGOUT,
>  	MCS_SWAP,
> +	MCS_FILE_DIRTY,
> +	MCS_WRITEBACK,
> +	MCS_WRITEBACK_TEMP,
> +	MCS_UNSTABLE_NFS,
>  	MCS_INACTIVE_ANON,
>  	MCS_ACTIVE_ANON,
>  	MCS_INACTIVE_FILE,
> @@ -3055,6 +3217,10 @@ struct {
>  	{"pgpgin", "total_pgpgin"},
>  	{"pgpgout", "total_pgpgout"},
>  	{"swap", "total_swap"},
> +	{"filedirty", "dirty_pages"},
> +	{"writeback", "writeback_pages"},
> +	{"writeback_tmp", "writeback_temp_pages"},
> +	{"nfs", "nfs_unstable"},
>  	{"inactive_anon", "total_inactive_anon"},
>  	{"active_anon", "total_active_anon"},
>  	{"inactive_file", "total_inactive_file"},
> @@ -3083,6 +3249,14 @@ static int mem_cgroup_get_local_stat(struct mem_cgroup *mem, void *data)
>  		val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_SWAPOUT);
>  		s->stat[MCS_SWAP] += val * PAGE_SIZE;
>  	}
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_DIRTY);
> +	s->stat[MCS_FILE_DIRTY] += val;
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WRITEBACK);
> +	s->stat[MCS_WRITEBACK] += val;
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WRITEBACK_TEMP);
> +	s->stat[MCS_WRITEBACK_TEMP] += val;
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_UNSTABLE_NFS);
> +	s->stat[MCS_UNSTABLE_NFS] += val;
> 
>  	/* per zone stat */
>  	val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON);
> @@ -3467,6 +3641,50 @@ unlock:
>  	return ret;
>  }
> 
> +static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +	int type = cft->private;
> +
> +	return get_dirty_param(memcg, type);
> +}
> +
> +static int
> +mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +	int type = cft->private;
> +
> +	if (cgrp->parent == NULL)
> +		return -EINVAL;
> +	if (((type == MEM_CGROUP_DIRTY_RATIO) ||
> +		(type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO)) && (val > 100))
> +		return -EINVAL;
> +
> +	spin_lock(&memcg->reclaim_param_lock);
> +	switch (type) {
> +	case MEM_CGROUP_DIRTY_RATIO:
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_RATIO] = val;
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_BYTES] = 0;
> +		break;
> +	case MEM_CGROUP_DIRTY_BYTES:
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_RATIO] = 0;
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_BYTES] = val;
> +		break;
> +	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] = val;
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES] = 0;
> +		break;
> +	case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] = 0;
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES] = val;
> +		break;
> +	}
> +	spin_unlock(&memcg->reclaim_param_lock);
> +
> +	return 0;
> +}
> +
>  static struct cftype mem_cgroup_files[] = {
>  	{
>  		.name = "usage_in_bytes",
> @@ -3518,6 +3736,30 @@ static struct cftype mem_cgroup_files[] = {
>  		.write_u64 = mem_cgroup_swappiness_write,
>  	},
>  	{
> +		.name = "dirty_ratio",
> +		.read_u64 = mem_cgroup_dirty_read,
> +		.write_u64 = mem_cgroup_dirty_write,
> +		.private = MEM_CGROUP_DIRTY_RATIO,
> +	},
> +	{
> +		.name = "dirty_bytes",
> +		.read_u64 = mem_cgroup_dirty_read,
> +		.write_u64 = mem_cgroup_dirty_write,
> +		.private = MEM_CGROUP_DIRTY_BYTES,
> +	},
> +	{
> +		.name = "dirty_background_ratio",
> +		.read_u64 = mem_cgroup_dirty_read,
> +		.write_u64 = mem_cgroup_dirty_write,
> +		.private = MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> +	},
> +	{
> +		.name = "dirty_background_bytes",
> +		.read_u64 = mem_cgroup_dirty_read,
> +		.write_u64 = mem_cgroup_dirty_write,
> +		.private = MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
> +	},
> +	{
>  		.name = "move_charge_at_immigrate",
>  		.read_u64 = mem_cgroup_move_charge_read,
>  		.write_u64 = mem_cgroup_move_charge_write,
> @@ -3725,6 +3967,19 @@ static int mem_cgroup_soft_limit_tree_init(void)
>  	return 0;
>  }
> 
> +/*
> + * NOTE: called only with &src->reclaim_param_lock held from
> + * mem_cgroup_create().
> + */
> +static inline void
> +copy_dirty_params(struct mem_cgroup *dst, struct mem_cgroup *src)
> +{
> +	int i;
> +
> +	for (i = 0; i < MEM_CGROUP_DIRTY_NPARAMS; i++)
> +		dst->dirty_param[i] = src->dirty_param[i];
> +}
> +
>  static struct cgroup_subsys_state * __ref
>  mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  {
> @@ -3776,8 +4031,37 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  	mem->last_scanned_child = 0;
>  	spin_lock_init(&mem->reclaim_param_lock);
> 
> -	if (parent)
> +	if (parent) {
>  		mem->swappiness = get_swappiness(parent);
> +
> +		spin_lock(&parent->reclaim_param_lock);
> +		copy_dirty_params(mem, parent);
> +		spin_unlock(&parent->reclaim_param_lock);
> +	} else {
> +		/*
> +		 * XXX: should we need a lock here? we could switch from
> +		 * vm_dirty_ratio to vm_dirty_bytes or vice versa but we're not
> +		 * reading them atomically. The same for dirty_background_ratio
> +		 * and dirty_background_bytes.
> +		 *
> +		 * For now, try to read them speculatively and retry if a
> +		 * "conflict" is detected.a

The do while loop is subtle, can we add a validate check,share it with
the write routine and retry if validation fails?

> +		 */
> +		do {
> +			mem->dirty_param[MEM_CGROUP_DIRTY_RATIO] =
> +						vm_dirty_ratio;
> +			mem->dirty_param[MEM_CGROUP_DIRTY_BYTES] =
> +						vm_dirty_bytes;
> +		} while (mem->dirty_param[MEM_CGROUP_DIRTY_RATIO] &&
> +			 mem->dirty_param[MEM_CGROUP_DIRTY_BYTES]);
> +		do {
> +			mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] =
> +						dirty_background_ratio;
> +			mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES] =
> +						dirty_background_bytes;
> +		} while (mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] &&
> +			mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES]);
> +	}
>  	atomic_set(&mem->refcnt, 1);
>  	mem->move_charge_at_immigrate = 0;
>  	mutex_init(&mem->thresholds_lock);
> -- 
> 1.6.3.3
> 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 2/3] memcg: dirty pages accounting and limiting infrastructure
  2010-03-01 21:23   ` Andrea Righi
@ 2010-03-02 13:02     ` Balbir Singh
  -1 siblings, 0 replies; 140+ messages in thread
From: Balbir Singh @ 2010-03-02 13:02 UTC (permalink / raw)
  To: Andrea Righi
  Cc: KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Kirill A. Shutemov, Andrew Morton, containers,
	linux-kernel, linux-mm

* Andrea Righi <arighi@develer.com> [2010-03-01 22:23:39]:

> Infrastructure to account dirty pages per cgroup and add dirty limit
> interfaces in the cgroupfs:
> 
>  - Direct write-out: memory.dirty_ratio, memory.dirty_bytes
> 
>  - Background write-out: memory.dirty_background_ratio, memory.dirty_background_bytes
> 
> Signed-off-by: Andrea Righi <arighi@develer.com>

Look good, but yet to be tested from my side.


> ---
>  include/linux/memcontrol.h |   77 ++++++++++-
>  mm/memcontrol.c            |  336 ++++++++++++++++++++++++++++++++++++++++----
>  2 files changed, 384 insertions(+), 29 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 1f9b119..cc88b2e 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -19,12 +19,50 @@
> 
>  #ifndef _LINUX_MEMCONTROL_H
>  #define _LINUX_MEMCONTROL_H
> +
> +#include <linux/writeback.h>
>  #include <linux/cgroup.h>
> +
>  struct mem_cgroup;
>  struct page_cgroup;
>  struct page;
>  struct mm_struct;
> 
> +/* Cgroup memory statistics items exported to the kernel */
> +enum mem_cgroup_page_stat_item {
> +	MEMCG_NR_DIRTYABLE_PAGES,
> +	MEMCG_NR_RECLAIM_PAGES,
> +	MEMCG_NR_WRITEBACK,
> +	MEMCG_NR_DIRTY_WRITEBACK_PAGES,
> +};
> +
> +/*
> + * Statistics for memory cgroup.
> + */
> +enum mem_cgroup_stat_index {
> +	/*
> +	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> +	 */
> +	MEM_CGROUP_STAT_CACHE,	   /* # of pages charged as cache */
> +	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
> +	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> +	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
> +	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
> +	MEM_CGROUP_STAT_EVENTS,	/* sum of pagein + pageout for internal use */
> +	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> +	MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
> +					used by soft limit implementation */
> +	MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out.
> +					used by threshold implementation */
> +	MEM_CGROUP_STAT_FILE_DIRTY,   /* # of dirty pages in page cache */
> +	MEM_CGROUP_STAT_WRITEBACK,   /* # of pages under writeback */
> +	MEM_CGROUP_STAT_WRITEBACK_TEMP,   /* # of pages under writeback using
> +						temporary buffers */
> +	MEM_CGROUP_STAT_UNSTABLE_NFS,   /* # of NFS unstable pages */
> +
> +	MEM_CGROUP_STAT_NSTATS,
> +};
> +
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
>  /*
>   * All "charge" functions with gfp_mask should use GFP_KERNEL or
> @@ -117,6 +155,13 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
>  extern int do_swap_account;
>  #endif
> 
> +extern long mem_cgroup_dirty_ratio(void);
> +extern unsigned long mem_cgroup_dirty_bytes(void);
> +extern long mem_cgroup_dirty_background_ratio(void);
> +extern unsigned long mem_cgroup_dirty_background_bytes(void);
> +
> +extern s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item);
> +

Docstyle comments for each function would be appreciated

>  static inline bool mem_cgroup_disabled(void)
>  {
>  	if (mem_cgroup_subsys.disabled)
> @@ -125,7 +170,8 @@ static inline bool mem_cgroup_disabled(void)
>  }
> 
>  extern bool mem_cgroup_oom_called(struct task_struct *task);
> -void mem_cgroup_update_file_mapped(struct page *page, int val);

Good to see you make generic use of this function

> +void mem_cgroup_update_stat(struct page *page,
> +			enum mem_cgroup_stat_index idx, int val);
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  						gfp_t gfp_mask, int nid,
>  						int zid);
> @@ -300,8 +346,8 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
>  {
>  }
> 
> -static inline void mem_cgroup_update_file_mapped(struct page *page,
> -							int val)
> +static inline void mem_cgroup_update_stat(struct page *page,
> +			enum mem_cgroup_stat_index idx, int val)
>  {
>  }
> 
> @@ -312,6 +358,31 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  	return 0;
>  }
> 
> +static inline long mem_cgroup_dirty_ratio(void)
> +{
> +	return vm_dirty_ratio;
> +}
> +
> +static inline unsigned long mem_cgroup_dirty_bytes(void)
> +{
> +	return vm_dirty_bytes;
> +}
> +
> +static inline long mem_cgroup_dirty_background_ratio(void)
> +{
> +	return dirty_background_ratio;
> +}
> +
> +static inline unsigned long mem_cgroup_dirty_background_bytes(void)
> +{
> +	return dirty_background_bytes;
> +}
> +
> +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> +{
> +	return -ENOMEM;
> +}
> +
>  #endif /* CONFIG_CGROUP_MEM_CONT */
> 
>  #endif /* _LINUX_MEMCONTROL_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a443c30..e74cf66 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -66,31 +66,16 @@ static int really_do_swap_account __initdata = 1; /* for remember boot option*/
>  #define SOFTLIMIT_EVENTS_THRESH (1000)
>  #define THRESHOLDS_EVENTS_THRESH (100)
> 
> -/*
> - * Statistics for memory cgroup.
> - */
> -enum mem_cgroup_stat_index {
> -	/*
> -	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> -	 */
> -	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
> -	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
> -	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> -	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
> -	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
> -	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> -	MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
> -					used by soft limit implementation */
> -	MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out.
> -					used by threshold implementation */
> -
> -	MEM_CGROUP_STAT_NSTATS,
> -};
> -
>  struct mem_cgroup_stat_cpu {
>  	s64 count[MEM_CGROUP_STAT_NSTATS];
>  };
> 
> +/* Per cgroup page statistics */
> +struct mem_cgroup_page_stat {
> +	enum mem_cgroup_page_stat_item item;
> +	s64 value;
> +};
> +
>  /*
>   * per-zone information in memory controller.
>   */
> @@ -157,6 +142,15 @@ struct mem_cgroup_threshold_ary {
>  static bool mem_cgroup_threshold_check(struct mem_cgroup *mem);
>  static void mem_cgroup_threshold(struct mem_cgroup *mem);
> 
> +enum mem_cgroup_dirty_param {
> +	MEM_CGROUP_DIRTY_RATIO,
> +	MEM_CGROUP_DIRTY_BYTES,
> +	MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> +	MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
> +
> +	MEM_CGROUP_DIRTY_NPARAMS,
> +};
> +
>  /*
>   * The memory controller data structure. The memory controller controls both
>   * page cache and RSS per cgroup. We would eventually like to provide
> @@ -205,6 +199,9 @@ struct mem_cgroup {
> 
>  	unsigned int	swappiness;
> 
> +	/* control memory cgroup dirty pages */
> +	unsigned long dirty_param[MEM_CGROUP_DIRTY_NPARAMS];
> +

Could you mention what protects this field, is it the reclaim_lock?
BTW, is unsigned long sufficient to represent dirty_param(s)?

>  	/* set when res.limit == memsw.limit */
>  	bool		memsw_is_minimum;
> 
> @@ -1021,6 +1018,164 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
>  	return swappiness;
>  }
> 
> +static unsigned long get_dirty_param(struct mem_cgroup *memcg,
> +			enum mem_cgroup_dirty_param idx)
> +{
> +	unsigned long ret;
> +
> +	VM_BUG_ON(idx >= MEM_CGROUP_DIRTY_NPARAMS);
> +	spin_lock(&memcg->reclaim_param_lock);
> +	ret = memcg->dirty_param[idx];
> +	spin_unlock(&memcg->reclaim_param_lock);

Do we need a spinlock if we protect it using RCU? Is precise data very
important?

> +
> +	return ret;
> +}
> +
> +long mem_cgroup_dirty_ratio(void)
> +{
> +	struct mem_cgroup *memcg;
> +	long ret = vm_dirty_ratio;
> +
> +	if (mem_cgroup_disabled())
> +		return ret;
> +	/*
> +	 * It's possible that "current" may be moved to other cgroup while we
> +	 * access cgroup. But precise check is meaningless because the task can
> +	 * be moved after our access and writeback tends to take long time.
> +	 * At least, "memcg" will not be freed under rcu_read_lock().
> +	 */
> +	rcu_read_lock();
> +	memcg = mem_cgroup_from_task(current);
> +	if (likely(memcg))
> +		ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_RATIO);
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> +
> +unsigned long mem_cgroup_dirty_bytes(void)
> +{
> +	struct mem_cgroup *memcg;
> +	unsigned long ret = vm_dirty_bytes;
> +
> +	if (mem_cgroup_disabled())
> +		return ret;
> +	rcu_read_lock();
> +	memcg = mem_cgroup_from_task(current);
> +	if (likely(memcg))
> +		ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BYTES);
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> +
> +long mem_cgroup_dirty_background_ratio(void)
> +{
> +	struct mem_cgroup *memcg;
> +	long ret = dirty_background_ratio;
> +
> +	if (mem_cgroup_disabled())
> +		return ret;
> +	rcu_read_lock();
> +	memcg = mem_cgroup_from_task(current);
> +	if (likely(memcg))
> +		ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BACKGROUND_RATIO);
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> +
> +unsigned long mem_cgroup_dirty_background_bytes(void)
> +{
> +	struct mem_cgroup *memcg;
> +	unsigned long ret = dirty_background_bytes;
> +
> +	if (mem_cgroup_disabled())
> +		return ret;
> +	rcu_read_lock();
> +	memcg = mem_cgroup_from_task(current);
> +	if (likely(memcg))
> +		ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BACKGROUND_BYTES);
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> +
> +static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
> +{
> +	return do_swap_account ?
> +			res_counter_read_u64(&memcg->memsw, RES_LIMIT) :

Shouldn't you do a res_counter_read_u64(...) > 0 for readability?
What happens if memcg->res, RES_LIMIT == memcg->memsw, RES_LIMIT?

> +			nr_swap_pages > 0;
> +}
> +
> +static s64 mem_cgroup_get_local_page_stat(struct mem_cgroup *memcg,
> +				enum mem_cgroup_page_stat_item item)
> +{
> +	s64 ret;
> +
> +	switch (item) {
> +	case MEMCG_NR_DIRTYABLE_PAGES:
> +		ret = res_counter_read_u64(&memcg->res, RES_LIMIT) -
> +			res_counter_read_u64(&memcg->res, RES_USAGE);
> +		/* Translate free memory in pages */
> +		ret >>= PAGE_SHIFT;
> +		ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_FILE) +
> +			mem_cgroup_read_stat(memcg, LRU_INACTIVE_FILE);
> +		if (mem_cgroup_can_swap(memcg))
> +			ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_ANON) +
> +				mem_cgroup_read_stat(memcg, LRU_INACTIVE_ANON);
> +		break;
> +	case MEMCG_NR_RECLAIM_PAGES:
> +		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_FILE_DIRTY) +
> +			mem_cgroup_read_stat(memcg,
> +					MEM_CGROUP_STAT_UNSTABLE_NFS);
> +		break;
> +	case MEMCG_NR_WRITEBACK:
> +		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK);
> +		break;
> +	case MEMCG_NR_DIRTY_WRITEBACK_PAGES:
> +		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK) +
> +			mem_cgroup_read_stat(memcg,
> +				MEM_CGROUP_STAT_UNSTABLE_NFS);
> +		break;
> +	default:
> +		ret = 0;
> +		WARN_ON_ONCE(1);
> +	}
> +	return ret;
> +}
> +
> +static int mem_cgroup_page_stat_cb(struct mem_cgroup *mem, void *data)
> +{
> +	struct mem_cgroup_page_stat *stat = (struct mem_cgroup_page_stat *)data;
> +
> +	stat->value += mem_cgroup_get_local_page_stat(mem, stat->item);
> +	return 0;
> +}
> +
> +s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> +{
> +	struct mem_cgroup_page_stat stat = {};
> +	struct mem_cgroup *memcg;
> +
> +	if (mem_cgroup_disabled())
> +		return -ENOMEM;
> +	rcu_read_lock();
> +	memcg = mem_cgroup_from_task(current);
> +	if (memcg) {
> +		/*
> +		 * Recursively evaulate page statistics against all cgroup
> +		 * under hierarchy tree
> +		 */
> +		stat.item = item;
> +		mem_cgroup_walk_tree(memcg, &stat, mem_cgroup_page_stat_cb);
> +	} else
> +		stat.value = -ENOMEM;
> +	rcu_read_unlock();
> +
> +	return stat.value;
> +}
> +
>  static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void *data)
>  {
>  	int *val = data;
> @@ -1263,14 +1418,16 @@ static void record_last_oom(struct mem_cgroup *mem)
>  }
> 
>  /*
> - * Currently used to update mapped file statistics, but the routine can be
> - * generalized to update other statistics as well.
> + * Generalized routine to update memory cgroup statistics.
>   */
> -void mem_cgroup_update_file_mapped(struct page *page, int val)
> +void mem_cgroup_update_stat(struct page *page,
> +			enum mem_cgroup_stat_index idx, int val)
>  {
>  	struct mem_cgroup *mem;
>  	struct page_cgroup *pc;
> 
> +	if (mem_cgroup_disabled())
> +		return;
>  	pc = lookup_page_cgroup(page);
>  	if (unlikely(!pc))
>  		return;
> @@ -1286,7 +1443,8 @@ void mem_cgroup_update_file_mapped(struct page *page, int val)
>  	/*
>  	 * Preemption is already disabled. We can use __this_cpu_xxx
>  	 */
> -	__this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], val);
> +	VM_BUG_ON(idx >= MEM_CGROUP_STAT_NSTATS);
> +	__this_cpu_add(mem->stat->count[idx], val);
> 
>  done:
>  	unlock_page_cgroup(pc);
> @@ -3033,6 +3191,10 @@ enum {
>  	MCS_PGPGIN,
>  	MCS_PGPGOUT,
>  	MCS_SWAP,
> +	MCS_FILE_DIRTY,
> +	MCS_WRITEBACK,
> +	MCS_WRITEBACK_TEMP,
> +	MCS_UNSTABLE_NFS,
>  	MCS_INACTIVE_ANON,
>  	MCS_ACTIVE_ANON,
>  	MCS_INACTIVE_FILE,
> @@ -3055,6 +3217,10 @@ struct {
>  	{"pgpgin", "total_pgpgin"},
>  	{"pgpgout", "total_pgpgout"},
>  	{"swap", "total_swap"},
> +	{"filedirty", "dirty_pages"},
> +	{"writeback", "writeback_pages"},
> +	{"writeback_tmp", "writeback_temp_pages"},
> +	{"nfs", "nfs_unstable"},
>  	{"inactive_anon", "total_inactive_anon"},
>  	{"active_anon", "total_active_anon"},
>  	{"inactive_file", "total_inactive_file"},
> @@ -3083,6 +3249,14 @@ static int mem_cgroup_get_local_stat(struct mem_cgroup *mem, void *data)
>  		val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_SWAPOUT);
>  		s->stat[MCS_SWAP] += val * PAGE_SIZE;
>  	}
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_DIRTY);
> +	s->stat[MCS_FILE_DIRTY] += val;
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WRITEBACK);
> +	s->stat[MCS_WRITEBACK] += val;
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WRITEBACK_TEMP);
> +	s->stat[MCS_WRITEBACK_TEMP] += val;
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_UNSTABLE_NFS);
> +	s->stat[MCS_UNSTABLE_NFS] += val;
> 
>  	/* per zone stat */
>  	val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON);
> @@ -3467,6 +3641,50 @@ unlock:
>  	return ret;
>  }
> 
> +static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +	int type = cft->private;
> +
> +	return get_dirty_param(memcg, type);
> +}
> +
> +static int
> +mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +	int type = cft->private;
> +
> +	if (cgrp->parent == NULL)
> +		return -EINVAL;
> +	if (((type == MEM_CGROUP_DIRTY_RATIO) ||
> +		(type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO)) && (val > 100))
> +		return -EINVAL;
> +
> +	spin_lock(&memcg->reclaim_param_lock);
> +	switch (type) {
> +	case MEM_CGROUP_DIRTY_RATIO:
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_RATIO] = val;
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_BYTES] = 0;
> +		break;
> +	case MEM_CGROUP_DIRTY_BYTES:
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_RATIO] = 0;
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_BYTES] = val;
> +		break;
> +	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] = val;
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES] = 0;
> +		break;
> +	case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] = 0;
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES] = val;
> +		break;
> +	}
> +	spin_unlock(&memcg->reclaim_param_lock);
> +
> +	return 0;
> +}
> +
>  static struct cftype mem_cgroup_files[] = {
>  	{
>  		.name = "usage_in_bytes",
> @@ -3518,6 +3736,30 @@ static struct cftype mem_cgroup_files[] = {
>  		.write_u64 = mem_cgroup_swappiness_write,
>  	},
>  	{
> +		.name = "dirty_ratio",
> +		.read_u64 = mem_cgroup_dirty_read,
> +		.write_u64 = mem_cgroup_dirty_write,
> +		.private = MEM_CGROUP_DIRTY_RATIO,
> +	},
> +	{
> +		.name = "dirty_bytes",
> +		.read_u64 = mem_cgroup_dirty_read,
> +		.write_u64 = mem_cgroup_dirty_write,
> +		.private = MEM_CGROUP_DIRTY_BYTES,
> +	},
> +	{
> +		.name = "dirty_background_ratio",
> +		.read_u64 = mem_cgroup_dirty_read,
> +		.write_u64 = mem_cgroup_dirty_write,
> +		.private = MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> +	},
> +	{
> +		.name = "dirty_background_bytes",
> +		.read_u64 = mem_cgroup_dirty_read,
> +		.write_u64 = mem_cgroup_dirty_write,
> +		.private = MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
> +	},
> +	{
>  		.name = "move_charge_at_immigrate",
>  		.read_u64 = mem_cgroup_move_charge_read,
>  		.write_u64 = mem_cgroup_move_charge_write,
> @@ -3725,6 +3967,19 @@ static int mem_cgroup_soft_limit_tree_init(void)
>  	return 0;
>  }
> 
> +/*
> + * NOTE: called only with &src->reclaim_param_lock held from
> + * mem_cgroup_create().
> + */
> +static inline void
> +copy_dirty_params(struct mem_cgroup *dst, struct mem_cgroup *src)
> +{
> +	int i;
> +
> +	for (i = 0; i < MEM_CGROUP_DIRTY_NPARAMS; i++)
> +		dst->dirty_param[i] = src->dirty_param[i];
> +}
> +
>  static struct cgroup_subsys_state * __ref
>  mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  {
> @@ -3776,8 +4031,37 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  	mem->last_scanned_child = 0;
>  	spin_lock_init(&mem->reclaim_param_lock);
> 
> -	if (parent)
> +	if (parent) {
>  		mem->swappiness = get_swappiness(parent);
> +
> +		spin_lock(&parent->reclaim_param_lock);
> +		copy_dirty_params(mem, parent);
> +		spin_unlock(&parent->reclaim_param_lock);
> +	} else {
> +		/*
> +		 * XXX: should we need a lock here? we could switch from
> +		 * vm_dirty_ratio to vm_dirty_bytes or vice versa but we're not
> +		 * reading them atomically. The same for dirty_background_ratio
> +		 * and dirty_background_bytes.
> +		 *
> +		 * For now, try to read them speculatively and retry if a
> +		 * "conflict" is detected.a

The do while loop is subtle, can we add a validate check,share it with
the write routine and retry if validation fails?

> +		 */
> +		do {
> +			mem->dirty_param[MEM_CGROUP_DIRTY_RATIO] =
> +						vm_dirty_ratio;
> +			mem->dirty_param[MEM_CGROUP_DIRTY_BYTES] =
> +						vm_dirty_bytes;
> +		} while (mem->dirty_param[MEM_CGROUP_DIRTY_RATIO] &&
> +			 mem->dirty_param[MEM_CGROUP_DIRTY_BYTES]);
> +		do {
> +			mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] =
> +						dirty_background_ratio;
> +			mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES] =
> +						dirty_background_bytes;
> +		} while (mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] &&
> +			mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES]);
> +	}
>  	atomic_set(&mem->refcnt, 1);
>  	mem->move_charge_at_immigrate = 0;
>  	mutex_init(&mem->thresholds_lock);
> -- 
> 1.6.3.3
> 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 2/3] memcg: dirty pages accounting and limiting infrastructure
@ 2010-03-02 13:02     ` Balbir Singh
  0 siblings, 0 replies; 140+ messages in thread
From: Balbir Singh @ 2010-03-02 13:02 UTC (permalink / raw)
  To: Andrea Righi
  Cc: KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Kirill A. Shutemov, Andrew Morton, containers,
	linux-kernel, linux-mm

* Andrea Righi <arighi@develer.com> [2010-03-01 22:23:39]:

> Infrastructure to account dirty pages per cgroup and add dirty limit
> interfaces in the cgroupfs:
> 
>  - Direct write-out: memory.dirty_ratio, memory.dirty_bytes
> 
>  - Background write-out: memory.dirty_background_ratio, memory.dirty_background_bytes
> 
> Signed-off-by: Andrea Righi <arighi@develer.com>

Look good, but yet to be tested from my side.


> ---
>  include/linux/memcontrol.h |   77 ++++++++++-
>  mm/memcontrol.c            |  336 ++++++++++++++++++++++++++++++++++++++++----
>  2 files changed, 384 insertions(+), 29 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 1f9b119..cc88b2e 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -19,12 +19,50 @@
> 
>  #ifndef _LINUX_MEMCONTROL_H
>  #define _LINUX_MEMCONTROL_H
> +
> +#include <linux/writeback.h>
>  #include <linux/cgroup.h>
> +
>  struct mem_cgroup;
>  struct page_cgroup;
>  struct page;
>  struct mm_struct;
> 
> +/* Cgroup memory statistics items exported to the kernel */
> +enum mem_cgroup_page_stat_item {
> +	MEMCG_NR_DIRTYABLE_PAGES,
> +	MEMCG_NR_RECLAIM_PAGES,
> +	MEMCG_NR_WRITEBACK,
> +	MEMCG_NR_DIRTY_WRITEBACK_PAGES,
> +};
> +
> +/*
> + * Statistics for memory cgroup.
> + */
> +enum mem_cgroup_stat_index {
> +	/*
> +	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> +	 */
> +	MEM_CGROUP_STAT_CACHE,	   /* # of pages charged as cache */
> +	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
> +	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> +	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
> +	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
> +	MEM_CGROUP_STAT_EVENTS,	/* sum of pagein + pageout for internal use */
> +	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> +	MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
> +					used by soft limit implementation */
> +	MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out.
> +					used by threshold implementation */
> +	MEM_CGROUP_STAT_FILE_DIRTY,   /* # of dirty pages in page cache */
> +	MEM_CGROUP_STAT_WRITEBACK,   /* # of pages under writeback */
> +	MEM_CGROUP_STAT_WRITEBACK_TEMP,   /* # of pages under writeback using
> +						temporary buffers */
> +	MEM_CGROUP_STAT_UNSTABLE_NFS,   /* # of NFS unstable pages */
> +
> +	MEM_CGROUP_STAT_NSTATS,
> +};
> +
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
>  /*
>   * All "charge" functions with gfp_mask should use GFP_KERNEL or
> @@ -117,6 +155,13 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
>  extern int do_swap_account;
>  #endif
> 
> +extern long mem_cgroup_dirty_ratio(void);
> +extern unsigned long mem_cgroup_dirty_bytes(void);
> +extern long mem_cgroup_dirty_background_ratio(void);
> +extern unsigned long mem_cgroup_dirty_background_bytes(void);
> +
> +extern s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item);
> +

Docstyle comments for each function would be appreciated

>  static inline bool mem_cgroup_disabled(void)
>  {
>  	if (mem_cgroup_subsys.disabled)
> @@ -125,7 +170,8 @@ static inline bool mem_cgroup_disabled(void)
>  }
> 
>  extern bool mem_cgroup_oom_called(struct task_struct *task);
> -void mem_cgroup_update_file_mapped(struct page *page, int val);

Good to see you make generic use of this function

> +void mem_cgroup_update_stat(struct page *page,
> +			enum mem_cgroup_stat_index idx, int val);
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  						gfp_t gfp_mask, int nid,
>  						int zid);
> @@ -300,8 +346,8 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
>  {
>  }
> 
> -static inline void mem_cgroup_update_file_mapped(struct page *page,
> -							int val)
> +static inline void mem_cgroup_update_stat(struct page *page,
> +			enum mem_cgroup_stat_index idx, int val)
>  {
>  }
> 
> @@ -312,6 +358,31 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  	return 0;
>  }
> 
> +static inline long mem_cgroup_dirty_ratio(void)
> +{
> +	return vm_dirty_ratio;
> +}
> +
> +static inline unsigned long mem_cgroup_dirty_bytes(void)
> +{
> +	return vm_dirty_bytes;
> +}
> +
> +static inline long mem_cgroup_dirty_background_ratio(void)
> +{
> +	return dirty_background_ratio;
> +}
> +
> +static inline unsigned long mem_cgroup_dirty_background_bytes(void)
> +{
> +	return dirty_background_bytes;
> +}
> +
> +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> +{
> +	return -ENOMEM;
> +}
> +
>  #endif /* CONFIG_CGROUP_MEM_CONT */
> 
>  #endif /* _LINUX_MEMCONTROL_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a443c30..e74cf66 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -66,31 +66,16 @@ static int really_do_swap_account __initdata = 1; /* for remember boot option*/
>  #define SOFTLIMIT_EVENTS_THRESH (1000)
>  #define THRESHOLDS_EVENTS_THRESH (100)
> 
> -/*
> - * Statistics for memory cgroup.
> - */
> -enum mem_cgroup_stat_index {
> -	/*
> -	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> -	 */
> -	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
> -	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
> -	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> -	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
> -	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
> -	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> -	MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
> -					used by soft limit implementation */
> -	MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out.
> -					used by threshold implementation */
> -
> -	MEM_CGROUP_STAT_NSTATS,
> -};
> -
>  struct mem_cgroup_stat_cpu {
>  	s64 count[MEM_CGROUP_STAT_NSTATS];
>  };
> 
> +/* Per cgroup page statistics */
> +struct mem_cgroup_page_stat {
> +	enum mem_cgroup_page_stat_item item;
> +	s64 value;
> +};
> +
>  /*
>   * per-zone information in memory controller.
>   */
> @@ -157,6 +142,15 @@ struct mem_cgroup_threshold_ary {
>  static bool mem_cgroup_threshold_check(struct mem_cgroup *mem);
>  static void mem_cgroup_threshold(struct mem_cgroup *mem);
> 
> +enum mem_cgroup_dirty_param {
> +	MEM_CGROUP_DIRTY_RATIO,
> +	MEM_CGROUP_DIRTY_BYTES,
> +	MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> +	MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
> +
> +	MEM_CGROUP_DIRTY_NPARAMS,
> +};
> +
>  /*
>   * The memory controller data structure. The memory controller controls both
>   * page cache and RSS per cgroup. We would eventually like to provide
> @@ -205,6 +199,9 @@ struct mem_cgroup {
> 
>  	unsigned int	swappiness;
> 
> +	/* control memory cgroup dirty pages */
> +	unsigned long dirty_param[MEM_CGROUP_DIRTY_NPARAMS];
> +

Could you mention what protects this field, is it the reclaim_lock?
BTW, is unsigned long sufficient to represent dirty_param(s)?

>  	/* set when res.limit == memsw.limit */
>  	bool		memsw_is_minimum;
> 
> @@ -1021,6 +1018,164 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
>  	return swappiness;
>  }
> 
> +static unsigned long get_dirty_param(struct mem_cgroup *memcg,
> +			enum mem_cgroup_dirty_param idx)
> +{
> +	unsigned long ret;
> +
> +	VM_BUG_ON(idx >= MEM_CGROUP_DIRTY_NPARAMS);
> +	spin_lock(&memcg->reclaim_param_lock);
> +	ret = memcg->dirty_param[idx];
> +	spin_unlock(&memcg->reclaim_param_lock);

Do we need a spinlock if we protect it using RCU? Is precise data very
important?

> +
> +	return ret;
> +}
> +
> +long mem_cgroup_dirty_ratio(void)
> +{
> +	struct mem_cgroup *memcg;
> +	long ret = vm_dirty_ratio;
> +
> +	if (mem_cgroup_disabled())
> +		return ret;
> +	/*
> +	 * It's possible that "current" may be moved to other cgroup while we
> +	 * access cgroup. But precise check is meaningless because the task can
> +	 * be moved after our access and writeback tends to take long time.
> +	 * At least, "memcg" will not be freed under rcu_read_lock().
> +	 */
> +	rcu_read_lock();
> +	memcg = mem_cgroup_from_task(current);
> +	if (likely(memcg))
> +		ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_RATIO);
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> +
> +unsigned long mem_cgroup_dirty_bytes(void)
> +{
> +	struct mem_cgroup *memcg;
> +	unsigned long ret = vm_dirty_bytes;
> +
> +	if (mem_cgroup_disabled())
> +		return ret;
> +	rcu_read_lock();
> +	memcg = mem_cgroup_from_task(current);
> +	if (likely(memcg))
> +		ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BYTES);
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> +
> +long mem_cgroup_dirty_background_ratio(void)
> +{
> +	struct mem_cgroup *memcg;
> +	long ret = dirty_background_ratio;
> +
> +	if (mem_cgroup_disabled())
> +		return ret;
> +	rcu_read_lock();
> +	memcg = mem_cgroup_from_task(current);
> +	if (likely(memcg))
> +		ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BACKGROUND_RATIO);
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> +
> +unsigned long mem_cgroup_dirty_background_bytes(void)
> +{
> +	struct mem_cgroup *memcg;
> +	unsigned long ret = dirty_background_bytes;
> +
> +	if (mem_cgroup_disabled())
> +		return ret;
> +	rcu_read_lock();
> +	memcg = mem_cgroup_from_task(current);
> +	if (likely(memcg))
> +		ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BACKGROUND_BYTES);
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> +
> +static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
> +{
> +	return do_swap_account ?
> +			res_counter_read_u64(&memcg->memsw, RES_LIMIT) :

Shouldn't you do a res_counter_read_u64(...) > 0 for readability?
What happens if memcg->res, RES_LIMIT == memcg->memsw, RES_LIMIT?

> +			nr_swap_pages > 0;
> +}
> +
> +static s64 mem_cgroup_get_local_page_stat(struct mem_cgroup *memcg,
> +				enum mem_cgroup_page_stat_item item)
> +{
> +	s64 ret;
> +
> +	switch (item) {
> +	case MEMCG_NR_DIRTYABLE_PAGES:
> +		ret = res_counter_read_u64(&memcg->res, RES_LIMIT) -
> +			res_counter_read_u64(&memcg->res, RES_USAGE);
> +		/* Translate free memory in pages */
> +		ret >>= PAGE_SHIFT;
> +		ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_FILE) +
> +			mem_cgroup_read_stat(memcg, LRU_INACTIVE_FILE);
> +		if (mem_cgroup_can_swap(memcg))
> +			ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_ANON) +
> +				mem_cgroup_read_stat(memcg, LRU_INACTIVE_ANON);
> +		break;
> +	case MEMCG_NR_RECLAIM_PAGES:
> +		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_FILE_DIRTY) +
> +			mem_cgroup_read_stat(memcg,
> +					MEM_CGROUP_STAT_UNSTABLE_NFS);
> +		break;
> +	case MEMCG_NR_WRITEBACK:
> +		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK);
> +		break;
> +	case MEMCG_NR_DIRTY_WRITEBACK_PAGES:
> +		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK) +
> +			mem_cgroup_read_stat(memcg,
> +				MEM_CGROUP_STAT_UNSTABLE_NFS);
> +		break;
> +	default:
> +		ret = 0;
> +		WARN_ON_ONCE(1);
> +	}
> +	return ret;
> +}
> +
> +static int mem_cgroup_page_stat_cb(struct mem_cgroup *mem, void *data)
> +{
> +	struct mem_cgroup_page_stat *stat = (struct mem_cgroup_page_stat *)data;
> +
> +	stat->value += mem_cgroup_get_local_page_stat(mem, stat->item);
> +	return 0;
> +}
> +
> +s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> +{
> +	struct mem_cgroup_page_stat stat = {};
> +	struct mem_cgroup *memcg;
> +
> +	if (mem_cgroup_disabled())
> +		return -ENOMEM;
> +	rcu_read_lock();
> +	memcg = mem_cgroup_from_task(current);
> +	if (memcg) {
> +		/*
> +		 * Recursively evaulate page statistics against all cgroup
> +		 * under hierarchy tree
> +		 */
> +		stat.item = item;
> +		mem_cgroup_walk_tree(memcg, &stat, mem_cgroup_page_stat_cb);
> +	} else
> +		stat.value = -ENOMEM;
> +	rcu_read_unlock();
> +
> +	return stat.value;
> +}
> +
>  static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void *data)
>  {
>  	int *val = data;
> @@ -1263,14 +1418,16 @@ static void record_last_oom(struct mem_cgroup *mem)
>  }
> 
>  /*
> - * Currently used to update mapped file statistics, but the routine can be
> - * generalized to update other statistics as well.
> + * Generalized routine to update memory cgroup statistics.
>   */
> -void mem_cgroup_update_file_mapped(struct page *page, int val)
> +void mem_cgroup_update_stat(struct page *page,
> +			enum mem_cgroup_stat_index idx, int val)
>  {
>  	struct mem_cgroup *mem;
>  	struct page_cgroup *pc;
> 
> +	if (mem_cgroup_disabled())
> +		return;
>  	pc = lookup_page_cgroup(page);
>  	if (unlikely(!pc))
>  		return;
> @@ -1286,7 +1443,8 @@ void mem_cgroup_update_file_mapped(struct page *page, int val)
>  	/*
>  	 * Preemption is already disabled. We can use __this_cpu_xxx
>  	 */
> -	__this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], val);
> +	VM_BUG_ON(idx >= MEM_CGROUP_STAT_NSTATS);
> +	__this_cpu_add(mem->stat->count[idx], val);
> 
>  done:
>  	unlock_page_cgroup(pc);
> @@ -3033,6 +3191,10 @@ enum {
>  	MCS_PGPGIN,
>  	MCS_PGPGOUT,
>  	MCS_SWAP,
> +	MCS_FILE_DIRTY,
> +	MCS_WRITEBACK,
> +	MCS_WRITEBACK_TEMP,
> +	MCS_UNSTABLE_NFS,
>  	MCS_INACTIVE_ANON,
>  	MCS_ACTIVE_ANON,
>  	MCS_INACTIVE_FILE,
> @@ -3055,6 +3217,10 @@ struct {
>  	{"pgpgin", "total_pgpgin"},
>  	{"pgpgout", "total_pgpgout"},
>  	{"swap", "total_swap"},
> +	{"filedirty", "dirty_pages"},
> +	{"writeback", "writeback_pages"},
> +	{"writeback_tmp", "writeback_temp_pages"},
> +	{"nfs", "nfs_unstable"},
>  	{"inactive_anon", "total_inactive_anon"},
>  	{"active_anon", "total_active_anon"},
>  	{"inactive_file", "total_inactive_file"},
> @@ -3083,6 +3249,14 @@ static int mem_cgroup_get_local_stat(struct mem_cgroup *mem, void *data)
>  		val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_SWAPOUT);
>  		s->stat[MCS_SWAP] += val * PAGE_SIZE;
>  	}
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_DIRTY);
> +	s->stat[MCS_FILE_DIRTY] += val;
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WRITEBACK);
> +	s->stat[MCS_WRITEBACK] += val;
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WRITEBACK_TEMP);
> +	s->stat[MCS_WRITEBACK_TEMP] += val;
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_UNSTABLE_NFS);
> +	s->stat[MCS_UNSTABLE_NFS] += val;
> 
>  	/* per zone stat */
>  	val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON);
> @@ -3467,6 +3641,50 @@ unlock:
>  	return ret;
>  }
> 
> +static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +	int type = cft->private;
> +
> +	return get_dirty_param(memcg, type);
> +}
> +
> +static int
> +mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +	int type = cft->private;
> +
> +	if (cgrp->parent == NULL)
> +		return -EINVAL;
> +	if (((type == MEM_CGROUP_DIRTY_RATIO) ||
> +		(type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO)) && (val > 100))
> +		return -EINVAL;
> +
> +	spin_lock(&memcg->reclaim_param_lock);
> +	switch (type) {
> +	case MEM_CGROUP_DIRTY_RATIO:
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_RATIO] = val;
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_BYTES] = 0;
> +		break;
> +	case MEM_CGROUP_DIRTY_BYTES:
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_RATIO] = 0;
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_BYTES] = val;
> +		break;
> +	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] = val;
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES] = 0;
> +		break;
> +	case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] = 0;
> +		memcg->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES] = val;
> +		break;
> +	}
> +	spin_unlock(&memcg->reclaim_param_lock);
> +
> +	return 0;
> +}
> +
>  static struct cftype mem_cgroup_files[] = {
>  	{
>  		.name = "usage_in_bytes",
> @@ -3518,6 +3736,30 @@ static struct cftype mem_cgroup_files[] = {
>  		.write_u64 = mem_cgroup_swappiness_write,
>  	},
>  	{
> +		.name = "dirty_ratio",
> +		.read_u64 = mem_cgroup_dirty_read,
> +		.write_u64 = mem_cgroup_dirty_write,
> +		.private = MEM_CGROUP_DIRTY_RATIO,
> +	},
> +	{
> +		.name = "dirty_bytes",
> +		.read_u64 = mem_cgroup_dirty_read,
> +		.write_u64 = mem_cgroup_dirty_write,
> +		.private = MEM_CGROUP_DIRTY_BYTES,
> +	},
> +	{
> +		.name = "dirty_background_ratio",
> +		.read_u64 = mem_cgroup_dirty_read,
> +		.write_u64 = mem_cgroup_dirty_write,
> +		.private = MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> +	},
> +	{
> +		.name = "dirty_background_bytes",
> +		.read_u64 = mem_cgroup_dirty_read,
> +		.write_u64 = mem_cgroup_dirty_write,
> +		.private = MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
> +	},
> +	{
>  		.name = "move_charge_at_immigrate",
>  		.read_u64 = mem_cgroup_move_charge_read,
>  		.write_u64 = mem_cgroup_move_charge_write,
> @@ -3725,6 +3967,19 @@ static int mem_cgroup_soft_limit_tree_init(void)
>  	return 0;
>  }
> 
> +/*
> + * NOTE: called only with &src->reclaim_param_lock held from
> + * mem_cgroup_create().
> + */
> +static inline void
> +copy_dirty_params(struct mem_cgroup *dst, struct mem_cgroup *src)
> +{
> +	int i;
> +
> +	for (i = 0; i < MEM_CGROUP_DIRTY_NPARAMS; i++)
> +		dst->dirty_param[i] = src->dirty_param[i];
> +}
> +
>  static struct cgroup_subsys_state * __ref
>  mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  {
> @@ -3776,8 +4031,37 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  	mem->last_scanned_child = 0;
>  	spin_lock_init(&mem->reclaim_param_lock);
> 
> -	if (parent)
> +	if (parent) {
>  		mem->swappiness = get_swappiness(parent);
> +
> +		spin_lock(&parent->reclaim_param_lock);
> +		copy_dirty_params(mem, parent);
> +		spin_unlock(&parent->reclaim_param_lock);
> +	} else {
> +		/*
> +		 * XXX: should we need a lock here? we could switch from
> +		 * vm_dirty_ratio to vm_dirty_bytes or vice versa but we're not
> +		 * reading them atomically. The same for dirty_background_ratio
> +		 * and dirty_background_bytes.
> +		 *
> +		 * For now, try to read them speculatively and retry if a
> +		 * "conflict" is detected.a

The do while loop is subtle, can we add a validate check,share it with
the write routine and retry if validation fails?

> +		 */
> +		do {
> +			mem->dirty_param[MEM_CGROUP_DIRTY_RATIO] =
> +						vm_dirty_ratio;
> +			mem->dirty_param[MEM_CGROUP_DIRTY_BYTES] =
> +						vm_dirty_bytes;
> +		} while (mem->dirty_param[MEM_CGROUP_DIRTY_RATIO] &&
> +			 mem->dirty_param[MEM_CGROUP_DIRTY_BYTES]);
> +		do {
> +			mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] =
> +						dirty_background_ratio;
> +			mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES] =
> +						dirty_background_bytes;
> +		} while (mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] &&
> +			mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES]);
> +	}
>  	atomic_set(&mem->refcnt, 1);
>  	mem->move_charge_at_immigrate = 0;
>  	mutex_init(&mem->thresholds_lock);
> -- 
> 1.6.3.3
> 

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
       [not found]   ` <1267478620-5276-4-git-send-email-arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
                       ` (2 preceding siblings ...)
  2010-03-02 10:11     ` Kirill A. Shutemov
@ 2010-03-02 13:47     ` Balbir Singh
  2010-03-02 13:48     ` Peter Zijlstra
  2010-03-03  2:12     ` Daisuke Nishimura
  5 siblings, 0 replies; 140+ messages in thread
From: Balbir Singh @ 2010-03-02 13:47 UTC (permalink / raw)
  To: Andrea Righi
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Andrew Morton

* Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org> [2010-03-01 22:23:40]:

> Apply the cgroup dirty pages accounting and limiting infrastructure to
> the opportune kernel functions.
> 
> Signed-off-by: Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
> ---
>  fs/fuse/file.c      |    5 +++
>  fs/nfs/write.c      |    4 ++
>  fs/nilfs2/segment.c |   10 +++++-
>  mm/filemap.c        |    1 +
>  mm/page-writeback.c |   84 ++++++++++++++++++++++++++++++++------------------
>  mm/rmap.c           |    4 +-
>  mm/truncate.c       |    2 +
>  7 files changed, 76 insertions(+), 34 deletions(-)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index a9f5e13..dbbdd53 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -11,6 +11,7 @@
>  #include <linux/pagemap.h>
>  #include <linux/slab.h>
>  #include <linux/kernel.h>
> +#include <linux/memcontrol.h>
>  #include <linux/sched.h>
>  #include <linux/module.h>
> 
> @@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
> 
>  	list_del(&req->writepages_entry);
>  	dec_bdi_stat(bdi, BDI_WRITEBACK);
> +	mem_cgroup_update_stat(req->pages[0],
> +			MEM_CGROUP_STAT_WRITEBACK_TEMP, -1);
>  	dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
>  	bdi_writeout_inc(bdi);
>  	wake_up(&fi->page_waitq);
> @@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
>  	req->inode = inode;
> 
>  	inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
> +	mem_cgroup_update_stat(tmp_page,
> +			MEM_CGROUP_STAT_WRITEBACK_TEMP, 1);
>  	inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
>  	end_page_writeback(page);
> 
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index b753242..7316f7a 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c

Don't need memcontrol.h to be included here?

Looks OK to me overall, but there might be objection using the
mem_cgroup_* naming convention, but I don't mind it very much :)

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-01 21:23   ` Andrea Righi
@ 2010-03-02 13:47     ` Balbir Singh
  -1 siblings, 0 replies; 140+ messages in thread
From: Balbir Singh @ 2010-03-02 13:47 UTC (permalink / raw)
  To: Andrea Righi
  Cc: KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Kirill A. Shutemov, Andrew Morton, containers,
	linux-kernel, linux-mm

* Andrea Righi <arighi@develer.com> [2010-03-01 22:23:40]:

> Apply the cgroup dirty pages accounting and limiting infrastructure to
> the opportune kernel functions.
> 
> Signed-off-by: Andrea Righi <arighi@develer.com>
> ---
>  fs/fuse/file.c      |    5 +++
>  fs/nfs/write.c      |    4 ++
>  fs/nilfs2/segment.c |   10 +++++-
>  mm/filemap.c        |    1 +
>  mm/page-writeback.c |   84 ++++++++++++++++++++++++++++++++------------------
>  mm/rmap.c           |    4 +-
>  mm/truncate.c       |    2 +
>  7 files changed, 76 insertions(+), 34 deletions(-)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index a9f5e13..dbbdd53 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -11,6 +11,7 @@
>  #include <linux/pagemap.h>
>  #include <linux/slab.h>
>  #include <linux/kernel.h>
> +#include <linux/memcontrol.h>
>  #include <linux/sched.h>
>  #include <linux/module.h>
> 
> @@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
> 
>  	list_del(&req->writepages_entry);
>  	dec_bdi_stat(bdi, BDI_WRITEBACK);
> +	mem_cgroup_update_stat(req->pages[0],
> +			MEM_CGROUP_STAT_WRITEBACK_TEMP, -1);
>  	dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
>  	bdi_writeout_inc(bdi);
>  	wake_up(&fi->page_waitq);
> @@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
>  	req->inode = inode;
> 
>  	inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
> +	mem_cgroup_update_stat(tmp_page,
> +			MEM_CGROUP_STAT_WRITEBACK_TEMP, 1);
>  	inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
>  	end_page_writeback(page);
> 
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index b753242..7316f7a 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c

Don't need memcontrol.h to be included here?

Looks OK to me overall, but there might be objection using the
mem_cgroup_* naming convention, but I don't mind it very much :)

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
@ 2010-03-02 13:47     ` Balbir Singh
  0 siblings, 0 replies; 140+ messages in thread
From: Balbir Singh @ 2010-03-02 13:47 UTC (permalink / raw)
  To: Andrea Righi
  Cc: KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Kirill A. Shutemov, Andrew Morton, containers,
	linux-kernel, linux-mm

* Andrea Righi <arighi@develer.com> [2010-03-01 22:23:40]:

> Apply the cgroup dirty pages accounting and limiting infrastructure to
> the opportune kernel functions.
> 
> Signed-off-by: Andrea Righi <arighi@develer.com>
> ---
>  fs/fuse/file.c      |    5 +++
>  fs/nfs/write.c      |    4 ++
>  fs/nilfs2/segment.c |   10 +++++-
>  mm/filemap.c        |    1 +
>  mm/page-writeback.c |   84 ++++++++++++++++++++++++++++++++------------------
>  mm/rmap.c           |    4 +-
>  mm/truncate.c       |    2 +
>  7 files changed, 76 insertions(+), 34 deletions(-)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index a9f5e13..dbbdd53 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -11,6 +11,7 @@
>  #include <linux/pagemap.h>
>  #include <linux/slab.h>
>  #include <linux/kernel.h>
> +#include <linux/memcontrol.h>
>  #include <linux/sched.h>
>  #include <linux/module.h>
> 
> @@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
> 
>  	list_del(&req->writepages_entry);
>  	dec_bdi_stat(bdi, BDI_WRITEBACK);
> +	mem_cgroup_update_stat(req->pages[0],
> +			MEM_CGROUP_STAT_WRITEBACK_TEMP, -1);
>  	dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
>  	bdi_writeout_inc(bdi);
>  	wake_up(&fi->page_waitq);
> @@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
>  	req->inode = inode;
> 
>  	inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
> +	mem_cgroup_update_stat(tmp_page,
> +			MEM_CGROUP_STAT_WRITEBACK_TEMP, 1);
>  	inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
>  	end_page_writeback(page);
> 
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index b753242..7316f7a 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c

Don't need memcontrol.h to be included here?

Looks OK to me overall, but there might be objection using the
mem_cgroup_* naming convention, but I don't mind it very much :)

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
       [not found]   ` <1267478620-5276-4-git-send-email-arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
                       ` (3 preceding siblings ...)
  2010-03-02 13:47     ` Balbir Singh
@ 2010-03-02 13:48     ` Peter Zijlstra
  2010-03-03  2:12     ` Daisuke Nishimura
  5 siblings, 0 replies; 140+ messages in thread
From: Peter Zijlstra @ 2010-03-02 13:48 UTC (permalink / raw)
  To: Andrea Righi
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Trond Myklebust, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Suleiman Souhlal, Andrew Morton, Balbir Singh

On Mon, 2010-03-01 at 22:23 +0100, Andrea Righi wrote:
> Apply the cgroup dirty pages accounting and limiting infrastructure to
> the opportune kernel functions.
> 
> Signed-off-by: Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
> ---

> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 5a0f8f3..d83f41c 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -137,13 +137,14 @@ static struct prop_descriptor vm_dirties;
>   */
>  static int calc_period_shift(void)
>  {
> -	unsigned long dirty_total;
> +	unsigned long dirty_total, dirty_bytes;
>  
> -	if (vm_dirty_bytes)
> -		dirty_total = vm_dirty_bytes / PAGE_SIZE;
> +	dirty_bytes = mem_cgroup_dirty_bytes();
> +	if (dirty_bytes)

So you don't think 0 is a valid max dirty amount?

> +		dirty_total = dirty_bytes / PAGE_SIZE;
>  	else
> -		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> -				100;
> +		dirty_total = (mem_cgroup_dirty_ratio() *
> +				determine_dirtyable_memory()) / 100;
>  	return 2 + ilog2(dirty_total - 1);
>  }
>  
> @@ -408,14 +409,16 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
>   */
>  unsigned long determine_dirtyable_memory(void)
>  {
> -	unsigned long x;
> -
> -	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> +	unsigned long memory;
> +	s64 memcg_memory;
>  
> +	memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
>  	if (!vm_highmem_is_dirtyable)
> -		x -= highmem_dirtyable_memory(x);
> -
> -	return x + 1;	/* Ensure that we never return 0 */
> +		memory -= highmem_dirtyable_memory(memory);
> +	memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
> +	if (memcg_memory < 0)

And here you somehow return negative?

> +		return memory + 1;
> +	return min((unsigned long)memcg_memory, memory + 1);
>  }
>  
>  void
> @@ -423,26 +426,28 @@ get_dirty_limits(unsigned long *pbackground, unsigned long *pdirty,
>  		 unsigned long *pbdi_dirty, struct backing_dev_info *bdi)
>  {
>  	unsigned long background;
> -	unsigned long dirty;
> +	unsigned long dirty, dirty_bytes, dirty_background;
>  	unsigned long available_memory = determine_dirtyable_memory();
>  	struct task_struct *tsk;
>  
> -	if (vm_dirty_bytes)
> -		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> +	dirty_bytes = mem_cgroup_dirty_bytes();
> +	if (dirty_bytes)

zero not valid again

> +		dirty = DIV_ROUND_UP(dirty_bytes, PAGE_SIZE);
>  	else {
>  		int dirty_ratio;
>  
> -		dirty_ratio = vm_dirty_ratio;
> +		dirty_ratio = mem_cgroup_dirty_ratio();
>  		if (dirty_ratio < 5)
>  			dirty_ratio = 5;
>  		dirty = (dirty_ratio * available_memory) / 100;
>  	}
>  
> -	if (dirty_background_bytes)
> -		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> +	dirty_background = mem_cgroup_dirty_background_bytes();
> +	if (dirty_background)

idem

> +		background = DIV_ROUND_UP(dirty_background, PAGE_SIZE);
>  	else
> -		background = (dirty_background_ratio * available_memory) / 100;
> -
> +		background = (mem_cgroup_dirty_background_ratio() *
> +					available_memory) / 100;
>  	if (background >= dirty)
>  		background = dirty / 2;
>  	tsk = current;
> @@ -508,9 +513,13 @@ static void balance_dirty_pages(struct address_space *mapping,
>  		get_dirty_limits(&background_thresh, &dirty_thresh,
>  				&bdi_thresh, bdi);
>  
> -		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> +		nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
> +		nr_writeback = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
> +		if ((nr_reclaimable < 0) || (nr_writeback < 0)) {
> +			nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
>  					global_page_state(NR_UNSTABLE_NFS);

??? why would a page_state be negative.. I see you return -ENOMEM on !
cgroup, but how can one specify no dirty limit with this compiled in?

> -		nr_writeback = global_page_state(NR_WRITEBACK);
> +			nr_writeback = global_page_state(NR_WRITEBACK);
> +		}
>  
>  		bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY);
>  		if (bdi_cap_account_unstable(bdi)) {
> @@ -611,10 +620,12 @@ static void balance_dirty_pages(struct address_space *mapping,
>  	 * In normal mode, we start background writeout at the lower
>  	 * background_thresh, to keep the amount of dirty memory low.
>  	 */
> +	nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
> +	if (nr_reclaimable < 0)
> +		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> +				global_page_state(NR_UNSTABLE_NFS);

Again..

>  	if ((laptop_mode && pages_written) ||
> -	    (!laptop_mode && ((global_page_state(NR_FILE_DIRTY)
> -			       + global_page_state(NR_UNSTABLE_NFS))
> -					  > background_thresh)))
> +	    (!laptop_mode && (nr_reclaimable > background_thresh)))
>  		bdi_start_writeback(bdi, NULL, 0);
>  }
>  
> @@ -678,6 +689,8 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>  	unsigned long dirty_thresh;
>  
>          for ( ; ; ) {
> +		unsigned long dirty;
> +
>  		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
>  
>                  /*
> @@ -686,10 +699,14 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>                   */
>                  dirty_thresh += dirty_thresh / 10;      /* wheeee... */
>  
> -                if (global_page_state(NR_UNSTABLE_NFS) +
> -			global_page_state(NR_WRITEBACK) <= dirty_thresh)
> -                        	break;
> -                congestion_wait(BLK_RW_ASYNC, HZ/10);
> +
> +		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> +		if (dirty < 0)
> +			dirty = global_page_state(NR_UNSTABLE_NFS) +
> +				global_page_state(NR_WRITEBACK);

and again..

> +		if (dirty <= dirty_thresh)
> +			break;
> +		congestion_wait(BLK_RW_ASYNC, HZ/10);
>  
>  		/*
>  		 * The caller might hold locks which can prevent IO completion

This is ugly and broken.. I thought you'd agreed to something like:

 if (mem_cgroup_has_dirty_limit(cgroup))
   use mem_cgroup numbers
 else
   use global numbers

That allows for a 0 dirty limit (which should work and basically makes
all io synchronous).

Also, I'd put each of those in a separate function, like:

unsigned long reclaimable_pages(cgroup)
{
  if (mem_cgroup_has_dirty_limit(cgroup))
    return mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
  
  return global_page_state(NR_FILE_DIRTY) + global_page_state(NR_NFS_UNSTABLE);
}

Which raises another question, you should probably rebase on top of
Trond's patches, which removes BDI_RECLAIMABLE, suggesting you also
loose MEMCG_NR_RECLAIM_PAGES in favour of the DIRTY+UNSTABLE split.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-01 21:23   ` Andrea Righi
@ 2010-03-02 13:48     ` Peter Zijlstra
  -1 siblings, 0 replies; 140+ messages in thread
From: Peter Zijlstra @ 2010-03-02 13:48 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Kirill A. Shutemov, Andrew Morton, containers,
	linux-kernel, linux-mm, Trond Myklebust

On Mon, 2010-03-01 at 22:23 +0100, Andrea Righi wrote:
> Apply the cgroup dirty pages accounting and limiting infrastructure to
> the opportune kernel functions.
> 
> Signed-off-by: Andrea Righi <arighi@develer.com>
> ---

> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 5a0f8f3..d83f41c 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -137,13 +137,14 @@ static struct prop_descriptor vm_dirties;
>   */
>  static int calc_period_shift(void)
>  {
> -	unsigned long dirty_total;
> +	unsigned long dirty_total, dirty_bytes;
>  
> -	if (vm_dirty_bytes)
> -		dirty_total = vm_dirty_bytes / PAGE_SIZE;
> +	dirty_bytes = mem_cgroup_dirty_bytes();
> +	if (dirty_bytes)

So you don't think 0 is a valid max dirty amount?

> +		dirty_total = dirty_bytes / PAGE_SIZE;
>  	else
> -		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> -				100;
> +		dirty_total = (mem_cgroup_dirty_ratio() *
> +				determine_dirtyable_memory()) / 100;
>  	return 2 + ilog2(dirty_total - 1);
>  }
>  
> @@ -408,14 +409,16 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
>   */
>  unsigned long determine_dirtyable_memory(void)
>  {
> -	unsigned long x;
> -
> -	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> +	unsigned long memory;
> +	s64 memcg_memory;
>  
> +	memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
>  	if (!vm_highmem_is_dirtyable)
> -		x -= highmem_dirtyable_memory(x);
> -
> -	return x + 1;	/* Ensure that we never return 0 */
> +		memory -= highmem_dirtyable_memory(memory);
> +	memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
> +	if (memcg_memory < 0)

And here you somehow return negative?

> +		return memory + 1;
> +	return min((unsigned long)memcg_memory, memory + 1);
>  }
>  
>  void
> @@ -423,26 +426,28 @@ get_dirty_limits(unsigned long *pbackground, unsigned long *pdirty,
>  		 unsigned long *pbdi_dirty, struct backing_dev_info *bdi)
>  {
>  	unsigned long background;
> -	unsigned long dirty;
> +	unsigned long dirty, dirty_bytes, dirty_background;
>  	unsigned long available_memory = determine_dirtyable_memory();
>  	struct task_struct *tsk;
>  
> -	if (vm_dirty_bytes)
> -		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> +	dirty_bytes = mem_cgroup_dirty_bytes();
> +	if (dirty_bytes)

zero not valid again

> +		dirty = DIV_ROUND_UP(dirty_bytes, PAGE_SIZE);
>  	else {
>  		int dirty_ratio;
>  
> -		dirty_ratio = vm_dirty_ratio;
> +		dirty_ratio = mem_cgroup_dirty_ratio();
>  		if (dirty_ratio < 5)
>  			dirty_ratio = 5;
>  		dirty = (dirty_ratio * available_memory) / 100;
>  	}
>  
> -	if (dirty_background_bytes)
> -		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> +	dirty_background = mem_cgroup_dirty_background_bytes();
> +	if (dirty_background)

idem

> +		background = DIV_ROUND_UP(dirty_background, PAGE_SIZE);
>  	else
> -		background = (dirty_background_ratio * available_memory) / 100;
> -
> +		background = (mem_cgroup_dirty_background_ratio() *
> +					available_memory) / 100;
>  	if (background >= dirty)
>  		background = dirty / 2;
>  	tsk = current;
> @@ -508,9 +513,13 @@ static void balance_dirty_pages(struct address_space *mapping,
>  		get_dirty_limits(&background_thresh, &dirty_thresh,
>  				&bdi_thresh, bdi);
>  
> -		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> +		nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
> +		nr_writeback = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
> +		if ((nr_reclaimable < 0) || (nr_writeback < 0)) {
> +			nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
>  					global_page_state(NR_UNSTABLE_NFS);

??? why would a page_state be negative.. I see you return -ENOMEM on !
cgroup, but how can one specify no dirty limit with this compiled in?

> -		nr_writeback = global_page_state(NR_WRITEBACK);
> +			nr_writeback = global_page_state(NR_WRITEBACK);
> +		}
>  
>  		bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY);
>  		if (bdi_cap_account_unstable(bdi)) {
> @@ -611,10 +620,12 @@ static void balance_dirty_pages(struct address_space *mapping,
>  	 * In normal mode, we start background writeout at the lower
>  	 * background_thresh, to keep the amount of dirty memory low.
>  	 */
> +	nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
> +	if (nr_reclaimable < 0)
> +		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> +				global_page_state(NR_UNSTABLE_NFS);

Again..

>  	if ((laptop_mode && pages_written) ||
> -	    (!laptop_mode && ((global_page_state(NR_FILE_DIRTY)
> -			       + global_page_state(NR_UNSTABLE_NFS))
> -					  > background_thresh)))
> +	    (!laptop_mode && (nr_reclaimable > background_thresh)))
>  		bdi_start_writeback(bdi, NULL, 0);
>  }
>  
> @@ -678,6 +689,8 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>  	unsigned long dirty_thresh;
>  
>          for ( ; ; ) {
> +		unsigned long dirty;
> +
>  		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
>  
>                  /*
> @@ -686,10 +699,14 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>                   */
>                  dirty_thresh += dirty_thresh / 10;      /* wheeee... */
>  
> -                if (global_page_state(NR_UNSTABLE_NFS) +
> -			global_page_state(NR_WRITEBACK) <= dirty_thresh)
> -                        	break;
> -                congestion_wait(BLK_RW_ASYNC, HZ/10);
> +
> +		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> +		if (dirty < 0)
> +			dirty = global_page_state(NR_UNSTABLE_NFS) +
> +				global_page_state(NR_WRITEBACK);

and again..

> +		if (dirty <= dirty_thresh)
> +			break;
> +		congestion_wait(BLK_RW_ASYNC, HZ/10);
>  
>  		/*
>  		 * The caller might hold locks which can prevent IO completion

This is ugly and broken.. I thought you'd agreed to something like:

 if (mem_cgroup_has_dirty_limit(cgroup))
   use mem_cgroup numbers
 else
   use global numbers

That allows for a 0 dirty limit (which should work and basically makes
all io synchronous).

Also, I'd put each of those in a separate function, like:

unsigned long reclaimable_pages(cgroup)
{
  if (mem_cgroup_has_dirty_limit(cgroup))
    return mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
  
  return global_page_state(NR_FILE_DIRTY) + global_page_state(NR_NFS_UNSTABLE);
}

Which raises another question, you should probably rebase on top of
Trond's patches, which removes BDI_RECLAIMABLE, suggesting you also
loose MEMCG_NR_RECLAIM_PAGES in favour of the DIRTY+UNSTABLE split.


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
@ 2010-03-02 13:48     ` Peter Zijlstra
  0 siblings, 0 replies; 140+ messages in thread
From: Peter Zijlstra @ 2010-03-02 13:48 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Kirill A. Shutemov, Andrew Morton, containers,
	linux-kernel, linux-mm, Trond Myklebust

On Mon, 2010-03-01 at 22:23 +0100, Andrea Righi wrote:
> Apply the cgroup dirty pages accounting and limiting infrastructure to
> the opportune kernel functions.
> 
> Signed-off-by: Andrea Righi <arighi@develer.com>
> ---

> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 5a0f8f3..d83f41c 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -137,13 +137,14 @@ static struct prop_descriptor vm_dirties;
>   */
>  static int calc_period_shift(void)
>  {
> -	unsigned long dirty_total;
> +	unsigned long dirty_total, dirty_bytes;
>  
> -	if (vm_dirty_bytes)
> -		dirty_total = vm_dirty_bytes / PAGE_SIZE;
> +	dirty_bytes = mem_cgroup_dirty_bytes();
> +	if (dirty_bytes)

So you don't think 0 is a valid max dirty amount?

> +		dirty_total = dirty_bytes / PAGE_SIZE;
>  	else
> -		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> -				100;
> +		dirty_total = (mem_cgroup_dirty_ratio() *
> +				determine_dirtyable_memory()) / 100;
>  	return 2 + ilog2(dirty_total - 1);
>  }
>  
> @@ -408,14 +409,16 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
>   */
>  unsigned long determine_dirtyable_memory(void)
>  {
> -	unsigned long x;
> -
> -	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> +	unsigned long memory;
> +	s64 memcg_memory;
>  
> +	memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
>  	if (!vm_highmem_is_dirtyable)
> -		x -= highmem_dirtyable_memory(x);
> -
> -	return x + 1;	/* Ensure that we never return 0 */
> +		memory -= highmem_dirtyable_memory(memory);
> +	memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
> +	if (memcg_memory < 0)

And here you somehow return negative?

> +		return memory + 1;
> +	return min((unsigned long)memcg_memory, memory + 1);
>  }
>  
>  void
> @@ -423,26 +426,28 @@ get_dirty_limits(unsigned long *pbackground, unsigned long *pdirty,
>  		 unsigned long *pbdi_dirty, struct backing_dev_info *bdi)
>  {
>  	unsigned long background;
> -	unsigned long dirty;
> +	unsigned long dirty, dirty_bytes, dirty_background;
>  	unsigned long available_memory = determine_dirtyable_memory();
>  	struct task_struct *tsk;
>  
> -	if (vm_dirty_bytes)
> -		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> +	dirty_bytes = mem_cgroup_dirty_bytes();
> +	if (dirty_bytes)

zero not valid again

> +		dirty = DIV_ROUND_UP(dirty_bytes, PAGE_SIZE);
>  	else {
>  		int dirty_ratio;
>  
> -		dirty_ratio = vm_dirty_ratio;
> +		dirty_ratio = mem_cgroup_dirty_ratio();
>  		if (dirty_ratio < 5)
>  			dirty_ratio = 5;
>  		dirty = (dirty_ratio * available_memory) / 100;
>  	}
>  
> -	if (dirty_background_bytes)
> -		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> +	dirty_background = mem_cgroup_dirty_background_bytes();
> +	if (dirty_background)

idem

> +		background = DIV_ROUND_UP(dirty_background, PAGE_SIZE);
>  	else
> -		background = (dirty_background_ratio * available_memory) / 100;
> -
> +		background = (mem_cgroup_dirty_background_ratio() *
> +					available_memory) / 100;
>  	if (background >= dirty)
>  		background = dirty / 2;
>  	tsk = current;
> @@ -508,9 +513,13 @@ static void balance_dirty_pages(struct address_space *mapping,
>  		get_dirty_limits(&background_thresh, &dirty_thresh,
>  				&bdi_thresh, bdi);
>  
> -		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> +		nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
> +		nr_writeback = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
> +		if ((nr_reclaimable < 0) || (nr_writeback < 0)) {
> +			nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
>  					global_page_state(NR_UNSTABLE_NFS);

??? why would a page_state be negative.. I see you return -ENOMEM on !
cgroup, but how can one specify no dirty limit with this compiled in?

> -		nr_writeback = global_page_state(NR_WRITEBACK);
> +			nr_writeback = global_page_state(NR_WRITEBACK);
> +		}
>  
>  		bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY);
>  		if (bdi_cap_account_unstable(bdi)) {
> @@ -611,10 +620,12 @@ static void balance_dirty_pages(struct address_space *mapping,
>  	 * In normal mode, we start background writeout at the lower
>  	 * background_thresh, to keep the amount of dirty memory low.
>  	 */
> +	nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
> +	if (nr_reclaimable < 0)
> +		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> +				global_page_state(NR_UNSTABLE_NFS);

Again..

>  	if ((laptop_mode && pages_written) ||
> -	    (!laptop_mode && ((global_page_state(NR_FILE_DIRTY)
> -			       + global_page_state(NR_UNSTABLE_NFS))
> -					  > background_thresh)))
> +	    (!laptop_mode && (nr_reclaimable > background_thresh)))
>  		bdi_start_writeback(bdi, NULL, 0);
>  }
>  
> @@ -678,6 +689,8 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>  	unsigned long dirty_thresh;
>  
>          for ( ; ; ) {
> +		unsigned long dirty;
> +
>  		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
>  
>                  /*
> @@ -686,10 +699,14 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>                   */
>                  dirty_thresh += dirty_thresh / 10;      /* wheeee... */
>  
> -                if (global_page_state(NR_UNSTABLE_NFS) +
> -			global_page_state(NR_WRITEBACK) <= dirty_thresh)
> -                        	break;
> -                congestion_wait(BLK_RW_ASYNC, HZ/10);
> +
> +		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> +		if (dirty < 0)
> +			dirty = global_page_state(NR_UNSTABLE_NFS) +
> +				global_page_state(NR_WRITEBACK);

and again..

> +		if (dirty <= dirty_thresh)
> +			break;
> +		congestion_wait(BLK_RW_ASYNC, HZ/10);
>  
>  		/*
>  		 * The caller might hold locks which can prevent IO completion

This is ugly and broken.. I thought you'd agreed to something like:

 if (mem_cgroup_has_dirty_limit(cgroup))
   use mem_cgroup numbers
 else
   use global numbers

That allows for a 0 dirty limit (which should work and basically makes
all io synchronous).

Also, I'd put each of those in a separate function, like:

unsigned long reclaimable_pages(cgroup)
{
  if (mem_cgroup_has_dirty_limit(cgroup))
    return mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
  
  return global_page_state(NR_FILE_DIRTY) + global_page_state(NR_NFS_UNSTABLE);
}

Which raises another question, you should probably rebase on top of
Trond's patches, which removes BDI_RECLAIMABLE, suggesting you also
loose MEMCG_NR_RECLAIM_PAGES in favour of the DIRTY+UNSTABLE split.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
       [not found]         ` <20100302172316.b959b04c.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
@ 2010-03-02 13:50           ` Balbir Singh
  0 siblings, 0 replies; 140+ messages in thread
From: Balbir Singh @ 2010-03-02 13:50 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrea Righi,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> [2010-03-02 17:23:16]:

> On Tue, 2 Mar 2010 09:01:58 +0100
> Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org> wrote:
> 
> > On Tue, Mar 02, 2010 at 09:23:09AM +0900, KAMEZAWA Hiroyuki wrote:
> > > On Mon,  1 Mar 2010 22:23:40 +0100
> > > Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org> wrote:
> > > 
> > > > Apply the cgroup dirty pages accounting and limiting infrastructure to
> > > > the opportune kernel functions.
> > > > 
> > > > Signed-off-by: Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
> > > 
> > > Seems nice.
> > > 
> > > Hmm. the last problem is moving account between memcg.
> > > 
> > > Right ?
> > 
> > Correct. This was actually the last item of the TODO list. Anyway, I'm
> > still considering if it's correct to move dirty pages when a task is
> > migrated from a cgroup to another. Currently, dirty pages just remain in
> > the original cgroup and are flushed depending on the original cgroup
> > settings. That is not totally wrong... at least moving the dirty pages
> > between memcgs should be optional (move_charge_at_immigrate?).
> > 
> 
> My concern is 
>  - migration between memcg is already suppoted
>     - at task move
>     - at rmdir
> 
> Then, if you leave DIRTY_PAGE accounting to original cgroup,
> the new cgroup (migration target)'s Dirty page accounting may
> goes to be negative, or incorrect value. Please check FILE_MAPPED
> implementation in __mem_cgroup_move_account()
> 
> As
>        if (page_mapped(page) && !PageAnon(page)) {
>                 /* Update mapped_file data for mem_cgroup */
>                 preempt_disable();
>                 __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
>                 __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
>                 preempt_enable();
>         }
> then, FILE_MAPPED never goes negative.
>

Absolutely! I am not sure how complex dirty memory migration will be,
but one way of working around it would be to disable migration of
charges when the feature is enabled (dirty* is set in the memory
cgroup). We might need additional logic to allow that to happen. 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-02  8:23         ` KAMEZAWA Hiroyuki
@ 2010-03-02 13:50           ` Balbir Singh
  -1 siblings, 0 replies; 140+ messages in thread
From: Balbir Singh @ 2010-03-02 13:50 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrea Righi, Suleiman Souhlal, Greg Thelen, Daisuke Nishimura,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-03-02 17:23:16]:

> On Tue, 2 Mar 2010 09:01:58 +0100
> Andrea Righi <arighi@develer.com> wrote:
> 
> > On Tue, Mar 02, 2010 at 09:23:09AM +0900, KAMEZAWA Hiroyuki wrote:
> > > On Mon,  1 Mar 2010 22:23:40 +0100
> > > Andrea Righi <arighi@develer.com> wrote:
> > > 
> > > > Apply the cgroup dirty pages accounting and limiting infrastructure to
> > > > the opportune kernel functions.
> > > > 
> > > > Signed-off-by: Andrea Righi <arighi@develer.com>
> > > 
> > > Seems nice.
> > > 
> > > Hmm. the last problem is moving account between memcg.
> > > 
> > > Right ?
> > 
> > Correct. This was actually the last item of the TODO list. Anyway, I'm
> > still considering if it's correct to move dirty pages when a task is
> > migrated from a cgroup to another. Currently, dirty pages just remain in
> > the original cgroup and are flushed depending on the original cgroup
> > settings. That is not totally wrong... at least moving the dirty pages
> > between memcgs should be optional (move_charge_at_immigrate?).
> > 
> 
> My concern is 
>  - migration between memcg is already suppoted
>     - at task move
>     - at rmdir
> 
> Then, if you leave DIRTY_PAGE accounting to original cgroup,
> the new cgroup (migration target)'s Dirty page accounting may
> goes to be negative, or incorrect value. Please check FILE_MAPPED
> implementation in __mem_cgroup_move_account()
> 
> As
>        if (page_mapped(page) && !PageAnon(page)) {
>                 /* Update mapped_file data for mem_cgroup */
>                 preempt_disable();
>                 __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
>                 __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
>                 preempt_enable();
>         }
> then, FILE_MAPPED never goes negative.
>

Absolutely! I am not sure how complex dirty memory migration will be,
but one way of working around it would be to disable migration of
charges when the feature is enabled (dirty* is set in the memory
cgroup). We might need additional logic to allow that to happen. 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
@ 2010-03-02 13:50           ` Balbir Singh
  0 siblings, 0 replies; 140+ messages in thread
From: Balbir Singh @ 2010-03-02 13:50 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrea Righi, Suleiman Souhlal, Greg Thelen, Daisuke Nishimura,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-03-02 17:23:16]:

> On Tue, 2 Mar 2010 09:01:58 +0100
> Andrea Righi <arighi@develer.com> wrote:
> 
> > On Tue, Mar 02, 2010 at 09:23:09AM +0900, KAMEZAWA Hiroyuki wrote:
> > > On Mon,  1 Mar 2010 22:23:40 +0100
> > > Andrea Righi <arighi@develer.com> wrote:
> > > 
> > > > Apply the cgroup dirty pages accounting and limiting infrastructure to
> > > > the opportune kernel functions.
> > > > 
> > > > Signed-off-by: Andrea Righi <arighi@develer.com>
> > > 
> > > Seems nice.
> > > 
> > > Hmm. the last problem is moving account between memcg.
> > > 
> > > Right ?
> > 
> > Correct. This was actually the last item of the TODO list. Anyway, I'm
> > still considering if it's correct to move dirty pages when a task is
> > migrated from a cgroup to another. Currently, dirty pages just remain in
> > the original cgroup and are flushed depending on the original cgroup
> > settings. That is not totally wrong... at least moving the dirty pages
> > between memcgs should be optional (move_charge_at_immigrate?).
> > 
> 
> My concern is 
>  - migration between memcg is already suppoted
>     - at task move
>     - at rmdir
> 
> Then, if you leave DIRTY_PAGE accounting to original cgroup,
> the new cgroup (migration target)'s Dirty page accounting may
> goes to be negative, or incorrect value. Please check FILE_MAPPED
> implementation in __mem_cgroup_move_account()
> 
> As
>        if (page_mapped(page) && !PageAnon(page)) {
>                 /* Update mapped_file data for mem_cgroup */
>                 preempt_disable();
>                 __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
>                 __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
>                 preempt_enable();
>         }
> then, FILE_MAPPED never goes negative.
>

Absolutely! I am not sure how complex dirty memory migration will be,
but one way of working around it would be to disable migration of
charges when the feature is enabled (dirty* is set in the memory
cgroup). We might need additional logic to allow that to happen. 

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
       [not found]     ` <20100302134736.GG3212-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org>
@ 2010-03-02 13:56       ` Kirill A. Shutemov
  0 siblings, 0 replies; 140+ messages in thread
From: Kirill A. Shutemov @ 2010-03-02 13:56 UTC (permalink / raw)
  To: balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8
  Cc: Andrea Righi,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Andrew Morton

On Tue, Mar 2, 2010 at 3:47 PM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> * Andrea Righi <arighi@develer.com> [2010-03-01 22:23:40]:
>
>> Apply the cgroup dirty pages accounting and limiting infrastructure to
>> the opportune kernel functions.
>>
>> Signed-off-by: Andrea Righi <arighi@develer.com>
>> ---
>>  fs/fuse/file.c      |    5 +++
>>  fs/nfs/write.c      |    4 ++
>>  fs/nilfs2/segment.c |   10 +++++-
>>  mm/filemap.c        |    1 +
>>  mm/page-writeback.c |   84 ++++++++++++++++++++++++++++++++------------------
>>  mm/rmap.c           |    4 +-
>>  mm/truncate.c       |    2 +
>>  7 files changed, 76 insertions(+), 34 deletions(-)
>>
>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>> index a9f5e13..dbbdd53 100644
>> --- a/fs/fuse/file.c
>> +++ b/fs/fuse/file.c
>> @@ -11,6 +11,7 @@
>>  #include <linux/pagemap.h>
>>  #include <linux/slab.h>
>>  #include <linux/kernel.h>
>> +#include <linux/memcontrol.h>
>>  #include <linux/sched.h>
>>  #include <linux/module.h>
>>
>> @@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
>>
>>       list_del(&req->writepages_entry);
>>       dec_bdi_stat(bdi, BDI_WRITEBACK);
>> +     mem_cgroup_update_stat(req->pages[0],
>> +                     MEM_CGROUP_STAT_WRITEBACK_TEMP, -1);
>>       dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
>>       bdi_writeout_inc(bdi);
>>       wake_up(&fi->page_waitq);
>> @@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
>>       req->inode = inode;
>>
>>       inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
>> +     mem_cgroup_update_stat(tmp_page,
>> +                     MEM_CGROUP_STAT_WRITEBACK_TEMP, 1);
>>       inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
>>       end_page_writeback(page);
>>
>> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
>> index b753242..7316f7a 100644
>> --- a/fs/nfs/write.c
>> +++ b/fs/nfs/write.c
>
> Don't need memcontrol.h to be included here?

It's included in <linux/swap.h>

> Looks OK to me overall, but there might be objection using the
> mem_cgroup_* naming convention, but I don't mind it very much :)
>
> --
>        Three Cheers,
>        Balbir
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-02 13:47     ` Balbir Singh
@ 2010-03-02 13:56       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 140+ messages in thread
From: Kirill A. Shutemov @ 2010-03-02 13:56 UTC (permalink / raw)
  To: balbir
  Cc: Andrea Righi, KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Andrew Morton, containers, linux-kernel,
	linux-mm

On Tue, Mar 2, 2010 at 3:47 PM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> * Andrea Righi <arighi@develer.com> [2010-03-01 22:23:40]:
>
>> Apply the cgroup dirty pages accounting and limiting infrastructure to
>> the opportune kernel functions.
>>
>> Signed-off-by: Andrea Righi <arighi@develer.com>
>> ---
>>  fs/fuse/file.c      |    5 +++
>>  fs/nfs/write.c      |    4 ++
>>  fs/nilfs2/segment.c |   10 +++++-
>>  mm/filemap.c        |    1 +
>>  mm/page-writeback.c |   84 ++++++++++++++++++++++++++++++++------------------
>>  mm/rmap.c           |    4 +-
>>  mm/truncate.c       |    2 +
>>  7 files changed, 76 insertions(+), 34 deletions(-)
>>
>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>> index a9f5e13..dbbdd53 100644
>> --- a/fs/fuse/file.c
>> +++ b/fs/fuse/file.c
>> @@ -11,6 +11,7 @@
>>  #include <linux/pagemap.h>
>>  #include <linux/slab.h>
>>  #include <linux/kernel.h>
>> +#include <linux/memcontrol.h>
>>  #include <linux/sched.h>
>>  #include <linux/module.h>
>>
>> @@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
>>
>>       list_del(&req->writepages_entry);
>>       dec_bdi_stat(bdi, BDI_WRITEBACK);
>> +     mem_cgroup_update_stat(req->pages[0],
>> +                     MEM_CGROUP_STAT_WRITEBACK_TEMP, -1);
>>       dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
>>       bdi_writeout_inc(bdi);
>>       wake_up(&fi->page_waitq);
>> @@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
>>       req->inode = inode;
>>
>>       inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
>> +     mem_cgroup_update_stat(tmp_page,
>> +                     MEM_CGROUP_STAT_WRITEBACK_TEMP, 1);
>>       inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
>>       end_page_writeback(page);
>>
>> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
>> index b753242..7316f7a 100644
>> --- a/fs/nfs/write.c
>> +++ b/fs/nfs/write.c
>
> Don't need memcontrol.h to be included here?

It's included in <linux/swap.h>

> Looks OK to me overall, but there might be objection using the
> mem_cgroup_* naming convention, but I don't mind it very much :)
>
> --
>        Three Cheers,
>        Balbir
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
@ 2010-03-02 13:56       ` Kirill A. Shutemov
  0 siblings, 0 replies; 140+ messages in thread
From: Kirill A. Shutemov @ 2010-03-02 13:56 UTC (permalink / raw)
  To: balbir
  Cc: Andrea Righi, KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Andrew Morton, containers, linux-kernel,
	linux-mm

On Tue, Mar 2, 2010 at 3:47 PM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> * Andrea Righi <arighi@develer.com> [2010-03-01 22:23:40]:
>
>> Apply the cgroup dirty pages accounting and limiting infrastructure to
>> the opportune kernel functions.
>>
>> Signed-off-by: Andrea Righi <arighi@develer.com>
>> ---
>>  fs/fuse/file.c      |    5 +++
>>  fs/nfs/write.c      |    4 ++
>>  fs/nilfs2/segment.c |   10 +++++-
>>  mm/filemap.c        |    1 +
>>  mm/page-writeback.c |   84 ++++++++++++++++++++++++++++++++------------------
>>  mm/rmap.c           |    4 +-
>>  mm/truncate.c       |    2 +
>>  7 files changed, 76 insertions(+), 34 deletions(-)
>>
>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>> index a9f5e13..dbbdd53 100644
>> --- a/fs/fuse/file.c
>> +++ b/fs/fuse/file.c
>> @@ -11,6 +11,7 @@
>>  #include <linux/pagemap.h>
>>  #include <linux/slab.h>
>>  #include <linux/kernel.h>
>> +#include <linux/memcontrol.h>
>>  #include <linux/sched.h>
>>  #include <linux/module.h>
>>
>> @@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
>>
>>       list_del(&req->writepages_entry);
>>       dec_bdi_stat(bdi, BDI_WRITEBACK);
>> +     mem_cgroup_update_stat(req->pages[0],
>> +                     MEM_CGROUP_STAT_WRITEBACK_TEMP, -1);
>>       dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
>>       bdi_writeout_inc(bdi);
>>       wake_up(&fi->page_waitq);
>> @@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
>>       req->inode = inode;
>>
>>       inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
>> +     mem_cgroup_update_stat(tmp_page,
>> +                     MEM_CGROUP_STAT_WRITEBACK_TEMP, 1);
>>       inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
>>       end_page_writeback(page);
>>
>> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
>> index b753242..7316f7a 100644
>> --- a/fs/nfs/write.c
>> +++ b/fs/nfs/write.c
>
> Don't need memcontrol.h to be included here?

It's included in <linux/swap.h>

> Looks OK to me overall, but there might be objection using the
> mem_cgroup_* naming convention, but I don't mind it very much :)
>
> --
>        Three Cheers,
>        Balbir
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-01 22:18       ` Andrea Righi
  (?)
  (?)
@ 2010-03-02 15:05       ` Vivek Goyal
  -1 siblings, 0 replies; 140+ messages in thread
From: Vivek Goyal @ 2010-03-02 15:05 UTC (permalink / raw)
  To: Andrea Righi
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Andrew Morton,
	Balbir Singh

On Mon, Mar 01, 2010 at 11:18:31PM +0100, Andrea Righi wrote:
> On Mon, Mar 01, 2010 at 05:02:08PM -0500, Vivek Goyal wrote:
> > > @@ -686,10 +699,14 @@ void throttle_vm_writeout(gfp_t gfp_mask)
> > >                   */
> > >                  dirty_thresh += dirty_thresh / 10;      /* wheeee... */
> > >  
> > > -                if (global_page_state(NR_UNSTABLE_NFS) +
> > > -			global_page_state(NR_WRITEBACK) <= dirty_thresh)
> > > -                        	break;
> > > -                congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > +
> > > +		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> > > +		if (dirty < 0)
> > > +			dirty = global_page_state(NR_UNSTABLE_NFS) +
> > > +				global_page_state(NR_WRITEBACK);
> > 
> > dirty is unsigned long. As mentioned last time, above will never be true?
> > In general these patches look ok to me. I will do some testing with these.
> 
> Re-introduced the same bug. My bad. :(
> 
> The value returned from mem_cgroup_page_stat() can be negative, i.e.
> when memory cgroup is disabled. We could simply use a long for dirty,
> the unit is in # of pages so s64 should be enough. Or cast dirty to long
> only for the check (see below).
> 
> Thanks!
> -Andrea
> 
> Signed-off-by: Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
> ---
>  mm/page-writeback.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index d83f41c..dbee976 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -701,7 +701,7 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>  
>  
>  		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> -		if (dirty < 0)
> +		if ((long)dirty < 0)

This will also be problematic as on 32bit systems, your uppper limit of
dirty memory will be 2G?

I guess, I will prefer one of the two.

- return the error code from function and pass a pointer to store stats
  in as function argument.

- Or Peter's suggestion of checking mem_cgroup_has_dirty_limit() and if
  per cgroup dirty control is enabled, then use per cgroup stats. In that
  case you don't have to return negative values.

  Only tricky part will be careful accouting so that none of the stats go
  negative in corner cases of migration etc.

Thanks
Vivek

>  			dirty = global_page_state(NR_UNSTABLE_NFS) +
>  				global_page_state(NR_WRITEBACK);
>  		if (dirty <= dirty_thresh)

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-01 22:18       ` Andrea Righi
@ 2010-03-02 15:05         ` Vivek Goyal
  -1 siblings, 0 replies; 140+ messages in thread
From: Vivek Goyal @ 2010-03-02 15:05 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Kirill A. Shutemov, Andrew Morton, containers,
	linux-kernel, linux-mm

On Mon, Mar 01, 2010 at 11:18:31PM +0100, Andrea Righi wrote:
> On Mon, Mar 01, 2010 at 05:02:08PM -0500, Vivek Goyal wrote:
> > > @@ -686,10 +699,14 @@ void throttle_vm_writeout(gfp_t gfp_mask)
> > >                   */
> > >                  dirty_thresh += dirty_thresh / 10;      /* wheeee... */
> > >  
> > > -                if (global_page_state(NR_UNSTABLE_NFS) +
> > > -			global_page_state(NR_WRITEBACK) <= dirty_thresh)
> > > -                        	break;
> > > -                congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > +
> > > +		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> > > +		if (dirty < 0)
> > > +			dirty = global_page_state(NR_UNSTABLE_NFS) +
> > > +				global_page_state(NR_WRITEBACK);
> > 
> > dirty is unsigned long. As mentioned last time, above will never be true?
> > In general these patches look ok to me. I will do some testing with these.
> 
> Re-introduced the same bug. My bad. :(
> 
> The value returned from mem_cgroup_page_stat() can be negative, i.e.
> when memory cgroup is disabled. We could simply use a long for dirty,
> the unit is in # of pages so s64 should be enough. Or cast dirty to long
> only for the check (see below).
> 
> Thanks!
> -Andrea
> 
> Signed-off-by: Andrea Righi <arighi@develer.com>
> ---
>  mm/page-writeback.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index d83f41c..dbee976 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -701,7 +701,7 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>  
>  
>  		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> -		if (dirty < 0)
> +		if ((long)dirty < 0)

This will also be problematic as on 32bit systems, your uppper limit of
dirty memory will be 2G?

I guess, I will prefer one of the two.

- return the error code from function and pass a pointer to store stats
  in as function argument.

- Or Peter's suggestion of checking mem_cgroup_has_dirty_limit() and if
  per cgroup dirty control is enabled, then use per cgroup stats. In that
  case you don't have to return negative values.

  Only tricky part will be careful accouting so that none of the stats go
  negative in corner cases of migration etc.

Thanks
Vivek

>  			dirty = global_page_state(NR_UNSTABLE_NFS) +
>  				global_page_state(NR_WRITEBACK);
>  		if (dirty <= dirty_thresh)

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
@ 2010-03-02 15:05         ` Vivek Goyal
  0 siblings, 0 replies; 140+ messages in thread
From: Vivek Goyal @ 2010-03-02 15:05 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Kirill A. Shutemov, Andrew Morton, containers,
	linux-kernel, linux-mm

On Mon, Mar 01, 2010 at 11:18:31PM +0100, Andrea Righi wrote:
> On Mon, Mar 01, 2010 at 05:02:08PM -0500, Vivek Goyal wrote:
> > > @@ -686,10 +699,14 @@ void throttle_vm_writeout(gfp_t gfp_mask)
> > >                   */
> > >                  dirty_thresh += dirty_thresh / 10;      /* wheeee... */
> > >  
> > > -                if (global_page_state(NR_UNSTABLE_NFS) +
> > > -			global_page_state(NR_WRITEBACK) <= dirty_thresh)
> > > -                        	break;
> > > -                congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > +
> > > +		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> > > +		if (dirty < 0)
> > > +			dirty = global_page_state(NR_UNSTABLE_NFS) +
> > > +				global_page_state(NR_WRITEBACK);
> > 
> > dirty is unsigned long. As mentioned last time, above will never be true?
> > In general these patches look ok to me. I will do some testing with these.
> 
> Re-introduced the same bug. My bad. :(
> 
> The value returned from mem_cgroup_page_stat() can be negative, i.e.
> when memory cgroup is disabled. We could simply use a long for dirty,
> the unit is in # of pages so s64 should be enough. Or cast dirty to long
> only for the check (see below).
> 
> Thanks!
> -Andrea
> 
> Signed-off-by: Andrea Righi <arighi@develer.com>
> ---
>  mm/page-writeback.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index d83f41c..dbee976 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -701,7 +701,7 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>  
>  
>  		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> -		if (dirty < 0)
> +		if ((long)dirty < 0)

This will also be problematic as on 32bit systems, your uppper limit of
dirty memory will be 2G?

I guess, I will prefer one of the two.

- return the error code from function and pass a pointer to store stats
  in as function argument.

- Or Peter's suggestion of checking mem_cgroup_has_dirty_limit() and if
  per cgroup dirty control is enabled, then use per cgroup stats. In that
  case you don't have to return negative values.

  Only tricky part will be careful accouting so that none of the stats go
  negative in corner cases of migration etc.

Thanks
Vivek

>  			dirty = global_page_state(NR_UNSTABLE_NFS) +
>  				global_page_state(NR_WRITEBACK);
>  		if (dirty <= dirty_thresh)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-02 13:48     ` Peter Zijlstra
  (?)
@ 2010-03-02 15:26     ` Balbir Singh
  -1 siblings, 0 replies; 140+ messages in thread
From: Balbir Singh @ 2010-03-02 15:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Righi,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Trond Myklebust, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Suleiman Souhlal, Andrew Morton

* Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> [2010-03-02 14:48:56]:

> This is ugly and broken.. I thought you'd agreed to something like:
> 
>  if (mem_cgroup_has_dirty_limit(cgroup))
>    use mem_cgroup numbers
>  else
>    use global numbers
> 
> That allows for a 0 dirty limit (which should work and basically makes
> all io synchronous).
> 
> Also, I'd put each of those in a separate function, like:
> 
> unsigned long reclaimable_pages(cgroup)
> {
>   if (mem_cgroup_has_dirty_limit(cgroup))
>     return mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
>   
>   return global_page_state(NR_FILE_DIRTY) + global_page_state(NR_NFS_UNSTABLE);
> }
>

I agree, I should have been more specific about the naming convention,
this is what I meant - along these lines as we do with
zone_nr_lru_pages(), etc.
 
> Which raises another question, you should probably rebase on top of
> Trond's patches, which removes BDI_RECLAIMABLE, suggesting you also
> loose MEMCG_NR_RECLAIM_PAGES in favour of the DIRTY+UNSTABLE split.
> 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-02 13:48     ` Peter Zijlstra
@ 2010-03-02 15:26       ` Balbir Singh
  -1 siblings, 0 replies; 140+ messages in thread
From: Balbir Singh @ 2010-03-02 15:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Righi, KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Kirill A. Shutemov, Andrew Morton, containers,
	linux-kernel, linux-mm, Trond Myklebust

* Peter Zijlstra <peterz@infradead.org> [2010-03-02 14:48:56]:

> This is ugly and broken.. I thought you'd agreed to something like:
> 
>  if (mem_cgroup_has_dirty_limit(cgroup))
>    use mem_cgroup numbers
>  else
>    use global numbers
> 
> That allows for a 0 dirty limit (which should work and basically makes
> all io synchronous).
> 
> Also, I'd put each of those in a separate function, like:
> 
> unsigned long reclaimable_pages(cgroup)
> {
>   if (mem_cgroup_has_dirty_limit(cgroup))
>     return mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
>   
>   return global_page_state(NR_FILE_DIRTY) + global_page_state(NR_NFS_UNSTABLE);
> }
>

I agree, I should have been more specific about the naming convention,
this is what I meant - along these lines as we do with
zone_nr_lru_pages(), etc.
 
> Which raises another question, you should probably rebase on top of
> Trond's patches, which removes BDI_RECLAIMABLE, suggesting you also
> loose MEMCG_NR_RECLAIM_PAGES in favour of the DIRTY+UNSTABLE split.
> 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
@ 2010-03-02 15:26       ` Balbir Singh
  0 siblings, 0 replies; 140+ messages in thread
From: Balbir Singh @ 2010-03-02 15:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Righi, KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Kirill A. Shutemov, Andrew Morton, containers,
	linux-kernel, linux-mm, Trond Myklebust

* Peter Zijlstra <peterz@infradead.org> [2010-03-02 14:48:56]:

> This is ugly and broken.. I thought you'd agreed to something like:
> 
>  if (mem_cgroup_has_dirty_limit(cgroup))
>    use mem_cgroup numbers
>  else
>    use global numbers
> 
> That allows for a 0 dirty limit (which should work and basically makes
> all io synchronous).
> 
> Also, I'd put each of those in a separate function, like:
> 
> unsigned long reclaimable_pages(cgroup)
> {
>   if (mem_cgroup_has_dirty_limit(cgroup))
>     return mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
>   
>   return global_page_state(NR_FILE_DIRTY) + global_page_state(NR_NFS_UNSTABLE);
> }
>

I agree, I should have been more specific about the naming convention,
this is what I meant - along these lines as we do with
zone_nr_lru_pages(), etc.
 
> Which raises another question, you should probably rebase on top of
> Trond's patches, which removes BDI_RECLAIMABLE, suggesting you also
> loose MEMCG_NR_RECLAIM_PAGES in favour of the DIRTY+UNSTABLE split.
> 

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-02 13:48     ` Peter Zijlstra
                       ` (2 preceding siblings ...)
  (?)
@ 2010-03-02 15:49     ` Trond Myklebust
  -1 siblings, 0 replies; 140+ messages in thread
From: Trond Myklebust @ 2010-03-02 15:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Righi, Daisuke Nishimura,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Andrew Morton,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Balbir Singh

On Tue, 2010-03-02 at 14:48 +0100, Peter Zijlstra wrote: 
> unsigned long reclaimable_pages(cgroup)
> {
>   if (mem_cgroup_has_dirty_limit(cgroup))
>     return mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
>   
>   return global_page_state(NR_FILE_DIRTY) + global_page_state(NR_NFS_UNSTABLE);
> }
> 
> Which raises another question, you should probably rebase on top of
> Trond's patches, which removes BDI_RECLAIMABLE, suggesting you also
> loose MEMCG_NR_RECLAIM_PAGES in favour of the DIRTY+UNSTABLE split.
> 

I'm dropping those patches for now. The main writeback change wasn't too
favourably received by the linux-mm community so I've implemented an
alternative that only changes the NFS layer, and doesn't depend on the
DIRTY+UNSTABLE split.

Cheers
  Trond

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-02 13:48     ` Peter Zijlstra
@ 2010-03-02 15:49       ` Trond Myklebust
  -1 siblings, 0 replies; 140+ messages in thread
From: Trond Myklebust @ 2010-03-02 15:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Righi, Balbir Singh, KAMEZAWA Hiroyuki, Suleiman Souhlal,
	Greg Thelen, Daisuke Nishimura, Kirill A. Shutemov,
	Andrew Morton, containers, linux-kernel, linux-mm

On Tue, 2010-03-02 at 14:48 +0100, Peter Zijlstra wrote: 
> unsigned long reclaimable_pages(cgroup)
> {
>   if (mem_cgroup_has_dirty_limit(cgroup))
>     return mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
>   
>   return global_page_state(NR_FILE_DIRTY) + global_page_state(NR_NFS_UNSTABLE);
> }
> 
> Which raises another question, you should probably rebase on top of
> Trond's patches, which removes BDI_RECLAIMABLE, suggesting you also
> loose MEMCG_NR_RECLAIM_PAGES in favour of the DIRTY+UNSTABLE split.
> 

I'm dropping those patches for now. The main writeback change wasn't too
favourably received by the linux-mm community so I've implemented an
alternative that only changes the NFS layer, and doesn't depend on the
DIRTY+UNSTABLE split.

Cheers
  Trond


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
@ 2010-03-02 15:49       ` Trond Myklebust
  0 siblings, 0 replies; 140+ messages in thread
From: Trond Myklebust @ 2010-03-02 15:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Righi, Balbir Singh, KAMEZAWA Hiroyuki, Suleiman Souhlal,
	Greg Thelen, Daisuke Nishimura, Kirill A. Shutemov,
	Andrew Morton, containers, linux-kernel, linux-mm

On Tue, 2010-03-02 at 14:48 +0100, Peter Zijlstra wrote: 
> unsigned long reclaimable_pages(cgroup)
> {
>   if (mem_cgroup_has_dirty_limit(cgroup))
>     return mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
>   
>   return global_page_state(NR_FILE_DIRTY) + global_page_state(NR_NFS_UNSTABLE);
> }
> 
> Which raises another question, you should probably rebase on top of
> Trond's patches, which removes BDI_RECLAIMABLE, suggesting you also
> loose MEMCG_NR_RECLAIM_PAGES in favour of the DIRTY+UNSTABLE split.
> 

I'm dropping those patches for now. The main writeback change wasn't too
favourably received by the linux-mm community so I've implemented an
alternative that only changes the NFS layer, and doesn't depend on the
DIRTY+UNSTABLE split.

Cheers
  Trond

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 2/3] memcg: dirty pages accounting and limiting infrastructure
       [not found]   ` <1267478620-5276-3-git-send-email-arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
                       ` (2 preceding siblings ...)
  2010-03-02 13:02     ` Balbir Singh
@ 2010-03-02 18:08     ` Greg Thelen
  3 siblings, 0 replies; 140+ messages in thread
From: Greg Thelen @ 2010-03-02 18:08 UTC (permalink / raw)
  To: Andrea Righi
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Andrew Morton,
	Balbir Singh

Comments below.  Yet to be tested on my end, but I will test it.

On Mon, Mar 1, 2010 at 1:23 PM, Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org> wrote:
> Infrastructure to account dirty pages per cgroup and add dirty limit
> interfaces in the cgroupfs:
>
>  - Direct write-out: memory.dirty_ratio, memory.dirty_bytes
>
>  - Background write-out: memory.dirty_background_ratio, memory.dirty_background_bytes
>
> Signed-off-by: Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
> ---
>  include/linux/memcontrol.h |   77 ++++++++++-
>  mm/memcontrol.c            |  336 ++++++++++++++++++++++++++++++++++++++++----
>  2 files changed, 384 insertions(+), 29 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 1f9b119..cc88b2e 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -19,12 +19,50 @@
>
>  #ifndef _LINUX_MEMCONTROL_H
>  #define _LINUX_MEMCONTROL_H
> +
> +#include <linux/writeback.h>
>  #include <linux/cgroup.h>
> +
>  struct mem_cgroup;
>  struct page_cgroup;
>  struct page;
>  struct mm_struct;
>
> +/* Cgroup memory statistics items exported to the kernel */
> +enum mem_cgroup_page_stat_item {
> +       MEMCG_NR_DIRTYABLE_PAGES,
> +       MEMCG_NR_RECLAIM_PAGES,
> +       MEMCG_NR_WRITEBACK,
> +       MEMCG_NR_DIRTY_WRITEBACK_PAGES,
> +};
> +
> +/*
> + * Statistics for memory cgroup.
> + */
> +enum mem_cgroup_stat_index {
> +       /*
> +        * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> +        */
> +       MEM_CGROUP_STAT_CACHE,     /* # of pages charged as cache */
> +       MEM_CGROUP_STAT_RSS,       /* # of pages charged as anon rss */
> +       MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> +       MEM_CGROUP_STAT_PGPGIN_COUNT,   /* # of pages paged in */
> +       MEM_CGROUP_STAT_PGPGOUT_COUNT,  /* # of pages paged out */
> +       MEM_CGROUP_STAT_EVENTS, /* sum of pagein + pageout for internal use */
> +       MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> +       MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
> +                                       used by soft limit implementation */
> +       MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out.
> +                                       used by threshold implementation */
> +       MEM_CGROUP_STAT_FILE_DIRTY,   /* # of dirty pages in page cache */
> +       MEM_CGROUP_STAT_WRITEBACK,   /* # of pages under writeback */
> +       MEM_CGROUP_STAT_WRITEBACK_TEMP,   /* # of pages under writeback using
> +                                               temporary buffers */
> +       MEM_CGROUP_STAT_UNSTABLE_NFS,   /* # of NFS unstable pages */
> +
> +       MEM_CGROUP_STAT_NSTATS,
> +};
> +
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
>  /*
>  * All "charge" functions with gfp_mask should use GFP_KERNEL or
> @@ -117,6 +155,13 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
>  extern int do_swap_account;
>  #endif
>
> +extern long mem_cgroup_dirty_ratio(void);
> +extern unsigned long mem_cgroup_dirty_bytes(void);
> +extern long mem_cgroup_dirty_background_ratio(void);
> +extern unsigned long mem_cgroup_dirty_background_bytes(void);
> +
> +extern s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item);
> +
>  static inline bool mem_cgroup_disabled(void)
>  {
>        if (mem_cgroup_subsys.disabled)
> @@ -125,7 +170,8 @@ static inline bool mem_cgroup_disabled(void)
>  }
>
>  extern bool mem_cgroup_oom_called(struct task_struct *task);
> -void mem_cgroup_update_file_mapped(struct page *page, int val);
> +void mem_cgroup_update_stat(struct page *page,
> +                       enum mem_cgroup_stat_index idx, int val);
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>                                                gfp_t gfp_mask, int nid,
>                                                int zid);
> @@ -300,8 +346,8 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
>  {
>  }
>
> -static inline void mem_cgroup_update_file_mapped(struct page *page,
> -                                                       int val)
> +static inline void mem_cgroup_update_stat(struct page *page,
> +                       enum mem_cgroup_stat_index idx, int val)
>  {
>  }
>
> @@ -312,6 +358,31 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>        return 0;
>  }
>
> +static inline long mem_cgroup_dirty_ratio(void)
> +{
> +       return vm_dirty_ratio;
> +}
> +
> +static inline unsigned long mem_cgroup_dirty_bytes(void)
> +{
> +       return vm_dirty_bytes;
> +}
> +
> +static inline long mem_cgroup_dirty_background_ratio(void)
> +{
> +       return dirty_background_ratio;
> +}
> +
> +static inline unsigned long mem_cgroup_dirty_background_bytes(void)
> +{
> +       return dirty_background_bytes;
> +}
> +
> +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> +{
> +       return -ENOMEM;
> +}
> +
>  #endif /* CONFIG_CGROUP_MEM_CONT */
>
>  #endif /* _LINUX_MEMCONTROL_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a443c30..e74cf66 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -66,31 +66,16 @@ static int really_do_swap_account __initdata = 1; /* for remember boot option*/
>  #define SOFTLIMIT_EVENTS_THRESH (1000)
>  #define THRESHOLDS_EVENTS_THRESH (100)
>
> -/*
> - * Statistics for memory cgroup.
> - */
> -enum mem_cgroup_stat_index {
> -       /*
> -        * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> -        */
> -       MEM_CGROUP_STAT_CACHE,     /* # of pages charged as cache */
> -       MEM_CGROUP_STAT_RSS,       /* # of pages charged as anon rss */
> -       MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> -       MEM_CGROUP_STAT_PGPGIN_COUNT,   /* # of pages paged in */
> -       MEM_CGROUP_STAT_PGPGOUT_COUNT,  /* # of pages paged out */
> -       MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> -       MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
> -                                       used by soft limit implementation */
> -       MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out.
> -                                       used by threshold implementation */
> -
> -       MEM_CGROUP_STAT_NSTATS,
> -};
> -
>  struct mem_cgroup_stat_cpu {
>        s64 count[MEM_CGROUP_STAT_NSTATS];
>  };
>
> +/* Per cgroup page statistics */
> +struct mem_cgroup_page_stat {
> +       enum mem_cgroup_page_stat_item item;
> +       s64 value;
> +};
> +
>  /*
>  * per-zone information in memory controller.
>  */
> @@ -157,6 +142,15 @@ struct mem_cgroup_threshold_ary {
>  static bool mem_cgroup_threshold_check(struct mem_cgroup *mem);
>  static void mem_cgroup_threshold(struct mem_cgroup *mem);
>
> +enum mem_cgroup_dirty_param {
> +       MEM_CGROUP_DIRTY_RATIO,
> +       MEM_CGROUP_DIRTY_BYTES,
> +       MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> +       MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
> +
> +       MEM_CGROUP_DIRTY_NPARAMS,
> +};
> +
>  /*
>  * The memory controller data structure. The memory controller controls both
>  * page cache and RSS per cgroup. We would eventually like to provide
> @@ -205,6 +199,9 @@ struct mem_cgroup {
>
>        unsigned int    swappiness;
>
> +       /* control memory cgroup dirty pages */
> +       unsigned long dirty_param[MEM_CGROUP_DIRTY_NPARAMS];
> +
>        /* set when res.limit == memsw.limit */
>        bool            memsw_is_minimum;
>
> @@ -1021,6 +1018,164 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
>        return swappiness;
>  }
>
> +static unsigned long get_dirty_param(struct mem_cgroup *memcg,
> +                       enum mem_cgroup_dirty_param idx)
> +{
> +       unsigned long ret;
> +
> +       VM_BUG_ON(idx >= MEM_CGROUP_DIRTY_NPARAMS);
> +       spin_lock(&memcg->reclaim_param_lock);
> +       ret = memcg->dirty_param[idx];
> +       spin_unlock(&memcg->reclaim_param_lock);
> +
> +       return ret;
> +}
> +

> +long mem_cgroup_dirty_ratio(void)
> +{
> +       struct mem_cgroup *memcg;
> +       long ret = vm_dirty_ratio;
> +
> +       if (mem_cgroup_disabled())
> +               return ret;
> +       /*
> +        * It's possible that "current" may be moved to other cgroup while we
> +        * access cgroup. But precise check is meaningless because the task can
> +        * be moved after our access and writeback tends to take long time.
> +        * At least, "memcg" will not be freed under rcu_read_lock().
> +        */
> +       rcu_read_lock();
> +       memcg = mem_cgroup_from_task(current);
> +       if (likely(memcg))
> +               ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_RATIO);
> +       rcu_read_unlock();
> +
> +       return ret;
> +}
> +
> +unsigned long mem_cgroup_dirty_bytes(void)
> +{
> +       struct mem_cgroup *memcg;
> +       unsigned long ret = vm_dirty_bytes;
> +
> +       if (mem_cgroup_disabled())
> +               return ret;
> +       rcu_read_lock();
> +       memcg = mem_cgroup_from_task(current);
> +       if (likely(memcg))
> +               ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BYTES);
> +       rcu_read_unlock();
> +
> +       return ret;
> +}
> +
> +long mem_cgroup_dirty_background_ratio(void)
> +{
> +       struct mem_cgroup *memcg;
> +       long ret = dirty_background_ratio;
> +
> +       if (mem_cgroup_disabled())
> +               return ret;
> +       rcu_read_lock();
> +       memcg = mem_cgroup_from_task(current);
> +       if (likely(memcg))
> +               ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BACKGROUND_RATIO);
> +       rcu_read_unlock();
> +
> +       return ret;
> +}
> +
> +unsigned long mem_cgroup_dirty_background_bytes(void)
> +{
> +       struct mem_cgroup *memcg;
> +       unsigned long ret = dirty_background_bytes;
> +
> +       if (mem_cgroup_disabled())
> +               return ret;
> +       rcu_read_lock();
> +       memcg = mem_cgroup_from_task(current);
> +       if (likely(memcg))
> +               ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BACKGROUND_BYTES);
> +       rcu_read_unlock();
> +
> +       return ret;
> +}

Given that mem_cgroup_dirty_[background_]{ratio,bytes}() are similar,
should we refactor the majority of them into a single routine?

--
Greg

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 2/3] memcg: dirty pages accounting and limiting  infrastructure
  2010-03-01 21:23   ` Andrea Righi
@ 2010-03-02 18:08     ` Greg Thelen
  -1 siblings, 0 replies; 140+ messages in thread
From: Greg Thelen @ 2010-03-02 18:08 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Suleiman Souhlal,
	Daisuke Nishimura, Kirill A. Shutemov, Andrew Morton, containers,
	linux-kernel, linux-mm

Comments below.  Yet to be tested on my end, but I will test it.

On Mon, Mar 1, 2010 at 1:23 PM, Andrea Righi <arighi@develer.com> wrote:
> Infrastructure to account dirty pages per cgroup and add dirty limit
> interfaces in the cgroupfs:
>
>  - Direct write-out: memory.dirty_ratio, memory.dirty_bytes
>
>  - Background write-out: memory.dirty_background_ratio, memory.dirty_background_bytes
>
> Signed-off-by: Andrea Righi <arighi@develer.com>
> ---
>  include/linux/memcontrol.h |   77 ++++++++++-
>  mm/memcontrol.c            |  336 ++++++++++++++++++++++++++++++++++++++++----
>  2 files changed, 384 insertions(+), 29 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 1f9b119..cc88b2e 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -19,12 +19,50 @@
>
>  #ifndef _LINUX_MEMCONTROL_H
>  #define _LINUX_MEMCONTROL_H
> +
> +#include <linux/writeback.h>
>  #include <linux/cgroup.h>
> +
>  struct mem_cgroup;
>  struct page_cgroup;
>  struct page;
>  struct mm_struct;
>
> +/* Cgroup memory statistics items exported to the kernel */
> +enum mem_cgroup_page_stat_item {
> +       MEMCG_NR_DIRTYABLE_PAGES,
> +       MEMCG_NR_RECLAIM_PAGES,
> +       MEMCG_NR_WRITEBACK,
> +       MEMCG_NR_DIRTY_WRITEBACK_PAGES,
> +};
> +
> +/*
> + * Statistics for memory cgroup.
> + */
> +enum mem_cgroup_stat_index {
> +       /*
> +        * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> +        */
> +       MEM_CGROUP_STAT_CACHE,     /* # of pages charged as cache */
> +       MEM_CGROUP_STAT_RSS,       /* # of pages charged as anon rss */
> +       MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> +       MEM_CGROUP_STAT_PGPGIN_COUNT,   /* # of pages paged in */
> +       MEM_CGROUP_STAT_PGPGOUT_COUNT,  /* # of pages paged out */
> +       MEM_CGROUP_STAT_EVENTS, /* sum of pagein + pageout for internal use */
> +       MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> +       MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
> +                                       used by soft limit implementation */
> +       MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out.
> +                                       used by threshold implementation */
> +       MEM_CGROUP_STAT_FILE_DIRTY,   /* # of dirty pages in page cache */
> +       MEM_CGROUP_STAT_WRITEBACK,   /* # of pages under writeback */
> +       MEM_CGROUP_STAT_WRITEBACK_TEMP,   /* # of pages under writeback using
> +                                               temporary buffers */
> +       MEM_CGROUP_STAT_UNSTABLE_NFS,   /* # of NFS unstable pages */
> +
> +       MEM_CGROUP_STAT_NSTATS,
> +};
> +
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
>  /*
>  * All "charge" functions with gfp_mask should use GFP_KERNEL or
> @@ -117,6 +155,13 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
>  extern int do_swap_account;
>  #endif
>
> +extern long mem_cgroup_dirty_ratio(void);
> +extern unsigned long mem_cgroup_dirty_bytes(void);
> +extern long mem_cgroup_dirty_background_ratio(void);
> +extern unsigned long mem_cgroup_dirty_background_bytes(void);
> +
> +extern s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item);
> +
>  static inline bool mem_cgroup_disabled(void)
>  {
>        if (mem_cgroup_subsys.disabled)
> @@ -125,7 +170,8 @@ static inline bool mem_cgroup_disabled(void)
>  }
>
>  extern bool mem_cgroup_oom_called(struct task_struct *task);
> -void mem_cgroup_update_file_mapped(struct page *page, int val);
> +void mem_cgroup_update_stat(struct page *page,
> +                       enum mem_cgroup_stat_index idx, int val);
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>                                                gfp_t gfp_mask, int nid,
>                                                int zid);
> @@ -300,8 +346,8 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
>  {
>  }
>
> -static inline void mem_cgroup_update_file_mapped(struct page *page,
> -                                                       int val)
> +static inline void mem_cgroup_update_stat(struct page *page,
> +                       enum mem_cgroup_stat_index idx, int val)
>  {
>  }
>
> @@ -312,6 +358,31 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>        return 0;
>  }
>
> +static inline long mem_cgroup_dirty_ratio(void)
> +{
> +       return vm_dirty_ratio;
> +}
> +
> +static inline unsigned long mem_cgroup_dirty_bytes(void)
> +{
> +       return vm_dirty_bytes;
> +}
> +
> +static inline long mem_cgroup_dirty_background_ratio(void)
> +{
> +       return dirty_background_ratio;
> +}
> +
> +static inline unsigned long mem_cgroup_dirty_background_bytes(void)
> +{
> +       return dirty_background_bytes;
> +}
> +
> +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> +{
> +       return -ENOMEM;
> +}
> +
>  #endif /* CONFIG_CGROUP_MEM_CONT */
>
>  #endif /* _LINUX_MEMCONTROL_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a443c30..e74cf66 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -66,31 +66,16 @@ static int really_do_swap_account __initdata = 1; /* for remember boot option*/
>  #define SOFTLIMIT_EVENTS_THRESH (1000)
>  #define THRESHOLDS_EVENTS_THRESH (100)
>
> -/*
> - * Statistics for memory cgroup.
> - */
> -enum mem_cgroup_stat_index {
> -       /*
> -        * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> -        */
> -       MEM_CGROUP_STAT_CACHE,     /* # of pages charged as cache */
> -       MEM_CGROUP_STAT_RSS,       /* # of pages charged as anon rss */
> -       MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> -       MEM_CGROUP_STAT_PGPGIN_COUNT,   /* # of pages paged in */
> -       MEM_CGROUP_STAT_PGPGOUT_COUNT,  /* # of pages paged out */
> -       MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> -       MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
> -                                       used by soft limit implementation */
> -       MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out.
> -                                       used by threshold implementation */
> -
> -       MEM_CGROUP_STAT_NSTATS,
> -};
> -
>  struct mem_cgroup_stat_cpu {
>        s64 count[MEM_CGROUP_STAT_NSTATS];
>  };
>
> +/* Per cgroup page statistics */
> +struct mem_cgroup_page_stat {
> +       enum mem_cgroup_page_stat_item item;
> +       s64 value;
> +};
> +
>  /*
>  * per-zone information in memory controller.
>  */
> @@ -157,6 +142,15 @@ struct mem_cgroup_threshold_ary {
>  static bool mem_cgroup_threshold_check(struct mem_cgroup *mem);
>  static void mem_cgroup_threshold(struct mem_cgroup *mem);
>
> +enum mem_cgroup_dirty_param {
> +       MEM_CGROUP_DIRTY_RATIO,
> +       MEM_CGROUP_DIRTY_BYTES,
> +       MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> +       MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
> +
> +       MEM_CGROUP_DIRTY_NPARAMS,
> +};
> +
>  /*
>  * The memory controller data structure. The memory controller controls both
>  * page cache and RSS per cgroup. We would eventually like to provide
> @@ -205,6 +199,9 @@ struct mem_cgroup {
>
>        unsigned int    swappiness;
>
> +       /* control memory cgroup dirty pages */
> +       unsigned long dirty_param[MEM_CGROUP_DIRTY_NPARAMS];
> +
>        /* set when res.limit == memsw.limit */
>        bool            memsw_is_minimum;
>
> @@ -1021,6 +1018,164 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
>        return swappiness;
>  }
>
> +static unsigned long get_dirty_param(struct mem_cgroup *memcg,
> +                       enum mem_cgroup_dirty_param idx)
> +{
> +       unsigned long ret;
> +
> +       VM_BUG_ON(idx >= MEM_CGROUP_DIRTY_NPARAMS);
> +       spin_lock(&memcg->reclaim_param_lock);
> +       ret = memcg->dirty_param[idx];
> +       spin_unlock(&memcg->reclaim_param_lock);
> +
> +       return ret;
> +}
> +

> +long mem_cgroup_dirty_ratio(void)
> +{
> +       struct mem_cgroup *memcg;
> +       long ret = vm_dirty_ratio;
> +
> +       if (mem_cgroup_disabled())
> +               return ret;
> +       /*
> +        * It's possible that "current" may be moved to other cgroup while we
> +        * access cgroup. But precise check is meaningless because the task can
> +        * be moved after our access and writeback tends to take long time.
> +        * At least, "memcg" will not be freed under rcu_read_lock().
> +        */
> +       rcu_read_lock();
> +       memcg = mem_cgroup_from_task(current);
> +       if (likely(memcg))
> +               ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_RATIO);
> +       rcu_read_unlock();
> +
> +       return ret;
> +}
> +
> +unsigned long mem_cgroup_dirty_bytes(void)
> +{
> +       struct mem_cgroup *memcg;
> +       unsigned long ret = vm_dirty_bytes;
> +
> +       if (mem_cgroup_disabled())
> +               return ret;
> +       rcu_read_lock();
> +       memcg = mem_cgroup_from_task(current);
> +       if (likely(memcg))
> +               ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BYTES);
> +       rcu_read_unlock();
> +
> +       return ret;
> +}
> +
> +long mem_cgroup_dirty_background_ratio(void)
> +{
> +       struct mem_cgroup *memcg;
> +       long ret = dirty_background_ratio;
> +
> +       if (mem_cgroup_disabled())
> +               return ret;
> +       rcu_read_lock();
> +       memcg = mem_cgroup_from_task(current);
> +       if (likely(memcg))
> +               ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BACKGROUND_RATIO);
> +       rcu_read_unlock();
> +
> +       return ret;
> +}
> +
> +unsigned long mem_cgroup_dirty_background_bytes(void)
> +{
> +       struct mem_cgroup *memcg;
> +       unsigned long ret = dirty_background_bytes;
> +
> +       if (mem_cgroup_disabled())
> +               return ret;
> +       rcu_read_lock();
> +       memcg = mem_cgroup_from_task(current);
> +       if (likely(memcg))
> +               ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BACKGROUND_BYTES);
> +       rcu_read_unlock();
> +
> +       return ret;
> +}

Given that mem_cgroup_dirty_[background_]{ratio,bytes}() are similar,
should we refactor the majority of them into a single routine?

--
Greg

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 2/3] memcg: dirty pages accounting and limiting infrastructure
@ 2010-03-02 18:08     ` Greg Thelen
  0 siblings, 0 replies; 140+ messages in thread
From: Greg Thelen @ 2010-03-02 18:08 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Suleiman Souhlal,
	Daisuke Nishimura, Kirill A. Shutemov, Andrew Morton, containers,
	linux-kernel, linux-mm

Comments below.  Yet to be tested on my end, but I will test it.

On Mon, Mar 1, 2010 at 1:23 PM, Andrea Righi <arighi@develer.com> wrote:
> Infrastructure to account dirty pages per cgroup and add dirty limit
> interfaces in the cgroupfs:
>
>  - Direct write-out: memory.dirty_ratio, memory.dirty_bytes
>
>  - Background write-out: memory.dirty_background_ratio, memory.dirty_background_bytes
>
> Signed-off-by: Andrea Righi <arighi@develer.com>
> ---
>  include/linux/memcontrol.h |   77 ++++++++++-
>  mm/memcontrol.c            |  336 ++++++++++++++++++++++++++++++++++++++++----
>  2 files changed, 384 insertions(+), 29 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 1f9b119..cc88b2e 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -19,12 +19,50 @@
>
>  #ifndef _LINUX_MEMCONTROL_H
>  #define _LINUX_MEMCONTROL_H
> +
> +#include <linux/writeback.h>
>  #include <linux/cgroup.h>
> +
>  struct mem_cgroup;
>  struct page_cgroup;
>  struct page;
>  struct mm_struct;
>
> +/* Cgroup memory statistics items exported to the kernel */
> +enum mem_cgroup_page_stat_item {
> +       MEMCG_NR_DIRTYABLE_PAGES,
> +       MEMCG_NR_RECLAIM_PAGES,
> +       MEMCG_NR_WRITEBACK,
> +       MEMCG_NR_DIRTY_WRITEBACK_PAGES,
> +};
> +
> +/*
> + * Statistics for memory cgroup.
> + */
> +enum mem_cgroup_stat_index {
> +       /*
> +        * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> +        */
> +       MEM_CGROUP_STAT_CACHE,     /* # of pages charged as cache */
> +       MEM_CGROUP_STAT_RSS,       /* # of pages charged as anon rss */
> +       MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> +       MEM_CGROUP_STAT_PGPGIN_COUNT,   /* # of pages paged in */
> +       MEM_CGROUP_STAT_PGPGOUT_COUNT,  /* # of pages paged out */
> +       MEM_CGROUP_STAT_EVENTS, /* sum of pagein + pageout for internal use */
> +       MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> +       MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
> +                                       used by soft limit implementation */
> +       MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out.
> +                                       used by threshold implementation */
> +       MEM_CGROUP_STAT_FILE_DIRTY,   /* # of dirty pages in page cache */
> +       MEM_CGROUP_STAT_WRITEBACK,   /* # of pages under writeback */
> +       MEM_CGROUP_STAT_WRITEBACK_TEMP,   /* # of pages under writeback using
> +                                               temporary buffers */
> +       MEM_CGROUP_STAT_UNSTABLE_NFS,   /* # of NFS unstable pages */
> +
> +       MEM_CGROUP_STAT_NSTATS,
> +};
> +
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
>  /*
>  * All "charge" functions with gfp_mask should use GFP_KERNEL or
> @@ -117,6 +155,13 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
>  extern int do_swap_account;
>  #endif
>
> +extern long mem_cgroup_dirty_ratio(void);
> +extern unsigned long mem_cgroup_dirty_bytes(void);
> +extern long mem_cgroup_dirty_background_ratio(void);
> +extern unsigned long mem_cgroup_dirty_background_bytes(void);
> +
> +extern s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item);
> +
>  static inline bool mem_cgroup_disabled(void)
>  {
>        if (mem_cgroup_subsys.disabled)
> @@ -125,7 +170,8 @@ static inline bool mem_cgroup_disabled(void)
>  }
>
>  extern bool mem_cgroup_oom_called(struct task_struct *task);
> -void mem_cgroup_update_file_mapped(struct page *page, int val);
> +void mem_cgroup_update_stat(struct page *page,
> +                       enum mem_cgroup_stat_index idx, int val);
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>                                                gfp_t gfp_mask, int nid,
>                                                int zid);
> @@ -300,8 +346,8 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
>  {
>  }
>
> -static inline void mem_cgroup_update_file_mapped(struct page *page,
> -                                                       int val)
> +static inline void mem_cgroup_update_stat(struct page *page,
> +                       enum mem_cgroup_stat_index idx, int val)
>  {
>  }
>
> @@ -312,6 +358,31 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>        return 0;
>  }
>
> +static inline long mem_cgroup_dirty_ratio(void)
> +{
> +       return vm_dirty_ratio;
> +}
> +
> +static inline unsigned long mem_cgroup_dirty_bytes(void)
> +{
> +       return vm_dirty_bytes;
> +}
> +
> +static inline long mem_cgroup_dirty_background_ratio(void)
> +{
> +       return dirty_background_ratio;
> +}
> +
> +static inline unsigned long mem_cgroup_dirty_background_bytes(void)
> +{
> +       return dirty_background_bytes;
> +}
> +
> +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> +{
> +       return -ENOMEM;
> +}
> +
>  #endif /* CONFIG_CGROUP_MEM_CONT */
>
>  #endif /* _LINUX_MEMCONTROL_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a443c30..e74cf66 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -66,31 +66,16 @@ static int really_do_swap_account __initdata = 1; /* for remember boot option*/
>  #define SOFTLIMIT_EVENTS_THRESH (1000)
>  #define THRESHOLDS_EVENTS_THRESH (100)
>
> -/*
> - * Statistics for memory cgroup.
> - */
> -enum mem_cgroup_stat_index {
> -       /*
> -        * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> -        */
> -       MEM_CGROUP_STAT_CACHE,     /* # of pages charged as cache */
> -       MEM_CGROUP_STAT_RSS,       /* # of pages charged as anon rss */
> -       MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> -       MEM_CGROUP_STAT_PGPGIN_COUNT,   /* # of pages paged in */
> -       MEM_CGROUP_STAT_PGPGOUT_COUNT,  /* # of pages paged out */
> -       MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> -       MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
> -                                       used by soft limit implementation */
> -       MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out.
> -                                       used by threshold implementation */
> -
> -       MEM_CGROUP_STAT_NSTATS,
> -};
> -
>  struct mem_cgroup_stat_cpu {
>        s64 count[MEM_CGROUP_STAT_NSTATS];
>  };
>
> +/* Per cgroup page statistics */
> +struct mem_cgroup_page_stat {
> +       enum mem_cgroup_page_stat_item item;
> +       s64 value;
> +};
> +
>  /*
>  * per-zone information in memory controller.
>  */
> @@ -157,6 +142,15 @@ struct mem_cgroup_threshold_ary {
>  static bool mem_cgroup_threshold_check(struct mem_cgroup *mem);
>  static void mem_cgroup_threshold(struct mem_cgroup *mem);
>
> +enum mem_cgroup_dirty_param {
> +       MEM_CGROUP_DIRTY_RATIO,
> +       MEM_CGROUP_DIRTY_BYTES,
> +       MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> +       MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
> +
> +       MEM_CGROUP_DIRTY_NPARAMS,
> +};
> +
>  /*
>  * The memory controller data structure. The memory controller controls both
>  * page cache and RSS per cgroup. We would eventually like to provide
> @@ -205,6 +199,9 @@ struct mem_cgroup {
>
>        unsigned int    swappiness;
>
> +       /* control memory cgroup dirty pages */
> +       unsigned long dirty_param[MEM_CGROUP_DIRTY_NPARAMS];
> +
>        /* set when res.limit == memsw.limit */
>        bool            memsw_is_minimum;
>
> @@ -1021,6 +1018,164 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
>        return swappiness;
>  }
>
> +static unsigned long get_dirty_param(struct mem_cgroup *memcg,
> +                       enum mem_cgroup_dirty_param idx)
> +{
> +       unsigned long ret;
> +
> +       VM_BUG_ON(idx >= MEM_CGROUP_DIRTY_NPARAMS);
> +       spin_lock(&memcg->reclaim_param_lock);
> +       ret = memcg->dirty_param[idx];
> +       spin_unlock(&memcg->reclaim_param_lock);
> +
> +       return ret;
> +}
> +

> +long mem_cgroup_dirty_ratio(void)
> +{
> +       struct mem_cgroup *memcg;
> +       long ret = vm_dirty_ratio;
> +
> +       if (mem_cgroup_disabled())
> +               return ret;
> +       /*
> +        * It's possible that "current" may be moved to other cgroup while we
> +        * access cgroup. But precise check is meaningless because the task can
> +        * be moved after our access and writeback tends to take long time.
> +        * At least, "memcg" will not be freed under rcu_read_lock().
> +        */
> +       rcu_read_lock();
> +       memcg = mem_cgroup_from_task(current);
> +       if (likely(memcg))
> +               ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_RATIO);
> +       rcu_read_unlock();
> +
> +       return ret;
> +}
> +
> +unsigned long mem_cgroup_dirty_bytes(void)
> +{
> +       struct mem_cgroup *memcg;
> +       unsigned long ret = vm_dirty_bytes;
> +
> +       if (mem_cgroup_disabled())
> +               return ret;
> +       rcu_read_lock();
> +       memcg = mem_cgroup_from_task(current);
> +       if (likely(memcg))
> +               ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BYTES);
> +       rcu_read_unlock();
> +
> +       return ret;
> +}
> +
> +long mem_cgroup_dirty_background_ratio(void)
> +{
> +       struct mem_cgroup *memcg;
> +       long ret = dirty_background_ratio;
> +
> +       if (mem_cgroup_disabled())
> +               return ret;
> +       rcu_read_lock();
> +       memcg = mem_cgroup_from_task(current);
> +       if (likely(memcg))
> +               ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BACKGROUND_RATIO);
> +       rcu_read_unlock();
> +
> +       return ret;
> +}
> +
> +unsigned long mem_cgroup_dirty_background_bytes(void)
> +{
> +       struct mem_cgroup *memcg;
> +       unsigned long ret = dirty_background_bytes;
> +
> +       if (mem_cgroup_disabled())
> +               return ret;
> +       rcu_read_lock();
> +       memcg = mem_cgroup_from_task(current);
> +       if (likely(memcg))
> +               ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BACKGROUND_BYTES);
> +       rcu_read_unlock();
> +
> +       return ret;
> +}

Given that mem_cgroup_dirty_[background_]{ratio,bytes}() are similar,
should we refactor the majority of them into a single routine?

--
Greg

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 2/3] memcg: dirty pages accounting and limiting infrastructure
       [not found]     ` <20100302130223.GF3212-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org>
@ 2010-03-02 21:50       ` Andrea Righi
  0 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-02 21:50 UTC (permalink / raw)
  To: Balbir Singh
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Andrew Morton

On Tue, Mar 02, 2010 at 06:32:24PM +0530, Balbir Singh wrote:

[snip]

> > +extern long mem_cgroup_dirty_ratio(void);
> > +extern unsigned long mem_cgroup_dirty_bytes(void);
> > +extern long mem_cgroup_dirty_background_ratio(void);
> > +extern unsigned long mem_cgroup_dirty_background_bytes(void);
> > +
> > +extern s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item);
> > +
> 
> Docstyle comments for each function would be appreciated

OK.

> >  /*
> >   * The memory controller data structure. The memory controller controls both
> >   * page cache and RSS per cgroup. We would eventually like to provide
> > @@ -205,6 +199,9 @@ struct mem_cgroup {
> > 
> >  	unsigned int	swappiness;
> > 
> > +	/* control memory cgroup dirty pages */
> > +	unsigned long dirty_param[MEM_CGROUP_DIRTY_NPARAMS];
> > +
> 
> Could you mention what protects this field, is it the reclaim_lock?

Yes, it is.

Actually, we could avoid the lock completely for dirty_param[], using a
validation routine to check for incoherencies after any read with
get_dirty_param(), and retry if the validation fails. In practice, the
same approach we're using to read global vm_dirty_ratio, vm_dirty_bytes,
etc...

Considering that those values are rarely written and read often we can
protect them in a RCU way.


> BTW, is unsigned long sufficient to represent dirty_param(s)?

Yes, I think. It's the same type used for the equivalent global values.

> 
> >  	/* set when res.limit == memsw.limit */
> >  	bool		memsw_is_minimum;
> > 
> > @@ -1021,6 +1018,164 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
> >  	return swappiness;
> >  }
> > 
> > +static unsigned long get_dirty_param(struct mem_cgroup *memcg,
> > +			enum mem_cgroup_dirty_param idx)
> > +{
> > +	unsigned long ret;
> > +
> > +	VM_BUG_ON(idx >= MEM_CGROUP_DIRTY_NPARAMS);
> > +	spin_lock(&memcg->reclaim_param_lock);
> > +	ret = memcg->dirty_param[idx];
> > +	spin_unlock(&memcg->reclaim_param_lock);
> 
> Do we need a spinlock if we protect it using RCU? Is precise data very
> important?

See above.

> > +unsigned long mem_cgroup_dirty_background_bytes(void)
> > +{
> > +	struct mem_cgroup *memcg;
> > +	unsigned long ret = dirty_background_bytes;
> > +
> > +	if (mem_cgroup_disabled())
> > +		return ret;
> > +	rcu_read_lock();
> > +	memcg = mem_cgroup_from_task(current);
> > +	if (likely(memcg))
> > +		ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BACKGROUND_BYTES);
> > +	rcu_read_unlock();
> > +
> > +	return ret;
> > +}
> > +
> > +static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
> > +{
> > +	return do_swap_account ?
> > +			res_counter_read_u64(&memcg->memsw, RES_LIMIT) :
> 
> Shouldn't you do a res_counter_read_u64(...) > 0 for readability?

OK.

> What happens if memcg->res, RES_LIMIT == memcg->memsw, RES_LIMIT?

OK, we should also check memcg->memsw_is_minimum.

> >  static struct cgroup_subsys_state * __ref
> >  mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
> >  {
> > @@ -3776,8 +4031,37 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
> >  	mem->last_scanned_child = 0;
> >  	spin_lock_init(&mem->reclaim_param_lock);
> > 
> > -	if (parent)
> > +	if (parent) {
> >  		mem->swappiness = get_swappiness(parent);
> > +
> > +		spin_lock(&parent->reclaim_param_lock);
> > +		copy_dirty_params(mem, parent);
> > +		spin_unlock(&parent->reclaim_param_lock);
> > +	} else {
> > +		/*
> > +		 * XXX: should we need a lock here? we could switch from
> > +		 * vm_dirty_ratio to vm_dirty_bytes or vice versa but we're not
> > +		 * reading them atomically. The same for dirty_background_ratio
> > +		 * and dirty_background_bytes.
> > +		 *
> > +		 * For now, try to read them speculatively and retry if a
> > +		 * "conflict" is detected.a
> 
> The do while loop is subtle, can we add a validate check,share it with
> the write routine and retry if validation fails?

Agreed.

> 
> > +		 */
> > +		do {
> > +			mem->dirty_param[MEM_CGROUP_DIRTY_RATIO] =
> > +						vm_dirty_ratio;
> > +			mem->dirty_param[MEM_CGROUP_DIRTY_BYTES] =
> > +						vm_dirty_bytes;
> > +		} while (mem->dirty_param[MEM_CGROUP_DIRTY_RATIO] &&
> > +			 mem->dirty_param[MEM_CGROUP_DIRTY_BYTES]);
> > +		do {
> > +			mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] =
> > +						dirty_background_ratio;
> > +			mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES] =
> > +						dirty_background_bytes;
> > +		} while (mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] &&
> > +			mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES]);
> > +	}
> >  	atomic_set(&mem->refcnt, 1);
> >  	mem->move_charge_at_immigrate = 0;
> >  	mutex_init(&mem->thresholds_lock);

Many thanks for reviewing,
-Andrea

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 2/3] memcg: dirty pages accounting and limiting infrastructure
  2010-03-02 13:02     ` Balbir Singh
@ 2010-03-02 21:50       ` Andrea Righi
  -1 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-02 21:50 UTC (permalink / raw)
  To: Balbir Singh
  Cc: KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Kirill A. Shutemov, Andrew Morton, containers,
	linux-kernel, linux-mm

On Tue, Mar 02, 2010 at 06:32:24PM +0530, Balbir Singh wrote:

[snip]

> > +extern long mem_cgroup_dirty_ratio(void);
> > +extern unsigned long mem_cgroup_dirty_bytes(void);
> > +extern long mem_cgroup_dirty_background_ratio(void);
> > +extern unsigned long mem_cgroup_dirty_background_bytes(void);
> > +
> > +extern s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item);
> > +
> 
> Docstyle comments for each function would be appreciated

OK.

> >  /*
> >   * The memory controller data structure. The memory controller controls both
> >   * page cache and RSS per cgroup. We would eventually like to provide
> > @@ -205,6 +199,9 @@ struct mem_cgroup {
> > 
> >  	unsigned int	swappiness;
> > 
> > +	/* control memory cgroup dirty pages */
> > +	unsigned long dirty_param[MEM_CGROUP_DIRTY_NPARAMS];
> > +
> 
> Could you mention what protects this field, is it the reclaim_lock?

Yes, it is.

Actually, we could avoid the lock completely for dirty_param[], using a
validation routine to check for incoherencies after any read with
get_dirty_param(), and retry if the validation fails. In practice, the
same approach we're using to read global vm_dirty_ratio, vm_dirty_bytes,
etc...

Considering that those values are rarely written and read often we can
protect them in a RCU way.


> BTW, is unsigned long sufficient to represent dirty_param(s)?

Yes, I think. It's the same type used for the equivalent global values.

> 
> >  	/* set when res.limit == memsw.limit */
> >  	bool		memsw_is_minimum;
> > 
> > @@ -1021,6 +1018,164 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
> >  	return swappiness;
> >  }
> > 
> > +static unsigned long get_dirty_param(struct mem_cgroup *memcg,
> > +			enum mem_cgroup_dirty_param idx)
> > +{
> > +	unsigned long ret;
> > +
> > +	VM_BUG_ON(idx >= MEM_CGROUP_DIRTY_NPARAMS);
> > +	spin_lock(&memcg->reclaim_param_lock);
> > +	ret = memcg->dirty_param[idx];
> > +	spin_unlock(&memcg->reclaim_param_lock);
> 
> Do we need a spinlock if we protect it using RCU? Is precise data very
> important?

See above.

> > +unsigned long mem_cgroup_dirty_background_bytes(void)
> > +{
> > +	struct mem_cgroup *memcg;
> > +	unsigned long ret = dirty_background_bytes;
> > +
> > +	if (mem_cgroup_disabled())
> > +		return ret;
> > +	rcu_read_lock();
> > +	memcg = mem_cgroup_from_task(current);
> > +	if (likely(memcg))
> > +		ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BACKGROUND_BYTES);
> > +	rcu_read_unlock();
> > +
> > +	return ret;
> > +}
> > +
> > +static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
> > +{
> > +	return do_swap_account ?
> > +			res_counter_read_u64(&memcg->memsw, RES_LIMIT) :
> 
> Shouldn't you do a res_counter_read_u64(...) > 0 for readability?

OK.

> What happens if memcg->res, RES_LIMIT == memcg->memsw, RES_LIMIT?

OK, we should also check memcg->memsw_is_minimum.

> >  static struct cgroup_subsys_state * __ref
> >  mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
> >  {
> > @@ -3776,8 +4031,37 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
> >  	mem->last_scanned_child = 0;
> >  	spin_lock_init(&mem->reclaim_param_lock);
> > 
> > -	if (parent)
> > +	if (parent) {
> >  		mem->swappiness = get_swappiness(parent);
> > +
> > +		spin_lock(&parent->reclaim_param_lock);
> > +		copy_dirty_params(mem, parent);
> > +		spin_unlock(&parent->reclaim_param_lock);
> > +	} else {
> > +		/*
> > +		 * XXX: should we need a lock here? we could switch from
> > +		 * vm_dirty_ratio to vm_dirty_bytes or vice versa but we're not
> > +		 * reading them atomically. The same for dirty_background_ratio
> > +		 * and dirty_background_bytes.
> > +		 *
> > +		 * For now, try to read them speculatively and retry if a
> > +		 * "conflict" is detected.a
> 
> The do while loop is subtle, can we add a validate check,share it with
> the write routine and retry if validation fails?

Agreed.

> 
> > +		 */
> > +		do {
> > +			mem->dirty_param[MEM_CGROUP_DIRTY_RATIO] =
> > +						vm_dirty_ratio;
> > +			mem->dirty_param[MEM_CGROUP_DIRTY_BYTES] =
> > +						vm_dirty_bytes;
> > +		} while (mem->dirty_param[MEM_CGROUP_DIRTY_RATIO] &&
> > +			 mem->dirty_param[MEM_CGROUP_DIRTY_BYTES]);
> > +		do {
> > +			mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] =
> > +						dirty_background_ratio;
> > +			mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES] =
> > +						dirty_background_bytes;
> > +		} while (mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] &&
> > +			mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES]);
> > +	}
> >  	atomic_set(&mem->refcnt, 1);
> >  	mem->move_charge_at_immigrate = 0;
> >  	mutex_init(&mem->thresholds_lock);

Many thanks for reviewing,
-Andrea

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 2/3] memcg: dirty pages accounting and limiting infrastructure
@ 2010-03-02 21:50       ` Andrea Righi
  0 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-02 21:50 UTC (permalink / raw)
  To: Balbir Singh
  Cc: KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Kirill A. Shutemov, Andrew Morton, containers,
	linux-kernel, linux-mm

On Tue, Mar 02, 2010 at 06:32:24PM +0530, Balbir Singh wrote:

[snip]

> > +extern long mem_cgroup_dirty_ratio(void);
> > +extern unsigned long mem_cgroup_dirty_bytes(void);
> > +extern long mem_cgroup_dirty_background_ratio(void);
> > +extern unsigned long mem_cgroup_dirty_background_bytes(void);
> > +
> > +extern s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item);
> > +
> 
> Docstyle comments for each function would be appreciated

OK.

> >  /*
> >   * The memory controller data structure. The memory controller controls both
> >   * page cache and RSS per cgroup. We would eventually like to provide
> > @@ -205,6 +199,9 @@ struct mem_cgroup {
> > 
> >  	unsigned int	swappiness;
> > 
> > +	/* control memory cgroup dirty pages */
> > +	unsigned long dirty_param[MEM_CGROUP_DIRTY_NPARAMS];
> > +
> 
> Could you mention what protects this field, is it the reclaim_lock?

Yes, it is.

Actually, we could avoid the lock completely for dirty_param[], using a
validation routine to check for incoherencies after any read with
get_dirty_param(), and retry if the validation fails. In practice, the
same approach we're using to read global vm_dirty_ratio, vm_dirty_bytes,
etc...

Considering that those values are rarely written and read often we can
protect them in a RCU way.


> BTW, is unsigned long sufficient to represent dirty_param(s)?

Yes, I think. It's the same type used for the equivalent global values.

> 
> >  	/* set when res.limit == memsw.limit */
> >  	bool		memsw_is_minimum;
> > 
> > @@ -1021,6 +1018,164 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
> >  	return swappiness;
> >  }
> > 
> > +static unsigned long get_dirty_param(struct mem_cgroup *memcg,
> > +			enum mem_cgroup_dirty_param idx)
> > +{
> > +	unsigned long ret;
> > +
> > +	VM_BUG_ON(idx >= MEM_CGROUP_DIRTY_NPARAMS);
> > +	spin_lock(&memcg->reclaim_param_lock);
> > +	ret = memcg->dirty_param[idx];
> > +	spin_unlock(&memcg->reclaim_param_lock);
> 
> Do we need a spinlock if we protect it using RCU? Is precise data very
> important?

See above.

> > +unsigned long mem_cgroup_dirty_background_bytes(void)
> > +{
> > +	struct mem_cgroup *memcg;
> > +	unsigned long ret = dirty_background_bytes;
> > +
> > +	if (mem_cgroup_disabled())
> > +		return ret;
> > +	rcu_read_lock();
> > +	memcg = mem_cgroup_from_task(current);
> > +	if (likely(memcg))
> > +		ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BACKGROUND_BYTES);
> > +	rcu_read_unlock();
> > +
> > +	return ret;
> > +}
> > +
> > +static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
> > +{
> > +	return do_swap_account ?
> > +			res_counter_read_u64(&memcg->memsw, RES_LIMIT) :
> 
> Shouldn't you do a res_counter_read_u64(...) > 0 for readability?

OK.

> What happens if memcg->res, RES_LIMIT == memcg->memsw, RES_LIMIT?

OK, we should also check memcg->memsw_is_minimum.

> >  static struct cgroup_subsys_state * __ref
> >  mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
> >  {
> > @@ -3776,8 +4031,37 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
> >  	mem->last_scanned_child = 0;
> >  	spin_lock_init(&mem->reclaim_param_lock);
> > 
> > -	if (parent)
> > +	if (parent) {
> >  		mem->swappiness = get_swappiness(parent);
> > +
> > +		spin_lock(&parent->reclaim_param_lock);
> > +		copy_dirty_params(mem, parent);
> > +		spin_unlock(&parent->reclaim_param_lock);
> > +	} else {
> > +		/*
> > +		 * XXX: should we need a lock here? we could switch from
> > +		 * vm_dirty_ratio to vm_dirty_bytes or vice versa but we're not
> > +		 * reading them atomically. The same for dirty_background_ratio
> > +		 * and dirty_background_bytes.
> > +		 *
> > +		 * For now, try to read them speculatively and retry if a
> > +		 * "conflict" is detected.a
> 
> The do while loop is subtle, can we add a validate check,share it with
> the write routine and retry if validation fails?

Agreed.

> 
> > +		 */
> > +		do {
> > +			mem->dirty_param[MEM_CGROUP_DIRTY_RATIO] =
> > +						vm_dirty_ratio;
> > +			mem->dirty_param[MEM_CGROUP_DIRTY_BYTES] =
> > +						vm_dirty_bytes;
> > +		} while (mem->dirty_param[MEM_CGROUP_DIRTY_RATIO] &&
> > +			 mem->dirty_param[MEM_CGROUP_DIRTY_BYTES]);
> > +		do {
> > +			mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] =
> > +						dirty_background_ratio;
> > +			mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES] =
> > +						dirty_background_bytes;
> > +		} while (mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] &&
> > +			mem->dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES]);
> > +	}
> >  	atomic_set(&mem->refcnt, 1);
> >  	mem->move_charge_at_immigrate = 0;
> >  	mutex_init(&mem->thresholds_lock);

Many thanks for reviewing,
-Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-02 13:48     ` Peter Zijlstra
                       ` (5 preceding siblings ...)
  (?)
@ 2010-03-02 22:14     ` Andrea Righi
  -1 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-02 22:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Trond Myklebust, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Suleiman Souhlal, Andrew Morton, Balbir Singh

On Tue, Mar 02, 2010 at 02:48:56PM +0100, Peter Zijlstra wrote:
> On Mon, 2010-03-01 at 22:23 +0100, Andrea Righi wrote:
> > Apply the cgroup dirty pages accounting and limiting infrastructure to
> > the opportune kernel functions.
> > 
> > Signed-off-by: Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
> > ---
> 
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index 5a0f8f3..d83f41c 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -137,13 +137,14 @@ static struct prop_descriptor vm_dirties;
> >   */
> >  static int calc_period_shift(void)
> >  {
> > -	unsigned long dirty_total;
> > +	unsigned long dirty_total, dirty_bytes;
> >  
> > -	if (vm_dirty_bytes)
> > -		dirty_total = vm_dirty_bytes / PAGE_SIZE;
> > +	dirty_bytes = mem_cgroup_dirty_bytes();
> > +	if (dirty_bytes)
> 
> So you don't think 0 is a valid max dirty amount?

A value of 0 means "disabled". It's used to select between dirty_ratio
or dirty_bytes. It's the same for the gloabl vm_dirty_* parameters.

> 
> > +		dirty_total = dirty_bytes / PAGE_SIZE;
> >  	else
> > -		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> > -				100;
> > +		dirty_total = (mem_cgroup_dirty_ratio() *
> > +				determine_dirtyable_memory()) / 100;
> >  	return 2 + ilog2(dirty_total - 1);
> >  }
> >  
> > @@ -408,14 +409,16 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
> >   */
> >  unsigned long determine_dirtyable_memory(void)
> >  {
> > -	unsigned long x;
> > -
> > -	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> > +	unsigned long memory;
> > +	s64 memcg_memory;
> >  
> > +	memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> >  	if (!vm_highmem_is_dirtyable)
> > -		x -= highmem_dirtyable_memory(x);
> > -
> > -	return x + 1;	/* Ensure that we never return 0 */
> > +		memory -= highmem_dirtyable_memory(memory);
> > +	memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
> > +	if (memcg_memory < 0)
> 
> And here you somehow return negative?
> 
> > +		return memory + 1;
> > +	return min((unsigned long)memcg_memory, memory + 1);
> >  }
> >  
> >  void
> > @@ -423,26 +426,28 @@ get_dirty_limits(unsigned long *pbackground, unsigned long *pdirty,
> >  		 unsigned long *pbdi_dirty, struct backing_dev_info *bdi)
> >  {
> >  	unsigned long background;
> > -	unsigned long dirty;
> > +	unsigned long dirty, dirty_bytes, dirty_background;
> >  	unsigned long available_memory = determine_dirtyable_memory();
> >  	struct task_struct *tsk;
> >  
> > -	if (vm_dirty_bytes)
> > -		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> > +	dirty_bytes = mem_cgroup_dirty_bytes();
> > +	if (dirty_bytes)
> 
> zero not valid again
> 
> > +		dirty = DIV_ROUND_UP(dirty_bytes, PAGE_SIZE);
> >  	else {
> >  		int dirty_ratio;
> >  
> > -		dirty_ratio = vm_dirty_ratio;
> > +		dirty_ratio = mem_cgroup_dirty_ratio();
> >  		if (dirty_ratio < 5)
> >  			dirty_ratio = 5;
> >  		dirty = (dirty_ratio * available_memory) / 100;
> >  	}
> >  
> > -	if (dirty_background_bytes)
> > -		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> > +	dirty_background = mem_cgroup_dirty_background_bytes();
> > +	if (dirty_background)
> 
> idem
> 
> > +		background = DIV_ROUND_UP(dirty_background, PAGE_SIZE);
> >  	else
> > -		background = (dirty_background_ratio * available_memory) / 100;
> > -
> > +		background = (mem_cgroup_dirty_background_ratio() *
> > +					available_memory) / 100;
> >  	if (background >= dirty)
> >  		background = dirty / 2;
> >  	tsk = current;
> > @@ -508,9 +513,13 @@ static void balance_dirty_pages(struct address_space *mapping,
> >  		get_dirty_limits(&background_thresh, &dirty_thresh,
> >  				&bdi_thresh, bdi);
> >  
> > -		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> > +		nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
> > +		nr_writeback = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
> > +		if ((nr_reclaimable < 0) || (nr_writeback < 0)) {
> > +			nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> >  					global_page_state(NR_UNSTABLE_NFS);
> 
> ??? why would a page_state be negative.. I see you return -ENOMEM on !
> cgroup, but how can one specify no dirty limit with this compiled in?
> 
> > -		nr_writeback = global_page_state(NR_WRITEBACK);
> > +			nr_writeback = global_page_state(NR_WRITEBACK);
> > +		}
> >  
> >  		bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY);
> >  		if (bdi_cap_account_unstable(bdi)) {
> > @@ -611,10 +620,12 @@ static void balance_dirty_pages(struct address_space *mapping,
> >  	 * In normal mode, we start background writeout at the lower
> >  	 * background_thresh, to keep the amount of dirty memory low.
> >  	 */
> > +	nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
> > +	if (nr_reclaimable < 0)
> > +		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> > +				global_page_state(NR_UNSTABLE_NFS);
> 
> Again..
> 
> >  	if ((laptop_mode && pages_written) ||
> > -	    (!laptop_mode && ((global_page_state(NR_FILE_DIRTY)
> > -			       + global_page_state(NR_UNSTABLE_NFS))
> > -					  > background_thresh)))
> > +	    (!laptop_mode && (nr_reclaimable > background_thresh)))
> >  		bdi_start_writeback(bdi, NULL, 0);
> >  }
> >  
> > @@ -678,6 +689,8 @@ void throttle_vm_writeout(gfp_t gfp_mask)
> >  	unsigned long dirty_thresh;
> >  
> >          for ( ; ; ) {
> > +		unsigned long dirty;
> > +
> >  		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
> >  
> >                  /*
> > @@ -686,10 +699,14 @@ void throttle_vm_writeout(gfp_t gfp_mask)
> >                   */
> >                  dirty_thresh += dirty_thresh / 10;      /* wheeee... */
> >  
> > -                if (global_page_state(NR_UNSTABLE_NFS) +
> > -			global_page_state(NR_WRITEBACK) <= dirty_thresh)
> > -                        	break;
> > -                congestion_wait(BLK_RW_ASYNC, HZ/10);
> > +
> > +		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> > +		if (dirty < 0)
> > +			dirty = global_page_state(NR_UNSTABLE_NFS) +
> > +				global_page_state(NR_WRITEBACK);
> 
> and again..
> 
> > +		if (dirty <= dirty_thresh)
> > +			break;
> > +		congestion_wait(BLK_RW_ASYNC, HZ/10);
> >  
> >  		/*
> >  		 * The caller might hold locks which can prevent IO completion
> 
> This is ugly and broken.. I thought you'd agreed to something like:
> 
>  if (mem_cgroup_has_dirty_limit(cgroup))
>    use mem_cgroup numbers
>  else
>    use global numbers

I agree mem_cgroup_has_dirty_limit() is nicer. But we must do that under
RCU, so something like:

	rcu_read_lock();
	if (mem_cgroup_has_dirty_limit())
		mem_cgroup_get_page_stat()
	else
		global_page_state()
	rcu_read_unlock();

That is bad when mem_cgroup_has_dirty_limit() always returns false
(e.g., when memory cgroups are disabled). So I fallback to the old
interface.

What do you think about:

	mem_cgroup_lock();
	if (mem_cgroup_has_dirty_limit())
		mem_cgroup_get_page_stat()
	else
		global_page_state()
	mem_cgroup_unlock();

Where mem_cgroup_read_lock/unlock() simply expand to nothing when
memory cgroups are disabled.

> 
> That allows for a 0 dirty limit (which should work and basically makes
> all io synchronous).

IMHO it is better to reserve 0 for the special value "disabled" like the
global settings. A synchronous IO can be also achieved using a dirty
limit of 1.

> 
> Also, I'd put each of those in a separate function, like:
> 
> unsigned long reclaimable_pages(cgroup)
> {
>   if (mem_cgroup_has_dirty_limit(cgroup))
>     return mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
>   
>   return global_page_state(NR_FILE_DIRTY) + global_page_state(NR_NFS_UNSTABLE);
> }

Agreed.

> 
> Which raises another question, you should probably rebase on top of
> Trond's patches, which removes BDI_RECLAIMABLE, suggesting you also
> loose MEMCG_NR_RECLAIM_PAGES in favour of the DIRTY+UNSTABLE split.

OK, will look at Trond's work.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-02 13:48     ` Peter Zijlstra
@ 2010-03-02 22:14       ` Andrea Righi
  -1 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-02 22:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Kirill A. Shutemov, Andrew Morton, containers,
	linux-kernel, linux-mm, Trond Myklebust

On Tue, Mar 02, 2010 at 02:48:56PM +0100, Peter Zijlstra wrote:
> On Mon, 2010-03-01 at 22:23 +0100, Andrea Righi wrote:
> > Apply the cgroup dirty pages accounting and limiting infrastructure to
> > the opportune kernel functions.
> > 
> > Signed-off-by: Andrea Righi <arighi@develer.com>
> > ---
> 
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index 5a0f8f3..d83f41c 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -137,13 +137,14 @@ static struct prop_descriptor vm_dirties;
> >   */
> >  static int calc_period_shift(void)
> >  {
> > -	unsigned long dirty_total;
> > +	unsigned long dirty_total, dirty_bytes;
> >  
> > -	if (vm_dirty_bytes)
> > -		dirty_total = vm_dirty_bytes / PAGE_SIZE;
> > +	dirty_bytes = mem_cgroup_dirty_bytes();
> > +	if (dirty_bytes)
> 
> So you don't think 0 is a valid max dirty amount?

A value of 0 means "disabled". It's used to select between dirty_ratio
or dirty_bytes. It's the same for the gloabl vm_dirty_* parameters.

> 
> > +		dirty_total = dirty_bytes / PAGE_SIZE;
> >  	else
> > -		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> > -				100;
> > +		dirty_total = (mem_cgroup_dirty_ratio() *
> > +				determine_dirtyable_memory()) / 100;
> >  	return 2 + ilog2(dirty_total - 1);
> >  }
> >  
> > @@ -408,14 +409,16 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
> >   */
> >  unsigned long determine_dirtyable_memory(void)
> >  {
> > -	unsigned long x;
> > -
> > -	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> > +	unsigned long memory;
> > +	s64 memcg_memory;
> >  
> > +	memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> >  	if (!vm_highmem_is_dirtyable)
> > -		x -= highmem_dirtyable_memory(x);
> > -
> > -	return x + 1;	/* Ensure that we never return 0 */
> > +		memory -= highmem_dirtyable_memory(memory);
> > +	memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
> > +	if (memcg_memory < 0)
> 
> And here you somehow return negative?
> 
> > +		return memory + 1;
> > +	return min((unsigned long)memcg_memory, memory + 1);
> >  }
> >  
> >  void
> > @@ -423,26 +426,28 @@ get_dirty_limits(unsigned long *pbackground, unsigned long *pdirty,
> >  		 unsigned long *pbdi_dirty, struct backing_dev_info *bdi)
> >  {
> >  	unsigned long background;
> > -	unsigned long dirty;
> > +	unsigned long dirty, dirty_bytes, dirty_background;
> >  	unsigned long available_memory = determine_dirtyable_memory();
> >  	struct task_struct *tsk;
> >  
> > -	if (vm_dirty_bytes)
> > -		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> > +	dirty_bytes = mem_cgroup_dirty_bytes();
> > +	if (dirty_bytes)
> 
> zero not valid again
> 
> > +		dirty = DIV_ROUND_UP(dirty_bytes, PAGE_SIZE);
> >  	else {
> >  		int dirty_ratio;
> >  
> > -		dirty_ratio = vm_dirty_ratio;
> > +		dirty_ratio = mem_cgroup_dirty_ratio();
> >  		if (dirty_ratio < 5)
> >  			dirty_ratio = 5;
> >  		dirty = (dirty_ratio * available_memory) / 100;
> >  	}
> >  
> > -	if (dirty_background_bytes)
> > -		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> > +	dirty_background = mem_cgroup_dirty_background_bytes();
> > +	if (dirty_background)
> 
> idem
> 
> > +		background = DIV_ROUND_UP(dirty_background, PAGE_SIZE);
> >  	else
> > -		background = (dirty_background_ratio * available_memory) / 100;
> > -
> > +		background = (mem_cgroup_dirty_background_ratio() *
> > +					available_memory) / 100;
> >  	if (background >= dirty)
> >  		background = dirty / 2;
> >  	tsk = current;
> > @@ -508,9 +513,13 @@ static void balance_dirty_pages(struct address_space *mapping,
> >  		get_dirty_limits(&background_thresh, &dirty_thresh,
> >  				&bdi_thresh, bdi);
> >  
> > -		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> > +		nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
> > +		nr_writeback = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
> > +		if ((nr_reclaimable < 0) || (nr_writeback < 0)) {
> > +			nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> >  					global_page_state(NR_UNSTABLE_NFS);
> 
> ??? why would a page_state be negative.. I see you return -ENOMEM on !
> cgroup, but how can one specify no dirty limit with this compiled in?
> 
> > -		nr_writeback = global_page_state(NR_WRITEBACK);
> > +			nr_writeback = global_page_state(NR_WRITEBACK);
> > +		}
> >  
> >  		bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY);
> >  		if (bdi_cap_account_unstable(bdi)) {
> > @@ -611,10 +620,12 @@ static void balance_dirty_pages(struct address_space *mapping,
> >  	 * In normal mode, we start background writeout at the lower
> >  	 * background_thresh, to keep the amount of dirty memory low.
> >  	 */
> > +	nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
> > +	if (nr_reclaimable < 0)
> > +		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> > +				global_page_state(NR_UNSTABLE_NFS);
> 
> Again..
> 
> >  	if ((laptop_mode && pages_written) ||
> > -	    (!laptop_mode && ((global_page_state(NR_FILE_DIRTY)
> > -			       + global_page_state(NR_UNSTABLE_NFS))
> > -					  > background_thresh)))
> > +	    (!laptop_mode && (nr_reclaimable > background_thresh)))
> >  		bdi_start_writeback(bdi, NULL, 0);
> >  }
> >  
> > @@ -678,6 +689,8 @@ void throttle_vm_writeout(gfp_t gfp_mask)
> >  	unsigned long dirty_thresh;
> >  
> >          for ( ; ; ) {
> > +		unsigned long dirty;
> > +
> >  		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
> >  
> >                  /*
> > @@ -686,10 +699,14 @@ void throttle_vm_writeout(gfp_t gfp_mask)
> >                   */
> >                  dirty_thresh += dirty_thresh / 10;      /* wheeee... */
> >  
> > -                if (global_page_state(NR_UNSTABLE_NFS) +
> > -			global_page_state(NR_WRITEBACK) <= dirty_thresh)
> > -                        	break;
> > -                congestion_wait(BLK_RW_ASYNC, HZ/10);
> > +
> > +		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> > +		if (dirty < 0)
> > +			dirty = global_page_state(NR_UNSTABLE_NFS) +
> > +				global_page_state(NR_WRITEBACK);
> 
> and again..
> 
> > +		if (dirty <= dirty_thresh)
> > +			break;
> > +		congestion_wait(BLK_RW_ASYNC, HZ/10);
> >  
> >  		/*
> >  		 * The caller might hold locks which can prevent IO completion
> 
> This is ugly and broken.. I thought you'd agreed to something like:
> 
>  if (mem_cgroup_has_dirty_limit(cgroup))
>    use mem_cgroup numbers
>  else
>    use global numbers

I agree mem_cgroup_has_dirty_limit() is nicer. But we must do that under
RCU, so something like:

	rcu_read_lock();
	if (mem_cgroup_has_dirty_limit())
		mem_cgroup_get_page_stat()
	else
		global_page_state()
	rcu_read_unlock();

That is bad when mem_cgroup_has_dirty_limit() always returns false
(e.g., when memory cgroups are disabled). So I fallback to the old
interface.

What do you think about:

	mem_cgroup_lock();
	if (mem_cgroup_has_dirty_limit())
		mem_cgroup_get_page_stat()
	else
		global_page_state()
	mem_cgroup_unlock();

Where mem_cgroup_read_lock/unlock() simply expand to nothing when
memory cgroups are disabled.

> 
> That allows for a 0 dirty limit (which should work and basically makes
> all io synchronous).

IMHO it is better to reserve 0 for the special value "disabled" like the
global settings. A synchronous IO can be also achieved using a dirty
limit of 1.

> 
> Also, I'd put each of those in a separate function, like:
> 
> unsigned long reclaimable_pages(cgroup)
> {
>   if (mem_cgroup_has_dirty_limit(cgroup))
>     return mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
>   
>   return global_page_state(NR_FILE_DIRTY) + global_page_state(NR_NFS_UNSTABLE);
> }

Agreed.

> 
> Which raises another question, you should probably rebase on top of
> Trond's patches, which removes BDI_RECLAIMABLE, suggesting you also
> loose MEMCG_NR_RECLAIM_PAGES in favour of the DIRTY+UNSTABLE split.

OK, will look at Trond's work.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
@ 2010-03-02 22:14       ` Andrea Righi
  0 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-02 22:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Kirill A. Shutemov, Andrew Morton, containers,
	linux-kernel, linux-mm, Trond Myklebust

On Tue, Mar 02, 2010 at 02:48:56PM +0100, Peter Zijlstra wrote:
> On Mon, 2010-03-01 at 22:23 +0100, Andrea Righi wrote:
> > Apply the cgroup dirty pages accounting and limiting infrastructure to
> > the opportune kernel functions.
> > 
> > Signed-off-by: Andrea Righi <arighi@develer.com>
> > ---
> 
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index 5a0f8f3..d83f41c 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -137,13 +137,14 @@ static struct prop_descriptor vm_dirties;
> >   */
> >  static int calc_period_shift(void)
> >  {
> > -	unsigned long dirty_total;
> > +	unsigned long dirty_total, dirty_bytes;
> >  
> > -	if (vm_dirty_bytes)
> > -		dirty_total = vm_dirty_bytes / PAGE_SIZE;
> > +	dirty_bytes = mem_cgroup_dirty_bytes();
> > +	if (dirty_bytes)
> 
> So you don't think 0 is a valid max dirty amount?

A value of 0 means "disabled". It's used to select between dirty_ratio
or dirty_bytes. It's the same for the gloabl vm_dirty_* parameters.

> 
> > +		dirty_total = dirty_bytes / PAGE_SIZE;
> >  	else
> > -		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> > -				100;
> > +		dirty_total = (mem_cgroup_dirty_ratio() *
> > +				determine_dirtyable_memory()) / 100;
> >  	return 2 + ilog2(dirty_total - 1);
> >  }
> >  
> > @@ -408,14 +409,16 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
> >   */
> >  unsigned long determine_dirtyable_memory(void)
> >  {
> > -	unsigned long x;
> > -
> > -	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> > +	unsigned long memory;
> > +	s64 memcg_memory;
> >  
> > +	memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> >  	if (!vm_highmem_is_dirtyable)
> > -		x -= highmem_dirtyable_memory(x);
> > -
> > -	return x + 1;	/* Ensure that we never return 0 */
> > +		memory -= highmem_dirtyable_memory(memory);
> > +	memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
> > +	if (memcg_memory < 0)
> 
> And here you somehow return negative?
> 
> > +		return memory + 1;
> > +	return min((unsigned long)memcg_memory, memory + 1);
> >  }
> >  
> >  void
> > @@ -423,26 +426,28 @@ get_dirty_limits(unsigned long *pbackground, unsigned long *pdirty,
> >  		 unsigned long *pbdi_dirty, struct backing_dev_info *bdi)
> >  {
> >  	unsigned long background;
> > -	unsigned long dirty;
> > +	unsigned long dirty, dirty_bytes, dirty_background;
> >  	unsigned long available_memory = determine_dirtyable_memory();
> >  	struct task_struct *tsk;
> >  
> > -	if (vm_dirty_bytes)
> > -		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> > +	dirty_bytes = mem_cgroup_dirty_bytes();
> > +	if (dirty_bytes)
> 
> zero not valid again
> 
> > +		dirty = DIV_ROUND_UP(dirty_bytes, PAGE_SIZE);
> >  	else {
> >  		int dirty_ratio;
> >  
> > -		dirty_ratio = vm_dirty_ratio;
> > +		dirty_ratio = mem_cgroup_dirty_ratio();
> >  		if (dirty_ratio < 5)
> >  			dirty_ratio = 5;
> >  		dirty = (dirty_ratio * available_memory) / 100;
> >  	}
> >  
> > -	if (dirty_background_bytes)
> > -		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> > +	dirty_background = mem_cgroup_dirty_background_bytes();
> > +	if (dirty_background)
> 
> idem
> 
> > +		background = DIV_ROUND_UP(dirty_background, PAGE_SIZE);
> >  	else
> > -		background = (dirty_background_ratio * available_memory) / 100;
> > -
> > +		background = (mem_cgroup_dirty_background_ratio() *
> > +					available_memory) / 100;
> >  	if (background >= dirty)
> >  		background = dirty / 2;
> >  	tsk = current;
> > @@ -508,9 +513,13 @@ static void balance_dirty_pages(struct address_space *mapping,
> >  		get_dirty_limits(&background_thresh, &dirty_thresh,
> >  				&bdi_thresh, bdi);
> >  
> > -		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> > +		nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
> > +		nr_writeback = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
> > +		if ((nr_reclaimable < 0) || (nr_writeback < 0)) {
> > +			nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> >  					global_page_state(NR_UNSTABLE_NFS);
> 
> ??? why would a page_state be negative.. I see you return -ENOMEM on !
> cgroup, but how can one specify no dirty limit with this compiled in?
> 
> > -		nr_writeback = global_page_state(NR_WRITEBACK);
> > +			nr_writeback = global_page_state(NR_WRITEBACK);
> > +		}
> >  
> >  		bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY);
> >  		if (bdi_cap_account_unstable(bdi)) {
> > @@ -611,10 +620,12 @@ static void balance_dirty_pages(struct address_space *mapping,
> >  	 * In normal mode, we start background writeout at the lower
> >  	 * background_thresh, to keep the amount of dirty memory low.
> >  	 */
> > +	nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
> > +	if (nr_reclaimable < 0)
> > +		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> > +				global_page_state(NR_UNSTABLE_NFS);
> 
> Again..
> 
> >  	if ((laptop_mode && pages_written) ||
> > -	    (!laptop_mode && ((global_page_state(NR_FILE_DIRTY)
> > -			       + global_page_state(NR_UNSTABLE_NFS))
> > -					  > background_thresh)))
> > +	    (!laptop_mode && (nr_reclaimable > background_thresh)))
> >  		bdi_start_writeback(bdi, NULL, 0);
> >  }
> >  
> > @@ -678,6 +689,8 @@ void throttle_vm_writeout(gfp_t gfp_mask)
> >  	unsigned long dirty_thresh;
> >  
> >          for ( ; ; ) {
> > +		unsigned long dirty;
> > +
> >  		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
> >  
> >                  /*
> > @@ -686,10 +699,14 @@ void throttle_vm_writeout(gfp_t gfp_mask)
> >                   */
> >                  dirty_thresh += dirty_thresh / 10;      /* wheeee... */
> >  
> > -                if (global_page_state(NR_UNSTABLE_NFS) +
> > -			global_page_state(NR_WRITEBACK) <= dirty_thresh)
> > -                        	break;
> > -                congestion_wait(BLK_RW_ASYNC, HZ/10);
> > +
> > +		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> > +		if (dirty < 0)
> > +			dirty = global_page_state(NR_UNSTABLE_NFS) +
> > +				global_page_state(NR_WRITEBACK);
> 
> and again..
> 
> > +		if (dirty <= dirty_thresh)
> > +			break;
> > +		congestion_wait(BLK_RW_ASYNC, HZ/10);
> >  
> >  		/*
> >  		 * The caller might hold locks which can prevent IO completion
> 
> This is ugly and broken.. I thought you'd agreed to something like:
> 
>  if (mem_cgroup_has_dirty_limit(cgroup))
>    use mem_cgroup numbers
>  else
>    use global numbers

I agree mem_cgroup_has_dirty_limit() is nicer. But we must do that under
RCU, so something like:

	rcu_read_lock();
	if (mem_cgroup_has_dirty_limit())
		mem_cgroup_get_page_stat()
	else
		global_page_state()
	rcu_read_unlock();

That is bad when mem_cgroup_has_dirty_limit() always returns false
(e.g., when memory cgroups are disabled). So I fallback to the old
interface.

What do you think about:

	mem_cgroup_lock();
	if (mem_cgroup_has_dirty_limit())
		mem_cgroup_get_page_stat()
	else
		global_page_state()
	mem_cgroup_unlock();

Where mem_cgroup_read_lock/unlock() simply expand to nothing when
memory cgroups are disabled.

> 
> That allows for a 0 dirty limit (which should work and basically makes
> all io synchronous).

IMHO it is better to reserve 0 for the special value "disabled" like the
global settings. A synchronous IO can be also achieved using a dirty
limit of 1.

> 
> Also, I'd put each of those in a separate function, like:
> 
> unsigned long reclaimable_pages(cgroup)
> {
>   if (mem_cgroup_has_dirty_limit(cgroup))
>     return mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
>   
>   return global_page_state(NR_FILE_DIRTY) + global_page_state(NR_NFS_UNSTABLE);
> }

Agreed.

> 
> Which raises another question, you should probably rebase on top of
> Trond's patches, which removes BDI_RECLAIMABLE, suggesting you also
> loose MEMCG_NR_RECLAIM_PAGES in favour of the DIRTY+UNSTABLE split.

OK, will look at Trond's work.

Thanks,
-Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
       [not found]           ` <20100302135026.GH3212-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org>
@ 2010-03-02 22:18             ` Andrea Righi
  0 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-02 22:18 UTC (permalink / raw)
  To: Balbir Singh
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Andrew Morton

On Tue, Mar 02, 2010 at 07:20:26PM +0530, Balbir Singh wrote:
> * KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> [2010-03-02 17:23:16]:
> 
> > On Tue, 2 Mar 2010 09:01:58 +0100
> > Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org> wrote:
> > 
> > > On Tue, Mar 02, 2010 at 09:23:09AM +0900, KAMEZAWA Hiroyuki wrote:
> > > > On Mon,  1 Mar 2010 22:23:40 +0100
> > > > Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org> wrote:
> > > > 
> > > > > Apply the cgroup dirty pages accounting and limiting infrastructure to
> > > > > the opportune kernel functions.
> > > > > 
> > > > > Signed-off-by: Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
> > > > 
> > > > Seems nice.
> > > > 
> > > > Hmm. the last problem is moving account between memcg.
> > > > 
> > > > Right ?
> > > 
> > > Correct. This was actually the last item of the TODO list. Anyway, I'm
> > > still considering if it's correct to move dirty pages when a task is
> > > migrated from a cgroup to another. Currently, dirty pages just remain in
> > > the original cgroup and are flushed depending on the original cgroup
> > > settings. That is not totally wrong... at least moving the dirty pages
> > > between memcgs should be optional (move_charge_at_immigrate?).
> > > 
> > 
> > My concern is 
> >  - migration between memcg is already suppoted
> >     - at task move
> >     - at rmdir
> > 
> > Then, if you leave DIRTY_PAGE accounting to original cgroup,
> > the new cgroup (migration target)'s Dirty page accounting may
> > goes to be negative, or incorrect value. Please check FILE_MAPPED
> > implementation in __mem_cgroup_move_account()
> > 
> > As
> >        if (page_mapped(page) && !PageAnon(page)) {
> >                 /* Update mapped_file data for mem_cgroup */
> >                 preempt_disable();
> >                 __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> >                 __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> >                 preempt_enable();
> >         }
> > then, FILE_MAPPED never goes negative.
> >
> 
> Absolutely! I am not sure how complex dirty memory migration will be,
> but one way of working around it would be to disable migration of
> charges when the feature is enabled (dirty* is set in the memory
> cgroup). We might need additional logic to allow that to happen. 

I've started to look at dirty memory migration. First attempt is to add
DIRTY, WRITEBACK, etc. to page_cgroup flags and handle them in
__mem_cgroup_move_account(). Probably I'll have something ready for the
next version of the patch. I still need to figure if this can work as
expected...

-Andrea

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-02 13:50           ` Balbir Singh
@ 2010-03-02 22:18             ` Andrea Righi
  -1 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-02 22:18 UTC (permalink / raw)
  To: Balbir Singh
  Cc: KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Kirill A. Shutemov, Andrew Morton, containers,
	linux-kernel, linux-mm

On Tue, Mar 02, 2010 at 07:20:26PM +0530, Balbir Singh wrote:
> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-03-02 17:23:16]:
> 
> > On Tue, 2 Mar 2010 09:01:58 +0100
> > Andrea Righi <arighi@develer.com> wrote:
> > 
> > > On Tue, Mar 02, 2010 at 09:23:09AM +0900, KAMEZAWA Hiroyuki wrote:
> > > > On Mon,  1 Mar 2010 22:23:40 +0100
> > > > Andrea Righi <arighi@develer.com> wrote:
> > > > 
> > > > > Apply the cgroup dirty pages accounting and limiting infrastructure to
> > > > > the opportune kernel functions.
> > > > > 
> > > > > Signed-off-by: Andrea Righi <arighi@develer.com>
> > > > 
> > > > Seems nice.
> > > > 
> > > > Hmm. the last problem is moving account between memcg.
> > > > 
> > > > Right ?
> > > 
> > > Correct. This was actually the last item of the TODO list. Anyway, I'm
> > > still considering if it's correct to move dirty pages when a task is
> > > migrated from a cgroup to another. Currently, dirty pages just remain in
> > > the original cgroup and are flushed depending on the original cgroup
> > > settings. That is not totally wrong... at least moving the dirty pages
> > > between memcgs should be optional (move_charge_at_immigrate?).
> > > 
> > 
> > My concern is 
> >  - migration between memcg is already suppoted
> >     - at task move
> >     - at rmdir
> > 
> > Then, if you leave DIRTY_PAGE accounting to original cgroup,
> > the new cgroup (migration target)'s Dirty page accounting may
> > goes to be negative, or incorrect value. Please check FILE_MAPPED
> > implementation in __mem_cgroup_move_account()
> > 
> > As
> >        if (page_mapped(page) && !PageAnon(page)) {
> >                 /* Update mapped_file data for mem_cgroup */
> >                 preempt_disable();
> >                 __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> >                 __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> >                 preempt_enable();
> >         }
> > then, FILE_MAPPED never goes negative.
> >
> 
> Absolutely! I am not sure how complex dirty memory migration will be,
> but one way of working around it would be to disable migration of
> charges when the feature is enabled (dirty* is set in the memory
> cgroup). We might need additional logic to allow that to happen. 

I've started to look at dirty memory migration. First attempt is to add
DIRTY, WRITEBACK, etc. to page_cgroup flags and handle them in
__mem_cgroup_move_account(). Probably I'll have something ready for the
next version of the patch. I still need to figure if this can work as
expected...

-Andrea

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
@ 2010-03-02 22:18             ` Andrea Righi
  0 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-02 22:18 UTC (permalink / raw)
  To: Balbir Singh
  Cc: KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Kirill A. Shutemov, Andrew Morton, containers,
	linux-kernel, linux-mm

On Tue, Mar 02, 2010 at 07:20:26PM +0530, Balbir Singh wrote:
> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-03-02 17:23:16]:
> 
> > On Tue, 2 Mar 2010 09:01:58 +0100
> > Andrea Righi <arighi@develer.com> wrote:
> > 
> > > On Tue, Mar 02, 2010 at 09:23:09AM +0900, KAMEZAWA Hiroyuki wrote:
> > > > On Mon,  1 Mar 2010 22:23:40 +0100
> > > > Andrea Righi <arighi@develer.com> wrote:
> > > > 
> > > > > Apply the cgroup dirty pages accounting and limiting infrastructure to
> > > > > the opportune kernel functions.
> > > > > 
> > > > > Signed-off-by: Andrea Righi <arighi@develer.com>
> > > > 
> > > > Seems nice.
> > > > 
> > > > Hmm. the last problem is moving account between memcg.
> > > > 
> > > > Right ?
> > > 
> > > Correct. This was actually the last item of the TODO list. Anyway, I'm
> > > still considering if it's correct to move dirty pages when a task is
> > > migrated from a cgroup to another. Currently, dirty pages just remain in
> > > the original cgroup and are flushed depending on the original cgroup
> > > settings. That is not totally wrong... at least moving the dirty pages
> > > between memcgs should be optional (move_charge_at_immigrate?).
> > > 
> > 
> > My concern is 
> >  - migration between memcg is already suppoted
> >     - at task move
> >     - at rmdir
> > 
> > Then, if you leave DIRTY_PAGE accounting to original cgroup,
> > the new cgroup (migration target)'s Dirty page accounting may
> > goes to be negative, or incorrect value. Please check FILE_MAPPED
> > implementation in __mem_cgroup_move_account()
> > 
> > As
> >        if (page_mapped(page) && !PageAnon(page)) {
> >                 /* Update mapped_file data for mem_cgroup */
> >                 preempt_disable();
> >                 __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> >                 __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> >                 preempt_enable();
> >         }
> > then, FILE_MAPPED never goes negative.
> >
> 
> Absolutely! I am not sure how complex dirty memory migration will be,
> but one way of working around it would be to disable migration of
> charges when the feature is enabled (dirty* is set in the memory
> cgroup). We might need additional logic to allow that to happen. 

I've started to look at dirty memory migration. First attempt is to add
DIRTY, WRITEBACK, etc. to page_cgroup flags and handle them in
__mem_cgroup_move_account(). Probably I'll have something ready for the
next version of the patch. I still need to figure if this can work as
expected...

-Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
       [not found]         ` <20100302150529.GA12855-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2010-03-02 22:22           ` Andrea Righi
  0 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-02 22:22 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Andrew Morton,
	Balbir Singh

On Tue, Mar 02, 2010 at 10:05:29AM -0500, Vivek Goyal wrote:
> On Mon, Mar 01, 2010 at 11:18:31PM +0100, Andrea Righi wrote:
> > On Mon, Mar 01, 2010 at 05:02:08PM -0500, Vivek Goyal wrote:
> > > > @@ -686,10 +699,14 @@ void throttle_vm_writeout(gfp_t gfp_mask)
> > > >                   */
> > > >                  dirty_thresh += dirty_thresh / 10;      /* wheeee... */
> > > >  
> > > > -                if (global_page_state(NR_UNSTABLE_NFS) +
> > > > -			global_page_state(NR_WRITEBACK) <= dirty_thresh)
> > > > -                        	break;
> > > > -                congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > > +
> > > > +		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> > > > +		if (dirty < 0)
> > > > +			dirty = global_page_state(NR_UNSTABLE_NFS) +
> > > > +				global_page_state(NR_WRITEBACK);
> > > 
> > > dirty is unsigned long. As mentioned last time, above will never be true?
> > > In general these patches look ok to me. I will do some testing with these.
> > 
> > Re-introduced the same bug. My bad. :(
> > 
> > The value returned from mem_cgroup_page_stat() can be negative, i.e.
> > when memory cgroup is disabled. We could simply use a long for dirty,
> > the unit is in # of pages so s64 should be enough. Or cast dirty to long
> > only for the check (see below).
> > 
> > Thanks!
> > -Andrea
> > 
> > Signed-off-by: Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
> > ---
> >  mm/page-writeback.c |    2 +-
> >  1 files changed, 1 insertions(+), 1 deletions(-)
> > 
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index d83f41c..dbee976 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -701,7 +701,7 @@ void throttle_vm_writeout(gfp_t gfp_mask)
> >  
> >  
> >  		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> > -		if (dirty < 0)
> > +		if ((long)dirty < 0)
> 
> This will also be problematic as on 32bit systems, your uppper limit of
> dirty memory will be 2G?
> 
> I guess, I will prefer one of the two.
> 
> - return the error code from function and pass a pointer to store stats
>   in as function argument.
> 
> - Or Peter's suggestion of checking mem_cgroup_has_dirty_limit() and if
>   per cgroup dirty control is enabled, then use per cgroup stats. In that
>   case you don't have to return negative values.
> 
>   Only tricky part will be careful accouting so that none of the stats go
>   negative in corner cases of migration etc.

What do you think about Peter's suggestion + the locking stuff? (see the
previous email). Otherwise, I'll choose the other solution, passing a
pointer and always return the error code is not bad.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-02 15:05         ` Vivek Goyal
@ 2010-03-02 22:22           ` Andrea Righi
  -1 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-02 22:22 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Kirill A. Shutemov, Andrew Morton, containers,
	linux-kernel, linux-mm

On Tue, Mar 02, 2010 at 10:05:29AM -0500, Vivek Goyal wrote:
> On Mon, Mar 01, 2010 at 11:18:31PM +0100, Andrea Righi wrote:
> > On Mon, Mar 01, 2010 at 05:02:08PM -0500, Vivek Goyal wrote:
> > > > @@ -686,10 +699,14 @@ void throttle_vm_writeout(gfp_t gfp_mask)
> > > >                   */
> > > >                  dirty_thresh += dirty_thresh / 10;      /* wheeee... */
> > > >  
> > > > -                if (global_page_state(NR_UNSTABLE_NFS) +
> > > > -			global_page_state(NR_WRITEBACK) <= dirty_thresh)
> > > > -                        	break;
> > > > -                congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > > +
> > > > +		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> > > > +		if (dirty < 0)
> > > > +			dirty = global_page_state(NR_UNSTABLE_NFS) +
> > > > +				global_page_state(NR_WRITEBACK);
> > > 
> > > dirty is unsigned long. As mentioned last time, above will never be true?
> > > In general these patches look ok to me. I will do some testing with these.
> > 
> > Re-introduced the same bug. My bad. :(
> > 
> > The value returned from mem_cgroup_page_stat() can be negative, i.e.
> > when memory cgroup is disabled. We could simply use a long for dirty,
> > the unit is in # of pages so s64 should be enough. Or cast dirty to long
> > only for the check (see below).
> > 
> > Thanks!
> > -Andrea
> > 
> > Signed-off-by: Andrea Righi <arighi@develer.com>
> > ---
> >  mm/page-writeback.c |    2 +-
> >  1 files changed, 1 insertions(+), 1 deletions(-)
> > 
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index d83f41c..dbee976 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -701,7 +701,7 @@ void throttle_vm_writeout(gfp_t gfp_mask)
> >  
> >  
> >  		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> > -		if (dirty < 0)
> > +		if ((long)dirty < 0)
> 
> This will also be problematic as on 32bit systems, your uppper limit of
> dirty memory will be 2G?
> 
> I guess, I will prefer one of the two.
> 
> - return the error code from function and pass a pointer to store stats
>   in as function argument.
> 
> - Or Peter's suggestion of checking mem_cgroup_has_dirty_limit() and if
>   per cgroup dirty control is enabled, then use per cgroup stats. In that
>   case you don't have to return negative values.
> 
>   Only tricky part will be careful accouting so that none of the stats go
>   negative in corner cases of migration etc.

What do you think about Peter's suggestion + the locking stuff? (see the
previous email). Otherwise, I'll choose the other solution, passing a
pointer and always return the error code is not bad.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
@ 2010-03-02 22:22           ` Andrea Righi
  0 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-02 22:22 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Kirill A. Shutemov, Andrew Morton, containers,
	linux-kernel, linux-mm

On Tue, Mar 02, 2010 at 10:05:29AM -0500, Vivek Goyal wrote:
> On Mon, Mar 01, 2010 at 11:18:31PM +0100, Andrea Righi wrote:
> > On Mon, Mar 01, 2010 at 05:02:08PM -0500, Vivek Goyal wrote:
> > > > @@ -686,10 +699,14 @@ void throttle_vm_writeout(gfp_t gfp_mask)
> > > >                   */
> > > >                  dirty_thresh += dirty_thresh / 10;      /* wheeee... */
> > > >  
> > > > -                if (global_page_state(NR_UNSTABLE_NFS) +
> > > > -			global_page_state(NR_WRITEBACK) <= dirty_thresh)
> > > > -                        	break;
> > > > -                congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > > +
> > > > +		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> > > > +		if (dirty < 0)
> > > > +			dirty = global_page_state(NR_UNSTABLE_NFS) +
> > > > +				global_page_state(NR_WRITEBACK);
> > > 
> > > dirty is unsigned long. As mentioned last time, above will never be true?
> > > In general these patches look ok to me. I will do some testing with these.
> > 
> > Re-introduced the same bug. My bad. :(
> > 
> > The value returned from mem_cgroup_page_stat() can be negative, i.e.
> > when memory cgroup is disabled. We could simply use a long for dirty,
> > the unit is in # of pages so s64 should be enough. Or cast dirty to long
> > only for the check (see below).
> > 
> > Thanks!
> > -Andrea
> > 
> > Signed-off-by: Andrea Righi <arighi@develer.com>
> > ---
> >  mm/page-writeback.c |    2 +-
> >  1 files changed, 1 insertions(+), 1 deletions(-)
> > 
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index d83f41c..dbee976 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -701,7 +701,7 @@ void throttle_vm_writeout(gfp_t gfp_mask)
> >  
> >  
> >  		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> > -		if (dirty < 0)
> > +		if ((long)dirty < 0)
> 
> This will also be problematic as on 32bit systems, your uppper limit of
> dirty memory will be 2G?
> 
> I guess, I will prefer one of the two.
> 
> - return the error code from function and pass a pointer to store stats
>   in as function argument.
> 
> - Or Peter's suggestion of checking mem_cgroup_has_dirty_limit() and if
>   per cgroup dirty control is enabled, then use per cgroup stats. In that
>   case you don't have to return negative values.
> 
>   Only tricky part will be careful accouting so that none of the stats go
>   negative in corner cases of migration etc.

What do you think about Peter's suggestion + the locking stuff? (see the
previous email). Otherwise, I'll choose the other solution, passing a
pointer and always return the error code is not bad.

Thanks,
-Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 2/3] memcg: dirty pages accounting and limiting infrastructure
       [not found]     ` <49b004811003021008t4fae71bbu8d56192e48c32f39-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-03-02 22:24       ` Andrea Righi
  0 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-02 22:24 UTC (permalink / raw)
  To: Greg Thelen
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Andrew Morton,
	Balbir Singh

On Tue, Mar 02, 2010 at 10:08:17AM -0800, Greg Thelen wrote:
> Comments below.  Yet to be tested on my end, but I will test it.
> 
> On Mon, Mar 1, 2010 at 1:23 PM, Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org> wrote:
> > Infrastructure to account dirty pages per cgroup and add dirty limit
> > interfaces in the cgroupfs:
> >
> >  - Direct write-out: memory.dirty_ratio, memory.dirty_bytes
> >
> >  - Background write-out: memory.dirty_background_ratio, memory.dirty_background_bytes
> >
> > Signed-off-by: Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
> > ---
> >  include/linux/memcontrol.h |   77 ++++++++++-
> >  mm/memcontrol.c            |  336 ++++++++++++++++++++++++++++++++++++++++----
> >  2 files changed, 384 insertions(+), 29 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 1f9b119..cc88b2e 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -19,12 +19,50 @@
> >
> >  #ifndef _LINUX_MEMCONTROL_H
> >  #define _LINUX_MEMCONTROL_H
> > +
> > +#include <linux/writeback.h>
> >  #include <linux/cgroup.h>
> > +
> >  struct mem_cgroup;
> >  struct page_cgroup;
> >  struct page;
> >  struct mm_struct;
> >
> > +/* Cgroup memory statistics items exported to the kernel */
> > +enum mem_cgroup_page_stat_item {
> > +       MEMCG_NR_DIRTYABLE_PAGES,
> > +       MEMCG_NR_RECLAIM_PAGES,
> > +       MEMCG_NR_WRITEBACK,
> > +       MEMCG_NR_DIRTY_WRITEBACK_PAGES,
> > +};
> > +
> > +/*
> > + * Statistics for memory cgroup.
> > + */
> > +enum mem_cgroup_stat_index {
> > +       /*
> > +        * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> > +        */
> > +       MEM_CGROUP_STAT_CACHE,     /* # of pages charged as cache */
> > +       MEM_CGROUP_STAT_RSS,       /* # of pages charged as anon rss */
> > +       MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> > +       MEM_CGROUP_STAT_PGPGIN_COUNT,   /* # of pages paged in */
> > +       MEM_CGROUP_STAT_PGPGOUT_COUNT,  /* # of pages paged out */
> > +       MEM_CGROUP_STAT_EVENTS, /* sum of pagein + pageout for internal use */
> > +       MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> > +       MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
> > +                                       used by soft limit implementation */
> > +       MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out.
> > +                                       used by threshold implementation */
> > +       MEM_CGROUP_STAT_FILE_DIRTY,   /* # of dirty pages in page cache */
> > +       MEM_CGROUP_STAT_WRITEBACK,   /* # of pages under writeback */
> > +       MEM_CGROUP_STAT_WRITEBACK_TEMP,   /* # of pages under writeback using
> > +                                               temporary buffers */
> > +       MEM_CGROUP_STAT_UNSTABLE_NFS,   /* # of NFS unstable pages */
> > +
> > +       MEM_CGROUP_STAT_NSTATS,
> > +};
> > +
> >  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> >  /*
> >  * All "charge" functions with gfp_mask should use GFP_KERNEL or
> > @@ -117,6 +155,13 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
> >  extern int do_swap_account;
> >  #endif
> >
> > +extern long mem_cgroup_dirty_ratio(void);
> > +extern unsigned long mem_cgroup_dirty_bytes(void);
> > +extern long mem_cgroup_dirty_background_ratio(void);
> > +extern unsigned long mem_cgroup_dirty_background_bytes(void);
> > +
> > +extern s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item);
> > +
> >  static inline bool mem_cgroup_disabled(void)
> >  {
> >        if (mem_cgroup_subsys.disabled)
> > @@ -125,7 +170,8 @@ static inline bool mem_cgroup_disabled(void)
> >  }
> >
> >  extern bool mem_cgroup_oom_called(struct task_struct *task);
> > -void mem_cgroup_update_file_mapped(struct page *page, int val);
> > +void mem_cgroup_update_stat(struct page *page,
> > +                       enum mem_cgroup_stat_index idx, int val);
> >  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> >                                                gfp_t gfp_mask, int nid,
> >                                                int zid);
> > @@ -300,8 +346,8 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
> >  {
> >  }
> >
> > -static inline void mem_cgroup_update_file_mapped(struct page *page,
> > -                                                       int val)
> > +static inline void mem_cgroup_update_stat(struct page *page,
> > +                       enum mem_cgroup_stat_index idx, int val)
> >  {
> >  }
> >
> > @@ -312,6 +358,31 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> >        return 0;
> >  }
> >
> > +static inline long mem_cgroup_dirty_ratio(void)
> > +{
> > +       return vm_dirty_ratio;
> > +}
> > +
> > +static inline unsigned long mem_cgroup_dirty_bytes(void)
> > +{
> > +       return vm_dirty_bytes;
> > +}
> > +
> > +static inline long mem_cgroup_dirty_background_ratio(void)
> > +{
> > +       return dirty_background_ratio;
> > +}
> > +
> > +static inline unsigned long mem_cgroup_dirty_background_bytes(void)
> > +{
> > +       return dirty_background_bytes;
> > +}
> > +
> > +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> > +{
> > +       return -ENOMEM;
> > +}
> > +
> >  #endif /* CONFIG_CGROUP_MEM_CONT */
> >
> >  #endif /* _LINUX_MEMCONTROL_H */
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index a443c30..e74cf66 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -66,31 +66,16 @@ static int really_do_swap_account __initdata = 1; /* for remember boot option*/
> >  #define SOFTLIMIT_EVENTS_THRESH (1000)
> >  #define THRESHOLDS_EVENTS_THRESH (100)
> >
> > -/*
> > - * Statistics for memory cgroup.
> > - */
> > -enum mem_cgroup_stat_index {
> > -       /*
> > -        * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> > -        */
> > -       MEM_CGROUP_STAT_CACHE,     /* # of pages charged as cache */
> > -       MEM_CGROUP_STAT_RSS,       /* # of pages charged as anon rss */
> > -       MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> > -       MEM_CGROUP_STAT_PGPGIN_COUNT,   /* # of pages paged in */
> > -       MEM_CGROUP_STAT_PGPGOUT_COUNT,  /* # of pages paged out */
> > -       MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> > -       MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
> > -                                       used by soft limit implementation */
> > -       MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out.
> > -                                       used by threshold implementation */
> > -
> > -       MEM_CGROUP_STAT_NSTATS,
> > -};
> > -
> >  struct mem_cgroup_stat_cpu {
> >        s64 count[MEM_CGROUP_STAT_NSTATS];
> >  };
> >
> > +/* Per cgroup page statistics */
> > +struct mem_cgroup_page_stat {
> > +       enum mem_cgroup_page_stat_item item;
> > +       s64 value;
> > +};
> > +
> >  /*
> >  * per-zone information in memory controller.
> >  */
> > @@ -157,6 +142,15 @@ struct mem_cgroup_threshold_ary {
> >  static bool mem_cgroup_threshold_check(struct mem_cgroup *mem);
> >  static void mem_cgroup_threshold(struct mem_cgroup *mem);
> >
> > +enum mem_cgroup_dirty_param {
> > +       MEM_CGROUP_DIRTY_RATIO,
> > +       MEM_CGROUP_DIRTY_BYTES,
> > +       MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> > +       MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
> > +
> > +       MEM_CGROUP_DIRTY_NPARAMS,
> > +};
> > +
> >  /*
> >  * The memory controller data structure. The memory controller controls both
> >  * page cache and RSS per cgroup. We would eventually like to provide
> > @@ -205,6 +199,9 @@ struct mem_cgroup {
> >
> >        unsigned int    swappiness;
> >
> > +       /* control memory cgroup dirty pages */
> > +       unsigned long dirty_param[MEM_CGROUP_DIRTY_NPARAMS];
> > +
> >        /* set when res.limit == memsw.limit */
> >        bool            memsw_is_minimum;
> >
> > @@ -1021,6 +1018,164 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
> >        return swappiness;
> >  }
> >
> > +static unsigned long get_dirty_param(struct mem_cgroup *memcg,
> > +                       enum mem_cgroup_dirty_param idx)
> > +{
> > +       unsigned long ret;
> > +
> > +       VM_BUG_ON(idx >= MEM_CGROUP_DIRTY_NPARAMS);
> > +       spin_lock(&memcg->reclaim_param_lock);
> > +       ret = memcg->dirty_param[idx];
> > +       spin_unlock(&memcg->reclaim_param_lock);
> > +
> > +       return ret;
> > +}
> > +
> 
> > +long mem_cgroup_dirty_ratio(void)
> > +{
> > +       struct mem_cgroup *memcg;
> > +       long ret = vm_dirty_ratio;
> > +
> > +       if (mem_cgroup_disabled())
> > +               return ret;
> > +       /*
> > +        * It's possible that "current" may be moved to other cgroup while we
> > +        * access cgroup. But precise check is meaningless because the task can
> > +        * be moved after our access and writeback tends to take long time.
> > +        * At least, "memcg" will not be freed under rcu_read_lock().
> > +        */
> > +       rcu_read_lock();
> > +       memcg = mem_cgroup_from_task(current);
> > +       if (likely(memcg))
> > +               ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_RATIO);
> > +       rcu_read_unlock();
> > +
> > +       return ret;
> > +}
> > +
> > +unsigned long mem_cgroup_dirty_bytes(void)
> > +{
> > +       struct mem_cgroup *memcg;
> > +       unsigned long ret = vm_dirty_bytes;
> > +
> > +       if (mem_cgroup_disabled())
> > +               return ret;
> > +       rcu_read_lock();
> > +       memcg = mem_cgroup_from_task(current);
> > +       if (likely(memcg))
> > +               ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BYTES);
> > +       rcu_read_unlock();
> > +
> > +       return ret;
> > +}
> > +
> > +long mem_cgroup_dirty_background_ratio(void)
> > +{
> > +       struct mem_cgroup *memcg;
> > +       long ret = dirty_background_ratio;
> > +
> > +       if (mem_cgroup_disabled())
> > +               return ret;
> > +       rcu_read_lock();
> > +       memcg = mem_cgroup_from_task(current);
> > +       if (likely(memcg))
> > +               ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BACKGROUND_RATIO);
> > +       rcu_read_unlock();
> > +
> > +       return ret;
> > +}
> > +
> > +unsigned long mem_cgroup_dirty_background_bytes(void)
> > +{
> > +       struct mem_cgroup *memcg;
> > +       unsigned long ret = dirty_background_bytes;
> > +
> > +       if (mem_cgroup_disabled())
> > +               return ret;
> > +       rcu_read_lock();
> > +       memcg = mem_cgroup_from_task(current);
> > +       if (likely(memcg))
> > +               ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BACKGROUND_BYTES);
> > +       rcu_read_unlock();
> > +
> > +       return ret;
> > +}
> 
> Given that mem_cgroup_dirty_[background_]{ratio,bytes}() are similar,
> should we refactor the majority of them into a single routine?

Agreed.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 2/3] memcg: dirty pages accounting and limiting infrastructure
  2010-03-02 18:08     ` Greg Thelen
@ 2010-03-02 22:24       ` Andrea Righi
  -1 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-02 22:24 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Suleiman Souhlal,
	Daisuke Nishimura, Kirill A. Shutemov, Andrew Morton, containers,
	linux-kernel, linux-mm

On Tue, Mar 02, 2010 at 10:08:17AM -0800, Greg Thelen wrote:
> Comments below.  Yet to be tested on my end, but I will test it.
> 
> On Mon, Mar 1, 2010 at 1:23 PM, Andrea Righi <arighi@develer.com> wrote:
> > Infrastructure to account dirty pages per cgroup and add dirty limit
> > interfaces in the cgroupfs:
> >
> >  - Direct write-out: memory.dirty_ratio, memory.dirty_bytes
> >
> >  - Background write-out: memory.dirty_background_ratio, memory.dirty_background_bytes
> >
> > Signed-off-by: Andrea Righi <arighi@develer.com>
> > ---
> >  include/linux/memcontrol.h |   77 ++++++++++-
> >  mm/memcontrol.c            |  336 ++++++++++++++++++++++++++++++++++++++++----
> >  2 files changed, 384 insertions(+), 29 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 1f9b119..cc88b2e 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -19,12 +19,50 @@
> >
> >  #ifndef _LINUX_MEMCONTROL_H
> >  #define _LINUX_MEMCONTROL_H
> > +
> > +#include <linux/writeback.h>
> >  #include <linux/cgroup.h>
> > +
> >  struct mem_cgroup;
> >  struct page_cgroup;
> >  struct page;
> >  struct mm_struct;
> >
> > +/* Cgroup memory statistics items exported to the kernel */
> > +enum mem_cgroup_page_stat_item {
> > +       MEMCG_NR_DIRTYABLE_PAGES,
> > +       MEMCG_NR_RECLAIM_PAGES,
> > +       MEMCG_NR_WRITEBACK,
> > +       MEMCG_NR_DIRTY_WRITEBACK_PAGES,
> > +};
> > +
> > +/*
> > + * Statistics for memory cgroup.
> > + */
> > +enum mem_cgroup_stat_index {
> > +       /*
> > +        * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> > +        */
> > +       MEM_CGROUP_STAT_CACHE,     /* # of pages charged as cache */
> > +       MEM_CGROUP_STAT_RSS,       /* # of pages charged as anon rss */
> > +       MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> > +       MEM_CGROUP_STAT_PGPGIN_COUNT,   /* # of pages paged in */
> > +       MEM_CGROUP_STAT_PGPGOUT_COUNT,  /* # of pages paged out */
> > +       MEM_CGROUP_STAT_EVENTS, /* sum of pagein + pageout for internal use */
> > +       MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> > +       MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
> > +                                       used by soft limit implementation */
> > +       MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out.
> > +                                       used by threshold implementation */
> > +       MEM_CGROUP_STAT_FILE_DIRTY,   /* # of dirty pages in page cache */
> > +       MEM_CGROUP_STAT_WRITEBACK,   /* # of pages under writeback */
> > +       MEM_CGROUP_STAT_WRITEBACK_TEMP,   /* # of pages under writeback using
> > +                                               temporary buffers */
> > +       MEM_CGROUP_STAT_UNSTABLE_NFS,   /* # of NFS unstable pages */
> > +
> > +       MEM_CGROUP_STAT_NSTATS,
> > +};
> > +
> >  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> >  /*
> >  * All "charge" functions with gfp_mask should use GFP_KERNEL or
> > @@ -117,6 +155,13 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
> >  extern int do_swap_account;
> >  #endif
> >
> > +extern long mem_cgroup_dirty_ratio(void);
> > +extern unsigned long mem_cgroup_dirty_bytes(void);
> > +extern long mem_cgroup_dirty_background_ratio(void);
> > +extern unsigned long mem_cgroup_dirty_background_bytes(void);
> > +
> > +extern s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item);
> > +
> >  static inline bool mem_cgroup_disabled(void)
> >  {
> >        if (mem_cgroup_subsys.disabled)
> > @@ -125,7 +170,8 @@ static inline bool mem_cgroup_disabled(void)
> >  }
> >
> >  extern bool mem_cgroup_oom_called(struct task_struct *task);
> > -void mem_cgroup_update_file_mapped(struct page *page, int val);
> > +void mem_cgroup_update_stat(struct page *page,
> > +                       enum mem_cgroup_stat_index idx, int val);
> >  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> >                                                gfp_t gfp_mask, int nid,
> >                                                int zid);
> > @@ -300,8 +346,8 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
> >  {
> >  }
> >
> > -static inline void mem_cgroup_update_file_mapped(struct page *page,
> > -                                                       int val)
> > +static inline void mem_cgroup_update_stat(struct page *page,
> > +                       enum mem_cgroup_stat_index idx, int val)
> >  {
> >  }
> >
> > @@ -312,6 +358,31 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> >        return 0;
> >  }
> >
> > +static inline long mem_cgroup_dirty_ratio(void)
> > +{
> > +       return vm_dirty_ratio;
> > +}
> > +
> > +static inline unsigned long mem_cgroup_dirty_bytes(void)
> > +{
> > +       return vm_dirty_bytes;
> > +}
> > +
> > +static inline long mem_cgroup_dirty_background_ratio(void)
> > +{
> > +       return dirty_background_ratio;
> > +}
> > +
> > +static inline unsigned long mem_cgroup_dirty_background_bytes(void)
> > +{
> > +       return dirty_background_bytes;
> > +}
> > +
> > +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> > +{
> > +       return -ENOMEM;
> > +}
> > +
> >  #endif /* CONFIG_CGROUP_MEM_CONT */
> >
> >  #endif /* _LINUX_MEMCONTROL_H */
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index a443c30..e74cf66 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -66,31 +66,16 @@ static int really_do_swap_account __initdata = 1; /* for remember boot option*/
> >  #define SOFTLIMIT_EVENTS_THRESH (1000)
> >  #define THRESHOLDS_EVENTS_THRESH (100)
> >
> > -/*
> > - * Statistics for memory cgroup.
> > - */
> > -enum mem_cgroup_stat_index {
> > -       /*
> > -        * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> > -        */
> > -       MEM_CGROUP_STAT_CACHE,     /* # of pages charged as cache */
> > -       MEM_CGROUP_STAT_RSS,       /* # of pages charged as anon rss */
> > -       MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> > -       MEM_CGROUP_STAT_PGPGIN_COUNT,   /* # of pages paged in */
> > -       MEM_CGROUP_STAT_PGPGOUT_COUNT,  /* # of pages paged out */
> > -       MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> > -       MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
> > -                                       used by soft limit implementation */
> > -       MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out.
> > -                                       used by threshold implementation */
> > -
> > -       MEM_CGROUP_STAT_NSTATS,
> > -};
> > -
> >  struct mem_cgroup_stat_cpu {
> >        s64 count[MEM_CGROUP_STAT_NSTATS];
> >  };
> >
> > +/* Per cgroup page statistics */
> > +struct mem_cgroup_page_stat {
> > +       enum mem_cgroup_page_stat_item item;
> > +       s64 value;
> > +};
> > +
> >  /*
> >  * per-zone information in memory controller.
> >  */
> > @@ -157,6 +142,15 @@ struct mem_cgroup_threshold_ary {
> >  static bool mem_cgroup_threshold_check(struct mem_cgroup *mem);
> >  static void mem_cgroup_threshold(struct mem_cgroup *mem);
> >
> > +enum mem_cgroup_dirty_param {
> > +       MEM_CGROUP_DIRTY_RATIO,
> > +       MEM_CGROUP_DIRTY_BYTES,
> > +       MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> > +       MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
> > +
> > +       MEM_CGROUP_DIRTY_NPARAMS,
> > +};
> > +
> >  /*
> >  * The memory controller data structure. The memory controller controls both
> >  * page cache and RSS per cgroup. We would eventually like to provide
> > @@ -205,6 +199,9 @@ struct mem_cgroup {
> >
> >        unsigned int    swappiness;
> >
> > +       /* control memory cgroup dirty pages */
> > +       unsigned long dirty_param[MEM_CGROUP_DIRTY_NPARAMS];
> > +
> >        /* set when res.limit == memsw.limit */
> >        bool            memsw_is_minimum;
> >
> > @@ -1021,6 +1018,164 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
> >        return swappiness;
> >  }
> >
> > +static unsigned long get_dirty_param(struct mem_cgroup *memcg,
> > +                       enum mem_cgroup_dirty_param idx)
> > +{
> > +       unsigned long ret;
> > +
> > +       VM_BUG_ON(idx >= MEM_CGROUP_DIRTY_NPARAMS);
> > +       spin_lock(&memcg->reclaim_param_lock);
> > +       ret = memcg->dirty_param[idx];
> > +       spin_unlock(&memcg->reclaim_param_lock);
> > +
> > +       return ret;
> > +}
> > +
> 
> > +long mem_cgroup_dirty_ratio(void)
> > +{
> > +       struct mem_cgroup *memcg;
> > +       long ret = vm_dirty_ratio;
> > +
> > +       if (mem_cgroup_disabled())
> > +               return ret;
> > +       /*
> > +        * It's possible that "current" may be moved to other cgroup while we
> > +        * access cgroup. But precise check is meaningless because the task can
> > +        * be moved after our access and writeback tends to take long time.
> > +        * At least, "memcg" will not be freed under rcu_read_lock().
> > +        */
> > +       rcu_read_lock();
> > +       memcg = mem_cgroup_from_task(current);
> > +       if (likely(memcg))
> > +               ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_RATIO);
> > +       rcu_read_unlock();
> > +
> > +       return ret;
> > +}
> > +
> > +unsigned long mem_cgroup_dirty_bytes(void)
> > +{
> > +       struct mem_cgroup *memcg;
> > +       unsigned long ret = vm_dirty_bytes;
> > +
> > +       if (mem_cgroup_disabled())
> > +               return ret;
> > +       rcu_read_lock();
> > +       memcg = mem_cgroup_from_task(current);
> > +       if (likely(memcg))
> > +               ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BYTES);
> > +       rcu_read_unlock();
> > +
> > +       return ret;
> > +}
> > +
> > +long mem_cgroup_dirty_background_ratio(void)
> > +{
> > +       struct mem_cgroup *memcg;
> > +       long ret = dirty_background_ratio;
> > +
> > +       if (mem_cgroup_disabled())
> > +               return ret;
> > +       rcu_read_lock();
> > +       memcg = mem_cgroup_from_task(current);
> > +       if (likely(memcg))
> > +               ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BACKGROUND_RATIO);
> > +       rcu_read_unlock();
> > +
> > +       return ret;
> > +}
> > +
> > +unsigned long mem_cgroup_dirty_background_bytes(void)
> > +{
> > +       struct mem_cgroup *memcg;
> > +       unsigned long ret = dirty_background_bytes;
> > +
> > +       if (mem_cgroup_disabled())
> > +               return ret;
> > +       rcu_read_lock();
> > +       memcg = mem_cgroup_from_task(current);
> > +       if (likely(memcg))
> > +               ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BACKGROUND_BYTES);
> > +       rcu_read_unlock();
> > +
> > +       return ret;
> > +}
> 
> Given that mem_cgroup_dirty_[background_]{ratio,bytes}() are similar,
> should we refactor the majority of them into a single routine?

Agreed.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 2/3] memcg: dirty pages accounting and limiting infrastructure
@ 2010-03-02 22:24       ` Andrea Righi
  0 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-02 22:24 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Suleiman Souhlal,
	Daisuke Nishimura, Kirill A. Shutemov, Andrew Morton, containers,
	linux-kernel, linux-mm

On Tue, Mar 02, 2010 at 10:08:17AM -0800, Greg Thelen wrote:
> Comments below.  Yet to be tested on my end, but I will test it.
> 
> On Mon, Mar 1, 2010 at 1:23 PM, Andrea Righi <arighi@develer.com> wrote:
> > Infrastructure to account dirty pages per cgroup and add dirty limit
> > interfaces in the cgroupfs:
> >
> >  - Direct write-out: memory.dirty_ratio, memory.dirty_bytes
> >
> >  - Background write-out: memory.dirty_background_ratio, memory.dirty_background_bytes
> >
> > Signed-off-by: Andrea Righi <arighi@develer.com>
> > ---
> >  include/linux/memcontrol.h |   77 ++++++++++-
> >  mm/memcontrol.c            |  336 ++++++++++++++++++++++++++++++++++++++++----
> >  2 files changed, 384 insertions(+), 29 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 1f9b119..cc88b2e 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -19,12 +19,50 @@
> >
> >  #ifndef _LINUX_MEMCONTROL_H
> >  #define _LINUX_MEMCONTROL_H
> > +
> > +#include <linux/writeback.h>
> >  #include <linux/cgroup.h>
> > +
> >  struct mem_cgroup;
> >  struct page_cgroup;
> >  struct page;
> >  struct mm_struct;
> >
> > +/* Cgroup memory statistics items exported to the kernel */
> > +enum mem_cgroup_page_stat_item {
> > +       MEMCG_NR_DIRTYABLE_PAGES,
> > +       MEMCG_NR_RECLAIM_PAGES,
> > +       MEMCG_NR_WRITEBACK,
> > +       MEMCG_NR_DIRTY_WRITEBACK_PAGES,
> > +};
> > +
> > +/*
> > + * Statistics for memory cgroup.
> > + */
> > +enum mem_cgroup_stat_index {
> > +       /*
> > +        * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> > +        */
> > +       MEM_CGROUP_STAT_CACHE,     /* # of pages charged as cache */
> > +       MEM_CGROUP_STAT_RSS,       /* # of pages charged as anon rss */
> > +       MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> > +       MEM_CGROUP_STAT_PGPGIN_COUNT,   /* # of pages paged in */
> > +       MEM_CGROUP_STAT_PGPGOUT_COUNT,  /* # of pages paged out */
> > +       MEM_CGROUP_STAT_EVENTS, /* sum of pagein + pageout for internal use */
> > +       MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> > +       MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
> > +                                       used by soft limit implementation */
> > +       MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out.
> > +                                       used by threshold implementation */
> > +       MEM_CGROUP_STAT_FILE_DIRTY,   /* # of dirty pages in page cache */
> > +       MEM_CGROUP_STAT_WRITEBACK,   /* # of pages under writeback */
> > +       MEM_CGROUP_STAT_WRITEBACK_TEMP,   /* # of pages under writeback using
> > +                                               temporary buffers */
> > +       MEM_CGROUP_STAT_UNSTABLE_NFS,   /* # of NFS unstable pages */
> > +
> > +       MEM_CGROUP_STAT_NSTATS,
> > +};
> > +
> >  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> >  /*
> >  * All "charge" functions with gfp_mask should use GFP_KERNEL or
> > @@ -117,6 +155,13 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
> >  extern int do_swap_account;
> >  #endif
> >
> > +extern long mem_cgroup_dirty_ratio(void);
> > +extern unsigned long mem_cgroup_dirty_bytes(void);
> > +extern long mem_cgroup_dirty_background_ratio(void);
> > +extern unsigned long mem_cgroup_dirty_background_bytes(void);
> > +
> > +extern s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item);
> > +
> >  static inline bool mem_cgroup_disabled(void)
> >  {
> >        if (mem_cgroup_subsys.disabled)
> > @@ -125,7 +170,8 @@ static inline bool mem_cgroup_disabled(void)
> >  }
> >
> >  extern bool mem_cgroup_oom_called(struct task_struct *task);
> > -void mem_cgroup_update_file_mapped(struct page *page, int val);
> > +void mem_cgroup_update_stat(struct page *page,
> > +                       enum mem_cgroup_stat_index idx, int val);
> >  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> >                                                gfp_t gfp_mask, int nid,
> >                                                int zid);
> > @@ -300,8 +346,8 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
> >  {
> >  }
> >
> > -static inline void mem_cgroup_update_file_mapped(struct page *page,
> > -                                                       int val)
> > +static inline void mem_cgroup_update_stat(struct page *page,
> > +                       enum mem_cgroup_stat_index idx, int val)
> >  {
> >  }
> >
> > @@ -312,6 +358,31 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> >        return 0;
> >  }
> >
> > +static inline long mem_cgroup_dirty_ratio(void)
> > +{
> > +       return vm_dirty_ratio;
> > +}
> > +
> > +static inline unsigned long mem_cgroup_dirty_bytes(void)
> > +{
> > +       return vm_dirty_bytes;
> > +}
> > +
> > +static inline long mem_cgroup_dirty_background_ratio(void)
> > +{
> > +       return dirty_background_ratio;
> > +}
> > +
> > +static inline unsigned long mem_cgroup_dirty_background_bytes(void)
> > +{
> > +       return dirty_background_bytes;
> > +}
> > +
> > +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> > +{
> > +       return -ENOMEM;
> > +}
> > +
> >  #endif /* CONFIG_CGROUP_MEM_CONT */
> >
> >  #endif /* _LINUX_MEMCONTROL_H */
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index a443c30..e74cf66 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -66,31 +66,16 @@ static int really_do_swap_account __initdata = 1; /* for remember boot option*/
> >  #define SOFTLIMIT_EVENTS_THRESH (1000)
> >  #define THRESHOLDS_EVENTS_THRESH (100)
> >
> > -/*
> > - * Statistics for memory cgroup.
> > - */
> > -enum mem_cgroup_stat_index {
> > -       /*
> > -        * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> > -        */
> > -       MEM_CGROUP_STAT_CACHE,     /* # of pages charged as cache */
> > -       MEM_CGROUP_STAT_RSS,       /* # of pages charged as anon rss */
> > -       MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> > -       MEM_CGROUP_STAT_PGPGIN_COUNT,   /* # of pages paged in */
> > -       MEM_CGROUP_STAT_PGPGOUT_COUNT,  /* # of pages paged out */
> > -       MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> > -       MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
> > -                                       used by soft limit implementation */
> > -       MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out.
> > -                                       used by threshold implementation */
> > -
> > -       MEM_CGROUP_STAT_NSTATS,
> > -};
> > -
> >  struct mem_cgroup_stat_cpu {
> >        s64 count[MEM_CGROUP_STAT_NSTATS];
> >  };
> >
> > +/* Per cgroup page statistics */
> > +struct mem_cgroup_page_stat {
> > +       enum mem_cgroup_page_stat_item item;
> > +       s64 value;
> > +};
> > +
> >  /*
> >  * per-zone information in memory controller.
> >  */
> > @@ -157,6 +142,15 @@ struct mem_cgroup_threshold_ary {
> >  static bool mem_cgroup_threshold_check(struct mem_cgroup *mem);
> >  static void mem_cgroup_threshold(struct mem_cgroup *mem);
> >
> > +enum mem_cgroup_dirty_param {
> > +       MEM_CGROUP_DIRTY_RATIO,
> > +       MEM_CGROUP_DIRTY_BYTES,
> > +       MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> > +       MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
> > +
> > +       MEM_CGROUP_DIRTY_NPARAMS,
> > +};
> > +
> >  /*
> >  * The memory controller data structure. The memory controller controls both
> >  * page cache and RSS per cgroup. We would eventually like to provide
> > @@ -205,6 +199,9 @@ struct mem_cgroup {
> >
> >        unsigned int    swappiness;
> >
> > +       /* control memory cgroup dirty pages */
> > +       unsigned long dirty_param[MEM_CGROUP_DIRTY_NPARAMS];
> > +
> >        /* set when res.limit == memsw.limit */
> >        bool            memsw_is_minimum;
> >
> > @@ -1021,6 +1018,164 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
> >        return swappiness;
> >  }
> >
> > +static unsigned long get_dirty_param(struct mem_cgroup *memcg,
> > +                       enum mem_cgroup_dirty_param idx)
> > +{
> > +       unsigned long ret;
> > +
> > +       VM_BUG_ON(idx >= MEM_CGROUP_DIRTY_NPARAMS);
> > +       spin_lock(&memcg->reclaim_param_lock);
> > +       ret = memcg->dirty_param[idx];
> > +       spin_unlock(&memcg->reclaim_param_lock);
> > +
> > +       return ret;
> > +}
> > +
> 
> > +long mem_cgroup_dirty_ratio(void)
> > +{
> > +       struct mem_cgroup *memcg;
> > +       long ret = vm_dirty_ratio;
> > +
> > +       if (mem_cgroup_disabled())
> > +               return ret;
> > +       /*
> > +        * It's possible that "current" may be moved to other cgroup while we
> > +        * access cgroup. But precise check is meaningless because the task can
> > +        * be moved after our access and writeback tends to take long time.
> > +        * At least, "memcg" will not be freed under rcu_read_lock().
> > +        */
> > +       rcu_read_lock();
> > +       memcg = mem_cgroup_from_task(current);
> > +       if (likely(memcg))
> > +               ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_RATIO);
> > +       rcu_read_unlock();
> > +
> > +       return ret;
> > +}
> > +
> > +unsigned long mem_cgroup_dirty_bytes(void)
> > +{
> > +       struct mem_cgroup *memcg;
> > +       unsigned long ret = vm_dirty_bytes;
> > +
> > +       if (mem_cgroup_disabled())
> > +               return ret;
> > +       rcu_read_lock();
> > +       memcg = mem_cgroup_from_task(current);
> > +       if (likely(memcg))
> > +               ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BYTES);
> > +       rcu_read_unlock();
> > +
> > +       return ret;
> > +}
> > +
> > +long mem_cgroup_dirty_background_ratio(void)
> > +{
> > +       struct mem_cgroup *memcg;
> > +       long ret = dirty_background_ratio;
> > +
> > +       if (mem_cgroup_disabled())
> > +               return ret;
> > +       rcu_read_lock();
> > +       memcg = mem_cgroup_from_task(current);
> > +       if (likely(memcg))
> > +               ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BACKGROUND_RATIO);
> > +       rcu_read_unlock();
> > +
> > +       return ret;
> > +}
> > +
> > +unsigned long mem_cgroup_dirty_background_bytes(void)
> > +{
> > +       struct mem_cgroup *memcg;
> > +       unsigned long ret = dirty_background_bytes;
> > +
> > +       if (mem_cgroup_disabled())
> > +               return ret;
> > +       rcu_read_lock();
> > +       memcg = mem_cgroup_from_task(current);
> > +       if (likely(memcg))
> > +               ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BACKGROUND_BYTES);
> > +       rcu_read_unlock();
> > +
> > +       return ret;
> > +}
> 
> Given that mem_cgroup_dirty_[background_]{ratio,bytes}() are similar,
> should we refactor the majority of them into a single routine?

Agreed.

Thanks,
-Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-02 22:18             ` Andrea Righi
  (?)
@ 2010-03-02 23:21             ` Daisuke Nishimura
  -1 siblings, 0 replies; 140+ messages in thread
From: Daisuke Nishimura @ 2010-03-02 23:21 UTC (permalink / raw)
  To: Andrea Righi
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Andrew Morton,
	Balbir Singh

On Tue, 2 Mar 2010 23:18:23 +0100, Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org> wrote:
> On Tue, Mar 02, 2010 at 07:20:26PM +0530, Balbir Singh wrote:
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> [2010-03-02 17:23:16]:
> > 
> > > On Tue, 2 Mar 2010 09:01:58 +0100
> > > Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org> wrote:
> > > 
> > > > On Tue, Mar 02, 2010 at 09:23:09AM +0900, KAMEZAWA Hiroyuki wrote:
> > > > > On Mon,  1 Mar 2010 22:23:40 +0100
> > > > > Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org> wrote:
> > > > > 
> > > > > > Apply the cgroup dirty pages accounting and limiting infrastructure to
> > > > > > the opportune kernel functions.
> > > > > > 
> > > > > > Signed-off-by: Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
> > > > > 
> > > > > Seems nice.
> > > > > 
> > > > > Hmm. the last problem is moving account between memcg.
> > > > > 
> > > > > Right ?
> > > > 
> > > > Correct. This was actually the last item of the TODO list. Anyway, I'm
> > > > still considering if it's correct to move dirty pages when a task is
> > > > migrated from a cgroup to another. Currently, dirty pages just remain in
> > > > the original cgroup and are flushed depending on the original cgroup
> > > > settings. That is not totally wrong... at least moving the dirty pages
> > > > between memcgs should be optional (move_charge_at_immigrate?).
> > > > 
> > > 
> > > My concern is 
> > >  - migration between memcg is already suppoted
> > >     - at task move
> > >     - at rmdir
> > > 
> > > Then, if you leave DIRTY_PAGE accounting to original cgroup,
> > > the new cgroup (migration target)'s Dirty page accounting may
> > > goes to be negative, or incorrect value. Please check FILE_MAPPED
> > > implementation in __mem_cgroup_move_account()
> > > 
> > > As
> > >        if (page_mapped(page) && !PageAnon(page)) {
> > >                 /* Update mapped_file data for mem_cgroup */
> > >                 preempt_disable();
> > >                 __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > >                 __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > >                 preempt_enable();
> > >         }
> > > then, FILE_MAPPED never goes negative.
> > >
> > 
> > Absolutely! I am not sure how complex dirty memory migration will be,
> > but one way of working around it would be to disable migration of
> > charges when the feature is enabled (dirty* is set in the memory
> > cgroup). We might need additional logic to allow that to happen. 
> 
> I've started to look at dirty memory migration. First attempt is to add
> DIRTY, WRITEBACK, etc. to page_cgroup flags and handle them in
> __mem_cgroup_move_account(). Probably I'll have something ready for the
> next version of the patch. I still need to figure if this can work as
> expected...
> 
I agree it's a right direction(in fact, I have been planning to post a patch
in that direction), so I leave it to you.
Can you add PCG_FILE_MAPPED flag too ? I think this flag can be handled in the
same way as other flags you're trying to add, and we can change
"if (page_mapped(page) && !PageAnon(page))" to "if (PageCgroupFileMapped(pc)"
in __mem_cgroup_move_account(). It would be cleaner than current code, IMHO.


Thanks,
Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-02 22:18             ` Andrea Righi
@ 2010-03-02 23:21               ` Daisuke Nishimura
  -1 siblings, 0 replies; 140+ messages in thread
From: Daisuke Nishimura @ 2010-03-02 23:21 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm, Daisuke Nishimura

On Tue, 2 Mar 2010 23:18:23 +0100, Andrea Righi <arighi@develer.com> wrote:
> On Tue, Mar 02, 2010 at 07:20:26PM +0530, Balbir Singh wrote:
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-03-02 17:23:16]:
> > 
> > > On Tue, 2 Mar 2010 09:01:58 +0100
> > > Andrea Righi <arighi@develer.com> wrote:
> > > 
> > > > On Tue, Mar 02, 2010 at 09:23:09AM +0900, KAMEZAWA Hiroyuki wrote:
> > > > > On Mon,  1 Mar 2010 22:23:40 +0100
> > > > > Andrea Righi <arighi@develer.com> wrote:
> > > > > 
> > > > > > Apply the cgroup dirty pages accounting and limiting infrastructure to
> > > > > > the opportune kernel functions.
> > > > > > 
> > > > > > Signed-off-by: Andrea Righi <arighi@develer.com>
> > > > > 
> > > > > Seems nice.
> > > > > 
> > > > > Hmm. the last problem is moving account between memcg.
> > > > > 
> > > > > Right ?
> > > > 
> > > > Correct. This was actually the last item of the TODO list. Anyway, I'm
> > > > still considering if it's correct to move dirty pages when a task is
> > > > migrated from a cgroup to another. Currently, dirty pages just remain in
> > > > the original cgroup and are flushed depending on the original cgroup
> > > > settings. That is not totally wrong... at least moving the dirty pages
> > > > between memcgs should be optional (move_charge_at_immigrate?).
> > > > 
> > > 
> > > My concern is 
> > >  - migration between memcg is already suppoted
> > >     - at task move
> > >     - at rmdir
> > > 
> > > Then, if you leave DIRTY_PAGE accounting to original cgroup,
> > > the new cgroup (migration target)'s Dirty page accounting may
> > > goes to be negative, or incorrect value. Please check FILE_MAPPED
> > > implementation in __mem_cgroup_move_account()
> > > 
> > > As
> > >        if (page_mapped(page) && !PageAnon(page)) {
> > >                 /* Update mapped_file data for mem_cgroup */
> > >                 preempt_disable();
> > >                 __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > >                 __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > >                 preempt_enable();
> > >         }
> > > then, FILE_MAPPED never goes negative.
> > >
> > 
> > Absolutely! I am not sure how complex dirty memory migration will be,
> > but one way of working around it would be to disable migration of
> > charges when the feature is enabled (dirty* is set in the memory
> > cgroup). We might need additional logic to allow that to happen. 
> 
> I've started to look at dirty memory migration. First attempt is to add
> DIRTY, WRITEBACK, etc. to page_cgroup flags and handle them in
> __mem_cgroup_move_account(). Probably I'll have something ready for the
> next version of the patch. I still need to figure if this can work as
> expected...
> 
I agree it's a right direction(in fact, I have been planning to post a patch
in that direction), so I leave it to you.
Can you add PCG_FILE_MAPPED flag too ? I think this flag can be handled in the
same way as other flags you're trying to add, and we can change
"if (page_mapped(page) && !PageAnon(page))" to "if (PageCgroupFileMapped(pc)"
in __mem_cgroup_move_account(). It would be cleaner than current code, IMHO.


Thanks,
Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
@ 2010-03-02 23:21               ` Daisuke Nishimura
  0 siblings, 0 replies; 140+ messages in thread
From: Daisuke Nishimura @ 2010-03-02 23:21 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm, Daisuke Nishimura

On Tue, 2 Mar 2010 23:18:23 +0100, Andrea Righi <arighi@develer.com> wrote:
> On Tue, Mar 02, 2010 at 07:20:26PM +0530, Balbir Singh wrote:
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-03-02 17:23:16]:
> > 
> > > On Tue, 2 Mar 2010 09:01:58 +0100
> > > Andrea Righi <arighi@develer.com> wrote:
> > > 
> > > > On Tue, Mar 02, 2010 at 09:23:09AM +0900, KAMEZAWA Hiroyuki wrote:
> > > > > On Mon,  1 Mar 2010 22:23:40 +0100
> > > > > Andrea Righi <arighi@develer.com> wrote:
> > > > > 
> > > > > > Apply the cgroup dirty pages accounting and limiting infrastructure to
> > > > > > the opportune kernel functions.
> > > > > > 
> > > > > > Signed-off-by: Andrea Righi <arighi@develer.com>
> > > > > 
> > > > > Seems nice.
> > > > > 
> > > > > Hmm. the last problem is moving account between memcg.
> > > > > 
> > > > > Right ?
> > > > 
> > > > Correct. This was actually the last item of the TODO list. Anyway, I'm
> > > > still considering if it's correct to move dirty pages when a task is
> > > > migrated from a cgroup to another. Currently, dirty pages just remain in
> > > > the original cgroup and are flushed depending on the original cgroup
> > > > settings. That is not totally wrong... at least moving the dirty pages
> > > > between memcgs should be optional (move_charge_at_immigrate?).
> > > > 
> > > 
> > > My concern is 
> > >  - migration between memcg is already suppoted
> > >     - at task move
> > >     - at rmdir
> > > 
> > > Then, if you leave DIRTY_PAGE accounting to original cgroup,
> > > the new cgroup (migration target)'s Dirty page accounting may
> > > goes to be negative, or incorrect value. Please check FILE_MAPPED
> > > implementation in __mem_cgroup_move_account()
> > > 
> > > As
> > >        if (page_mapped(page) && !PageAnon(page)) {
> > >                 /* Update mapped_file data for mem_cgroup */
> > >                 preempt_disable();
> > >                 __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > >                 __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > >                 preempt_enable();
> > >         }
> > > then, FILE_MAPPED never goes negative.
> > >
> > 
> > Absolutely! I am not sure how complex dirty memory migration will be,
> > but one way of working around it would be to disable migration of
> > charges when the feature is enabled (dirty* is set in the memory
> > cgroup). We might need additional logic to allow that to happen. 
> 
> I've started to look at dirty memory migration. First attempt is to add
> DIRTY, WRITEBACK, etc. to page_cgroup flags and handle them in
> __mem_cgroup_move_account(). Probably I'll have something ready for the
> next version of the patch. I still need to figure if this can work as
> expected...
> 
I agree it's a right direction(in fact, I have been planning to post a patch
in that direction), so I leave it to you.
Can you add PCG_FILE_MAPPED flag too ? I think this flag can be handled in the
same way as other flags you're trying to add, and we can change
"if (page_mapped(page) && !PageAnon(page))" to "if (PageCgroupFileMapped(pc)"
in __mem_cgroup_move_account(). It would be cleaner than current code, IMHO.


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-02 22:22           ` Andrea Righi
  (?)
@ 2010-03-02 23:59           ` Vivek Goyal
  -1 siblings, 0 replies; 140+ messages in thread
From: Vivek Goyal @ 2010-03-02 23:59 UTC (permalink / raw)
  To: Andrea Righi
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Andrew Morton,
	Balbir Singh

On Tue, Mar 02, 2010 at 11:22:48PM +0100, Andrea Righi wrote:
> On Tue, Mar 02, 2010 at 10:05:29AM -0500, Vivek Goyal wrote:
> > On Mon, Mar 01, 2010 at 11:18:31PM +0100, Andrea Righi wrote:
> > > On Mon, Mar 01, 2010 at 05:02:08PM -0500, Vivek Goyal wrote:
> > > > > @@ -686,10 +699,14 @@ void throttle_vm_writeout(gfp_t gfp_mask)
> > > > >                   */
> > > > >                  dirty_thresh += dirty_thresh / 10;      /* wheeee... */
> > > > >  
> > > > > -                if (global_page_state(NR_UNSTABLE_NFS) +
> > > > > -			global_page_state(NR_WRITEBACK) <= dirty_thresh)
> > > > > -                        	break;
> > > > > -                congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > > > +
> > > > > +		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> > > > > +		if (dirty < 0)
> > > > > +			dirty = global_page_state(NR_UNSTABLE_NFS) +
> > > > > +				global_page_state(NR_WRITEBACK);
> > > > 
> > > > dirty is unsigned long. As mentioned last time, above will never be true?
> > > > In general these patches look ok to me. I will do some testing with these.
> > > 
> > > Re-introduced the same bug. My bad. :(
> > > 
> > > The value returned from mem_cgroup_page_stat() can be negative, i.e.
> > > when memory cgroup is disabled. We could simply use a long for dirty,
> > > the unit is in # of pages so s64 should be enough. Or cast dirty to long
> > > only for the check (see below).
> > > 
> > > Thanks!
> > > -Andrea
> > > 
> > > Signed-off-by: Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
> > > ---
> > >  mm/page-writeback.c |    2 +-
> > >  1 files changed, 1 insertions(+), 1 deletions(-)
> > > 
> > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > > index d83f41c..dbee976 100644
> > > --- a/mm/page-writeback.c
> > > +++ b/mm/page-writeback.c
> > > @@ -701,7 +701,7 @@ void throttle_vm_writeout(gfp_t gfp_mask)
> > >  
> > >  
> > >  		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> > > -		if (dirty < 0)
> > > +		if ((long)dirty < 0)
> > 
> > This will also be problematic as on 32bit systems, your uppper limit of
> > dirty memory will be 2G?
> > 
> > I guess, I will prefer one of the two.
> > 
> > - return the error code from function and pass a pointer to store stats
> >   in as function argument.
> > 
> > - Or Peter's suggestion of checking mem_cgroup_has_dirty_limit() and if
> >   per cgroup dirty control is enabled, then use per cgroup stats. In that
> >   case you don't have to return negative values.
> > 
> >   Only tricky part will be careful accouting so that none of the stats go
> >   negative in corner cases of migration etc.
> 
> What do you think about Peter's suggestion + the locking stuff? (see the
> previous email). Otherwise, I'll choose the other solution, passing a
> pointer and always return the error code is not bad.
> 

Ok, so you are worried about that by the we finish mem_cgroup_has_dirty_limit()
call, task might change cgroup and later we might call
mem_cgroup_get_page_stat() on a different cgroup altogether which might or
might not have dirty limits specified?

But in what cases you don't want to use memory cgroup specified limit? I
thought cgroup disabled what the only case where we need to use global
limits. Otherwise a memory cgroup will have either dirty_bytes specified
or by default inherit global dirty_ratio which is a valid number. If
that's the case then you don't have to take rcu_lock() outside
get_page_stat()?

IOW, apart from cgroup being disabled, what are the other cases where you
expect to not use cgroup's page stat and use global stats?

Thanks
Vivek

> Thanks,
> -Andrea

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-02 22:22           ` Andrea Righi
@ 2010-03-02 23:59             ` Vivek Goyal
  -1 siblings, 0 replies; 140+ messages in thread
From: Vivek Goyal @ 2010-03-02 23:59 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Kirill A. Shutemov, Andrew Morton, containers,
	linux-kernel, linux-mm

On Tue, Mar 02, 2010 at 11:22:48PM +0100, Andrea Righi wrote:
> On Tue, Mar 02, 2010 at 10:05:29AM -0500, Vivek Goyal wrote:
> > On Mon, Mar 01, 2010 at 11:18:31PM +0100, Andrea Righi wrote:
> > > On Mon, Mar 01, 2010 at 05:02:08PM -0500, Vivek Goyal wrote:
> > > > > @@ -686,10 +699,14 @@ void throttle_vm_writeout(gfp_t gfp_mask)
> > > > >                   */
> > > > >                  dirty_thresh += dirty_thresh / 10;      /* wheeee... */
> > > > >  
> > > > > -                if (global_page_state(NR_UNSTABLE_NFS) +
> > > > > -			global_page_state(NR_WRITEBACK) <= dirty_thresh)
> > > > > -                        	break;
> > > > > -                congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > > > +
> > > > > +		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> > > > > +		if (dirty < 0)
> > > > > +			dirty = global_page_state(NR_UNSTABLE_NFS) +
> > > > > +				global_page_state(NR_WRITEBACK);
> > > > 
> > > > dirty is unsigned long. As mentioned last time, above will never be true?
> > > > In general these patches look ok to me. I will do some testing with these.
> > > 
> > > Re-introduced the same bug. My bad. :(
> > > 
> > > The value returned from mem_cgroup_page_stat() can be negative, i.e.
> > > when memory cgroup is disabled. We could simply use a long for dirty,
> > > the unit is in # of pages so s64 should be enough. Or cast dirty to long
> > > only for the check (see below).
> > > 
> > > Thanks!
> > > -Andrea
> > > 
> > > Signed-off-by: Andrea Righi <arighi@develer.com>
> > > ---
> > >  mm/page-writeback.c |    2 +-
> > >  1 files changed, 1 insertions(+), 1 deletions(-)
> > > 
> > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > > index d83f41c..dbee976 100644
> > > --- a/mm/page-writeback.c
> > > +++ b/mm/page-writeback.c
> > > @@ -701,7 +701,7 @@ void throttle_vm_writeout(gfp_t gfp_mask)
> > >  
> > >  
> > >  		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> > > -		if (dirty < 0)
> > > +		if ((long)dirty < 0)
> > 
> > This will also be problematic as on 32bit systems, your uppper limit of
> > dirty memory will be 2G?
> > 
> > I guess, I will prefer one of the two.
> > 
> > - return the error code from function and pass a pointer to store stats
> >   in as function argument.
> > 
> > - Or Peter's suggestion of checking mem_cgroup_has_dirty_limit() and if
> >   per cgroup dirty control is enabled, then use per cgroup stats. In that
> >   case you don't have to return negative values.
> > 
> >   Only tricky part will be careful accouting so that none of the stats go
> >   negative in corner cases of migration etc.
> 
> What do you think about Peter's suggestion + the locking stuff? (see the
> previous email). Otherwise, I'll choose the other solution, passing a
> pointer and always return the error code is not bad.
> 

Ok, so you are worried about that by the we finish mem_cgroup_has_dirty_limit()
call, task might change cgroup and later we might call
mem_cgroup_get_page_stat() on a different cgroup altogether which might or
might not have dirty limits specified?

But in what cases you don't want to use memory cgroup specified limit? I
thought cgroup disabled what the only case where we need to use global
limits. Otherwise a memory cgroup will have either dirty_bytes specified
or by default inherit global dirty_ratio which is a valid number. If
that's the case then you don't have to take rcu_lock() outside
get_page_stat()?

IOW, apart from cgroup being disabled, what are the other cases where you
expect to not use cgroup's page stat and use global stats?

Thanks
Vivek

> Thanks,
> -Andrea

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
@ 2010-03-02 23:59             ` Vivek Goyal
  0 siblings, 0 replies; 140+ messages in thread
From: Vivek Goyal @ 2010-03-02 23:59 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Kirill A. Shutemov, Andrew Morton, containers,
	linux-kernel, linux-mm

On Tue, Mar 02, 2010 at 11:22:48PM +0100, Andrea Righi wrote:
> On Tue, Mar 02, 2010 at 10:05:29AM -0500, Vivek Goyal wrote:
> > On Mon, Mar 01, 2010 at 11:18:31PM +0100, Andrea Righi wrote:
> > > On Mon, Mar 01, 2010 at 05:02:08PM -0500, Vivek Goyal wrote:
> > > > > @@ -686,10 +699,14 @@ void throttle_vm_writeout(gfp_t gfp_mask)
> > > > >                   */
> > > > >                  dirty_thresh += dirty_thresh / 10;      /* wheeee... */
> > > > >  
> > > > > -                if (global_page_state(NR_UNSTABLE_NFS) +
> > > > > -			global_page_state(NR_WRITEBACK) <= dirty_thresh)
> > > > > -                        	break;
> > > > > -                congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > > > +
> > > > > +		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> > > > > +		if (dirty < 0)
> > > > > +			dirty = global_page_state(NR_UNSTABLE_NFS) +
> > > > > +				global_page_state(NR_WRITEBACK);
> > > > 
> > > > dirty is unsigned long. As mentioned last time, above will never be true?
> > > > In general these patches look ok to me. I will do some testing with these.
> > > 
> > > Re-introduced the same bug. My bad. :(
> > > 
> > > The value returned from mem_cgroup_page_stat() can be negative, i.e.
> > > when memory cgroup is disabled. We could simply use a long for dirty,
> > > the unit is in # of pages so s64 should be enough. Or cast dirty to long
> > > only for the check (see below).
> > > 
> > > Thanks!
> > > -Andrea
> > > 
> > > Signed-off-by: Andrea Righi <arighi@develer.com>
> > > ---
> > >  mm/page-writeback.c |    2 +-
> > >  1 files changed, 1 insertions(+), 1 deletions(-)
> > > 
> > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > > index d83f41c..dbee976 100644
> > > --- a/mm/page-writeback.c
> > > +++ b/mm/page-writeback.c
> > > @@ -701,7 +701,7 @@ void throttle_vm_writeout(gfp_t gfp_mask)
> > >  
> > >  
> > >  		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> > > -		if (dirty < 0)
> > > +		if ((long)dirty < 0)
> > 
> > This will also be problematic as on 32bit systems, your uppper limit of
> > dirty memory will be 2G?
> > 
> > I guess, I will prefer one of the two.
> > 
> > - return the error code from function and pass a pointer to store stats
> >   in as function argument.
> > 
> > - Or Peter's suggestion of checking mem_cgroup_has_dirty_limit() and if
> >   per cgroup dirty control is enabled, then use per cgroup stats. In that
> >   case you don't have to return negative values.
> > 
> >   Only tricky part will be careful accouting so that none of the stats go
> >   negative in corner cases of migration etc.
> 
> What do you think about Peter's suggestion + the locking stuff? (see the
> previous email). Otherwise, I'll choose the other solution, passing a
> pointer and always return the error code is not bad.
> 

Ok, so you are worried about that by the we finish mem_cgroup_has_dirty_limit()
call, task might change cgroup and later we might call
mem_cgroup_get_page_stat() on a different cgroup altogether which might or
might not have dirty limits specified?

But in what cases you don't want to use memory cgroup specified limit? I
thought cgroup disabled what the only case where we need to use global
limits. Otherwise a memory cgroup will have either dirty_bytes specified
or by default inherit global dirty_ratio which is a valid number. If
that's the case then you don't have to take rcu_lock() outside
get_page_stat()?

IOW, apart from cgroup being disabled, what are the other cases where you
expect to not use cgroup's page stat and use global stats?

Thanks
Vivek

> Thanks,
> -Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
       [not found]   ` <1267478620-5276-4-git-send-email-arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
                       ` (4 preceding siblings ...)
  2010-03-02 13:48     ` Peter Zijlstra
@ 2010-03-03  2:12     ` Daisuke Nishimura
  5 siblings, 0 replies; 140+ messages in thread
From: Daisuke Nishimura @ 2010-03-03  2:12 UTC (permalink / raw)
  To: Andrea Righi
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Andrew Morton,
	Balbir Singh

> diff --git a/mm/filemap.c b/mm/filemap.c
> index fe09e51..f85acae 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
>  	 * having removed the page entirely.
>  	 */
>  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, -1);
>  		dec_zone_page_state(page, NR_FILE_DIRTY);
>  		dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
>  	}
(snip)
> @@ -1096,6 +1113,7 @@ int __set_page_dirty_no_writeback(struct page *page)
>  void account_page_dirtied(struct page *page, struct address_space *mapping)
>  {
>  	if (mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, 1);
>  		__inc_zone_page_state(page, NR_FILE_DIRTY);
>  		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
>  		task_dirty_inc(current);
As long as I can see, those two functions(at least) calls mem_cgroup_update_state(),
which acquires page cgroup lock, under mapping->tree_lock.
But as I fixed before in commit e767e056, page cgroup lock must not acquired under
mapping->tree_lock.
hmm, we should call those mem_cgroup_update_state() outside mapping->tree_lock,
or add local_irq_save/restore() around lock/unlock_page_cgroup() to avoid dead-lock.


Thanks,
Daisuke Nishimura.

On Mon,  1 Mar 2010 22:23:40 +0100, Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org> wrote:
> Apply the cgroup dirty pages accounting and limiting infrastructure to
> the opportune kernel functions.
> 
> Signed-off-by: Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
> ---
>  fs/fuse/file.c      |    5 +++
>  fs/nfs/write.c      |    4 ++
>  fs/nilfs2/segment.c |   10 +++++-
>  mm/filemap.c        |    1 +
>  mm/page-writeback.c |   84 ++++++++++++++++++++++++++++++++------------------
>  mm/rmap.c           |    4 +-
>  mm/truncate.c       |    2 +
>  7 files changed, 76 insertions(+), 34 deletions(-)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index a9f5e13..dbbdd53 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -11,6 +11,7 @@
>  #include <linux/pagemap.h>
>  #include <linux/slab.h>
>  #include <linux/kernel.h>
> +#include <linux/memcontrol.h>
>  #include <linux/sched.h>
>  #include <linux/module.h>
>  
> @@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
>  
>  	list_del(&req->writepages_entry);
>  	dec_bdi_stat(bdi, BDI_WRITEBACK);
> +	mem_cgroup_update_stat(req->pages[0],
> +			MEM_CGROUP_STAT_WRITEBACK_TEMP, -1);
>  	dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
>  	bdi_writeout_inc(bdi);
>  	wake_up(&fi->page_waitq);
> @@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
>  	req->inode = inode;
>  
>  	inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
> +	mem_cgroup_update_stat(tmp_page,
> +			MEM_CGROUP_STAT_WRITEBACK_TEMP, 1);
>  	inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
>  	end_page_writeback(page);
>  
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index b753242..7316f7a 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -439,6 +439,7 @@ nfs_mark_request_commit(struct nfs_page *req)
>  			req->wb_index,
>  			NFS_PAGE_TAG_COMMIT);
>  	spin_unlock(&inode->i_lock);
> +	mem_cgroup_update_stat(req->wb_page, MEM_CGROUP_STAT_UNSTABLE_NFS, 1);
>  	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
>  	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE);
>  	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
> @@ -450,6 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req)
>  	struct page *page = req->wb_page;
>  
>  	if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
>  		dec_zone_page_state(page, NR_UNSTABLE_NFS);
>  		dec_bdi_stat(page->mapping->backing_dev_info, BDI_UNSTABLE);
>  		return 1;
> @@ -1273,6 +1275,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
>  		req = nfs_list_entry(head->next);
>  		nfs_list_remove_request(req);
>  		nfs_mark_request_commit(req);
> +		mem_cgroup_update_stat(req->wb_page,
> +				MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
>  		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
>  		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
>  				BDI_UNSTABLE);
> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> index ada2f1b..aef6d13 100644
> --- a/fs/nilfs2/segment.c
> +++ b/fs/nilfs2/segment.c
> @@ -1660,8 +1660,11 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out)
>  	} while (bh = bh->b_this_page, bh2 = bh2->b_this_page, bh != head);
>  	kunmap_atomic(kaddr, KM_USER0);
>  
> -	if (!TestSetPageWriteback(clone_page))
> +	if (!TestSetPageWriteback(clone_page)) {
> +		mem_cgroup_update_stat(clone_page,
> +				MEM_CGROUP_STAT_WRITEBACK, 1);
>  		inc_zone_page_state(clone_page, NR_WRITEBACK);
> +	}
>  	unlock_page(clone_page);
>  
>  	return 0;
> @@ -1783,8 +1786,11 @@ static void __nilfs_end_page_io(struct page *page, int err)
>  	}
>  
>  	if (buffer_nilfs_allocated(page_buffers(page))) {
> -		if (TestClearPageWriteback(page))
> +		if (TestClearPageWriteback(page)) {
> +			mem_cgroup_update_stat(clone_page,
> +					MEM_CGROUP_STAT_WRITEBACK, -1);
>  			dec_zone_page_state(page, NR_WRITEBACK);
> +		}
>  	} else
>  		end_page_writeback(page);
>  }
> diff --git a/mm/filemap.c b/mm/filemap.c
> index fe09e51..f85acae 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
>  	 * having removed the page entirely.
>  	 */
>  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, -1);
>  		dec_zone_page_state(page, NR_FILE_DIRTY);
>  		dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
>  	}
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 5a0f8f3..d83f41c 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -137,13 +137,14 @@ static struct prop_descriptor vm_dirties;
>   */
>  static int calc_period_shift(void)
>  {
> -	unsigned long dirty_total;
> +	unsigned long dirty_total, dirty_bytes;
>  
> -	if (vm_dirty_bytes)
> -		dirty_total = vm_dirty_bytes / PAGE_SIZE;
> +	dirty_bytes = mem_cgroup_dirty_bytes();
> +	if (dirty_bytes)
> +		dirty_total = dirty_bytes / PAGE_SIZE;
>  	else
> -		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> -				100;
> +		dirty_total = (mem_cgroup_dirty_ratio() *
> +				determine_dirtyable_memory()) / 100;
>  	return 2 + ilog2(dirty_total - 1);
>  }
>  
> @@ -408,14 +409,16 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
>   */
>  unsigned long determine_dirtyable_memory(void)
>  {
> -	unsigned long x;
> -
> -	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> +	unsigned long memory;
> +	s64 memcg_memory;
>  
> +	memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
>  	if (!vm_highmem_is_dirtyable)
> -		x -= highmem_dirtyable_memory(x);
> -
> -	return x + 1;	/* Ensure that we never return 0 */
> +		memory -= highmem_dirtyable_memory(memory);
> +	memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
> +	if (memcg_memory < 0)
> +		return memory + 1;
> +	return min((unsigned long)memcg_memory, memory + 1);
>  }
>  
>  void
> @@ -423,26 +426,28 @@ get_dirty_limits(unsigned long *pbackground, unsigned long *pdirty,
>  		 unsigned long *pbdi_dirty, struct backing_dev_info *bdi)
>  {
>  	unsigned long background;
> -	unsigned long dirty;
> +	unsigned long dirty, dirty_bytes, dirty_background;
>  	unsigned long available_memory = determine_dirtyable_memory();
>  	struct task_struct *tsk;
>  
> -	if (vm_dirty_bytes)
> -		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> +	dirty_bytes = mem_cgroup_dirty_bytes();
> +	if (dirty_bytes)
> +		dirty = DIV_ROUND_UP(dirty_bytes, PAGE_SIZE);
>  	else {
>  		int dirty_ratio;
>  
> -		dirty_ratio = vm_dirty_ratio;
> +		dirty_ratio = mem_cgroup_dirty_ratio();
>  		if (dirty_ratio < 5)
>  			dirty_ratio = 5;
>  		dirty = (dirty_ratio * available_memory) / 100;
>  	}
>  
> -	if (dirty_background_bytes)
> -		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> +	dirty_background = mem_cgroup_dirty_background_bytes();
> +	if (dirty_background)
> +		background = DIV_ROUND_UP(dirty_background, PAGE_SIZE);
>  	else
> -		background = (dirty_background_ratio * available_memory) / 100;
> -
> +		background = (mem_cgroup_dirty_background_ratio() *
> +					available_memory) / 100;
>  	if (background >= dirty)
>  		background = dirty / 2;
>  	tsk = current;
> @@ -508,9 +513,13 @@ static void balance_dirty_pages(struct address_space *mapping,
>  		get_dirty_limits(&background_thresh, &dirty_thresh,
>  				&bdi_thresh, bdi);
>  
> -		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> +		nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
> +		nr_writeback = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
> +		if ((nr_reclaimable < 0) || (nr_writeback < 0)) {
> +			nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
>  					global_page_state(NR_UNSTABLE_NFS);
> -		nr_writeback = global_page_state(NR_WRITEBACK);
> +			nr_writeback = global_page_state(NR_WRITEBACK);
> +		}
>  
>  		bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY);
>  		if (bdi_cap_account_unstable(bdi)) {
> @@ -611,10 +620,12 @@ static void balance_dirty_pages(struct address_space *mapping,
>  	 * In normal mode, we start background writeout at the lower
>  	 * background_thresh, to keep the amount of dirty memory low.
>  	 */
> +	nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
> +	if (nr_reclaimable < 0)
> +		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> +				global_page_state(NR_UNSTABLE_NFS);
>  	if ((laptop_mode && pages_written) ||
> -	    (!laptop_mode && ((global_page_state(NR_FILE_DIRTY)
> -			       + global_page_state(NR_UNSTABLE_NFS))
> -					  > background_thresh)))
> +	    (!laptop_mode && (nr_reclaimable > background_thresh)))
>  		bdi_start_writeback(bdi, NULL, 0);
>  }
>  
> @@ -678,6 +689,8 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>  	unsigned long dirty_thresh;
>  
>          for ( ; ; ) {
> +		unsigned long dirty;
> +
>  		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
>  
>                  /*
> @@ -686,10 +699,14 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>                   */
>                  dirty_thresh += dirty_thresh / 10;      /* wheeee... */
>  
> -                if (global_page_state(NR_UNSTABLE_NFS) +
> -			global_page_state(NR_WRITEBACK) <= dirty_thresh)
> -                        	break;
> -                congestion_wait(BLK_RW_ASYNC, HZ/10);
> +
> +		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> +		if (dirty < 0)
> +			dirty = global_page_state(NR_UNSTABLE_NFS) +
> +				global_page_state(NR_WRITEBACK);
> +		if (dirty <= dirty_thresh)
> +			break;
> +		congestion_wait(BLK_RW_ASYNC, HZ/10);
>  
>  		/*
>  		 * The caller might hold locks which can prevent IO completion
> @@ -1096,6 +1113,7 @@ int __set_page_dirty_no_writeback(struct page *page)
>  void account_page_dirtied(struct page *page, struct address_space *mapping)
>  {
>  	if (mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, 1);
>  		__inc_zone_page_state(page, NR_FILE_DIRTY);
>  		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
>  		task_dirty_inc(current);
> @@ -1297,6 +1315,8 @@ int clear_page_dirty_for_io(struct page *page)
>  		 * for more comments.
>  		 */
>  		if (TestClearPageDirty(page)) {
> +			mem_cgroup_update_stat(page,
> +					MEM_CGROUP_STAT_FILE_DIRTY, -1);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_DIRTY);
> @@ -1332,8 +1352,10 @@ int test_clear_page_writeback(struct page *page)
>  	} else {
>  		ret = TestClearPageWriteback(page);
>  	}
> -	if (ret)
> +	if (ret) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_WRITEBACK, -1);
>  		dec_zone_page_state(page, NR_WRITEBACK);
> +	}
>  	return ret;
>  }
>  
> @@ -1363,8 +1385,10 @@ int test_set_page_writeback(struct page *page)
>  	} else {
>  		ret = TestSetPageWriteback(page);
>  	}
> -	if (!ret)
> +	if (!ret) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_WRITEBACK, 1);
>  		inc_zone_page_state(page, NR_WRITEBACK);
> +	}
>  	return ret;
>  
>  }
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 4d2fb93..8d74335 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -832,7 +832,7 @@ void page_add_file_rmap(struct page *page)
>  {
>  	if (atomic_inc_and_test(&page->_mapcount)) {
>  		__inc_zone_page_state(page, NR_FILE_MAPPED);
> -		mem_cgroup_update_file_mapped(page, 1);
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, 1);
>  	}
>  }
>  
> @@ -864,7 +864,7 @@ void page_remove_rmap(struct page *page)
>  		__dec_zone_page_state(page, NR_ANON_PAGES);
>  	} else {
>  		__dec_zone_page_state(page, NR_FILE_MAPPED);
> -		mem_cgroup_update_file_mapped(page, -1);
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, -1);
>  	}
>  	/*
>  	 * It would be tidy to reset the PageAnon mapping here,
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 2466e0c..5f437e7 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -73,6 +73,8 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
>  	if (TestClearPageDirty(page)) {
>  		struct address_space *mapping = page->mapping;
>  		if (mapping && mapping_cap_account_dirty(mapping)) {
> +			mem_cgroup_update_stat(page,
> +					MEM_CGROUP_STAT_FILE_DIRTY, -1);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_DIRTY);
> -- 
> 1.6.3.3
> 

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-01 21:23   ` Andrea Righi
@ 2010-03-03  2:12     ` Daisuke Nishimura
  -1 siblings, 0 replies; 140+ messages in thread
From: Daisuke Nishimura @ 2010-03-03  2:12 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm, Daisuke Nishimura

> diff --git a/mm/filemap.c b/mm/filemap.c
> index fe09e51..f85acae 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
>  	 * having removed the page entirely.
>  	 */
>  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, -1);
>  		dec_zone_page_state(page, NR_FILE_DIRTY);
>  		dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
>  	}
(snip)
> @@ -1096,6 +1113,7 @@ int __set_page_dirty_no_writeback(struct page *page)
>  void account_page_dirtied(struct page *page, struct address_space *mapping)
>  {
>  	if (mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, 1);
>  		__inc_zone_page_state(page, NR_FILE_DIRTY);
>  		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
>  		task_dirty_inc(current);
As long as I can see, those two functions(at least) calls mem_cgroup_update_state(),
which acquires page cgroup lock, under mapping->tree_lock.
But as I fixed before in commit e767e056, page cgroup lock must not acquired under
mapping->tree_lock.
hmm, we should call those mem_cgroup_update_state() outside mapping->tree_lock,
or add local_irq_save/restore() around lock/unlock_page_cgroup() to avoid dead-lock.


Thanks,
Daisuke Nishimura.

On Mon,  1 Mar 2010 22:23:40 +0100, Andrea Righi <arighi@develer.com> wrote:
> Apply the cgroup dirty pages accounting and limiting infrastructure to
> the opportune kernel functions.
> 
> Signed-off-by: Andrea Righi <arighi@develer.com>
> ---
>  fs/fuse/file.c      |    5 +++
>  fs/nfs/write.c      |    4 ++
>  fs/nilfs2/segment.c |   10 +++++-
>  mm/filemap.c        |    1 +
>  mm/page-writeback.c |   84 ++++++++++++++++++++++++++++++++------------------
>  mm/rmap.c           |    4 +-
>  mm/truncate.c       |    2 +
>  7 files changed, 76 insertions(+), 34 deletions(-)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index a9f5e13..dbbdd53 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -11,6 +11,7 @@
>  #include <linux/pagemap.h>
>  #include <linux/slab.h>
>  #include <linux/kernel.h>
> +#include <linux/memcontrol.h>
>  #include <linux/sched.h>
>  #include <linux/module.h>
>  
> @@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
>  
>  	list_del(&req->writepages_entry);
>  	dec_bdi_stat(bdi, BDI_WRITEBACK);
> +	mem_cgroup_update_stat(req->pages[0],
> +			MEM_CGROUP_STAT_WRITEBACK_TEMP, -1);
>  	dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
>  	bdi_writeout_inc(bdi);
>  	wake_up(&fi->page_waitq);
> @@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
>  	req->inode = inode;
>  
>  	inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
> +	mem_cgroup_update_stat(tmp_page,
> +			MEM_CGROUP_STAT_WRITEBACK_TEMP, 1);
>  	inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
>  	end_page_writeback(page);
>  
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index b753242..7316f7a 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -439,6 +439,7 @@ nfs_mark_request_commit(struct nfs_page *req)
>  			req->wb_index,
>  			NFS_PAGE_TAG_COMMIT);
>  	spin_unlock(&inode->i_lock);
> +	mem_cgroup_update_stat(req->wb_page, MEM_CGROUP_STAT_UNSTABLE_NFS, 1);
>  	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
>  	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE);
>  	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
> @@ -450,6 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req)
>  	struct page *page = req->wb_page;
>  
>  	if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
>  		dec_zone_page_state(page, NR_UNSTABLE_NFS);
>  		dec_bdi_stat(page->mapping->backing_dev_info, BDI_UNSTABLE);
>  		return 1;
> @@ -1273,6 +1275,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
>  		req = nfs_list_entry(head->next);
>  		nfs_list_remove_request(req);
>  		nfs_mark_request_commit(req);
> +		mem_cgroup_update_stat(req->wb_page,
> +				MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
>  		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
>  		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
>  				BDI_UNSTABLE);
> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> index ada2f1b..aef6d13 100644
> --- a/fs/nilfs2/segment.c
> +++ b/fs/nilfs2/segment.c
> @@ -1660,8 +1660,11 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out)
>  	} while (bh = bh->b_this_page, bh2 = bh2->b_this_page, bh != head);
>  	kunmap_atomic(kaddr, KM_USER0);
>  
> -	if (!TestSetPageWriteback(clone_page))
> +	if (!TestSetPageWriteback(clone_page)) {
> +		mem_cgroup_update_stat(clone_page,
> +				MEM_CGROUP_STAT_WRITEBACK, 1);
>  		inc_zone_page_state(clone_page, NR_WRITEBACK);
> +	}
>  	unlock_page(clone_page);
>  
>  	return 0;
> @@ -1783,8 +1786,11 @@ static void __nilfs_end_page_io(struct page *page, int err)
>  	}
>  
>  	if (buffer_nilfs_allocated(page_buffers(page))) {
> -		if (TestClearPageWriteback(page))
> +		if (TestClearPageWriteback(page)) {
> +			mem_cgroup_update_stat(clone_page,
> +					MEM_CGROUP_STAT_WRITEBACK, -1);
>  			dec_zone_page_state(page, NR_WRITEBACK);
> +		}
>  	} else
>  		end_page_writeback(page);
>  }
> diff --git a/mm/filemap.c b/mm/filemap.c
> index fe09e51..f85acae 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
>  	 * having removed the page entirely.
>  	 */
>  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, -1);
>  		dec_zone_page_state(page, NR_FILE_DIRTY);
>  		dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
>  	}
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 5a0f8f3..d83f41c 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -137,13 +137,14 @@ static struct prop_descriptor vm_dirties;
>   */
>  static int calc_period_shift(void)
>  {
> -	unsigned long dirty_total;
> +	unsigned long dirty_total, dirty_bytes;
>  
> -	if (vm_dirty_bytes)
> -		dirty_total = vm_dirty_bytes / PAGE_SIZE;
> +	dirty_bytes = mem_cgroup_dirty_bytes();
> +	if (dirty_bytes)
> +		dirty_total = dirty_bytes / PAGE_SIZE;
>  	else
> -		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> -				100;
> +		dirty_total = (mem_cgroup_dirty_ratio() *
> +				determine_dirtyable_memory()) / 100;
>  	return 2 + ilog2(dirty_total - 1);
>  }
>  
> @@ -408,14 +409,16 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
>   */
>  unsigned long determine_dirtyable_memory(void)
>  {
> -	unsigned long x;
> -
> -	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> +	unsigned long memory;
> +	s64 memcg_memory;
>  
> +	memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
>  	if (!vm_highmem_is_dirtyable)
> -		x -= highmem_dirtyable_memory(x);
> -
> -	return x + 1;	/* Ensure that we never return 0 */
> +		memory -= highmem_dirtyable_memory(memory);
> +	memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
> +	if (memcg_memory < 0)
> +		return memory + 1;
> +	return min((unsigned long)memcg_memory, memory + 1);
>  }
>  
>  void
> @@ -423,26 +426,28 @@ get_dirty_limits(unsigned long *pbackground, unsigned long *pdirty,
>  		 unsigned long *pbdi_dirty, struct backing_dev_info *bdi)
>  {
>  	unsigned long background;
> -	unsigned long dirty;
> +	unsigned long dirty, dirty_bytes, dirty_background;
>  	unsigned long available_memory = determine_dirtyable_memory();
>  	struct task_struct *tsk;
>  
> -	if (vm_dirty_bytes)
> -		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> +	dirty_bytes = mem_cgroup_dirty_bytes();
> +	if (dirty_bytes)
> +		dirty = DIV_ROUND_UP(dirty_bytes, PAGE_SIZE);
>  	else {
>  		int dirty_ratio;
>  
> -		dirty_ratio = vm_dirty_ratio;
> +		dirty_ratio = mem_cgroup_dirty_ratio();
>  		if (dirty_ratio < 5)
>  			dirty_ratio = 5;
>  		dirty = (dirty_ratio * available_memory) / 100;
>  	}
>  
> -	if (dirty_background_bytes)
> -		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> +	dirty_background = mem_cgroup_dirty_background_bytes();
> +	if (dirty_background)
> +		background = DIV_ROUND_UP(dirty_background, PAGE_SIZE);
>  	else
> -		background = (dirty_background_ratio * available_memory) / 100;
> -
> +		background = (mem_cgroup_dirty_background_ratio() *
> +					available_memory) / 100;
>  	if (background >= dirty)
>  		background = dirty / 2;
>  	tsk = current;
> @@ -508,9 +513,13 @@ static void balance_dirty_pages(struct address_space *mapping,
>  		get_dirty_limits(&background_thresh, &dirty_thresh,
>  				&bdi_thresh, bdi);
>  
> -		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> +		nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
> +		nr_writeback = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
> +		if ((nr_reclaimable < 0) || (nr_writeback < 0)) {
> +			nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
>  					global_page_state(NR_UNSTABLE_NFS);
> -		nr_writeback = global_page_state(NR_WRITEBACK);
> +			nr_writeback = global_page_state(NR_WRITEBACK);
> +		}
>  
>  		bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY);
>  		if (bdi_cap_account_unstable(bdi)) {
> @@ -611,10 +620,12 @@ static void balance_dirty_pages(struct address_space *mapping,
>  	 * In normal mode, we start background writeout at the lower
>  	 * background_thresh, to keep the amount of dirty memory low.
>  	 */
> +	nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
> +	if (nr_reclaimable < 0)
> +		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> +				global_page_state(NR_UNSTABLE_NFS);
>  	if ((laptop_mode && pages_written) ||
> -	    (!laptop_mode && ((global_page_state(NR_FILE_DIRTY)
> -			       + global_page_state(NR_UNSTABLE_NFS))
> -					  > background_thresh)))
> +	    (!laptop_mode && (nr_reclaimable > background_thresh)))
>  		bdi_start_writeback(bdi, NULL, 0);
>  }
>  
> @@ -678,6 +689,8 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>  	unsigned long dirty_thresh;
>  
>          for ( ; ; ) {
> +		unsigned long dirty;
> +
>  		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
>  
>                  /*
> @@ -686,10 +699,14 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>                   */
>                  dirty_thresh += dirty_thresh / 10;      /* wheeee... */
>  
> -                if (global_page_state(NR_UNSTABLE_NFS) +
> -			global_page_state(NR_WRITEBACK) <= dirty_thresh)
> -                        	break;
> -                congestion_wait(BLK_RW_ASYNC, HZ/10);
> +
> +		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> +		if (dirty < 0)
> +			dirty = global_page_state(NR_UNSTABLE_NFS) +
> +				global_page_state(NR_WRITEBACK);
> +		if (dirty <= dirty_thresh)
> +			break;
> +		congestion_wait(BLK_RW_ASYNC, HZ/10);
>  
>  		/*
>  		 * The caller might hold locks which can prevent IO completion
> @@ -1096,6 +1113,7 @@ int __set_page_dirty_no_writeback(struct page *page)
>  void account_page_dirtied(struct page *page, struct address_space *mapping)
>  {
>  	if (mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, 1);
>  		__inc_zone_page_state(page, NR_FILE_DIRTY);
>  		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
>  		task_dirty_inc(current);
> @@ -1297,6 +1315,8 @@ int clear_page_dirty_for_io(struct page *page)
>  		 * for more comments.
>  		 */
>  		if (TestClearPageDirty(page)) {
> +			mem_cgroup_update_stat(page,
> +					MEM_CGROUP_STAT_FILE_DIRTY, -1);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_DIRTY);
> @@ -1332,8 +1352,10 @@ int test_clear_page_writeback(struct page *page)
>  	} else {
>  		ret = TestClearPageWriteback(page);
>  	}
> -	if (ret)
> +	if (ret) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_WRITEBACK, -1);
>  		dec_zone_page_state(page, NR_WRITEBACK);
> +	}
>  	return ret;
>  }
>  
> @@ -1363,8 +1385,10 @@ int test_set_page_writeback(struct page *page)
>  	} else {
>  		ret = TestSetPageWriteback(page);
>  	}
> -	if (!ret)
> +	if (!ret) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_WRITEBACK, 1);
>  		inc_zone_page_state(page, NR_WRITEBACK);
> +	}
>  	return ret;
>  
>  }
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 4d2fb93..8d74335 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -832,7 +832,7 @@ void page_add_file_rmap(struct page *page)
>  {
>  	if (atomic_inc_and_test(&page->_mapcount)) {
>  		__inc_zone_page_state(page, NR_FILE_MAPPED);
> -		mem_cgroup_update_file_mapped(page, 1);
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, 1);
>  	}
>  }
>  
> @@ -864,7 +864,7 @@ void page_remove_rmap(struct page *page)
>  		__dec_zone_page_state(page, NR_ANON_PAGES);
>  	} else {
>  		__dec_zone_page_state(page, NR_FILE_MAPPED);
> -		mem_cgroup_update_file_mapped(page, -1);
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, -1);
>  	}
>  	/*
>  	 * It would be tidy to reset the PageAnon mapping here,
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 2466e0c..5f437e7 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -73,6 +73,8 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
>  	if (TestClearPageDirty(page)) {
>  		struct address_space *mapping = page->mapping;
>  		if (mapping && mapping_cap_account_dirty(mapping)) {
> +			mem_cgroup_update_stat(page,
> +					MEM_CGROUP_STAT_FILE_DIRTY, -1);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_DIRTY);
> -- 
> 1.6.3.3
> 

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
@ 2010-03-03  2:12     ` Daisuke Nishimura
  0 siblings, 0 replies; 140+ messages in thread
From: Daisuke Nishimura @ 2010-03-03  2:12 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm, Daisuke Nishimura

> diff --git a/mm/filemap.c b/mm/filemap.c
> index fe09e51..f85acae 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
>  	 * having removed the page entirely.
>  	 */
>  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, -1);
>  		dec_zone_page_state(page, NR_FILE_DIRTY);
>  		dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
>  	}
(snip)
> @@ -1096,6 +1113,7 @@ int __set_page_dirty_no_writeback(struct page *page)
>  void account_page_dirtied(struct page *page, struct address_space *mapping)
>  {
>  	if (mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, 1);
>  		__inc_zone_page_state(page, NR_FILE_DIRTY);
>  		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
>  		task_dirty_inc(current);
As long as I can see, those two functions(at least) calls mem_cgroup_update_state(),
which acquires page cgroup lock, under mapping->tree_lock.
But as I fixed before in commit e767e056, page cgroup lock must not acquired under
mapping->tree_lock.
hmm, we should call those mem_cgroup_update_state() outside mapping->tree_lock,
or add local_irq_save/restore() around lock/unlock_page_cgroup() to avoid dead-lock.


Thanks,
Daisuke Nishimura.

On Mon,  1 Mar 2010 22:23:40 +0100, Andrea Righi <arighi@develer.com> wrote:
> Apply the cgroup dirty pages accounting and limiting infrastructure to
> the opportune kernel functions.
> 
> Signed-off-by: Andrea Righi <arighi@develer.com>
> ---
>  fs/fuse/file.c      |    5 +++
>  fs/nfs/write.c      |    4 ++
>  fs/nilfs2/segment.c |   10 +++++-
>  mm/filemap.c        |    1 +
>  mm/page-writeback.c |   84 ++++++++++++++++++++++++++++++++------------------
>  mm/rmap.c           |    4 +-
>  mm/truncate.c       |    2 +
>  7 files changed, 76 insertions(+), 34 deletions(-)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index a9f5e13..dbbdd53 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -11,6 +11,7 @@
>  #include <linux/pagemap.h>
>  #include <linux/slab.h>
>  #include <linux/kernel.h>
> +#include <linux/memcontrol.h>
>  #include <linux/sched.h>
>  #include <linux/module.h>
>  
> @@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
>  
>  	list_del(&req->writepages_entry);
>  	dec_bdi_stat(bdi, BDI_WRITEBACK);
> +	mem_cgroup_update_stat(req->pages[0],
> +			MEM_CGROUP_STAT_WRITEBACK_TEMP, -1);
>  	dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
>  	bdi_writeout_inc(bdi);
>  	wake_up(&fi->page_waitq);
> @@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
>  	req->inode = inode;
>  
>  	inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
> +	mem_cgroup_update_stat(tmp_page,
> +			MEM_CGROUP_STAT_WRITEBACK_TEMP, 1);
>  	inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
>  	end_page_writeback(page);
>  
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index b753242..7316f7a 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -439,6 +439,7 @@ nfs_mark_request_commit(struct nfs_page *req)
>  			req->wb_index,
>  			NFS_PAGE_TAG_COMMIT);
>  	spin_unlock(&inode->i_lock);
> +	mem_cgroup_update_stat(req->wb_page, MEM_CGROUP_STAT_UNSTABLE_NFS, 1);
>  	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
>  	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE);
>  	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
> @@ -450,6 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req)
>  	struct page *page = req->wb_page;
>  
>  	if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
>  		dec_zone_page_state(page, NR_UNSTABLE_NFS);
>  		dec_bdi_stat(page->mapping->backing_dev_info, BDI_UNSTABLE);
>  		return 1;
> @@ -1273,6 +1275,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
>  		req = nfs_list_entry(head->next);
>  		nfs_list_remove_request(req);
>  		nfs_mark_request_commit(req);
> +		mem_cgroup_update_stat(req->wb_page,
> +				MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
>  		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
>  		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
>  				BDI_UNSTABLE);
> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> index ada2f1b..aef6d13 100644
> --- a/fs/nilfs2/segment.c
> +++ b/fs/nilfs2/segment.c
> @@ -1660,8 +1660,11 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out)
>  	} while (bh = bh->b_this_page, bh2 = bh2->b_this_page, bh != head);
>  	kunmap_atomic(kaddr, KM_USER0);
>  
> -	if (!TestSetPageWriteback(clone_page))
> +	if (!TestSetPageWriteback(clone_page)) {
> +		mem_cgroup_update_stat(clone_page,
> +				MEM_CGROUP_STAT_WRITEBACK, 1);
>  		inc_zone_page_state(clone_page, NR_WRITEBACK);
> +	}
>  	unlock_page(clone_page);
>  
>  	return 0;
> @@ -1783,8 +1786,11 @@ static void __nilfs_end_page_io(struct page *page, int err)
>  	}
>  
>  	if (buffer_nilfs_allocated(page_buffers(page))) {
> -		if (TestClearPageWriteback(page))
> +		if (TestClearPageWriteback(page)) {
> +			mem_cgroup_update_stat(clone_page,
> +					MEM_CGROUP_STAT_WRITEBACK, -1);
>  			dec_zone_page_state(page, NR_WRITEBACK);
> +		}
>  	} else
>  		end_page_writeback(page);
>  }
> diff --git a/mm/filemap.c b/mm/filemap.c
> index fe09e51..f85acae 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
>  	 * having removed the page entirely.
>  	 */
>  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, -1);
>  		dec_zone_page_state(page, NR_FILE_DIRTY);
>  		dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
>  	}
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 5a0f8f3..d83f41c 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -137,13 +137,14 @@ static struct prop_descriptor vm_dirties;
>   */
>  static int calc_period_shift(void)
>  {
> -	unsigned long dirty_total;
> +	unsigned long dirty_total, dirty_bytes;
>  
> -	if (vm_dirty_bytes)
> -		dirty_total = vm_dirty_bytes / PAGE_SIZE;
> +	dirty_bytes = mem_cgroup_dirty_bytes();
> +	if (dirty_bytes)
> +		dirty_total = dirty_bytes / PAGE_SIZE;
>  	else
> -		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> -				100;
> +		dirty_total = (mem_cgroup_dirty_ratio() *
> +				determine_dirtyable_memory()) / 100;
>  	return 2 + ilog2(dirty_total - 1);
>  }
>  
> @@ -408,14 +409,16 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
>   */
>  unsigned long determine_dirtyable_memory(void)
>  {
> -	unsigned long x;
> -
> -	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> +	unsigned long memory;
> +	s64 memcg_memory;
>  
> +	memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
>  	if (!vm_highmem_is_dirtyable)
> -		x -= highmem_dirtyable_memory(x);
> -
> -	return x + 1;	/* Ensure that we never return 0 */
> +		memory -= highmem_dirtyable_memory(memory);
> +	memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
> +	if (memcg_memory < 0)
> +		return memory + 1;
> +	return min((unsigned long)memcg_memory, memory + 1);
>  }
>  
>  void
> @@ -423,26 +426,28 @@ get_dirty_limits(unsigned long *pbackground, unsigned long *pdirty,
>  		 unsigned long *pbdi_dirty, struct backing_dev_info *bdi)
>  {
>  	unsigned long background;
> -	unsigned long dirty;
> +	unsigned long dirty, dirty_bytes, dirty_background;
>  	unsigned long available_memory = determine_dirtyable_memory();
>  	struct task_struct *tsk;
>  
> -	if (vm_dirty_bytes)
> -		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> +	dirty_bytes = mem_cgroup_dirty_bytes();
> +	if (dirty_bytes)
> +		dirty = DIV_ROUND_UP(dirty_bytes, PAGE_SIZE);
>  	else {
>  		int dirty_ratio;
>  
> -		dirty_ratio = vm_dirty_ratio;
> +		dirty_ratio = mem_cgroup_dirty_ratio();
>  		if (dirty_ratio < 5)
>  			dirty_ratio = 5;
>  		dirty = (dirty_ratio * available_memory) / 100;
>  	}
>  
> -	if (dirty_background_bytes)
> -		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> +	dirty_background = mem_cgroup_dirty_background_bytes();
> +	if (dirty_background)
> +		background = DIV_ROUND_UP(dirty_background, PAGE_SIZE);
>  	else
> -		background = (dirty_background_ratio * available_memory) / 100;
> -
> +		background = (mem_cgroup_dirty_background_ratio() *
> +					available_memory) / 100;
>  	if (background >= dirty)
>  		background = dirty / 2;
>  	tsk = current;
> @@ -508,9 +513,13 @@ static void balance_dirty_pages(struct address_space *mapping,
>  		get_dirty_limits(&background_thresh, &dirty_thresh,
>  				&bdi_thresh, bdi);
>  
> -		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> +		nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
> +		nr_writeback = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
> +		if ((nr_reclaimable < 0) || (nr_writeback < 0)) {
> +			nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
>  					global_page_state(NR_UNSTABLE_NFS);
> -		nr_writeback = global_page_state(NR_WRITEBACK);
> +			nr_writeback = global_page_state(NR_WRITEBACK);
> +		}
>  
>  		bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY);
>  		if (bdi_cap_account_unstable(bdi)) {
> @@ -611,10 +620,12 @@ static void balance_dirty_pages(struct address_space *mapping,
>  	 * In normal mode, we start background writeout at the lower
>  	 * background_thresh, to keep the amount of dirty memory low.
>  	 */
> +	nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
> +	if (nr_reclaimable < 0)
> +		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> +				global_page_state(NR_UNSTABLE_NFS);
>  	if ((laptop_mode && pages_written) ||
> -	    (!laptop_mode && ((global_page_state(NR_FILE_DIRTY)
> -			       + global_page_state(NR_UNSTABLE_NFS))
> -					  > background_thresh)))
> +	    (!laptop_mode && (nr_reclaimable > background_thresh)))
>  		bdi_start_writeback(bdi, NULL, 0);
>  }
>  
> @@ -678,6 +689,8 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>  	unsigned long dirty_thresh;
>  
>          for ( ; ; ) {
> +		unsigned long dirty;
> +
>  		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
>  
>                  /*
> @@ -686,10 +699,14 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>                   */
>                  dirty_thresh += dirty_thresh / 10;      /* wheeee... */
>  
> -                if (global_page_state(NR_UNSTABLE_NFS) +
> -			global_page_state(NR_WRITEBACK) <= dirty_thresh)
> -                        	break;
> -                congestion_wait(BLK_RW_ASYNC, HZ/10);
> +
> +		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> +		if (dirty < 0)
> +			dirty = global_page_state(NR_UNSTABLE_NFS) +
> +				global_page_state(NR_WRITEBACK);
> +		if (dirty <= dirty_thresh)
> +			break;
> +		congestion_wait(BLK_RW_ASYNC, HZ/10);
>  
>  		/*
>  		 * The caller might hold locks which can prevent IO completion
> @@ -1096,6 +1113,7 @@ int __set_page_dirty_no_writeback(struct page *page)
>  void account_page_dirtied(struct page *page, struct address_space *mapping)
>  {
>  	if (mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, 1);
>  		__inc_zone_page_state(page, NR_FILE_DIRTY);
>  		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
>  		task_dirty_inc(current);
> @@ -1297,6 +1315,8 @@ int clear_page_dirty_for_io(struct page *page)
>  		 * for more comments.
>  		 */
>  		if (TestClearPageDirty(page)) {
> +			mem_cgroup_update_stat(page,
> +					MEM_CGROUP_STAT_FILE_DIRTY, -1);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_DIRTY);
> @@ -1332,8 +1352,10 @@ int test_clear_page_writeback(struct page *page)
>  	} else {
>  		ret = TestClearPageWriteback(page);
>  	}
> -	if (ret)
> +	if (ret) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_WRITEBACK, -1);
>  		dec_zone_page_state(page, NR_WRITEBACK);
> +	}
>  	return ret;
>  }
>  
> @@ -1363,8 +1385,10 @@ int test_set_page_writeback(struct page *page)
>  	} else {
>  		ret = TestSetPageWriteback(page);
>  	}
> -	if (!ret)
> +	if (!ret) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_WRITEBACK, 1);
>  		inc_zone_page_state(page, NR_WRITEBACK);
> +	}
>  	return ret;
>  
>  }
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 4d2fb93..8d74335 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -832,7 +832,7 @@ void page_add_file_rmap(struct page *page)
>  {
>  	if (atomic_inc_and_test(&page->_mapcount)) {
>  		__inc_zone_page_state(page, NR_FILE_MAPPED);
> -		mem_cgroup_update_file_mapped(page, 1);
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, 1);
>  	}
>  }
>  
> @@ -864,7 +864,7 @@ void page_remove_rmap(struct page *page)
>  		__dec_zone_page_state(page, NR_ANON_PAGES);
>  	} else {
>  		__dec_zone_page_state(page, NR_FILE_MAPPED);
> -		mem_cgroup_update_file_mapped(page, -1);
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, -1);
>  	}
>  	/*
>  	 * It would be tidy to reset the PageAnon mapping here,
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 2466e0c..5f437e7 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -73,6 +73,8 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
>  	if (TestClearPageDirty(page)) {
>  		struct address_space *mapping = page->mapping;
>  		if (mapping && mapping_cap_account_dirty(mapping)) {
> +			mem_cgroup_update_stat(page,
> +					MEM_CGROUP_STAT_FILE_DIRTY, -1);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_DIRTY);
> -- 
> 1.6.3.3
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
       [not found]     ` <20100303111238.7133f8af.nishimura-YQH0OdQVrdy45+QrQBaojngSJqDPrsil@public.gmane.org>
@ 2010-03-03  3:29       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 140+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-03  3:29 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: Andrea Righi,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Greg-FOgKQjlUJ6BQetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Suleiman Souhlal, Andrew Morton, Balbir Singh

On Wed, 3 Mar 2010 11:12:38 +0900
Daisuke Nishimura <nishimura-YQH0OdQVrdy45+QrQBaojngSJqDPrsil@public.gmane.org> wrote:

> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index fe09e51..f85acae 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
> >  	 * having removed the page entirely.
> >  	 */
> >  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> > +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, -1);
> >  		dec_zone_page_state(page, NR_FILE_DIRTY);
> >  		dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
> >  	}
> (snip)
> > @@ -1096,6 +1113,7 @@ int __set_page_dirty_no_writeback(struct page *page)
> >  void account_page_dirtied(struct page *page, struct address_space *mapping)
> >  {
> >  	if (mapping_cap_account_dirty(mapping)) {
> > +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, 1);
> >  		__inc_zone_page_state(page, NR_FILE_DIRTY);
> >  		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
> >  		task_dirty_inc(current);
> As long as I can see, those two functions(at least) calls mem_cgroup_update_state(),
> which acquires page cgroup lock, under mapping->tree_lock.
> But as I fixed before in commit e767e056, page cgroup lock must not acquired under
> mapping->tree_lock.
> hmm, we should call those mem_cgroup_update_state() outside mapping->tree_lock,
> or add local_irq_save/restore() around lock/unlock_page_cgroup() to avoid dead-lock.
> 
Ah, good catch! But hmmmmmm...
This account_page_dirtted() seems to be called under IRQ-disabled.
About  __remove_from_page_cache(), I think page_cgroup should have its own DIRTY flag,
then, mem_cgroup_uncharge_page() can handle it automatically.

But. there are no guarantee that following never happens. 
	lock_page_cgroup()
	    <=== interrupt.
	    -> mapping->tree_lock()
Even if mapping->tree_lock is held with IRQ-disabled.
Then, if we add local_irq_save(), we have to add it to all lock_page_cgroup().

Then, hm...some kind of new trick ? as..
(Follwoing patch is not tested!!)

==
---
 include/linux/page_cgroup.h |   14 ++++++++++++++
 mm/memcontrol.c             |   27 +++++++++++++++++----------
 2 files changed, 31 insertions(+), 10 deletions(-)

Index: mmotm-2.6.33-Feb11/include/linux/page_cgroup.h
===================================================================
--- mmotm-2.6.33-Feb11.orig/include/linux/page_cgroup.h
+++ mmotm-2.6.33-Feb11/include/linux/page_cgroup.h
@@ -39,6 +39,7 @@ enum {
 	PCG_CACHE, /* charged as cache */
 	PCG_USED, /* this object is in use. */
 	PCG_ACCT_LRU, /* page has been accounted for */
+	PCG_MIGRATE, /* page cgroup is under memcg account migration */
 };
 
 #define TESTPCGFLAG(uname, lname)			\
@@ -73,6 +74,8 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU)
 TESTPCGFLAG(AcctLRU, ACCT_LRU)
 TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU)
 
+TESTPCGFLAG(Migrate, MIGRATE)
+
 static inline int page_cgroup_nid(struct page_cgroup *pc)
 {
 	return page_to_nid(pc->page);
@@ -93,6 +96,17 @@ static inline void unlock_page_cgroup(st
 	bit_spin_unlock(PCG_LOCK, &pc->flags);
 }
 
+static inline unsigned long page_cgroup_migration_lock(struct page_cgroup *pc)
+{
+	local_irq_save(flags);
+	bit_spin_lock(PCG_MIGRATE, &pc->flags);
+}
+static inline void
+page_cgroup_migration_lock(struct page_cgroup *pc, unsigned long flags)
+{
+	bit_spin_lock(PCG_MIGRATE, &pc->flags);
+	local_irq_restore(flags);
+}
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct page_cgroup;
 
Index: mmotm-2.6.33-Feb11/mm/memcontrol.c
===================================================================
--- mmotm-2.6.33-Feb11.orig/mm/memcontrol.c
+++ mmotm-2.6.33-Feb11/mm/memcontrol.c
@@ -1321,7 +1321,7 @@ bool mem_cgroup_handle_oom(struct mem_cg
  * Currently used to update mapped file statistics, but the routine can be
  * generalized to update other statistics as well.
  */
-void mem_cgroup_update_file_mapped(struct page *page, int val)
+void mem_cgroup_update_file_mapped(struct page *page, int val, int locked)
 {
 	struct mem_cgroup *mem;
 	struct page_cgroup *pc;
@@ -1329,22 +1329,27 @@ void mem_cgroup_update_file_mapped(struc
 	pc = lookup_page_cgroup(page);
 	if (unlikely(!pc))
 		return;
-
-	lock_page_cgroup(pc);
+	/*
+	 * if locked==1, mapping->tree_lock is held. We don't have to take
+	 * care of charge/uncharge. just think about migration.
+	 */
+	if (!locked)
+		lock_page_cgroup(pc);
+	else
+		page_cgroup_migration_lock(pc);
 	mem = pc->mem_cgroup;
-	if (!mem)
+	if (!mem || !PageCgroupUsed(pc))
 		goto done;
-
-	if (!PageCgroupUsed(pc))
-		goto done;
-
 	/*
 	 * Preemption is already disabled. We can use __this_cpu_xxx
 	 */
 	__this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], val);
 
 done:
-	unlock_page_cgroup(pc);
+	if (!locked)
+		lock_page_cgroup(pc);
+	else
+		page_cgroup_migration_unlock(pc);
 }
 
 /*
@@ -1785,7 +1790,8 @@ static void __mem_cgroup_move_account(st
 	VM_BUG_ON(!PageCgroupLocked(pc));
 	VM_BUG_ON(!PageCgroupUsed(pc));
 	VM_BUG_ON(pc->mem_cgroup != from);
-
+		
+	page_cgroup_migration_lock(pc);
 	page = pc->page;
 	if (page_mapped(page) && !PageAnon(page)) {
 		/* Update mapped_file data for mem_cgroup */
@@ -1802,6 +1808,7 @@ static void __mem_cgroup_move_account(st
 	/* caller should have done css_get */
 	pc->mem_cgroup = to;
 	mem_cgroup_charge_statistics(to, pc, true);
+	page_cgroup_migration_lock(pc);
 	/*
 	 * We charges against "to" which may not have any tasks. Then, "to"
 	 * can be under rmdir(). But in current implementation, caller of




Thanks,
-Kame

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-03  2:12     ` Daisuke Nishimura
@ 2010-03-03  3:29       ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 140+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-03  3:29 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: Andrea Righi, Balbir Singh, Suleiman Souhlal, Greg Thelen,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

On Wed, 3 Mar 2010 11:12:38 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index fe09e51..f85acae 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
> >  	 * having removed the page entirely.
> >  	 */
> >  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> > +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, -1);
> >  		dec_zone_page_state(page, NR_FILE_DIRTY);
> >  		dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
> >  	}
> (snip)
> > @@ -1096,6 +1113,7 @@ int __set_page_dirty_no_writeback(struct page *page)
> >  void account_page_dirtied(struct page *page, struct address_space *mapping)
> >  {
> >  	if (mapping_cap_account_dirty(mapping)) {
> > +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, 1);
> >  		__inc_zone_page_state(page, NR_FILE_DIRTY);
> >  		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
> >  		task_dirty_inc(current);
> As long as I can see, those two functions(at least) calls mem_cgroup_update_state(),
> which acquires page cgroup lock, under mapping->tree_lock.
> But as I fixed before in commit e767e056, page cgroup lock must not acquired under
> mapping->tree_lock.
> hmm, we should call those mem_cgroup_update_state() outside mapping->tree_lock,
> or add local_irq_save/restore() around lock/unlock_page_cgroup() to avoid dead-lock.
> 
Ah, good catch! But hmmmmmm...
This account_page_dirtted() seems to be called under IRQ-disabled.
About  __remove_from_page_cache(), I think page_cgroup should have its own DIRTY flag,
then, mem_cgroup_uncharge_page() can handle it automatically.

But. there are no guarantee that following never happens. 
	lock_page_cgroup()
	    <=== interrupt.
	    -> mapping->tree_lock()
Even if mapping->tree_lock is held with IRQ-disabled.
Then, if we add local_irq_save(), we have to add it to all lock_page_cgroup().

Then, hm...some kind of new trick ? as..
(Follwoing patch is not tested!!)

==
---
 include/linux/page_cgroup.h |   14 ++++++++++++++
 mm/memcontrol.c             |   27 +++++++++++++++++----------
 2 files changed, 31 insertions(+), 10 deletions(-)

Index: mmotm-2.6.33-Feb11/include/linux/page_cgroup.h
===================================================================
--- mmotm-2.6.33-Feb11.orig/include/linux/page_cgroup.h
+++ mmotm-2.6.33-Feb11/include/linux/page_cgroup.h
@@ -39,6 +39,7 @@ enum {
 	PCG_CACHE, /* charged as cache */
 	PCG_USED, /* this object is in use. */
 	PCG_ACCT_LRU, /* page has been accounted for */
+	PCG_MIGRATE, /* page cgroup is under memcg account migration */
 };
 
 #define TESTPCGFLAG(uname, lname)			\
@@ -73,6 +74,8 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU)
 TESTPCGFLAG(AcctLRU, ACCT_LRU)
 TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU)
 
+TESTPCGFLAG(Migrate, MIGRATE)
+
 static inline int page_cgroup_nid(struct page_cgroup *pc)
 {
 	return page_to_nid(pc->page);
@@ -93,6 +96,17 @@ static inline void unlock_page_cgroup(st
 	bit_spin_unlock(PCG_LOCK, &pc->flags);
 }
 
+static inline unsigned long page_cgroup_migration_lock(struct page_cgroup *pc)
+{
+	local_irq_save(flags);
+	bit_spin_lock(PCG_MIGRATE, &pc->flags);
+}
+static inline void
+page_cgroup_migration_lock(struct page_cgroup *pc, unsigned long flags)
+{
+	bit_spin_lock(PCG_MIGRATE, &pc->flags);
+	local_irq_restore(flags);
+}
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct page_cgroup;
 
Index: mmotm-2.6.33-Feb11/mm/memcontrol.c
===================================================================
--- mmotm-2.6.33-Feb11.orig/mm/memcontrol.c
+++ mmotm-2.6.33-Feb11/mm/memcontrol.c
@@ -1321,7 +1321,7 @@ bool mem_cgroup_handle_oom(struct mem_cg
  * Currently used to update mapped file statistics, but the routine can be
  * generalized to update other statistics as well.
  */
-void mem_cgroup_update_file_mapped(struct page *page, int val)
+void mem_cgroup_update_file_mapped(struct page *page, int val, int locked)
 {
 	struct mem_cgroup *mem;
 	struct page_cgroup *pc;
@@ -1329,22 +1329,27 @@ void mem_cgroup_update_file_mapped(struc
 	pc = lookup_page_cgroup(page);
 	if (unlikely(!pc))
 		return;
-
-	lock_page_cgroup(pc);
+	/*
+	 * if locked==1, mapping->tree_lock is held. We don't have to take
+	 * care of charge/uncharge. just think about migration.
+	 */
+	if (!locked)
+		lock_page_cgroup(pc);
+	else
+		page_cgroup_migration_lock(pc);
 	mem = pc->mem_cgroup;
-	if (!mem)
+	if (!mem || !PageCgroupUsed(pc))
 		goto done;
-
-	if (!PageCgroupUsed(pc))
-		goto done;
-
 	/*
 	 * Preemption is already disabled. We can use __this_cpu_xxx
 	 */
 	__this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], val);
 
 done:
-	unlock_page_cgroup(pc);
+	if (!locked)
+		lock_page_cgroup(pc);
+	else
+		page_cgroup_migration_unlock(pc);
 }
 
 /*
@@ -1785,7 +1790,8 @@ static void __mem_cgroup_move_account(st
 	VM_BUG_ON(!PageCgroupLocked(pc));
 	VM_BUG_ON(!PageCgroupUsed(pc));
 	VM_BUG_ON(pc->mem_cgroup != from);
-
+		
+	page_cgroup_migration_lock(pc);
 	page = pc->page;
 	if (page_mapped(page) && !PageAnon(page)) {
 		/* Update mapped_file data for mem_cgroup */
@@ -1802,6 +1808,7 @@ static void __mem_cgroup_move_account(st
 	/* caller should have done css_get */
 	pc->mem_cgroup = to;
 	mem_cgroup_charge_statistics(to, pc, true);
+	page_cgroup_migration_lock(pc);
 	/*
 	 * We charges against "to" which may not have any tasks. Then, "to"
 	 * can be under rmdir(). But in current implementation, caller of




Thanks,
-Kame


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
@ 2010-03-03  3:29       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 140+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-03  3:29 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: Andrea Righi, Balbir Singh, Suleiman Souhlal, Greg Thelen,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

On Wed, 3 Mar 2010 11:12:38 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index fe09e51..f85acae 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
> >  	 * having removed the page entirely.
> >  	 */
> >  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> > +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, -1);
> >  		dec_zone_page_state(page, NR_FILE_DIRTY);
> >  		dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
> >  	}
> (snip)
> > @@ -1096,6 +1113,7 @@ int __set_page_dirty_no_writeback(struct page *page)
> >  void account_page_dirtied(struct page *page, struct address_space *mapping)
> >  {
> >  	if (mapping_cap_account_dirty(mapping)) {
> > +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, 1);
> >  		__inc_zone_page_state(page, NR_FILE_DIRTY);
> >  		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
> >  		task_dirty_inc(current);
> As long as I can see, those two functions(at least) calls mem_cgroup_update_state(),
> which acquires page cgroup lock, under mapping->tree_lock.
> But as I fixed before in commit e767e056, page cgroup lock must not acquired under
> mapping->tree_lock.
> hmm, we should call those mem_cgroup_update_state() outside mapping->tree_lock,
> or add local_irq_save/restore() around lock/unlock_page_cgroup() to avoid dead-lock.
> 
Ah, good catch! But hmmmmmm...
This account_page_dirtted() seems to be called under IRQ-disabled.
About  __remove_from_page_cache(), I think page_cgroup should have its own DIRTY flag,
then, mem_cgroup_uncharge_page() can handle it automatically.

But. there are no guarantee that following never happens. 
	lock_page_cgroup()
	    <=== interrupt.
	    -> mapping->tree_lock()
Even if mapping->tree_lock is held with IRQ-disabled.
Then, if we add local_irq_save(), we have to add it to all lock_page_cgroup().

Then, hm...some kind of new trick ? as..
(Follwoing patch is not tested!!)

==
---
 include/linux/page_cgroup.h |   14 ++++++++++++++
 mm/memcontrol.c             |   27 +++++++++++++++++----------
 2 files changed, 31 insertions(+), 10 deletions(-)

Index: mmotm-2.6.33-Feb11/include/linux/page_cgroup.h
===================================================================
--- mmotm-2.6.33-Feb11.orig/include/linux/page_cgroup.h
+++ mmotm-2.6.33-Feb11/include/linux/page_cgroup.h
@@ -39,6 +39,7 @@ enum {
 	PCG_CACHE, /* charged as cache */
 	PCG_USED, /* this object is in use. */
 	PCG_ACCT_LRU, /* page has been accounted for */
+	PCG_MIGRATE, /* page cgroup is under memcg account migration */
 };
 
 #define TESTPCGFLAG(uname, lname)			\
@@ -73,6 +74,8 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU)
 TESTPCGFLAG(AcctLRU, ACCT_LRU)
 TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU)
 
+TESTPCGFLAG(Migrate, MIGRATE)
+
 static inline int page_cgroup_nid(struct page_cgroup *pc)
 {
 	return page_to_nid(pc->page);
@@ -93,6 +96,17 @@ static inline void unlock_page_cgroup(st
 	bit_spin_unlock(PCG_LOCK, &pc->flags);
 }
 
+static inline unsigned long page_cgroup_migration_lock(struct page_cgroup *pc)
+{
+	local_irq_save(flags);
+	bit_spin_lock(PCG_MIGRATE, &pc->flags);
+}
+static inline void
+page_cgroup_migration_lock(struct page_cgroup *pc, unsigned long flags)
+{
+	bit_spin_lock(PCG_MIGRATE, &pc->flags);
+	local_irq_restore(flags);
+}
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct page_cgroup;
 
Index: mmotm-2.6.33-Feb11/mm/memcontrol.c
===================================================================
--- mmotm-2.6.33-Feb11.orig/mm/memcontrol.c
+++ mmotm-2.6.33-Feb11/mm/memcontrol.c
@@ -1321,7 +1321,7 @@ bool mem_cgroup_handle_oom(struct mem_cg
  * Currently used to update mapped file statistics, but the routine can be
  * generalized to update other statistics as well.
  */
-void mem_cgroup_update_file_mapped(struct page *page, int val)
+void mem_cgroup_update_file_mapped(struct page *page, int val, int locked)
 {
 	struct mem_cgroup *mem;
 	struct page_cgroup *pc;
@@ -1329,22 +1329,27 @@ void mem_cgroup_update_file_mapped(struc
 	pc = lookup_page_cgroup(page);
 	if (unlikely(!pc))
 		return;
-
-	lock_page_cgroup(pc);
+	/*
+	 * if locked==1, mapping->tree_lock is held. We don't have to take
+	 * care of charge/uncharge. just think about migration.
+	 */
+	if (!locked)
+		lock_page_cgroup(pc);
+	else
+		page_cgroup_migration_lock(pc);
 	mem = pc->mem_cgroup;
-	if (!mem)
+	if (!mem || !PageCgroupUsed(pc))
 		goto done;
-
-	if (!PageCgroupUsed(pc))
-		goto done;
-
 	/*
 	 * Preemption is already disabled. We can use __this_cpu_xxx
 	 */
 	__this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], val);
 
 done:
-	unlock_page_cgroup(pc);
+	if (!locked)
+		lock_page_cgroup(pc);
+	else
+		page_cgroup_migration_unlock(pc);
 }
 
 /*
@@ -1785,7 +1790,8 @@ static void __mem_cgroup_move_account(st
 	VM_BUG_ON(!PageCgroupLocked(pc));
 	VM_BUG_ON(!PageCgroupUsed(pc));
 	VM_BUG_ON(pc->mem_cgroup != from);
-
+		
+	page_cgroup_migration_lock(pc);
 	page = pc->page;
 	if (page_mapped(page) && !PageAnon(page)) {
 		/* Update mapped_file data for mem_cgroup */
@@ -1802,6 +1808,7 @@ static void __mem_cgroup_move_account(st
 	/* caller should have done css_get */
 	pc->mem_cgroup = to;
 	mem_cgroup_charge_statistics(to, pc, true);
+	page_cgroup_migration_lock(pc);
 	/*
 	 * We charges against "to" which may not have any tasks. Then, "to"
 	 * can be under rmdir(). But in current implementation, caller of




Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
       [not found]       ` <20100303122906.9c613ab2.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
@ 2010-03-03  6:01         ` Daisuke Nishimura
  0 siblings, 0 replies; 140+ messages in thread
From: Daisuke Nishimura @ 2010-03-03  6:01 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrea Righi,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Andrew Morton,
	Balbir Singh

On Wed, 3 Mar 2010 12:29:06 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> wrote:
> On Wed, 3 Mar 2010 11:12:38 +0900
> Daisuke Nishimura <nishimura-YQH0OdQVrdy45+QrQBaojngSJqDPrsil@public.gmane.org> wrote:
> 
> > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > index fe09e51..f85acae 100644
> > > --- a/mm/filemap.c
> > > +++ b/mm/filemap.c
> > > @@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
> > >  	 * having removed the page entirely.
> > >  	 */
> > >  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> > > +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, -1);
> > >  		dec_zone_page_state(page, NR_FILE_DIRTY);
> > >  		dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
> > >  	}
> > (snip)
> > > @@ -1096,6 +1113,7 @@ int __set_page_dirty_no_writeback(struct page *page)
> > >  void account_page_dirtied(struct page *page, struct address_space *mapping)
> > >  {
> > >  	if (mapping_cap_account_dirty(mapping)) {
> > > +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, 1);
> > >  		__inc_zone_page_state(page, NR_FILE_DIRTY);
> > >  		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
> > >  		task_dirty_inc(current);
> > As long as I can see, those two functions(at least) calls mem_cgroup_update_state(),
> > which acquires page cgroup lock, under mapping->tree_lock.
> > But as I fixed before in commit e767e056, page cgroup lock must not acquired under
> > mapping->tree_lock.
> > hmm, we should call those mem_cgroup_update_state() outside mapping->tree_lock,
> > or add local_irq_save/restore() around lock/unlock_page_cgroup() to avoid dead-lock.
> > 
> Ah, good catch! But hmmmmmm...
> This account_page_dirtted() seems to be called under IRQ-disabled.
> About  __remove_from_page_cache(), I think page_cgroup should have its own DIRTY flag,
> then, mem_cgroup_uncharge_page() can handle it automatically.
> 
> But. there are no guarantee that following never happens. 
> 	lock_page_cgroup()
> 	    <=== interrupt.
> 	    -> mapping->tree_lock()
> Even if mapping->tree_lock is held with IRQ-disabled.
> Then, if we add local_irq_save(), we have to add it to all lock_page_cgroup().
> 
> Then, hm...some kind of new trick ? as..
> (Follwoing patch is not tested!!)
> 
If we can verify that all callers of mem_cgroup_update_stat() have always either aquired
or not aquired tree_lock, this direction will work fine.
But if we can't, we have to add local_irq_save() to lock_page_cgroup() like below.

===
 include/linux/page_cgroup.h |    8 ++++++--
 mm/memcontrol.c             |   43 +++++++++++++++++++++++++------------------
 2 files changed, 31 insertions(+), 20 deletions(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 30b0813..51da916 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -83,15 +83,19 @@ static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc)
 	return page_zonenum(pc->page);
 }
 
-static inline void lock_page_cgroup(struct page_cgroup *pc)
+static inline void __lock_page_cgroup(struct page_cgroup *pc)
 {
 	bit_spin_lock(PCG_LOCK, &pc->flags);
 }
+#define lock_page_cgroup(pc, flags) \
+  do { local_irq_save(flags); __lock_page_cgroup(pc); } while (0)
 
-static inline void unlock_page_cgroup(struct page_cgroup *pc)
+static inline void __unlock_page_cgroup(struct page_cgroup *pc)
 {
 	bit_spin_unlock(PCG_LOCK, &pc->flags);
 }
+#define unlock_page_cgroup(pc, flags) \
+  do { __unlock_page_cgroup(pc); local_irq_restore(flags); } while (0)
 
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct page_cgroup;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 00ed4b1..40b9be4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1327,12 +1327,13 @@ void mem_cgroup_update_file_mapped(struct page *page, int val)
 {
 	struct mem_cgroup *mem;
 	struct page_cgroup *pc;
+	unsigned long flags;
 
 	pc = lookup_page_cgroup(page);
 	if (unlikely(!pc))
 		return;
 
-	lock_page_cgroup(pc);
+	lock_page_cgroup(pc, flags);
 	mem = pc->mem_cgroup;
 	if (!mem)
 		goto done;
@@ -1346,7 +1347,7 @@ void mem_cgroup_update_file_mapped(struct page *page, int val)
 	__this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], val);
 
 done:
-	unlock_page_cgroup(pc);
+	unlock_page_cgroup(pc, flags);
 }
 
 /*
@@ -1680,11 +1681,12 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 	struct page_cgroup *pc;
 	unsigned short id;
 	swp_entry_t ent;
+	unsigned long flags;
 
 	VM_BUG_ON(!PageLocked(page));
 
 	pc = lookup_page_cgroup(page);
-	lock_page_cgroup(pc);
+	lock_page_cgroup(pc, flags);
 	if (PageCgroupUsed(pc)) {
 		mem = pc->mem_cgroup;
 		if (mem && !css_tryget(&mem->css))
@@ -1698,7 +1700,7 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 			mem = NULL;
 		rcu_read_unlock();
 	}
-	unlock_page_cgroup(pc);
+	unlock_page_cgroup(pc, flags);
 	return mem;
 }
 
@@ -1711,13 +1713,15 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
 				     struct page_cgroup *pc,
 				     enum charge_type ctype)
 {
+	unsigned long flags;
+
 	/* try_charge() can return NULL to *memcg, taking care of it. */
 	if (!mem)
 		return;
 
-	lock_page_cgroup(pc);
+	lock_page_cgroup(pc, flags);
 	if (unlikely(PageCgroupUsed(pc))) {
-		unlock_page_cgroup(pc);
+		unlock_page_cgroup(pc, flags);
 		mem_cgroup_cancel_charge(mem);
 		return;
 	}
@@ -1747,7 +1751,7 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
 
 	mem_cgroup_charge_statistics(mem, pc, true);
 
-	unlock_page_cgroup(pc);
+	unlock_page_cgroup(pc, flags);
 	/*
 	 * "charge_statistics" updated event counter. Then, check it.
 	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
@@ -1817,12 +1821,13 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,
 		struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
 {
 	int ret = -EINVAL;
-	lock_page_cgroup(pc);
+	unsigned long flags;
+	lock_page_cgroup(pc, flags);
 	if (PageCgroupUsed(pc) && pc->mem_cgroup == from) {
 		__mem_cgroup_move_account(pc, from, to, uncharge);
 		ret = 0;
 	}
-	unlock_page_cgroup(pc);
+	unlock_page_cgroup(pc, flags);
 	/*
 	 * check events
 	 */
@@ -1949,17 +1954,17 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 	 */
 	if (!(gfp_mask & __GFP_WAIT)) {
 		struct page_cgroup *pc;
-
+		unsigned long flags;
 
 		pc = lookup_page_cgroup(page);
 		if (!pc)
 			return 0;
-		lock_page_cgroup(pc);
+		lock_page_cgroup(pc, flags);
 		if (PageCgroupUsed(pc)) {
-			unlock_page_cgroup(pc);
+			unlock_page_cgroup(pc, flags);
 			return 0;
 		}
-		unlock_page_cgroup(pc);
+		unlock_page_cgroup(pc, flags);
 	}
 
 	if (unlikely(!mm && !mem))
@@ -2141,6 +2146,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 	struct page_cgroup *pc;
 	struct mem_cgroup *mem = NULL;
 	struct mem_cgroup_per_zone *mz;
+	unsigned long flags;
 
 	if (mem_cgroup_disabled())
 		return NULL;
@@ -2155,7 +2161,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 	if (unlikely(!pc || !PageCgroupUsed(pc)))
 		return NULL;
 
-	lock_page_cgroup(pc);
+	lock_page_cgroup(pc, flags);
 
 	mem = pc->mem_cgroup;
 
@@ -2194,7 +2200,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 	 */
 
 	mz = page_cgroup_zoneinfo(pc);
-	unlock_page_cgroup(pc);
+	unlock_page_cgroup(pc, flags);
 
 	memcg_check_events(mem, page);
 	/* at swapout, this memcg will be accessed to record to swap */
@@ -2204,7 +2210,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 	return mem;
 
 unlock_out:
-	unlock_page_cgroup(pc);
+	unlock_page_cgroup(pc, flags);
 	return NULL;
 }
 
@@ -2392,17 +2398,18 @@ int mem_cgroup_prepare_migration(struct page *page, struct mem_cgroup **ptr)
 	struct page_cgroup *pc;
 	struct mem_cgroup *mem = NULL;
 	int ret = 0;
+	unsigned long flags;
 
 	if (mem_cgroup_disabled())
 		return 0;
 
 	pc = lookup_page_cgroup(page);
-	lock_page_cgroup(pc);
+	lock_page_cgroup(pc, flags);
 	if (PageCgroupUsed(pc)) {
 		mem = pc->mem_cgroup;
 		css_get(&mem->css);
 	}
-	unlock_page_cgroup(pc);
+	unlock_page_cgroup(pc, flags);
 
 	if (mem) {
 		ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false);


> ==
> ---
>  include/linux/page_cgroup.h |   14 ++++++++++++++
>  mm/memcontrol.c             |   27 +++++++++++++++++----------
>  2 files changed, 31 insertions(+), 10 deletions(-)
> 
> Index: mmotm-2.6.33-Feb11/include/linux/page_cgroup.h
> ===================================================================
> --- mmotm-2.6.33-Feb11.orig/include/linux/page_cgroup.h
> +++ mmotm-2.6.33-Feb11/include/linux/page_cgroup.h
> @@ -39,6 +39,7 @@ enum {
>  	PCG_CACHE, /* charged as cache */
>  	PCG_USED, /* this object is in use. */
>  	PCG_ACCT_LRU, /* page has been accounted for */
> +	PCG_MIGRATE, /* page cgroup is under memcg account migration */
>  };
>  
>  #define TESTPCGFLAG(uname, lname)			\
> @@ -73,6 +74,8 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU)
>  TESTPCGFLAG(AcctLRU, ACCT_LRU)
>  TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU)
>  
> +TESTPCGFLAG(Migrate, MIGRATE)
> +
>  static inline int page_cgroup_nid(struct page_cgroup *pc)
>  {
>  	return page_to_nid(pc->page);
> @@ -93,6 +96,17 @@ static inline void unlock_page_cgroup(st
>  	bit_spin_unlock(PCG_LOCK, &pc->flags);
>  }
>  
> +static inline unsigned long page_cgroup_migration_lock(struct page_cgroup *pc)
> +{
> +	local_irq_save(flags);
> +	bit_spin_lock(PCG_MIGRATE, &pc->flags);
> +}
> +static inline void
> +page_cgroup_migration_lock(struct page_cgroup *pc, unsigned long flags)
> +{
> +	bit_spin_lock(PCG_MIGRATE, &pc->flags);
> +	local_irq_restore(flags);
> +}
>  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
>  struct page_cgroup;
>  
> Index: mmotm-2.6.33-Feb11/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.33-Feb11.orig/mm/memcontrol.c
> +++ mmotm-2.6.33-Feb11/mm/memcontrol.c
> @@ -1321,7 +1321,7 @@ bool mem_cgroup_handle_oom(struct mem_cg
>   * Currently used to update mapped file statistics, but the routine can be
>   * generalized to update other statistics as well.
>   */
> -void mem_cgroup_update_file_mapped(struct page *page, int val)
> +void mem_cgroup_update_file_mapped(struct page *page, int val, int locked)
>  {
>  	struct mem_cgroup *mem;
>  	struct page_cgroup *pc;
> @@ -1329,22 +1329,27 @@ void mem_cgroup_update_file_mapped(struc
>  	pc = lookup_page_cgroup(page);
>  	if (unlikely(!pc))
>  		return;
> -
> -	lock_page_cgroup(pc);
> +	/*
> +	 * if locked==1, mapping->tree_lock is held. We don't have to take
> +	 * care of charge/uncharge. just think about migration.
> +	 */
> +	if (!locked)
> +		lock_page_cgroup(pc);
> +	else
> +		page_cgroup_migration_lock(pc);
>  	mem = pc->mem_cgroup;
> -	if (!mem)
> +	if (!mem || !PageCgroupUsed(pc))
>  		goto done;
> -
> -	if (!PageCgroupUsed(pc))
> -		goto done;
> -
>  	/*
>  	 * Preemption is already disabled. We can use __this_cpu_xxx
>  	 */
>  	__this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], val);
>  
>  done:
> -	unlock_page_cgroup(pc);
> +	if (!locked)
> +		lock_page_cgroup(pc);
> +	else
> +		page_cgroup_migration_unlock(pc);
>  }
>  
>  /*
> @@ -1785,7 +1790,8 @@ static void __mem_cgroup_move_account(st
>  	VM_BUG_ON(!PageCgroupLocked(pc));
>  	VM_BUG_ON(!PageCgroupUsed(pc));
>  	VM_BUG_ON(pc->mem_cgroup != from);
> -
> +		
> +	page_cgroup_migration_lock(pc);
>  	page = pc->page;
>  	if (page_mapped(page) && !PageAnon(page)) {
>  		/* Update mapped_file data for mem_cgroup */
> @@ -1802,6 +1808,7 @@ static void __mem_cgroup_move_account(st
>  	/* caller should have done css_get */
>  	pc->mem_cgroup = to;
>  	mem_cgroup_charge_statistics(to, pc, true);
> +	page_cgroup_migration_lock(pc);
>  	/*
>  	 * We charges against "to" which may not have any tasks. Then, "to"
>  	 * can be under rmdir(). But in current implementation, caller of
> 
> 
> 
> 
> Thanks,
> -Kame
> 

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-03  3:29       ` KAMEZAWA Hiroyuki
@ 2010-03-03  6:01         ` Daisuke Nishimura
  -1 siblings, 0 replies; 140+ messages in thread
From: Daisuke Nishimura @ 2010-03-03  6:01 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrea Righi, Balbir Singh, Suleiman Souhlal, Greg Thelen,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm, Daisuke Nishimura

On Wed, 3 Mar 2010 12:29:06 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Wed, 3 Mar 2010 11:12:38 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > index fe09e51..f85acae 100644
> > > --- a/mm/filemap.c
> > > +++ b/mm/filemap.c
> > > @@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
> > >  	 * having removed the page entirely.
> > >  	 */
> > >  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> > > +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, -1);
> > >  		dec_zone_page_state(page, NR_FILE_DIRTY);
> > >  		dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
> > >  	}
> > (snip)
> > > @@ -1096,6 +1113,7 @@ int __set_page_dirty_no_writeback(struct page *page)
> > >  void account_page_dirtied(struct page *page, struct address_space *mapping)
> > >  {
> > >  	if (mapping_cap_account_dirty(mapping)) {
> > > +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, 1);
> > >  		__inc_zone_page_state(page, NR_FILE_DIRTY);
> > >  		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
> > >  		task_dirty_inc(current);
> > As long as I can see, those two functions(at least) calls mem_cgroup_update_state(),
> > which acquires page cgroup lock, under mapping->tree_lock.
> > But as I fixed before in commit e767e056, page cgroup lock must not acquired under
> > mapping->tree_lock.
> > hmm, we should call those mem_cgroup_update_state() outside mapping->tree_lock,
> > or add local_irq_save/restore() around lock/unlock_page_cgroup() to avoid dead-lock.
> > 
> Ah, good catch! But hmmmmmm...
> This account_page_dirtted() seems to be called under IRQ-disabled.
> About  __remove_from_page_cache(), I think page_cgroup should have its own DIRTY flag,
> then, mem_cgroup_uncharge_page() can handle it automatically.
> 
> But. there are no guarantee that following never happens. 
> 	lock_page_cgroup()
> 	    <=== interrupt.
> 	    -> mapping->tree_lock()
> Even if mapping->tree_lock is held with IRQ-disabled.
> Then, if we add local_irq_save(), we have to add it to all lock_page_cgroup().
> 
> Then, hm...some kind of new trick ? as..
> (Follwoing patch is not tested!!)
> 
If we can verify that all callers of mem_cgroup_update_stat() have always either aquired
or not aquired tree_lock, this direction will work fine.
But if we can't, we have to add local_irq_save() to lock_page_cgroup() like below.

===
 include/linux/page_cgroup.h |    8 ++++++--
 mm/memcontrol.c             |   43 +++++++++++++++++++++++++------------------
 2 files changed, 31 insertions(+), 20 deletions(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 30b0813..51da916 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -83,15 +83,19 @@ static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc)
 	return page_zonenum(pc->page);
 }
 
-static inline void lock_page_cgroup(struct page_cgroup *pc)
+static inline void __lock_page_cgroup(struct page_cgroup *pc)
 {
 	bit_spin_lock(PCG_LOCK, &pc->flags);
 }
+#define lock_page_cgroup(pc, flags) \
+  do { local_irq_save(flags); __lock_page_cgroup(pc); } while (0)
 
-static inline void unlock_page_cgroup(struct page_cgroup *pc)
+static inline void __unlock_page_cgroup(struct page_cgroup *pc)
 {
 	bit_spin_unlock(PCG_LOCK, &pc->flags);
 }
+#define unlock_page_cgroup(pc, flags) \
+  do { __unlock_page_cgroup(pc); local_irq_restore(flags); } while (0)
 
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct page_cgroup;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 00ed4b1..40b9be4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1327,12 +1327,13 @@ void mem_cgroup_update_file_mapped(struct page *page, int val)
 {
 	struct mem_cgroup *mem;
 	struct page_cgroup *pc;
+	unsigned long flags;
 
 	pc = lookup_page_cgroup(page);
 	if (unlikely(!pc))
 		return;
 
-	lock_page_cgroup(pc);
+	lock_page_cgroup(pc, flags);
 	mem = pc->mem_cgroup;
 	if (!mem)
 		goto done;
@@ -1346,7 +1347,7 @@ void mem_cgroup_update_file_mapped(struct page *page, int val)
 	__this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], val);
 
 done:
-	unlock_page_cgroup(pc);
+	unlock_page_cgroup(pc, flags);
 }
 
 /*
@@ -1680,11 +1681,12 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 	struct page_cgroup *pc;
 	unsigned short id;
 	swp_entry_t ent;
+	unsigned long flags;
 
 	VM_BUG_ON(!PageLocked(page));
 
 	pc = lookup_page_cgroup(page);
-	lock_page_cgroup(pc);
+	lock_page_cgroup(pc, flags);
 	if (PageCgroupUsed(pc)) {
 		mem = pc->mem_cgroup;
 		if (mem && !css_tryget(&mem->css))
@@ -1698,7 +1700,7 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 			mem = NULL;
 		rcu_read_unlock();
 	}
-	unlock_page_cgroup(pc);
+	unlock_page_cgroup(pc, flags);
 	return mem;
 }
 
@@ -1711,13 +1713,15 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
 				     struct page_cgroup *pc,
 				     enum charge_type ctype)
 {
+	unsigned long flags;
+
 	/* try_charge() can return NULL to *memcg, taking care of it. */
 	if (!mem)
 		return;
 
-	lock_page_cgroup(pc);
+	lock_page_cgroup(pc, flags);
 	if (unlikely(PageCgroupUsed(pc))) {
-		unlock_page_cgroup(pc);
+		unlock_page_cgroup(pc, flags);
 		mem_cgroup_cancel_charge(mem);
 		return;
 	}
@@ -1747,7 +1751,7 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
 
 	mem_cgroup_charge_statistics(mem, pc, true);
 
-	unlock_page_cgroup(pc);
+	unlock_page_cgroup(pc, flags);
 	/*
 	 * "charge_statistics" updated event counter. Then, check it.
 	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
@@ -1817,12 +1821,13 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,
 		struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
 {
 	int ret = -EINVAL;
-	lock_page_cgroup(pc);
+	unsigned long flags;
+	lock_page_cgroup(pc, flags);
 	if (PageCgroupUsed(pc) && pc->mem_cgroup == from) {
 		__mem_cgroup_move_account(pc, from, to, uncharge);
 		ret = 0;
 	}
-	unlock_page_cgroup(pc);
+	unlock_page_cgroup(pc, flags);
 	/*
 	 * check events
 	 */
@@ -1949,17 +1954,17 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 	 */
 	if (!(gfp_mask & __GFP_WAIT)) {
 		struct page_cgroup *pc;
-
+		unsigned long flags;
 
 		pc = lookup_page_cgroup(page);
 		if (!pc)
 			return 0;
-		lock_page_cgroup(pc);
+		lock_page_cgroup(pc, flags);
 		if (PageCgroupUsed(pc)) {
-			unlock_page_cgroup(pc);
+			unlock_page_cgroup(pc, flags);
 			return 0;
 		}
-		unlock_page_cgroup(pc);
+		unlock_page_cgroup(pc, flags);
 	}
 
 	if (unlikely(!mm && !mem))
@@ -2141,6 +2146,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 	struct page_cgroup *pc;
 	struct mem_cgroup *mem = NULL;
 	struct mem_cgroup_per_zone *mz;
+	unsigned long flags;
 
 	if (mem_cgroup_disabled())
 		return NULL;
@@ -2155,7 +2161,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 	if (unlikely(!pc || !PageCgroupUsed(pc)))
 		return NULL;
 
-	lock_page_cgroup(pc);
+	lock_page_cgroup(pc, flags);
 
 	mem = pc->mem_cgroup;
 
@@ -2194,7 +2200,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 	 */
 
 	mz = page_cgroup_zoneinfo(pc);
-	unlock_page_cgroup(pc);
+	unlock_page_cgroup(pc, flags);
 
 	memcg_check_events(mem, page);
 	/* at swapout, this memcg will be accessed to record to swap */
@@ -2204,7 +2210,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 	return mem;
 
 unlock_out:
-	unlock_page_cgroup(pc);
+	unlock_page_cgroup(pc, flags);
 	return NULL;
 }
 
@@ -2392,17 +2398,18 @@ int mem_cgroup_prepare_migration(struct page *page, struct mem_cgroup **ptr)
 	struct page_cgroup *pc;
 	struct mem_cgroup *mem = NULL;
 	int ret = 0;
+	unsigned long flags;
 
 	if (mem_cgroup_disabled())
 		return 0;
 
 	pc = lookup_page_cgroup(page);
-	lock_page_cgroup(pc);
+	lock_page_cgroup(pc, flags);
 	if (PageCgroupUsed(pc)) {
 		mem = pc->mem_cgroup;
 		css_get(&mem->css);
 	}
-	unlock_page_cgroup(pc);
+	unlock_page_cgroup(pc, flags);
 
 	if (mem) {
 		ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false);


> ==
> ---
>  include/linux/page_cgroup.h |   14 ++++++++++++++
>  mm/memcontrol.c             |   27 +++++++++++++++++----------
>  2 files changed, 31 insertions(+), 10 deletions(-)
> 
> Index: mmotm-2.6.33-Feb11/include/linux/page_cgroup.h
> ===================================================================
> --- mmotm-2.6.33-Feb11.orig/include/linux/page_cgroup.h
> +++ mmotm-2.6.33-Feb11/include/linux/page_cgroup.h
> @@ -39,6 +39,7 @@ enum {
>  	PCG_CACHE, /* charged as cache */
>  	PCG_USED, /* this object is in use. */
>  	PCG_ACCT_LRU, /* page has been accounted for */
> +	PCG_MIGRATE, /* page cgroup is under memcg account migration */
>  };
>  
>  #define TESTPCGFLAG(uname, lname)			\
> @@ -73,6 +74,8 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU)
>  TESTPCGFLAG(AcctLRU, ACCT_LRU)
>  TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU)
>  
> +TESTPCGFLAG(Migrate, MIGRATE)
> +
>  static inline int page_cgroup_nid(struct page_cgroup *pc)
>  {
>  	return page_to_nid(pc->page);
> @@ -93,6 +96,17 @@ static inline void unlock_page_cgroup(st
>  	bit_spin_unlock(PCG_LOCK, &pc->flags);
>  }
>  
> +static inline unsigned long page_cgroup_migration_lock(struct page_cgroup *pc)
> +{
> +	local_irq_save(flags);
> +	bit_spin_lock(PCG_MIGRATE, &pc->flags);
> +}
> +static inline void
> +page_cgroup_migration_lock(struct page_cgroup *pc, unsigned long flags)
> +{
> +	bit_spin_lock(PCG_MIGRATE, &pc->flags);
> +	local_irq_restore(flags);
> +}
>  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
>  struct page_cgroup;
>  
> Index: mmotm-2.6.33-Feb11/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.33-Feb11.orig/mm/memcontrol.c
> +++ mmotm-2.6.33-Feb11/mm/memcontrol.c
> @@ -1321,7 +1321,7 @@ bool mem_cgroup_handle_oom(struct mem_cg
>   * Currently used to update mapped file statistics, but the routine can be
>   * generalized to update other statistics as well.
>   */
> -void mem_cgroup_update_file_mapped(struct page *page, int val)
> +void mem_cgroup_update_file_mapped(struct page *page, int val, int locked)
>  {
>  	struct mem_cgroup *mem;
>  	struct page_cgroup *pc;
> @@ -1329,22 +1329,27 @@ void mem_cgroup_update_file_mapped(struc
>  	pc = lookup_page_cgroup(page);
>  	if (unlikely(!pc))
>  		return;
> -
> -	lock_page_cgroup(pc);
> +	/*
> +	 * if locked==1, mapping->tree_lock is held. We don't have to take
> +	 * care of charge/uncharge. just think about migration.
> +	 */
> +	if (!locked)
> +		lock_page_cgroup(pc);
> +	else
> +		page_cgroup_migration_lock(pc);
>  	mem = pc->mem_cgroup;
> -	if (!mem)
> +	if (!mem || !PageCgroupUsed(pc))
>  		goto done;
> -
> -	if (!PageCgroupUsed(pc))
> -		goto done;
> -
>  	/*
>  	 * Preemption is already disabled. We can use __this_cpu_xxx
>  	 */
>  	__this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], val);
>  
>  done:
> -	unlock_page_cgroup(pc);
> +	if (!locked)
> +		lock_page_cgroup(pc);
> +	else
> +		page_cgroup_migration_unlock(pc);
>  }
>  
>  /*
> @@ -1785,7 +1790,8 @@ static void __mem_cgroup_move_account(st
>  	VM_BUG_ON(!PageCgroupLocked(pc));
>  	VM_BUG_ON(!PageCgroupUsed(pc));
>  	VM_BUG_ON(pc->mem_cgroup != from);
> -
> +		
> +	page_cgroup_migration_lock(pc);
>  	page = pc->page;
>  	if (page_mapped(page) && !PageAnon(page)) {
>  		/* Update mapped_file data for mem_cgroup */
> @@ -1802,6 +1808,7 @@ static void __mem_cgroup_move_account(st
>  	/* caller should have done css_get */
>  	pc->mem_cgroup = to;
>  	mem_cgroup_charge_statistics(to, pc, true);
> +	page_cgroup_migration_lock(pc);
>  	/*
>  	 * We charges against "to" which may not have any tasks. Then, "to"
>  	 * can be under rmdir(). But in current implementation, caller of
> 
> 
> 
> 
> Thanks,
> -Kame
> 

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
@ 2010-03-03  6:01         ` Daisuke Nishimura
  0 siblings, 0 replies; 140+ messages in thread
From: Daisuke Nishimura @ 2010-03-03  6:01 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrea Righi, Balbir Singh, Suleiman Souhlal, Greg Thelen,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm, Daisuke Nishimura

On Wed, 3 Mar 2010 12:29:06 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Wed, 3 Mar 2010 11:12:38 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > index fe09e51..f85acae 100644
> > > --- a/mm/filemap.c
> > > +++ b/mm/filemap.c
> > > @@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
> > >  	 * having removed the page entirely.
> > >  	 */
> > >  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> > > +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, -1);
> > >  		dec_zone_page_state(page, NR_FILE_DIRTY);
> > >  		dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
> > >  	}
> > (snip)
> > > @@ -1096,6 +1113,7 @@ int __set_page_dirty_no_writeback(struct page *page)
> > >  void account_page_dirtied(struct page *page, struct address_space *mapping)
> > >  {
> > >  	if (mapping_cap_account_dirty(mapping)) {
> > > +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, 1);
> > >  		__inc_zone_page_state(page, NR_FILE_DIRTY);
> > >  		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
> > >  		task_dirty_inc(current);
> > As long as I can see, those two functions(at least) calls mem_cgroup_update_state(),
> > which acquires page cgroup lock, under mapping->tree_lock.
> > But as I fixed before in commit e767e056, page cgroup lock must not acquired under
> > mapping->tree_lock.
> > hmm, we should call those mem_cgroup_update_state() outside mapping->tree_lock,
> > or add local_irq_save/restore() around lock/unlock_page_cgroup() to avoid dead-lock.
> > 
> Ah, good catch! But hmmmmmm...
> This account_page_dirtted() seems to be called under IRQ-disabled.
> About  __remove_from_page_cache(), I think page_cgroup should have its own DIRTY flag,
> then, mem_cgroup_uncharge_page() can handle it automatically.
> 
> But. there are no guarantee that following never happens. 
> 	lock_page_cgroup()
> 	    <=== interrupt.
> 	    -> mapping->tree_lock()
> Even if mapping->tree_lock is held with IRQ-disabled.
> Then, if we add local_irq_save(), we have to add it to all lock_page_cgroup().
> 
> Then, hm...some kind of new trick ? as..
> (Follwoing patch is not tested!!)
> 
If we can verify that all callers of mem_cgroup_update_stat() have always either aquired
or not aquired tree_lock, this direction will work fine.
But if we can't, we have to add local_irq_save() to lock_page_cgroup() like below.

===
 include/linux/page_cgroup.h |    8 ++++++--
 mm/memcontrol.c             |   43 +++++++++++++++++++++++++------------------
 2 files changed, 31 insertions(+), 20 deletions(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 30b0813..51da916 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -83,15 +83,19 @@ static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc)
 	return page_zonenum(pc->page);
 }
 
-static inline void lock_page_cgroup(struct page_cgroup *pc)
+static inline void __lock_page_cgroup(struct page_cgroup *pc)
 {
 	bit_spin_lock(PCG_LOCK, &pc->flags);
 }
+#define lock_page_cgroup(pc, flags) \
+  do { local_irq_save(flags); __lock_page_cgroup(pc); } while (0)
 
-static inline void unlock_page_cgroup(struct page_cgroup *pc)
+static inline void __unlock_page_cgroup(struct page_cgroup *pc)
 {
 	bit_spin_unlock(PCG_LOCK, &pc->flags);
 }
+#define unlock_page_cgroup(pc, flags) \
+  do { __unlock_page_cgroup(pc); local_irq_restore(flags); } while (0)
 
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct page_cgroup;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 00ed4b1..40b9be4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1327,12 +1327,13 @@ void mem_cgroup_update_file_mapped(struct page *page, int val)
 {
 	struct mem_cgroup *mem;
 	struct page_cgroup *pc;
+	unsigned long flags;
 
 	pc = lookup_page_cgroup(page);
 	if (unlikely(!pc))
 		return;
 
-	lock_page_cgroup(pc);
+	lock_page_cgroup(pc, flags);
 	mem = pc->mem_cgroup;
 	if (!mem)
 		goto done;
@@ -1346,7 +1347,7 @@ void mem_cgroup_update_file_mapped(struct page *page, int val)
 	__this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], val);
 
 done:
-	unlock_page_cgroup(pc);
+	unlock_page_cgroup(pc, flags);
 }
 
 /*
@@ -1680,11 +1681,12 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 	struct page_cgroup *pc;
 	unsigned short id;
 	swp_entry_t ent;
+	unsigned long flags;
 
 	VM_BUG_ON(!PageLocked(page));
 
 	pc = lookup_page_cgroup(page);
-	lock_page_cgroup(pc);
+	lock_page_cgroup(pc, flags);
 	if (PageCgroupUsed(pc)) {
 		mem = pc->mem_cgroup;
 		if (mem && !css_tryget(&mem->css))
@@ -1698,7 +1700,7 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 			mem = NULL;
 		rcu_read_unlock();
 	}
-	unlock_page_cgroup(pc);
+	unlock_page_cgroup(pc, flags);
 	return mem;
 }
 
@@ -1711,13 +1713,15 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
 				     struct page_cgroup *pc,
 				     enum charge_type ctype)
 {
+	unsigned long flags;
+
 	/* try_charge() can return NULL to *memcg, taking care of it. */
 	if (!mem)
 		return;
 
-	lock_page_cgroup(pc);
+	lock_page_cgroup(pc, flags);
 	if (unlikely(PageCgroupUsed(pc))) {
-		unlock_page_cgroup(pc);
+		unlock_page_cgroup(pc, flags);
 		mem_cgroup_cancel_charge(mem);
 		return;
 	}
@@ -1747,7 +1751,7 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
 
 	mem_cgroup_charge_statistics(mem, pc, true);
 
-	unlock_page_cgroup(pc);
+	unlock_page_cgroup(pc, flags);
 	/*
 	 * "charge_statistics" updated event counter. Then, check it.
 	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
@@ -1817,12 +1821,13 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,
 		struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
 {
 	int ret = -EINVAL;
-	lock_page_cgroup(pc);
+	unsigned long flags;
+	lock_page_cgroup(pc, flags);
 	if (PageCgroupUsed(pc) && pc->mem_cgroup == from) {
 		__mem_cgroup_move_account(pc, from, to, uncharge);
 		ret = 0;
 	}
-	unlock_page_cgroup(pc);
+	unlock_page_cgroup(pc, flags);
 	/*
 	 * check events
 	 */
@@ -1949,17 +1954,17 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 	 */
 	if (!(gfp_mask & __GFP_WAIT)) {
 		struct page_cgroup *pc;
-
+		unsigned long flags;
 
 		pc = lookup_page_cgroup(page);
 		if (!pc)
 			return 0;
-		lock_page_cgroup(pc);
+		lock_page_cgroup(pc, flags);
 		if (PageCgroupUsed(pc)) {
-			unlock_page_cgroup(pc);
+			unlock_page_cgroup(pc, flags);
 			return 0;
 		}
-		unlock_page_cgroup(pc);
+		unlock_page_cgroup(pc, flags);
 	}
 
 	if (unlikely(!mm && !mem))
@@ -2141,6 +2146,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 	struct page_cgroup *pc;
 	struct mem_cgroup *mem = NULL;
 	struct mem_cgroup_per_zone *mz;
+	unsigned long flags;
 
 	if (mem_cgroup_disabled())
 		return NULL;
@@ -2155,7 +2161,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 	if (unlikely(!pc || !PageCgroupUsed(pc)))
 		return NULL;
 
-	lock_page_cgroup(pc);
+	lock_page_cgroup(pc, flags);
 
 	mem = pc->mem_cgroup;
 
@@ -2194,7 +2200,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 	 */
 
 	mz = page_cgroup_zoneinfo(pc);
-	unlock_page_cgroup(pc);
+	unlock_page_cgroup(pc, flags);
 
 	memcg_check_events(mem, page);
 	/* at swapout, this memcg will be accessed to record to swap */
@@ -2204,7 +2210,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 	return mem;
 
 unlock_out:
-	unlock_page_cgroup(pc);
+	unlock_page_cgroup(pc, flags);
 	return NULL;
 }
 
@@ -2392,17 +2398,18 @@ int mem_cgroup_prepare_migration(struct page *page, struct mem_cgroup **ptr)
 	struct page_cgroup *pc;
 	struct mem_cgroup *mem = NULL;
 	int ret = 0;
+	unsigned long flags;
 
 	if (mem_cgroup_disabled())
 		return 0;
 
 	pc = lookup_page_cgroup(page);
-	lock_page_cgroup(pc);
+	lock_page_cgroup(pc, flags);
 	if (PageCgroupUsed(pc)) {
 		mem = pc->mem_cgroup;
 		css_get(&mem->css);
 	}
-	unlock_page_cgroup(pc);
+	unlock_page_cgroup(pc, flags);
 
 	if (mem) {
 		ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false);


> ==
> ---
>  include/linux/page_cgroup.h |   14 ++++++++++++++
>  mm/memcontrol.c             |   27 +++++++++++++++++----------
>  2 files changed, 31 insertions(+), 10 deletions(-)
> 
> Index: mmotm-2.6.33-Feb11/include/linux/page_cgroup.h
> ===================================================================
> --- mmotm-2.6.33-Feb11.orig/include/linux/page_cgroup.h
> +++ mmotm-2.6.33-Feb11/include/linux/page_cgroup.h
> @@ -39,6 +39,7 @@ enum {
>  	PCG_CACHE, /* charged as cache */
>  	PCG_USED, /* this object is in use. */
>  	PCG_ACCT_LRU, /* page has been accounted for */
> +	PCG_MIGRATE, /* page cgroup is under memcg account migration */
>  };
>  
>  #define TESTPCGFLAG(uname, lname)			\
> @@ -73,6 +74,8 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU)
>  TESTPCGFLAG(AcctLRU, ACCT_LRU)
>  TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU)
>  
> +TESTPCGFLAG(Migrate, MIGRATE)
> +
>  static inline int page_cgroup_nid(struct page_cgroup *pc)
>  {
>  	return page_to_nid(pc->page);
> @@ -93,6 +96,17 @@ static inline void unlock_page_cgroup(st
>  	bit_spin_unlock(PCG_LOCK, &pc->flags);
>  }
>  
> +static inline unsigned long page_cgroup_migration_lock(struct page_cgroup *pc)
> +{
> +	local_irq_save(flags);
> +	bit_spin_lock(PCG_MIGRATE, &pc->flags);
> +}
> +static inline void
> +page_cgroup_migration_lock(struct page_cgroup *pc, unsigned long flags)
> +{
> +	bit_spin_lock(PCG_MIGRATE, &pc->flags);
> +	local_irq_restore(flags);
> +}
>  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
>  struct page_cgroup;
>  
> Index: mmotm-2.6.33-Feb11/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.33-Feb11.orig/mm/memcontrol.c
> +++ mmotm-2.6.33-Feb11/mm/memcontrol.c
> @@ -1321,7 +1321,7 @@ bool mem_cgroup_handle_oom(struct mem_cg
>   * Currently used to update mapped file statistics, but the routine can be
>   * generalized to update other statistics as well.
>   */
> -void mem_cgroup_update_file_mapped(struct page *page, int val)
> +void mem_cgroup_update_file_mapped(struct page *page, int val, int locked)
>  {
>  	struct mem_cgroup *mem;
>  	struct page_cgroup *pc;
> @@ -1329,22 +1329,27 @@ void mem_cgroup_update_file_mapped(struc
>  	pc = lookup_page_cgroup(page);
>  	if (unlikely(!pc))
>  		return;
> -
> -	lock_page_cgroup(pc);
> +	/*
> +	 * if locked==1, mapping->tree_lock is held. We don't have to take
> +	 * care of charge/uncharge. just think about migration.
> +	 */
> +	if (!locked)
> +		lock_page_cgroup(pc);
> +	else
> +		page_cgroup_migration_lock(pc);
>  	mem = pc->mem_cgroup;
> -	if (!mem)
> +	if (!mem || !PageCgroupUsed(pc))
>  		goto done;
> -
> -	if (!PageCgroupUsed(pc))
> -		goto done;
> -
>  	/*
>  	 * Preemption is already disabled. We can use __this_cpu_xxx
>  	 */
>  	__this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], val);
>  
>  done:
> -	unlock_page_cgroup(pc);
> +	if (!locked)
> +		lock_page_cgroup(pc);
> +	else
> +		page_cgroup_migration_unlock(pc);
>  }
>  
>  /*
> @@ -1785,7 +1790,8 @@ static void __mem_cgroup_move_account(st
>  	VM_BUG_ON(!PageCgroupLocked(pc));
>  	VM_BUG_ON(!PageCgroupUsed(pc));
>  	VM_BUG_ON(pc->mem_cgroup != from);
> -
> +		
> +	page_cgroup_migration_lock(pc);
>  	page = pc->page;
>  	if (page_mapped(page) && !PageAnon(page)) {
>  		/* Update mapped_file data for mem_cgroup */
> @@ -1802,6 +1808,7 @@ static void __mem_cgroup_move_account(st
>  	/* caller should have done css_get */
>  	pc->mem_cgroup = to;
>  	mem_cgroup_charge_statistics(to, pc, true);
> +	page_cgroup_migration_lock(pc);
>  	/*
>  	 * We charges against "to" which may not have any tasks. Then, "to"
>  	 * can be under rmdir(). But in current implementation, caller of
> 
> 
> 
> 
> Thanks,
> -Kame
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
       [not found]         ` <20100303150137.f56d7084.nishimura-YQH0OdQVrdy45+QrQBaojngSJqDPrsil@public.gmane.org>
@ 2010-03-03  6:15           ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 140+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-03  6:15 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: Andrea Righi,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Greg-FOgKQjlUJ6BQetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Suleiman Souhlal, Andrew Morton, Balbir Singh

On Wed, 3 Mar 2010 15:01:37 +0900
Daisuke Nishimura <nishimura-YQH0OdQVrdy45+QrQBaojngSJqDPrsil@public.gmane.org> wrote:

> On Wed, 3 Mar 2010 12:29:06 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> wrote:
> > On Wed, 3 Mar 2010 11:12:38 +0900
> > Daisuke Nishimura <nishimura-YQH0OdQVrdy45+QrQBaojngSJqDPrsil@public.gmane.org> wrote:
> > 
> > > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > > index fe09e51..f85acae 100644
> > > > --- a/mm/filemap.c
> > > > +++ b/mm/filemap.c
> > > > @@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
> > > >  	 * having removed the page entirely.
> > > >  	 */
> > > >  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> > > > +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, -1);
> > > >  		dec_zone_page_state(page, NR_FILE_DIRTY);
> > > >  		dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
> > > >  	}
> > > (snip)
> > > > @@ -1096,6 +1113,7 @@ int __set_page_dirty_no_writeback(struct page *page)
> > > >  void account_page_dirtied(struct page *page, struct address_space *mapping)
> > > >  {
> > > >  	if (mapping_cap_account_dirty(mapping)) {
> > > > +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, 1);
> > > >  		__inc_zone_page_state(page, NR_FILE_DIRTY);
> > > >  		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
> > > >  		task_dirty_inc(current);
> > > As long as I can see, those two functions(at least) calls mem_cgroup_update_state(),
> > > which acquires page cgroup lock, under mapping->tree_lock.
> > > But as I fixed before in commit e767e056, page cgroup lock must not acquired under
> > > mapping->tree_lock.
> > > hmm, we should call those mem_cgroup_update_state() outside mapping->tree_lock,
> > > or add local_irq_save/restore() around lock/unlock_page_cgroup() to avoid dead-lock.
> > > 
> > Ah, good catch! But hmmmmmm...
> > This account_page_dirtted() seems to be called under IRQ-disabled.
> > About  __remove_from_page_cache(), I think page_cgroup should have its own DIRTY flag,
> > then, mem_cgroup_uncharge_page() can handle it automatically.
> > 
> > But. there are no guarantee that following never happens. 
> > 	lock_page_cgroup()
> > 	    <=== interrupt.
> > 	    -> mapping->tree_lock()
> > Even if mapping->tree_lock is held with IRQ-disabled.
> > Then, if we add local_irq_save(), we have to add it to all lock_page_cgroup().
> > 
> > Then, hm...some kind of new trick ? as..
> > (Follwoing patch is not tested!!)
> > 
> If we can verify that all callers of mem_cgroup_update_stat() have always either aquired
> or not aquired tree_lock, this direction will work fine.
> But if we can't, we have to add local_irq_save() to lock_page_cgroup() like below.
> 

Agreed.
Let's try how we can write a code in clean way. (we have time ;)
For now, to me, IRQ disabling while lock_page_cgroup() seems to be a little
over killing. What I really want is lockless code...but it seems impossible
under current implementation.

I wonder the fact "the page is never unchareged under us" can give us some chances
...Hmm.

Thanks,
-Kame

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-03  6:01         ` Daisuke Nishimura
@ 2010-03-03  6:15           ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 140+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-03  6:15 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: Andrea Righi, Balbir Singh, Suleiman Souhlal, Greg Thelen,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

On Wed, 3 Mar 2010 15:01:37 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Wed, 3 Mar 2010 12:29:06 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Wed, 3 Mar 2010 11:12:38 +0900
> > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > 
> > > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > > index fe09e51..f85acae 100644
> > > > --- a/mm/filemap.c
> > > > +++ b/mm/filemap.c
> > > > @@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
> > > >  	 * having removed the page entirely.
> > > >  	 */
> > > >  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> > > > +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, -1);
> > > >  		dec_zone_page_state(page, NR_FILE_DIRTY);
> > > >  		dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
> > > >  	}
> > > (snip)
> > > > @@ -1096,6 +1113,7 @@ int __set_page_dirty_no_writeback(struct page *page)
> > > >  void account_page_dirtied(struct page *page, struct address_space *mapping)
> > > >  {
> > > >  	if (mapping_cap_account_dirty(mapping)) {
> > > > +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, 1);
> > > >  		__inc_zone_page_state(page, NR_FILE_DIRTY);
> > > >  		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
> > > >  		task_dirty_inc(current);
> > > As long as I can see, those two functions(at least) calls mem_cgroup_update_state(),
> > > which acquires page cgroup lock, under mapping->tree_lock.
> > > But as I fixed before in commit e767e056, page cgroup lock must not acquired under
> > > mapping->tree_lock.
> > > hmm, we should call those mem_cgroup_update_state() outside mapping->tree_lock,
> > > or add local_irq_save/restore() around lock/unlock_page_cgroup() to avoid dead-lock.
> > > 
> > Ah, good catch! But hmmmmmm...
> > This account_page_dirtted() seems to be called under IRQ-disabled.
> > About  __remove_from_page_cache(), I think page_cgroup should have its own DIRTY flag,
> > then, mem_cgroup_uncharge_page() can handle it automatically.
> > 
> > But. there are no guarantee that following never happens. 
> > 	lock_page_cgroup()
> > 	    <=== interrupt.
> > 	    -> mapping->tree_lock()
> > Even if mapping->tree_lock is held with IRQ-disabled.
> > Then, if we add local_irq_save(), we have to add it to all lock_page_cgroup().
> > 
> > Then, hm...some kind of new trick ? as..
> > (Follwoing patch is not tested!!)
> > 
> If we can verify that all callers of mem_cgroup_update_stat() have always either aquired
> or not aquired tree_lock, this direction will work fine.
> But if we can't, we have to add local_irq_save() to lock_page_cgroup() like below.
> 

Agreed.
Let's try how we can write a code in clean way. (we have time ;)
For now, to me, IRQ disabling while lock_page_cgroup() seems to be a little
over killing. What I really want is lockless code...but it seems impossible
under current implementation.

I wonder the fact "the page is never unchareged under us" can give us some chances
...Hmm.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
@ 2010-03-03  6:15           ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 140+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-03  6:15 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: Andrea Righi, Balbir Singh, Suleiman Souhlal, Greg Thelen,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

On Wed, 3 Mar 2010 15:01:37 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Wed, 3 Mar 2010 12:29:06 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Wed, 3 Mar 2010 11:12:38 +0900
> > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > 
> > > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > > index fe09e51..f85acae 100644
> > > > --- a/mm/filemap.c
> > > > +++ b/mm/filemap.c
> > > > @@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
> > > >  	 * having removed the page entirely.
> > > >  	 */
> > > >  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> > > > +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, -1);
> > > >  		dec_zone_page_state(page, NR_FILE_DIRTY);
> > > >  		dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
> > > >  	}
> > > (snip)
> > > > @@ -1096,6 +1113,7 @@ int __set_page_dirty_no_writeback(struct page *page)
> > > >  void account_page_dirtied(struct page *page, struct address_space *mapping)
> > > >  {
> > > >  	if (mapping_cap_account_dirty(mapping)) {
> > > > +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, 1);
> > > >  		__inc_zone_page_state(page, NR_FILE_DIRTY);
> > > >  		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
> > > >  		task_dirty_inc(current);
> > > As long as I can see, those two functions(at least) calls mem_cgroup_update_state(),
> > > which acquires page cgroup lock, under mapping->tree_lock.
> > > But as I fixed before in commit e767e056, page cgroup lock must not acquired under
> > > mapping->tree_lock.
> > > hmm, we should call those mem_cgroup_update_state() outside mapping->tree_lock,
> > > or add local_irq_save/restore() around lock/unlock_page_cgroup() to avoid dead-lock.
> > > 
> > Ah, good catch! But hmmmmmm...
> > This account_page_dirtted() seems to be called under IRQ-disabled.
> > About  __remove_from_page_cache(), I think page_cgroup should have its own DIRTY flag,
> > then, mem_cgroup_uncharge_page() can handle it automatically.
> > 
> > But. there are no guarantee that following never happens. 
> > 	lock_page_cgroup()
> > 	    <=== interrupt.
> > 	    -> mapping->tree_lock()
> > Even if mapping->tree_lock is held with IRQ-disabled.
> > Then, if we add local_irq_save(), we have to add it to all lock_page_cgroup().
> > 
> > Then, hm...some kind of new trick ? as..
> > (Follwoing patch is not tested!!)
> > 
> If we can verify that all callers of mem_cgroup_update_stat() have always either aquired
> or not aquired tree_lock, this direction will work fine.
> But if we can't, we have to add local_irq_save() to lock_page_cgroup() like below.
> 

Agreed.
Let's try how we can write a code in clean way. (we have time ;)
For now, to me, IRQ disabling while lock_page_cgroup() seems to be a little
over killing. What I really want is lockless code...but it seems impossible
under current implementation.

I wonder the fact "the page is never unchareged under us" can give us some chances
...Hmm.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
       [not found]           ` <20100303151549.5d3d686a.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
@ 2010-03-03  8:21             ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 140+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-03  8:21 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrea Righi,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Greg-FOgKQjlUJ6BQetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Andrew Morton,
	Balbir Singh

On Wed, 3 Mar 2010 15:15:49 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> wrote:

> Agreed.
> Let's try how we can write a code in clean way. (we have time ;)
> For now, to me, IRQ disabling while lock_page_cgroup() seems to be a little
> over killing. What I really want is lockless code...but it seems impossible
> under current implementation.
> 
> I wonder the fact "the page is never unchareged under us" can give us some chances
> ...Hmm.
> 

How about this ? Basically, I don't like duplicating information...so,
# of new pcg_flags may be able to be reduced.

I'm glad this can be a hint for Andrea-san.

==
---
 include/linux/page_cgroup.h |   44 ++++++++++++++++++++-
 mm/memcontrol.c             |   91 +++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 132 insertions(+), 3 deletions(-)

Index: mmotm-2.6.33-Mar2/include/linux/page_cgroup.h
===================================================================
--- mmotm-2.6.33-Mar2.orig/include/linux/page_cgroup.h
+++ mmotm-2.6.33-Mar2/include/linux/page_cgroup.h
@@ -39,6 +39,11 @@ enum {
 	PCG_CACHE, /* charged as cache */
 	PCG_USED, /* this object is in use. */
 	PCG_ACCT_LRU, /* page has been accounted for */
+	PCG_MIGRATE_LOCK, /* used for mutual execution of account migration */
+	PCG_ACCT_DIRTY,
+	PCG_ACCT_WB,
+	PCG_ACCT_WB_TEMP,
+	PCG_ACCT_UNSTABLE,
 };
 
 #define TESTPCGFLAG(uname, lname)			\
@@ -73,6 +78,23 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU)
 TESTPCGFLAG(AcctLRU, ACCT_LRU)
 TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU)
 
+SETPCGFLAG(AcctDirty, ACCT_DIRTY);
+CLEARPCGFLAG(AcctDirty, ACCT_DIRTY);
+TESTPCGFLAG(AcctDirty, ACCT_DIRTY);
+
+SETPCGFLAG(AcctWB, ACCT_WB);
+CLEARPCGFLAG(AcctWB, ACCT_WB);
+TESTPCGFLAG(AcctWB, ACCT_WB);
+
+SETPCGFLAG(AcctWBTemp, ACCT_WB_TEMP);
+CLEARPCGFLAG(AcctWBTemp, ACCT_WB_TEMP);
+TESTPCGFLAG(AcctWBTemp, ACCT_WB_TEMP);
+
+SETPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE);
+CLEARPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE);
+TESTPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE);
+
+
 static inline int page_cgroup_nid(struct page_cgroup *pc)
 {
 	return page_to_nid(pc->page);
@@ -82,7 +104,9 @@ static inline enum zone_type page_cgroup
 {
 	return page_zonenum(pc->page);
 }
-
+/*
+ * lock_page_cgroup() should not be held under mapping->tree_lock
+ */
 static inline void lock_page_cgroup(struct page_cgroup *pc)
 {
 	bit_spin_lock(PCG_LOCK, &pc->flags);
@@ -93,6 +117,24 @@ static inline void unlock_page_cgroup(st
 	bit_spin_unlock(PCG_LOCK, &pc->flags);
 }
 
+/*
+ * Lock order is
+ * 	lock_page_cgroup()
+ * 		lock_page_cgroup_migrate()
+ * This lock is not be lock for charge/uncharge but for account moving.
+ * i.e. overwrite pc->mem_cgroup. The lock owner should guarantee by itself
+ * the page is uncharged while we hold this.
+ */
+static inline void lock_page_cgroup_migrate(struct page_cgroup *pc)
+{
+	bit_spin_lock(PCG_MIGRATE_LOCK, &pc->flags);
+}
+
+static inline void unlock_page_cgroup_migrate(struct page_cgroup *pc)
+{
+	bit_spin_unlock(PCG_MIGRATE_LOCK, &pc->flags);
+}
+
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct page_cgroup;
 
Index: mmotm-2.6.33-Mar2/mm/memcontrol.c
===================================================================
--- mmotm-2.6.33-Mar2.orig/mm/memcontrol.c
+++ mmotm-2.6.33-Mar2/mm/memcontrol.c
@@ -87,6 +87,10 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
 	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
 	MEM_CGROUP_EVENTS,	/* incremented at every  pagein/pageout */
+	MEM_CGROUP_STAT_DIRTY,
+	MEM_CGROUP_STAT_WBACK,
+	MEM_CGROUP_STAT_WBACK_TEMP,
+	MEM_CGROUP_STAT_UNSTABLE_NFS,
 
 	MEM_CGROUP_STAT_NSTATS,
 };
@@ -1360,6 +1364,86 @@ done:
 }
 
 /*
+ * Update file cache's status for memcg. Before calling this,
+ * mapping->tree_lock should be held and preemption is disabled.
+ * Then, it's guarnteed that the page is not uncharged while we
+ * access page_cgroup. We can make use of that.
+ */
+void mem_cgroup_update_stat_locked(struct page *page, int idx, bool set)
+{
+	struct page_cgroup *pc;
+	struct mem_cgroup *mem;
+
+	pc = lookup_page_cgroup(page);
+	/* Not accounted ? */
+	if (!PageCgroupUsed(pc))
+		return;
+	lock_page_cgroup_migrate(pc);
+	/*
+	 * It's guarnteed that this page is never uncharged.
+	 * The only racy problem is moving account among memcgs.
+	 */
+	switch (idx) {
+	case MEM_CGROUP_STAT_DIRTY:
+		if (set)
+			SetPageCgroupAcctDirty(pc);
+		else
+			ClearPageCgroupAcctDirty(pc);
+		break;
+	case MEM_CGROUP_STAT_WBACK:
+		if (set)
+			SetPageCgroupAcctWB(pc);
+		else
+			ClearPageCgroupAcctWB(pc);
+		break;
+	case MEM_CGROUP_STAT_WBACK_TEMP:
+		if (set)
+			SetPageCgroupAcctWBTemp(pc);
+		else
+			ClearPageCgroupAcctWBTemp(pc);
+		break;
+	case MEM_CGROUP_STAT_UNSTABLE_NFS:
+		if (set)
+			SetPageCgroupAcctUnstableNFS(pc);
+		else
+			ClearPageCgroupAcctUnstableNFS(pc);
+		break;
+	default:
+		BUG();
+		break;
+	}
+	mem = pc->mem_cgroup;
+	if (set)
+		__this_cpu_inc(mem->stat->count[idx]);
+	else
+		__this_cpu_dec(mem->stat->count[idx]);
+	unlock_page_cgroup_migrate(pc);
+}
+
+static void move_acct_information(struct mem_cgroup *from,
+				struct mem_cgroup *to,
+				struct page_cgroup *pc)
+{
+	/* preemption is disabled, migration_lock is held. */
+	if (PageCgroupAcctDirty(pc)) {
+		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_DIRTY]);
+		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_DIRTY]);
+	}
+	if (PageCgroupAcctWB(pc)) {
+		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_WBACK]);
+		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WBACK]);
+	}
+	if (PageCgroupAcctWBTemp(pc)) {
+		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_WBACK_TEMP]);
+		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WBACK_TEMP]);
+	}
+	if (PageCgroupAcctUnstableNFS(pc)) {
+		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
+		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
+	}
+}
+
+/*
  * size of first charge trial. "32" comes from vmscan.c's magic value.
  * TODO: maybe necessary to use big numbers in big irons.
  */
@@ -1794,15 +1878,16 @@ static void __mem_cgroup_move_account(st
 	VM_BUG_ON(!PageCgroupUsed(pc));
 	VM_BUG_ON(pc->mem_cgroup != from);
 
+	preempt_disable();
+	lock_page_cgroup_migrate(pc);
 	page = pc->page;
 	if (page_mapped(page) && !PageAnon(page)) {
 		/* Update mapped_file data for mem_cgroup */
-		preempt_disable();
 		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
 		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		preempt_enable();
 	}
 	mem_cgroup_charge_statistics(from, pc, false);
+	move_acct_information(from, to, pc);
 	if (uncharge)
 		/* This is not "cancel", but cancel_charge does all we need. */
 		mem_cgroup_cancel_charge(from);
@@ -1810,6 +1895,8 @@ static void __mem_cgroup_move_account(st
 	/* caller should have done css_get */
 	pc->mem_cgroup = to;
 	mem_cgroup_charge_statistics(to, pc, true);
+	unlock_page_cgroup_migrate(pc);
+	preempt_enable();
 	/*
 	 * We charges against "to" which may not have any tasks. Then, "to"
 	 * can be under rmdir(). But in current implementation, caller of

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-03  6:15           ` KAMEZAWA Hiroyuki
@ 2010-03-03  8:21             ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 140+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-03  8:21 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, Andrea Righi, containers, linux-kernel,
	linux-mm, Greg, Suleiman Souhlal, Andrew Morton, Balbir Singh

On Wed, 3 Mar 2010 15:15:49 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> Agreed.
> Let's try how we can write a code in clean way. (we have time ;)
> For now, to me, IRQ disabling while lock_page_cgroup() seems to be a little
> over killing. What I really want is lockless code...but it seems impossible
> under current implementation.
> 
> I wonder the fact "the page is never unchareged under us" can give us some chances
> ...Hmm.
> 

How about this ? Basically, I don't like duplicating information...so,
# of new pcg_flags may be able to be reduced.

I'm glad this can be a hint for Andrea-san.

==
---
 include/linux/page_cgroup.h |   44 ++++++++++++++++++++-
 mm/memcontrol.c             |   91 +++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 132 insertions(+), 3 deletions(-)

Index: mmotm-2.6.33-Mar2/include/linux/page_cgroup.h
===================================================================
--- mmotm-2.6.33-Mar2.orig/include/linux/page_cgroup.h
+++ mmotm-2.6.33-Mar2/include/linux/page_cgroup.h
@@ -39,6 +39,11 @@ enum {
 	PCG_CACHE, /* charged as cache */
 	PCG_USED, /* this object is in use. */
 	PCG_ACCT_LRU, /* page has been accounted for */
+	PCG_MIGRATE_LOCK, /* used for mutual execution of account migration */
+	PCG_ACCT_DIRTY,
+	PCG_ACCT_WB,
+	PCG_ACCT_WB_TEMP,
+	PCG_ACCT_UNSTABLE,
 };
 
 #define TESTPCGFLAG(uname, lname)			\
@@ -73,6 +78,23 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU)
 TESTPCGFLAG(AcctLRU, ACCT_LRU)
 TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU)
 
+SETPCGFLAG(AcctDirty, ACCT_DIRTY);
+CLEARPCGFLAG(AcctDirty, ACCT_DIRTY);
+TESTPCGFLAG(AcctDirty, ACCT_DIRTY);
+
+SETPCGFLAG(AcctWB, ACCT_WB);
+CLEARPCGFLAG(AcctWB, ACCT_WB);
+TESTPCGFLAG(AcctWB, ACCT_WB);
+
+SETPCGFLAG(AcctWBTemp, ACCT_WB_TEMP);
+CLEARPCGFLAG(AcctWBTemp, ACCT_WB_TEMP);
+TESTPCGFLAG(AcctWBTemp, ACCT_WB_TEMP);
+
+SETPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE);
+CLEARPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE);
+TESTPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE);
+
+
 static inline int page_cgroup_nid(struct page_cgroup *pc)
 {
 	return page_to_nid(pc->page);
@@ -82,7 +104,9 @@ static inline enum zone_type page_cgroup
 {
 	return page_zonenum(pc->page);
 }
-
+/*
+ * lock_page_cgroup() should not be held under mapping->tree_lock
+ */
 static inline void lock_page_cgroup(struct page_cgroup *pc)
 {
 	bit_spin_lock(PCG_LOCK, &pc->flags);
@@ -93,6 +117,24 @@ static inline void unlock_page_cgroup(st
 	bit_spin_unlock(PCG_LOCK, &pc->flags);
 }
 
+/*
+ * Lock order is
+ * 	lock_page_cgroup()
+ * 		lock_page_cgroup_migrate()
+ * This lock is not be lock for charge/uncharge but for account moving.
+ * i.e. overwrite pc->mem_cgroup. The lock owner should guarantee by itself
+ * the page is uncharged while we hold this.
+ */
+static inline void lock_page_cgroup_migrate(struct page_cgroup *pc)
+{
+	bit_spin_lock(PCG_MIGRATE_LOCK, &pc->flags);
+}
+
+static inline void unlock_page_cgroup_migrate(struct page_cgroup *pc)
+{
+	bit_spin_unlock(PCG_MIGRATE_LOCK, &pc->flags);
+}
+
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct page_cgroup;
 
Index: mmotm-2.6.33-Mar2/mm/memcontrol.c
===================================================================
--- mmotm-2.6.33-Mar2.orig/mm/memcontrol.c
+++ mmotm-2.6.33-Mar2/mm/memcontrol.c
@@ -87,6 +87,10 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
 	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
 	MEM_CGROUP_EVENTS,	/* incremented at every  pagein/pageout */
+	MEM_CGROUP_STAT_DIRTY,
+	MEM_CGROUP_STAT_WBACK,
+	MEM_CGROUP_STAT_WBACK_TEMP,
+	MEM_CGROUP_STAT_UNSTABLE_NFS,
 
 	MEM_CGROUP_STAT_NSTATS,
 };
@@ -1360,6 +1364,86 @@ done:
 }
 
 /*
+ * Update file cache's status for memcg. Before calling this,
+ * mapping->tree_lock should be held and preemption is disabled.
+ * Then, it's guarnteed that the page is not uncharged while we
+ * access page_cgroup. We can make use of that.
+ */
+void mem_cgroup_update_stat_locked(struct page *page, int idx, bool set)
+{
+	struct page_cgroup *pc;
+	struct mem_cgroup *mem;
+
+	pc = lookup_page_cgroup(page);
+	/* Not accounted ? */
+	if (!PageCgroupUsed(pc))
+		return;
+	lock_page_cgroup_migrate(pc);
+	/*
+	 * It's guarnteed that this page is never uncharged.
+	 * The only racy problem is moving account among memcgs.
+	 */
+	switch (idx) {
+	case MEM_CGROUP_STAT_DIRTY:
+		if (set)
+			SetPageCgroupAcctDirty(pc);
+		else
+			ClearPageCgroupAcctDirty(pc);
+		break;
+	case MEM_CGROUP_STAT_WBACK:
+		if (set)
+			SetPageCgroupAcctWB(pc);
+		else
+			ClearPageCgroupAcctWB(pc);
+		break;
+	case MEM_CGROUP_STAT_WBACK_TEMP:
+		if (set)
+			SetPageCgroupAcctWBTemp(pc);
+		else
+			ClearPageCgroupAcctWBTemp(pc);
+		break;
+	case MEM_CGROUP_STAT_UNSTABLE_NFS:
+		if (set)
+			SetPageCgroupAcctUnstableNFS(pc);
+		else
+			ClearPageCgroupAcctUnstableNFS(pc);
+		break;
+	default:
+		BUG();
+		break;
+	}
+	mem = pc->mem_cgroup;
+	if (set)
+		__this_cpu_inc(mem->stat->count[idx]);
+	else
+		__this_cpu_dec(mem->stat->count[idx]);
+	unlock_page_cgroup_migrate(pc);
+}
+
+static void move_acct_information(struct mem_cgroup *from,
+				struct mem_cgroup *to,
+				struct page_cgroup *pc)
+{
+	/* preemption is disabled, migration_lock is held. */
+	if (PageCgroupAcctDirty(pc)) {
+		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_DIRTY]);
+		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_DIRTY]);
+	}
+	if (PageCgroupAcctWB(pc)) {
+		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_WBACK]);
+		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WBACK]);
+	}
+	if (PageCgroupAcctWBTemp(pc)) {
+		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_WBACK_TEMP]);
+		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WBACK_TEMP]);
+	}
+	if (PageCgroupAcctUnstableNFS(pc)) {
+		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
+		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
+	}
+}
+
+/*
  * size of first charge trial. "32" comes from vmscan.c's magic value.
  * TODO: maybe necessary to use big numbers in big irons.
  */
@@ -1794,15 +1878,16 @@ static void __mem_cgroup_move_account(st
 	VM_BUG_ON(!PageCgroupUsed(pc));
 	VM_BUG_ON(pc->mem_cgroup != from);
 
+	preempt_disable();
+	lock_page_cgroup_migrate(pc);
 	page = pc->page;
 	if (page_mapped(page) && !PageAnon(page)) {
 		/* Update mapped_file data for mem_cgroup */
-		preempt_disable();
 		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
 		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		preempt_enable();
 	}
 	mem_cgroup_charge_statistics(from, pc, false);
+	move_acct_information(from, to, pc);
 	if (uncharge)
 		/* This is not "cancel", but cancel_charge does all we need. */
 		mem_cgroup_cancel_charge(from);
@@ -1810,6 +1895,8 @@ static void __mem_cgroup_move_account(st
 	/* caller should have done css_get */
 	pc->mem_cgroup = to;
 	mem_cgroup_charge_statistics(to, pc, true);
+	unlock_page_cgroup_migrate(pc);
+	preempt_enable();
 	/*
 	 * We charges against "to" which may not have any tasks. Then, "to"
 	 * can be under rmdir(). But in current implementation, caller of


^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
@ 2010-03-03  8:21             ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 140+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-03  8:21 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, Andrea Righi, containers, linux-kernel,
	linux-mm, Greg, Suleiman Souhlal, Andrew Morton, Balbir Singh

On Wed, 3 Mar 2010 15:15:49 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> Agreed.
> Let's try how we can write a code in clean way. (we have time ;)
> For now, to me, IRQ disabling while lock_page_cgroup() seems to be a little
> over killing. What I really want is lockless code...but it seems impossible
> under current implementation.
> 
> I wonder the fact "the page is never unchareged under us" can give us some chances
> ...Hmm.
> 

How about this ? Basically, I don't like duplicating information...so,
# of new pcg_flags may be able to be reduced.

I'm glad this can be a hint for Andrea-san.

==
---
 include/linux/page_cgroup.h |   44 ++++++++++++++++++++-
 mm/memcontrol.c             |   91 +++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 132 insertions(+), 3 deletions(-)

Index: mmotm-2.6.33-Mar2/include/linux/page_cgroup.h
===================================================================
--- mmotm-2.6.33-Mar2.orig/include/linux/page_cgroup.h
+++ mmotm-2.6.33-Mar2/include/linux/page_cgroup.h
@@ -39,6 +39,11 @@ enum {
 	PCG_CACHE, /* charged as cache */
 	PCG_USED, /* this object is in use. */
 	PCG_ACCT_LRU, /* page has been accounted for */
+	PCG_MIGRATE_LOCK, /* used for mutual execution of account migration */
+	PCG_ACCT_DIRTY,
+	PCG_ACCT_WB,
+	PCG_ACCT_WB_TEMP,
+	PCG_ACCT_UNSTABLE,
 };
 
 #define TESTPCGFLAG(uname, lname)			\
@@ -73,6 +78,23 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU)
 TESTPCGFLAG(AcctLRU, ACCT_LRU)
 TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU)
 
+SETPCGFLAG(AcctDirty, ACCT_DIRTY);
+CLEARPCGFLAG(AcctDirty, ACCT_DIRTY);
+TESTPCGFLAG(AcctDirty, ACCT_DIRTY);
+
+SETPCGFLAG(AcctWB, ACCT_WB);
+CLEARPCGFLAG(AcctWB, ACCT_WB);
+TESTPCGFLAG(AcctWB, ACCT_WB);
+
+SETPCGFLAG(AcctWBTemp, ACCT_WB_TEMP);
+CLEARPCGFLAG(AcctWBTemp, ACCT_WB_TEMP);
+TESTPCGFLAG(AcctWBTemp, ACCT_WB_TEMP);
+
+SETPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE);
+CLEARPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE);
+TESTPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE);
+
+
 static inline int page_cgroup_nid(struct page_cgroup *pc)
 {
 	return page_to_nid(pc->page);
@@ -82,7 +104,9 @@ static inline enum zone_type page_cgroup
 {
 	return page_zonenum(pc->page);
 }
-
+/*
+ * lock_page_cgroup() should not be held under mapping->tree_lock
+ */
 static inline void lock_page_cgroup(struct page_cgroup *pc)
 {
 	bit_spin_lock(PCG_LOCK, &pc->flags);
@@ -93,6 +117,24 @@ static inline void unlock_page_cgroup(st
 	bit_spin_unlock(PCG_LOCK, &pc->flags);
 }
 
+/*
+ * Lock order is
+ * 	lock_page_cgroup()
+ * 		lock_page_cgroup_migrate()
+ * This lock is not be lock for charge/uncharge but for account moving.
+ * i.e. overwrite pc->mem_cgroup. The lock owner should guarantee by itself
+ * the page is uncharged while we hold this.
+ */
+static inline void lock_page_cgroup_migrate(struct page_cgroup *pc)
+{
+	bit_spin_lock(PCG_MIGRATE_LOCK, &pc->flags);
+}
+
+static inline void unlock_page_cgroup_migrate(struct page_cgroup *pc)
+{
+	bit_spin_unlock(PCG_MIGRATE_LOCK, &pc->flags);
+}
+
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct page_cgroup;
 
Index: mmotm-2.6.33-Mar2/mm/memcontrol.c
===================================================================
--- mmotm-2.6.33-Mar2.orig/mm/memcontrol.c
+++ mmotm-2.6.33-Mar2/mm/memcontrol.c
@@ -87,6 +87,10 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
 	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
 	MEM_CGROUP_EVENTS,	/* incremented at every  pagein/pageout */
+	MEM_CGROUP_STAT_DIRTY,
+	MEM_CGROUP_STAT_WBACK,
+	MEM_CGROUP_STAT_WBACK_TEMP,
+	MEM_CGROUP_STAT_UNSTABLE_NFS,
 
 	MEM_CGROUP_STAT_NSTATS,
 };
@@ -1360,6 +1364,86 @@ done:
 }
 
 /*
+ * Update file cache's status for memcg. Before calling this,
+ * mapping->tree_lock should be held and preemption is disabled.
+ * Then, it's guarnteed that the page is not uncharged while we
+ * access page_cgroup. We can make use of that.
+ */
+void mem_cgroup_update_stat_locked(struct page *page, int idx, bool set)
+{
+	struct page_cgroup *pc;
+	struct mem_cgroup *mem;
+
+	pc = lookup_page_cgroup(page);
+	/* Not accounted ? */
+	if (!PageCgroupUsed(pc))
+		return;
+	lock_page_cgroup_migrate(pc);
+	/*
+	 * It's guarnteed that this page is never uncharged.
+	 * The only racy problem is moving account among memcgs.
+	 */
+	switch (idx) {
+	case MEM_CGROUP_STAT_DIRTY:
+		if (set)
+			SetPageCgroupAcctDirty(pc);
+		else
+			ClearPageCgroupAcctDirty(pc);
+		break;
+	case MEM_CGROUP_STAT_WBACK:
+		if (set)
+			SetPageCgroupAcctWB(pc);
+		else
+			ClearPageCgroupAcctWB(pc);
+		break;
+	case MEM_CGROUP_STAT_WBACK_TEMP:
+		if (set)
+			SetPageCgroupAcctWBTemp(pc);
+		else
+			ClearPageCgroupAcctWBTemp(pc);
+		break;
+	case MEM_CGROUP_STAT_UNSTABLE_NFS:
+		if (set)
+			SetPageCgroupAcctUnstableNFS(pc);
+		else
+			ClearPageCgroupAcctUnstableNFS(pc);
+		break;
+	default:
+		BUG();
+		break;
+	}
+	mem = pc->mem_cgroup;
+	if (set)
+		__this_cpu_inc(mem->stat->count[idx]);
+	else
+		__this_cpu_dec(mem->stat->count[idx]);
+	unlock_page_cgroup_migrate(pc);
+}
+
+static void move_acct_information(struct mem_cgroup *from,
+				struct mem_cgroup *to,
+				struct page_cgroup *pc)
+{
+	/* preemption is disabled, migration_lock is held. */
+	if (PageCgroupAcctDirty(pc)) {
+		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_DIRTY]);
+		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_DIRTY]);
+	}
+	if (PageCgroupAcctWB(pc)) {
+		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_WBACK]);
+		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WBACK]);
+	}
+	if (PageCgroupAcctWBTemp(pc)) {
+		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_WBACK_TEMP]);
+		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WBACK_TEMP]);
+	}
+	if (PageCgroupAcctUnstableNFS(pc)) {
+		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
+		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
+	}
+}
+
+/*
  * size of first charge trial. "32" comes from vmscan.c's magic value.
  * TODO: maybe necessary to use big numbers in big irons.
  */
@@ -1794,15 +1878,16 @@ static void __mem_cgroup_move_account(st
 	VM_BUG_ON(!PageCgroupUsed(pc));
 	VM_BUG_ON(pc->mem_cgroup != from);
 
+	preempt_disable();
+	lock_page_cgroup_migrate(pc);
 	page = pc->page;
 	if (page_mapped(page) && !PageAnon(page)) {
 		/* Update mapped_file data for mem_cgroup */
-		preempt_disable();
 		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
 		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		preempt_enable();
 	}
 	mem_cgroup_charge_statistics(from, pc, false);
+	move_acct_information(from, to, pc);
 	if (uncharge)
 		/* This is not "cancel", but cancel_charge does all we need. */
 		mem_cgroup_cancel_charge(from);
@@ -1810,6 +1895,8 @@ static void __mem_cgroup_move_account(st
 	/* caller should have done css_get */
 	pc->mem_cgroup = to;
 	mem_cgroup_charge_statistics(to, pc, true);
+	unlock_page_cgroup_migrate(pc);
+	preempt_enable();
 	/*
 	 * We charges against "to" which may not have any tasks. Then, "to"
 	 * can be under rmdir(). But in current implementation, caller of

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-02 22:14       ` Andrea Righi
  (?)
  (?)
@ 2010-03-03 10:07       ` Peter Zijlstra
  -1 siblings, 0 replies; 140+ messages in thread
From: Peter Zijlstra @ 2010-03-03 10:07 UTC (permalink / raw)
  To: Andrea Righi
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Trond Myklebust, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Suleiman Souhlal, Andrew Morton, Balbir Singh

On Tue, 2010-03-02 at 23:14 +0100, Andrea Righi wrote:
> 
> I agree mem_cgroup_has_dirty_limit() is nicer. But we must do that under
> RCU, so something like:
> 
>         rcu_read_lock();
>         if (mem_cgroup_has_dirty_limit())
>                 mem_cgroup_get_page_stat()
>         else
>                 global_page_state()
>         rcu_read_unlock();
> 
> That is bad when mem_cgroup_has_dirty_limit() always returns false
> (e.g., when memory cgroups are disabled). So I fallback to the old
> interface.

Why is it that mem_cgroup_has_dirty_limit() needs RCU when
mem_cgroup_get_page_stat() doesn't? That is, simply make
mem_cgroup_has_dirty_limit() not require RCU in the same way
*_get_page_stat() doesn't either.

> What do you think about:
> 
>         mem_cgroup_lock();
>         if (mem_cgroup_has_dirty_limit())
>                 mem_cgroup_get_page_stat()
>         else
>                 global_page_state()
>         mem_cgroup_unlock();
> 
> Where mem_cgroup_read_lock/unlock() simply expand to nothing when
> memory cgroups are disabled.

I think you're engineering the wrong way around.

> > 
> > That allows for a 0 dirty limit (which should work and basically makes
> > all io synchronous).
> 
> IMHO it is better to reserve 0 for the special value "disabled" like the
> global settings. A synchronous IO can be also achieved using a dirty
> limit of 1.

Why?! 0 clearly states no writeback cache, IOW sync writes, a 1
byte/page writeback cache effectively reduces to the same thing, but its
not the same thing conceptually. If you want to put the size and enable
into a single variable pick -1 for disable or so.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-02 22:14       ` Andrea Righi
@ 2010-03-03 10:07         ` Peter Zijlstra
  -1 siblings, 0 replies; 140+ messages in thread
From: Peter Zijlstra @ 2010-03-03 10:07 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Kirill A. Shutemov, Andrew Morton, containers,
	linux-kernel, linux-mm, Trond Myklebust

On Tue, 2010-03-02 at 23:14 +0100, Andrea Righi wrote:
> 
> I agree mem_cgroup_has_dirty_limit() is nicer. But we must do that under
> RCU, so something like:
> 
>         rcu_read_lock();
>         if (mem_cgroup_has_dirty_limit())
>                 mem_cgroup_get_page_stat()
>         else
>                 global_page_state()
>         rcu_read_unlock();
> 
> That is bad when mem_cgroup_has_dirty_limit() always returns false
> (e.g., when memory cgroups are disabled). So I fallback to the old
> interface.

Why is it that mem_cgroup_has_dirty_limit() needs RCU when
mem_cgroup_get_page_stat() doesn't? That is, simply make
mem_cgroup_has_dirty_limit() not require RCU in the same way
*_get_page_stat() doesn't either.

> What do you think about:
> 
>         mem_cgroup_lock();
>         if (mem_cgroup_has_dirty_limit())
>                 mem_cgroup_get_page_stat()
>         else
>                 global_page_state()
>         mem_cgroup_unlock();
> 
> Where mem_cgroup_read_lock/unlock() simply expand to nothing when
> memory cgroups are disabled.

I think you're engineering the wrong way around.

> > 
> > That allows for a 0 dirty limit (which should work and basically makes
> > all io synchronous).
> 
> IMHO it is better to reserve 0 for the special value "disabled" like the
> global settings. A synchronous IO can be also achieved using a dirty
> limit of 1.

Why?! 0 clearly states no writeback cache, IOW sync writes, a 1
byte/page writeback cache effectively reduces to the same thing, but its
not the same thing conceptually. If you want to put the size and enable
into a single variable pick -1 for disable or so.




^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
@ 2010-03-03 10:07         ` Peter Zijlstra
  0 siblings, 0 replies; 140+ messages in thread
From: Peter Zijlstra @ 2010-03-03 10:07 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Kirill A. Shutemov, Andrew Morton, containers,
	linux-kernel, linux-mm, Trond Myklebust

On Tue, 2010-03-02 at 23:14 +0100, Andrea Righi wrote:
> 
> I agree mem_cgroup_has_dirty_limit() is nicer. But we must do that under
> RCU, so something like:
> 
>         rcu_read_lock();
>         if (mem_cgroup_has_dirty_limit())
>                 mem_cgroup_get_page_stat()
>         else
>                 global_page_state()
>         rcu_read_unlock();
> 
> That is bad when mem_cgroup_has_dirty_limit() always returns false
> (e.g., when memory cgroups are disabled). So I fallback to the old
> interface.

Why is it that mem_cgroup_has_dirty_limit() needs RCU when
mem_cgroup_get_page_stat() doesn't? That is, simply make
mem_cgroup_has_dirty_limit() not require RCU in the same way
*_get_page_stat() doesn't either.

> What do you think about:
> 
>         mem_cgroup_lock();
>         if (mem_cgroup_has_dirty_limit())
>                 mem_cgroup_get_page_stat()
>         else
>                 global_page_state()
>         mem_cgroup_unlock();
> 
> Where mem_cgroup_read_lock/unlock() simply expand to nothing when
> memory cgroups are disabled.

I think you're engineering the wrong way around.

> > 
> > That allows for a 0 dirty limit (which should work and basically makes
> > all io synchronous).
> 
> IMHO it is better to reserve 0 for the special value "disabled" like the
> global settings. A synchronous IO can be also achieved using a dirty
> limit of 1.

Why?! 0 clearly states no writeback cache, IOW sync writes, a 1
byte/page writeback cache effectively reduces to the same thing, but its
not the same thing conceptually. If you want to put the size and enable
into a single variable pick -1 for disable or so.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
       [not found]             ` <20100302235932.GA3007-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2010-03-03 11:47               ` Andrea Righi
  0 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-03 11:47 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Andrew Morton,
	Balbir Singh

On Tue, Mar 02, 2010 at 06:59:32PM -0500, Vivek Goyal wrote:
> On Tue, Mar 02, 2010 at 11:22:48PM +0100, Andrea Righi wrote:
> > On Tue, Mar 02, 2010 at 10:05:29AM -0500, Vivek Goyal wrote:
> > > On Mon, Mar 01, 2010 at 11:18:31PM +0100, Andrea Righi wrote:
> > > > On Mon, Mar 01, 2010 at 05:02:08PM -0500, Vivek Goyal wrote:
> > > > > > @@ -686,10 +699,14 @@ void throttle_vm_writeout(gfp_t gfp_mask)
> > > > > >                   */
> > > > > >                  dirty_thresh += dirty_thresh / 10;      /* wheeee... */
> > > > > >  
> > > > > > -                if (global_page_state(NR_UNSTABLE_NFS) +
> > > > > > -			global_page_state(NR_WRITEBACK) <= dirty_thresh)
> > > > > > -                        	break;
> > > > > > -                congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > > > > +
> > > > > > +		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> > > > > > +		if (dirty < 0)
> > > > > > +			dirty = global_page_state(NR_UNSTABLE_NFS) +
> > > > > > +				global_page_state(NR_WRITEBACK);
> > > > > 
> > > > > dirty is unsigned long. As mentioned last time, above will never be true?
> > > > > In general these patches look ok to me. I will do some testing with these.
> > > > 
> > > > Re-introduced the same bug. My bad. :(
> > > > 
> > > > The value returned from mem_cgroup_page_stat() can be negative, i.e.
> > > > when memory cgroup is disabled. We could simply use a long for dirty,
> > > > the unit is in # of pages so s64 should be enough. Or cast dirty to long
> > > > only for the check (see below).
> > > > 
> > > > Thanks!
> > > > -Andrea
> > > > 
> > > > Signed-off-by: Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
> > > > ---
> > > >  mm/page-writeback.c |    2 +-
> > > >  1 files changed, 1 insertions(+), 1 deletions(-)
> > > > 
> > > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > > > index d83f41c..dbee976 100644
> > > > --- a/mm/page-writeback.c
> > > > +++ b/mm/page-writeback.c
> > > > @@ -701,7 +701,7 @@ void throttle_vm_writeout(gfp_t gfp_mask)
> > > >  
> > > >  
> > > >  		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> > > > -		if (dirty < 0)
> > > > +		if ((long)dirty < 0)
> > > 
> > > This will also be problematic as on 32bit systems, your uppper limit of
> > > dirty memory will be 2G?
> > > 
> > > I guess, I will prefer one of the two.
> > > 
> > > - return the error code from function and pass a pointer to store stats
> > >   in as function argument.
> > > 
> > > - Or Peter's suggestion of checking mem_cgroup_has_dirty_limit() and if
> > >   per cgroup dirty control is enabled, then use per cgroup stats. In that
> > >   case you don't have to return negative values.
> > > 
> > >   Only tricky part will be careful accouting so that none of the stats go
> > >   negative in corner cases of migration etc.
> > 
> > What do you think about Peter's suggestion + the locking stuff? (see the
> > previous email). Otherwise, I'll choose the other solution, passing a
> > pointer and always return the error code is not bad.
> > 
> 
> Ok, so you are worried about that by the we finish mem_cgroup_has_dirty_limit()
> call, task might change cgroup and later we might call
> mem_cgroup_get_page_stat() on a different cgroup altogether which might or
> might not have dirty limits specified?

Correct.

> 
> But in what cases you don't want to use memory cgroup specified limit? I
> thought cgroup disabled what the only case where we need to use global
> limits. Otherwise a memory cgroup will have either dirty_bytes specified
> or by default inherit global dirty_ratio which is a valid number. If
> that's the case then you don't have to take rcu_lock() outside
> get_page_stat()?
> 
> IOW, apart from cgroup being disabled, what are the other cases where you
> expect to not use cgroup's page stat and use global stats?

At boot, when mem_cgroup_from_task() may return NULL. But this is not
related to the RCU acquisition.

Anyway, probably the RCU protection is not so critical for this
particular case, and we can simply get rid of it. In this way we can
easily implement the interface proposed by Peter.

-Andrea

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-02 23:59             ` Vivek Goyal
@ 2010-03-03 11:47               ` Andrea Righi
  -1 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-03 11:47 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Kirill A. Shutemov, Andrew Morton, containers,
	linux-kernel, linux-mm

On Tue, Mar 02, 2010 at 06:59:32PM -0500, Vivek Goyal wrote:
> On Tue, Mar 02, 2010 at 11:22:48PM +0100, Andrea Righi wrote:
> > On Tue, Mar 02, 2010 at 10:05:29AM -0500, Vivek Goyal wrote:
> > > On Mon, Mar 01, 2010 at 11:18:31PM +0100, Andrea Righi wrote:
> > > > On Mon, Mar 01, 2010 at 05:02:08PM -0500, Vivek Goyal wrote:
> > > > > > @@ -686,10 +699,14 @@ void throttle_vm_writeout(gfp_t gfp_mask)
> > > > > >                   */
> > > > > >                  dirty_thresh += dirty_thresh / 10;      /* wheeee... */
> > > > > >  
> > > > > > -                if (global_page_state(NR_UNSTABLE_NFS) +
> > > > > > -			global_page_state(NR_WRITEBACK) <= dirty_thresh)
> > > > > > -                        	break;
> > > > > > -                congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > > > > +
> > > > > > +		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> > > > > > +		if (dirty < 0)
> > > > > > +			dirty = global_page_state(NR_UNSTABLE_NFS) +
> > > > > > +				global_page_state(NR_WRITEBACK);
> > > > > 
> > > > > dirty is unsigned long. As mentioned last time, above will never be true?
> > > > > In general these patches look ok to me. I will do some testing with these.
> > > > 
> > > > Re-introduced the same bug. My bad. :(
> > > > 
> > > > The value returned from mem_cgroup_page_stat() can be negative, i.e.
> > > > when memory cgroup is disabled. We could simply use a long for dirty,
> > > > the unit is in # of pages so s64 should be enough. Or cast dirty to long
> > > > only for the check (see below).
> > > > 
> > > > Thanks!
> > > > -Andrea
> > > > 
> > > > Signed-off-by: Andrea Righi <arighi@develer.com>
> > > > ---
> > > >  mm/page-writeback.c |    2 +-
> > > >  1 files changed, 1 insertions(+), 1 deletions(-)
> > > > 
> > > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > > > index d83f41c..dbee976 100644
> > > > --- a/mm/page-writeback.c
> > > > +++ b/mm/page-writeback.c
> > > > @@ -701,7 +701,7 @@ void throttle_vm_writeout(gfp_t gfp_mask)
> > > >  
> > > >  
> > > >  		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> > > > -		if (dirty < 0)
> > > > +		if ((long)dirty < 0)
> > > 
> > > This will also be problematic as on 32bit systems, your uppper limit of
> > > dirty memory will be 2G?
> > > 
> > > I guess, I will prefer one of the two.
> > > 
> > > - return the error code from function and pass a pointer to store stats
> > >   in as function argument.
> > > 
> > > - Or Peter's suggestion of checking mem_cgroup_has_dirty_limit() and if
> > >   per cgroup dirty control is enabled, then use per cgroup stats. In that
> > >   case you don't have to return negative values.
> > > 
> > >   Only tricky part will be careful accouting so that none of the stats go
> > >   negative in corner cases of migration etc.
> > 
> > What do you think about Peter's suggestion + the locking stuff? (see the
> > previous email). Otherwise, I'll choose the other solution, passing a
> > pointer and always return the error code is not bad.
> > 
> 
> Ok, so you are worried about that by the we finish mem_cgroup_has_dirty_limit()
> call, task might change cgroup and later we might call
> mem_cgroup_get_page_stat() on a different cgroup altogether which might or
> might not have dirty limits specified?

Correct.

> 
> But in what cases you don't want to use memory cgroup specified limit? I
> thought cgroup disabled what the only case where we need to use global
> limits. Otherwise a memory cgroup will have either dirty_bytes specified
> or by default inherit global dirty_ratio which is a valid number. If
> that's the case then you don't have to take rcu_lock() outside
> get_page_stat()?
> 
> IOW, apart from cgroup being disabled, what are the other cases where you
> expect to not use cgroup's page stat and use global stats?

At boot, when mem_cgroup_from_task() may return NULL. But this is not
related to the RCU acquisition.

Anyway, probably the RCU protection is not so critical for this
particular case, and we can simply get rid of it. In this way we can
easily implement the interface proposed by Peter.

-Andrea

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
@ 2010-03-03 11:47               ` Andrea Righi
  0 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-03 11:47 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Kirill A. Shutemov, Andrew Morton, containers,
	linux-kernel, linux-mm

On Tue, Mar 02, 2010 at 06:59:32PM -0500, Vivek Goyal wrote:
> On Tue, Mar 02, 2010 at 11:22:48PM +0100, Andrea Righi wrote:
> > On Tue, Mar 02, 2010 at 10:05:29AM -0500, Vivek Goyal wrote:
> > > On Mon, Mar 01, 2010 at 11:18:31PM +0100, Andrea Righi wrote:
> > > > On Mon, Mar 01, 2010 at 05:02:08PM -0500, Vivek Goyal wrote:
> > > > > > @@ -686,10 +699,14 @@ void throttle_vm_writeout(gfp_t gfp_mask)
> > > > > >                   */
> > > > > >                  dirty_thresh += dirty_thresh / 10;      /* wheeee... */
> > > > > >  
> > > > > > -                if (global_page_state(NR_UNSTABLE_NFS) +
> > > > > > -			global_page_state(NR_WRITEBACK) <= dirty_thresh)
> > > > > > -                        	break;
> > > > > > -                congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > > > > +
> > > > > > +		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> > > > > > +		if (dirty < 0)
> > > > > > +			dirty = global_page_state(NR_UNSTABLE_NFS) +
> > > > > > +				global_page_state(NR_WRITEBACK);
> > > > > 
> > > > > dirty is unsigned long. As mentioned last time, above will never be true?
> > > > > In general these patches look ok to me. I will do some testing with these.
> > > > 
> > > > Re-introduced the same bug. My bad. :(
> > > > 
> > > > The value returned from mem_cgroup_page_stat() can be negative, i.e.
> > > > when memory cgroup is disabled. We could simply use a long for dirty,
> > > > the unit is in # of pages so s64 should be enough. Or cast dirty to long
> > > > only for the check (see below).
> > > > 
> > > > Thanks!
> > > > -Andrea
> > > > 
> > > > Signed-off-by: Andrea Righi <arighi@develer.com>
> > > > ---
> > > >  mm/page-writeback.c |    2 +-
> > > >  1 files changed, 1 insertions(+), 1 deletions(-)
> > > > 
> > > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > > > index d83f41c..dbee976 100644
> > > > --- a/mm/page-writeback.c
> > > > +++ b/mm/page-writeback.c
> > > > @@ -701,7 +701,7 @@ void throttle_vm_writeout(gfp_t gfp_mask)
> > > >  
> > > >  
> > > >  		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> > > > -		if (dirty < 0)
> > > > +		if ((long)dirty < 0)
> > > 
> > > This will also be problematic as on 32bit systems, your uppper limit of
> > > dirty memory will be 2G?
> > > 
> > > I guess, I will prefer one of the two.
> > > 
> > > - return the error code from function and pass a pointer to store stats
> > >   in as function argument.
> > > 
> > > - Or Peter's suggestion of checking mem_cgroup_has_dirty_limit() and if
> > >   per cgroup dirty control is enabled, then use per cgroup stats. In that
> > >   case you don't have to return negative values.
> > > 
> > >   Only tricky part will be careful accouting so that none of the stats go
> > >   negative in corner cases of migration etc.
> > 
> > What do you think about Peter's suggestion + the locking stuff? (see the
> > previous email). Otherwise, I'll choose the other solution, passing a
> > pointer and always return the error code is not bad.
> > 
> 
> Ok, so you are worried about that by the we finish mem_cgroup_has_dirty_limit()
> call, task might change cgroup and later we might call
> mem_cgroup_get_page_stat() on a different cgroup altogether which might or
> might not have dirty limits specified?

Correct.

> 
> But in what cases you don't want to use memory cgroup specified limit? I
> thought cgroup disabled what the only case where we need to use global
> limits. Otherwise a memory cgroup will have either dirty_bytes specified
> or by default inherit global dirty_ratio which is a valid number. If
> that's the case then you don't have to take rcu_lock() outside
> get_page_stat()?
> 
> IOW, apart from cgroup being disabled, what are the other cases where you
> expect to not use cgroup's page stat and use global stats?

At boot, when mem_cgroup_from_task() may return NULL. But this is not
related to the RCU acquisition.

Anyway, probably the RCU protection is not so critical for this
particular case, and we can simply get rid of it. In this way we can
easily implement the interface proposed by Peter.

-Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
       [not found]               ` <20100303082107.a29562fa.nishimura-YQH0OdQVrdy45+QrQBaojngSJqDPrsil@public.gmane.org>
@ 2010-03-03 11:48                 ` Andrea Righi
  0 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-03 11:48 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Andrew Morton,
	Balbir Singh

On Wed, Mar 03, 2010 at 08:21:07AM +0900, Daisuke Nishimura wrote:
> On Tue, 2 Mar 2010 23:18:23 +0100, Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org> wrote:
> > On Tue, Mar 02, 2010 at 07:20:26PM +0530, Balbir Singh wrote:
> > > * KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> [2010-03-02 17:23:16]:
> > > 
> > > > On Tue, 2 Mar 2010 09:01:58 +0100
> > > > Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org> wrote:
> > > > 
> > > > > On Tue, Mar 02, 2010 at 09:23:09AM +0900, KAMEZAWA Hiroyuki wrote:
> > > > > > On Mon,  1 Mar 2010 22:23:40 +0100
> > > > > > Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org> wrote:
> > > > > > 
> > > > > > > Apply the cgroup dirty pages accounting and limiting infrastructure to
> > > > > > > the opportune kernel functions.
> > > > > > > 
> > > > > > > Signed-off-by: Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
> > > > > > 
> > > > > > Seems nice.
> > > > > > 
> > > > > > Hmm. the last problem is moving account between memcg.
> > > > > > 
> > > > > > Right ?
> > > > > 
> > > > > Correct. This was actually the last item of the TODO list. Anyway, I'm
> > > > > still considering if it's correct to move dirty pages when a task is
> > > > > migrated from a cgroup to another. Currently, dirty pages just remain in
> > > > > the original cgroup and are flushed depending on the original cgroup
> > > > > settings. That is not totally wrong... at least moving the dirty pages
> > > > > between memcgs should be optional (move_charge_at_immigrate?).
> > > > > 
> > > > 
> > > > My concern is 
> > > >  - migration between memcg is already suppoted
> > > >     - at task move
> > > >     - at rmdir
> > > > 
> > > > Then, if you leave DIRTY_PAGE accounting to original cgroup,
> > > > the new cgroup (migration target)'s Dirty page accounting may
> > > > goes to be negative, or incorrect value. Please check FILE_MAPPED
> > > > implementation in __mem_cgroup_move_account()
> > > > 
> > > > As
> > > >        if (page_mapped(page) && !PageAnon(page)) {
> > > >                 /* Update mapped_file data for mem_cgroup */
> > > >                 preempt_disable();
> > > >                 __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > > >                 __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > > >                 preempt_enable();
> > > >         }
> > > > then, FILE_MAPPED never goes negative.
> > > >
> > > 
> > > Absolutely! I am not sure how complex dirty memory migration will be,
> > > but one way of working around it would be to disable migration of
> > > charges when the feature is enabled (dirty* is set in the memory
> > > cgroup). We might need additional logic to allow that to happen. 
> > 
> > I've started to look at dirty memory migration. First attempt is to add
> > DIRTY, WRITEBACK, etc. to page_cgroup flags and handle them in
> > __mem_cgroup_move_account(). Probably I'll have something ready for the
> > next version of the patch. I still need to figure if this can work as
> > expected...
> > 
> I agree it's a right direction(in fact, I have been planning to post a patch
> in that direction), so I leave it to you.
> Can you add PCG_FILE_MAPPED flag too ? I think this flag can be handled in the
> same way as other flags you're trying to add, and we can change
> "if (page_mapped(page) && !PageAnon(page))" to "if (PageCgroupFileMapped(pc)"
> in __mem_cgroup_move_account(). It would be cleaner than current code, IMHO.

OK, sounds good to me. I'll introduce PCG_FILE_MAPPED in the next
version.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-02 23:21               ` Daisuke Nishimura
@ 2010-03-03 11:48                 ` Andrea Righi
  -1 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-03 11:48 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

On Wed, Mar 03, 2010 at 08:21:07AM +0900, Daisuke Nishimura wrote:
> On Tue, 2 Mar 2010 23:18:23 +0100, Andrea Righi <arighi@develer.com> wrote:
> > On Tue, Mar 02, 2010 at 07:20:26PM +0530, Balbir Singh wrote:
> > > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-03-02 17:23:16]:
> > > 
> > > > On Tue, 2 Mar 2010 09:01:58 +0100
> > > > Andrea Righi <arighi@develer.com> wrote:
> > > > 
> > > > > On Tue, Mar 02, 2010 at 09:23:09AM +0900, KAMEZAWA Hiroyuki wrote:
> > > > > > On Mon,  1 Mar 2010 22:23:40 +0100
> > > > > > Andrea Righi <arighi@develer.com> wrote:
> > > > > > 
> > > > > > > Apply the cgroup dirty pages accounting and limiting infrastructure to
> > > > > > > the opportune kernel functions.
> > > > > > > 
> > > > > > > Signed-off-by: Andrea Righi <arighi@develer.com>
> > > > > > 
> > > > > > Seems nice.
> > > > > > 
> > > > > > Hmm. the last problem is moving account between memcg.
> > > > > > 
> > > > > > Right ?
> > > > > 
> > > > > Correct. This was actually the last item of the TODO list. Anyway, I'm
> > > > > still considering if it's correct to move dirty pages when a task is
> > > > > migrated from a cgroup to another. Currently, dirty pages just remain in
> > > > > the original cgroup and are flushed depending on the original cgroup
> > > > > settings. That is not totally wrong... at least moving the dirty pages
> > > > > between memcgs should be optional (move_charge_at_immigrate?).
> > > > > 
> > > > 
> > > > My concern is 
> > > >  - migration between memcg is already suppoted
> > > >     - at task move
> > > >     - at rmdir
> > > > 
> > > > Then, if you leave DIRTY_PAGE accounting to original cgroup,
> > > > the new cgroup (migration target)'s Dirty page accounting may
> > > > goes to be negative, or incorrect value. Please check FILE_MAPPED
> > > > implementation in __mem_cgroup_move_account()
> > > > 
> > > > As
> > > >        if (page_mapped(page) && !PageAnon(page)) {
> > > >                 /* Update mapped_file data for mem_cgroup */
> > > >                 preempt_disable();
> > > >                 __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > > >                 __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > > >                 preempt_enable();
> > > >         }
> > > > then, FILE_MAPPED never goes negative.
> > > >
> > > 
> > > Absolutely! I am not sure how complex dirty memory migration will be,
> > > but one way of working around it would be to disable migration of
> > > charges when the feature is enabled (dirty* is set in the memory
> > > cgroup). We might need additional logic to allow that to happen. 
> > 
> > I've started to look at dirty memory migration. First attempt is to add
> > DIRTY, WRITEBACK, etc. to page_cgroup flags and handle them in
> > __mem_cgroup_move_account(). Probably I'll have something ready for the
> > next version of the patch. I still need to figure if this can work as
> > expected...
> > 
> I agree it's a right direction(in fact, I have been planning to post a patch
> in that direction), so I leave it to you.
> Can you add PCG_FILE_MAPPED flag too ? I think this flag can be handled in the
> same way as other flags you're trying to add, and we can change
> "if (page_mapped(page) && !PageAnon(page))" to "if (PageCgroupFileMapped(pc)"
> in __mem_cgroup_move_account(). It would be cleaner than current code, IMHO.

OK, sounds good to me. I'll introduce PCG_FILE_MAPPED in the next
version.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
@ 2010-03-03 11:48                 ` Andrea Righi
  0 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-03 11:48 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

On Wed, Mar 03, 2010 at 08:21:07AM +0900, Daisuke Nishimura wrote:
> On Tue, 2 Mar 2010 23:18:23 +0100, Andrea Righi <arighi@develer.com> wrote:
> > On Tue, Mar 02, 2010 at 07:20:26PM +0530, Balbir Singh wrote:
> > > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-03-02 17:23:16]:
> > > 
> > > > On Tue, 2 Mar 2010 09:01:58 +0100
> > > > Andrea Righi <arighi@develer.com> wrote:
> > > > 
> > > > > On Tue, Mar 02, 2010 at 09:23:09AM +0900, KAMEZAWA Hiroyuki wrote:
> > > > > > On Mon,  1 Mar 2010 22:23:40 +0100
> > > > > > Andrea Righi <arighi@develer.com> wrote:
> > > > > > 
> > > > > > > Apply the cgroup dirty pages accounting and limiting infrastructure to
> > > > > > > the opportune kernel functions.
> > > > > > > 
> > > > > > > Signed-off-by: Andrea Righi <arighi@develer.com>
> > > > > > 
> > > > > > Seems nice.
> > > > > > 
> > > > > > Hmm. the last problem is moving account between memcg.
> > > > > > 
> > > > > > Right ?
> > > > > 
> > > > > Correct. This was actually the last item of the TODO list. Anyway, I'm
> > > > > still considering if it's correct to move dirty pages when a task is
> > > > > migrated from a cgroup to another. Currently, dirty pages just remain in
> > > > > the original cgroup and are flushed depending on the original cgroup
> > > > > settings. That is not totally wrong... at least moving the dirty pages
> > > > > between memcgs should be optional (move_charge_at_immigrate?).
> > > > > 
> > > > 
> > > > My concern is 
> > > >  - migration between memcg is already suppoted
> > > >     - at task move
> > > >     - at rmdir
> > > > 
> > > > Then, if you leave DIRTY_PAGE accounting to original cgroup,
> > > > the new cgroup (migration target)'s Dirty page accounting may
> > > > goes to be negative, or incorrect value. Please check FILE_MAPPED
> > > > implementation in __mem_cgroup_move_account()
> > > > 
> > > > As
> > > >        if (page_mapped(page) && !PageAnon(page)) {
> > > >                 /* Update mapped_file data for mem_cgroup */
> > > >                 preempt_disable();
> > > >                 __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > > >                 __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > > >                 preempt_enable();
> > > >         }
> > > > then, FILE_MAPPED never goes negative.
> > > >
> > > 
> > > Absolutely! I am not sure how complex dirty memory migration will be,
> > > but one way of working around it would be to disable migration of
> > > charges when the feature is enabled (dirty* is set in the memory
> > > cgroup). We might need additional logic to allow that to happen. 
> > 
> > I've started to look at dirty memory migration. First attempt is to add
> > DIRTY, WRITEBACK, etc. to page_cgroup flags and handle them in
> > __mem_cgroup_move_account(). Probably I'll have something ready for the
> > next version of the patch. I still need to figure if this can work as
> > expected...
> > 
> I agree it's a right direction(in fact, I have been planning to post a patch
> in that direction), so I leave it to you.
> Can you add PCG_FILE_MAPPED flag too ? I think this flag can be handled in the
> same way as other flags you're trying to add, and we can change
> "if (page_mapped(page) && !PageAnon(page))" to "if (PageCgroupFileMapped(pc)"
> in __mem_cgroup_move_account(). It would be cleaner than current code, IMHO.

OK, sounds good to me. I'll introduce PCG_FILE_MAPPED in the next
version.

Thanks,
-Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
       [not found]             ` <20100303172132.fc6d9387.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
@ 2010-03-03 11:50               ` Andrea Righi
  2010-03-03 22:03               ` Andrea Righi
  1 sibling, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-03 11:50 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Greg-FOgKQjlUJ6BQetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Andrew Morton,
	Balbir Singh

On Wed, Mar 03, 2010 at 05:21:32PM +0900, KAMEZAWA Hiroyuki wrote:
> On Wed, 3 Mar 2010 15:15:49 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> wrote:
> 
> > Agreed.
> > Let's try how we can write a code in clean way. (we have time ;)
> > For now, to me, IRQ disabling while lock_page_cgroup() seems to be a little
> > over killing. What I really want is lockless code...but it seems impossible
> > under current implementation.
> > 
> > I wonder the fact "the page is never unchareged under us" can give us some chances
> > ...Hmm.
> > 
> 
> How about this ? Basically, I don't like duplicating information...so,
> # of new pcg_flags may be able to be reduced.
> 
> I'm glad this can be a hint for Andrea-san.

Many thanks! I already wrote pretty the same code, but at this point I
think I'll just apply and test this one. ;)

-Andrea

> 
> ==
> ---
>  include/linux/page_cgroup.h |   44 ++++++++++++++++++++-
>  mm/memcontrol.c             |   91 +++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 132 insertions(+), 3 deletions(-)
> 
> Index: mmotm-2.6.33-Mar2/include/linux/page_cgroup.h
> ===================================================================
> --- mmotm-2.6.33-Mar2.orig/include/linux/page_cgroup.h
> +++ mmotm-2.6.33-Mar2/include/linux/page_cgroup.h
> @@ -39,6 +39,11 @@ enum {
>  	PCG_CACHE, /* charged as cache */
>  	PCG_USED, /* this object is in use. */
>  	PCG_ACCT_LRU, /* page has been accounted for */
> +	PCG_MIGRATE_LOCK, /* used for mutual execution of account migration */
> +	PCG_ACCT_DIRTY,
> +	PCG_ACCT_WB,
> +	PCG_ACCT_WB_TEMP,
> +	PCG_ACCT_UNSTABLE,
>  };
>  
>  #define TESTPCGFLAG(uname, lname)			\
> @@ -73,6 +78,23 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU)
>  TESTPCGFLAG(AcctLRU, ACCT_LRU)
>  TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU)
>  
> +SETPCGFLAG(AcctDirty, ACCT_DIRTY);
> +CLEARPCGFLAG(AcctDirty, ACCT_DIRTY);
> +TESTPCGFLAG(AcctDirty, ACCT_DIRTY);
> +
> +SETPCGFLAG(AcctWB, ACCT_WB);
> +CLEARPCGFLAG(AcctWB, ACCT_WB);
> +TESTPCGFLAG(AcctWB, ACCT_WB);
> +
> +SETPCGFLAG(AcctWBTemp, ACCT_WB_TEMP);
> +CLEARPCGFLAG(AcctWBTemp, ACCT_WB_TEMP);
> +TESTPCGFLAG(AcctWBTemp, ACCT_WB_TEMP);
> +
> +SETPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE);
> +CLEARPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE);
> +TESTPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE);
> +
> +
>  static inline int page_cgroup_nid(struct page_cgroup *pc)
>  {
>  	return page_to_nid(pc->page);
> @@ -82,7 +104,9 @@ static inline enum zone_type page_cgroup
>  {
>  	return page_zonenum(pc->page);
>  }
> -
> +/*
> + * lock_page_cgroup() should not be held under mapping->tree_lock
> + */
>  static inline void lock_page_cgroup(struct page_cgroup *pc)
>  {
>  	bit_spin_lock(PCG_LOCK, &pc->flags);
> @@ -93,6 +117,24 @@ static inline void unlock_page_cgroup(st
>  	bit_spin_unlock(PCG_LOCK, &pc->flags);
>  }
>  
> +/*
> + * Lock order is
> + * 	lock_page_cgroup()
> + * 		lock_page_cgroup_migrate()
> + * This lock is not be lock for charge/uncharge but for account moving.
> + * i.e. overwrite pc->mem_cgroup. The lock owner should guarantee by itself
> + * the page is uncharged while we hold this.
> + */
> +static inline void lock_page_cgroup_migrate(struct page_cgroup *pc)
> +{
> +	bit_spin_lock(PCG_MIGRATE_LOCK, &pc->flags);
> +}
> +
> +static inline void unlock_page_cgroup_migrate(struct page_cgroup *pc)
> +{
> +	bit_spin_unlock(PCG_MIGRATE_LOCK, &pc->flags);
> +}
> +
>  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
>  struct page_cgroup;
>  
> Index: mmotm-2.6.33-Mar2/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.33-Mar2.orig/mm/memcontrol.c
> +++ mmotm-2.6.33-Mar2/mm/memcontrol.c
> @@ -87,6 +87,10 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
>  	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
>  	MEM_CGROUP_EVENTS,	/* incremented at every  pagein/pageout */
> +	MEM_CGROUP_STAT_DIRTY,
> +	MEM_CGROUP_STAT_WBACK,
> +	MEM_CGROUP_STAT_WBACK_TEMP,
> +	MEM_CGROUP_STAT_UNSTABLE_NFS,
>  
>  	MEM_CGROUP_STAT_NSTATS,
>  };
> @@ -1360,6 +1364,86 @@ done:
>  }
>  
>  /*
> + * Update file cache's status for memcg. Before calling this,
> + * mapping->tree_lock should be held and preemption is disabled.
> + * Then, it's guarnteed that the page is not uncharged while we
> + * access page_cgroup. We can make use of that.
> + */
> +void mem_cgroup_update_stat_locked(struct page *page, int idx, bool set)
> +{
> +	struct page_cgroup *pc;
> +	struct mem_cgroup *mem;
> +
> +	pc = lookup_page_cgroup(page);
> +	/* Not accounted ? */
> +	if (!PageCgroupUsed(pc))
> +		return;
> +	lock_page_cgroup_migrate(pc);
> +	/*
> +	 * It's guarnteed that this page is never uncharged.
> +	 * The only racy problem is moving account among memcgs.
> +	 */
> +	switch (idx) {
> +	case MEM_CGROUP_STAT_DIRTY:
> +		if (set)
> +			SetPageCgroupAcctDirty(pc);
> +		else
> +			ClearPageCgroupAcctDirty(pc);
> +		break;
> +	case MEM_CGROUP_STAT_WBACK:
> +		if (set)
> +			SetPageCgroupAcctWB(pc);
> +		else
> +			ClearPageCgroupAcctWB(pc);
> +		break;
> +	case MEM_CGROUP_STAT_WBACK_TEMP:
> +		if (set)
> +			SetPageCgroupAcctWBTemp(pc);
> +		else
> +			ClearPageCgroupAcctWBTemp(pc);
> +		break;
> +	case MEM_CGROUP_STAT_UNSTABLE_NFS:
> +		if (set)
> +			SetPageCgroupAcctUnstableNFS(pc);
> +		else
> +			ClearPageCgroupAcctUnstableNFS(pc);
> +		break;
> +	default:
> +		BUG();
> +		break;
> +	}
> +	mem = pc->mem_cgroup;
> +	if (set)
> +		__this_cpu_inc(mem->stat->count[idx]);
> +	else
> +		__this_cpu_dec(mem->stat->count[idx]);
> +	unlock_page_cgroup_migrate(pc);
> +}
> +
> +static void move_acct_information(struct mem_cgroup *from,
> +				struct mem_cgroup *to,
> +				struct page_cgroup *pc)
> +{
> +	/* preemption is disabled, migration_lock is held. */
> +	if (PageCgroupAcctDirty(pc)) {
> +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_DIRTY]);
> +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_DIRTY]);
> +	}
> +	if (PageCgroupAcctWB(pc)) {
> +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_WBACK]);
> +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WBACK]);
> +	}
> +	if (PageCgroupAcctWBTemp(pc)) {
> +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_WBACK_TEMP]);
> +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WBACK_TEMP]);
> +	}
> +	if (PageCgroupAcctUnstableNFS(pc)) {
> +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
> +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
> +	}
> +}
> +
> +/*
>   * size of first charge trial. "32" comes from vmscan.c's magic value.
>   * TODO: maybe necessary to use big numbers in big irons.
>   */
> @@ -1794,15 +1878,16 @@ static void __mem_cgroup_move_account(st
>  	VM_BUG_ON(!PageCgroupUsed(pc));
>  	VM_BUG_ON(pc->mem_cgroup != from);
>  
> +	preempt_disable();
> +	lock_page_cgroup_migrate(pc);
>  	page = pc->page;
>  	if (page_mapped(page) && !PageAnon(page)) {
>  		/* Update mapped_file data for mem_cgroup */
> -		preempt_disable();
>  		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
>  		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		preempt_enable();
>  	}
>  	mem_cgroup_charge_statistics(from, pc, false);
> +	move_acct_information(from, to, pc);
>  	if (uncharge)
>  		/* This is not "cancel", but cancel_charge does all we need. */
>  		mem_cgroup_cancel_charge(from);
> @@ -1810,6 +1895,8 @@ static void __mem_cgroup_move_account(st
>  	/* caller should have done css_get */
>  	pc->mem_cgroup = to;
>  	mem_cgroup_charge_statistics(to, pc, true);
> +	unlock_page_cgroup_migrate(pc);
> +	preempt_enable();
>  	/*
>  	 * We charges against "to" which may not have any tasks. Then, "to"
>  	 * can be under rmdir(). But in current implementation, caller of
> 

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-03  8:21             ` KAMEZAWA Hiroyuki
@ 2010-03-03 11:50               ` Andrea Righi
  -1 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-03 11:50 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, containers, linux-kernel, linux-mm, Greg,
	Suleiman Souhlal, Andrew Morton, Balbir Singh

On Wed, Mar 03, 2010 at 05:21:32PM +0900, KAMEZAWA Hiroyuki wrote:
> On Wed, 3 Mar 2010 15:15:49 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > Agreed.
> > Let's try how we can write a code in clean way. (we have time ;)
> > For now, to me, IRQ disabling while lock_page_cgroup() seems to be a little
> > over killing. What I really want is lockless code...but it seems impossible
> > under current implementation.
> > 
> > I wonder the fact "the page is never unchareged under us" can give us some chances
> > ...Hmm.
> > 
> 
> How about this ? Basically, I don't like duplicating information...so,
> # of new pcg_flags may be able to be reduced.
> 
> I'm glad this can be a hint for Andrea-san.

Many thanks! I already wrote pretty the same code, but at this point I
think I'll just apply and test this one. ;)

-Andrea

> 
> ==
> ---
>  include/linux/page_cgroup.h |   44 ++++++++++++++++++++-
>  mm/memcontrol.c             |   91 +++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 132 insertions(+), 3 deletions(-)
> 
> Index: mmotm-2.6.33-Mar2/include/linux/page_cgroup.h
> ===================================================================
> --- mmotm-2.6.33-Mar2.orig/include/linux/page_cgroup.h
> +++ mmotm-2.6.33-Mar2/include/linux/page_cgroup.h
> @@ -39,6 +39,11 @@ enum {
>  	PCG_CACHE, /* charged as cache */
>  	PCG_USED, /* this object is in use. */
>  	PCG_ACCT_LRU, /* page has been accounted for */
> +	PCG_MIGRATE_LOCK, /* used for mutual execution of account migration */
> +	PCG_ACCT_DIRTY,
> +	PCG_ACCT_WB,
> +	PCG_ACCT_WB_TEMP,
> +	PCG_ACCT_UNSTABLE,
>  };
>  
>  #define TESTPCGFLAG(uname, lname)			\
> @@ -73,6 +78,23 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU)
>  TESTPCGFLAG(AcctLRU, ACCT_LRU)
>  TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU)
>  
> +SETPCGFLAG(AcctDirty, ACCT_DIRTY);
> +CLEARPCGFLAG(AcctDirty, ACCT_DIRTY);
> +TESTPCGFLAG(AcctDirty, ACCT_DIRTY);
> +
> +SETPCGFLAG(AcctWB, ACCT_WB);
> +CLEARPCGFLAG(AcctWB, ACCT_WB);
> +TESTPCGFLAG(AcctWB, ACCT_WB);
> +
> +SETPCGFLAG(AcctWBTemp, ACCT_WB_TEMP);
> +CLEARPCGFLAG(AcctWBTemp, ACCT_WB_TEMP);
> +TESTPCGFLAG(AcctWBTemp, ACCT_WB_TEMP);
> +
> +SETPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE);
> +CLEARPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE);
> +TESTPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE);
> +
> +
>  static inline int page_cgroup_nid(struct page_cgroup *pc)
>  {
>  	return page_to_nid(pc->page);
> @@ -82,7 +104,9 @@ static inline enum zone_type page_cgroup
>  {
>  	return page_zonenum(pc->page);
>  }
> -
> +/*
> + * lock_page_cgroup() should not be held under mapping->tree_lock
> + */
>  static inline void lock_page_cgroup(struct page_cgroup *pc)
>  {
>  	bit_spin_lock(PCG_LOCK, &pc->flags);
> @@ -93,6 +117,24 @@ static inline void unlock_page_cgroup(st
>  	bit_spin_unlock(PCG_LOCK, &pc->flags);
>  }
>  
> +/*
> + * Lock order is
> + * 	lock_page_cgroup()
> + * 		lock_page_cgroup_migrate()
> + * This lock is not be lock for charge/uncharge but for account moving.
> + * i.e. overwrite pc->mem_cgroup. The lock owner should guarantee by itself
> + * the page is uncharged while we hold this.
> + */
> +static inline void lock_page_cgroup_migrate(struct page_cgroup *pc)
> +{
> +	bit_spin_lock(PCG_MIGRATE_LOCK, &pc->flags);
> +}
> +
> +static inline void unlock_page_cgroup_migrate(struct page_cgroup *pc)
> +{
> +	bit_spin_unlock(PCG_MIGRATE_LOCK, &pc->flags);
> +}
> +
>  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
>  struct page_cgroup;
>  
> Index: mmotm-2.6.33-Mar2/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.33-Mar2.orig/mm/memcontrol.c
> +++ mmotm-2.6.33-Mar2/mm/memcontrol.c
> @@ -87,6 +87,10 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
>  	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
>  	MEM_CGROUP_EVENTS,	/* incremented at every  pagein/pageout */
> +	MEM_CGROUP_STAT_DIRTY,
> +	MEM_CGROUP_STAT_WBACK,
> +	MEM_CGROUP_STAT_WBACK_TEMP,
> +	MEM_CGROUP_STAT_UNSTABLE_NFS,
>  
>  	MEM_CGROUP_STAT_NSTATS,
>  };
> @@ -1360,6 +1364,86 @@ done:
>  }
>  
>  /*
> + * Update file cache's status for memcg. Before calling this,
> + * mapping->tree_lock should be held and preemption is disabled.
> + * Then, it's guarnteed that the page is not uncharged while we
> + * access page_cgroup. We can make use of that.
> + */
> +void mem_cgroup_update_stat_locked(struct page *page, int idx, bool set)
> +{
> +	struct page_cgroup *pc;
> +	struct mem_cgroup *mem;
> +
> +	pc = lookup_page_cgroup(page);
> +	/* Not accounted ? */
> +	if (!PageCgroupUsed(pc))
> +		return;
> +	lock_page_cgroup_migrate(pc);
> +	/*
> +	 * It's guarnteed that this page is never uncharged.
> +	 * The only racy problem is moving account among memcgs.
> +	 */
> +	switch (idx) {
> +	case MEM_CGROUP_STAT_DIRTY:
> +		if (set)
> +			SetPageCgroupAcctDirty(pc);
> +		else
> +			ClearPageCgroupAcctDirty(pc);
> +		break;
> +	case MEM_CGROUP_STAT_WBACK:
> +		if (set)
> +			SetPageCgroupAcctWB(pc);
> +		else
> +			ClearPageCgroupAcctWB(pc);
> +		break;
> +	case MEM_CGROUP_STAT_WBACK_TEMP:
> +		if (set)
> +			SetPageCgroupAcctWBTemp(pc);
> +		else
> +			ClearPageCgroupAcctWBTemp(pc);
> +		break;
> +	case MEM_CGROUP_STAT_UNSTABLE_NFS:
> +		if (set)
> +			SetPageCgroupAcctUnstableNFS(pc);
> +		else
> +			ClearPageCgroupAcctUnstableNFS(pc);
> +		break;
> +	default:
> +		BUG();
> +		break;
> +	}
> +	mem = pc->mem_cgroup;
> +	if (set)
> +		__this_cpu_inc(mem->stat->count[idx]);
> +	else
> +		__this_cpu_dec(mem->stat->count[idx]);
> +	unlock_page_cgroup_migrate(pc);
> +}
> +
> +static void move_acct_information(struct mem_cgroup *from,
> +				struct mem_cgroup *to,
> +				struct page_cgroup *pc)
> +{
> +	/* preemption is disabled, migration_lock is held. */
> +	if (PageCgroupAcctDirty(pc)) {
> +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_DIRTY]);
> +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_DIRTY]);
> +	}
> +	if (PageCgroupAcctWB(pc)) {
> +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_WBACK]);
> +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WBACK]);
> +	}
> +	if (PageCgroupAcctWBTemp(pc)) {
> +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_WBACK_TEMP]);
> +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WBACK_TEMP]);
> +	}
> +	if (PageCgroupAcctUnstableNFS(pc)) {
> +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
> +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
> +	}
> +}
> +
> +/*
>   * size of first charge trial. "32" comes from vmscan.c's magic value.
>   * TODO: maybe necessary to use big numbers in big irons.
>   */
> @@ -1794,15 +1878,16 @@ static void __mem_cgroup_move_account(st
>  	VM_BUG_ON(!PageCgroupUsed(pc));
>  	VM_BUG_ON(pc->mem_cgroup != from);
>  
> +	preempt_disable();
> +	lock_page_cgroup_migrate(pc);
>  	page = pc->page;
>  	if (page_mapped(page) && !PageAnon(page)) {
>  		/* Update mapped_file data for mem_cgroup */
> -		preempt_disable();
>  		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
>  		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		preempt_enable();
>  	}
>  	mem_cgroup_charge_statistics(from, pc, false);
> +	move_acct_information(from, to, pc);
>  	if (uncharge)
>  		/* This is not "cancel", but cancel_charge does all we need. */
>  		mem_cgroup_cancel_charge(from);
> @@ -1810,6 +1895,8 @@ static void __mem_cgroup_move_account(st
>  	/* caller should have done css_get */
>  	pc->mem_cgroup = to;
>  	mem_cgroup_charge_statistics(to, pc, true);
> +	unlock_page_cgroup_migrate(pc);
> +	preempt_enable();
>  	/*
>  	 * We charges against "to" which may not have any tasks. Then, "to"
>  	 * can be under rmdir(). But in current implementation, caller of
> 

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
@ 2010-03-03 11:50               ` Andrea Righi
  0 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-03 11:50 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, containers, linux-kernel, linux-mm, Greg,
	Suleiman Souhlal, Andrew Morton, Balbir Singh

On Wed, Mar 03, 2010 at 05:21:32PM +0900, KAMEZAWA Hiroyuki wrote:
> On Wed, 3 Mar 2010 15:15:49 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > Agreed.
> > Let's try how we can write a code in clean way. (we have time ;)
> > For now, to me, IRQ disabling while lock_page_cgroup() seems to be a little
> > over killing. What I really want is lockless code...but it seems impossible
> > under current implementation.
> > 
> > I wonder the fact "the page is never unchareged under us" can give us some chances
> > ...Hmm.
> > 
> 
> How about this ? Basically, I don't like duplicating information...so,
> # of new pcg_flags may be able to be reduced.
> 
> I'm glad this can be a hint for Andrea-san.

Many thanks! I already wrote pretty the same code, but at this point I
think I'll just apply and test this one. ;)

-Andrea

> 
> ==
> ---
>  include/linux/page_cgroup.h |   44 ++++++++++++++++++++-
>  mm/memcontrol.c             |   91 +++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 132 insertions(+), 3 deletions(-)
> 
> Index: mmotm-2.6.33-Mar2/include/linux/page_cgroup.h
> ===================================================================
> --- mmotm-2.6.33-Mar2.orig/include/linux/page_cgroup.h
> +++ mmotm-2.6.33-Mar2/include/linux/page_cgroup.h
> @@ -39,6 +39,11 @@ enum {
>  	PCG_CACHE, /* charged as cache */
>  	PCG_USED, /* this object is in use. */
>  	PCG_ACCT_LRU, /* page has been accounted for */
> +	PCG_MIGRATE_LOCK, /* used for mutual execution of account migration */
> +	PCG_ACCT_DIRTY,
> +	PCG_ACCT_WB,
> +	PCG_ACCT_WB_TEMP,
> +	PCG_ACCT_UNSTABLE,
>  };
>  
>  #define TESTPCGFLAG(uname, lname)			\
> @@ -73,6 +78,23 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU)
>  TESTPCGFLAG(AcctLRU, ACCT_LRU)
>  TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU)
>  
> +SETPCGFLAG(AcctDirty, ACCT_DIRTY);
> +CLEARPCGFLAG(AcctDirty, ACCT_DIRTY);
> +TESTPCGFLAG(AcctDirty, ACCT_DIRTY);
> +
> +SETPCGFLAG(AcctWB, ACCT_WB);
> +CLEARPCGFLAG(AcctWB, ACCT_WB);
> +TESTPCGFLAG(AcctWB, ACCT_WB);
> +
> +SETPCGFLAG(AcctWBTemp, ACCT_WB_TEMP);
> +CLEARPCGFLAG(AcctWBTemp, ACCT_WB_TEMP);
> +TESTPCGFLAG(AcctWBTemp, ACCT_WB_TEMP);
> +
> +SETPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE);
> +CLEARPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE);
> +TESTPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE);
> +
> +
>  static inline int page_cgroup_nid(struct page_cgroup *pc)
>  {
>  	return page_to_nid(pc->page);
> @@ -82,7 +104,9 @@ static inline enum zone_type page_cgroup
>  {
>  	return page_zonenum(pc->page);
>  }
> -
> +/*
> + * lock_page_cgroup() should not be held under mapping->tree_lock
> + */
>  static inline void lock_page_cgroup(struct page_cgroup *pc)
>  {
>  	bit_spin_lock(PCG_LOCK, &pc->flags);
> @@ -93,6 +117,24 @@ static inline void unlock_page_cgroup(st
>  	bit_spin_unlock(PCG_LOCK, &pc->flags);
>  }
>  
> +/*
> + * Lock order is
> + * 	lock_page_cgroup()
> + * 		lock_page_cgroup_migrate()
> + * This lock is not be lock for charge/uncharge but for account moving.
> + * i.e. overwrite pc->mem_cgroup. The lock owner should guarantee by itself
> + * the page is uncharged while we hold this.
> + */
> +static inline void lock_page_cgroup_migrate(struct page_cgroup *pc)
> +{
> +	bit_spin_lock(PCG_MIGRATE_LOCK, &pc->flags);
> +}
> +
> +static inline void unlock_page_cgroup_migrate(struct page_cgroup *pc)
> +{
> +	bit_spin_unlock(PCG_MIGRATE_LOCK, &pc->flags);
> +}
> +
>  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
>  struct page_cgroup;
>  
> Index: mmotm-2.6.33-Mar2/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.33-Mar2.orig/mm/memcontrol.c
> +++ mmotm-2.6.33-Mar2/mm/memcontrol.c
> @@ -87,6 +87,10 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
>  	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
>  	MEM_CGROUP_EVENTS,	/* incremented at every  pagein/pageout */
> +	MEM_CGROUP_STAT_DIRTY,
> +	MEM_CGROUP_STAT_WBACK,
> +	MEM_CGROUP_STAT_WBACK_TEMP,
> +	MEM_CGROUP_STAT_UNSTABLE_NFS,
>  
>  	MEM_CGROUP_STAT_NSTATS,
>  };
> @@ -1360,6 +1364,86 @@ done:
>  }
>  
>  /*
> + * Update file cache's status for memcg. Before calling this,
> + * mapping->tree_lock should be held and preemption is disabled.
> + * Then, it's guarnteed that the page is not uncharged while we
> + * access page_cgroup. We can make use of that.
> + */
> +void mem_cgroup_update_stat_locked(struct page *page, int idx, bool set)
> +{
> +	struct page_cgroup *pc;
> +	struct mem_cgroup *mem;
> +
> +	pc = lookup_page_cgroup(page);
> +	/* Not accounted ? */
> +	if (!PageCgroupUsed(pc))
> +		return;
> +	lock_page_cgroup_migrate(pc);
> +	/*
> +	 * It's guarnteed that this page is never uncharged.
> +	 * The only racy problem is moving account among memcgs.
> +	 */
> +	switch (idx) {
> +	case MEM_CGROUP_STAT_DIRTY:
> +		if (set)
> +			SetPageCgroupAcctDirty(pc);
> +		else
> +			ClearPageCgroupAcctDirty(pc);
> +		break;
> +	case MEM_CGROUP_STAT_WBACK:
> +		if (set)
> +			SetPageCgroupAcctWB(pc);
> +		else
> +			ClearPageCgroupAcctWB(pc);
> +		break;
> +	case MEM_CGROUP_STAT_WBACK_TEMP:
> +		if (set)
> +			SetPageCgroupAcctWBTemp(pc);
> +		else
> +			ClearPageCgroupAcctWBTemp(pc);
> +		break;
> +	case MEM_CGROUP_STAT_UNSTABLE_NFS:
> +		if (set)
> +			SetPageCgroupAcctUnstableNFS(pc);
> +		else
> +			ClearPageCgroupAcctUnstableNFS(pc);
> +		break;
> +	default:
> +		BUG();
> +		break;
> +	}
> +	mem = pc->mem_cgroup;
> +	if (set)
> +		__this_cpu_inc(mem->stat->count[idx]);
> +	else
> +		__this_cpu_dec(mem->stat->count[idx]);
> +	unlock_page_cgroup_migrate(pc);
> +}
> +
> +static void move_acct_information(struct mem_cgroup *from,
> +				struct mem_cgroup *to,
> +				struct page_cgroup *pc)
> +{
> +	/* preemption is disabled, migration_lock is held. */
> +	if (PageCgroupAcctDirty(pc)) {
> +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_DIRTY]);
> +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_DIRTY]);
> +	}
> +	if (PageCgroupAcctWB(pc)) {
> +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_WBACK]);
> +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WBACK]);
> +	}
> +	if (PageCgroupAcctWBTemp(pc)) {
> +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_WBACK_TEMP]);
> +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WBACK_TEMP]);
> +	}
> +	if (PageCgroupAcctUnstableNFS(pc)) {
> +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
> +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
> +	}
> +}
> +
> +/*
>   * size of first charge trial. "32" comes from vmscan.c's magic value.
>   * TODO: maybe necessary to use big numbers in big irons.
>   */
> @@ -1794,15 +1878,16 @@ static void __mem_cgroup_move_account(st
>  	VM_BUG_ON(!PageCgroupUsed(pc));
>  	VM_BUG_ON(pc->mem_cgroup != from);
>  
> +	preempt_disable();
> +	lock_page_cgroup_migrate(pc);
>  	page = pc->page;
>  	if (page_mapped(page) && !PageAnon(page)) {
>  		/* Update mapped_file data for mem_cgroup */
> -		preempt_disable();
>  		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
>  		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		preempt_enable();
>  	}
>  	mem_cgroup_charge_statistics(from, pc, false);
> +	move_acct_information(from, to, pc);
>  	if (uncharge)
>  		/* This is not "cancel", but cancel_charge does all we need. */
>  		mem_cgroup_cancel_charge(from);
> @@ -1810,6 +1895,8 @@ static void __mem_cgroup_move_account(st
>  	/* caller should have done css_get */
>  	pc->mem_cgroup = to;
>  	mem_cgroup_charge_statistics(to, pc, true);
> +	unlock_page_cgroup_migrate(pc);
> +	preempt_enable();
>  	/*
>  	 * We charges against "to" which may not have any tasks. Then, "to"
>  	 * can be under rmdir(). But in current implementation, caller of
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-03 11:47               ` Andrea Righi
  (?)
  (?)
@ 2010-03-03 11:56               ` Andrea Righi
  -1 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-03 11:56 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Andrew Morton,
	Balbir Singh

On Wed, Mar 03, 2010 at 12:47:03PM +0100, Andrea Righi wrote:
> On Tue, Mar 02, 2010 at 06:59:32PM -0500, Vivek Goyal wrote:
> > On Tue, Mar 02, 2010 at 11:22:48PM +0100, Andrea Righi wrote:
> > > On Tue, Mar 02, 2010 at 10:05:29AM -0500, Vivek Goyal wrote:
> > > > On Mon, Mar 01, 2010 at 11:18:31PM +0100, Andrea Righi wrote:
> > > > > On Mon, Mar 01, 2010 at 05:02:08PM -0500, Vivek Goyal wrote:
> > > > > > > @@ -686,10 +699,14 @@ void throttle_vm_writeout(gfp_t gfp_mask)
> > > > > > >                   */
> > > > > > >                  dirty_thresh += dirty_thresh / 10;      /* wheeee... */
> > > > > > >  
> > > > > > > -                if (global_page_state(NR_UNSTABLE_NFS) +
> > > > > > > -			global_page_state(NR_WRITEBACK) <= dirty_thresh)
> > > > > > > -                        	break;
> > > > > > > -                congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > > > > > +
> > > > > > > +		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> > > > > > > +		if (dirty < 0)
> > > > > > > +			dirty = global_page_state(NR_UNSTABLE_NFS) +
> > > > > > > +				global_page_state(NR_WRITEBACK);
> > > > > > 
> > > > > > dirty is unsigned long. As mentioned last time, above will never be true?
> > > > > > In general these patches look ok to me. I will do some testing with these.
> > > > > 
> > > > > Re-introduced the same bug. My bad. :(
> > > > > 
> > > > > The value returned from mem_cgroup_page_stat() can be negative, i.e.
> > > > > when memory cgroup is disabled. We could simply use a long for dirty,
> > > > > the unit is in # of pages so s64 should be enough. Or cast dirty to long
> > > > > only for the check (see below).
> > > > > 
> > > > > Thanks!
> > > > > -Andrea
> > > > > 
> > > > > Signed-off-by: Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
> > > > > ---
> > > > >  mm/page-writeback.c |    2 +-
> > > > >  1 files changed, 1 insertions(+), 1 deletions(-)
> > > > > 
> > > > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > > > > index d83f41c..dbee976 100644
> > > > > --- a/mm/page-writeback.c
> > > > > +++ b/mm/page-writeback.c
> > > > > @@ -701,7 +701,7 @@ void throttle_vm_writeout(gfp_t gfp_mask)
> > > > >  
> > > > >  
> > > > >  		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> > > > > -		if (dirty < 0)
> > > > > +		if ((long)dirty < 0)
> > > > 
> > > > This will also be problematic as on 32bit systems, your uppper limit of
> > > > dirty memory will be 2G?
> > > > 
> > > > I guess, I will prefer one of the two.
> > > > 
> > > > - return the error code from function and pass a pointer to store stats
> > > >   in as function argument.
> > > > 
> > > > - Or Peter's suggestion of checking mem_cgroup_has_dirty_limit() and if
> > > >   per cgroup dirty control is enabled, then use per cgroup stats. In that
> > > >   case you don't have to return negative values.
> > > > 
> > > >   Only tricky part will be careful accouting so that none of the stats go
> > > >   negative in corner cases of migration etc.
> > > 
> > > What do you think about Peter's suggestion + the locking stuff? (see the
> > > previous email). Otherwise, I'll choose the other solution, passing a
> > > pointer and always return the error code is not bad.
> > > 
> > 
> > Ok, so you are worried about that by the we finish mem_cgroup_has_dirty_limit()
> > call, task might change cgroup and later we might call
> > mem_cgroup_get_page_stat() on a different cgroup altogether which might or
> > might not have dirty limits specified?
> 
> Correct.
> 
> > 
> > But in what cases you don't want to use memory cgroup specified limit? I
> > thought cgroup disabled what the only case where we need to use global
> > limits. Otherwise a memory cgroup will have either dirty_bytes specified
> > or by default inherit global dirty_ratio which is a valid number. If
> > that's the case then you don't have to take rcu_lock() outside
> > get_page_stat()?
> > 
> > IOW, apart from cgroup being disabled, what are the other cases where you
> > expect to not use cgroup's page stat and use global stats?
> 
> At boot, when mem_cgroup_from_task() may return NULL. But this is not
> related to the RCU acquisition.

Nevermind. You're right. In any case even if a task is migrated to a
different cgroup it will always have mem_cgroup_has_dirty_limit() ==
true.

So RCU protection is not needed outside these functions.

OK, I'll go with the Peter's suggestion.

Thanks!
-Andrea

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-03 11:47               ` Andrea Righi
@ 2010-03-03 11:56                 ` Andrea Righi
  -1 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-03 11:56 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Kirill A. Shutemov, Andrew Morton, containers,
	linux-kernel, linux-mm

On Wed, Mar 03, 2010 at 12:47:03PM +0100, Andrea Righi wrote:
> On Tue, Mar 02, 2010 at 06:59:32PM -0500, Vivek Goyal wrote:
> > On Tue, Mar 02, 2010 at 11:22:48PM +0100, Andrea Righi wrote:
> > > On Tue, Mar 02, 2010 at 10:05:29AM -0500, Vivek Goyal wrote:
> > > > On Mon, Mar 01, 2010 at 11:18:31PM +0100, Andrea Righi wrote:
> > > > > On Mon, Mar 01, 2010 at 05:02:08PM -0500, Vivek Goyal wrote:
> > > > > > > @@ -686,10 +699,14 @@ void throttle_vm_writeout(gfp_t gfp_mask)
> > > > > > >                   */
> > > > > > >                  dirty_thresh += dirty_thresh / 10;      /* wheeee... */
> > > > > > >  
> > > > > > > -                if (global_page_state(NR_UNSTABLE_NFS) +
> > > > > > > -			global_page_state(NR_WRITEBACK) <= dirty_thresh)
> > > > > > > -                        	break;
> > > > > > > -                congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > > > > > +
> > > > > > > +		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> > > > > > > +		if (dirty < 0)
> > > > > > > +			dirty = global_page_state(NR_UNSTABLE_NFS) +
> > > > > > > +				global_page_state(NR_WRITEBACK);
> > > > > > 
> > > > > > dirty is unsigned long. As mentioned last time, above will never be true?
> > > > > > In general these patches look ok to me. I will do some testing with these.
> > > > > 
> > > > > Re-introduced the same bug. My bad. :(
> > > > > 
> > > > > The value returned from mem_cgroup_page_stat() can be negative, i.e.
> > > > > when memory cgroup is disabled. We could simply use a long for dirty,
> > > > > the unit is in # of pages so s64 should be enough. Or cast dirty to long
> > > > > only for the check (see below).
> > > > > 
> > > > > Thanks!
> > > > > -Andrea
> > > > > 
> > > > > Signed-off-by: Andrea Righi <arighi@develer.com>
> > > > > ---
> > > > >  mm/page-writeback.c |    2 +-
> > > > >  1 files changed, 1 insertions(+), 1 deletions(-)
> > > > > 
> > > > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > > > > index d83f41c..dbee976 100644
> > > > > --- a/mm/page-writeback.c
> > > > > +++ b/mm/page-writeback.c
> > > > > @@ -701,7 +701,7 @@ void throttle_vm_writeout(gfp_t gfp_mask)
> > > > >  
> > > > >  
> > > > >  		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> > > > > -		if (dirty < 0)
> > > > > +		if ((long)dirty < 0)
> > > > 
> > > > This will also be problematic as on 32bit systems, your uppper limit of
> > > > dirty memory will be 2G?
> > > > 
> > > > I guess, I will prefer one of the two.
> > > > 
> > > > - return the error code from function and pass a pointer to store stats
> > > >   in as function argument.
> > > > 
> > > > - Or Peter's suggestion of checking mem_cgroup_has_dirty_limit() and if
> > > >   per cgroup dirty control is enabled, then use per cgroup stats. In that
> > > >   case you don't have to return negative values.
> > > > 
> > > >   Only tricky part will be careful accouting so that none of the stats go
> > > >   negative in corner cases of migration etc.
> > > 
> > > What do you think about Peter's suggestion + the locking stuff? (see the
> > > previous email). Otherwise, I'll choose the other solution, passing a
> > > pointer and always return the error code is not bad.
> > > 
> > 
> > Ok, so you are worried about that by the we finish mem_cgroup_has_dirty_limit()
> > call, task might change cgroup and later we might call
> > mem_cgroup_get_page_stat() on a different cgroup altogether which might or
> > might not have dirty limits specified?
> 
> Correct.
> 
> > 
> > But in what cases you don't want to use memory cgroup specified limit? I
> > thought cgroup disabled what the only case where we need to use global
> > limits. Otherwise a memory cgroup will have either dirty_bytes specified
> > or by default inherit global dirty_ratio which is a valid number. If
> > that's the case then you don't have to take rcu_lock() outside
> > get_page_stat()?
> > 
> > IOW, apart from cgroup being disabled, what are the other cases where you
> > expect to not use cgroup's page stat and use global stats?
> 
> At boot, when mem_cgroup_from_task() may return NULL. But this is not
> related to the RCU acquisition.

Nevermind. You're right. In any case even if a task is migrated to a
different cgroup it will always have mem_cgroup_has_dirty_limit() ==
true.

So RCU protection is not needed outside these functions.

OK, I'll go with the Peter's suggestion.

Thanks!
-Andrea

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
@ 2010-03-03 11:56                 ` Andrea Righi
  0 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-03 11:56 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Kirill A. Shutemov, Andrew Morton, containers,
	linux-kernel, linux-mm

On Wed, Mar 03, 2010 at 12:47:03PM +0100, Andrea Righi wrote:
> On Tue, Mar 02, 2010 at 06:59:32PM -0500, Vivek Goyal wrote:
> > On Tue, Mar 02, 2010 at 11:22:48PM +0100, Andrea Righi wrote:
> > > On Tue, Mar 02, 2010 at 10:05:29AM -0500, Vivek Goyal wrote:
> > > > On Mon, Mar 01, 2010 at 11:18:31PM +0100, Andrea Righi wrote:
> > > > > On Mon, Mar 01, 2010 at 05:02:08PM -0500, Vivek Goyal wrote:
> > > > > > > @@ -686,10 +699,14 @@ void throttle_vm_writeout(gfp_t gfp_mask)
> > > > > > >                   */
> > > > > > >                  dirty_thresh += dirty_thresh / 10;      /* wheeee... */
> > > > > > >  
> > > > > > > -                if (global_page_state(NR_UNSTABLE_NFS) +
> > > > > > > -			global_page_state(NR_WRITEBACK) <= dirty_thresh)
> > > > > > > -                        	break;
> > > > > > > -                congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > > > > > +
> > > > > > > +		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> > > > > > > +		if (dirty < 0)
> > > > > > > +			dirty = global_page_state(NR_UNSTABLE_NFS) +
> > > > > > > +				global_page_state(NR_WRITEBACK);
> > > > > > 
> > > > > > dirty is unsigned long. As mentioned last time, above will never be true?
> > > > > > In general these patches look ok to me. I will do some testing with these.
> > > > > 
> > > > > Re-introduced the same bug. My bad. :(
> > > > > 
> > > > > The value returned from mem_cgroup_page_stat() can be negative, i.e.
> > > > > when memory cgroup is disabled. We could simply use a long for dirty,
> > > > > the unit is in # of pages so s64 should be enough. Or cast dirty to long
> > > > > only for the check (see below).
> > > > > 
> > > > > Thanks!
> > > > > -Andrea
> > > > > 
> > > > > Signed-off-by: Andrea Righi <arighi@develer.com>
> > > > > ---
> > > > >  mm/page-writeback.c |    2 +-
> > > > >  1 files changed, 1 insertions(+), 1 deletions(-)
> > > > > 
> > > > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > > > > index d83f41c..dbee976 100644
> > > > > --- a/mm/page-writeback.c
> > > > > +++ b/mm/page-writeback.c
> > > > > @@ -701,7 +701,7 @@ void throttle_vm_writeout(gfp_t gfp_mask)
> > > > >  
> > > > >  
> > > > >  		dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> > > > > -		if (dirty < 0)
> > > > > +		if ((long)dirty < 0)
> > > > 
> > > > This will also be problematic as on 32bit systems, your uppper limit of
> > > > dirty memory will be 2G?
> > > > 
> > > > I guess, I will prefer one of the two.
> > > > 
> > > > - return the error code from function and pass a pointer to store stats
> > > >   in as function argument.
> > > > 
> > > > - Or Peter's suggestion of checking mem_cgroup_has_dirty_limit() and if
> > > >   per cgroup dirty control is enabled, then use per cgroup stats. In that
> > > >   case you don't have to return negative values.
> > > > 
> > > >   Only tricky part will be careful accouting so that none of the stats go
> > > >   negative in corner cases of migration etc.
> > > 
> > > What do you think about Peter's suggestion + the locking stuff? (see the
> > > previous email). Otherwise, I'll choose the other solution, passing a
> > > pointer and always return the error code is not bad.
> > > 
> > 
> > Ok, so you are worried about that by the we finish mem_cgroup_has_dirty_limit()
> > call, task might change cgroup and later we might call
> > mem_cgroup_get_page_stat() on a different cgroup altogether which might or
> > might not have dirty limits specified?
> 
> Correct.
> 
> > 
> > But in what cases you don't want to use memory cgroup specified limit? I
> > thought cgroup disabled what the only case where we need to use global
> > limits. Otherwise a memory cgroup will have either dirty_bytes specified
> > or by default inherit global dirty_ratio which is a valid number. If
> > that's the case then you don't have to take rcu_lock() outside
> > get_page_stat()?
> > 
> > IOW, apart from cgroup being disabled, what are the other cases where you
> > expect to not use cgroup's page stat and use global stats?
> 
> At boot, when mem_cgroup_from_task() may return NULL. But this is not
> related to the RCU acquisition.

Nevermind. You're right. In any case even if a task is migrated to a
different cgroup it will always have mem_cgroup_has_dirty_limit() ==
true.

So RCU protection is not needed outside these functions.

OK, I'll go with the Peter's suggestion.

Thanks!
-Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-03 10:07         ` Peter Zijlstra
  (?)
@ 2010-03-03 12:05         ` Andrea Righi
  -1 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-03 12:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Trond Myklebust, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Suleiman Souhlal, Andrew Morton, Balbir Singh

On Wed, Mar 03, 2010 at 11:07:35AM +0100, Peter Zijlstra wrote:
> On Tue, 2010-03-02 at 23:14 +0100, Andrea Righi wrote:
> > 
> > I agree mem_cgroup_has_dirty_limit() is nicer. But we must do that under
> > RCU, so something like:
> > 
> >         rcu_read_lock();
> >         if (mem_cgroup_has_dirty_limit())
> >                 mem_cgroup_get_page_stat()
> >         else
> >                 global_page_state()
> >         rcu_read_unlock();
> > 
> > That is bad when mem_cgroup_has_dirty_limit() always returns false
> > (e.g., when memory cgroups are disabled). So I fallback to the old
> > interface.
> 
> Why is it that mem_cgroup_has_dirty_limit() needs RCU when
> mem_cgroup_get_page_stat() doesn't? That is, simply make
> mem_cgroup_has_dirty_limit() not require RCU in the same way
> *_get_page_stat() doesn't either.

OK, I agree we can get rid of RCU protection here (see my previous
email).

BTW the point was that after mem_cgroup_has_dirty_limit() the task might
be moved to another cgroup, but also in this case mem_cgroup_has_dirty_limit()
will be always true, so mem_cgroup_get_page_stat() is always coherent.

> 
> > What do you think about:
> > 
> >         mem_cgroup_lock();
> >         if (mem_cgroup_has_dirty_limit())
> >                 mem_cgroup_get_page_stat()
> >         else
> >                 global_page_state()
> >         mem_cgroup_unlock();
> > 
> > Where mem_cgroup_read_lock/unlock() simply expand to nothing when
> > memory cgroups are disabled.
> 
> I think you're engineering the wrong way around.
> 
> > > 
> > > That allows for a 0 dirty limit (which should work and basically makes
> > > all io synchronous).
> > 
> > IMHO it is better to reserve 0 for the special value "disabled" like the
> > global settings. A synchronous IO can be also achieved using a dirty
> > limit of 1.
> 
> Why?! 0 clearly states no writeback cache, IOW sync writes, a 1
> byte/page writeback cache effectively reduces to the same thing, but its
> not the same thing conceptually. If you want to put the size and enable
> into a single variable pick -1 for disable or so.

I might agree, and actually I prefer this solution.. but in this way we
would use a different interface respect to the equivalent vm_dirty_ratio
/ vm_dirty_bytes global settings (as well as dirty_background_ratio /
dirty_background_bytes).

IMHO it's better to use the same interface to avoid user
misunderstandings.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-03 10:07         ` Peter Zijlstra
@ 2010-03-03 12:05           ` Andrea Righi
  -1 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-03 12:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Kirill A. Shutemov, Andrew Morton, containers,
	linux-kernel, linux-mm, Trond Myklebust

On Wed, Mar 03, 2010 at 11:07:35AM +0100, Peter Zijlstra wrote:
> On Tue, 2010-03-02 at 23:14 +0100, Andrea Righi wrote:
> > 
> > I agree mem_cgroup_has_dirty_limit() is nicer. But we must do that under
> > RCU, so something like:
> > 
> >         rcu_read_lock();
> >         if (mem_cgroup_has_dirty_limit())
> >                 mem_cgroup_get_page_stat()
> >         else
> >                 global_page_state()
> >         rcu_read_unlock();
> > 
> > That is bad when mem_cgroup_has_dirty_limit() always returns false
> > (e.g., when memory cgroups are disabled). So I fallback to the old
> > interface.
> 
> Why is it that mem_cgroup_has_dirty_limit() needs RCU when
> mem_cgroup_get_page_stat() doesn't? That is, simply make
> mem_cgroup_has_dirty_limit() not require RCU in the same way
> *_get_page_stat() doesn't either.

OK, I agree we can get rid of RCU protection here (see my previous
email).

BTW the point was that after mem_cgroup_has_dirty_limit() the task might
be moved to another cgroup, but also in this case mem_cgroup_has_dirty_limit()
will be always true, so mem_cgroup_get_page_stat() is always coherent.

> 
> > What do you think about:
> > 
> >         mem_cgroup_lock();
> >         if (mem_cgroup_has_dirty_limit())
> >                 mem_cgroup_get_page_stat()
> >         else
> >                 global_page_state()
> >         mem_cgroup_unlock();
> > 
> > Where mem_cgroup_read_lock/unlock() simply expand to nothing when
> > memory cgroups are disabled.
> 
> I think you're engineering the wrong way around.
> 
> > > 
> > > That allows for a 0 dirty limit (which should work and basically makes
> > > all io synchronous).
> > 
> > IMHO it is better to reserve 0 for the special value "disabled" like the
> > global settings. A synchronous IO can be also achieved using a dirty
> > limit of 1.
> 
> Why?! 0 clearly states no writeback cache, IOW sync writes, a 1
> byte/page writeback cache effectively reduces to the same thing, but its
> not the same thing conceptually. If you want to put the size and enable
> into a single variable pick -1 for disable or so.

I might agree, and actually I prefer this solution.. but in this way we
would use a different interface respect to the equivalent vm_dirty_ratio
/ vm_dirty_bytes global settings (as well as dirty_background_ratio /
dirty_background_bytes).

IMHO it's better to use the same interface to avoid user
misunderstandings.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
@ 2010-03-03 12:05           ` Andrea Righi
  0 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-03 12:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Kirill A. Shutemov, Andrew Morton, containers,
	linux-kernel, linux-mm, Trond Myklebust

On Wed, Mar 03, 2010 at 11:07:35AM +0100, Peter Zijlstra wrote:
> On Tue, 2010-03-02 at 23:14 +0100, Andrea Righi wrote:
> > 
> > I agree mem_cgroup_has_dirty_limit() is nicer. But we must do that under
> > RCU, so something like:
> > 
> >         rcu_read_lock();
> >         if (mem_cgroup_has_dirty_limit())
> >                 mem_cgroup_get_page_stat()
> >         else
> >                 global_page_state()
> >         rcu_read_unlock();
> > 
> > That is bad when mem_cgroup_has_dirty_limit() always returns false
> > (e.g., when memory cgroups are disabled). So I fallback to the old
> > interface.
> 
> Why is it that mem_cgroup_has_dirty_limit() needs RCU when
> mem_cgroup_get_page_stat() doesn't? That is, simply make
> mem_cgroup_has_dirty_limit() not require RCU in the same way
> *_get_page_stat() doesn't either.

OK, I agree we can get rid of RCU protection here (see my previous
email).

BTW the point was that after mem_cgroup_has_dirty_limit() the task might
be moved to another cgroup, but also in this case mem_cgroup_has_dirty_limit()
will be always true, so mem_cgroup_get_page_stat() is always coherent.

> 
> > What do you think about:
> > 
> >         mem_cgroup_lock();
> >         if (mem_cgroup_has_dirty_limit())
> >                 mem_cgroup_get_page_stat()
> >         else
> >                 global_page_state()
> >         mem_cgroup_unlock();
> > 
> > Where mem_cgroup_read_lock/unlock() simply expand to nothing when
> > memory cgroups are disabled.
> 
> I think you're engineering the wrong way around.
> 
> > > 
> > > That allows for a 0 dirty limit (which should work and basically makes
> > > all io synchronous).
> > 
> > IMHO it is better to reserve 0 for the special value "disabled" like the
> > global settings. A synchronous IO can be also achieved using a dirty
> > limit of 1.
> 
> Why?! 0 clearly states no writeback cache, IOW sync writes, a 1
> byte/page writeback cache effectively reduces to the same thing, but its
> not the same thing conceptually. If you want to put the size and enable
> into a single variable pick -1 for disable or so.

I might agree, and actually I prefer this solution.. but in this way we
would use a different interface respect to the equivalent vm_dirty_ratio
/ vm_dirty_bytes global settings (as well as dirty_background_ratio /
dirty_background_bytes).

IMHO it's better to use the same interface to avoid user
misunderstandings.

Thanks,
-Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
       [not found]             ` <20100303172132.fc6d9387.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
  2010-03-03 11:50               ` Andrea Righi
@ 2010-03-03 22:03               ` Andrea Righi
  1 sibling, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-03 22:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Greg-FOgKQjlUJ6BQetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Andrew Morton,
	Balbir Singh

On Wed, Mar 03, 2010 at 05:21:32PM +0900, KAMEZAWA Hiroyuki wrote:
> On Wed, 3 Mar 2010 15:15:49 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> wrote:
> 
> > Agreed.
> > Let's try how we can write a code in clean way. (we have time ;)
> > For now, to me, IRQ disabling while lock_page_cgroup() seems to be a little
> > over killing. What I really want is lockless code...but it seems impossible
> > under current implementation.
> > 
> > I wonder the fact "the page is never unchareged under us" can give us some chances
> > ...Hmm.
> > 
> 
> How about this ? Basically, I don't like duplicating information...so,
> # of new pcg_flags may be able to be reduced.
> 
> I'm glad this can be a hint for Andrea-san.
> 
> ==
> ---
>  include/linux/page_cgroup.h |   44 ++++++++++++++++++++-
>  mm/memcontrol.c             |   91 +++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 132 insertions(+), 3 deletions(-)
> 
> Index: mmotm-2.6.33-Mar2/include/linux/page_cgroup.h
> ===================================================================
> --- mmotm-2.6.33-Mar2.orig/include/linux/page_cgroup.h
> +++ mmotm-2.6.33-Mar2/include/linux/page_cgroup.h
> @@ -39,6 +39,11 @@ enum {
>  	PCG_CACHE, /* charged as cache */
>  	PCG_USED, /* this object is in use. */
>  	PCG_ACCT_LRU, /* page has been accounted for */
> +	PCG_MIGRATE_LOCK, /* used for mutual execution of account migration */
> +	PCG_ACCT_DIRTY,
> +	PCG_ACCT_WB,
> +	PCG_ACCT_WB_TEMP,
> +	PCG_ACCT_UNSTABLE,
>  };
>  
>  #define TESTPCGFLAG(uname, lname)			\
> @@ -73,6 +78,23 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU)
>  TESTPCGFLAG(AcctLRU, ACCT_LRU)
>  TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU)
>  
> +SETPCGFLAG(AcctDirty, ACCT_DIRTY);
> +CLEARPCGFLAG(AcctDirty, ACCT_DIRTY);
> +TESTPCGFLAG(AcctDirty, ACCT_DIRTY);
> +
> +SETPCGFLAG(AcctWB, ACCT_WB);
> +CLEARPCGFLAG(AcctWB, ACCT_WB);
> +TESTPCGFLAG(AcctWB, ACCT_WB);
> +
> +SETPCGFLAG(AcctWBTemp, ACCT_WB_TEMP);
> +CLEARPCGFLAG(AcctWBTemp, ACCT_WB_TEMP);
> +TESTPCGFLAG(AcctWBTemp, ACCT_WB_TEMP);
> +
> +SETPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE);
> +CLEARPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE);
> +TESTPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE);
> +
> +
>  static inline int page_cgroup_nid(struct page_cgroup *pc)
>  {
>  	return page_to_nid(pc->page);
> @@ -82,7 +104,9 @@ static inline enum zone_type page_cgroup
>  {
>  	return page_zonenum(pc->page);
>  }
> -
> +/*
> + * lock_page_cgroup() should not be held under mapping->tree_lock
> + */
>  static inline void lock_page_cgroup(struct page_cgroup *pc)
>  {
>  	bit_spin_lock(PCG_LOCK, &pc->flags);
> @@ -93,6 +117,24 @@ static inline void unlock_page_cgroup(st
>  	bit_spin_unlock(PCG_LOCK, &pc->flags);
>  }
>  
> +/*
> + * Lock order is
> + * 	lock_page_cgroup()
> + * 		lock_page_cgroup_migrate()
> + * This lock is not be lock for charge/uncharge but for account moving.
> + * i.e. overwrite pc->mem_cgroup. The lock owner should guarantee by itself
> + * the page is uncharged while we hold this.
> + */
> +static inline void lock_page_cgroup_migrate(struct page_cgroup *pc)
> +{
> +	bit_spin_lock(PCG_MIGRATE_LOCK, &pc->flags);
> +}
> +
> +static inline void unlock_page_cgroup_migrate(struct page_cgroup *pc)
> +{
> +	bit_spin_unlock(PCG_MIGRATE_LOCK, &pc->flags);
> +}
> +
>  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
>  struct page_cgroup;
>  
> Index: mmotm-2.6.33-Mar2/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.33-Mar2.orig/mm/memcontrol.c
> +++ mmotm-2.6.33-Mar2/mm/memcontrol.c
> @@ -87,6 +87,10 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
>  	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
>  	MEM_CGROUP_EVENTS,	/* incremented at every  pagein/pageout */
> +	MEM_CGROUP_STAT_DIRTY,
> +	MEM_CGROUP_STAT_WBACK,
> +	MEM_CGROUP_STAT_WBACK_TEMP,
> +	MEM_CGROUP_STAT_UNSTABLE_NFS,
>  
>  	MEM_CGROUP_STAT_NSTATS,
>  };
> @@ -1360,6 +1364,86 @@ done:
>  }
>  
>  /*
> + * Update file cache's status for memcg. Before calling this,
> + * mapping->tree_lock should be held and preemption is disabled.
> + * Then, it's guarnteed that the page is not uncharged while we
> + * access page_cgroup. We can make use of that.
> + */
> +void mem_cgroup_update_stat_locked(struct page *page, int idx, bool set)
> +{
> +	struct page_cgroup *pc;
> +	struct mem_cgroup *mem;
> +
> +	pc = lookup_page_cgroup(page);
> +	/* Not accounted ? */
> +	if (!PageCgroupUsed(pc))
> +		return;
> +	lock_page_cgroup_migrate(pc);
> +	/*
> +	 * It's guarnteed that this page is never uncharged.
> +	 * The only racy problem is moving account among memcgs.
> +	 */
> +	switch (idx) {
> +	case MEM_CGROUP_STAT_DIRTY:
> +		if (set)
> +			SetPageCgroupAcctDirty(pc);
> +		else
> +			ClearPageCgroupAcctDirty(pc);
> +		break;
> +	case MEM_CGROUP_STAT_WBACK:
> +		if (set)
> +			SetPageCgroupAcctWB(pc);
> +		else
> +			ClearPageCgroupAcctWB(pc);
> +		break;
> +	case MEM_CGROUP_STAT_WBACK_TEMP:
> +		if (set)
> +			SetPageCgroupAcctWBTemp(pc);
> +		else
> +			ClearPageCgroupAcctWBTemp(pc);
> +		break;
> +	case MEM_CGROUP_STAT_UNSTABLE_NFS:
> +		if (set)
> +			SetPageCgroupAcctUnstableNFS(pc);
> +		else
> +			ClearPageCgroupAcctUnstableNFS(pc);
> +		break;
> +	default:
> +		BUG();
> +		break;
> +	}
> +	mem = pc->mem_cgroup;
> +	if (set)
> +		__this_cpu_inc(mem->stat->count[idx]);
> +	else
> +		__this_cpu_dec(mem->stat->count[idx]);
> +	unlock_page_cgroup_migrate(pc);
> +}
> +
> +static void move_acct_information(struct mem_cgroup *from,
> +				struct mem_cgroup *to,
> +				struct page_cgroup *pc)
> +{
> +	/* preemption is disabled, migration_lock is held. */
> +	if (PageCgroupAcctDirty(pc)) {
> +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_DIRTY]);
> +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_DIRTY]);
> +	}
> +	if (PageCgroupAcctWB(pc)) {
> +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_WBACK]);
> +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WBACK]);
> +	}
> +	if (PageCgroupAcctWBTemp(pc)) {
> +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_WBACK_TEMP]);
> +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WBACK_TEMP]);
> +	}
> +	if (PageCgroupAcctUnstableNFS(pc)) {
> +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
> +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
> +	}
> +}
> +
> +/*
>   * size of first charge trial. "32" comes from vmscan.c's magic value.
>   * TODO: maybe necessary to use big numbers in big irons.
>   */
> @@ -1794,15 +1878,16 @@ static void __mem_cgroup_move_account(st
>  	VM_BUG_ON(!PageCgroupUsed(pc));
>  	VM_BUG_ON(pc->mem_cgroup != from);
>  
> +	preempt_disable();
> +	lock_page_cgroup_migrate(pc);
>  	page = pc->page;
>  	if (page_mapped(page) && !PageAnon(page)) {
>  		/* Update mapped_file data for mem_cgroup */
> -		preempt_disable();
>  		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
>  		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		preempt_enable();
>  	}
>  	mem_cgroup_charge_statistics(from, pc, false);
> +	move_acct_information(from, to, pc);

Kame-san, a question. According to is_target_pte_for_mc() it seems we
don't move file pages across cgroups for now. If !PageAnon(page) we just
return 0 and the page won't be selected for migration in
mem_cgroup_move_charge_pte_range().

So, if I've understood well the code is correct in perspective, but
right now it's unnecessary. File pages are not moved on task migration
across cgroups and, at the moment, there's no way for file page
accounted statistics to go negative.

Or am I missing something?

Thanks,
-Andrea

>  	if (uncharge)
>  		/* This is not "cancel", but cancel_charge does all we need. */
>  		mem_cgroup_cancel_charge(from);
> @@ -1810,6 +1895,8 @@ static void __mem_cgroup_move_account(st
>  	/* caller should have done css_get */
>  	pc->mem_cgroup = to;
>  	mem_cgroup_charge_statistics(to, pc, true);
> +	unlock_page_cgroup_migrate(pc);
> +	preempt_enable();
>  	/*
>  	 * We charges against "to" which may not have any tasks. Then, "to"
>  	 * can be under rmdir(). But in current implementation, caller of
> 

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-03  8:21             ` KAMEZAWA Hiroyuki
@ 2010-03-03 22:03               ` Andrea Righi
  -1 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-03 22:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, containers, linux-kernel, linux-mm, Greg,
	Suleiman Souhlal, Andrew Morton, Balbir Singh

On Wed, Mar 03, 2010 at 05:21:32PM +0900, KAMEZAWA Hiroyuki wrote:
> On Wed, 3 Mar 2010 15:15:49 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > Agreed.
> > Let's try how we can write a code in clean way. (we have time ;)
> > For now, to me, IRQ disabling while lock_page_cgroup() seems to be a little
> > over killing. What I really want is lockless code...but it seems impossible
> > under current implementation.
> > 
> > I wonder the fact "the page is never unchareged under us" can give us some chances
> > ...Hmm.
> > 
> 
> How about this ? Basically, I don't like duplicating information...so,
> # of new pcg_flags may be able to be reduced.
> 
> I'm glad this can be a hint for Andrea-san.
> 
> ==
> ---
>  include/linux/page_cgroup.h |   44 ++++++++++++++++++++-
>  mm/memcontrol.c             |   91 +++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 132 insertions(+), 3 deletions(-)
> 
> Index: mmotm-2.6.33-Mar2/include/linux/page_cgroup.h
> ===================================================================
> --- mmotm-2.6.33-Mar2.orig/include/linux/page_cgroup.h
> +++ mmotm-2.6.33-Mar2/include/linux/page_cgroup.h
> @@ -39,6 +39,11 @@ enum {
>  	PCG_CACHE, /* charged as cache */
>  	PCG_USED, /* this object is in use. */
>  	PCG_ACCT_LRU, /* page has been accounted for */
> +	PCG_MIGRATE_LOCK, /* used for mutual execution of account migration */
> +	PCG_ACCT_DIRTY,
> +	PCG_ACCT_WB,
> +	PCG_ACCT_WB_TEMP,
> +	PCG_ACCT_UNSTABLE,
>  };
>  
>  #define TESTPCGFLAG(uname, lname)			\
> @@ -73,6 +78,23 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU)
>  TESTPCGFLAG(AcctLRU, ACCT_LRU)
>  TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU)
>  
> +SETPCGFLAG(AcctDirty, ACCT_DIRTY);
> +CLEARPCGFLAG(AcctDirty, ACCT_DIRTY);
> +TESTPCGFLAG(AcctDirty, ACCT_DIRTY);
> +
> +SETPCGFLAG(AcctWB, ACCT_WB);
> +CLEARPCGFLAG(AcctWB, ACCT_WB);
> +TESTPCGFLAG(AcctWB, ACCT_WB);
> +
> +SETPCGFLAG(AcctWBTemp, ACCT_WB_TEMP);
> +CLEARPCGFLAG(AcctWBTemp, ACCT_WB_TEMP);
> +TESTPCGFLAG(AcctWBTemp, ACCT_WB_TEMP);
> +
> +SETPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE);
> +CLEARPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE);
> +TESTPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE);
> +
> +
>  static inline int page_cgroup_nid(struct page_cgroup *pc)
>  {
>  	return page_to_nid(pc->page);
> @@ -82,7 +104,9 @@ static inline enum zone_type page_cgroup
>  {
>  	return page_zonenum(pc->page);
>  }
> -
> +/*
> + * lock_page_cgroup() should not be held under mapping->tree_lock
> + */
>  static inline void lock_page_cgroup(struct page_cgroup *pc)
>  {
>  	bit_spin_lock(PCG_LOCK, &pc->flags);
> @@ -93,6 +117,24 @@ static inline void unlock_page_cgroup(st
>  	bit_spin_unlock(PCG_LOCK, &pc->flags);
>  }
>  
> +/*
> + * Lock order is
> + * 	lock_page_cgroup()
> + * 		lock_page_cgroup_migrate()
> + * This lock is not be lock for charge/uncharge but for account moving.
> + * i.e. overwrite pc->mem_cgroup. The lock owner should guarantee by itself
> + * the page is uncharged while we hold this.
> + */
> +static inline void lock_page_cgroup_migrate(struct page_cgroup *pc)
> +{
> +	bit_spin_lock(PCG_MIGRATE_LOCK, &pc->flags);
> +}
> +
> +static inline void unlock_page_cgroup_migrate(struct page_cgroup *pc)
> +{
> +	bit_spin_unlock(PCG_MIGRATE_LOCK, &pc->flags);
> +}
> +
>  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
>  struct page_cgroup;
>  
> Index: mmotm-2.6.33-Mar2/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.33-Mar2.orig/mm/memcontrol.c
> +++ mmotm-2.6.33-Mar2/mm/memcontrol.c
> @@ -87,6 +87,10 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
>  	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
>  	MEM_CGROUP_EVENTS,	/* incremented at every  pagein/pageout */
> +	MEM_CGROUP_STAT_DIRTY,
> +	MEM_CGROUP_STAT_WBACK,
> +	MEM_CGROUP_STAT_WBACK_TEMP,
> +	MEM_CGROUP_STAT_UNSTABLE_NFS,
>  
>  	MEM_CGROUP_STAT_NSTATS,
>  };
> @@ -1360,6 +1364,86 @@ done:
>  }
>  
>  /*
> + * Update file cache's status for memcg. Before calling this,
> + * mapping->tree_lock should be held and preemption is disabled.
> + * Then, it's guarnteed that the page is not uncharged while we
> + * access page_cgroup. We can make use of that.
> + */
> +void mem_cgroup_update_stat_locked(struct page *page, int idx, bool set)
> +{
> +	struct page_cgroup *pc;
> +	struct mem_cgroup *mem;
> +
> +	pc = lookup_page_cgroup(page);
> +	/* Not accounted ? */
> +	if (!PageCgroupUsed(pc))
> +		return;
> +	lock_page_cgroup_migrate(pc);
> +	/*
> +	 * It's guarnteed that this page is never uncharged.
> +	 * The only racy problem is moving account among memcgs.
> +	 */
> +	switch (idx) {
> +	case MEM_CGROUP_STAT_DIRTY:
> +		if (set)
> +			SetPageCgroupAcctDirty(pc);
> +		else
> +			ClearPageCgroupAcctDirty(pc);
> +		break;
> +	case MEM_CGROUP_STAT_WBACK:
> +		if (set)
> +			SetPageCgroupAcctWB(pc);
> +		else
> +			ClearPageCgroupAcctWB(pc);
> +		break;
> +	case MEM_CGROUP_STAT_WBACK_TEMP:
> +		if (set)
> +			SetPageCgroupAcctWBTemp(pc);
> +		else
> +			ClearPageCgroupAcctWBTemp(pc);
> +		break;
> +	case MEM_CGROUP_STAT_UNSTABLE_NFS:
> +		if (set)
> +			SetPageCgroupAcctUnstableNFS(pc);
> +		else
> +			ClearPageCgroupAcctUnstableNFS(pc);
> +		break;
> +	default:
> +		BUG();
> +		break;
> +	}
> +	mem = pc->mem_cgroup;
> +	if (set)
> +		__this_cpu_inc(mem->stat->count[idx]);
> +	else
> +		__this_cpu_dec(mem->stat->count[idx]);
> +	unlock_page_cgroup_migrate(pc);
> +}
> +
> +static void move_acct_information(struct mem_cgroup *from,
> +				struct mem_cgroup *to,
> +				struct page_cgroup *pc)
> +{
> +	/* preemption is disabled, migration_lock is held. */
> +	if (PageCgroupAcctDirty(pc)) {
> +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_DIRTY]);
> +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_DIRTY]);
> +	}
> +	if (PageCgroupAcctWB(pc)) {
> +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_WBACK]);
> +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WBACK]);
> +	}
> +	if (PageCgroupAcctWBTemp(pc)) {
> +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_WBACK_TEMP]);
> +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WBACK_TEMP]);
> +	}
> +	if (PageCgroupAcctUnstableNFS(pc)) {
> +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
> +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
> +	}
> +}
> +
> +/*
>   * size of first charge trial. "32" comes from vmscan.c's magic value.
>   * TODO: maybe necessary to use big numbers in big irons.
>   */
> @@ -1794,15 +1878,16 @@ static void __mem_cgroup_move_account(st
>  	VM_BUG_ON(!PageCgroupUsed(pc));
>  	VM_BUG_ON(pc->mem_cgroup != from);
>  
> +	preempt_disable();
> +	lock_page_cgroup_migrate(pc);
>  	page = pc->page;
>  	if (page_mapped(page) && !PageAnon(page)) {
>  		/* Update mapped_file data for mem_cgroup */
> -		preempt_disable();
>  		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
>  		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		preempt_enable();
>  	}
>  	mem_cgroup_charge_statistics(from, pc, false);
> +	move_acct_information(from, to, pc);

Kame-san, a question. According to is_target_pte_for_mc() it seems we
don't move file pages across cgroups for now. If !PageAnon(page) we just
return 0 and the page won't be selected for migration in
mem_cgroup_move_charge_pte_range().

So, if I've understood well the code is correct in perspective, but
right now it's unnecessary. File pages are not moved on task migration
across cgroups and, at the moment, there's no way for file page
accounted statistics to go negative.

Or am I missing something?

Thanks,
-Andrea

>  	if (uncharge)
>  		/* This is not "cancel", but cancel_charge does all we need. */
>  		mem_cgroup_cancel_charge(from);
> @@ -1810,6 +1895,8 @@ static void __mem_cgroup_move_account(st
>  	/* caller should have done css_get */
>  	pc->mem_cgroup = to;
>  	mem_cgroup_charge_statistics(to, pc, true);
> +	unlock_page_cgroup_migrate(pc);
> +	preempt_enable();
>  	/*
>  	 * We charges against "to" which may not have any tasks. Then, "to"
>  	 * can be under rmdir(). But in current implementation, caller of
> 

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
@ 2010-03-03 22:03               ` Andrea Righi
  0 siblings, 0 replies; 140+ messages in thread
From: Andrea Righi @ 2010-03-03 22:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, containers, linux-kernel, linux-mm, Greg,
	Suleiman Souhlal, Andrew Morton, Balbir Singh

On Wed, Mar 03, 2010 at 05:21:32PM +0900, KAMEZAWA Hiroyuki wrote:
> On Wed, 3 Mar 2010 15:15:49 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > Agreed.
> > Let's try how we can write a code in clean way. (we have time ;)
> > For now, to me, IRQ disabling while lock_page_cgroup() seems to be a little
> > over killing. What I really want is lockless code...but it seems impossible
> > under current implementation.
> > 
> > I wonder the fact "the page is never unchareged under us" can give us some chances
> > ...Hmm.
> > 
> 
> How about this ? Basically, I don't like duplicating information...so,
> # of new pcg_flags may be able to be reduced.
> 
> I'm glad this can be a hint for Andrea-san.
> 
> ==
> ---
>  include/linux/page_cgroup.h |   44 ++++++++++++++++++++-
>  mm/memcontrol.c             |   91 +++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 132 insertions(+), 3 deletions(-)
> 
> Index: mmotm-2.6.33-Mar2/include/linux/page_cgroup.h
> ===================================================================
> --- mmotm-2.6.33-Mar2.orig/include/linux/page_cgroup.h
> +++ mmotm-2.6.33-Mar2/include/linux/page_cgroup.h
> @@ -39,6 +39,11 @@ enum {
>  	PCG_CACHE, /* charged as cache */
>  	PCG_USED, /* this object is in use. */
>  	PCG_ACCT_LRU, /* page has been accounted for */
> +	PCG_MIGRATE_LOCK, /* used for mutual execution of account migration */
> +	PCG_ACCT_DIRTY,
> +	PCG_ACCT_WB,
> +	PCG_ACCT_WB_TEMP,
> +	PCG_ACCT_UNSTABLE,
>  };
>  
>  #define TESTPCGFLAG(uname, lname)			\
> @@ -73,6 +78,23 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU)
>  TESTPCGFLAG(AcctLRU, ACCT_LRU)
>  TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU)
>  
> +SETPCGFLAG(AcctDirty, ACCT_DIRTY);
> +CLEARPCGFLAG(AcctDirty, ACCT_DIRTY);
> +TESTPCGFLAG(AcctDirty, ACCT_DIRTY);
> +
> +SETPCGFLAG(AcctWB, ACCT_WB);
> +CLEARPCGFLAG(AcctWB, ACCT_WB);
> +TESTPCGFLAG(AcctWB, ACCT_WB);
> +
> +SETPCGFLAG(AcctWBTemp, ACCT_WB_TEMP);
> +CLEARPCGFLAG(AcctWBTemp, ACCT_WB_TEMP);
> +TESTPCGFLAG(AcctWBTemp, ACCT_WB_TEMP);
> +
> +SETPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE);
> +CLEARPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE);
> +TESTPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE);
> +
> +
>  static inline int page_cgroup_nid(struct page_cgroup *pc)
>  {
>  	return page_to_nid(pc->page);
> @@ -82,7 +104,9 @@ static inline enum zone_type page_cgroup
>  {
>  	return page_zonenum(pc->page);
>  }
> -
> +/*
> + * lock_page_cgroup() should not be held under mapping->tree_lock
> + */
>  static inline void lock_page_cgroup(struct page_cgroup *pc)
>  {
>  	bit_spin_lock(PCG_LOCK, &pc->flags);
> @@ -93,6 +117,24 @@ static inline void unlock_page_cgroup(st
>  	bit_spin_unlock(PCG_LOCK, &pc->flags);
>  }
>  
> +/*
> + * Lock order is
> + * 	lock_page_cgroup()
> + * 		lock_page_cgroup_migrate()
> + * This lock is not be lock for charge/uncharge but for account moving.
> + * i.e. overwrite pc->mem_cgroup. The lock owner should guarantee by itself
> + * the page is uncharged while we hold this.
> + */
> +static inline void lock_page_cgroup_migrate(struct page_cgroup *pc)
> +{
> +	bit_spin_lock(PCG_MIGRATE_LOCK, &pc->flags);
> +}
> +
> +static inline void unlock_page_cgroup_migrate(struct page_cgroup *pc)
> +{
> +	bit_spin_unlock(PCG_MIGRATE_LOCK, &pc->flags);
> +}
> +
>  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
>  struct page_cgroup;
>  
> Index: mmotm-2.6.33-Mar2/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.33-Mar2.orig/mm/memcontrol.c
> +++ mmotm-2.6.33-Mar2/mm/memcontrol.c
> @@ -87,6 +87,10 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
>  	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
>  	MEM_CGROUP_EVENTS,	/* incremented at every  pagein/pageout */
> +	MEM_CGROUP_STAT_DIRTY,
> +	MEM_CGROUP_STAT_WBACK,
> +	MEM_CGROUP_STAT_WBACK_TEMP,
> +	MEM_CGROUP_STAT_UNSTABLE_NFS,
>  
>  	MEM_CGROUP_STAT_NSTATS,
>  };
> @@ -1360,6 +1364,86 @@ done:
>  }
>  
>  /*
> + * Update file cache's status for memcg. Before calling this,
> + * mapping->tree_lock should be held and preemption is disabled.
> + * Then, it's guarnteed that the page is not uncharged while we
> + * access page_cgroup. We can make use of that.
> + */
> +void mem_cgroup_update_stat_locked(struct page *page, int idx, bool set)
> +{
> +	struct page_cgroup *pc;
> +	struct mem_cgroup *mem;
> +
> +	pc = lookup_page_cgroup(page);
> +	/* Not accounted ? */
> +	if (!PageCgroupUsed(pc))
> +		return;
> +	lock_page_cgroup_migrate(pc);
> +	/*
> +	 * It's guarnteed that this page is never uncharged.
> +	 * The only racy problem is moving account among memcgs.
> +	 */
> +	switch (idx) {
> +	case MEM_CGROUP_STAT_DIRTY:
> +		if (set)
> +			SetPageCgroupAcctDirty(pc);
> +		else
> +			ClearPageCgroupAcctDirty(pc);
> +		break;
> +	case MEM_CGROUP_STAT_WBACK:
> +		if (set)
> +			SetPageCgroupAcctWB(pc);
> +		else
> +			ClearPageCgroupAcctWB(pc);
> +		break;
> +	case MEM_CGROUP_STAT_WBACK_TEMP:
> +		if (set)
> +			SetPageCgroupAcctWBTemp(pc);
> +		else
> +			ClearPageCgroupAcctWBTemp(pc);
> +		break;
> +	case MEM_CGROUP_STAT_UNSTABLE_NFS:
> +		if (set)
> +			SetPageCgroupAcctUnstableNFS(pc);
> +		else
> +			ClearPageCgroupAcctUnstableNFS(pc);
> +		break;
> +	default:
> +		BUG();
> +		break;
> +	}
> +	mem = pc->mem_cgroup;
> +	if (set)
> +		__this_cpu_inc(mem->stat->count[idx]);
> +	else
> +		__this_cpu_dec(mem->stat->count[idx]);
> +	unlock_page_cgroup_migrate(pc);
> +}
> +
> +static void move_acct_information(struct mem_cgroup *from,
> +				struct mem_cgroup *to,
> +				struct page_cgroup *pc)
> +{
> +	/* preemption is disabled, migration_lock is held. */
> +	if (PageCgroupAcctDirty(pc)) {
> +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_DIRTY]);
> +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_DIRTY]);
> +	}
> +	if (PageCgroupAcctWB(pc)) {
> +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_WBACK]);
> +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WBACK]);
> +	}
> +	if (PageCgroupAcctWBTemp(pc)) {
> +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_WBACK_TEMP]);
> +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WBACK_TEMP]);
> +	}
> +	if (PageCgroupAcctUnstableNFS(pc)) {
> +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
> +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
> +	}
> +}
> +
> +/*
>   * size of first charge trial. "32" comes from vmscan.c's magic value.
>   * TODO: maybe necessary to use big numbers in big irons.
>   */
> @@ -1794,15 +1878,16 @@ static void __mem_cgroup_move_account(st
>  	VM_BUG_ON(!PageCgroupUsed(pc));
>  	VM_BUG_ON(pc->mem_cgroup != from);
>  
> +	preempt_disable();
> +	lock_page_cgroup_migrate(pc);
>  	page = pc->page;
>  	if (page_mapped(page) && !PageAnon(page)) {
>  		/* Update mapped_file data for mem_cgroup */
> -		preempt_disable();
>  		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
>  		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		preempt_enable();
>  	}
>  	mem_cgroup_charge_statistics(from, pc, false);
> +	move_acct_information(from, to, pc);

Kame-san, a question. According to is_target_pte_for_mc() it seems we
don't move file pages across cgroups for now. If !PageAnon(page) we just
return 0 and the page won't be selected for migration in
mem_cgroup_move_charge_pte_range().

So, if I've understood well the code is correct in perspective, but
right now it's unnecessary. File pages are not moved on task migration
across cgroups and, at the moment, there's no way for file page
accounted statistics to go negative.

Or am I missing something?

Thanks,
-Andrea

>  	if (uncharge)
>  		/* This is not "cancel", but cancel_charge does all we need. */
>  		mem_cgroup_cancel_charge(from);
> @@ -1810,6 +1895,8 @@ static void __mem_cgroup_move_account(st
>  	/* caller should have done css_get */
>  	pc->mem_cgroup = to;
>  	mem_cgroup_charge_statistics(to, pc, true);
> +	unlock_page_cgroup_migrate(pc);
> +	preempt_enable();
>  	/*
>  	 * We charges against "to" which may not have any tasks. Then, "to"
>  	 * can be under rmdir(). But in current implementation, caller of
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-03 22:03               ` Andrea Righi
  (?)
@ 2010-03-03 23:25               ` Daisuke Nishimura
  -1 siblings, 0 replies; 140+ messages in thread
From: Daisuke Nishimura @ 2010-03-03 23:25 UTC (permalink / raw)
  To: Andrea Righi
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Greg-FOgKQjlUJ6BQetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Andrew Morton,
	Balbir Singh

On Wed, 3 Mar 2010 23:03:19 +0100, Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org> wrote:
> On Wed, Mar 03, 2010 at 05:21:32PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Wed, 3 Mar 2010 15:15:49 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> wrote:
> > 
> > > Agreed.
> > > Let's try how we can write a code in clean way. (we have time ;)
> > > For now, to me, IRQ disabling while lock_page_cgroup() seems to be a little
> > > over killing. What I really want is lockless code...but it seems impossible
> > > under current implementation.
> > > 
> > > I wonder the fact "the page is never unchareged under us" can give us some chances
> > > ...Hmm.
> > > 
> > 
> > How about this ? Basically, I don't like duplicating information...so,
> > # of new pcg_flags may be able to be reduced.
> > 
> > I'm glad this can be a hint for Andrea-san.
> > 
> > ==
> > ---
> >  include/linux/page_cgroup.h |   44 ++++++++++++++++++++-
> >  mm/memcontrol.c             |   91 +++++++++++++++++++++++++++++++++++++++++++-
> >  2 files changed, 132 insertions(+), 3 deletions(-)
> > 
> > Index: mmotm-2.6.33-Mar2/include/linux/page_cgroup.h
> > ===================================================================
> > --- mmotm-2.6.33-Mar2.orig/include/linux/page_cgroup.h
> > +++ mmotm-2.6.33-Mar2/include/linux/page_cgroup.h
> > @@ -39,6 +39,11 @@ enum {
> >  	PCG_CACHE, /* charged as cache */
> >  	PCG_USED, /* this object is in use. */
> >  	PCG_ACCT_LRU, /* page has been accounted for */
> > +	PCG_MIGRATE_LOCK, /* used for mutual execution of account migration */
> > +	PCG_ACCT_DIRTY,
> > +	PCG_ACCT_WB,
> > +	PCG_ACCT_WB_TEMP,
> > +	PCG_ACCT_UNSTABLE,
> >  };
> >  
> >  #define TESTPCGFLAG(uname, lname)			\
> > @@ -73,6 +78,23 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU)
> >  TESTPCGFLAG(AcctLRU, ACCT_LRU)
> >  TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU)
> >  
> > +SETPCGFLAG(AcctDirty, ACCT_DIRTY);
> > +CLEARPCGFLAG(AcctDirty, ACCT_DIRTY);
> > +TESTPCGFLAG(AcctDirty, ACCT_DIRTY);
> > +
> > +SETPCGFLAG(AcctWB, ACCT_WB);
> > +CLEARPCGFLAG(AcctWB, ACCT_WB);
> > +TESTPCGFLAG(AcctWB, ACCT_WB);
> > +
> > +SETPCGFLAG(AcctWBTemp, ACCT_WB_TEMP);
> > +CLEARPCGFLAG(AcctWBTemp, ACCT_WB_TEMP);
> > +TESTPCGFLAG(AcctWBTemp, ACCT_WB_TEMP);
> > +
> > +SETPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE);
> > +CLEARPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE);
> > +TESTPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE);
> > +
> > +
> >  static inline int page_cgroup_nid(struct page_cgroup *pc)
> >  {
> >  	return page_to_nid(pc->page);
> > @@ -82,7 +104,9 @@ static inline enum zone_type page_cgroup
> >  {
> >  	return page_zonenum(pc->page);
> >  }
> > -
> > +/*
> > + * lock_page_cgroup() should not be held under mapping->tree_lock
> > + */
> >  static inline void lock_page_cgroup(struct page_cgroup *pc)
> >  {
> >  	bit_spin_lock(PCG_LOCK, &pc->flags);
> > @@ -93,6 +117,24 @@ static inline void unlock_page_cgroup(st
> >  	bit_spin_unlock(PCG_LOCK, &pc->flags);
> >  }
> >  
> > +/*
> > + * Lock order is
> > + * 	lock_page_cgroup()
> > + * 		lock_page_cgroup_migrate()
> > + * This lock is not be lock for charge/uncharge but for account moving.
> > + * i.e. overwrite pc->mem_cgroup. The lock owner should guarantee by itself
> > + * the page is uncharged while we hold this.
> > + */
> > +static inline void lock_page_cgroup_migrate(struct page_cgroup *pc)
> > +{
> > +	bit_spin_lock(PCG_MIGRATE_LOCK, &pc->flags);
> > +}
> > +
> > +static inline void unlock_page_cgroup_migrate(struct page_cgroup *pc)
> > +{
> > +	bit_spin_unlock(PCG_MIGRATE_LOCK, &pc->flags);
> > +}
> > +
> >  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
> >  struct page_cgroup;
> >  
> > Index: mmotm-2.6.33-Mar2/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-2.6.33-Mar2.orig/mm/memcontrol.c
> > +++ mmotm-2.6.33-Mar2/mm/memcontrol.c
> > @@ -87,6 +87,10 @@ enum mem_cgroup_stat_index {
> >  	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
> >  	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> >  	MEM_CGROUP_EVENTS,	/* incremented at every  pagein/pageout */
> > +	MEM_CGROUP_STAT_DIRTY,
> > +	MEM_CGROUP_STAT_WBACK,
> > +	MEM_CGROUP_STAT_WBACK_TEMP,
> > +	MEM_CGROUP_STAT_UNSTABLE_NFS,
> >  
> >  	MEM_CGROUP_STAT_NSTATS,
> >  };
> > @@ -1360,6 +1364,86 @@ done:
> >  }
> >  
> >  /*
> > + * Update file cache's status for memcg. Before calling this,
> > + * mapping->tree_lock should be held and preemption is disabled.
> > + * Then, it's guarnteed that the page is not uncharged while we
> > + * access page_cgroup. We can make use of that.
> > + */
> > +void mem_cgroup_update_stat_locked(struct page *page, int idx, bool set)
> > +{
> > +	struct page_cgroup *pc;
> > +	struct mem_cgroup *mem;
> > +
> > +	pc = lookup_page_cgroup(page);
> > +	/* Not accounted ? */
> > +	if (!PageCgroupUsed(pc))
> > +		return;
> > +	lock_page_cgroup_migrate(pc);
> > +	/*
> > +	 * It's guarnteed that this page is never uncharged.
> > +	 * The only racy problem is moving account among memcgs.
> > +	 */
> > +	switch (idx) {
> > +	case MEM_CGROUP_STAT_DIRTY:
> > +		if (set)
> > +			SetPageCgroupAcctDirty(pc);
> > +		else
> > +			ClearPageCgroupAcctDirty(pc);
> > +		break;
> > +	case MEM_CGROUP_STAT_WBACK:
> > +		if (set)
> > +			SetPageCgroupAcctWB(pc);
> > +		else
> > +			ClearPageCgroupAcctWB(pc);
> > +		break;
> > +	case MEM_CGROUP_STAT_WBACK_TEMP:
> > +		if (set)
> > +			SetPageCgroupAcctWBTemp(pc);
> > +		else
> > +			ClearPageCgroupAcctWBTemp(pc);
> > +		break;
> > +	case MEM_CGROUP_STAT_UNSTABLE_NFS:
> > +		if (set)
> > +			SetPageCgroupAcctUnstableNFS(pc);
> > +		else
> > +			ClearPageCgroupAcctUnstableNFS(pc);
> > +		break;
> > +	default:
> > +		BUG();
> > +		break;
> > +	}
> > +	mem = pc->mem_cgroup;
> > +	if (set)
> > +		__this_cpu_inc(mem->stat->count[idx]);
> > +	else
> > +		__this_cpu_dec(mem->stat->count[idx]);
> > +	unlock_page_cgroup_migrate(pc);
> > +}
> > +
> > +static void move_acct_information(struct mem_cgroup *from,
> > +				struct mem_cgroup *to,
> > +				struct page_cgroup *pc)
> > +{
> > +	/* preemption is disabled, migration_lock is held. */
> > +	if (PageCgroupAcctDirty(pc)) {
> > +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_DIRTY]);
> > +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_DIRTY]);
> > +	}
> > +	if (PageCgroupAcctWB(pc)) {
> > +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_WBACK]);
> > +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WBACK]);
> > +	}
> > +	if (PageCgroupAcctWBTemp(pc)) {
> > +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_WBACK_TEMP]);
> > +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WBACK_TEMP]);
> > +	}
> > +	if (PageCgroupAcctUnstableNFS(pc)) {
> > +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
> > +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
> > +	}
> > +}
> > +
> > +/*
> >   * size of first charge trial. "32" comes from vmscan.c's magic value.
> >   * TODO: maybe necessary to use big numbers in big irons.
> >   */
> > @@ -1794,15 +1878,16 @@ static void __mem_cgroup_move_account(st
> >  	VM_BUG_ON(!PageCgroupUsed(pc));
> >  	VM_BUG_ON(pc->mem_cgroup != from);
> >  
> > +	preempt_disable();
> > +	lock_page_cgroup_migrate(pc);
> >  	page = pc->page;
> >  	if (page_mapped(page) && !PageAnon(page)) {
> >  		/* Update mapped_file data for mem_cgroup */
> > -		preempt_disable();
> >  		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> >  		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > -		preempt_enable();
> >  	}
> >  	mem_cgroup_charge_statistics(from, pc, false);
> > +	move_acct_information(from, to, pc);
> 
> Kame-san, a question. According to is_target_pte_for_mc() it seems we
> don't move file pages across cgroups for now. If !PageAnon(page) we just
> return 0 and the page won't be selected for migration in
> mem_cgroup_move_charge_pte_range().
> 
You're right. It's my TODO to move file pages at task migration.

> So, if I've understood well the code is correct in perspective, but
> right now it's unnecessary. File pages are not moved on task migration
> across cgroups and, at the moment, there's no way for file page
> accounted statistics to go negative.
> 
> Or am I missing something?
> 
__mem_cgroup_move_account() will be called not only at task migration
but also at rmdir, so I think it would be better to handle file pages anyway.


Thanks,
Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-03 22:03               ` Andrea Righi
@ 2010-03-03 23:25                 ` Daisuke Nishimura
  -1 siblings, 0 replies; 140+ messages in thread
From: Daisuke Nishimura @ 2010-03-03 23:25 UTC (permalink / raw)
  To: Andrea Righi
  Cc: KAMEZAWA Hiroyuki, containers, linux-kernel, linux-mm, Greg,
	Suleiman Souhlal, Andrew Morton, Balbir Singh, Daisuke Nishimura

On Wed, 3 Mar 2010 23:03:19 +0100, Andrea Righi <arighi@develer.com> wrote:
> On Wed, Mar 03, 2010 at 05:21:32PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Wed, 3 Mar 2010 15:15:49 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > 
> > > Agreed.
> > > Let's try how we can write a code in clean way. (we have time ;)
> > > For now, to me, IRQ disabling while lock_page_cgroup() seems to be a little
> > > over killing. What I really want is lockless code...but it seems impossible
> > > under current implementation.
> > > 
> > > I wonder the fact "the page is never unchareged under us" can give us some chances
> > > ...Hmm.
> > > 
> > 
> > How about this ? Basically, I don't like duplicating information...so,
> > # of new pcg_flags may be able to be reduced.
> > 
> > I'm glad this can be a hint for Andrea-san.
> > 
> > ==
> > ---
> >  include/linux/page_cgroup.h |   44 ++++++++++++++++++++-
> >  mm/memcontrol.c             |   91 +++++++++++++++++++++++++++++++++++++++++++-
> >  2 files changed, 132 insertions(+), 3 deletions(-)
> > 
> > Index: mmotm-2.6.33-Mar2/include/linux/page_cgroup.h
> > ===================================================================
> > --- mmotm-2.6.33-Mar2.orig/include/linux/page_cgroup.h
> > +++ mmotm-2.6.33-Mar2/include/linux/page_cgroup.h
> > @@ -39,6 +39,11 @@ enum {
> >  	PCG_CACHE, /* charged as cache */
> >  	PCG_USED, /* this object is in use. */
> >  	PCG_ACCT_LRU, /* page has been accounted for */
> > +	PCG_MIGRATE_LOCK, /* used for mutual execution of account migration */
> > +	PCG_ACCT_DIRTY,
> > +	PCG_ACCT_WB,
> > +	PCG_ACCT_WB_TEMP,
> > +	PCG_ACCT_UNSTABLE,
> >  };
> >  
> >  #define TESTPCGFLAG(uname, lname)			\
> > @@ -73,6 +78,23 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU)
> >  TESTPCGFLAG(AcctLRU, ACCT_LRU)
> >  TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU)
> >  
> > +SETPCGFLAG(AcctDirty, ACCT_DIRTY);
> > +CLEARPCGFLAG(AcctDirty, ACCT_DIRTY);
> > +TESTPCGFLAG(AcctDirty, ACCT_DIRTY);
> > +
> > +SETPCGFLAG(AcctWB, ACCT_WB);
> > +CLEARPCGFLAG(AcctWB, ACCT_WB);
> > +TESTPCGFLAG(AcctWB, ACCT_WB);
> > +
> > +SETPCGFLAG(AcctWBTemp, ACCT_WB_TEMP);
> > +CLEARPCGFLAG(AcctWBTemp, ACCT_WB_TEMP);
> > +TESTPCGFLAG(AcctWBTemp, ACCT_WB_TEMP);
> > +
> > +SETPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE);
> > +CLEARPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE);
> > +TESTPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE);
> > +
> > +
> >  static inline int page_cgroup_nid(struct page_cgroup *pc)
> >  {
> >  	return page_to_nid(pc->page);
> > @@ -82,7 +104,9 @@ static inline enum zone_type page_cgroup
> >  {
> >  	return page_zonenum(pc->page);
> >  }
> > -
> > +/*
> > + * lock_page_cgroup() should not be held under mapping->tree_lock
> > + */
> >  static inline void lock_page_cgroup(struct page_cgroup *pc)
> >  {
> >  	bit_spin_lock(PCG_LOCK, &pc->flags);
> > @@ -93,6 +117,24 @@ static inline void unlock_page_cgroup(st
> >  	bit_spin_unlock(PCG_LOCK, &pc->flags);
> >  }
> >  
> > +/*
> > + * Lock order is
> > + * 	lock_page_cgroup()
> > + * 		lock_page_cgroup_migrate()
> > + * This lock is not be lock for charge/uncharge but for account moving.
> > + * i.e. overwrite pc->mem_cgroup. The lock owner should guarantee by itself
> > + * the page is uncharged while we hold this.
> > + */
> > +static inline void lock_page_cgroup_migrate(struct page_cgroup *pc)
> > +{
> > +	bit_spin_lock(PCG_MIGRATE_LOCK, &pc->flags);
> > +}
> > +
> > +static inline void unlock_page_cgroup_migrate(struct page_cgroup *pc)
> > +{
> > +	bit_spin_unlock(PCG_MIGRATE_LOCK, &pc->flags);
> > +}
> > +
> >  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
> >  struct page_cgroup;
> >  
> > Index: mmotm-2.6.33-Mar2/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-2.6.33-Mar2.orig/mm/memcontrol.c
> > +++ mmotm-2.6.33-Mar2/mm/memcontrol.c
> > @@ -87,6 +87,10 @@ enum mem_cgroup_stat_index {
> >  	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
> >  	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> >  	MEM_CGROUP_EVENTS,	/* incremented at every  pagein/pageout */
> > +	MEM_CGROUP_STAT_DIRTY,
> > +	MEM_CGROUP_STAT_WBACK,
> > +	MEM_CGROUP_STAT_WBACK_TEMP,
> > +	MEM_CGROUP_STAT_UNSTABLE_NFS,
> >  
> >  	MEM_CGROUP_STAT_NSTATS,
> >  };
> > @@ -1360,6 +1364,86 @@ done:
> >  }
> >  
> >  /*
> > + * Update file cache's status for memcg. Before calling this,
> > + * mapping->tree_lock should be held and preemption is disabled.
> > + * Then, it's guarnteed that the page is not uncharged while we
> > + * access page_cgroup. We can make use of that.
> > + */
> > +void mem_cgroup_update_stat_locked(struct page *page, int idx, bool set)
> > +{
> > +	struct page_cgroup *pc;
> > +	struct mem_cgroup *mem;
> > +
> > +	pc = lookup_page_cgroup(page);
> > +	/* Not accounted ? */
> > +	if (!PageCgroupUsed(pc))
> > +		return;
> > +	lock_page_cgroup_migrate(pc);
> > +	/*
> > +	 * It's guarnteed that this page is never uncharged.
> > +	 * The only racy problem is moving account among memcgs.
> > +	 */
> > +	switch (idx) {
> > +	case MEM_CGROUP_STAT_DIRTY:
> > +		if (set)
> > +			SetPageCgroupAcctDirty(pc);
> > +		else
> > +			ClearPageCgroupAcctDirty(pc);
> > +		break;
> > +	case MEM_CGROUP_STAT_WBACK:
> > +		if (set)
> > +			SetPageCgroupAcctWB(pc);
> > +		else
> > +			ClearPageCgroupAcctWB(pc);
> > +		break;
> > +	case MEM_CGROUP_STAT_WBACK_TEMP:
> > +		if (set)
> > +			SetPageCgroupAcctWBTemp(pc);
> > +		else
> > +			ClearPageCgroupAcctWBTemp(pc);
> > +		break;
> > +	case MEM_CGROUP_STAT_UNSTABLE_NFS:
> > +		if (set)
> > +			SetPageCgroupAcctUnstableNFS(pc);
> > +		else
> > +			ClearPageCgroupAcctUnstableNFS(pc);
> > +		break;
> > +	default:
> > +		BUG();
> > +		break;
> > +	}
> > +	mem = pc->mem_cgroup;
> > +	if (set)
> > +		__this_cpu_inc(mem->stat->count[idx]);
> > +	else
> > +		__this_cpu_dec(mem->stat->count[idx]);
> > +	unlock_page_cgroup_migrate(pc);
> > +}
> > +
> > +static void move_acct_information(struct mem_cgroup *from,
> > +				struct mem_cgroup *to,
> > +				struct page_cgroup *pc)
> > +{
> > +	/* preemption is disabled, migration_lock is held. */
> > +	if (PageCgroupAcctDirty(pc)) {
> > +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_DIRTY]);
> > +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_DIRTY]);
> > +	}
> > +	if (PageCgroupAcctWB(pc)) {
> > +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_WBACK]);
> > +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WBACK]);
> > +	}
> > +	if (PageCgroupAcctWBTemp(pc)) {
> > +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_WBACK_TEMP]);
> > +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WBACK_TEMP]);
> > +	}
> > +	if (PageCgroupAcctUnstableNFS(pc)) {
> > +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
> > +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
> > +	}
> > +}
> > +
> > +/*
> >   * size of first charge trial. "32" comes from vmscan.c's magic value.
> >   * TODO: maybe necessary to use big numbers in big irons.
> >   */
> > @@ -1794,15 +1878,16 @@ static void __mem_cgroup_move_account(st
> >  	VM_BUG_ON(!PageCgroupUsed(pc));
> >  	VM_BUG_ON(pc->mem_cgroup != from);
> >  
> > +	preempt_disable();
> > +	lock_page_cgroup_migrate(pc);
> >  	page = pc->page;
> >  	if (page_mapped(page) && !PageAnon(page)) {
> >  		/* Update mapped_file data for mem_cgroup */
> > -		preempt_disable();
> >  		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> >  		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > -		preempt_enable();
> >  	}
> >  	mem_cgroup_charge_statistics(from, pc, false);
> > +	move_acct_information(from, to, pc);
> 
> Kame-san, a question. According to is_target_pte_for_mc() it seems we
> don't move file pages across cgroups for now. If !PageAnon(page) we just
> return 0 and the page won't be selected for migration in
> mem_cgroup_move_charge_pte_range().
> 
You're right. It's my TODO to move file pages at task migration.

> So, if I've understood well the code is correct in perspective, but
> right now it's unnecessary. File pages are not moved on task migration
> across cgroups and, at the moment, there's no way for file page
> accounted statistics to go negative.
> 
> Or am I missing something?
> 
__mem_cgroup_move_account() will be called not only at task migration
but also at rmdir, so I think it would be better to handle file pages anyway.


Thanks,
Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
@ 2010-03-03 23:25                 ` Daisuke Nishimura
  0 siblings, 0 replies; 140+ messages in thread
From: Daisuke Nishimura @ 2010-03-03 23:25 UTC (permalink / raw)
  To: Andrea Righi
  Cc: KAMEZAWA Hiroyuki, containers, linux-kernel, linux-mm, Greg,
	Suleiman Souhlal, Andrew Morton, Balbir Singh, Daisuke Nishimura

On Wed, 3 Mar 2010 23:03:19 +0100, Andrea Righi <arighi@develer.com> wrote:
> On Wed, Mar 03, 2010 at 05:21:32PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Wed, 3 Mar 2010 15:15:49 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > 
> > > Agreed.
> > > Let's try how we can write a code in clean way. (we have time ;)
> > > For now, to me, IRQ disabling while lock_page_cgroup() seems to be a little
> > > over killing. What I really want is lockless code...but it seems impossible
> > > under current implementation.
> > > 
> > > I wonder the fact "the page is never unchareged under us" can give us some chances
> > > ...Hmm.
> > > 
> > 
> > How about this ? Basically, I don't like duplicating information...so,
> > # of new pcg_flags may be able to be reduced.
> > 
> > I'm glad this can be a hint for Andrea-san.
> > 
> > ==
> > ---
> >  include/linux/page_cgroup.h |   44 ++++++++++++++++++++-
> >  mm/memcontrol.c             |   91 +++++++++++++++++++++++++++++++++++++++++++-
> >  2 files changed, 132 insertions(+), 3 deletions(-)
> > 
> > Index: mmotm-2.6.33-Mar2/include/linux/page_cgroup.h
> > ===================================================================
> > --- mmotm-2.6.33-Mar2.orig/include/linux/page_cgroup.h
> > +++ mmotm-2.6.33-Mar2/include/linux/page_cgroup.h
> > @@ -39,6 +39,11 @@ enum {
> >  	PCG_CACHE, /* charged as cache */
> >  	PCG_USED, /* this object is in use. */
> >  	PCG_ACCT_LRU, /* page has been accounted for */
> > +	PCG_MIGRATE_LOCK, /* used for mutual execution of account migration */
> > +	PCG_ACCT_DIRTY,
> > +	PCG_ACCT_WB,
> > +	PCG_ACCT_WB_TEMP,
> > +	PCG_ACCT_UNSTABLE,
> >  };
> >  
> >  #define TESTPCGFLAG(uname, lname)			\
> > @@ -73,6 +78,23 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU)
> >  TESTPCGFLAG(AcctLRU, ACCT_LRU)
> >  TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU)
> >  
> > +SETPCGFLAG(AcctDirty, ACCT_DIRTY);
> > +CLEARPCGFLAG(AcctDirty, ACCT_DIRTY);
> > +TESTPCGFLAG(AcctDirty, ACCT_DIRTY);
> > +
> > +SETPCGFLAG(AcctWB, ACCT_WB);
> > +CLEARPCGFLAG(AcctWB, ACCT_WB);
> > +TESTPCGFLAG(AcctWB, ACCT_WB);
> > +
> > +SETPCGFLAG(AcctWBTemp, ACCT_WB_TEMP);
> > +CLEARPCGFLAG(AcctWBTemp, ACCT_WB_TEMP);
> > +TESTPCGFLAG(AcctWBTemp, ACCT_WB_TEMP);
> > +
> > +SETPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE);
> > +CLEARPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE);
> > +TESTPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE);
> > +
> > +
> >  static inline int page_cgroup_nid(struct page_cgroup *pc)
> >  {
> >  	return page_to_nid(pc->page);
> > @@ -82,7 +104,9 @@ static inline enum zone_type page_cgroup
> >  {
> >  	return page_zonenum(pc->page);
> >  }
> > -
> > +/*
> > + * lock_page_cgroup() should not be held under mapping->tree_lock
> > + */
> >  static inline void lock_page_cgroup(struct page_cgroup *pc)
> >  {
> >  	bit_spin_lock(PCG_LOCK, &pc->flags);
> > @@ -93,6 +117,24 @@ static inline void unlock_page_cgroup(st
> >  	bit_spin_unlock(PCG_LOCK, &pc->flags);
> >  }
> >  
> > +/*
> > + * Lock order is
> > + * 	lock_page_cgroup()
> > + * 		lock_page_cgroup_migrate()
> > + * This lock is not be lock for charge/uncharge but for account moving.
> > + * i.e. overwrite pc->mem_cgroup. The lock owner should guarantee by itself
> > + * the page is uncharged while we hold this.
> > + */
> > +static inline void lock_page_cgroup_migrate(struct page_cgroup *pc)
> > +{
> > +	bit_spin_lock(PCG_MIGRATE_LOCK, &pc->flags);
> > +}
> > +
> > +static inline void unlock_page_cgroup_migrate(struct page_cgroup *pc)
> > +{
> > +	bit_spin_unlock(PCG_MIGRATE_LOCK, &pc->flags);
> > +}
> > +
> >  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
> >  struct page_cgroup;
> >  
> > Index: mmotm-2.6.33-Mar2/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-2.6.33-Mar2.orig/mm/memcontrol.c
> > +++ mmotm-2.6.33-Mar2/mm/memcontrol.c
> > @@ -87,6 +87,10 @@ enum mem_cgroup_stat_index {
> >  	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
> >  	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> >  	MEM_CGROUP_EVENTS,	/* incremented at every  pagein/pageout */
> > +	MEM_CGROUP_STAT_DIRTY,
> > +	MEM_CGROUP_STAT_WBACK,
> > +	MEM_CGROUP_STAT_WBACK_TEMP,
> > +	MEM_CGROUP_STAT_UNSTABLE_NFS,
> >  
> >  	MEM_CGROUP_STAT_NSTATS,
> >  };
> > @@ -1360,6 +1364,86 @@ done:
> >  }
> >  
> >  /*
> > + * Update file cache's status for memcg. Before calling this,
> > + * mapping->tree_lock should be held and preemption is disabled.
> > + * Then, it's guarnteed that the page is not uncharged while we
> > + * access page_cgroup. We can make use of that.
> > + */
> > +void mem_cgroup_update_stat_locked(struct page *page, int idx, bool set)
> > +{
> > +	struct page_cgroup *pc;
> > +	struct mem_cgroup *mem;
> > +
> > +	pc = lookup_page_cgroup(page);
> > +	/* Not accounted ? */
> > +	if (!PageCgroupUsed(pc))
> > +		return;
> > +	lock_page_cgroup_migrate(pc);
> > +	/*
> > +	 * It's guarnteed that this page is never uncharged.
> > +	 * The only racy problem is moving account among memcgs.
> > +	 */
> > +	switch (idx) {
> > +	case MEM_CGROUP_STAT_DIRTY:
> > +		if (set)
> > +			SetPageCgroupAcctDirty(pc);
> > +		else
> > +			ClearPageCgroupAcctDirty(pc);
> > +		break;
> > +	case MEM_CGROUP_STAT_WBACK:
> > +		if (set)
> > +			SetPageCgroupAcctWB(pc);
> > +		else
> > +			ClearPageCgroupAcctWB(pc);
> > +		break;
> > +	case MEM_CGROUP_STAT_WBACK_TEMP:
> > +		if (set)
> > +			SetPageCgroupAcctWBTemp(pc);
> > +		else
> > +			ClearPageCgroupAcctWBTemp(pc);
> > +		break;
> > +	case MEM_CGROUP_STAT_UNSTABLE_NFS:
> > +		if (set)
> > +			SetPageCgroupAcctUnstableNFS(pc);
> > +		else
> > +			ClearPageCgroupAcctUnstableNFS(pc);
> > +		break;
> > +	default:
> > +		BUG();
> > +		break;
> > +	}
> > +	mem = pc->mem_cgroup;
> > +	if (set)
> > +		__this_cpu_inc(mem->stat->count[idx]);
> > +	else
> > +		__this_cpu_dec(mem->stat->count[idx]);
> > +	unlock_page_cgroup_migrate(pc);
> > +}
> > +
> > +static void move_acct_information(struct mem_cgroup *from,
> > +				struct mem_cgroup *to,
> > +				struct page_cgroup *pc)
> > +{
> > +	/* preemption is disabled, migration_lock is held. */
> > +	if (PageCgroupAcctDirty(pc)) {
> > +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_DIRTY]);
> > +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_DIRTY]);
> > +	}
> > +	if (PageCgroupAcctWB(pc)) {
> > +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_WBACK]);
> > +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WBACK]);
> > +	}
> > +	if (PageCgroupAcctWBTemp(pc)) {
> > +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_WBACK_TEMP]);
> > +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WBACK_TEMP]);
> > +	}
> > +	if (PageCgroupAcctUnstableNFS(pc)) {
> > +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
> > +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
> > +	}
> > +}
> > +
> > +/*
> >   * size of first charge trial. "32" comes from vmscan.c's magic value.
> >   * TODO: maybe necessary to use big numbers in big irons.
> >   */
> > @@ -1794,15 +1878,16 @@ static void __mem_cgroup_move_account(st
> >  	VM_BUG_ON(!PageCgroupUsed(pc));
> >  	VM_BUG_ON(pc->mem_cgroup != from);
> >  
> > +	preempt_disable();
> > +	lock_page_cgroup_migrate(pc);
> >  	page = pc->page;
> >  	if (page_mapped(page) && !PageAnon(page)) {
> >  		/* Update mapped_file data for mem_cgroup */
> > -		preempt_disable();
> >  		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> >  		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > -		preempt_enable();
> >  	}
> >  	mem_cgroup_charge_statistics(from, pc, false);
> > +	move_acct_information(from, to, pc);
> 
> Kame-san, a question. According to is_target_pte_for_mc() it seems we
> don't move file pages across cgroups for now. If !PageAnon(page) we just
> return 0 and the page won't be selected for migration in
> mem_cgroup_move_charge_pte_range().
> 
You're right. It's my TODO to move file pages at task migration.

> So, if I've understood well the code is correct in perspective, but
> right now it's unnecessary. File pages are not moved on task migration
> across cgroups and, at the moment, there's no way for file page
> accounted statistics to go negative.
> 
> Or am I missing something?
> 
__mem_cgroup_move_account() will be called not only at task migration
but also at rmdir, so I think it would be better to handle file pages anyway.


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-03 22:03               ` Andrea Righi
                                 ` (3 preceding siblings ...)
  (?)
@ 2010-03-04  3:45               ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 140+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-04  3:45 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Balbir-FOgKQjlUJ6BQetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Greg-FOgKQjlUJ6BQetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Andrew Morton,
	Singh

On Wed, 3 Mar 2010 23:03:19 +0100
Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org> wrote:

> On Wed, Mar 03, 2010 at 05:21:32PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Wed, 3 Mar 2010 15:15:49 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> wrote:
 
> > +	preempt_disable();
> > +	lock_page_cgroup_migrate(pc);
> >  	page = pc->page;
> >  	if (page_mapped(page) && !PageAnon(page)) {
> >  		/* Update mapped_file data for mem_cgroup */
> > -		preempt_disable();
> >  		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> >  		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > -		preempt_enable();
> >  	}
> >  	mem_cgroup_charge_statistics(from, pc, false);
> > +	move_acct_information(from, to, pc);
> 
> Kame-san, a question. According to is_target_pte_for_mc() it seems we
> don't move file pages across cgroups for now. 

yes. It's just in plan.

> If !PageAnon(page) we just return 0 and the page won't be selected for migration in
> mem_cgroup_move_charge_pte_range().
> 
> So, if I've understood well the code is correct in perspective, but
> right now it's unnecessary. File pages are not moved on task migration
> across cgroups and, at the moment, there's no way for file page
> accounted statistics to go negative.
> 
> Or am I missing something?
> 

At rmdir(), remainging file caches in a cgroup is moved to
its parent. Then, all file caches are moved to its parent at rmdir().

This behavior is for avoiding to lose too much file caches at removing cgroup.

Thanks,
-Kame

^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
  2010-03-03 22:03               ` Andrea Righi
@ 2010-03-04  3:45                 ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 140+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-04  3:45 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Daisuke Nishimura, containers, linux-kernel, linux-mm, Greg,
	Suleiman Souhlal, Andrew Morton, Balbir Singh

On Wed, 3 Mar 2010 23:03:19 +0100
Andrea Righi <arighi@develer.com> wrote:

> On Wed, Mar 03, 2010 at 05:21:32PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Wed, 3 Mar 2010 15:15:49 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
 
> > +	preempt_disable();
> > +	lock_page_cgroup_migrate(pc);
> >  	page = pc->page;
> >  	if (page_mapped(page) && !PageAnon(page)) {
> >  		/* Update mapped_file data for mem_cgroup */
> > -		preempt_disable();
> >  		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> >  		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > -		preempt_enable();
> >  	}
> >  	mem_cgroup_charge_statistics(from, pc, false);
> > +	move_acct_information(from, to, pc);
> 
> Kame-san, a question. According to is_target_pte_for_mc() it seems we
> don't move file pages across cgroups for now. 

yes. It's just in plan.

> If !PageAnon(page) we just return 0 and the page won't be selected for migration in
> mem_cgroup_move_charge_pte_range().
> 
> So, if I've understood well the code is correct in perspective, but
> right now it's unnecessary. File pages are not moved on task migration
> across cgroups and, at the moment, there's no way for file page
> accounted statistics to go negative.
> 
> Or am I missing something?
> 

At rmdir(), remainging file caches in a cgroup is moved to
its parent. Then, all file caches are moved to its parent at rmdir().

This behavior is for avoiding to lose too much file caches at removing cgroup.

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 140+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
@ 2010-03-04  3:45                 ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 140+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-04  3:45 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Daisuke Nishimura, containers, linux-kernel, linux-mm, Greg,
	Suleiman Souhlal, Andrew Morton, Balbir Singh

On Wed, 3 Mar 2010 23:03:19 +0100
Andrea Righi <arighi@develer.com> wrote:

> On Wed, Mar 03, 2010 at 05:21:32PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Wed, 3 Mar 2010 15:15:49 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
 
> > +	preempt_disable();
> > +	lock_page_cgroup_migrate(pc);
> >  	page = pc->page;
> >  	if (page_mapped(page) && !PageAnon(page)) {
> >  		/* Update mapped_file data for mem_cgroup */
> > -		preempt_disable();
> >  		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> >  		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > -		preempt_enable();
> >  	}
> >  	mem_cgroup_charge_statistics(from, pc, false);
> > +	move_acct_information(from, to, pc);
> 
> Kame-san, a question. According to is_target_pte_for_mc() it seems we
> don't move file pages across cgroups for now. 

yes. It's just in plan.

> If !PageAnon(page) we just return 0 and the page won't be selected for migration in
> mem_cgroup_move_charge_pte_range().
> 
> So, if I've understood well the code is correct in perspective, but
> right now it's unnecessary. File pages are not moved on task migration
> across cgroups and, at the moment, there's no way for file page
> accounted statistics to go negative.
> 
> Or am I missing something?
> 

At rmdir(), remainging file caches in a cgroup is moved to
its parent. Then, all file caches are moved to its parent at rmdir().

This behavior is for avoiding to lose too much file caches at removing cgroup.

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 140+ messages in thread

end of thread, other threads:[~2010-03-04  3:48 UTC | newest]

Thread overview: 140+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-03-01 21:23 [PATCH -mmotm 0/3] memcg: per cgroup dirty limit (v3) Andrea Righi
2010-03-01 21:23 ` Andrea Righi
2010-03-01 21:23 ` [PATCH -mmotm 1/3] memcg: dirty memory documentation Andrea Righi
2010-03-01 21:23   ` Andrea Righi
2010-03-01 21:23 ` [PATCH -mmotm 2/3] memcg: dirty pages accounting and limiting infrastructure Andrea Righi
2010-03-01 21:23   ` Andrea Righi
2010-03-02  0:20   ` KAMEZAWA Hiroyuki
2010-03-02  0:20     ` KAMEZAWA Hiroyuki
2010-03-02 10:04   ` Kirill A. Shutemov
2010-03-02 10:04     ` Kirill A. Shutemov
2010-03-02 11:00     ` Andrea Righi
2010-03-02 11:00       ` Andrea Righi
     [not found]     ` <cc557aab1003020204k16038838ta537357aeeb67b11-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-03-02 11:00       ` Andrea Righi
2010-03-02 13:02   ` Balbir Singh
2010-03-02 13:02     ` Balbir Singh
2010-03-02 21:50     ` Andrea Righi
2010-03-02 21:50       ` Andrea Righi
     [not found]     ` <20100302130223.GF3212-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org>
2010-03-02 21:50       ` Andrea Righi
2010-03-02 18:08   ` Greg Thelen
2010-03-02 18:08     ` Greg Thelen
2010-03-02 22:24     ` Andrea Righi
2010-03-02 22:24       ` Andrea Righi
     [not found]     ` <49b004811003021008t4fae71bbu8d56192e48c32f39-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-03-02 22:24       ` Andrea Righi
     [not found]   ` <1267478620-5276-3-git-send-email-arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
2010-03-02  0:20     ` KAMEZAWA Hiroyuki
2010-03-02 10:04     ` Kirill A. Shutemov
2010-03-02 13:02     ` Balbir Singh
2010-03-02 18:08     ` Greg Thelen
     [not found] ` <1267478620-5276-1-git-send-email-arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
2010-03-01 21:23   ` [PATCH -mmotm 1/3] memcg: dirty memory documentation Andrea Righi
2010-03-01 21:23   ` [PATCH -mmotm 2/3] memcg: dirty pages accounting and limiting infrastructure Andrea Righi
2010-03-01 21:23   ` [PATCH -mmotm 3/3] memcg: dirty pages instrumentation Andrea Righi
2010-03-01 21:23 ` Andrea Righi
2010-03-01 21:23   ` Andrea Righi
2010-03-01 22:02   ` Vivek Goyal
2010-03-01 22:02     ` Vivek Goyal
2010-03-01 22:18     ` Andrea Righi
2010-03-01 22:18       ` Andrea Righi
2010-03-02 15:05       ` Vivek Goyal
2010-03-02 15:05         ` Vivek Goyal
2010-03-02 22:22         ` Andrea Righi
2010-03-02 22:22           ` Andrea Righi
2010-03-02 23:59           ` Vivek Goyal
2010-03-02 23:59           ` Vivek Goyal
2010-03-02 23:59             ` Vivek Goyal
2010-03-03 11:47             ` Andrea Righi
2010-03-03 11:47               ` Andrea Righi
2010-03-03 11:56               ` Andrea Righi
2010-03-03 11:56                 ` Andrea Righi
2010-03-03 11:56               ` Andrea Righi
     [not found]             ` <20100302235932.GA3007-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2010-03-03 11:47               ` Andrea Righi
     [not found]         ` <20100302150529.GA12855-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2010-03-02 22:22           ` Andrea Righi
2010-03-02 15:05       ` Vivek Goyal
     [not found]     ` <20100301220208.GH3109-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2010-03-01 22:18       ` Andrea Righi
2010-03-02  0:23   ` KAMEZAWA Hiroyuki
2010-03-02  0:23     ` KAMEZAWA Hiroyuki
2010-03-02  8:01     ` Andrea Righi
2010-03-02  8:01       ` Andrea Righi
2010-03-02  8:12       ` Daisuke Nishimura
2010-03-02  8:12       ` Daisuke Nishimura
2010-03-02  8:12         ` Daisuke Nishimura
2010-03-02  8:23       ` KAMEZAWA Hiroyuki
2010-03-02  8:23       ` KAMEZAWA Hiroyuki
2010-03-02  8:23         ` KAMEZAWA Hiroyuki
2010-03-02 13:50         ` Balbir Singh
2010-03-02 13:50           ` Balbir Singh
     [not found]           ` <20100302135026.GH3212-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org>
2010-03-02 22:18             ` Andrea Righi
2010-03-02 22:18           ` Andrea Righi
2010-03-02 22:18             ` Andrea Righi
2010-03-02 23:21             ` Daisuke Nishimura
2010-03-02 23:21             ` Daisuke Nishimura
2010-03-02 23:21               ` Daisuke Nishimura
     [not found]               ` <20100303082107.a29562fa.nishimura-YQH0OdQVrdy45+QrQBaojngSJqDPrsil@public.gmane.org>
2010-03-03 11:48                 ` Andrea Righi
2010-03-03 11:48               ` Andrea Righi
2010-03-03 11:48                 ` Andrea Righi
     [not found]         ` <20100302172316.b959b04c.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
2010-03-02 13:50           ` Balbir Singh
     [not found]     ` <20100302092309.bff454d7.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
2010-03-02  8:01       ` Andrea Righi
2010-03-02 10:11   ` Kirill A. Shutemov
2010-03-02 10:11     ` Kirill A. Shutemov
     [not found]     ` <cc557aab1003020211h391947f0p3eae04a298127d32-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-03-02 11:02       ` Andrea Righi
2010-03-02 11:02     ` Andrea Righi
2010-03-02 11:02       ` Andrea Righi
2010-03-02 11:09       ` Kirill A. Shutemov
2010-03-02 11:09       ` Kirill A. Shutemov
2010-03-02 11:09         ` Kirill A. Shutemov
     [not found]         ` <cc557aab1003020309y37587110i685d0d968bfba9f4-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-03-02 11:34           ` Andrea Righi
2010-03-02 11:34         ` Andrea Righi
2010-03-02 11:34           ` Andrea Righi
2010-03-02 13:47   ` Balbir Singh
2010-03-02 13:47     ` Balbir Singh
2010-03-02 13:56     ` Kirill A. Shutemov
2010-03-02 13:56       ` Kirill A. Shutemov
     [not found]     ` <20100302134736.GG3212-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org>
2010-03-02 13:56       ` Kirill A. Shutemov
2010-03-02 13:48   ` Peter Zijlstra
2010-03-02 13:48     ` Peter Zijlstra
2010-03-02 15:26     ` Balbir Singh
2010-03-02 15:26     ` Balbir Singh
2010-03-02 15:26       ` Balbir Singh
2010-03-02 15:49     ` Trond Myklebust
2010-03-02 15:49     ` Trond Myklebust
2010-03-02 15:49       ` Trond Myklebust
2010-03-02 22:14     ` Andrea Righi
2010-03-02 22:14       ` Andrea Righi
2010-03-03 10:07       ` Peter Zijlstra
2010-03-03 10:07         ` Peter Zijlstra
2010-03-03 12:05         ` Andrea Righi
2010-03-03 12:05         ` Andrea Righi
2010-03-03 12:05           ` Andrea Righi
2010-03-03 10:07       ` Peter Zijlstra
2010-03-02 22:14     ` Andrea Righi
2010-03-03  2:12   ` Daisuke Nishimura
2010-03-03  2:12     ` Daisuke Nishimura
2010-03-03  3:29     ` KAMEZAWA Hiroyuki
2010-03-03  3:29       ` KAMEZAWA Hiroyuki
     [not found]       ` <20100303122906.9c613ab2.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
2010-03-03  6:01         ` Daisuke Nishimura
2010-03-03  6:01       ` Daisuke Nishimura
2010-03-03  6:01         ` Daisuke Nishimura
2010-03-03  6:15         ` KAMEZAWA Hiroyuki
2010-03-03  6:15           ` KAMEZAWA Hiroyuki
2010-03-03  8:21           ` KAMEZAWA Hiroyuki
2010-03-03  8:21             ` KAMEZAWA Hiroyuki
2010-03-03 11:50             ` Andrea Righi
2010-03-03 11:50               ` Andrea Righi
2010-03-03 22:03             ` Andrea Righi
2010-03-03 22:03               ` Andrea Righi
2010-03-03 23:25               ` Daisuke Nishimura
2010-03-03 23:25               ` Daisuke Nishimura
2010-03-03 23:25                 ` Daisuke Nishimura
2010-03-04  3:45               ` KAMEZAWA Hiroyuki
2010-03-04  3:45                 ` KAMEZAWA Hiroyuki
2010-03-04  3:45               ` KAMEZAWA Hiroyuki
     [not found]             ` <20100303172132.fc6d9387.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
2010-03-03 11:50               ` Andrea Righi
2010-03-03 22:03               ` Andrea Righi
     [not found]           ` <20100303151549.5d3d686a.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
2010-03-03  8:21             ` KAMEZAWA Hiroyuki
     [not found]         ` <20100303150137.f56d7084.nishimura-YQH0OdQVrdy45+QrQBaojngSJqDPrsil@public.gmane.org>
2010-03-03  6:15           ` KAMEZAWA Hiroyuki
     [not found]     ` <20100303111238.7133f8af.nishimura-YQH0OdQVrdy45+QrQBaojngSJqDPrsil@public.gmane.org>
2010-03-03  3:29       ` KAMEZAWA Hiroyuki
     [not found]   ` <1267478620-5276-4-git-send-email-arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
2010-03-01 22:02     ` Vivek Goyal
2010-03-02  0:23     ` KAMEZAWA Hiroyuki
2010-03-02 10:11     ` Kirill A. Shutemov
2010-03-02 13:47     ` Balbir Singh
2010-03-02 13:48     ` Peter Zijlstra
2010-03-03  2:12     ` Daisuke Nishimura

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.